本篇博文主要内容为 2026-06-30 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-06-30)

今日共更新1299篇论文,其中:

  • 自然语言处理185篇(Computation and Language (cs.CL))
  • 人工智能440篇(Artificial Intelligence (cs.AI))
  • 计算机视觉306篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习348篇(Machine Learning (cs.LG))
  • 多智能体系统34篇(Multiagent Systems (cs.MA))
  • 信息检索68篇(Information Retrieval (cs.IR))
  • 人机交互27篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Linguistic Firewall: Geometry as Defense in Multi-Agent Systems Routing ICML2026

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中任务路由机制因依赖未经验证的间接代理(如文本自我描述或静态替代表征)而导致的能力评估失准问题。现有方法无法准确反映智能体的实际操作能力,从而为恶意智能体通过伪装能力或植入隐蔽后门创造了可乘之机,引发严重的安全漏洞。其解决方案的关键在于提出ANTAP(Automatic Non-Textual Agent Picker),一种基于评估驱动的路由架构,摒弃传统非实证的文本或静态表征,转而采用主动能力测试方式,动态查询智能体以实证获取其真实性能。ANTAP将智能体的行为表现映射至共享语义空间中的固定行为算子,并在推理阶段通过纯粹非文本的代数投影实现路由决策,构建“语言防火墙”,使基于元数据的攻击无法表达。实验表明,相较于基于描述的基线模型在描述注入攻击下高达67.3%的攻击成功率,ANTAP实现近零攻击成功率;在对抗自适应嵌入攻击时,相比嵌入基线模型降低20%的攻击成功率,且天然具备对描述篡改的鲁棒性。

链接: https://arxiv.org/abs/2606.30555
作者: Dvir Alsheich,Adar Peleg,Ben Hagag,Rom Himelstein,Amit Levi,Avi Mendelson
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 8 pages (9 more for appendix), 3 figures. Published at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026

点击查看摘要

Abstract:The rapid integration of Large Language Models (LLMs) has driven the evolution of Multi-Agent Systems (MAS), where specialized agents collaborate to execute complex workflows. Effective orchestration in these environments requires robust routing mechanisms to efficiently allocate tasks to the most suitable agent. However, existing routers fundamentally rely on unverified proxies, ranging from textual self-descriptions to static surrogate representations, to gauge an agent’s competence. This reliance on non-empirical data creates a critical gap between an agent’s projected profile and its actual operational capabilities, introducing severe security vulnerabilities. Malicious agents can easily misrepresent their proficiencies or harbor covert backdoors that evade both standard external analysis and static representation-learning techniques. In this work, we introduce ANTAP (Automatic Non-Textual Agent Picker), an evaluation-driven routing architecture that discards indirect proxies in favor of active capability testing. By dynamically querying agents to ascertain their true competencies empirically, ANTAP distills performance into fixed behavioral operators within a shared semantic space. At inference time, routing is performed via a purely non-textual algebraic projection, establishing a “linguistic firewall” that renders metadata-based attacks inexpressible. In our experiments, ANTAP achieves near-zero ASR against description-based injection attacks, compared to 67.3% and above for the description-based router baseline. Against adaptive embedding attacks, ANTAP achieves substantially lower ASR than the embedding-based baseline, with a 20% reduction, while remaining resilient to description manipulation by design.

[MA-1] MAS-Lab: A Specification-Driven Validation Framework for Reliable Multi-Agent Systems

【速读】:该论文旨在解决当前基于大语言模型(LLM)的多智能体系统(Multi-Agent Systems, MAS)在实际生产部署中面临的关键挑战:现有工具链导致系统开发呈现随意性与命令式编程模式,使得智能体逻辑、编排控制、可观测性及管理机制高度耦合,缺乏显式的系统级验证,且开发流程侧重于演示而非长期稳定运行。这种现状导致实验环境中观察到的行为难以可靠映射至生产环境,严重制约了系统的可靠性、可演化性与工程化部署能力。其解决方案的核心在于提出MAS-Lab——一个以规范驱动(specification-driven)的多智能体系统开发框架,通过三层次架构实现系统解耦与工程化转型:第一层为声明式、框架无关的智能体规范层(Spec),明确表达语义意图;第二层为有状态的多智能体操作系统(MAS-OS),提供可插拔的执行与控制原语;第三层为集成可观测性与评估工具的实验室叠加层(Labs)。该框架通过分离语义意图与操作关切,使行为与控制显式化,支持可复现的实验验证,并保障系统生命周期各阶段的连续性,从而实现基于意图的验证、有原则的系统演进以及向生产级多智能体系统的平滑过渡。

链接: https://arxiv.org/abs/2606.30546
作者: Jordan Augé,Giovanna Carofiglio,Giulio Grassi,Jacques Samain
机构: Cisco Systems(思科系统); Paris(巴黎); France(法国)
类目: Multiagent Systems (cs.MA)
备注: 16 pages, 12 figures

点击查看摘要

Abstract:The rapid emergence of LLM-based agentic frameworks has significantly reduced the cost of assembling multi-agent systems (MAS), enabling fast prototyping and exploration of agentic behaviors. However, systems built with current tooling remain ill-suited for reliable, evolvable, and production-grade deployment. In practice, MAS are often developed in an ad-hoc and imperative manner, with agent logic, orchestration, observability, and control tightly interwoven, little to no explicit system-level validation, and development workflows optimized for demonstrations rather than long-lived, governed operation. As a result, behavior observed during experimentation rarely constitutes reliable evidence of behavior in production. In this paper, we introduce MAS-Lab, a specification-driven framework for principled development and experimental validation of multi-agent systems properties. MAS-Lab is designed to transform MAS from collections of scripts into engineered distributed systems by separating semantic intent from operational concerns, making behavior and control explicit, supporting reproducible experimentation, and preserving continuity across lifecycle stages. MAS-Lab consists of three layers: a declarative, framework-agnostic agentic specification layer (Spec); a stateful MAS Operating System that provides execution and control primitives plugged-in by design (MAS-OS); and a set of lab overlays with integrated observability and evaluation tools (Labs). Together, these components enable intent-based validation, principled system evolution, and a seamless transition to production-grade MAS. Comments: 16 pages, 12 figures Subjects: Multiagent Systems (cs.MA) MSC classes: 68T42 ACMclasses: I.2.11; D.2.4 Cite as: arXiv:2606.30546 [cs.MA] (or arXiv:2606.30546v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2606.30546 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-2] COHORT: Collaborative Orchestration for Hardening via Offensive Replay on Emulated Topologies

【速读】:该论文旨在解决企业网络中针对特定攻击者(adversary)的缓解措施(mitigation)制定过程耗时长、依赖专家经验且无法在生产环境中安全验证的问题。传统方法需数周时间由安全分析师手动设计、验证并确认缓解方案的有效性与安全性,而这一过程难以在真实生产网络中直接测试。为此,本文提出COHORT——首个端到端自动化框架,通过角色分解的多智能体大语言模型(LLM)工作流,实现可部署缓解措施的自动生成、实施与优化。其解决方案的关键在于:1)在高保真GNS3网络仿真环境(运行真实厂商固件)中执行缓解策略生成与验证;2)采用进攻性回放(offensive replay)机制,即在已部署缓解措施的网络上重演原始攻击行为,与未缓解基线进行直接对比,从而避免依赖奖励信号或专家判断等间接代理指标;3)引入连通性回归检查(如局域网ping和互联网HTTP探测)以排除破坏正常业务流量的方案;4)通过累积评估机制将多个获批缓解措施叠加至持久状态,揭示复合效应。实验表明,在三种拓扑结构和四种攻击场景(勒索软件、横向移动、DNS外泄、数据窃取)下,46.7%的生成缓解方案既能有效阻断攻击,又保持网络连通性,是单智能体基线的4.4倍。

链接: https://arxiv.org/abs/2606.30479
作者: Chen Frydman,Aviram Zilberman,Rubin Krief,Abed Showgan,Andres Murillo,Sekiya Motoyoshi,Asaf Shabtai,Yuval Elovici,Rami Puzis
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注: Submitted to Journal of Network and Computer Applications

点击查看摘要

Abstract:Mitigating an observed adversary in an enterprise network typically takes weeks of expert work: an analyst derives a mitigation tailored to that adversary, validates it without breaking production, and verifies it disrupts the specific attack. The procedure relies on expert judgment and cannot safely be exercised against the production network. COHORT is the first end-to-end framework to automate this procedure for deployable mitigations. A role-decomposed multi-agent LLM workflow proposes candidates, implements them as real device commands, and refines them through a critique loop, all on a high-fidelity GNS3 emulator running real vendor firmware (firewall, switch, router). Each candidate is evaluated by offensive replay: re-executing the original adversary on the mitigated network for a paired comparison against the unmitigated baseline, rather than the reward-signal or expert-judgment proxies used in prior simulation, hybrid, and configuration-generation work. Two further checks complement replay: a connectivity-regression check (LAN ping and internet HTTP probe) rejects mitigations that disrupt legitimate LAN or internet connectivity, and a cumulative evaluation stacks approved mitigations onto a persistent state to surface compound effects. Across three topologies and four attack scenarios (ransomware, lateral movement, DNS exfiltration, data theft), 46.7% of generated mitigations both disrupt the attack and preserve connectivity under replay, 4.4 times the rate of a single-agent baseline using the same model and tool access. A demo video walking through the framework is available with our released artifacts.

[MA-3] Minimal MMAO: A Resource-Closed-Loop Framework for Adaptive Metaheuristic Search

【速读】:该论文旨在解决传统元启发式算法中搜索强度、探索-利用平衡及生命周期更替依赖于独立预设策略所导致的适应性不足与耦合性差的问题。其核心挑战在于如何实现搜索过程的内在动态调节,以提升算法在不同优化场景下的泛化能力与自适应性能。解决方案的关键在于构建一个基于内源性资源循环的代谢控制器(metabolic controller),通过有界私有能量、共享预算、归一化奖励机制、连续的角色自适应以及资源驱动的分支与剪枝等机制,使多智能体系统在统一的代谢循环下实现自主演化。该设计使得算法在连续与离散优化域中均保持一致的有效性,尤其在紧凑架构下展现出良好的稳定性,同时揭示出连续精度提升是维持方法简洁性的主要代价。研究结果表明,MMAO不仅是一种高效的优化工具,更是一个具有内在一致性与可扩展性的自适应启发式设计框架。

链接: https://arxiv.org/abs/2606.30450
作者: Jinliang Xu,Liping Ma
机构: The Seventh Medical Center of Chinese PLA General Hospital (中国人民解放军总医院第七医学中心)
类目: Neural and Evolutionary Computing (cs.NE); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This paper presents the Metabolic Multi-Agent Optimizer (MMAO) as an adaptive metaheuristic built around endogenous resource circulation. The central premise is that search intensity, exploration–exploitation balance, and lifecycle turnover should be induced by a shared metabolic controller rather than by separately attached schedules. We formulate MMAO through bounded private energy, a communal budget, normalized reward, continuous role adaptation, and resource-financed branching and pruning. The method is then instantiated in both continuous and discrete domains and evaluated on a matched small-scale suite including Sphere, Rastrigin, a synthetic Euclidean TSP, and two TSPLIB instances. The results show a consistent pattern: the same metabolic loop remains workable across domains, the discrete realization remains relatively stable under a compact design, and continuous refinement quality is the main cost of keeping the method lean. Taken together, these findings position MMAO as a coherent framework for adaptive heuristic design rather than a loose collection of operators.

[MA-4] ranslating Natural Language to Strategic Temporal Specifications via LLM s

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中从自然语言(Natural Language, NL)描述生成正确形式化规范这一关键难题,尤其针对战略能力与时序目标的复杂需求建模。由于当前缺乏成熟的从自然语言到策略逻辑语言(如ATL/ATL*)的自动转换方法,且人工编写形式化规格书存在易错、耗时且高度依赖专业知识的问题,本文提出一种基于大语言模型(Large Language Models, LLMs)的框架,实现将自然语言的战略性需求转化为语法正确的ATL/ATL*公式。其解决方案的关键在于构建并验证了一个全新的专家标注数据集,支持小规模开源模型(3–7B参数)在域内微调,从而在不依赖闭源API的前提下达到与强基线少样本推理相当的性能(最高语义准确率达0.84),同时保障了数据隐私与本地部署能力。此外,研究发现评估者(judge)的可靠性与生成器强度呈负相关,即最强的专有模型反而最不可靠,倾向于过度拒绝符合原意的同义改写,而开源的Llama-3.3-70B模型则最贴近人类专家判断。最终,该工具被集成至现有策略逻辑模型检测器中,使非专家用户可通过自然语言直接表达战略属性,显著提升了形式化验证的可及性与实用性。

链接: https://arxiv.org/abs/2606.30441
作者: Marco Aruta,Francesco Improta,Vadim Malvone,Aniello Murano,Vladana Perlic
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A rigorous formalization of system requirements is a fundamental prerequisite for the verification of Multi-Agent Systems (MAS). However, writing correct formal specifications is well known as an error-prone, time-consuming, and expertise-intensive task. This difficulty is further accentuated in MAS, where requirements must capture strategic abilities and temporal objectives. At present, there is no established methodology for deriving MAS specifications from natural language. We present a framework for translating Natural Language descriptions of strategic requirements into well-formed ATL/ATL* formulas using Large Language Models (LLMs). Since no available dataset supports supervised learning for the NL-to-ATL/ATL* translation task, we create and curate a novel expert-validated dataset, employed for training and evaluating fine-tuned models. On a held-out test set, evaluated under the LLM judge that best agrees with expert annotations, in-domain fine-tuning of small open-weight models (3 - 7B parameters) matches strong few-shot proprietary API baselines. Our best fine-tuned system reaches 0.84 semantic accuracy, statistically on par with 0.86 for the strongest few-shot proprietary baseline, while keeping requirements on-premises. We further find that judge reliability is inverse to generator strength. The open-weight Llama-3.3-70B tracks human verdicts most closely, whereas the strongest proprietary models are the least reliable judges, over-rejecting faithful paraphrases of the reference. To assess the practical applicability of the generated specifications, we embed our tool to an existing strategic logics model checker, enabling non-expert users to specify strategic properties in natural language.

[MA-5] Always-OnAgents :A Survey of Persistent Memory State and Governance in LLM Agents

【速读】:该论文旨在解决持续性状态(persistent state)在始终在线的智能体(always-on agents)中缺乏有效治理的问题。随着智能体在多轮交互中积累包括记忆、任务清单、权限凭证、承诺记录、溯源信息及审计日志等在内的持久化状态,如何对这些状态进行安全、可追溯且可回收的管理成为关键挑战。现有研究主要集中于状态的累积与检索,而在状态的授权范围、可修改性、可恢复性、可行动性以及生命周期管理(如更新、遗忘、回滚与审计)方面存在明显不足。为此,论文提出“始终在线评估协议”(Always-On Evaluation Protocol, AOEP-v0),其核心在于将评估重点从传统的答案质量转向对状态变更与恢复义务的量化评分,从而推动智能体系统向具备数据库、分布式系统、形式化方法、能力安全与机器可遗忘性等特征的方向演进,构建更健全的长期运行机制。

链接: https://arxiv.org/abs/2606.30306
作者: Tianyu Ding,Aditya Nannapaneni,Bingfan Liu,Ling Zhang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Always-on agents are systems whose future behavior depends on durable state accumulated across earlier interactions. We treat them as persistent-state systems: the operative system includes retrievable memories, but also task ledgers, permissions, credentials, commitments, provenance and audit records, shared state, trigger conditions, and externally committed effects linked to those records. The survey reads the literature through six diagnostic axes for each state item, authority, scope, mutability, provenance, recoverability, and actionability, and through a lifecycle in which state is written, validated, organized, retrieved, acted upon, updated, forgotten, audited, and sometimes rolled back. Across a 435-work coded corpus, treated as a scoped map rather than an exhaustive census, the literature concentrates more heavily on accumulating and retrieving state than on governing, recovering, or relinquishing it. We therefore introduce the Always-On Evaluation Protocol (AOEP-v0), a pilot evaluation contract that makes these governance requirements concrete by scoring state mutation and recovery obligations rather than answer quality alone. The resulting agenda connects always-on agents to databases, distributed systems, formal methods, capability security, and machine unlearning.

[MA-6] ACO: Tool-Augmented Credit Optimization for Agent ic Tool Use

【速读】:该论文旨在解决生成式代码-工具代理(code-tool agents)在多模态视觉问答任务中因工具调用存在有效性、冗余性或误导性而难以精确评估其贡献的问题。现有方法依赖仅基于最终结果的奖励机制,无法区分工具调用的真实价值;而现有的过程奖励机制要么无法将最终正确性归因于具体工具调用,要么需要额外的外部评判模型。为应对这一挑战,论文提出了一种名为工具增强信用优化(Tool-Augmented Credit Optimization, TACO)的新方法,其核心在于构建两个耦合的优势通道:一是自监督、无需裁判模型的差分答案探针奖励(Differential Answer-Probe Reward, DAPR),通过在推理过程中插入探针标记,分别获取启用与禁用特定工具时的预测结果,并以二者在答案奖励上的差异作为该工具调用的价值评分——正数表示有效、负数表示误导、零值表示无影响;该机制复用已有答案校验器,无需额外判别模型,且因采用差值而非绝对得分,天然具备抗探针攻击(probe-hacking)能力。二是由结果门控优势路由(Outcome-Gated Advantage Routing, OGAR)提供的最终结果优势,这是一种无参数规则,根据工具调用的实际输出结果,仅将奖励分配给对成功结果起关键作用的推理片段,从而抑制无效调用而不引入惩罚项。TACO通过两阶段指令微调+强化学习(SFT+RL)流程进行训练,在感知、推理及通用多模态基准上均展现出一致的性能提升,且能精准识别并仅在工具真正有助于解答时才调用它们。

链接: https://arxiv.org/abs/2606.30251
作者: Mingkuan Feng,Jinyang Wu,Hao Gu,Fangrui Lv,Ruihan Jin,Chuyuan Zhang,Zhengqi Wen,Jianhua Tao
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Agentic multimodal models perform diverse operations on an image via code and reason over the returned view, an effective paradigm for fine-grained visual question answering. However, code operations can be useful, redundant, or misleading. Outcome-only rewards cannot precisely distinguish these cases, and existing process rewards either fail to attribute final correctness to individual tool calls, or require an external judge model. To address this, we introduce Tool-Augmented Credit Optimization (TACO), a GRPO variant for code-tool agents built on two coupled advantage channels. The first, Differential Answer-Probe Reward (DAPR), is a self-supervised, judge-free tool-contribution advantage that credits each tool call by its own effect on answering correctly. Probe tokens inserted into the model’s reasoning elicit its predictions with and without the tool, and the difference in outcome reward is taken as the call’s value: positive for a useful call, negative for a misleading one, and zero for one that changes nothing. This reuses the existing answer checker with no auxiliary judge, and, being a difference rather than an absolute probe score, is naturally robust to probe-hacking. The second is the outcome advantage from the final answer, distributed by Outcome-Gated Advantage Routing (OGAR): a parameter-free rule that, conditioned on the call’s outcome, delivers this credit only to the responsible segments, suppressing wasted tool calls without any cost term. We train TACO through a two-stage SFT+RL pipeline. Extensive experiments across perception, reasoning, and general multimodal benchmarks show that it yields consistent accuracy gains and learns to invoke its tools only when they help.

[MA-7] Clarus: Coordinating Autonomous Research Agents toward Web-Scale Scientific Collaboration

【速读】:该论文旨在解决当前自主科研代理(autonomous research agents)在科学研究中普遍存在的碎片化与封闭性问题,即现有系统多将研究视为孤立的辅助任务或封闭的工作流,缺乏对跨主体、跨资源、跨阶段协作的系统性支持。其核心挑战在于如何在不确定性背景下,有效协调研究问题、证据、参与者及数字与物理资源。为此,论文提出了一种名为Clarus的协作基础设施,其关键在于将研究范式从以代码为中心的执行循环转向以研究为导向的协作过程,通过构建一个开放、可审计、可追溯且具备资源感知能力的多阶段协作框架来实现。Clarus的核心创新包括:一个最小化的项目-代理-资源对象模型,以及由四个层级构成的协同架构——研究应用层、数字协作层、物理基础层和物理世界层;同时,其核心模块采用可插拔机制,能够灵活适应任务风险、协作结构与资源约束的变化。基于可控的论文生成案例研究,验证了Clarus能够将研究目标组织为可追踪、可审查、可归因且可累积的协作网络,覆盖不同阶段、任务与参与者。整体上,该工作为开放科研网络的构建提供了包含对象模型、协作协议、信任机制与原型验证在内的初步技术基础。

链接: https://arxiv.org/abs/2606.30246
作者: Zihan Guo,Zeyi Chen,Zhiyu Chen,Zicai Cui,Shuai Shao,Bo Huang,Zhi Han,Yuanyi Song,Yuan Yuan,Chenxi Zeng,Xiaohang Nie,Zhengxi Yu,Hanwen Zhu,Junwei Liao,Ming Zhou,Yang Li,Yuanjian Zhou,Weinan Zhang
机构: Sun Yat-sen University; Shanghai Jiao Tong University; Tongji University; Jilin University; Harbin Institute of Technology; Shanghai Innovation Institute; Shanghai Artificial Intelligence Laboratory
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: 28 pages, 7 figures, 1 table

点击查看摘要

Abstract:Existing autonomous research agents can support parts of the research process, but most systems still treat research as either an isolated assistant task or a closed workflow. Therefore, autonomous science needs a collaboration infrastructure that coordinates projects, agents, and digital and physical resources. We identify this as a shift from code-centered execution loops to research-oriented collaboration processes, where questions, evidence, participants, and resources must be coordinated under uncertainty. In this framing, an agent may be an AI system, a human researcher, a team, a laboratory, or an organization-backed participant. To this end, we present Clarus, a collaboration infrastructure for coordinating autonomous research agents toward web-scale scientific collaboration. Clarus reformulates research as an open, auditable, attributable, and resource-aware multi-phase collaboration process. It defines a minimal project-agent-resource object model and organizes scientific collaboration through four layers including Research Application, Digital Collaboration, Physical Substrate, and Physical World. Core modules are implemented as pluggable mechanisms, allowing Clarus to adapt to task risk, collaboration structure, and resource constraints. Through a controlled paper-generation case study, we show that Clarus can organize a research goal into a traceable, reviewable, attributable, and accumulative collaboration network across phases, tasks, and participants. Together, the object model, collaboration protocol, trust mechanisms, and prototype validation provide an initial foundation for open research networks. Clarus is now available at this http URL.

[MA-8] Sparse Sensor Placement in Multi-Agent Reinforcement Learning Control of Rayleigh-Bénard Convection

【速读】:该论文旨在解决雷利-贝纳德对流(Rayleigh-Bénard convection)系统中传感器稀疏布置的控制优化问题,核心挑战在于如何在保证控制性能的前提下显著减少传感器数量,以适应实际硬件部署中的资源限制。其解决方案的关键在于提出一种基于多智能体强化学习(multi-agent reinforcement learning)的稀疏策略蒸馏框架:通过窗口化观测训练密集型专家策略(dense expert policies),并利用带有分组正则化(grouped regularization)的监督学习方法,将专家策略压缩为稀疏的学徒策略(sparse apprentice policies)。该框架创新性地结合了有序非凸分组正则化与迭代重加权分组正则化,并设计了一种强制跨重叠观测窗口一致剪枝的分组结构,从而实现稳定且高效的稀疏化。实验表明,所提方法在固定与变化初始条件等多种场景下均能实现最大或近似最大稀疏度,且稀疏学徒策略保持了与密集专家相当的控制性能;进一步验证显示,基于学习到的最小传感器集合可将每智能体观测维度从360降至12,显著降低数据吞吐量的同时维持模拟训练趋势。研究成果不仅为识别控制相关空间区域与状态分量提供了可解释性依据,也为面向真实硬件约束的高效传感器部署提供了可行路径。

链接: https://arxiv.org/abs/2606.30238
作者: Jan Stenner,Hans Harder,Sebastian Peitz
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: 22 pages, 11 figures, 1 table

点击查看摘要

Abstract:This paper studies sparse sensor placement for control of Rayleigh-Bénard convection with multi-agent reinforcement learning. We train dense expert policies with windowed observations and distill sparse apprentice policies by supervised learning with grouped regularization on encoder input weights. The framework combines ordered non-convex grouped regularization and iterative reweighted grouped regularization, and uses a grouping construction that enforces consistent pruning across overlapping observation windows. Experiments with fixed and varying initial conditions show that Multi-Agent Transformer policies train more stably than proximal policy optimization baselines, while sparse apprentices retain control behavior comparable to dense experts. Sparsity results are strong for the proposed grouped methods across settings, including maximal sparsity in all fixed-initial-condition setting variants and maximal or near-maximal sparsity in varying-initial-condition setting variants. As an additional proof of concept, training from learned minimal sensor sets reduces per-agent observation size from 360 to 12 and preserves the overall training trend in simulation while reducing data throughput. The results provide both an interpretable basis for identifying control-relevant spatial regions and state components, and a practical pathway toward sensor-efficient control under realistic hardware constraints.

[MA-9] Experience Graphs: The Data Foundation for Self-Improving Agents

【速读】:该论文旨在解决生成式智能体(agentic tasks)在长期任务执行过程中产生的“经验图”(experience graph)无法被有效存储、查询与复用的问题。传统代理框架将这些经验视为临时状态,仅以JSON检查点或会话日志形式保存,导致其在崩溃后不可恢复、跨用户无法共享,也无法转化为训练数据,限制了智能体的累积性与可扩展性。其解决方案的关键在于提出Trellis——一个将经验图作为第一类数据库对象进行管理的数据基础架构。核心洞察是:对经验图的搜索本质上是一种数据库访问模式,包括前沿选择(frontier selection)作为查询、跨会话复用作为向量种子的图检索、训练数据提取作为物化视图,以及任意历史时刻的状态回溯作为时间旅行查询。当数据库掌控经验图时,智能体可退化为无状态计算单元,从而自然实现崩溃恢复、水平扩展和闭环训练飞轮等系统级优势。研究通过在Meta生产级加速器内核优化系统KernelEvolve中的实证表明,跨会话复用可实现约10倍的性能提升,且令牌消耗降低52%。更广泛而言,Trellis将推理阶段的搜索过程从一次性计算转变为持久化的机构资产,有望使经验图如同日志之于数据库可靠性一样,成为使智能体具备累积性能力的关键基础设施。

链接: https://arxiv.org/abs/2606.29823
作者: Gang Liao,Yujia He,Abdullah Ozturk,Zhouyang Li,Ying Wang,Zhitong Guo,Hongsen Qin,Yaobin Qin,Tao Yang,Zewei Jiang,Dianshi Li,Jort Gemmeke,Jiangyuan Li,Liyuan Li,Nathan Yan,Masha Basmanova,Uladzimir Pashkevich,Matt Steiner,Pedro Pedreira,Rob Fergus,Anirudh Goyal,Carole-Jean Wu,Gaoxiang Liu,Andrew Witten,Daniel J. Abadi
机构: Meta Platforms (Meta); University of Maryland, College Park (马里兰大学学院帕克分校)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The database community has repeatedly advanced the state of the art by recognizing that new workloads demand new system architectures. We argue that long-horizon agentic tasks – code generation, scientific discovery, hardware design – are such a workload. These agents explore: they generate artifacts, execute tools, observe failures, branch, and repair over hundreds of steps. This search produces a structured object we call an experience graph: executable artifacts, tool outputs, rewards, sibling comparisons, and causal lineage. Yet existing agent frameworks treat this experience as disposable state – JSON checkpoints and session logs that cannot be recovered after a crash, queried across users, or materialized into training data. We propose Trellis: a data foundation that treats the experience graph as first-class, governed, queryable database state. The core insight is that search over experience graphs is a database access pattern. Frontier selection is a query, cross-session reuse is vector-seeded graph retrieval, training-data extraction is a materialized view, and reconstructing what an agent knew at any past step is a time-travel query. When the database owns the experience graph, agents become stateless compute, and crash recovery, horizontal scaling, and a closed-loop training flywheel emerge as architectural byproducts. We ground the design in KernelEvolve, a production accelerator-kernel optimizer at Meta, where cross-session reuse reaches a target speedup roughly 10x faster at 52% lower token cost. More broadly, Trellis turns inference-time search from disposable computation into a durable institutional asset: logs made databases reliable; experience graphs may make agents cumulative.

[MA-10] ECHO: Learning Epistemically Adaptive Language Agents with Turn-Level Credit

【速读】:该论文旨在解决多轮信息获取任务中语言智能体的认知适应性(epistemic adaptivity)问题,即如何使智能体在不确定环境下动态决策:何时主动寻求新信息、如何利用新证据更新自身信念,并在具备足够置信度时采取行动。其核心挑战在于,传统方法往往依赖于对最终结果的全局奖励信号,导致无法准确分配每一轮决策的“认知信用”(epistemic credit),从而引发错误累积和策略失效。解决方案的关键在于提出认知决策过程(Epistemic Decision Processes, EDPs)这一基于信念状态的形式化框架,将任务变量的后验分布作为决策依据,确保动作选择基于当前信念而非仅与最终成功相关。进一步地,论文设计了ECHO(Epistemic Credit for History-Conditioned Optimization),一种基于后验敏感奖励的截断策略梯度目标函数,能够精准分配每一轮的决策信用。在新提出的“线索选择游戏(Clue Selector Game)”基准测试中,ECHO显著提升了问题求解率、信息获取效率和决策准确性,同时在可解释性、恢复能力与信念校准等认知指标上达到或超越前沿方法,且生成的推理文本极少,体现了高效而隐蔽的认知行为。

链接: https://arxiv.org/abs/2606.29745
作者: Abhijnan Nath,Nikhil Krishnaswamy
机构: Colorado State University (科罗拉多州立大学)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:What does it mean for a language agent to be adaptive? Effective multi-turn agents must decide what information to seek, how to use new evidence, and when they are certain enough to act. We introduce Epistemic Decision Processes (EDPs), a belief-state formulation of multi-turn information seeking in which actions produce external observations that update the agent’s posterior over a latent task variable. EDPs make epistemic adaptivity explicit: good policies choose actions that are useful under the current belief, not merely those that correlate with eventual success. We prove that belief-agnostic policies can suffer errors that compound exponentially over the horizon, and that aggregate trajectory returns can fail to identify the per-turn Bayesian advantage needed for epistemic credit. We then introduce ECHO (Epistemic Credit for History-Conditioned Optimization), a practical clipped policy-gradient objective that assigns turn-level credit using posterior-sensitive rewards. In the Clue Selector Game, a novel controlled evidence-seeking benchmark, we show that ECHO substantially improves resolution, information gain, and efficiency over trajectory-level GRPO, and matches or exceeds frontier baselines on epistemic metrics such as grounding, recovery, and calibration while producing almost no visible reasoning text.

[MA-11] Budgeted Act-or-Defer Multi-Agent LLM Deliberation with Local Reliability Bounds

【速读】:该论文旨在解决大语言模型(LLM)多智能体辩论(multi-agent deliberation)在实际部署中面临的可靠性判断问题,即如何确定当前答案是否足够可信以直接执行,还是应交由人工审核。其核心挑战在于,在动态辩论过程中建立可量化的、可审计的“行动或延迟”(act-or-defer)决策机制,确保错误动作的总风险控制在用户预设的预算范围内。解决方案的关键在于提出一种基于预算的行动-延迟决策框架:在每轮辩论中,系统将辩论历史映射为低维状态表示,利用校准数据计算该状态下条件正确率的k近邻下置信界,并仅当该置信界超过用户设定的可靠性阈值时才采取行动。该方法通过分解错误风险为三部分——校准失败(δ)、残余动作风险(α)和表征差距(ε_act)——实现了对错误来源的可解释性控制。其理论保证是条件性的,依赖于局部偏差包络的有效性和动作区域表征差距的有界性,并配套提供反证式诊断工具以验证假设。为应对不同任务难度下的预算相对性问题,该方法将预算相对于各任务最终轮次的误差进行归一化处理,评估安全性时采用归一化预算使用率(WA/β)。在六个基准测试中,该方法在激活数据集上仅使用预设预算的9–12%,实现最高84%的自动化率与96%的动作准确率;在压力测试数据集上则主动延迟,避免不可靠的自动化。相比传统需针对每项任务事后调参的阈值搜索方法,本方法可在部署前前瞻性地将用户声明的错误动作预算转化为可审计的操作点,且所有推断均基于明确陈述的假设。

链接: https://arxiv.org/abs/2606.29654
作者: Mengdie Flora Wang,Haochen Xie,Guanghui Wang,Devin Zhang,Jae Oh Woo
机构: AWS Generative AI Innovation Center (亚马逊云科技生成式人工智能创新中心); General Motors (通用汽车)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-agent deliberation among LLMs can improve reasoning, but deployment requires deciding when the current answer is reliable enough to act on and when it should be escalated to human review. We formulate this as budgeted act-or-defer decision making. At each round, the system maps the debate prefix to a low-dimensional state, computes a k -nearest-neighbor lower confidence bound on state-conditional correctness using calibration data, and acts only when the bound exceeds a user-specified reliability threshold. The certificate controls wrong actions through the decomposition \beta = \delta + \alpha + \varepsilon_\mathrmact , separating calibration failure, residual action risk, and representation gap. The guarantee is conditional, not distribution-free: it relies on a valid local bias envelope and an action-region representation-gap bound, and each assumption is paired with falsification-style diagnostics. Because the same absolute wrong-action budget has different meanings across tasks of different difficulty, we set budgets relative to each task’s final-round error using training data only, and evaluate safety by normalized budget usage \mathrmWA/\beta . On six benchmarks against nine baselines, the method uses 9–12% of the pre-declared budget on activated datasets, reaching up to 84% automation and 96% acted-on accuracy; on stress-test datasets, it defers rather than forcing unreliable automation. Rather than relying on per-task post-hoc threshold search, the method prospectively converts a user-declared wrong-action budget into an auditable act-or-defer operating point before deployment, under explicitly stated assumptions.

[MA-12] Hybrid Retriever Evolution for Multimodal Document Reasoning Agents

【速读】:该论文旨在解决多模态文档理解中检索器(retriever)组合方式僵化的问题,即现有系统普遍采用固定流水线(fixed pipeline)整合词法、语义及多模态检索方法,无法根据具体推理步骤动态调整检索策略。其解决方案的关键在于提出一种基于失败驱动演化的检索编排框架(failure-driven evolution framework),通过一个元智能体(meta-agent)自主学习如何在多步文档问答任务中协调多种检索器。该元智能体分析错误的推理路径,主动探查工具环境以诊断根本原因,并迭代重写任务代理的指令,将原本静态的前端检索过程转变为动态的、分步的推理决策机制。由此演化出的智能体能够自适应地决定何时调用何种检索器、如何融合多源检索结果以及如何跨模态与跨页合成证据。在MMLongBench-Doc和DocBench上的实验表明,该方法相比未演化的基线最高提升19.6分,且持续优于MACT、MDocAgent和SimpleDoc等近期先进系统。详细的检索分析证实性能提升源于自适应路由与证据融合,而非依赖任何硬编码的检索模式;演化过程进一步揭示了从单一词法行为向复杂多工具协同演进的渐进趋势。研究结果确立了自主多智能体协同作为多模态文档推理的有前景范式。

链接: https://arxiv.org/abs/2606.29648
作者: Bohan Yao,Shruthan Radhakrishna,Vikas Yadav
机构: ServiceNow(服务Now); University of Washington(华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 17 pages, 3 figures

点击查看摘要

Abstract:Different retrievers, including lexical, semantic, and multimodal approaches, provide highly complementary strengths for multimodal document understanding, yet most systems combine them through fixed pipelines that cannot adapt to the demands of individual reasoning steps. In this work, we ask whether retrieval orchestration itself can be learned as part of the reasoning process. We introduce a failure-driven evolution framework in which a meta-agent autonomously discovers how a tool-using task agent should coordinate diverse retrievers during multi-step document question answering. The meta-agent analyzes incorrect reasoning trajectories, actively probes the same tool environment to diagnose root causes, and iteratively rewrites the task agent’s instructions, turning retrieval from a fixed front-end stage into an adaptive, step-wise reasoning decision. The evolved agent learns when to invoke each retriever, how to combine them, and how to compose evidence across modalities and pages. On MMLongBench-Doc and DocBench, the evolved agent achieves gains of up to +19.6 points over the unevolved baseline and consistently outperforms recent systems including MACT, MDocAgent, and SimpleDoc. Detailed retrieval analyses confirm that these improvements arise from adaptive routing and evidence composition rather than reliance on any hard coded retrieval mode, and evolution dynamics reveal a progressive shift from narrow lexical behavior to rich multi-tool coordination. These findings establish autonomous multi-agent coordination as a promising paradigm for multimodal document reasoning.

[MA-13] Langshaw: Declarative Interaction Protocols Based on Sayso and Conflict IJCAI2024

【速读】:该论文旨在解决现有多智能体协议语言在协议执行时过度约束或难以准确表达其语义的问题。其核心挑战在于如何在保持协议灵活性的同时,确保协议的正确性与可执行性。解决方案的关键在于提出Langshaw这一声明式协议语言,其创新点在于引入三个核心构造:(1) sayso,用于显式刻画各参与方对属性设置的优先权关系;(2) nono与nogo,分别用于捕捉动作间的冲突条件。通过将这些构造与信息模型相结合,Langshaw实现了语义表达的清晰性与执行灵活性的统一。论文进一步给出了Langshaw的形式化语义,提供了判断协议安全性(safety)与活锁性(liveness)的判定方法,并提出一种生成面向消息的协议(嵌入必要协调机制)的方法,支持灵活的异步执行。

链接: https://arxiv.org/abs/2606.29601
作者: Munindar P. Singh,Samuel H. Christie V,Amit K. Chopra
机构: North Carolina State University (北卡罗来纳州立大学); Lancaster University (兰开斯特大学)
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Appeared in IJCAI 2024

点击查看摘要

Abstract:Current languages for specifying multiagent protocols either over-constrain protocol enactments or complicate capturing their meanings. We propose Langshaw, a declarative protocol language based on (1) sayso, a new construct that captures who has priority over setting each attribute, and (2) nono and nogo, two constructs to capture conflicts between actions. Langshaw combines flexibility with an information model to express meaning. We give a formal semantics for Langshaw, procedures for determining the safety and liveness of a protocol, and a method to generate a message-oriented protocol (embedding needed coordination) suitable for flexible asynchronous enactment.

[MA-14] Persona-Trained Monte Carlo: Estimating Market-Outcome Distributions via Swarms of Persona-Conditioned Neural Policy Bots in a Limit Order Book

【速读】:该论文旨在解决金融市场中市场结果统计量(如价格波动性、流动性变化等)的不确定性估计问题,尤其在复杂市场微观结构与参与者异质性并存情境下的建模挑战。传统蒙特卡洛方法通常仅通过价格随机性进行模拟,难以捕捉交易者行为多样性对市场动态的影响;而现有方法如手写规则代理模型或单智能体强化学习又往往无法充分反映真实市场中交易者个体差异及其互动效应。为此,论文提出一种名为“人格化训练蒙特卡洛”(Persona-Trained Monte Carlo, PTMC)的新方法,其核心在于构建由大量人格条件化的神经策略交易机器人组成的群体,这些机器人共享同一训练好的策略网络,但各自依据从学习得到的交易者异质性分布中独立采样的“人格参数”进行个性化行为设定。在连续双重拍卖机制下,这些机器人之间的交互生成价格路径,构成一次蒙特卡洛样本。通过重复抽取不同的人格群体组合,并结合运行内动作采样与可选外生冲击,实现多源随机性注入,从而更真实地模拟市场演化过程。PTMC的关键创新在于将交易者异质性显式建模为可学习的潜在人格变量,并通过跨学科理论基础(包括行为金融、市场微观结构、深度强化学习及生成式代理等)支撑其策略网络设计、训练数据构造与验证流程。论文进一步形式化了PTMC估计器及其收敛性,并提出四层次验证框架:拟合典型事实、微观结构与代理层级检验、以及历史压力测试对比零智能基线。尽管未实现具体系统部署,本文贡献了形式化估计框架、跨学科设计合理性论证及可执行的验证路线图,为未来研究提供了坚实基础。

链接: https://arxiv.org/abs/2606.29556
作者: Salavat Ishbulatov
机构: Doplan(多兰)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 58 pages, 3 figures, 9 tables, 3 algorithms. Survey and proposed framework; no implementation or empirical results

点击查看摘要

Abstract:We propose Persona-Trained Monte Carlo (PTMC), a method for estimating distributions of market-outcome statistics by repeatedly simulating limit-order-book interaction among swarms of persona-conditioned neural-policy trading bots. Each run instantiates many bots sharing one trained policy network but conditioned on heterogeneous, individually sampled persona parameters drawn from a learned trader-heterogeneity distribution; the bots interact in a continuous double auction, and the resulting price path is one Monte Carlo sample. Repeating this over independent persona-population draws yields an ensemble from which a target market statistic is estimated. Randomness enters through persona draws, within-run action sampling, and optional exogenous shocks, not solely through price as in classical Monte Carlo. We distinguish PTMC from adjacent paradigms, including classical Monte Carlo, hand-coded agent-based models, single-agent reinforcement learning, and large-language-model-based generative agents. To justify the design, we survey cross-disciplinary foundations – agent-based computational economics, market microstructure, behavioral finance, deep reinforcement learning, generative/LLM-based agents, news-driven trading, systemic risk, econophysics, and game theory – connecting each literature to a specific design choice in the policy network, training data, or validation protocol. We formalize the PTMC estimator and its convergence properties, specify a candidate bot architecture and training objective, and propose a four-level validation methodology: stylized-fact matching, microstructure- and agent-level checks, and historical stress-test comparison against a zero-intelligence baseline. The framework is proposed but not implemented: we contribute a formal estimator, a cross-disciplinary design justification, and a validation roadmap, and conclude with open research questions.

[MA-15] Mixture of Debaters: Learn to Debate at Architectural Level in Multi-Agent Reasoning

【速读】:该论文旨在解决现有多智能体辩论框架中存在的两大核心问题:一是其依赖静态架构,导致智能体角色与协作模式在设计阶段即固定,缺乏灵活性;二是需要实例化多个模型副本,造成显著的计算开销。为此,论文提出了一种统一的“辩论混合专家”(Mixture of Debaters, MoD)框架,通过引入混合专家(Mixture-of-Experts, MoE)范式,在单个模型内实现动态自辩论。其解决方案的关键在于:(1)双路路由机制,将角色分配与推理流程解耦,动态判断何时进行辩论、何时进行综合;(2)动量切换机制,利用局部上下文平滑令牌级路由,降低专家切换带来的抖动;(3)统一自辩论结构,将多样化的辩论人格封装为轻量级专家模块,无需智能体间通信即可保留行为多样性。大量在多模态基准上的实验表明,MoD在保持更高准确率的同时,相较单模型基线和传统多智能体系统实现了3.7倍的延迟降低和87%的令牌消耗减少。

链接: https://arxiv.org/abs/2606.29425
作者: Dayong Liang,Kaisong Gong,Yi Cai,Changmeng Zheng,Xiao-Yong Wei
机构: South China University of Technology(华南理工大学); Tianjin University(天津大学); The Hong Kong Polytechnic University(香港理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Existing multi-agent debate frameworks suffer from two critical limitations: they rely on static architectures where agent roles and coordination patterns are fixed at design time, and they require instantiating multiple model copies, incurring substantial computational overhead. We propose Mixture of Debaters (MoD), a unified framework that enables dynamic self-debate within a single model by leveraging the Mixture-of-Experts paradigm. We address three key challenges in adapting MoE for dialectical reasoning: (1) dual-routing that decouples role allocation from process flow, dynamically determining when to debate versus when to synthesize; (2) momentum switching that smooths token-level routing with local context, reducing expert-switch jitter; and (3) unified self-debate that encapsulates diverse debating personas into lightweight expert modules, eliminating inter-agent communication while preserving behavioral diversity. Extensive experiments on multimodal benchmarks demonstrate that MoD outperforms both single-model baselines and conventional multi-agent systems, achieving superior accuracy with 3.7x lower latency and 87% reduction in token this http URL source code can be accessed at this https URL.

[MA-16] Minority Sentinel: When to Overturn Majority Voting in Multi-Agent LLM Debates SIGIR2026

【速读】:该论文旨在解决多智能体辩论(Multi-Agent Debate, MAD)中因大语言模型(LLM)错误高度相关而导致的“少数真理”(Minority Truth)问题,即在多数投票机制下,具有正确答案的少数派观点被系统性压制。其核心解决方案是提出一种轻量级元分类器——“少数哨兵”(Minority Sentinel),该模型通过从辩论日志中提取多维度的行为指纹特征,并利用LightGBM构建判别模型,以决定何时应推翻多数投票结果。该方法的关键在于:辩论日志中蕴含充分的行为信号,足以使非LLM类别的分类器可靠识别并恢复被压制的正确少数意见,从而实现81.2%的稳定翻转精确率(Flip Precision)和全数据集、全随机种子下的正净收益(Net Gain),证明了仅依赖辩论过程中的行为模式即可有效干预且不损害整体系统准确性。研究进一步表明,翻转安全性(flip safety)而非召回量才是干预策略价值的核心决定因素,验证了现有“LLM作为裁判”基线因缺乏安全约束而产生负净收益。

链接: https://arxiv.org/abs/2606.29270
作者: Chuan He,Zebin Chen,Zhengyi Yang,Shaobo Qiao,Mingchen Ju,Jiate Liu,Dong Wen,Guanfeng Liu
机构: University of New South Wales(新南威尔士大学); Euler AI(欧拉人工智能); Macquarie University(麦考瑞大学)
类目: Multiagent Systems (cs.MA)
备注: 11 pages, 4 figures. Accepted at the AgentSearch Workshop @ SIGIR 2026, Melbourne, Australia

点击查看摘要

Abstract:Multi-Agent Debate (MAD) with Majority Voting is a dominant paradigm for improving LLM reasoning, yet its effectiveness rests on the Condorcet Jury Theorem’s assumption of independent errors. Because contemporary LLMs share similar pretraining corpora, their errors are strongly correlated, causing the majority to systematically suppress correct minority opinions, a phenomenon we term Minority Truth. Through debates among three heterogeneous LLM agents on six benchmarks, we find that roughly one in four divergent cases has the minority holding the correct answer, yielding a 10-percentage-point theoretical recovery margin. We propose Minority Sentinel, a lightweight meta-classifier that extracts a multi-dimensional debate fingerprint from debate logs and trains a LightGBM model to decide when to overturn majority voting. Minority Sentinel achieves a stable Flip Precision of 81.2% with positive Net Gain across all six datasets and all 20 random seed trials, demonstrating that debate logs contain sufficient behavioral signals for a non-LLM classifier to reliably recover suppressed minorities without degrading system accuracy. The LLM-as-Judge baseline yields negative Net Gain despite higher recall, confirming that flip safety, not recovery volume, determines intervention value.

[MA-17] Projected Exploitability Descent for Nash Equilibrium Computation in Multiplayer Imperfect-Information Games

【速读】:该论文旨在解决多玩家不完美信息博弈中纳什均衡(Nash equilibrium)计算的可扩展性与性能瓶颈问题。现有方法在处理此类博弈时,要么难以扩展至大规模场景,要么收敛性能不佳。为此,论文提出一种名为投影剥削度下降(Projected Exploitability Descent, PED)的新算法,通过执行投影次梯度下降来最小化多玩家广义剥削度函数的代理目标,从而逼近纳什均衡。其核心创新在于:尽管目标函数为非凸且不可微,但可表示为多个线性函数最大值之和,因此能够高效计算次梯度并投影至可行序列形式策略的多面体空间。作者在广义三玩家库恩扑克(three-player Kuhn poker)这一经典基准上评估PED性能,该版本游戏在牌组规模超过4时,此前无精确算法可有效求解。实验表明,虽然虚构演进(Fictitious Play, FP)和反事实后悔最小化(Counterfactual Regret Minimization, CFR)在初期迭代中表现更优,但PED在整个优化过程中展现出一致的近单调改进。由此启发设计出混合算法FP-PED,即先以FP进行预热(burn-in)阶段,再切换至PED实现稳定长期精炼;也可视为一种多步算法,利用FP提供强初始解以提升PED的收敛效率。

链接: https://arxiv.org/abs/2606.29169
作者: Sam Ganzfried
机构: Ganzfried Research, Cornell University (Ganzfried 研究所,康奈尔大学)
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Many important games have more than two players and imperfect information. Existing approaches for computing Nash equilibrium, the central game-theoretic solution concept, in such games either lack scalability or obtain poor performance. In this paper we introduce a new algorithm called projected exploitability descent (PED) for approximating Nash equilibria in multiplayer games of imperfect information. The algorithm works by running projected subgradient descent minimizing a proxy for the multiplayer generalized exploitability function. The objective is nonconvex and nonsmooth, but can be represented as the sum of the maxima of linear functions, for which a subgradient can easily be computed and projected to the polytope of feasible sequence-form strategies. We explore performance of PED on a generalized version of the well-studied benchmark game three-player Kuhn poker. No prior exact algorithms scale to the version of the game with deck size larger than 4, and we compare performance to the popular algorithms of fictitious play (FP) and counterfactual regret minimization (CFR). We find that PED obtains a consistent near-monotonic improvement throughout all runs, though both FP and CFR perform significantly better in the initial iterations. This inspires a hybrid algorithm FP-PED that runs FP for an initial burn-in period before switching to PED for stable long-run refinement. We can alternatively view this as a multi-step algorithm that runs FP as a pre-processing step to obtain a strong initialization for PED.

[MA-18] LLM Semantic Signaling Game and Mechanism Design: Systematic Blindness Awareness Shaping and Mindset Dynamics

【速读】:该论文旨在解决生成式 AI(Generative AI)系统中由语言媒介引发的战略性交互问题,核心挑战在于如何实现对语义层面的控制以应对通信与欺骗行为。其解决方案的关键在于构建一个语义信号博弈(semantic signaling game)框架,其中发送方选择语义控制策略,大语言模型(LLM)生成随机化消息,接收方则基于依赖意识水平的评分机制评估信息。通过将接收方的意识建模为决定其感知和利用语言特征能力的类型,该框架形式化地刻画了系统性盲视现象,并整合提示工程控制、统计检测与博弈论均衡分析。利用高斯近似方法对聚合消息评分进行建模,从而导出似然比决策规则;同时,通过完美贝叶斯纳什均衡(Perfect Bayesian Nash equilibrium)刻画参与者的战略行为。进一步地,论文提出机制设计方法,包括重塑接收方意识、惩罚欺骗性语义控制以及调整接收方群体结构,以诱导良性汇聚均衡(benign pooling equilibria)。数值实验验证了高斯近似的有效性,量化了意识排序效应,分析了适应性对手下的心智动态演化,并证明意识塑造与防护机制成本能够显著降低成功网络钓鱼攻击率。该框架为代理型 AI 系统中的战略性语言交互提供了理论基础,并为构建鲁棒、安全的人机交互系统提供了新工具。

链接: https://arxiv.org/abs/2606.29113
作者: Quanyan Zhu
机构: New York University Tandon School of Engineering (纽约大学坦顿工程学院)
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly mediate strategic interactions through natural language, making semantic control a critical element of communication and deception. This paper develops a semantic signaling game in which a sender selects a semantic control, an LLM generates a stochastic message, and a receiver evaluates the message using an awareness-dependent scoring mechanism. Receiver awareness is modeled as a type that determines which linguistic features are perceived and used for inference, providing a formal model of systematic blindness. The framework connects prompt-based control, statistical detection, and game-theoretic equilibrium analysis. Gaussian approximations of aggregate message scores enable likelihood-ratio decision rules, while Perfect Bayesian Nash equilibria characterize strategic behavior. The paper further develops mechanism-design approaches that reshape receiver awareness, penalize deceptive semantic controls, and modify receiver populations to induce benign pooling equilibria. Numerical experiments validate the Gaussian approximation, quantify awareness-ordering effects, analyze mindset dynamics under adaptive adversaries, and demonstrate how awareness shaping and guardrail costs reduce successful phishing attacks. The proposed framework provides a principled foundation for analyzing strategic language-mediated interactions in agentic AI systems and offers new tools for the design of robust and secure human-AI communication.

[MA-19] Metric Aggregation Divergence: A Hidden Validity Threat in Agent -Based Policy Optimization and a Contractual Remedy

【速读】:该论文旨在解决基于代理的模型与多目标进化算法(ABM+MOEA)耦合时,因各流水线阶段独立实现结果度量提取方式而引发的“度量聚合分歧”(Metric Aggregation Divergence, MAD)问题。该问题表现为各阶段内部逻辑一致,但跨阶段输出因度量计算路径不一致导致决策偏差,且此类不一致性难以通过常规代码审查发现。研究通过分析已发表的流行病政策工具箱EpidemiOptim的代码,揭示其存在三个结构上相互独立的度量聚合路径,导致在500次独立运行中64.2%的案例出现最优解分歧;在300种子的政策翻转实验中,83%的复现中优化器推荐了错误的最优方案,平均福利差距达2.19单位,基尼不平等差距达0.050单位。进一步审计显示,3个种子甚至跨越了统计显著性边界。该研究提出“度量契约”(metric contract)作为解决方案——即在调度时强制所有流水线阶段调用同一共享可执行函数以统一度量计算接口。该方法将跨阶段度量接口规范化为标准工程实践,从架构层面彻底消除分歧,仅引入约3%的运行开销,显著提升模型决策的可重复性与可信度。

链接: https://arxiv.org/abs/2606.29038
作者: Ruiyu Zhang,Lin Nie,Xin Zhao
机构: The University of Hong Kong (香港大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Metric aggregation divergence (MAD) is the silent inconsistency that arises when distinct pipeline stages in an agent-based model coupled with a multi-objective evolutionary algorithm (ABM+MOEA) independently re-implement how an outcome metric is extracted from simulation trajectories. Unlike deliberate analytical choices, MAD operates at the level of pipeline architecture: each stage is internally coherent, and the inconsistency becomes visible only when cross-stage outputs are compared. Code inspection of EpidemiOptim, a JAIR-published epidemic policy toolbox, reveals three structurally independent aggregation paths in peer-reviewed code. A faithful replication of this structure produces champion disagreement in 64.2% of independent runs (n=500, 95% CI: [59.9%, 68.3%]). In a 300-seed policy-flip experiment, divergent aggregation causes the optimizer to recommend the wrong champion in 83% of replications, with a mean welfare gap of 2.19 units and a Gini inequality gap of 0.050 units. In a follow-up inference audit, 3 of 249 flipped seeds cross the significance boundary itself. A complementary enterprise follow-up produces the predicted null under near-commensurable rankings (rho = 0.991), while a public upstream rerun of the Lake Problem DPS workflow shows that the archived published-path recommendation reaches joint-threshold success 0.401 whereas a shared contract-path rule reaches 0.552. We introduce the metric contract - a single shared callable enforced at dispatch time across all pipeline stages - as the remedy. Framed as standard engineering discipline applied to the cross-stage metric interface, the contract eliminates divergence by construction with approximately 3% runtime overhead.

[MA-20] Pure Nash Equilibria under the Affine Mechanism: A Potential Game of Exaggeration

【速读】:该论文旨在解决均值机制(mean mechanism)在博弈论中存在激励不相容性的问题,即理性参与者有动机虚报自身价值。尽管该机制在实践中广泛应用,因其具备其他理想属性,但其激励相容性缺陷可能导致策略性行为。本文的核心贡献在于对仿射机制(affine mechanism)——均值机制的推广形式——的纯策略纳什均衡进行了完整刻画,揭示了玩家在博弈中如何进行极端夸大式的虚假申报。研究进一步区分了完全信息与贝叶斯博弈场景下的均衡特性,结果表明在该类机制下,极端夸大行为是不可避免的,凸显了机制设计中激励相容性与实用性之间的根本矛盾。

链接: https://arxiv.org/abs/2606.29010
作者: Jason Jisen Li,Young Wu,Yancheng Zhu,Jin-yi Cai,Xiaojin Zhu
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The mean mechanism is known to be non-incentive-compatible, namely, rational players are incentivized to misreport their values. Despite this game-theoretic issue, the mean mechanism is prevalent in practice due to its other desirable properties. We give a full characterization of pure Nash equilibria–how the players will misreport–for the affine mechanism, of which the mean is a special case. Furthermore, we characterize both complete-information and Bayesian games under the affine mechanism. Our results highlight the inevitability of extreme exaggeration in such games.

[MA-21] When Latent Agents Lie: KV-Cache Integrity in Multi-Agent LLM Collaboration

【速读】:该论文旨在解决多智能体问答系统中因共享隐式状态(即全量键值缓存,KV-cache)而引发的安全性问题。尽管生成式 AI 智能体可通过可见文本消息进行协作,但其背后传递的完整 KV-cache 状态可隐含地影响最终决策,从而在未被察觉的情况下引入偏差或攻击。研究发现,在正常情况下,这种基于隐式状态的协作显著优于仅依赖文本的协作方式(如 Qwen3-4B 在 HiddenBench 上的 EM/F1 达到 0.338/0.486,优于文本版本的 0.231/0.369)。然而,当某一专业智能体被恶意操控时,其可伪造看似合理的可见承诺,同时篡改隐藏的 KV-cache 状态,导致最终答案严重偏离正确结果,且仅检查可见文本的验证器无法识别此类攻击。现有简单幅度检测方法亦易被自适应攻击绕过。论文提出的关键解决方案是:将完整的 KV-cache 状态视为安全敏感对象,通过引入 HMAC-SHA256 消息摘要机制,对智能体身份、会话信息、模型版本、可见承诺、张量元数据及负载哈希进行绑定认证。该方案在实验中成功识别所有 295 个篡改样本,同时通过全部 774 个诚实重放样本,证明了其有效性。核心结论在于:全量 KV 缓存虽具协作优势,但必须通过端到端完整性保护机制加以防护,而非依赖对状态外观的主观判断。

链接: https://arxiv.org/abs/2606.28958
作者: Luís Brito,Carlos Baquero
机构: Escola Superior de Tecnologia e Gestão (ESTG), Politécnico de Viana do Castelo (IPVC), Portugal; Faculdade de Engenharia (FEUP), Universidade do Porto (UP), Portugal
类目: Multiagent Systems (cs.MA)
备注: 16 pages, 2 figures, 3 tables

点击查看摘要

Abstract:LLM agents can share more than text. In some systems, an agent can send a short visible message while also passing its full KV-cache state to another model. This hidden state can help the final model combine evidence from several agents, but it is also hard to inspect. A visible message may look harmless even if the hidden state has been changed. We study this problem in a multi-agent question-answering setup. Specialists each see part of the evidence, send a short commitment, and pass full KV-cache state to a coordinator. In clean runs, this latent collaboration improves over a matched text-only version. On transformed HiddenBench with Qwen3-4B, it reaches EM/F1 of 0.338/0.486, compared with 0.231/0.369 for text collaboration. Qwen3-8B and HotPotQA runs show the same direction of improvement. The problem appears when one specialist is malicious. Some false visible commitments can steer answers. More seriously, changing the hidden KV state can collapse performance even when the visible commitment still looks plausible. A verifier that checks only text misses this failure mode. Simple magnitude checks catch some obvious corruptions, but adaptive attacks can evade them while still damaging the final answer. The most reliable fix we find is not to guess whether hidden state looks normal, but to protect it in transport. We implement an HMAC-SHA256 manifest that binds the specialist, session, model, visible commitment, tensor metadata, and payload digest. It accepts all 774 honest replayed payloads and rejects all 295 recorded tampered payloads. The main lesson is that full-KV latent memory can be useful, but it should be treated as a security-sensitive object, not as ordinary internal model state. Comments: 16 pages, 2 figures, 3 tables Subjects: Multiagent Systems (cs.MA) ACMclasses: C.2.0; I.2.11; I.2.7; D.4.6 Cite as: arXiv:2606.28958 [cs.MA] (or arXiv:2606.28958v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2606.28958 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Luis Brito [view email] [v1] Sat, 27 Jun 2026 15:00:01 UTC (31 KB)

[MA-22] Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation RECSYS2026

【速读】:该论文旨在解决从自然语言提示中进行工具与智能体(Agent)路由时面临的多智能体选择问题,核心挑战在于如何在保证任务执行准确性的前提下,有效平衡多智能体的精确选择与执行成本——即单个查询可能需要多个智能体协同完成,但过度选择会显著增加系统开销。其解决方案的关键在于提出一个包含3000个提示、基于固定12智能体目录的基准数据集,并采用人工智能辅助的启发式标注与受控的多标签重平衡机制,构建了一套综合评估体系,涵盖集合层面指标(精确率、召回率、F1值、交并比、精确匹配)、延迟、面向执行能力覆盖的仿真以及基于序数智能体成本层级的约束性加权路由设置。实验表明,监督式路由模型显著优于最近邻匹配和零样本大模型基线;其中,微调编码器在无约束场景下表现最佳,而线性多标签分类模型提供了最强的实际基线;在受限条件下,通过加权智能体路由(Weighted Agent Routing, WAR)对强监督评分器进行后处理可进一步提升实用性,尤其在“编码器+WAR”组合中取得最大增益。整体而言,该研究通过建立可复现的评估框架,为固定目录下的多智能体路由中的精度-成本权衡提供了系统化分析工具。

链接: https://arxiv.org/abs/2606.28925
作者: Ananto Nayan Bala,Faisal Muhammad Shah
机构: Ahsanullah University of Science and Technology (阿善诺拉大学科技学院); Dhaka (达卡); Bangladesh (孟加拉国)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: 9 pages, 8 figures. Under review at ACM RecSys 2026

点击查看摘要

Abstract:Tool and agent routing from natural-language prompts is naturally a set-valued prediction problem: a single query may require multiple agents, while over-selection increases execution cost. The benchmark introduced here is derived from WildChat and contains 3,000 prompts over a fixed 12-agent catalog, with AI-assisted heuristic labels under a fixed schema and controlled rebalancing for multi-label evaluation. The evaluation protocol combines set-level metrics (Precision, Recall, F1, Jaccard, and Exact Match), latency, an execution-oriented capability-coverage simulation, and a constrained weighted-routing setting based on ordinal agent-cost tiers. Compared methods include nearest-neighbor matching, linear multilabel classification, dependency-aware baselines, a fine-tuned encoder, deterministic weighted post-scoring via Weighted Agent Routing (WAR), and a zero-shot LLM baseline. Results show that supervised routers substantially outperform nearest-neighbor and zero-shot LLM routing. The fine-tuned encoder achieves the strongest unconstrained set accuracy, while the linear multilabel model provides the strongest practical baseline. In the constrained setting, the weighted routing layer improves utility when applied on top of strong supervised scorers, with the largest gain observed for Encoder+WAR. Overall, the benchmark and evaluation protocol support reproducible study of accuracy-cost trade-offs in fixed-catalog multi-agent routing.

[MA-23] Exit-and-Join Dynamics and Equilibrium in Continuum Cooperative Games

【速读】:该论文旨在解决大规模多智能体系统中联盟形成与动态适应的建模问题,核心在于如何在非原子合作博弈(nonatomic cooperative games)框架下,统一描述联盟的退出与加入行为及其对个体激励的影响。其解决方案的关键在于构建一种基于边际贡献的收益密度(payoff density)机制,将每个联盟视为一个受限的非原子博弈,并通过引入激励相容且严格响应收益差异的去中心化切换规则,推导出确定性的均场动力学(mean-field dynamics)。研究进一步证明,当收益差驱动的切换规则成立时,该动力学可还原为演化博弈论中的复制者动态(replicator dynamics)作为特例;并通过定义无正质量有利偏离的退出-加入均衡(exit-and-join equilibrium),建立了该均衡与诱导质量动态平稳性之间的等价关系。对于基于质量的合作博弈,作者构造了一个Lyapunov函数,在严格凹性条件下实现了全局收敛性,并揭示该均衡等价于一个诱导的非原子群体博弈(nonatomic population game)的Wardrop均衡,且可表述为变分不等式(variational inequality)。此外,理论框架还扩展至包含切换成本和内生联盟接受规则的情形,导出了由拟变分不等式刻画的约束均衡。整体上,该理论实现了合作价值分配、非合作联盟流动性、均场动力学、演化博弈论与群体博弈的统一,为大规模多智能体系统的联盟演化分析提供了通用数学框架。

链接: https://arxiv.org/abs/2606.28824
作者: Quanyan Zhu
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:This paper develops a continuum theory of exit-and-join coalition dynamics in nonatomic cooperative games. We extend the Aumann-Shapley value and the Aumann-Drèze value to coalition structures in which each coalition is treated as a restricted nonatomic game, yielding a marginal-contribution-based payoff density that governs incentives for agents to remain in, exit, or join coalitions. We derive deterministic mean-field dynamics from decentralized switching rules and show that payoff-difference switching recovers replicator dynamics as a special case. We characterize exit-and-join equilibrium by the absence of profitable positive-mass deviations and prove its equivalence with stationarity of the induced mass dynamics under incentive-compatible and strictly payoff-responsive switching rates. For mass-based cooperative games, we construct a Lyapunov function and establish global convergence under strict concavity. We further show that the equilibrium is equivalent to a Wardrop equilibrium of an induced nonatomic population game and admits a variational inequality formulation. The framework is extended to incorporate switching costs and endogenous coalition acceptance rules, leading to constrained equilibria characterized by quasi-variational inequalities. The proposed theory unifies cooperative value allocation, noncooperative coalition mobility, mean-field dynamics, evolutionary game theory, and population games within a common framework for analyzing coalition formation and adaptation in large-scale multi-agent systems.

[MA-24] HyphaeDB: A Living Knowledge Topology for Agent -First Memory

【速读】:该论文旨在解决现有向量数据库与智能体记忆框架中记忆作为被动存储、无法在智能体间实现知识自发传播的核心问题。传统系统仅允许智能体显式查询记忆,缺乏跨智能体的知识动态流动机制。其解决方案的关键在于提出HyphaeDB——一种面向智能体的原生记忆基础设施,创新性地将现代向量数据库核心结构之一的层次化可导航小世界(Hierarchical Navigable Small World, HNSW)图拓扑重新诠释为多智能体系统的通信网络,而非单纯的检索优化工具。在此架构中,智能体作为向量空间中的持久节点,知识通过基于能量衰减的八卦协议(gossip protocol)沿图的邻接结构进行传播,而矛盾检测、模式凝练与共识形成等涌现行为则由拓扑结构、传播动力学与局部交互规则的协同作用所驱动。系统基于三个基本组件(知识节点、拓扑边、记忆差分)构建多层抽象层级,并通过涌现共识实现信息层级提升,其理论基础融合了小世界网络理论、流行病式广播协议与群体智能。研究进一步提供了基于PostgreSQL与pgvector的参考实现,并在“群智驱动开发”(Swarm-Driven Development)这一多智能体软件工程范式中完成实际部署,证明了该系统是目前首个将可导航小世界拓扑与基于八卦协议的知识传播相结合以实现多智能体协调的系统。

链接: https://arxiv.org/abs/2606.28781
作者: Krishna Halaharvi
机构: HyphaeDB
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Every existing vector database and agent memory framework treats memory as passive storage that agents query explicitly. No system propagates knowledge between agents through the memory layer itself. We introduce HyphaeDB, an agent-native memory infrastructure that reinterprets the Hierarchical Navigable Small World (HNSW) graph topology the data structure at the core of every modern vector database not as a search optimization, but as a communication fabric for multi-agent AI systems. In HyphaeDB, agents are nodes in the vector space with persistent positions, knowledge propagates via a gossip protocol through the graph’s neighbor structure with energy-based attenuation, and emergent behaviors contradiction detection, pattern crystallization, and consensus formation arise from the combination of topology, propagation dynamics, and local interaction rules. We present the architecture built on three primitives (knowledge nodes, topology edges, and memory diffs), a multi-layer abstraction hierarchy with promotion via emergent consensus, and theoretical analysis grounding the system in small-world network theory, epidemic broadcast protocols, and swarm intelligence. We provide a reference implementation on PostgreSQL with pgvector and describe a concrete deployment in Swarm-Driven Development, a multi-agent software engineering methodology. HyphaeDB represents, to our knowledge, the first system to combine navigable small world topology with gossip-based knowledge propagation for multi-agent coordination.

[MA-25] A Fast Convergent Algorithm for Solving Non-convex Partially-Decoupled Generalized Nash Equilibrium Problems

【速读】:该论文旨在解决航空航天领域中多智能体最优控制问题(如追逃博弈和受控空间作战)建模为非凸微分博弈时缺乏有效求解算法的挑战。其核心问题是现有方法难以处理多智能体系统中状态动力学内存在的智能体间控制耦合,导致求解复杂度高且收敛性难以保证。解决方案的关键在于提出一种广义纳什均衡问题(Generalized Nash Equilibrium Problems, GNEPs)的松弛方法,通过消除动力学中的智能体间控制耦合,从而将原问题转化为可处理的形式。在此基础上,论文提出了FALCON(Fast Augmented Lagrangian Convexification for Open-loop Nash equilibria)算法,该算法基于序列凸规划(Sequential Convex Programming, SCP)框架,构建可求解的凸子博弈,并结合势博弈(Potential Game)重构技术,利用标准凸优化方法进行求解。理论分析表明,在较弱假设条件下,FALCON具有全局收敛至开环纳什均衡的保证,数值实验进一步验证了其在合作与竞争型微分博弈中的有效性与鲁棒性。

链接: https://arxiv.org/abs/2606.28617
作者: Bennet Outland,Vishala Arya
机构: University of Colorado Boulder(科罗拉多大学博尔德分校)
类目: Multiagent Systems (cs.MA); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Solving multi-agent optimal control problems in aerospace such as pursuit-evasion and contested space operations can be modeled as non-convex differential games for which, there are limited algorithms. In this work, a relaxation of generalized Nash Equilibrium problems (GNEPs) to exclude inter-agent control coupling in dynamics, which is representative of many multi-agent systems is introduced. The main contribution is an algorithm for solving a broad class of differential games named FALCON: Fast Augmented Lagrangian Convexification for Open-loop Nash equilibria is presented. Methodologically, sequential convex programming (SCP) is utilized to create tractable convex sub-games which can then be solved via standard convex programming methods involving a potential game reformulation. FALCON is demonstrated to have global convergence guarantees to an open-loop Nash equilibrium for non-convex differential games under mild assumptions. This is numerically shown through both cooperative and competitive differential games.

[MA-26] Digitizing Coaching Intelligence: An Agent ic Framework for Holistic Athlete Profiling using VLM and RAG

【速读】:该论文旨在解决传统运动员评估方法在大规模选拔中面临的主观性强、可扩展性差以及缺乏“教练智能”的问题,尤其针对现有计算机视觉(CV)系统仅能实现量化动作计数而无法有效评估形式退化、脊柱活动度和疲劳等定性生理指标的局限。其核心解决方案在于提出一种基于大语言模型(LLM)的混合智能体框架,通过LangGraph进行协调,构建双管道架构:一方面利用MediaPipe实现运动学轨迹的几何精度追踪,另一方面借助视觉-语言模型(Llama-4-scout)完成语义层面的推理分析。为应对多模态视频处理中的延迟与令牌限制,创新性地引入3×3“智能网格”时间分块策略,使计算开销降低超过88%的同时保持关键的时间连续性。为进一步保障数据完整性并抑制幻觉,设计了自主的“LLM作为裁判”自校正循环机制,对定量与定性指标进行交叉验证后才进行持久化存储。此外,采用基于向量检索的双重持久化检索增强生成(RAG)管道(结合ChromaDB),使教练能够以自然语言执行复杂语义查询(如“找出耐力高但核心稳定性差的运动员”),突破传统SQL数据库的僵化限制。实验结果表明,该多智能体方法显著弥合了原始生物特征监测与可操作教练洞察之间的鸿沟,为国家级人才识别提供了可扩展、客观化的解决方案。

链接: https://arxiv.org/abs/2606.28570
作者: Deep Ghosal,Ishani Sen,Wazib Ansar,Amlan Chakrabarti
机构: University of Calcutta (加尔各答大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 16 pages

点击查看摘要

Abstract:Athlete assessment is a critical process for tracking physical progress and identifying elite talent. However, during mass recruitment drives, traditional methods rely on manual observation, which is inherently subjective and unscalable, or basic computer vision (CV) systems limited to quantitative repetition counting. These standard approaches lack the “coaching intelligence” required to evaluate qualitative physiological markers such as form degradation, spinal articulation, and fatigue. This paper presents a novel, LLM-based hybrid agentic framework for automated, holistic athlete profiling that strictly aligns with the Sports Authority of India (SAI) assessment protocols. Orchestrated via LangGraph, our dual-pipeline architecture synthesizes the geometric precision of CV (MediaPipe) for kinematic tracking with the semantic reasoning of Vision-Language Models (Llama-4-scout). To overcome the latency and token constraints associated with multimodal video processing, we introduce a 3 X 3 “Smart Grid” temporal chunking strategy, reducing computational overhead by over 88% while preserving critical temporal continuity. To ensure data integrity and mitigate hallucination, the framework pioneers an autonomous “LLM-as-a-Judge” self-correction loop that cross-references quantitative and qualitative metrics before persistence. Finally, we implement a dual-persistence Retrieval-Augmented Generation (RAG) pipeline utilizing a vector search engine (ChromaDB). This enables coaches to bypass rigid SQL databases and perform complex semantic queries (e.g., “Identify athletes with high endurance but poor core rigidity”) using natural language. Experimental results demonstrate that this multi-agent approach significantly bridges the gap between raw biometric tracking and actionable coaching insights, offering a scalable, objective solution for national talent identification.

[MA-27] Is Lying an Emergent Behaviour in LLM s? Evidence from Gaslighting AI agents in a Sustainability Game

【速读】:该论文旨在探究大语言模型(LLM)代理在竞争性可持续性博弈中是否会出现欺骗行为,尤其是在资源可再生的误导性设定下(实际并不存在再生机制)。其核心问题在于:当代理被赋予感知邻近状态、声明未来攻击、获取声誉信息等能力时,基于生成式智能体的系统是否会演化出非合作甚至欺骗性策略,从而影响整体生态系统的可持续性。解决方案的关键在于构建一个基于代理的可持续性博弈模型,其中混合使用大语言模型代理与规则基代理作为对照,通过网络化交互机制模拟工业、军事与生态资源的管理过程,并引入未来攻击声明、谎言许可机制及声誉记忆等变量。研究发现,尽管未明确授权说谎,欺骗行为仍作为涌现行为出现;而明确允许说谎仅显著提升虚张声势和误导性行为,而非直接背叛。此外,邻居信息共享增强了系统动态复杂性,虽增加冲突频率但有助于维持生物圈存续与共存;声誉记忆与当前生物圈状态的信息则有效缓解生态耗竭。这表明,尽管存在风险,大语言模型代理间的通信机制可在不完全控制的情况下促进可持续性治理。

链接: https://arxiv.org/abs/2606.28456
作者: Subhendu Bhandary,Federico Carucci,Christos Charalambous,Francesca Dilisante,Ksenia Dvorkina,Anna Garbo,Jiaqi Liang,Riccardo Vasellini,Francesco Bertolotti
机构: University of Cyprus; University of Zaragoza; Medical University of Vienna; ISI Foundation; Complexity Science Hub; CNR; Università Cattolica del Sacro Cuore
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs agents are increasingly used in multi-agent settings, yet their behaviour in sustainability games remains largely unexplored. This work investigates whether lying can emerge among LLM agents in a competitive sustainability game in which agents are informed that common resources can regenerate, although regeneration does not actually occur. We develop an agent-based model of a sustainability game in which agents manage industrial, military, and ecological resources, and interact through a network. LLM agents can observe neighbours’ status, declare future attacks, receive permission to lie, and access reputation information, while rule-based agents provide an interpretable behavioural baseline. The results show that neighbour information strongly changes system dynamics, increasing attacks while improving biosphere retention and coexistence. Also, the presence of future declarations reduce extinction risk without suppressing conflict. Behaviourally, deception emerges even when agents are not explicitly allowed to lie, and explicit permission mainly increases bluffing and diversion rather than direct backstabbing. Finally, the presence of reputation memory and information about the current biosphere level reduces system ecological depletion. These findings suggest that deception can arise as an emergent behaviour in LLM-agent systems and that communication between LLM-agents could support sustainability while dealing with risk.

[MA-28] Learning to Distributedly Estimate under Partially Known Dynamics: A Covariance-Agnostic Neural Kalman Consensus Filter

【速读】:该论文旨在解决在线潜在状态估计(online latent state estimation)中的核心挑战,特别是在分布式感知场景下,如何在缺乏噪声统计信息的条件下实现高效、鲁棒的去中心化状态推断。其关键解决方案是提出一种无需依赖噪声协方差信息的新型分布式滤波框架——协方差无关神经卡尔曼共识滤波器(Covariance-Agnostic Neural Kalman Consensus Filter, CA-NKCF)。该方法融合了部分先验领域知识与深度神经网络的表征能力,通过引入优化的共识权重和类卡尔曼递归更新机制,在不依赖噪声统计特性的情况下实现分布式协同推理。实验结果表明,CA-NKCF在线性系统、混沌系统(如Lorenz系统)及实际无线跟踪环境中均显著优于传统分布式卡尔曼滤波器、粒子滤波器以及纯数据驱动的深度神经网络模型,且在噪声水平变化、通信拓扑随机性、高维状态空间及无线信道中散射体引起的观测杂波等复杂条件下保持稳定性能优势。

链接: https://arxiv.org/abs/2606.28441
作者: George Stamatelis,Kyriakos Stylianopoulos,George C. Alexandropoulos
机构: National and Kapodistrian University of Athens (雅典国立卡波迪斯特里亚大学); Hellenic Foundation for Research and Innovation (希腊研究与创新基金会); European Union’s Horizon Europe (欧盟地平线欧洲计划)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: Under review in IEEE journal, 13 pages, 9 figures

点击查看摘要

Abstract:Online latent state estimation constitutes a fundamental challenge within the artificial intelligence field, serving as a foundational tool for diverse applications, including sequential decision making, anomaly and change-point detection. In this paper, a novel online distributed sensing framework, where agents collaborate and exchange information to perform latent state estimation, is presented. The proposed estimator combines available partial domain knowledge with the representation capabilities of deep neural networks. In particular, the designed sensing framework incorporates prior estimates, optimized consensus weights, and Kalman-like recursive updates to perform decentralized inference, without relying on knowledge of noise statistics. Extensive experiments on linear, chaotic (Lorenz), and practical wireless tracking environments reveal that the proposed Covariance-Agnostic Neural Kalman Consensus Filter (CA-NKCF) outperforms traditional distributed Kalman and particle filters as well as purely model-free deep neural networks, exhibiting robustness even when the underlying motion and observation models are misspecified. It is also demonstrated that CA-NKCF’s performance advantage remains stable across varying noise levels, random communication topologies, latent state dimensions, and observation clutter densities induced by scattering objects in wireless systems.

[MA-29] An Algebraic Framework for Quantitative Semantics of Spatio-Temporal Logic with Graph Operators

【速读】:该论文旨在解决多智能体系统中时空逻辑(Spatio-Temporal Logic, STL)的定量语义缺失问题,特别是针对包含图操作符(Graph Operators)的STL-GO所引入的邻近智能体计数约束无法被现有定量语义(如STREL)准确刻画的挑战。其解决方案的关键在于提出一种分层代数构造的定量语义框架,通过将时间聚合与图操作符聚合解耦,由一个具有单调折叠(monotone fold)和读出机制的抽象累加器(accumulator)统一控制图操作符的聚合行为。该框架证明了其正确性与完备性可归约为各组件的单调性条件,从而保证语义的一致性;同时在二维受限区域中的随机Dubins车辆系统与三维地球-卫星系统上进行了多语义实例化(布尔、最小-最大、带符号亏量及混合语义)的实现与评估,验证了不同累加器选择在可扩展性(随智能体数量与时间范围增长)方面的权衡关系。

链接: https://arxiv.org/abs/2606.28429
作者: Sheryl Paul,Vidisha Kudalkar,Anand Balakrishnan,Tianhao Wu,Lars Lindemann,Jyotirmoy V. Deshmukh
机构: 未知
类目: Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Spatio-Temporal Logic with Graph Operators (STL-GO) extends Signal Temporal Logic (STL) to multi-agent systems via graph operators that count neighboring agents satisfying a property, together with multi-agent quantifiers. While Boolean semantics for STL-GO are well-defined, quantitative semantics have not yet been developed and existing quantitative semantics for spatio-temporal logics such as STREL cannot capture the counting constraints in STL-GO’s graph operators. We develop quantitative semantics for STL-GO as a layered algebraic construction that separates temporal aggregation from graph-operator aggregation (governed by an abstract accumulator with a monotone fold and readout). We prove that soundness and completeness reduce to monotonicity conditions on these components. We implement the framework and evaluate it on two multi-agent environments: a 2D bounded region with stochastic Dubins-car dynamics and a 3D Earth-satellite system, under four semantic instantiations (Boolean, min-max, signed-deficit, and a hybrid), demonstrating the tradeoffs between accumulator choices and reporting scalability in the number of agents and time horizon.

[MA-30] On the Necessity of a Liquid Substrate for Mesh Intelligence

【速读】:该论文旨在解决在无中心化主权代理网络(mesh of sovereign agents)中,如何实现高效、准确的状态估计问题。此类网络缺乏共享时钟、共享模型以及协调者,各代理需在异步、非预定时间接收观测数据,并基于不可重训练权重的底层结构,在线将同伴发出的预测投影融合为自身内部状态。现有方法在单一约束下可处理,但同时满足无共享时钟、无全局重训练能力与异步观测输入三重限制下的最优融合则难以实现。其核心解决方案的关键在于:首先,由于潜在状态(latent)随时间动态变化,最优估计器具有时变特性,因此必须采用自适应时标(adaptive timescale),固定增益滤波器在此条件下严格次优;其次,由于事件到达无时钟同步,最优估计依赖于相邻事件间的间隔时间,而任何忽略时间间隔的网络(无论宽度或深度如何)均无法恢复该依赖关系。这两个条件共同指向连续时间液态网络(continuous-time liquid class)——其中,长短期记忆网络(LSTM)满足第一个条件,固定连续时间滤波器满足第二个,而多时标液态网络则同时满足两者。合成实验验证了上述结论:网络能精确捕捉时标特性,且时间间隔分离得以准确计算。该表征为固定权重底座的必要条件,而非充分条件,且对整个代理网络的智能结构构成约束——每个代理必须满足此结构性要求,从而定义了分布式智能系统的基本架构边界。

链接: https://arxiv.org/abs/2606.28413
作者: Hongwei Xu
机构: SYM.BOT(对称机器人)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 14 pages, 3 figures

点击查看摘要

Abstract:A mesh of sovereign agents has no center: no shared clock, no shared model, and no coordinator to gather data or retrain. Its competence rests on each agent folding the projections its peers emit into a single internal state, online, from observations that arrive at irregular, unscheduled times, on a substrate whose weights it cannot retrain. Any one of these constraints is tractable on its own; folding optimally under all three at once is not. We ask what such a substrate must be, and prove two necessary conditions from one model of a self-evolving latent observed at irregular, exogenous times. Because the latent changes, its optimal estimator is time-varying: an adaptive timescale is necessary, and every fixed-gain filter is strictly suboptimal. And because arrivals are clock-free, the optimal estimate depends on the elapsed gap between them, which no gap-blind network recovers at any width or depth. This second condition is capacity-independent: scale cannot substitute for the missing dependence. The two conditions intersect in the continuous-time liquid class. An LSTM satisfies the first, a fixed continuous-time filter the second, and a multi-timescale liquid network both. Synthetic experiments confirm each: the network attains the timescale, and the separation is computed exactly. The characterization is necessary, not sufficient, and binds fixed-weight substrates: a network free to retrain reaches the class by other means. Proved per agent, the necessity binds every agent of a mesh, a structural condition on mesh intelligence.

[MA-31] Formation of Circular Directed Networks with Shared Link Costs

【速读】:该论文旨在解决在信息共享情境下,个体理性行为如何导致网络结构形成的非合作博弈问题。具体而言,研究关注的是:当各主体通过建立有向连接以获取有价值信息,但需承担信息传递路径上的累积成本时,其策略选择如何共同决定最终的网络形态。其核心问题是识别出在纳什均衡(Nash equilibrium)下的稳定网络结构,并揭示此类结构的性质与效率特征。解决方案的关键在于引入“共享路径成本”机制与“异质信息价值”设定,从而刻画个体在权衡信息收益与连接成本时的最优策略。研究表明,严格纳什均衡必须表现为环状有向网络(circular directed networks),这类网络在满足所有参与者可访问全部信息的前提下,使用最少数量的链接,实现了最小连通性(minimal connectedness)。尽管可能存在非环状的弱纳什网络,但它们具有结构性冗余,不满足最小化链接数的性质。进一步分析表明,严格纳什网络不仅在个体策略上具有稳定性,且在整体福利层面达到帕累托最优和效率最大化。该模型与Bala和Goyal(2000)的经典无向网络模型形成对比,凸显了有向性、路径成本分摊及信息价值异质性对均衡结果的根本影响,最终确立了严格稳定性和最小连通性之间的等价关系。

链接: https://arxiv.org/abs/2606.28382
作者: Juan M.C. Larrosa,Fernando Tohmé
机构: Universidad Nacional del Sur(南国立大学); Instituto de Ciencias e Ingeniería de la Computación (ICIC)(计算机科学与工程研究所); Instituto de Matemática de Bahía Blanca (INMABB)(巴伊亚布莱卡数学研究所)
类目: Multiagent Systems (cs.MA)
备注: 25 pages

点击查看摘要

Abstract:This paper develops a noncooperative model of directed network formation in which agents create links to access valuable information while sharing the costs generated along the paths through which information is obtained. Each agent is endowed with a positive amount of information and chooses, simultaneously, which other agents to contact. A directed link initiated by one agent allows her to access the information of the contacted agent and of the latter’s reachable network, but each link in the resulting information path entails a unit cost. Payoffs therefore depend on the total value of accessible information net of the accumulated connection costs required to obtain it. The paper characterizes the relationship between strategy profiles and directed graphs, defines accessibility, paths, components, and minimal connectedness, and studies the Nash architectures induced by individual best responses. The central result is that strict Nash equilibria must take the form of circular directed networks. Moreover, circular networks are exactly the Nash networks that use the minimum number of links while allowing every agent to access all available information. Although noncircular weak Nash networks may exist, they are structurally redundant and do not satisfy the same minimality property. The model also shows that strict Nash networks are both Pareto optimal and efficient in terms of aggregate welfare. Finally, the paper compares this framework with Bala and Goyal’s model, emphasizing that shared path costs and heterogeneous information values generate different equilibrium implications. The analysis supports the equivalence between strict stability and minimal connectivity in directed information networks.

[MA-32] Operating Regimes of Decentralized Learning Under Mobility and Bandwidth Constraints

【速读】:该论文旨在解决在移动与普适计算系统中,基于无线网络环境的去中心化学习(decentralized learning)所面临的实际挑战,特别是客户端异步性、随时间变化的通信拓扑以及受技术限制的传输速率等问题。传统去中心化学习研究多依赖理想化的通信假设,但在真实无线场景下,网络连接间歇性、拓扑因移动性动态变化且带宽受限,导致现有方法性能显著下降。为此,论文提出一种完全去中心化的协议,其关键创新在于将同步过程与本地训练重叠执行,并支持在通信中断时进行部分张量级别(tensor-level)的数据传输,从而提升对不完整通信周期的鲁棒性。通过在随机行走(Random Waypoint)移动模型下结合蓝牙低功耗(Bluetooth LE)、LTE 和 Wi-Fi 等多种无线技术进行实验,研究量化了网络动态特性与链路容量对收敛性的影响,揭示出三种典型运行状态:(i) 客户端间断时间主要通过混合(mixing)效应决定收敛速度;(ii) 当通信频繁时,部分更新通常可被有效容忍;(iii) 过于密集的通信模式会引发信道竞争,降低有效吞吐量。这些发现为在真实无线环境中部署去中心化学习提供了实用指导,明确了在不同场景下优化连接性、提升带宽或缓解竞争冲突的优先级。

链接: https://arxiv.org/abs/2606.28342
作者: Samuele Sabella,Chiara Boldrini,Lorenzo Valerio,Marco Conti,Andrea Passarella
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted for publication at IEEE SmartComp 2026. This work was partially supported by the PNRR Project SoBigDatait (IR0000013). S. Sabella, C. Boldrini, and M. Conti were partly funded by the PNRR project FAIR (PE00000013), while A. Passarella and L. Valerio were partially supported by the PNRR project RESTART (PE00000001)

点击查看摘要

Abstract:Decentralized learning is a promising paradigm for collaborative training in mobile and pervasive systems, as it avoids a central coordinator and does not require sharing raw data. Yet, most analyses rely on idealized communication assumptions that break down in wireless settings, where connectivity is intermittent, topology changes due to mobility, and bandwidth is limited. We study decentralized averaging under client asynchrony, time-varying contact graphs, and technology-dependent throughput constraints. We implement a fully decentralized protocol that overlaps synchronization with local training and supports partial tensor-level transfers when contacts end early. Using Random Waypoint mobility and multiple wireless technologies (Bluetooth LE, LTE, and Wi-Fi), we quantify how network dynamics and link capacity impact convergence. We identify three operating regimes: (i) inter-contact time largely dictates convergence via mixing, (ii) partial updates are often well tolerated when contacts are frequent, and (iii) very dense contact patterns can trigger contention, reducing effective throughput. These findings provide a practical lens to reason about decentralized learning deployments over realistic wireless systems, highlighting when improving connectivity, increasing bandwidth, or mitigating contention is most impactful.

[MA-33] Agent ic Analysis for Agent ic Infrastructure: An LLM -Powered Pipeline for Comparative Governance of DAO and Corporate AI Protocols

【速读】:该论文旨在解决生成式 AI(Generative AI)代理协议日益增多背景下,其互操作性标准所依赖的治理结构缺乏实证研究的问题。核心挑战在于如何系统性地分析大规模、复杂的技术治理话语,揭示制度设计对议题优先级与社区结构的影响。解决方案的关键在于提出一种基于大语言模型(LLM)的对比分析流水线,整合自动化标注、神经主题建模与多层网络分析,实现对治理话语的规模化、多层次解析。通过在两种截然不同的代理互操作性标准——ERC-8004(无许可、链上)与 Google A2A(企业主导)——上的验证,研究发现尽管治理形式影响议题聚焦,但两者均表现出相似程度的参与不平等与社区碎片化;而开放治理环境下的语义对齐更紧密,表明去中心化治理虽分散参与,却可能促进主题趋同。该方法为技术治理的实证研究提供了可扩展的工具框架,对设计更具包容性的智能体人工智能标准具有重要启示。所有数据与代码均已开源。

链接: https://arxiv.org/abs/2606.26203
作者: Yutian Wang,Luyao Zhang
机构: Duke Kunshan University(杜克昆山大学); Duke Kunshan University(杜克昆山大学)
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:As AI agent protocols proliferate, the governance structures shaping their interoperability standards remain empirically underexamined. We introduce an LLM-powered comparative pipeline for large-scale governance discourse analysis, integrating automated annotation, neural topic modeling, and multi-layer network analysis to study socio-technical power structures at scale. We validate it on two contrasting standards for agent interoperability: ERC-8004 (permissionless, on-chain) and Google A2A (corporate-led). Analyzing 4,323 governance participation records, we combine LLM-assisted coding, topic modeling, and multi-layer network analysis to examine how institutional design shapes thematic priorities and community structure. We find that while governance form influences substantive focus, both regimes exhibit comparable levels of participation inequality and community fragmentation. Discourse alignment is denser in the permissionless setting, suggesting that open governance may foster greater thematic convergence despite decentralized participation. These findings illustrate how LLM-assisted methods can advance the empirical study of technology governance, with implications for designing more equitable agentic AI standards. All data and code are openly available.

自然语言处理

[NLP-0] Self-Evolving World Models for LLM Agent Planning

【速读】: 该论文旨在解决长时序大语言模型(LLM)智能体在执行决策前缺乏可靠预见能力的问题,即生成式世界模型(Generative World Model)在实际部署中因预测不可靠而导致的推理偏差、误用甚至性能退化。其核心解决方案是提出WorldEvolver——一个在部署阶段动态自我演化的世界模型框架,通过不更新下游智能体及模型参数的前提下,对上下文进行自适应修正。该方案的关键在于三模块协同机制:(i)情景记忆(Episodic Memory)利用检索式模拟真实动作转移过程以增强预测真实性;(ii)语义记忆(Semantic Memory)从预测与观测之间的不一致中提取持久性启发规则,实现知识提炼;(iii)选择性预见(Selective Foresight)通过过滤低置信度预测,确保仅高可信度信息被纳入智能体推理上下文。实验表明,WorldEvolver在ALFWorld和ScienceWorld基准上均实现了最高预测准确率,并显著提升下游智能体的任务成功率,验证了测试时记忆修正可有效提升预测保真度与规划性能。

链接: https://arxiv.org/abs/2606.30639
作者: Xuan Zhang,Wenxuan Zhang,See-Kiong Ng,Yang Deng
机构: National University of Singapore (新加坡国立大学); Singapore University of Technology and Design (新加坡科技设计大学); Singapore Management University (新加坡管理大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:World models offer a principled way to equip long-horizon LLM agents with foresight: predictions of action consequences before execution. However, unreliable foresight can be ignored, misused, or even degrade downstream decision-making. In this paper, we introduce WorldEvolver, a self-evolving world model framework that revises its deployment-time context while keeping the downstream agent and all model parameters frozen. WorldEvolver integrates three modules: (i) Episodic Memory, which exploits real action transitions through retrieval-based simulation; (ii) Semantic Memory, which extracts persistent heuristic rules from prediction-observation mismatches; and (iii) Selective Foresight, which filters low-confidence predictions before integrating them into agent reasoning context. We evaluate WorldEvolver on ALFWorld and ScienceWorld, measuring world model prediction accuracy on Word2World and downstream agent success rate on AgentBoard. Extensive experiments show that WorldEvolver achieves the highest prediction accuracy across three backbones and leads other world model baselines on downstream agent success rate, demonstrating that test-time memory revision enhances both predictive fidelity and planning performance.

[NLP-1] Scaling the Horizon Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

【速读】: 该论文旨在解决大模型在长时序(long-horizon)复杂任务中表现不足的问题,尤其是在需要持续推理、外部知识调用与多步行动规划的智能体(Agent)场景下,如何实现高效的知识整合与跨域能力迁移。其核心挑战在于:如何在参数量受限的情况下(35B规模),通过架构设计与训练策略逼近甚至超越万亿级(trillion-parameter)模型在长周期任务中的性能。解决方案的关键在于提出一种基于混合专家系统(Mixture-of-Experts, MoE)的智能体架构——Agents-A1,结合“代理视野扩展”(agent-horizon scaling)的双路径策略:一方面通过构建包含外部知识、动作、观测与验证结果的长时序知识-行动基础设施,生成平均长度达45K tokens的超长智能体轨迹;另一方面采用三阶段训练范式,包括全领域监督微调、领域专用教师模型训练,以及创新性的多教师领域路由在线蒸馏(multi-teacher domain-routed on-policy distillation)并辅以显著词汇对齐机制,显著提升了跨领域知识迁移效率,成功将六个异构领域的能力统一至单一可部署的学生模型中。该方案使35B规模的Agents-A1在多个长时序基准测试中达到或超越1T级模型(如Kimi-K2.6、DeepSeek-V4-pro)的性能,验证了通过智能体视野扩展与高效知识融合实现高性能低成本智能体的可行性。

链接: https://arxiv.org/abs/2606.30616
作者: Lei Bai,Zongsheng Cao,Yang Chen,Zhiyao Cui,Shangheng Du,Yue Fan,Shiyang Feng,Zijie Guo,Haonan He,Liang He,Xiaohan He,Shuyue Hu,Yusong Hu,Songtao Huang,Yichen Jiang,Hao Li,Xin Li,Dahua Lin,Weihao Lin,Fenghua Ling,Dongrui Liu,Zhuo Liu,Runmin Ma,Chunjiang Mu,Haoyang Peng,Tianshuo Peng,Jinxin Shi,Luohe Shi,Boyuan Sun,Zelin Tan,Shengji Tang,Qianyi Wang,Yiming Wu,Yi Xie,Xiangchao Yan,Jingqi Ye,Peng Ye,Fangchen Yu,Jiakang Yuan,Bihao Zhan,Bo Zhang,Chen Zhang,Shufei Zhang,Shuaiyu Zhang,Wenlong Zhang,Yiqun Zhang,Junpeng Zhao,Zhijie Zhong,Bowen Zhou,Yuhao Zhou
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: The model checkpoints and evaluation codebase are available at this https URL and this https URL

点击查看摘要

Abstract:We introduce Agents-A1, a 35B Mixture-of-Experts Agentic Model that reaches trillion-parameter-level performance by scaling the agent horizon. We investigate agent-horizon scaling from two perspectives: scaling long-horizon trajectories and scaling heterogeneous agent abilities. To support this goal, we build a long-horizon knowledge-action infrastructure that connects external knowledge, actions, observations, and verifier outcomes, producing agentic trajectories with an average length of 45K tokens. Based on this, we train Agents-A1 with a three-stage recipe. First, we perform full-domain supervised fine-tuning to align the base model with broad agentic behaviors. Second, we train domain-level teacher models to capture specialized expertise in each domain. Third, we propose a multi-teacher domain-routed on-policy distillation with salient vocabulary alignment to improve knowledge transfer efficiency across different domains, unifying six heterogeneous domains into one deployable student model. Agents-A1 achieves strong and broad performance for long-horizon agent benchmarks. Compared with 1T-parameter model such as Kimi-K2.6 and DeepSeek-V4-pro, Agents-A1 achieves leading results on SEAL-0 (56.4), IFBench (80.6), HiPhO (46.4), FrontierScience-Olympiad (79.0), and MolBench-Bind (56.8), and remains highly competitive on SciCode (44.3), HLE (47.6) and BrowseComp (75.5). We hope this work provides the community with a practical path for scaling the horizon using a 35B agent that can reach or match the performance of 1T models on long-horizon tasks.

[NLP-2] Uncertainty-Aware Generation and Decision-Making Under Ambiguity

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂现实任务中因缺乏对不确定性的感知而导致决策不可靠的问题。尽管模型能力不断提升,但许多任务(如教学辅导和同行评审)不仅需要深度知识与推理能力,还具有高度主观性,要求输出结果具备可信赖性。现有研究多聚焦于提升模型本身性能,而对决策机制的关注不足。为此,本文基于贝叶斯决策理论与风险规避决策方法,提出并评估了一系列考虑不确定性的决策算法,具体在生成导师回复或评审意见时,将辅导策略与评分的不确定性纳入考量,并利用合规预测(conformal prediction)为策略选择和评分提供置信保证。实验结果表明,这些算法能够有效提升生成内容的实用性,但在高模糊性情境下需谨慎实现:例如,风险规避规则可能因追求通用输出而导致性能下降,而贝叶斯方法则表现出更优的鲁棒性。本研究通过引入决策理论中的技术增强基于大语言模型的决策质量,同时指出了该领域仍存在的开放性挑战。

链接: https://arxiv.org/abs/2606.30578
作者: Nico Daheim,Iryna Gurevych
机构: Technical University of Darmstadt (达姆施塔特工业大学); National Research Center for Applied Cybersecurity ATHENE (应用网络安全国家研究中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code available under this https URL

点击查看摘要

Abstract:With rapidly improving capabilities, Large Language Models (LLMs) are increasingly used in many complex real-world tasks. Beyond requiring in-depth knowledge and reasoning skills, many of these tasks exhibit a high degree of subjectivity and require that the outputs of the model can be trusted. While a lot of progress has been made to train better models, decision-making algorithms have received less attention. In this work, we present and evaluate various uncertainty-aware decision-making algorithms based on Bayesian decision theory and risk-averse decision making on the tasks of tutoring and automatic peer reviewing. Concretely, we take uncertainty over tutoring strategies and review scores into account when generating a tutor response or review and use conformal prediction to provide guarantees over strategy and score. We find empirically that these algorithms can improve the utility of the generations but need to be carefully implemented when ambiguity is high. For example, risk-averse rules can degrade performance by optimizing for generic outputs, while Bayesian methods tend to perform better. Our work uses techniques from decision theory to improve LLM-based decision-making and outlines open challenges for the community.

[NLP-3] Attractor States Emerge in Multi-Turn LLM Conversations

【速读】: 该论文旨在解决开放式多智能体环境中大语言模型(Large Language Models, LLMs)长期交互动态机制不明确的问题,具体探究在开放式对话中,不同模型间的互动是否表现出类似吸引子(attractor-like)的行为模式,即在无特定话题约束下,对话趋于稳定的行为集合。其解决方案的关键在于通过对比自对弈(self-play)与混合对弈(mixed-play)双人辩论场景,系统追踪模型在表示空间、话语特征和立场演化等方面的轨迹。研究发现,自对弈下的对话轨迹呈现出模型特异性的吸引子效应,在混合对弈中这些吸引子会以非对称方式影响其他模型的风格选择与行为表现,例如Claude Haiku在潜在空间中表现出强吸引性,促使其他模型模仿其元评论等特征;而GPT-4.1 nano则表现出较高的可塑性。结果表明,开放式的多模型交互虽具有部分可预测性,但受制于模型间结构化且不对称的影响关系,揭示了复杂多智能体系统中行为演化的重要规律,为真实世界中自主代理系统的构建、预测与监控提供了理论依据。

链接: https://arxiv.org/abs/2606.30571
作者: Ting-Wen Ko,Jonas Geiping
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in open-ended multi-agent settings, but the long-run dynamics of model–model interaction remain poorly understood. We study whether open-ended LLM discussions exhibit attractor-like behavior, i.e. topic-independent stable sets of behaviors which conversations settle into. Across 7 LLMs and 20 controversial topics, we compare self-play and mixed-play dyadic debates, tracking trajectories in representation space, discourse traits, and stances. We find self-play trajectories to be model-specific attractors that draw their conversation partners asymmetrically in mixed-play debates, influencing the other models’ stylistic choices and behavior. For example, Claude Haiku is a strong attractor of other models in latent space, corresponding to other models taking on its traits like metacommentary, and models like GPT-4.1 nano are especially malleable. Our results suggest that open-ended LLM interactions are partially predictable from model-specific attractors, but shaped by structured and asymmetric partner influence. Overall, our analysis sheds some light on the complex behavior of open-ended multi-agent interaction, which we hope is helpful in designing, predicting, and monitoring autonomous agentic systems in the real world.

[NLP-4] Morphing into Hybrid Attention Models

【速读】: 该论文旨在解决生成式AI模型在长上下文处理中因自注意力机制(Self-Attention)计算复杂度高而导致的效率瓶颈问题。现有混合注意力(Hybrid Attention)架构通过保留部分全注意力(Full-Attention)层并用线性注意力(Linear Attention)替代其余层以提升效率,但其性能高度依赖于哪些层被保留为全注意力层的选择策略。传统方法多采用启发式规则(如固定位置模式或逐层评分),将各层重要性视为独立因素,忽略了在全局混合配置下层间相互依赖的影响。为此,本文将混合层选择建模为一个受预算约束的子集优化问题,并提出FlashMorph(Fast Layer Selection for Hybrid MORPHing)——一种高效、可扩展的层选择方法。其核心在于:首先构建可变形模型,为每个全注意力层引入一个可替换的线性注意力分支;随后冻结所有权重,在合成的长上下文检索数据上联合优化各层门控参数,并施加线性化正则化以鼓励模型依赖线性注意力实现高效推理;最终在预设的全注意力预算下对学习到的门控值进行离散化,形成具体的混合架构,并通过标准逻辑蒸馏与长上下文微调进一步优化性能。实验表明,FlashMorph能够发现更优的混合配置,在显著降低层选择成本的同时,保持强大的长上下文召回能力与基准任务表现,验证了其有效性、效率与可扩展性。

链接: https://arxiv.org/abs/2606.30562
作者: Disen Lan,Jianbin Zheng,Yuxi Ren,Xin Xia,Xuanda Wang,Xuefeng Xiao,Xipeng Qiu,Yu Cheng
机构: Fudan University (复旦大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hybrid attention models improve long-context efficiency by retaining only a subset of full-attention layers and replacing the remaining layers with linear attention. However, the effectiveness of Transformer-to-hybrid conversion critically depends on which layers preserve full attention. Existing hybrid layer selection methods typically rely on heuristic strategies such as fixed placement patterns or layerwise scoring, implicitly treating layer importance as isolated and overlooking the interdependent layer effect under a global hybrid configuration. In this work, we formulate hybrid layer selection as a budget-constrained subset optimization problem. We further propose FlashMorph (Fast LAyer Selection for Hybrid MORPHing), an effective, efficient and scalable layer selection method for Transformer-to-hybrid conversion. FlashMorph first constructs a morphable model by equipping each full-attention layer with a converted linear-attention branch. It then freezes all model weights and jointly optimizes layerwise gates on synthetic long-context retrieval data, with a linearization regularization that encourages the model to rely on linear attention for efficiency. The learned gates are discretized under a preset full-attention budget to instantiate the hybrid architecture, followed by standard logits distillation and long-context finetuning. Extensive experiments show that FlashMorph discovers more effective hybrid configurations, preserves strong long-context recall and general benchmark performance while substantially reducing layer selection cost compared with existing layer selection methods, demonstrating its effectiveness, efficiency, and scalability.

[NLP-5] Poller: Are LLM s Suitable for Evaluating the Poetry Understanding Task?

【速读】: 该论文旨在解决传统自动评估方法在现代汉语诗歌评价中适用性不足的问题,因其难以捕捉诗歌这一文学体裁的独特性;尽管人工评估仍具可靠性,但成本高昂且难以应用于大规模数据。为此,论文提出Poller(Poetry LLM Evaluator),一种基于大语言模型(Large Language Models, LLMs)的新型诗歌理解任务评估方法。其核心创新在于让LLMs扮演诗歌作者的角色,结合详细的背景信息,从创作者视角模拟人类的审美判断与评价过程,从而实现更贴近真实人类评估的自动化评价。实验结果表明,该方法显著降低了LLM与人类评估者之间的评价误差,在修辞手法和陌生化等特定维度上,相比基线方法分别实现了94.55%和89.53%的误差降低,远超传统LLM评估方法的表现。该研究有效弥合了自动化效率与人类专业性之间的鸿沟,为诗歌相关任务的自动化评估奠定了坚实基础。

链接: https://arxiv.org/abs/2606.30556
作者: Shanshan Wang,Derek F. Wong,Jingming Yao,Lidia S. Chao
机构: University of Macau(澳门大学); NLP2CT Lab(自然语言处理与认知计算实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional automatic evaluation methods have been shown to be unsuitable for modern Chinese poetry because of the distinct nature of this literary genre. Human evaluation remains reliable, but is expensive and not applicable to large-scale data. In this paper, we propose Poller (Poetry LLM Evaluator), a novel method leveraging large language models (LLMs) to evaluate the poetry understanding task. Specifically, our method requires LLMs to play the role of a poem’s author with detailed information, thereby emulating human evaluation and judgment by adopting the poet’s perspective. We conducted comprehensive experiments on multiple LLMs, evaluating the interpretations of poems across eight specialized dimensions. Experimental results demonstrate that our method effectively reduces the evaluation error between LLMs and humans. Especially for specific dimension evaluation, Poller-based LLMs achieve a 94.55% and 89.53% error reduction for rhetorical techniques and defamiliarization, respectively, compared to baseline methods. These performances are unattainable by conventional LLM evaluation methods. Experimental results from multiple LLMs across various dimensions validate the efficacy of our method. This work bridges the gap between automated efficiency and human expertise, establishing a foundation for automated evaluation in poetry-related tasks.

[NLP-6] RACE: Temporal Relationship-Aware Conversational Entrainment Detection in Dyadic Speech

【速读】: 该论文旨在解决对话交互中情感同步(emotional entrainment)的检测问题,尤其关注在双人语音互动场景下,如何有效捕捉由社会关系与对话语境共同塑造的情感协调动态。其核心挑战在于传统方法通常将对话片段进行特征聚合处理,忽略了交互过程中时序动态与个体间情感响应的细微变化。为此,论文提出关键解决方案——TRACE框架,该框架将双人互动建模为基于情绪微调的Whisper模型提取的声学嵌入序列,以窗口级别对交互过程进行建模,将每一段音频视为一个“交互痕迹”(interaction trace),从而保留了时间维度上的连续性与互动双方的情感演化轨迹。实验结果表明,结合对话上下文与人际关系信息可显著提升情感同步检测性能,其中TRACE在自建数据集DyadEE上达到97.01%的最高准确率,验证了其对动态情感协调建模的有效性。

链接: https://arxiv.org/abs/2606.30543
作者: Sathvik Manikantan Napa Ugandhar,Hao Zhang,Alison Gunzler,Yuzhe Wang,Thomas Thebaud,Georgi Tinchev,Venkatesh Ravichandran,Laureano Moro-Velázquez
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the proliferation of speech AI agents, understanding emotional entrainment in conversational interaction has become increasingly important. Emotional entrainment is shaped by social relationships and conversational context, influencing affective coordination over time. We introduce DyadEE, a dataset for emotional entrainment detection in dyadic speech interactions, containing both emotionally entrained conversations and synthetic interactions where entrainment is disrupted through partner swapping and emotion resynthesis. We further propose TRACE, a window-level framework that models dyadic interaction as ordered sequences of acoustic embeddings derived from emotion fine-tuned Whisper representations, treating each sample as an interaction trace rather than pooled utterances. Experimental results on DyadEE show that incorporating conversational context and relationship information improves emotional entrainment detection, with TRACE achieving the best accuracy of 97.01%.

[NLP-7] Regime-Aware Peer Specialization for Robust RAG under Heterogeneous Knowledge Conflicts

【速读】: 该论文旨在解决生成式AI在检索增强生成(Retrieval-Augmented Generation, RAG)框架中因外部检索上下文与模型参数化知识存在冲突而导致的生成不可靠问题。此类冲突具有不同程度的可靠性,涵盖从可靠到对抗性上下文的连续谱系,而现有方法通常采用不区分可靠性的统一监督策略,导致不同可靠性区间的学习信号相互干扰。其解决方案的关键在于提出一种基于可靠性分治的同伴专业化框架(RAPS-DA),通过双重粒度的机制实现对冲突信号的解耦:在样本层面,将冲突划分为“锚定(Grounding)”、“仲裁(Arbitration)”和“抵抗(Resistance)”三类,针对每类训练一个同规模的同伴专家模型,并通过硬路由机制将样本分配至对应专家进行基于反KL散度的在策略监督;在词元层面,设计双层选择器,利用教师间分歧、学生-教师差异及学生熵等指标,动态筛选低信息量或不稳定词元,加权高置信度的错位词元,并随学生模型成熟逐步聚焦于高冲突词元。该方法在固定模型规模下实现专业化,无需更强的教师模型,且同伴专家仅存在于训练阶段,部署时学生模型无需依赖类别标签或同伴访问。实验在五个冲突场景及两个分布外基准上均显著优于提示工程、解码优化、微调、强化学习及单教师基线方法。

链接: https://arxiv.org/abs/2606.30518
作者: Bo Wang,Heyan Huang,Yaolin Li,Yanghao Zhou,Jiahao Teng,Ziyi Yang,Ge Shi,Chong Feng
机构: Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL)
备注: Working in Progress

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) improves language models by grounding generation in external context. However, it can be fragile when the retrieved context conflicts with the model’s parametric knowledge. Such conflicts span a reliability spectrum, ranging from reliable and partially reliable evidence to adversarial context. Existing remedies often handle such heterogeneous conflicts with regime-agnostic supervision, which can conflate incompatible learning signals across reliability regimes. To disentangle these signals, we propose RAPS-DA, a regime-aware peer specialization framework that addresses conflict at two complementary granularities. At the sample level, conflicts are divided into three regimes, including Grounding, Arbitration, and Resistance, with one same-scale peer specialist trained per regime from a shared base model. Each sample is then hard-routed to its regime-matched peer for on-policy reverse-KL supervision. At the token level, a dual-layer selector uses inter-teacher disagreement, student-teacher divergence, and student entropy to filter uninformative or unstable tokens, upweight confidently misaligned ones, and gradually focus supervision on high-conflict tokens as the student matures. Gains stem from specialization at a fixed model scale, not from a stronger teacher, and the peer specialists exist only during training, so the deployed student requires no regime labels or peer access. Experiments on five conflict scenarios and two out-of-distribution benchmarks show RAPS-DA surpasses all prompting, decoding, fine-tuning, RL, and single-teacher baselines.

[NLP-8] SIMAX: A Scalable and Interpretable Framework for Multi-Fidelity and Annotated Clinician-Patient Dialogue Simulation

【速读】: 该论文旨在解决生成式人工智能(Generative AI)在临床沟通编码系统开发与评估中面临的核心难题:真实世界医患对话数据及人工标注标签难以大规模获取,导致现有通信编码系统的训练、验证与优化缺乏充足且可控的数据支持。其关键解决方案是提出SIMAX(可扩展且可解释的多保真度与标注医患对话模拟框架),通过预定义的临床场景、角色设定(persona)、语音条件及目标沟通行为,实现对医患对话的可控生成。该框架引入双代码本机制——全局代码本(Global Codebook)用于控制整体沟通质量,WISER代码本(WISER Codebook)用于精确调控具体可计数的沟通行为,从而确保生成数据兼具真实性与可追踪性。实验表明,SIMAX生成的3,388组对话在语音自然度、转录准确率和文本-音频一致性方面表现良好,并通过自动化与人工评估验证了其临床现实感,同时证明其可用于评估通信编码系统对特定行为目标的响应能力,揭示系统在某些维度上的敏感性不足。因此,SIMAX为构建、验证与优化通信编码系统提供了可重复、可解释且具备行为控制能力的仿真数据基础。

链接: https://arxiv.org/abs/2606.30491
作者: Zhuhan Bao,Rui Yang,Bohao Yang,Zhiyi Liu,Sicheng Shu,Ruio Heerschap,Le Li,Doris Yang,Elisabeth Bond,Haoyuan Wang,Nicoleta Economou-Zavlanos,Joshua M. Biro,Matthew McDermott,Nan Liu,Anand Chowdhury,Kai Sun,Kathryn Pollak,Ed Hammond,Chuan Hong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background. The widespread deployment of ambient digital scribes is driving large-scale capture of clinician-patient dialogues. Human coding of clinical communication data remains costly, inconsistent, and difficult to scale, motivating AI-driven communication coding systems. However, evaluating these systems requires real-world dialogues and human-coded labels, both hard to obtain at scale. Methods. We developed SIMAX (Scalable and Interpretable Framework for Multi-Fidelity and Annotated Clinician-Patient Dialogue Simulation), a framework for generating controlled clinical dialogue data with reference behavioral annotations. SIMAX generates clinician-patient dialogues from predefined clinical scenarios, personas and voice conditions, and target communication behaviors. Behaviors are controlled using two codebooks: the Global Codebook for overall communication quality and the WISER Codebook for specific countable behaviors. We evaluated SIMAX using automated and human quality assessments and an example communication coding system. Results. SIMAX generated 3,388 simulated dialogues across three specialties, multiple visit stages, persona characteristics, and accent conditions. Automated assessment showed mean UTMOS and WV-MOS scores of 3.03 and 2.61, WER and CER of 0.07 and 0.05, and CLAP cosine similarity of 0.41, suggesting reasonable speech naturalness, high transcription fidelity, and positive text-audio correspondence. Human evaluation showed a median MOS of 4.67 and a median clinical realism score of 3.00. Downstream evaluation suggests that SIMAX can assess how a communication coding system responds to behavioral targets and reveal insufficient sensitivity in some dimensions. Conclusions. SIMAX generates controlled and reproducible simulated clinician-patient dialogues, providing a data foundation for developing, validating, and refining communication coding systems. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.30491 [cs.CL] (or arXiv:2606.30491v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.30491 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rui Yang [view email] [v1] Mon, 29 Jun 2026 15:56:38 UTC (1,517 KB)

[NLP-9] Situation Perception: A Necessary Primitive to Artificial Superintelligence

【速读】: 该论文试图解决的核心问题是:当前大型语言模型虽具备强大的模式识别与生成能力,但其本质仍为统计性引擎,缺乏类人般的基本认知能力,无法实现真正的通用智能(General Intelligence)。具体而言,人类智能的起点是近乎零的显性知识,却能通过经验逐步构建对物体恒常性、因果关系、他人意图、身体自主性及物理世界持续性的深刻理解。而现有模型未能具备这种基于内在模拟(internal simulation)的动态认知机制。论文提出的解决方案关键在于引入“情境感知”(situation perception)这一核心能力——即在潜在时间维度上构建、修正并主动作用于对可能世界的内部模拟。该能力依赖三个核心组件:抽象预测、长期压缩记忆和由目标引导的主动学习。论文进一步指出,评估人工智能是否迈向超智能(Artificial Superintelligence, ASI)的关键标准,应包括系统能否模拟未来、追求自主目标,甚至对自身创造者进行评判,从而推动对智能本质的深层理解与技术演进。

链接: https://arxiv.org/abs/2606.30481
作者: Ziqin Yuan,Jaymari Chua
机构: The University of New South Wales (新南威尔士大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Current large language models are extraordinary statistical engines. They compress vast amounts of text into useful patterns and can explain science, write code, imitate reasoning, and participate in philosophical conversation. Yet pattern mastery is not the same as general intelligence. A human infant begins with little explicit knowledge, but gradually discovers object permanence, cause and effect, other minds, bodily agency, and the persistence of the physical world. We make an argument that the path to artificial superintelligence (ASI) depends on a missing capacity we call \emphsituation perception: the ability to construct, revise, and act within internal simulations of possible worlds across latent time. \emph perception requires at least three core components: abstract prediction, long-term compressed memory, and active learning guided by objectives. In this work, we analyse why modern large language models remain incomplete, and propose the appropriate tests for measuring progress and consequences of machines that can simulate futures, pursue self-directed goals, and possibly judge their own creators.

[NLP-10] MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练阶段如何高效集成多个特定领域强化学习(Reinforcement Learning, RL)能力的问题。现有方法如离策略微调(Off-Policy Finetune)和混合强化学习(Mix-RL)存在效率低下或性能下降的缺陷。本文提出了一种名为多教师在线蒸馏(Multi-teacher On-Policy Distillation, MOPD)的后训练范式,其关键在于:首先针对各领域分别训练专用的强化学习教师模型,随后在学生模型自身的采样轨迹上对这些教师进行在线蒸馏,从而消除暴露偏差(exposure bias),并提供密集的优化信号。该方法不仅显著优于Mix-RL、级联强化学习(Cascade RL)、离策略微调及参数合并(Param-Merge)等基线,还能近乎完整地继承各教师模型的能力,同时支持领域教师的并行独立开发,有效解耦了跨领域间的耦合关系。MOPD已成功应用于工业级前沿模型MiMo-V2-Flash的后训练流程,验证了其在大规模前沿语言模型中实现能力融合的实际价值。

链接: https://arxiv.org/abs/2606.30406
作者: Wenhan Ma,Jianyu Wei,Liang Zhao,Hailin Zhang,Bangjun Xiao,Lei Li,Qibin Yang,Bofei Gao,Yudong Wang,Rang Li,Jinhao Dong,Zhifang Sui,Fuli Luo
机构: Peking University (北京大学); Xiaomi(小米); University of Hong Kong (香港大学); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern large language models (LLMs) rely on reinforcement learning during post-training to push specific capabilities, yet integrating multiple capabilities into one model remains hard. Existing methods, such as Off-Policy Finetune and Mix-RL, are either inefficient or lose performance. In this work, we propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm for combining the capabilities of multiple domain RL teachers: we first run per-domain specialised RL to obtain a set of domain teachers, then distill these teachers into the student on its own rollouts. This eliminates exposure bias and provides a dense optimization signal. On Qwen3-30B-A3B, MOPD outperforms Mix-RL, Cascade RL, Off-Policy Finetune, and Param-Merge baselines, inheriting nearly all of each teacher’s capability. MOPD also enables parallel, independent development of domain teachers, removing the cross-domain coupling typical of multi-domain post-training. MOPD has been deployed in the post-training of MiMo-V2-Flash, an industrial-scale frontier model, demonstrating its practical value for capability integration in frontier-scale LLMs.

[NLP-11] Uncovering Salience-Driven Dynamics in Consumer Confidence with Generative Social Simulation

【速读】: 该论文旨在解决传统消费者信心指数(Consumer Confidence Index, CCI)建模中忽略微观个体异质性与行为机制的问题,即尽管CCI被视为具有持续性的宏观经济指标,但其实际变动源于家庭在不同信息约束、认知偏差、注意力分配及先验信念下对经济信号的差异化解读。为此,论文提出ConsumerSim——一种生成式人-环境响应框架,通过校准微观数据的合成人口、时序化的宏观经济、金融、政策与新闻信号,模拟类调查响应生成、后分层信念扩展以及行为惯性对齐等关键机制,重构真实世界中的CCI动态。其核心解决方案在于整合多源异构信号与代表性个体异质性,实现从微观行为到宏观指标的可解释性推演。实证结果显示,ConsumerSim在美、欧27国及日本的官方CCI序列重建中优于持久性模型、时间序列模型、回归模型及信息增强基线,在高显著性冲击事件中表现尤为突出;且其重构信号能有效提升短期真实经济活动(尤其是住房)的预测能力。机制分析表明,消费者信心波动集中于重大事件附近,群体轨迹方向趋同但幅度差异明显,且不同收入、房主身份、教育水平与政治倾向群体对信号敏感度存在系统性差异。通过群体扩展与消融实验进一步验证,代表性聚合、情境信号、人格异质性及行为惯性是实现高精度建模与可解释诊断的必要条件。研究支持将消费者信心视为一个可解释的人-环境交互响应过程,而非纯粹的宏观时间序列。

链接: https://arxiv.org/abs/2606.30395
作者: Yixu Huang,Yunlu Yin,Jiayu Lin,Xinnong Zhang,Jia Wang,Siyuan Wang,Xuanjing Huang,Liyin Jin,Zhongyu Wei
机构: 复旦大学( Fudan University)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Consumer confidence is typically modeled as a persistent macroeconomic index, yet its movements arise from households that interpret economic information through heterogeneous constraints, exposures, prior beliefs, and attention. We introduce ConsumerSim, a generative Human–Environment response framework that reconstructs Consumer Confidence Index (CCI) dynamics from a microdata-calibrated synthetic population, time-stamped macroeconomic, financial, policy, and news signals, survey-like response generation, post-stratified belief expansion, and behavioral inertia alignment. Across U.S., EU27, and Japanese official CCI target series, ConsumerSim ranks first among persistence, time-series, regression, and information-augmented baselines on the reported reconstruction metrics, with clear gains around high-salience shocks. Its reconstructed signal also improves short-horizon prediction of real activity, most consistently for housing outcomes. Mechanism analyses show that CCI movements concentrate around salient events; subgroup trajectories often align in direction while differing in magnitude; and signal sensitivity varies across income, homeownership, education, and political-alignment groups. Population-expansion and ablation results indicate that representative aggregation, situational signals, persona heterogeneity, and inertia are necessary for both accuracy and diagnosis. The findings support a behavioral view of consumer confidence as an interpretable Human–Environment response process rather than a purely aggregate time series.

[NLP-12] MaDI-Bench: An End-to-End Data Integration Benchmark

【速读】: 该论文旨在解决现有数据集成(Data Integration)研究中缺乏端到端(end-to-end)基准评测体系的问题。当前主流基准通常仅评估数据集成流程中的单一环节(如模式匹配或实体匹配),或仅覆盖流程的部分步骤,无法全面衡量整体集成方法的性能。为填补这一空白,本文提出了首个涵盖完整数据集成流程的端到端基准——曼海姆数据集成基准(Mannheim Data Integration Benchmark, MaDI-Bench)。其核心贡献在于:(i)构建了一组跨多个应用领域的基础端到端集成任务,每项任务均需完整执行模式匹配、值归一化、实体阻隔、实体匹配及冲突消解等全流程;(ii)设计了一种通用的任务变体生成方法,可有效缓解因技术进步导致的基准快速饱和问题。通过人工设计的流水线、最优领域流水线以及基于大语言模型(LLM-based)的流水线进行验证,结果表明该基准能够有效评估集成流水线在各步骤及整体端到端层面的性能。所有基准资源均已公开发布,支持后续研究复现与对比。

链接: https://arxiv.org/abs/2606.30371
作者: Aaron Steiner,Ralph Peeters,Christian Bizer
机构: 未知
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注: 14 pages, 1 figure, 13 tables

点击查看摘要

Abstract:Data integration combines heterogeneous data sets into a single, coherent representation. Data integration involves a sequence of interdependent tasks including schema matching, value normalization, entity blocking, entity matching, and data fusion. Existing benchmarks either evaluate these steps in isolation or cover only incomplete versions of the data integration pipeline, omitting specific steps. The lack of public end-to-end data integration benchmarks hinders research on data integration methods that address the integration process as a whole. This paper fills this gap by introducing the Mannheim Data Integration Benchmark (MaDI-Bench), the first benchmark for the end-to-end integration of relational tables covering all steps of the integration process. MaDI-Bench contributes (i) a set of base end-to-end data integration tasks spanning several application domains, each requiring the full schema matching, value normalization, entity matching, and conflict resolution pipeline; and (ii) a generic method for deriving task variants that mitigates rapid benchmark saturation as data integration systems advance. We validate the benchmark using human-engineered pipelines, a best-of-breed pipeline, and an LLM-based pipeline. The validation demonstrates the utility of the benchmark for measuring the step-wise as well as the end-to-end performance of data integration pipelines. All benchmark artifacts are available for public download.

[NLP-13] OLIVE: View-Augmented Latent Prediction with Waveform Reconstruction for Speech SSL

【速读】: 该论文旨在解决自监督语音表示学习中如何同时提升表示的语义鲁棒性与波形重建质量的问题。现有方法往往在语音生成、说话人识别等任务中表现不均衡,且难以兼顾上下文表征的不变性与信号级信息的保留。针对此问题,本文提出OLIVE(Online Latent prediction with Invariant Views and rEconstruction)框架,其核心创新在于通过统一目标函数联合优化分析(analysis)与合成(synthesis)两个目标:一方面,利用视图增强的掩码潜在预测(view-augmented masked latent prediction)促使深层上下文表示具备对扰动的不变性,从而增强下游任务的鲁棒性;另一方面,通过波形重建(waveform reconstruction)约束早期编码器特征保留信号级细节信息,确保高质量的语音生成能力。二者协同作用,使所学表示在生成、说话人识别等任务上均取得显著提升,同时在语音识别与语义理解任务上保持竞争力,并进一步改善波形重建性能。

链接: https://arxiv.org/abs/2606.30356
作者: Karl El Hajal,Mathew Magimai.-Doss
机构: Idiap Research Institute (瑞士); EPFL (瑞士)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We propose Online Latent prediction with Invariant Views and rEconstruction (OLIVE), a self-supervised speech representation learning framework that jointly optimizes analysis and synthesis objectives. OLIVE combines view-augmented masked latent prediction with waveform reconstruction under a unified objective. Reconstruction constrains early encoder features to retain signal-level information, while masked latent prediction shapes later contextual representations toward invariance for robust downstream performance. We show that these objectives enable representations that support a broad range of tasks. In particular, OLIVE improves results on generation and speaker tasks, maintains competitive performance on recognition and semantic tasks, and improves waveform reconstruction.

[NLP-14] REAR: Test-time Preference Realignment through Reward Decomposition ICML2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对多样化用户偏好时的对齐难题。尽管后训练方法可实现模型定制化,但通常依赖昂贵的数据标注与额外训练;而测试时扩展(Test-time Scaling, TTS)虽为无需训练的高效替代方案,其应用却受限于可验证领域(如数学与编程),难以推广至偏好对齐任务。为此,本文提出一种新框架,将偏好对齐建模为再对齐(realignment)问题——因基础模型往往无法充分满足用户声明的偏好。其核心创新在于将隐含的奖励函数分解为两个独立分量:一个与问题内容相关,另一个与偏好信息相关。基于此,提出可微分的再对齐奖励(REAlignment Reward, REAR),通过选择性调节两部分奖励的权重比例,实现灵活的偏好调控。进一步地,REAR被形式化为词元级策略对数概率的线性组合,具备计算高效性,并可无缝集成至多种TTS算法(如best-of-N采样与树搜索)。实验表明,相较于其他测试时基线方法,REAR不仅在多样用户需求下实现了可扩展的测试时再对齐能力,且在适当偏好设定下,亦能泛化至数学与视觉任务。

链接: https://arxiv.org/abs/2606.30339
作者: Fuxiang Zhang,Pengcheng Wang,Chenran Li,Yi-Chen Li,Yuxin Chen,Lang Feng,Chenfeng Xu,Masayoshi Tomizuka,Bo An
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Aligning large language models (LLMs) with diverse user preferences is a critical yet challenging task. While post-training methods can adapt models to specific needs, they often require costly data curation and additional training. Test-time scaling (TTS) presents an efficient, training-free alternative, but its application has been largely limited to verifiable domains like mathematics and coding, where response correctness is easily judged. To extend TTS to preference alignment, we introduce a novel framework that models the task as a realignment problem, since the base model often fails to sufficiently align with the stated preference. Our key insight is to decompose the underlying reward function into two components: one related to the question and the other to preference information. This allows us to derive a REAlignment Reward (REAR) that selectively rescales the proportions of these two reward terms. We then show that REAR can be formulated as a linear combination of token-level policy log-probabilities, making it computationally efficient and easy to integrate with various TTS algorithms such as best-of- N sampling and tree search. Experiments show that compared to other test-time baselines, REAR not only enables scalable test-time realignment for preference alignment tasks under diverse user requirements, but also generalizes to mathematical and visual tasks under appropriate preference settings.

[NLP-15] DialogPII: A multilingual dataset of synthetic dialog transcripts to detect personal information

【速读】: 该论文旨在解决在医疗、社会科学等领域的对话数据中,如何有效检测并移除个人身份信息(PII)以实现负责任的数据共享这一关键问题。其核心挑战在于构建一个高质量、多语言、跨模态的基准数据集,以支持自动去标识化系统的研究与评估。解决方案的关键在于提出DialogPII,一个通过大语言模型半自动生成、经人工校验确保语境合理性与多样性,并针对特定国家及城市背景进行本地化的多语言对话与语音转录数据集。该数据集涵盖8种交互场景、19类实体类型和11种语言,所有文本均通过语音合成生成语音,再利用Whisper模型转录,形成对齐的书面与语音衍生资源,同时结合自动投影与人工修正完成标注。此外,研究还提供了多语言命名实体识别基线模型,并通过多重技术验证(如标注者间一致性分析、翻译质量评估、标注投影评估及基于Transformer的序列标注模型基准实验)确保数据集的可靠性与可用性。

链接: https://arxiv.org/abs/2606.30312
作者: Roland Roller,Vera Czehmann,Derya Erman,Luke Flanagan,Ibrahim Baroud,Frédéric Blain,Viviana Cotik,Eletta Giusto,Akhil Juneja,Mariana Neves,Maria Słowińska,Christine Hovhannisyan,Aaron Louis Eidt,Lisa Raithel,Sebastian Möller,Maija Poikela
机构: DFKI(德国弗劳恩霍夫计算机图形研究所); BIH(Charité-Universitätsmedizin Berlin)
类目: Computation and Language (cs.CL)
备注: currently under review

点击查看摘要

Abstract:Conversational data collected in domains such as healthcare or social sciences is a valuable resource for research and automated analysis. However, responsible data sharing requires the detection and removal of personally identifiable and sensitive information to protect individual privacy. To support the development and evaluation of automatic de-identification systems, we present DialogPII, a multilingual dataset of synthetic dialogs and speech-derived transcripts for personal information detection. DialogPII covers eight interaction scenarios (emergency calls, medical anamnesis interviews, therapy sessions, insurance communication, customer support, clinical interviews regarding an AI-supported dashboard, police reports, and group therapy discussions), 19 entity types, and 11 languages (English, Arabic, Finnish, French, German, Hindi, Italian, Polish, Portuguese, Spanish, and Turkish). Dialogs were generated semi-automatically using large language models, manually curated for plausibility and diversity, and localized to country- and city-specific contexts. All dialogs were additionally converted to speech via text-to-speech synthesis, transcribed with Whisper, and annotated through automatic projection and manual correction, yielding aligned written and speech-derived resources across all languages. We further release baseline multilingual named entity recognition models and provide technical validation through inter-annotator agreement analysis, translation quality evaluation, annotation projection assessment, and benchmark experiments with transformer-based sequence labeling models.

[NLP-16] When Is a Draft Accepted? A Theory of Acceptance in Speculative Decoding

【速读】: 该论文旨在解决生成式 AI(Generative AI)中推测解码(speculative decoding)在实际推理系统中因采用贪婪解码、宽松接受规则或基于树结构的候选集而导致的局部排序与阈值事件主导成功概率的问题。现有理论多聚焦于随机性、分布保持的场景,要求精确采样目标分布,但实践中此类理想条件难以满足。本文的关键贡献在于构建了一套适用于这些非分布保持场景的理论框架:识别出多种常见接受准则的拒绝区域可表征为靶模型分布的下水平集(lower level sets),并据此推导出精确的KL散度界限,用于获得严格贪婪解码、加法与乘法型松弛接受、Top-(m)松弛准则以及熵阈值接受等情形下的精确证书与紧致的边际边界。进一步,该框架扩展至贪婪树解码,推导出靶模型贪婪词元仍被起草器前(m)个候选覆盖的确切与仅依赖边际的证书。通过对Qwen3模型的评估验证表明,松弛与树结构接受准则显著扩大了可认证接受区域,尤其在靶模型分布边际较小的解码步骤中表现突出。该研究补全了现有分布保持分析的不足,为实际推理系统中的确定性局部接受事件提供了严谨的理论支撑。

链接: https://arxiv.org/abs/2606.30265
作者: Aaryam Sharma
机构: University of Waterloo (Waterloo大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 29 pages, 5 figures

点击查看摘要

Abstract:Speculative decoding accelerates language model inference by using a fast drafter to propose candidate tokens that are then verified by a larger target model. Existing theory largely studies the stochastic, distribution-preserving setting, where the goal is to exactly sample from the target distribution. In contrast, many practical systems use greedy decoding, relaxed acceptance rules, or tree-based candidate sets, where success is governed by local ranking and threshold events rather than exact distributional equality. We develop a theory for these regimes. We identify that many common acceptance criteria have rejection regions that can be characterized as lower level sets of the target distribution. For these, we characterize the exact KL divergence required for rejection yielding exact certificates and sharp margin-based bounds for strict greedy decoding, additive and multiplicative relaxed acceptance, top-(m) relaxed criteria, and entropy-thresholded acceptance. We then extend the framework to greedy tree decoding, deriving exact and margin-only certificates for when the target greedy token remains covered by the drafter’s top-(m) candidates. Finally, we evaluate the resulting certificates on Qwen3 models, showing that relaxed and tree-based criteria substantially enlarge the region of certified acceptance, especially on decoding steps with low target model distribution margin. These results complement existing distribution-preserving analyses of speculative decoding by characterizing the deterministic local acceptance events common in practical inference systems.

[NLP-17] Multi-Agent ic System Leverag ing Open-Source LLM s to Mitigate Disinformation Threats

【速读】: 该论文旨在解决当前社会中虚假信息(disinformation)泛滥所带来的严峻挑战,尤其在电子通信、社交媒体广泛传播以及生成式人工智能(Generative AI)技术快速发展的背景下,传统依赖人工的审核方式已无法应对海量信息的验证需求。为应对这一问题,论文提出一种基于多智能体系统(multi-agent system)的自动化解决方案,其核心在于模拟人类标注者在识别虚假信息时的决策行为,通过引入共识机制(consensus mechanism)、认知多样性(diversity in cognition)、知识多样性(diversion in knowledge)以及分层结构(hierarchical structure),有效提升系统整体的判断准确性和鲁棒性。相较于单一大型语言模型(LLM,如GPT-4和GPT-3.5),该方法在多种语言资源条件下的任务表现均显著更优,包括直接虚假信息检测、需验证文本识别及可验证事实陈述检测等。此外,系统采用开源模型(如LLaMA、Kimi、Qwen、Deepseek和LLaMA-Nemotron)以增强透明度与可复现性,确保方法在不同语言生态(高资源英语、中资源波兰语、低资源斯洛伐克语和保加利亚语)中的普适性与有效性。

链接: https://arxiv.org/abs/2606.30259
作者: Sebastian Kula,Martin Tamajka
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In contemporary societies, the threat of disinformation has reached alarming levels, exacerbated by the proliferation of electronic communication, social media, and advancements in artificial intelligence. As a result, there is an urgent need to develop effective countermeasures to mitigate this menace. However, the sheer scale of the problem renders manual fact-checking and human-based verification inadequate, underscoring the necessity for automated methods to detect and debunk disinformation. This article proposes a novel approach based on a multi-agent system that emulates the decision-making processes of human annotators engaged in disinformation detection tasks. By incorporating a consensus mechanism, diversity in cognition and diversity in knowledge, and also hierarchical structure, inspired by human annotators’ behavior, the proposed method achieves superior results compared to individual Large Language Models (LLMs), including GPT 4 and GPT 3.5. The system leverages open models (e.g., LLaMA, Kimi, Qwen, Deepseek and LLaMA-Nemotron) to ensure greater transparency. The evaluation of the proposed method encompasses datasets in languages with varying resource availability, including English (high-resource), Polish (medium-resource), Slovak (low-resource) and Bulgarian (low-resource). Experiments were conducted on tasks such as direct disinformation detection, identification of texts worthy of verification, and detection of texts containing verifiable factual claims.

[NLP-18] Grounding LLM Reasoning under Incomplete Graph Evidence

【速读】: 该论文旨在解决在知识图谱(Knowledge Graph, KG)不完整且动态变化的条件下,如何有效引导大语言模型(Large Language Models, LLMs)进行可靠推理的问题。其核心挑战在于:系统所依赖的知识图谱通常仅为检索所得、时序受限且不完整的证据状态,而非对真实世界的完全描述。为应对这一问题,论文提出一种理论框架,将不完整图谱诱导出的实体锚点(entity anchors)、类型化关系残差(typed relation residuals)、路径能量(path energies)和支撑区域(support regions)与语言模型自身的先验轨迹分布相结合,实现对模型推理轨迹的“软性定位”(soft grounding)。该解决方案的关键在于将推理过程建模为在KL散度正则化下的先验变形——通过引入有限松弛(finite slack),允许保留未被观测但不矛盾的真实轨迹,同时以无穷惩罚形式实现硬条件约束,从而在开放世界不完整性下平衡容错性与一致性。该框架还提供了证据扰动下的稳定性边界,并阐明了GraphRAG、KGQA、图代理(graph agents)、受限解码及可信生成等场景中的约束适用条件。值得注意的是,所有结论均基于证据相对性:知识图谱的兼容性被视为声明支持(declared support),而非绝对事实。

链接: https://arxiv.org/abs/2606.30247
作者: Jiaqi Li,Fanghui Song
机构: Tianjin Normal University (天津师范大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注: A theoretical perspective about Grounding LLM Reasoning

点击查看摘要

Abstract:Knowledge graphs can guide large language models (LLMs) reasoning, but the graph seen by a system is usually a retrieved, linked, temporally scoped, and incomplete evidence state rather than a complete account of truth. We develop a theoretical perspective on grounding observable LLM trajectories under such incomplete graph this http URL evidence state induces entity anchors, typed relation residuals, path energies, and support regions, while the language model supplies a prior over candidate trajectories. We show that, under open-world incompleteness, no hard rule based only on the observed state can both reject every false unsupported trajectory and retain every true-but-unobserved this http URL then characterize soft grounding as a KL-regularized deformation of the LLM prior: finite slack preserves support for unsupported but non-contradicted trajectories, whereas hard conditioning appears as an infinite-penalty this http URL framework also yields stability bounds under evidence perturbations and clarifies the constraint regimes appropriate for GraphRAG, KGQA, graph agents, constrained decoding, and faithful generation. The claims are evidence-relative: KG compatibility is treated as declared support, not factual truth.

[NLP-19] Comparing Human and Automatic Recognition of Dutch Dysarthric Continuous Speech: A Case Study

【速读】: 该论文旨在解决严重构音障碍(dysarthria)患者语音识别的难题,特别是针对单个严重构音障碍患者在连续朗读与自发性言语中的个性化语音识别(DSR)性能瓶颈。研究发现,无论是人类听者还是三种先进的现成自动语音识别(ASR)系统(Whisper-large-V3、Google Chirp 3 和 Omnilingual),在识别该患者言语时的词错误率(WER)均超过70%,表明当前技术对构音障碍语音的处理仍存在显著挑战。其解决方案的关键在于通过针对该患者特定语音数据进行微调(fine-tuning),显著降低了词错误率至23%以下,使个性化构建的DSR模型性能超越人类听者,并逐步接近支持日常交流的实际应用水平。未来研究应聚焦于提升个性化DSR在自发性言语及更长朗读语句上的表现,尤其需关注特定音素的识别精度优化。

链接: https://arxiv.org/abs/2606.30237
作者: Yuanyuan Zhang,Dimme de Groot,Jorge Martinez,Odette Scharenborg
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In our goal to develop personalised dysarthric speech recognition (DSR) models, this study compared the recognition performances of human listeners and those of three state-of-the-art, off-the-shelf ASR systems (Whisper-large-V3, Google Chirp 3, and Omnilingual) on the recognition of Dutch continuous read and spontaneous speech from a single speaker with severe dysarthria. Results showed that both humans listeners and the three off-the-shelf ASR systems exhibit word error rates (WER) exceeding 70% on average, indicating that DSR is highly challenging for both humans and ASR systems. Fine-tuning on the dysarthric speech significantly reduced WER. Although overall WERs are still quite high (23%), the personalised DSR models outperformed the human listeners, and performance is getting closer to being useful for supporting day-to-day communication of dysarthric speakers. Future research should focus on improving personalized DSR on spontaneous speech and longer utterances in the case of read speech, with a specific focus on particular phonemes.

[NLP-20] CaresAI at CT-DEB26: Detecting Dosing Errors In Clinical Trials Using Domain-Specific Transformer Embeddings and Classification Models LREC2026 LREC ALT

【速读】: 该论文旨在解决临床试验(Clinical Trial, CT)中剂量错误(dosing error)难以早期识别的问题,此类错误可能导致患者伤害、不良药物事件及更差的临床结局。其核心解决方案是利用基于生物医学语料库训练的Transformer语言模型(如BioBERT、PubMedBERT等)对临床试验文本信息进行编码,生成高质量的文本嵌入(text embedding),并将其与结构化特征(categorical features)融合后输入至经典机器学习与神经网络模型中,以预测剂量错误风险。研究发现,领域特定的预训练模型(尤其是BioBERT)在表征能力上显著优于通用模型,且在逻辑回归基线基础上实现了0.794的ROC-AUC,较ClinicalBERT提升3.95%;而梯度提升、支持向量机、残差神经网络等模型进一步将性能提升至0.821–0.853的区间。值得注意的是,多模型嵌入融合并未带来性能增益,表明领域适配性(domain alignment)的重要性高于嵌入堆叠(representational stacking)。该方法通过整合领域专用的文本表示与结构化元数据,有效区分出高剂量错误风险的临床试验,为安全性监测和监管决策提供可解释的技术支持。

链接: https://arxiv.org/abs/2606.30236
作者: Leon Hamnett,Favour Igwezeke,Joseph Itopa Abubakar,Mary Adetutu Adewunmi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages, published in CL4Health 2026 proceedings (3rd Workshop on Patient-oriented language processing) @ LREC 2026 this http URL

点击查看摘要

Abstract:Medication errors, particularly dosing errors in clinical trials (CT), can lead to patient harm, adverse drug events and worse patient outcomes. Dosing errors are preventable, and early identification can improve trial integrity and mitigate subsequent clinical and financial burden. This study aims to detect dosing errors within CT protocols by evaluating text representations of trial information using transformer-based language models trained on biomedical corpora. CT textual data was encoded using several models, including ClinicalBERT, PubMedBERT, BioBERT, and MedCPT, and integrated with categorical features. These text embeddings were used as input to classical machine learning models and neural network architectures within an experimental framework. Performance was primarily assessed using ROC-AUC with respect to predicting dosage error. Under a logistic regression baseline, BioBERT consistently outperformed alternative encoders, achieving an ROC-AUC of 0.794, a 3.95% improvement over the ClinicalBERT baseline. Combining multiple embeddings did not yield improvements, indicating that domain alignment outweighs representational stacking. Gradient boosting models, support vector classifiers, logistic regression, and residual neural networks achieved the strongest performance for predicting dosage error, achieving ROC-AUCs: 0.821 to 0.853. Overall, the integration of domain-specific transformer embeddings with structured metadata enables discrimination of trials meeting a predefined elevated dosing error risk criterion, advancing safety monitoring and supporting informed regulatory decision-making.

[NLP-21] EvalSafetyGap: A Hybrid Survey and Conceptual Framework for LLM Evaluation-Safety Failures

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)评估与人工智能安全领域共同面临的测量难题:即基准分数、奖励模型信号及报告的安全指标可能持续提升,但其所代表的潜在属性却难以验证。其核心解决方案在于构建一个整合混合调研方法(系统性搜索结合叙事综合与独立追踪的灰色证据)、概念框架与结构化十模型审计的综合性分析体系。研究通过八个证据流——包括基准有效性、动态评估、模型自评可靠性、安全评估、越狱/拒答鲁棒性、奖励劫持、机制可解释性以及治理与可审计性——覆盖2018至2026年间相关评估-安全度量工作。论文提出“评估-对齐差距”(EvalSafetyGap)作为组织假说,用于在优化压力下比较评估侧与对齐侧的代理失效现象,并引入古德哈特定律(Goodhart’s Law)以及两项新构建的分析工具——不稳定性分解(Instability Decomposition)与对齐三难困境(Alignment Trilemma),以生成可检验的对比假设。审计结果显示,当能力、行为安全性与治理维度分别测量时,结论会发生显著变化;在样本量为10的情况下,能力与持续对抗鲁棒性之间的关联统计上无法确定(皮尔逊相关系数 r = +0.232,p = 0.520),而表面显现的“开放-封闭”安全差距实际较小,主要由治理与披露因素驱动,且对单个边缘模型的分类高度敏感;尝试预算结果则具有协议依赖性。由于公开证据采用异构评估协议,该审计具有诊断性质而非排名生成功能。论文贡献在于建立了一套共享术语体系与证据地图,旨在支持动态评估、透明来源报告、多轮次安全测量以及可审计的对齐实践。

链接: https://arxiv.org/abs/2606.30219
作者: Buğra Alperen Uluırmak,Rifat Kurban
机构: Erciyes University; Abdullah Gül University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 67 pages, 8 figures

点击查看摘要

Abstract:LLM evaluation and AI safety face a shared measurement problem: benchmark scores, reward-model signals, and reported safety metrics can improve while the latent properties they are meant to represent remain difficult to verify. This paper combines a hybrid survey - a systematic search paired with narrative synthesis and separately tracked grey evidence - with a conceptual framework and a structured ten-model audit. The synthesis spans eight evidence streams: benchmark validity, dynamic evaluation, LLM-as-judge reliability, safety evaluation, jailbreak/refusal robustness, reward hacking, mechanistic interpretability, and governance/auditability, covering 2018-2026 evaluation-safety measurement work. We introduce EvalSafetyGap as an organizing hypothesis for comparing evaluation-side and alignment-side proxy failures under optimization pressure, using Goodhart’s Law together with two constructs we develop here - an Instability Decomposition and an Alignment Trilemma - as tools for generating testable comparisons. The audit shows how conclusions shift when capability, behavioral safety, and governance are measured separately. In this sample (n = 10), the association between capability and sustained adversarial robustness is statistically indeterminate using the displayed Table 3 inputs (Pearson r = +0.232, p = 0.520), and the apparent open-closed safety gap is modest, driven mainly by governance and disclosure rather than behavioral robustness, and sensitive to how a single borderline model is classified; attempt-budget results are protocol dependent. Because the public evidence uses heterogeneous protocols, the audit is diagnostic rather than rank-generating. The contribution is a shared vocabulary and evidence map to support dynamic evaluation, transparent source reporting, multi-attempt safety measurement, and auditable alignment practice.

[NLP-22] Before Thinking Learn to Decide: Proactive Routing for Efficient Visual Reasoning

【速读】: 该论文旨在解决大模型在多模态复杂视觉任务中推理效率低下的问题,其核心瓶颈在于长链思维(chain-of-thought)导致的计算开销。现有解决方案依赖于小规模草稿模型(draft model)与大规模目标模型(target model)协同推理,通过路由信号将查询动态分配至更合适的模型以平衡效率与精度。然而,当前方法在多模态场景下难以可靠地生成查询难度信号:基于后验词元概率的方法在多模态输入中表现不佳,而依赖监督微调的策略则对数据敏感且仅能在完整输出生成后进行路由决策,忽视了目标模型实际处理能力。为此,本文提出一种主动式路由范式(Proactive Routing Paradigm, PRP),其关键在于引入双模型联合评估机制——通过草稿评分学习(Draft Rating Learning, DRL)为草稿模型赋予内部置信度估计能力,并结合联合评分学习(Joint Rating Learning, JRL)预测目标模型对特定查询的处理能力,从而实现基于实例级别的早期决策。该方法能够优先将目标模型擅长的任务分配给其处理,而非最困难样本,显著提升推理速度并保持整体性能,已在多个多模态推理基准上验证了其有效性与高效性。

链接: https://arxiv.org/abs/2606.30217
作者: Yinan Zhou,Haokun Lin,Yichen Wu,Caifeng Shan,Zhenan Sun,Yuxin Chen,Teng Wang,Chen Ma,Li Zhu,Ying Shan
机构: Tencent(腾讯)
类目: Computation and Language (cs.CL)
备注: 36 pages, 20 figures

点击查看摘要

Abstract:Large multimodal models have achieved strong reasoning on complex visual tasks, but their inference efficiency is often restricted by long chains of thought. A promising solution is to pair a small draft model with a large target model, enabling cooperative inference employing a routing signal that adaptively routes queries to either the draft or target model based on their difficulties for optimal efficiency and accuracy. Yet, the remaining bottleneck is to establish a reliable query difficulty signal under multimodal settings. Existing approaches designed for language models either rely on post-hoc token probabilities, which fall short in multimodal scenarios, or depend on supervised fine-tuning, which is a data-sensitive strategy. Both paradigms perform routing only after a complete output, and ignore whether the target model can actually solve the routed instances. To address this, we propose PRP, a Proactive Routing Paradigm that enables early decision-making by jointly evaluating the competence of both the draft and target models. Our Draft Rating Learning (DRL) equips the draft model with an internal confidence estimator, while Joint Rating Learning (JRL) predicts how well the target model can handle a given query, thereby prioritizing the allocation of samples it excels at rather than the hardest ones. These ratings enable fine-grained, instance-level \textbfProactive Routing and substantially accelerate inference without compromising overall performance. Extensive experiments across multiple multimodal reasoning benchmarks validate our effectiveness and efficiency.

[NLP-23] SHOVIR: A Benchmark for Evaluating Vision Shortcut Learning in Radiology Report Generation

【速读】: 该论文旨在解决当前放射科报告生成(Radiology Report Generation, RRG)中视觉-语言模型(Vision-Language Models, VLMs)评估体系存在的核心缺陷:现有评估协议依赖于报告层面的指标,如词汇重叠度或整体临床正确性,无法检验模型所生成的诊断结论是否真正基于图像中可见的病理证据。这种评估方式导致模型可能通过利用学习到的先验知识或虚假相关性(即“视觉捷径”,vision shortcut)获得高分,而非真正理解图像内容。为应对这一问题,论文提出SHOVIR基准,其关键在于扩展两个空间标注的胸部X光数据集(MIMIC-CXR和PadChest-GR),引入每个区域框的CheXpert标签,并设计图像级与疾病级的遮蔽实验(occlusion experiments),通过对比干净图像与局部区域扰动下的模型表现,分离出两类疾病级别的失败模式:直接捷径(direct shortcuts),即移除视觉证据后诊断仍持续存在;上下文捷径(contextual shortcuts),即当共现病灶被遮蔽时,即使目标区域完整,检测性能仍下降。通过对八种先进VLMs的基准测试发现,不同架构与数据集间捷径行为差异显著,且报告质量最高的模型未必具备良好的空间定位能力,揭示了当前评估体系在空间语义对齐方面的盲区。因此,该研究强调亟需发展基于区域感知的评估范式,以推动更可靠、可解释的RRG模型发展。

链接: https://arxiv.org/abs/2606.30201
作者: Filippo Ruffini,Marco Salmé,Rosa Sicilia,Valerio Guarrasi,Paolo Soda
机构: Università Campus Bio-Medico di Roma (罗马大学生物医学校区); Umeå University (于默奥大学); UniCamillus-Saint Camillus International University of Health Sciences (圣卡米卢斯国际健康科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current evaluation protocols for Vision-Language Models (VLMs) in Radiology Report Generation (RRG) rely on report-level metrics that measure lexical overlap or aggregate clinical correctness. However, such metrics do not test whether individual diagnostic statements stem from the actual pathological evidence visible in the image. This allows models to achieve competitive scores by exploiting learned priors or spurious correlations, a failure mode we refer to as vision shortcut. We introduce SHOVIR, a benchmark for evaluating vision shortcut behavior in RRG. SHOVIR extends two spatially annotated chest X-ray datasets, MIMIC-CXR and PadChest-GR, with per-box CheXpert labels, and defines image-level and disease-level occlusion experiments that contrast baseline performance on clean images against localized, region-specific perturbations. Comparing predictions across these conditions isolates two failure modes at the disease-class level: direct shortcuts, where a finding persists after its visual evidence is removed, and contextual shortcuts, where detection degrades once co-occurring pathologies are occluded despite the target region remaining intact. Benchmarking eight state-of-the-art VLMs, we find that shortcut behavior varies substantially across architectures and datasets. Models achieving the highest baseline report quality do not necessarily rank highest in spatial grounding, revealing that clinically fluent generation can coexist with shallow reliance on visual evidence. These findings expose a blind spot in current RRG evaluation and motivate region-aware assessment protocols.

[NLP-24] Forewarned is Forearmed: When Non-Sequential Embedding Turns Into an Anomaly Detector LREC2026

【速读】: 该论文旨在解决多模态句子级嵌入(multimodal sentence-level embeddings)中非顺序性(non-sequential)表示在解码过程中可能出现的异常问题,尤其是由嵌入维度对扰动敏感所引发的可靠性下降。其核心解决方案在于利用编码与解码过程之间的连续一致性(consistency),识别出对扰动敏感的特定嵌入维度,并以此作为解码异常的指示器,进而构建高精度的异常检测机制。此外,通过针对性地修改感兴趣维度以实现修复,进一步提升了多模态表示的鲁棒性与可信度。该研究强调了深入分析嵌入结构本身对于提升多模态系统可靠性的关键作用。

链接: https://arxiv.org/abs/2606.30196
作者: Elys Allesiardo,Antoine Caubrière,Valentin Vielzeuf
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted for presentation at LREC 2026

点击查看摘要

Abstract:This paper offers an in-depth analysis of non-sequential multimodal sentence-level embeddings, with a particular focus on the SONAR model. We demonstrate that certain embedding dimensions are sensitive to perturbations and can serve as indicators of decoding anomalies. By leveraging the consistency between successive encoding and decoding, we successfully build an accurate detector. Additionally, we explore modifying specific dimensions of interest to attempt to correct them. This work underscores the importance of understanding and analyzing the embeddings themselves to enhance the reliability of multimodal representations.

[NLP-25] DAIN: Dynamic Agent -Based Interaction Network for Efficient and Collaborative Multimodal Reasoning

【速读】: 该论文旨在解决当前基于静态混合专家(Mixture-of-Experts, MoE)架构的多模态融合方法在复杂现实应用中难以实现自适应且高效的协同推理问题。其核心解决方案是提出动态代理交互网络(Dynamic Agent-based Interaction Network, DAIN),将多模态融合重构为一种动态的多代理协作过程。DAIN的关键在于引入一个上下文感知的元控制器(Meta-Controller),能够动态调度特定交互代理的稀疏激活,并压缩代理间的通信以达成共识;同时,通过多目标损失函数联合优化任务准确性、代理专业化程度与运行效率,借助稀疏激活与通信正则化实现性能提升。实验在五个不同基准数据集(ADNI、MIMIC-IV、MM-IMDB、CMU-MOSI 和 ENRICO)上验证了DAIN的优越性,相较现有方法在ADNI上实现了2.6%的准确率提升。消融实验证明了动态调度机制与代理间通信的重要性。此外,DAIN通过揭示上下文依赖的代理角色与协作模式,增强了模型可解释性,同时通过样本级稀疏激活保持计算高效性。本研究展示了动态代理范式在多模态推理中的巨大潜力。

链接: https://arxiv.org/abs/2606.30189
作者: Xinxin Chen,Yuchen Li,Zihan Wang,Haoyu Zhang,Ruixin Liu,Mingyuan Zhao
机构: University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注: 19 pages

点击查看摘要

Abstract:Current multimodal fusion approaches, particularly those based on static Mixture-of-Experts (MoE) architectures, often struggle to provide the adaptive and efficient collaborative reasoning required by complex real-world applications. We introduce the Dynamic Agent-based Interaction Network (DAIN), which reconceptualizes multimodal fusion as a dynamic, multi-agent collaborative process. DAIN employs a context-aware Meta-Controller that dynamically schedules sparse activation of specialized interaction agents and orchestrates compressed inter-agent communication for consensus-building. The framework is guided by a multi-objective loss function that jointly optimizes task accuracy, agent specialization, and operational efficiency through sparse activation and communication regularization. Comprehensive evaluations across five diverse benchmarks – ADNI, MIMIC-IV, MM-IMDB, CMU-MOSI, and ENRICO – establish DAIN as a new state-of-the-art, delivering significant performance improvements including a 2.6% accuracy gain on ADNI. Ablation studies verify the critical roles of both dynamic scheduling and agent communication. Furthermore, DAIN offers enhanced interpretability by exposing context-dependent agent roles and collaboration patterns while maintaining computational efficiency through sample-wise sparse agent activation. Our work demonstrates the promise of dynamic, agent-based paradigms for multimodal reasoning.

[NLP-26] CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练中对数据规模与质量日益增长的需求,尤其是不同训练阶段对数据具有差异化要求的挑战。现有语料库构建流程通常仅生成扁平化、无结构的文档集合,缺乏系统性的知识组织机制,难以支持高质量、可复用的数据管理与跨领域知识融合。为此,本文提出Cortex框架,首次将网络规模语料库构建从传统的扁平文档筛选提升至结构化知识组织层面,其核心创新在于引入本体语料图(Ontological Corpus Graph, OCG),该结构为三层异构体系:第一层为经过质量精炼的内容层,第二层通过大语言模型(LLM)驱动实现自动演化的轻量级分层本体层,第三层为跨领域对齐层,支持任意分类层级上的跨域关联。关键解决方案在于通过OCG实现语料库的层次化、可扩展且具备语义理解能力的知识组织,从而有效支撑高质量数据的提炼、领域内结构化建模及跨领域数据合成。实验验证表明,基于OCG构建的CortexBench基准测试在八种前沿大模型上均有效证明了质量精炼、领域组织与跨域数据合成的有效性。研究团队将公开完整代码、24.14B token的精炼语料库及其对应的OCG以及CortexBench。

链接: https://arxiv.org/abs/2606.30175
作者: Chengtao Gan,Xiaoke Guo,Yushan Zhu,Zhaoyan Gong,Zhiqiang Liu,Songze Li,Huajun Chen,Wen Zhang
机构: Zhejiang University (浙江大学); JIUTIAN Research, Beijing, China (九天研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The continuous evolution of large language models drives escalating demands on data scale and quality, and as different training stages impose increasingly tailored data requirements, systematic organization of high-quality corpora becomes indispensable. Existing corpus construction pipelines confine the resulting corpora to flat, undifferentiated document collections, universally lacking systematic knowledge organization. We present Cortex, to our knowledge the first framework that elevates web-scale corpus construction from flat document filtering to structured knowledge organization through an Ontological Corpus Graph (OCG), a three-layer heterogeneous structure unifying a quality-refined content layer, a hierarchical lightweight ontology layer via LLM-driven automated evolution, and a cross-domain alignment layer enabling inter-domain association at arbitrary taxonomic resolution. Comprehensive experiments confirm the effectiveness of Cortex. In particular, we leverage the OCG to synthesize CortexBench, a cross-domain search-and-reasoning benchmark whose evaluation across eight frontier LLMs validates the effectiveness of quality refinement, domain organization, and cross-domain data synthesis. We will publicly release the complete codebase, a 24.14B-token refined corpus with its OCG, and CortexBench.

[NLP-27] Estimating Grammatical Gender Directions in Contextual Embeddings under Controlled and Natural Contexts

【速读】: 该论文旨在解决西班牙语等性别化语言中,上下文语言模型(contextual language models)在表示过程中将语法性别(grammatical gender)与社会语义偏见(social semantic bias)混淆的问题。现有去偏方法仅作用于静态词嵌入(static word embeddings),未充分探索上下文表示中实现二维性别解耦的潜力。为此,研究提出首次针对上下文嵌入的语法性别与语义污染解耦框架:通过构建受控模板和自然维基百科语境,构建了包含无生命名词的平衡数据集;设计融合中心点(centroid)、支持向量机(SVM)与线性判别分析(LDA)的性别方向估计器,并引入感知污染的加权策略。同时提出双目标评估指标体系,以平衡无生命名词上语法性别泄露的抑制与职业类词汇中语义性别的保留。实验结果表明,未经加权的受控语境可生成最纯净的语法性别方向,且中心点估计器性能优于判别式基线方法。

链接: https://arxiv.org/abs/2606.30152
作者: Huanping Xiao,Yingji Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 1 figure

点击查看摘要

Abstract:Contextual language models conflate grammatical gender and social semantic bias in gendered languages such as Spanish. Existing gender debiasing approaches only operate on static word embeddings leaving contextual representations unexplored for this two dimensional gender disentanglement. To address the this issue, we make the first attempt to disentangle grammatical gender from semantic contamination for contextual embeddings. We construct both controlled templates and natural Wikipedia contexts to build balanced datasets of inanimate nouns, and design a framework equipped with centroid, Support Vector Machine (SVM) and Linear Discriminant Analysis (LDA) gender direction estimators as well as contamination-aware weighting strategies. A set of dual-objective evaluation metrics is proposed to balance the suppression of grammatical gender leakage on inanimate nouns and the preservation of semantic gender distinctions for occupation terms. The results reveal that unweighted controlled contexts yield the purest grammatical gender direction, and the centroid estimator achieves better performance than discriminative baselines.

[NLP-28] Does Verbose Chain-of-Thought Really Help? In-Distribution Evidence that Content Not Length Matters ICML

【速读】: 该论文旨在解决生成式人工智能(Generative AI)中链式思维(Chain-of-Thought, CoT)提示技术提升大语言模型(LLM)推理能力的内在机制争议问题:即推理性能的提升是源于中间步骤所携带的语义信息,还是仅仅因为更多的上下文令牌(tokens)提供了额外的计算资源以支持最终决策。其解决方案的关键在于通过两组互补证据进行实证分析。第一,在分布内(in-distribution)实验中,对同一问题重复采样模型输出,选取同一模型自身生成的、遵循相同推理路径但长度不同的两个回答序列,确保二者均为真实分布内的自然生成且无重写;结果显示,增加的额外文本对准确率基本无影响,而盲法分析表明任何微小增益均与验证内容和检查内容相关,而非单纯的冗余度。第二,采用受控干预设计,构建双验证器(dual-validator)框架,在四个任务目标和八个基准测试上,对比具有相同语义内容(通过有向无环图等价性验证的事实、操作及中间值一致)但表达冗长度不同的两段推理轨迹,发现冗长版本虽能小幅提升准确率(32个基准-目标单元中有25个在至少一个验证器下呈正向效应),但提升幅度有限(通常1–4个百分点),且效果依赖于冗长表述的质量而非单纯长度。在最大数值掩码条件下,该效应显著放大(四类算术基准中位数提升达3.24倍),而长度匹配的非推理填充文本无法恢复该增益。两项证据共同表明,关键因素并非额外令牌的数量,而是其承载的推理过程与验证内容本身——这一发现既不完全符合“纯前向计算增强”解释,也非“纯语义内容驱动”所能涵盖,揭示了生成式推理中语义有效性与计算深度之间的复杂耦合关系。

链接: https://arxiv.org/abs/2606.30128
作者: Wenlong Wang,Fergal Reid
机构: Fin AI Research
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICML Workshop on Efficient Multimodal Question Answering (EMM-QA)

点击查看摘要

Abstract:Chain-of-thought (CoT) prompting improves LLM reasoning, but the source is contested: do the intermediate steps help because they carry useful semantic content, or because conditioning on more tokens buys extra computation before the model commits to an answer? We bring two lines of evidence to bear. First, in distribution: we repeatedly sample each model on the same question and pair a shorter with a longer of its own natural generations that follow the same reasoning plan, so nothing is rewritten and both traces are genuinely in-distribution. Across 25 models the extra tokens leave accuracy essentially unchanged for every independently-trained reasoner, and a blind analysis of the surplus tokens shows that what gain exists elsewhere tracks validation- and checking-content, not verbosity per se. Second, as a controlled intervention, we ask whether two traces expressing the same semantic content (the same facts, operations, and intermediate values, verified through directed acyclic graph equivalence) produce different outcomes when one is more verbose, using a dual-validator design across four targets and eight benchmarks with number-redacted completion and stratified bootstrap confidence intervals. Verbose traces do improve accuracy (25 of 32 benchmark-target cells are positive under at least one validator), but the effects are modest (typically 1-4 points) and depend on the quality of the verbose prose, not merely its length. Under maximum numerical redaction the effect is amplified (median 3.24x across four arithmetic benchmarks), and length-matched non-reasoning filler recovers none of it. Both lines converge: what matters is what the extra tokens do (the reasoning and validation content they carry), not how many there are, a picture neither a pure forward-pass-compute nor a pure semantic-content account fully explains.

[NLP-29] Information Dynamics of Language Communication

【速读】: 该论文旨在解决计算语言学中语义信息在交际过程中如何定向传播这一尚未充分发展的核心问题。其关键解决方案在于提出一种基于信息论的框架,利用大语言模型作为自然语言的概率估计器,计算两个核心指标:语义转移熵(Semantic Transfer Entropy, STE),用于捕捉对话者之间的定向预测性影响;以及语义部分信息分解(Semantic Partial Information Decomposition, SPID),用于解析多个信息源如何以冗余、独特和协同的方式共同塑造目标话语。该框架能够有效识别认知僵化对话中的信息流动减弱现象,揭示说服者在话语建构中的主导作用,区分心理治疗中高质量与低质量互动的双向信息交换模式,并揭示论辩性文章中前提间的协同贡献,为数字话语、教学互动、临床对话等领域的语义动态研究提供了新的分析范式。

链接: https://arxiv.org/abs/2606.30096
作者: Leonardo S. Goodall,Andrea I. Luppi,Pedro A. M. Mediano
机构: University of Oxford(牛津大学); University of Cambridge(剑桥大学); McGill University(麦吉尔大学); Imperial College London(帝国理工学院); OpenAI(OpenAI)
类目: Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Quantifying how meaning propagates through communicative exchanges remains underdeveloped in computational linguistics. Here we introduce an information-theoretic framework that quantifies the directed flow of semantic content between interlocutors and decomposes multi-source contributions into redundant, unique, and synergistic components. Our approach leverages large language models as probabilistic estimators of natural language to compute two measures: semantic transfer entropy (STE), which captures directed predictive influence between speakers, and semantic partial information decomposition (SPID), which resolves how multiple sources jointly shape a target’s language. Across four experiments we show that the framework detects reduced information flow in cognitively rigid dialogue, captures the dominant role of persuaders in shaping discourse, distinguishes high- from low-quality psychotherapy by the directionality of therapist-client information exchange, and reveals synergistic premise contributions in argumentative essays. This framework opens new avenues for studying information dynamics in digital discourse, pedagogical interactions, clinical dialogues, and any domain in which the structure of linguistic exchange is of research relevance.

[NLP-30] Not-quite-human tastes: the stylized omnivorousness of LLM survey surrogates

【速读】: 该论文旨在解决大语言模型(Large-Language Models, LLMs)在模拟人类文化消费偏好时的准确性与可靠性问题,尤其关注其在社会调查数据中作为“硅基样本”(silicon surrogates)替代真实受访者所引发的生态效度偏差。随着市场研究机构开始使用生成式人工智能构建“合成”调查样本,以及真实调查数据被大量由LLM生成的回答污染,这一问题日益紧迫。论文的关键解决方案在于系统性评估不同来源的LLMs(OpenAI、Anthropic、DeepSeek)生成的277,470个虚拟受访者在文化消费偏好上的表现,通过与真实调查数据(SPPA)对比,揭示其在算法保真度与社会文化对齐性方面的根本缺陷。其核心发现表明:(1)硅基样本普遍存在正向偏好偏倚,导致对文化喜好程度的生态估计被系统性高估;(2)真实数据中复杂的品味结构关系在硅基样本中完全丧失;(3)社会空间与文化品味之间的已知关联——如年龄、阶级、性别与种族维度——在生成样本中被严重扭曲或失真,表现为弱化年龄-品味关联、复活过时的阶级-品味关联,并夸张性别与种族-品味关联。因此,该研究的关键贡献在于揭示了当前基于大语言模型的“硅基采样”在文化消费研究中的深层局限性,警示其在社会科学研究中滥用可能带来的严重生态效度问题。

链接: https://arxiv.org/abs/2606.30085
作者: Xiangyu Ma,Mengmi Zhang,Shannon Ang,Minne Chen
机构: Nanyang Technological University(南洋理工大学)
类目: Computation and Language (cs.CL); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Large-language models have proven to be remarkable if inconsistent parrots of public attitudes and opinions. The extent to which LLMs are able to produce reasonable approximations of cultural taste remains an open empirical question that becomes more urgent by the day, with market research companies already offering provisional `synthetic’ survey panels and the contamination of standard survey data from LLM-generated responses. In this study, we build on past work on silicon sampling by extending considerations of its algorithmic fidelity and alignment to the domain of cultural consumption. We use large-language models from OpenAI, Anthropic, and DeepSeek to each produce 277,470 (30x9249) silicon surrogates of survey respondents from the Survey of Public Participation in the Arts (SPPA). We find these silicon surrogates’ tastes to be highly stylized facsimiles of human tastes. (1) Silicon samples have a systematic postive-bias for liking, resulting in inflated ecological estimates of tastes. The individual-level bias of silicon samples are not well-explained by the WEIRD-bias often discussed in the literature. (2) The complex relationality in real taste structures is completely lost among silicon samples. (3) Finally, very little of the known cultural alignment between tastes and social space are preserved. Silicon samples attenuate age-taste associations, resurrect anachronistic class-taste associations, caricaturize gender- and race-taste associations.

[NLP-31] Little Brains Big Feats: Exploring Compact Language Models ECML KDD2026

【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在检索增强生成(Retrieval-Augmented Generation, RAG)系统中生成阶段性能表现不明确的问题,尤其关注其在资源受限设备上的可行性与效率。尽管大语言模型(Large Language Models, LLMs)受到广泛关注,但小语言模型在实际应用中仍具有重要价值,尤其是在边缘计算和本地部署场景下。论文的关键解决方案在于验证:通过合理设计的RAG架构,小语言模型可在无需依赖GPU硬件的情况下,在合理时间内实现高效的本地化推理。研究通过在开源与专有数据集上进行多领域、多类型问题的基准测试,证明了小语言模型在RAG框架下的实用性与高效性,为轻量化、低延迟的智能生成系统提供了可行路径。

链接: https://arxiv.org/abs/2606.30062
作者: Dari Baturova,Elena Bruches,Ivan Chernov,Roman Derunets,Arsenii Fomin,Andrey Kostin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ECML PKDD 2026, Applied Data Science track. Author preprint; the definitive version will appear in the proceedings of ECML PKDD 2026, Springer LNCS

点击查看摘要

Abstract:While large language models have been dominating the research landscape recently, small language models remain highly relevant across various domains; yet, they receive far less attention. In this study, we investigate how smaller language models perform during the generation stage within a Retrieval-Augmented Generation (RAG) system. To benchmark these models effectively, we utilised both open-source and proprietary datasets covering diverse subject areas and question types. Our findings demonstrate that a RAG system with small language models can be executed directly on-device without requiring any GPU hardware within a reasonable time. The experimental code and links to the supplementary materials can be accessed through the GitHub repository: this https URL.

[NLP-32] Parametric Skills

【速读】: 该论文旨在解决大语言模型(LLM)在复杂、长上下文场景下难以有效理解和执行文本形式技能指令的问题,尤其体现在关键指令难以定位与遵循的局限性。其核心挑战在于传统技能依赖于上下文学习(in-context learning),而随着任务复杂度提升,模型对技能指令的解析与执行能力显著下降。为此,论文提出ParametricSkills框架,其关键创新在于将自由形式的文本技能在推理时动态转换为可参数化的模型权重(即LoRA适配器),实现无需依赖上下文即可直接调用技能的“上下文无关”技能利用。该方法通过构建大规模高质量技能库并基于OpenCode生成多轮技能执行轨迹,训练一个超网络(hypernetwork)以将文本技能映射为参数化表示,从而在测试阶段实现高效、精准的技能应用。实验结果表明,相较于标准的上下文学习,ParametricSkills在六个软件工程子任务上平均提升6.44分(以DeepSeek-V4-Flash为评估基准),同时在BERT Score和F1分数上亦有显著改善,验证了其有效性。此外,由于参数化技能具有天然的累积性,该框架还为测试时持续学习(test-time continual learning)提供了初步但极具潜力的解决方案。

链接: https://arxiv.org/abs/2606.30015
作者: Xuan Zhao,Haonan He,Qingyu Yang,Minglei Li,Jingqi Ye,Zelin Tan,Bo Wan,Peng Ye
机构: Shanghai Artificial Intelligence Laboratory; University of Science and Technology of China; KTH Royal Institute of Technology; Fudan University; The Chinese University of Hong Kong
类目: Computation and Language (cs.CL)
备注: Preprint, Under Review

点击查看摘要

Abstract:Since intelligence fundamentally relies on efficient skill acquisition (Chollet, 2019), the ability to leverage skills is critical. For LLMs, skills, manually authored or extracted from task trajectories, are textual recipes encoding mature problem-solving experience and are critical to agentic capabilities. Despite widespread deployment, their utility is limited by the model’s ability to comprehend and follow skill instructions, especially under complex and long-context scenarios, where key instructions are difficult to locate and adhere to. To address this limitation, we propose ParametricSkills, a framework that can convert free-form textual skills into parameters at test time, enabling context-free skill exploitation. Specifically, we first construct a large-scale, high-quality skill library, and synthesize single-turn and multi-turn skill exploitation trajectories built around these skills with OpenCode. Using these data, we then train a hypernetwork that parameterizes both the skill content and the test-time exploitation methodology by receiving textual skills and converting them into LoRA adapters. Experimental results on six complex software engineering (SWE) subtasks demonstrate that, the proposed ParametricSkills averagely outperforms in-context learning by 6.44 points as judged by DeepSeek-V4-Flash, while also achieving significantly higher BERT Score and F1 score, confirming its effectiveness. Beyond performance, we further find that parametric skills, being inherently accumulative, offer a preliminary yet promising avenue toward test-time continual learning.

[NLP-33] Node-to-Neighborhood Semantic Consistency: Text-Topology Alignment for TAGs Anomaly Detection

【速读】: 该论文旨在解决文本属性图(Text-attributed Graph, TAG)中异常检测的难题,尤其关注现有方法在捕捉节点文本语义与图拓扑结构之间对应关系方面的不足。传统基于图神经网络(GNN)的方法虽能有效建模结构模式,但难以充分挖掘细粒度的文本语义;而融合大语言模型(LLM)与图结构的方法虽提升了语义理解能力,却未能充分建模邻域节点间的拓扑关系。更重要的是,二者均忽视了节点语义与其邻域在拓扑结构上的一致性,导致无法有效识别语义与局部结构不一致的异常节点。为此,论文将标签异常检测形式化为“节点-邻域语义一致性”问题,并提出N2NSC(Node-to-Neighborhood Semantic Consistency)框架,通过两条互补的融合路径,协同建模图拓扑与文本语义之间的对应关系,使大语言模型能够同时利用文本和结构化的邻域信息进行联合推理。其核心创新在于实现了语义与拓扑信息的深度融合,从而显著提升异常检测性能。在八个数据集上的大量实验表明,N2NSC consistently超越当前最先进的方法。

链接: https://arxiv.org/abs/2606.30009
作者: Bochen Lin,Jianxiang Yu,Jiayi Wu,Lin Qi,Huang Lu,Xiang Li
机构: East China Normal University (华东师范大学); WeChat, Tencent (微信,腾讯)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Graph anomaly detection (GAD) on text-attributed graphs (TAGs) is vital for applications such as fraud detection and academic integrity verification. Existing approaches generally fall into two paradigms. GNN-based methods effectively capture structural patterns but struggle to capture fine-grained textual semantics. Methods integrating LLMs with graphs improve semantic understanding yet fail to fully comprehend topological relationships among neighboring nodes. Moreover, both paradigms overlook the correspondence between textual semantics and graph topological relationships, limiting their ability to identify nodes whose semantics are inconsistent with their neighborhoods. In this paper, we formalize TAG anomaly detection as a node-to-neighborhood semantic consistency problem, where anomalies may arise from either textual semantic mismatch or topological deviation between a node and its neighbors. We propose N2NSC (Node-to-Neighborhood Semantic Consistency), a framework that captures the correspondence between graph topology and textual semantics through two complementary fusion paths. The two pathways work synergistically, enabling the LLM to fully leverage both textual and structural neighborhood information for anomaly detection. Extensive experiments across eight datasets demonstrate that N2NSC consistently outperforms current state-of-the-art methods.

[NLP-34] LLM Agents Are Latent Context Managers: Eliciting Self-Managed Context via a Proprioceptive Dashboard

【速读】: 该论文旨在解决长时程工具代理(long-horizon tool agents)在运行过程中因上下文(context)持续累积而逼近上下文窗口(context window)上限所导致的性能瓶颈问题。现有系统虽尝试通过代理或系统级控制进行上下文管理,但普遍存在两大缺陷:一是采用学习型压缩策略,可能误删关键证据;二是将上下文管理置于代理不可见的抽象层,使其无法感知自身状态。作者指出,这些方法均未触及根本——当前前沿语言模型对自身上下文状态存在“本体感知盲区”(proprioceptively blind),仅凭输入提示(prompt)无法判断各上下文块的大小、年龄或使用频率,而这正是决定保留或丢弃的关键信号。为此,论文提出VISTA(Visible Internal State for Tool Agents),一种无需训练、与模型无关的可解释性接口层,将工作记忆表示为类型化、可寻址的块结构,实时呈现每个块的标记使用率、时效性及访问历史,并以全保真格式归档可恢复的上下文块。在LOCA-Bench、BrowseComp-Plus和GAIA基准测试中,同一未训练接口可在百万级、十万级及万级轨迹间跨规模迁移;在LOCA-Bench上使四种骨干模型性能提升,尤其使Gemini-3-Flash从22.7%跃升至50.7%,且提升随上下文压力增加而增强,并具备跨模型泛化能力。消融实验进一步验证了运行时仪表盘(dashboard)本身的重要性,超越单纯存档与恢复功能。因此,解决方案的核心在于通过引入显式、可访问的内部状态接口,激活大模型内在潜在的上下文管理能力,而非依赖外部学习策略。

链接: https://arxiv.org/abs/2606.30005
作者: Binyan Xu,Haitao Li,Kehuan Zhang
机构: Tencent(腾讯)
类目: Computation and Language (cs.CL)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Long-horizon tool agents are bottlenecked by how their context grows toward the limits of the context window. Recent systems make context management agent- or system-controlled, but they either learn a compression policy that discards evidence or manage context in a layer the agent never sees. We argue both leave a more basic gap unaddressed. Frontier language models are proprioceptively blind to their own context. From the prompt alone they cannot see how large, how old, or how used each block is, the signals a keep-or-drop decision needs. We hypothesize that competent context management is already latent in capable models, and that what is missing is not a learned policy but an interface exposing this state. We introduce VISTA (Visible Internal State for Tool Agents), a training-free, model-agnostic layer that represents working memory as typed, addressable blocks, surfaces a runtime dashboard of per-block token usage, recency, and access history, and archives blocks as recoverable full-fidelity payloads. On LOCA-Bench, BrowseComp-Plus, and GAIA, the same untrained interface transfers across million-, 100K-, and 10K-scale trajectories. On LOCA-Bench it improves four backbones and lifts Gemini-3-Flash from 22.7 to 50.7%. The lift grows with context pressure and transfers across backbones. Ablations further confirm that the dashboard matters beyond archive and recovery tools.

[NLP-35] Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在数学推理中表面多样性与深层策略多样性之间的鸿沟问题。现有主流多样性度量方法仅捕捉解题过程的表层差异,无法反映不同正确解法间策略层面的根本区别。为此,论文提出“方法级多样性”(approach-level diversity)这一新概念,即同一问题的不同正确解法在推理策略上的差异。通过构建人类校准的LLM评判框架,研究发现以往的多样性指标无法有效代理方法级多样性,且这种偏差会传导至多样性的强化学习(RLVR)范式中——尽管目标度量得以保留,但真正的方法多样性却持续下降。进一步研究表明,方法多样化的候选解集可在测试时提升模型的缩放性能,然而在训练中直接优化基于LLM评判的多样性奖励会导致策略利用评判器的特定偏好,而非真正拓展推理路径,从而暴露了对方法级多样性进行直接优化仍是一个未解难题。综上,该工作首次系统引入方法级多样性概念,揭示了表层信号与深层策略信号之间存在的系统性偏差,为实现更接近人类、真正多元化的数学推理能力提供了关键方向。

链接: https://arxiv.org/abs/2606.29985
作者: Sangmook Lee,Minbeom Kim,Jeonghye Kim,Dohyung Kim,Sojeong Rhee,Kyomin Jung
机构: 未知
类目: Computation and Language (cs.CL)
备注: 27 pages, 6 figures

点击查看摘要

Abstract:Diversity in LLM mathematical reasoning is critical for exploration, but common diversity metrics mostly capture surface-level variation rather than differences in how a problem is solved. We address this gap by introducing approach-level diversity: variation in strategies across correct solutions to the same problem. Using a human-calibrated LLM judge framework, we show that prior diversity measures are unreliable proxies for approach-level diversity, and this mismatch carries over to diversity-aware RLVR, where target metrics are preserved while approach-level diversity declines. Investigating when approach-level diversity helps and whether it can be directly induced, we find that approach-diverse candidate sets improve test-time scaling. However, optimizing an LLM judge diversity reward during training causes the policy to exploit judge-specific preferences rather than broaden its approaches, leaving direct optimization of approach-level diversity as an open problem. Together, our work introduces the notion of approach-level diversity and uncovers a systematic divergence between surface- and approach-level signals, marking a step toward LLMs that reason in genuinely diverse, human-like ways.

[NLP-36] IHDec: Divergence-Steered Contrastive Decoding for Securing Multi-Turn Instruction Hierarchies

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理多源输入时,因角色优先级差异导致的指令层级(Instruction Hierarchy, IH)失效问题,尤其是在多轮对话中出现下位角色指令压制上位角色指令的“角色影响倒置”现象。现有解决方案多局限于单轮场景且依赖昂贵的微调训练,难以适应复杂交互环境。其核心解决方案为IHDec(Instruction Hierarchy-steered Decoding),通过引入Jensen-Shannon散度(Jensen-Shannon Divergence, JSD)框架,实现对词元级别指令层级冲突的自动检测,并基于对比解码(contrastive decoding)动态抑制与高层级指令不一致的低优先级输入响应,从而在无需任何训练的情况下有效维持指令层级权威性。实验表明,IHDec在多轮冲突场景下优于基于训练的基线方法,同时保持了模型的通用生成质量,并显著增强对对抗性提示注入的安全防御能力,且与更大规模模型呈现稳健的协同扩展效应。

链接: https://arxiv.org/abs/2606.29960
作者: Nicole Geumheon Liu,Haeun Jang,Yonghyun Jun,Hwanhee Lee
机构: Chung-Ang University (中央大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often fail to maintain instruction hierarchies (IH) when processing multi-source inputs with varying role-level priorities, paradoxically adhering to lower-priority directives during conflicts. While existing defenses mitigate this issue, they are largely restricted to single-turn scenarios and require expensive fine-tuning. In this paper, we formalize this failure mode in multi-turn contexts via a Jensen-Shannon Divergence (JSD) framework, uncovering a pervasive role-influence inversion phenomenon where subordinate inputs override superior roles. To rectify this without training, we propose IHDec (Instruction Hierarchy-steered Decoding). IHDec leverages JSD to automatically detect token-level hierarchy violations and dynamically executes contrastive decoding to suppress misaligned subordinate roles. Extensive evaluations demonstrate that IHDec outperforms training-based baselines in multi-turn conflicts while fully preserving general response quality. Furthermore, IHDec strengthens safety against adversarial prompt injections and exhibits a robust scaling synergy with larger models. The Code is available at this https URL

[NLP-37] LatentRevise: Learning from Zero-Hit Reasoning

【速读】: 该论文旨在解决生成式强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)面临的采样效率瓶颈问题,即在存在“硬提示”(hard prompts)时,正确轨迹的生成概率极低,导致在有限采样预算内难以捕获有效样本,进而使策略更新缺乏足够信号。其核心挑战在于“零命中提示”(zero-hit prompts)——即在多次采样中均未生成正确路径,构成RLVR的采样前沿,此时新推理行为最具价值却最难以被采样到。针对此问题,论文提出了一种名为LatentRevise的一阶潜在空间修正方法,其关键创新在于利用失败的推理轨迹与标准答案之间的对比,通过双重互补梯度优化推理前缀的输入嵌入:一方面将前缀从失败的延续方向推开,另一方面向标准答案所对应的嵌入方向拉近。该优化过程被约束于模型词汇表嵌入的凸包内,确保每一步更新都指向真实词元嵌入而非任意特征方向。实验表明,经修正后的前缀能够生成更长、具备自我反思能力且最终达成正确答案的推理路径,将其作为训练数据后,显著提升了监督微调(Supervised Fine-Tuning, SFT)和RLVR在数学推理基准上的性能。

链接: https://arxiv.org/abs/2606.29938
作者: Yiqiu Guo,Xueting Han,Qi Jia,Guangtao Zhai,Jing Bai
机构: Fudan University(复旦大学); Microsoft Research Asia(微软亚洲研究院); Shanghai AI Laboratory(上海人工智能实验室); Shanghai Jiao Tong University(上海交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) is bottlenecked by hard prompts on which correct trajectories have low probability, so sampling misses them within a practical budget and leaves the policy update with little useful signal. We frame such zero-hit prompts as RLVR’s sampling frontier, where new reasoning behavior is most valuable yet least likely to be sampled. Importantly, failed rollouts can be informative: they expose where the model’s reasoning went wrong. We introduce LatentRevise, a first-order latent revision method that recovers training signal for this zero-hit regime. Given a failed rollout and the gold answer as an anchor, LatentRevise optimizes the input embeddings of its reasoning prefix under two complementary gradients, moving the prefix away from the failed continuation and toward the gold answer. The optimization is constrained to the convex hull of the model’s vocabulary embeddings, so each update moves the latent toward a real token embedding rather than an arbitrary feature direction. We find that continuations from the revised prefix lengthen, exhibit self-reflection, and reach correct answers missed by the original rollouts. Used as training data, these trajectories improve SFT and RLVR on math benchmarks over standard baselines.

[NLP-38] owards Physical Intuitions for Alignment Dynamics: A Case Study With Randomness Crystallization

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在后训练阶段(post-training)的动态演化机制不明确的问题,尤其是模型对齐(alignment)过程中行为结构如何形成与演变。传统研究多依赖能力基准测试,但忽视了模型内部状态随训练过程变化的本质动力学。为此,论文提出将物理科学中的热力学相变理论(thermodynamic phase-transition theory)作为分析框架,引入“结晶化”(crystallization)这一经典热力学相变现象作为案例,类比描述模型从预训练到对齐的演化过程。其核心解决方案在于:将模型行为演化划分为三个阶段——(1)高熵液态阶段(对应预训练模型,具有多样化的可提示采样分布);(2)成核阶段(由监督微调引发,行为坍缩至单一种子分布);(3)稳定阶段(通过强化学习重新分配概率质量,但整体仍集中于原始种子分布)。该框架的关键在于通过可解释的度量指标识别这些相变,并在多种随机任务中验证其有效性。该方法为理解对齐过程中结构生成的根源、收敛性以及不可改变的边界提供了新的理论范式,推动对齐研究向更系统、更本质的物理类比框架演进。

链接: https://arxiv.org/abs/2606.29933
作者: Kunal Samanta,Ari Holtzman,Peter West
机构: University of British Columbia (不列颠哥伦比亚大学); University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The alignment of language models is typically studied through the lens of capability benchmarks, but the dynamics of how models change during post-training remain poorly understood. We argue that the physical sciences, and thermodynamic phase-transition theory in particular, offer a principled and underexplored vocabulary for reasoning about these dynamics. As a case study, we instantiate this position through the lens of material Crystallization, which is a well-studied thermodynamic phase transition. For tasks like random number generation, this breaks into 3 phases: (1) the high entropy liquid phase in the pretrained model, with many distinct sampling distributions promptable from the model; (2) the nucleation phase caused by supervised finetuning, in which behavior collapses onto a single seed distribution present in the pretrained LLM; and (3) a settling phase in which reinforcement learning techniques redistribute probability of the collapsed distribution, but largely keep it concentrated on the same options as the seed distribution. We propose intuitive metrics to verify the transitions between these phases, and validate the idea across a range of random tasks. Crystallization is one instance of a broader class of physical frameworks we believe alignment research should import to answer questions about where alignment-induced structure comes from, why it converges where it does, and what it fundamentally cannot change.

[NLP-39] Can LLM -as-a-Judge Reliably Verify Rubrics in Agent ic Scenarios?

【速读】: 该论文旨在解决在代理型场景(agentic scenarios)中,基于评分量规(rubric-based scoring)的大型语言模型作为裁判(LLM-as-a-Judge, LaaJ)进行评分时的可靠性问题。由于此类场景下生成内容通常具有长序列、高复杂度的特点,现有LaaJ方法在量规验证中的稳定性与一致性尚未得到充分评估,存在显著的评分噪声。为应对这一挑战,论文提出并构建了首个针对代理型场景下量规验证可靠性的基准测试工具——RuVerBench,涵盖深度研究与代理编程两大典型领域,共包含2,458个样本实例,每个实例均包含模型生成输出、对应的评分量规及人工标注的合规性标签。基于该基准,研究系统评估了多个前沿大模型的表现,发现即便最先进的模型仍存在显著的评分噪声。进一步分析表明,提示工程(prompt design)、批量验证(batching)与多数投票(majority voting)等关键策略对评分可靠性有显著影响:弱模型对提示变化更为敏感,批量验证在准确率与效率之间存在权衡,而多数投票虽能有效提升性能但收益递减。研究成果已开源,以推动后续相关研究发展。

链接: https://arxiv.org/abs/2606.29920
作者: Yangda Peng,Yunjia Qi,Hao Peng,Haotian Xia,Guanzhong He,Xintong Shi,Richeng Xuan,Songyuanyi Lu,Yixian Liu,Zhichao Hu,Yuhong Liu,Lei Hou,Bin Xu,Juanzi Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Rubric-based scoring has become a widely used paradigm in model evaluation, typically with LLM-as-a-Judge (LaaJ) for rubric scoring. However, the reliability of LaaJ for rubric scoring remains underexplored. This concern is especially pronounced in agentic scenarios, where long, complex outputs further challenge reliable scoring. To address this, we conduct a systematic meta-evaluation of LaaJ reliability for rubric verification. We introduce RuVerBench, the first benchmark for assessing LaaJ reliability in rubric verification for agentic scenarios. RuVerBench covers two prevalent agentic domains, deep research and agentic coding, with 2,458 instances, each containing a model-generated output, a rubric, and a human-annotated label indicating whether the output satisfies the rubric. Using RuVerBench, we evaluate numerous frontier LLMs and find that even the most advanced models achieve strong performance but still exhibit substantial noise. We further analyze the impact of key LaaJ strategies, including prompt design, batching, and majority voting, on rubric verification. We find that weaker models are more sensitive to prompt variations, batched verification presents a trade-off between accuracy and efficiency, and majority voting yields effective but diminishing returns. We have released our dataset and code to facilitate future research: this https URL.

[NLP-40] MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation

【速读】: 该论文旨在解决当前智能体记忆系统(agent memory systems)评估中存在的混淆问题,即现有研究在对比检索增强生成(RAG)与全上下文(full-context)基线时,常同时改变语言模型、嵌入模型或检索管道等多个变量,导致性能提升的归因不清晰。为此,论文提出了一种受控评估协议MemDelta,通过在LongMemEval-S基准(500个问题,50+会话,三类模型家族)上逐项独立变动单一组件,实现精确归因。其关键解决方案在于:严格控制变量,固定嵌入模型进行对比,分模型家族分层分析,并报告写入路径成本;实证发现,仅嵌入模型更换即可带来6.2个百分点的准确率差异,且不同模型对记忆策略的响应存在显著异质性(如Sonnet模型从RAG中获益31个百分点而拒绝全上下文查询),同时自记忆(self-memory)表现劣于基础检索,部分场景下尽管精度接近云端RAG但成本高出50倍,揭示了现有性能增益多为特定条件下的窄域优势而非通用改进。因此,论文强调评估应统一嵌入模型、按模型家族分层并披露写入开销,以避免将架构收益错误归因于记忆设计本身。

链接: https://arxiv.org/abs/2606.29914
作者: Kuan Wang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Agent memory systems are increasingly evaluated against RAG and full-context baselines, but reported gains often mix changes in the memory method with changes in the language model, embedding model, or retrieval pipeline, making it unclear what is actually being measured. We present MemDelta, a controlled evaluation protocol that varies one component at a time on LongMemEval-S (500 questions, 50+ sessions, three model families). Four findings emerge: (1) verbatim RAG matches full-context GPT-4o-mini (47.2% vs. 49.8%, p = 0.34), but the ranking reverses across models: Gemini gains +14pp from full context, while Sonnet gains +31pp from RAG, partly because it refuses 63% of full-context queries; (2) swapping only the embedding model in an identical pipeline shifts accuracy by +6.2pp at n = 500 (p = 0.004), and Mem0 beats MiniLM-RAG by +11pp but loses to cloud-RAG by 1.2pp, so one variable flips the conclusion; (3) agent self-memory (42%) underperforms basic retrieval (47%); (4) on 2 of 6 question types (n = 88), Mem0 matches cloud RAG (72.7% vs. 73.9%, p = 1.0) at 50x the cost, suggesting narrow rather than general gains. We recommend memory evaluations fix embedding models across comparisons, stratify by model family, and report write-path cost before attributing gains to architecture.

[NLP-41] mesteps of Mamba Align with Human Reading Times

【速读】: 该论文旨在解决语言模型在处理文本时的动态时间机制与人类阅读过程中的认知加工时间之间是否存在对齐的问题。其核心挑战在于理解生成式语言模型如何模拟人类实时语言处理中随输入变化而动态调整的信息处理节奏。解决方案的关键在于揭示状态空间模型Mamba中每个词所对应的离散时间步长(discretization timestep, Δt)能够有效预测人类读者的单字阅读时间,并且这一预测能力在控制已有认知模型指标(如GPT-2意外性,surprisal)后依然显著。通过形式化分析Mamba的架构与内部动态机制,研究进一步表明,该模型提供了一种新的视角来理解具有持续更新记忆的人类实时语言处理过程,尤其体现在各层模块对短期与长期信息保留的权衡机制,以及噪声与连续动态记忆表征之间的交互方式。

链接: https://arxiv.org/abs/2606.29904
作者: Yuji Yamamoto,Shinnosuke Isono,Yoshinobu Kawahara,Sho Yokoi
机构: SOKENDAI; NINJAL; The University of Osaka; Tohoku University; RIKEN
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study demonstrates an alignment of per-word processing time in a popular state-space language model Mamba and human readers. In Mamba, the recurrent state transition at each layer conceptually takes some duration of time, the discretization timestep \Delta_t , determined dynamically in response to the input. Using a naturalistic reading dataset, we show that the per-word timestep from Mamba is a significant predictor of human reading times, and remains significant even when known predictors such as GPT-2 surprisal are controlled for. We further suggest, through formal analysis of Mamba’s architecture and internal dynamics, that Mamba can serve as a new, valuable lens to look at human real-time language processing with ever-updated memory, because it allows us to look at how each module (layer) weighs short- and long-term information retention, and how noise may interact with dynamic, continuous memory representation. Code is available online.

[NLP-42] Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency ICML ALT

【速读】: 该论文旨在解决当前大型语言模型(LLM)在临床诊断任务中仅依赖诊断准确率评估其推理能力的局限性,即无法区分稳定的、基于临床知识的结构化推理与仅依赖模式匹配的表面性推断。其核心解决方案是提出临床推理图(clinical reasoning graphs),一种基于领域知识本体(包含5种节点类型和7种边类型)从自由文本的诊断推理轨迹中提取的结构化图表示方法。通过在50例《新英格兰医学杂志》病案讨论(CPC)病例上对5个LLM在3种提示条件下的750条推理轨迹进行分析,研究检验了临床相似病例是否表现出一致的结构化推理模式(即“诊断范式”),并以图相似性作为衡量标准。结果表明,在15组模型-提示组合中,临床相似病例与非相似病例间的图相似性无显著差异,且无任何比较通过多重假设检验校正;进一步的组件级分析显示,残余内容信号远低于范式尺度水平。此外,正确与错误答案对应的图相似性几乎相同(0.488 vs. 0.484),说明图结构捕捉的是与最终诊断准确率无关的独立维度。尽管结构化反思提示可提升轨迹中明确特征分析的比例(+33%),但未改善跨案例的一致性。这些发现揭示了当前模型具备诊断能力但缺乏范式尺度的推理一致性,强调应将最终答案准确性与过程层面的结构化评估相结合。研究团队已公开本体、抽取流程、验证协议及生成的推理图与相似性数据,为未来大模型临床推理的结构化评估提供可复用资源。

链接: https://arxiv.org/abs/2606.29876
作者: Nisarg A. Patel(University of California, San Francisco)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: Spotlight Paper, Proceedings of the Workshop on Structured Data for Health at the 43rd International Conference on Machine Learning (ICML), Seoul, South Korea

点击查看摘要

Abstract:Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical reasoning graphs, structured graph representations extracted from free-text LLM diagnostic traces using a domain-grounded ontology with 5 node types and 7 edge types. We apply this pipeline to 750 traces from five LLMs across 50 New England Journal of Medicine Clinicopathological Conference cases and three prompt conditions, and test whether diagnostic traces show stable structured reasoning patterns, or diagnostic schemas, for clinically similar cases. We operationalize this as higher graph similarity among clinically similar cases than among clinically dissimilar ones. Across 15 model-condition comparisons, within-cluster and between-cluster composite similarity are nearly equal, and no comparison survives multiple-testing correction; a component-level analysis finds any residual content signal far below schema scale. Graph similarity is also nearly identical for pairs of models that are both correct (0.488) and both incorrect (0.484), suggesting that graph structure captures a dimension not reflected in diagnostic accuracy. Structured reflection prompting increases explicit discriminating-feature analysis within traces (+33%) but does not increase cross-case consistency. These results show diagnostic competence without schema-scale reasoning consistency, and indicate that final-answer accuracy should be complemented by process-level evaluation. We release the ontology, extraction pipeline, validation protocol, and the extracted reasoning graphs and similarity artifacts as resources for structured evaluation of LLM clinical reasoning.

[NLP-43] Unveiling Novelty Evolution in the field of Library and Information Science in China

【速读】: 该论文旨在揭示中国图书馆与信息科学(Library and Information Science, LIS)领域学术论文的创新性分布特征,重点关注不同期刊、研究主题及时间周期下的创新差异。其核心问题在于:如何量化并解析中国LIS研究中论文创新性的演变规律及其与研究主题和作者合作模式之间的关联。解决方案的关键在于结合基于引用对组合创新理论的创新度评分方法与BERTopic主题建模技术,通过对CSSCI收录的2000至2022年间中国LIS期刊论文摘要进行分析,实现对研究主题的自动识别与创新性量化评估。研究发现,档案研究类主题整体创新性较低,而期刊评价与专利技术相关主题则表现出更高创新性;总体而言,中国LIS研究的创新性呈上升趋势。此外,低创新性主题多由独立作者完成,而高创新性主题更倾向于跨机构合作,揭示了合作模式在推动学术创新中的关键作用。该研究为理解研究主题与协作机制如何共同影响学术创新提供了新的实证视角。

链接: https://arxiv.org/abs/2606.29872
作者: Chen Yang,Yuzhuo Wang,Chengzhi Zhang
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This study analyzes the novelty distribution of scholarly papers in the field of Library and Information Science (LIS) in China, with a focus on differences across journals, research topics, and time periods. Articles published in Chinese LIS journals indexed by the Chinese Social Sciences Citation Index (CSSCI) from 2000 to 2022 were collected as the research sample. BERTopic was applied to paper abstracts to identify research topics, and novelty scores were calculated based on the combinatorial innovation theory of reference pairs cited by focal papers. The study then examined the novelty of papers under different topics and further analyzed author collaboration patterns to explain how collaboration may be associated with paper novelty. The results show that archival research topics generally have lower novelty, whereas topics related to journal evaluation and patent technology display higher novelty in Chinese LIS research. Overall, the novelty of papers in this field has gradually increased over time. Papers with different topics and novelty levels also show distinct collaboration patterns: low-novelty topics are more often associated with solo authorship, while high-novelty topics tend to involve a higher proportion of inter-institutional collaboration. This study reveals the topic-level characteristics and temporal trends of novelty in Chinese LIS research and provides a new perspective for understanding how research topics and collaboration patterns influence scholarly innovation.

[NLP-44] ARKD: Adaptive Reinforcement Learning-Guided Bidirectional KL Divergence Distillation for Text Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)知识蒸馏(Knowledge Distillation, KD)中因依赖单一KL散度目标而导致的分布拟合与长尾概率建模之间难以平衡的问题,进而限制生成质量与泛化能力。其核心挑战在于如何有效对齐教师模型与学生模型在主模式(principal modes)和长尾模式(long-tail modes)上的输出分布。解决方案的关键在于从理论与实证层面分析前向KL(Forward KL, FKL)与反向KL(Reverse KL, RKL)在分布对齐中的互补作用,并提出一种基于强化学习的自适应KL加权蒸馏框架。该框架通过策略网络(policy network)动态根据教师-学生分布特性分配FKL与RKL的权重,以即时奖励信号为指导,实现对主模式与长尾模式的双重对齐。实验结果表明,该方法在Rouge-L与BertScore等指标上均取得稳定提升,相较贪婪启发式方法提升0.4–0.6分,并优于多种基准方法,在多个基准测试中表现更优。

链接: https://arxiv.org/abs/2606.29869
作者: Zilong Liu,Xuewen Zhang,Jinrui Xing,Juyi Qiao,Huiyong Wang,Junming Jiao
机构: Li Auto(小鹏汽车); Peking University(北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) is a key technique for compressing Large Language Models (LLMs), yet methods relying on a single KL objective often fail to balance primary distribution fitting with long-tail probability modeling, limiting both generation quality and generalization. To address this, we analyze the complementary roles of forward and reverse KL divergence (FKL/RKL) in distribution alignment from theoretical and empirical perspectives. We then propose a reinforcement-learning-based adaptive KL-weighted distillation framework, in which a policy network dynamically assigns weights to FKL and RKL based on teacher-student distributional characteristics, guided by immediate reward signals to achieve dual alignment on principal and long-tail modes. Extensive experiments demonstrate consistent improvements across Rouge-L and BertScore metrics, surpassing greedy heuristics by 0.4-0.6 points and outperforming other baseline methods on diverse benchmarks.

[NLP-45] KbSD: Knowledge Boundary aware Self-Distillation for Behavioral Calibration in Agent ic Search

【速读】: 该论文旨在解决生成式AI在动态检索场景下知识边界校准(knowledge boundary calibration)中的奖励稀疏性问题,即如何在不同知识状态中精准决策:何时信任参数化记忆、何时依赖检索证据、何时选择拒绝回答。现有强化学习方法因采用二元奖励机制,虽能惩罚错误行为,却无法为推理过程提供充分指导。其核心解决方案是提出KbSD(Knowledge boundary Self-Distillation)框架,通过三重机制实现突破:首先,构建与学生模型架构相同的提示增强型教师模型,接收参数化置信度、检索质量及真实答案等显式知识边界信号,生成校准的推理示范;其次,利用信息不对称的自蒸馏机制,在不依赖更大外部模型的前提下实现细粒度的令牌级密集监督;最后,针对不同知识状态下的推理分布异质性,设计分象限自适应蒸馏目标——对集中整合场景使用反向KL散度,对多样化拒绝场景采用正向KL散度,并在需兼顾精度与覆盖的非对称象限中引入帕累托最优双向KL,实现更优的平衡。实验表明,KbSD在多个基准测试中均显著提升任务准确率并有效抑制幻觉现象,尤其在稀疏奖励信息最匮乏的挑战性象限中表现最为突出。

链接: https://arxiv.org/abs/2606.29863
作者: Tao Feng,Xinke Jiang,Chao Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agentic search equips large language models with dynamic retrieval abilities, but existing reinforcement learning methods remain limited by reward sparsity in knowledge boundary calibration – deciding when to trust parametric memory, when to rely on retrieved evidence, and when to abstain. Binary rewards can penalize undesirable outcomes, but provide little guidance on the reasoning process required to make calibrated decisions across different knowledge states. To address this, we propose KbSD (Knowledge boundary Self-Distillation), a framework that tackles this limitation through dense token-level supervision, outcome-level sparse rewards, and quadrant-adaptive optimization. KbSD constructs a hint-augmented teacher, architecturally identical to the student, that receives explicit knowledge boundary signals – including parametric certainty, retrieval quality, and ground-truth answers – to generate calibrated reasoning demonstrations. This information-asymmetric self-distillation enables dense supervision without requiring a larger external model. To further account for the heterogeneous reasoning distributions across knowledge states, we introduce a quadrant-adaptive distillation objective: reverse KL for concentrated integration, forward KL for diverse refusal, and Pareto-optimal bidirectional KL for asymmetric quadrants requiring both precision and coverage. Experiments on multiple benchmarks show that KbSD consistently improves both task accuracy and hallucination mitigation over strong baselines, with the largest gains appearing in the challenging quadrants where sparse rewards are least informative.

[NLP-46] Smooth Scaling Laws Hide Stepwise Token Learning

【速读】: 该论文旨在解决语言模型损失函数在模型规模与数据规模下呈现幂律缩放规律的内在机制问题,即为何整体损失会表现出如此一致的幂律形式。现有解释多归因于自然语言中模式难度的重尾分布特性,但这一观点缺乏在大规模真实数据训练中以词元(token)为粒度的直接验证。本文提出一种基于词元级别的分析框架,将缩放定律分解为单个上下文化词元的局部学习事件,通过用sigmoid函数拟合词元损失轨迹,发现词元学习集中发生于特定的时间区间内,由此形成主导缩放规律形态的学习时间谱(learning-time spectrum)。在超过一百次使用现代大语言模型架构、覆盖大规模多样化真实语料库(最大达60亿参数和3000亿训练词元)的预训练实验中,所测量的学习时间谱能够定量重构验证损失对训练步数 TT、数据规模 DD 与模型规模 MM 的导数关系。进一步表明该信号具有可操作性:通过根据词元可学习时刻重新调整训练数据分布,可显著改变优化轨迹,实现验证损失下降速度提升11%。研究结果提供了直接实证证据,表明缩放定律主要由词元级学习时间分布决定,且该分布不仅可用于解释缩放行为,还可用于优化训练效率。

链接: https://arxiv.org/abs/2606.29858
作者: Pingjie Wang,Zechen Hu,Peiru Yang,Fu Guo,Debing Zhang
机构: Dots Studio, Xiaohongshu Inc.; Shanghai Jiao Tong University; Tsinghua University
类目: Computation and Language (cs.CL)
备注: 21 pages

点击查看摘要

Abstract:Language model loss follows remarkably regular scaling laws over model and data size, yet it remains unclear why the aggregate loss should exhibit a power-law form. Existing explanations often attribute this regularity to a heavy-tailed spectrum of pattern difficulty in natural language, but this view has not been directly validated at token-level granularity in large-scale real-data training. We present a token-level framework that decomposes scaling laws into localized learning events of individual contextualized tokens. By fitting token loss trajectories with sigmoids, we show that token learning is concentrated in localized transitions, giving rise to a learning-time spectrum that dominates the scaling-law shape. Across more than one hundred pre-training runs on large and diverse real-language corpora with modern LLM architectures, scaling up to 6B parameters and 300B training tokens, the measured learning-time spectrum quantitatively reconstructs the validation loss derivative along the training-step T , data-scale D , and model-scale M axes. We further show that the same signal is actionable: by reshaping the training distribution according to when tokens become learnable, we alter the optimization trajectory and achieve 11% faster validation-loss reduction. These results provide direct empirical evidence that scaling laws are governed primarily by the distribution of token-level learning times, and that this distribution can be used not only to explain scaling behavior but also to improve training performance.

[NLP-47] MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers ACL2026

【速读】: 该论文旨在解决传统注意力机制(attention mechanism)带来的二次计算复杂度问题,这一瓶颈严重制约了大语言模型(Large Language Models, LLMs)在长上下文场景下的可扩展性与实际部署能力。现有方法通常通过引入刚性结构约束(如局部注意力窗口)来提升效率,但此类策略往往导致需要精准长程记忆的任务上性能显著下降。本文提出MATCH框架,通过在稀疏化注意力机制中动态整合上下文信息,结合高效的检索系统,实现对上下文的增强式检索。其核心创新在于:在保持稀疏注意力架构高效性的前提下,利用动态检索机制弥补因稀疏化导致的上下文信息丢失问题,从而在合成数据与真实自然语言任务上均显著提升稀疏注意力模型的性能。该方案的关键在于将外部高效检索与动态上下文信息融合无缝集成于稀疏注意力结构中,实现了效率与精度的协同优化。

链接: https://arxiv.org/abs/2606.29844
作者: Linrui Ma,Chun Hei Lo,Xinyu Wang,Peng Lu,Xihao Yuan,Hanting Chen,Kai Han,Xinghao Chen,Chengjun Zhan,Hanlin Xu,Yichun Yin,Lifeng Shang,Feng Wen,Boxing Chen,Yufei Cui
机构: Huawei Canada(华为加拿大); McGill University(麦吉尔大学); Université de Montréal(蒙特利尔大学); Huawei(华为)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL 2026 Main Conference

点击查看摘要

Abstract:The quadratic computational cost of traditional attention mechanisms poses a major bottleneck to the scalability and practical deployment of large language models (LLMs), particularly in long-context scenarios. To improve efficiency, existing approaches often enforce rigid structural constraints such as local attention windows. However, these strategies typically lead to substantial performance degradation on tasks requiring precise long-range recall. In this work, we propose MATCH, a scalable and efficient framework that augments sparsified attention mechanisms with dynamically integrated in-context information through an efficient retrieval system. Empirical results show that MATCH significantly improves the performance of sparse-attention models on both synthetic and real-world natural-language tasks. These findings highlight the versatility of MATCH as a general approach for enhancing in-context retrieval capabilities while maintaining the efficiency benefits of sparse attention architectures.

[NLP-48] Neural Procedural Memory: Empowering LLM Agents with Implicit Activation Steering

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)从静态求解器向自主智能体(autonomous agents)转变过程中的核心挑战——缺乏持续的、可持久化的程序化记忆(procedural memory)。现有方法主要依赖检索增强生成(Retrieval-Augmented Generation, RAG)通过显式文本指令注入上下文,但这种符号化指令易导致“文本-动作脱节”(text-action disconnect),难以有效激活执行任务所必需的内部神经表征。为应对这一问题,论文提出无需训练的神经程序化记忆(Neural Procedural Memory, NPM)框架,其关键在于将历史对比性经验中提炼出的程序化技能以隐式激活引导向量(steering vectors)的形式编码于激活空间中,直接驱动与任务相关的神经机制,实现对任务执行的隐式引导。实验表明,NPM在四个智能体基准测试中表现与依赖显式文本指令的基线相当;更重要的是,隐式引导与显式工作流的结合展现出互补优势,显著提升了任务执行的鲁棒性。表征分析进一步揭示,这些引导向量在激活空间中形成了结构化且一致的任务逻辑表示,验证了隐式激活引导在构建可持续智能体记忆方面的有效性与潜力。

链接: https://arxiv.org/abs/2606.29824
作者: Chengfeng Zhao,Yuqiao Tan,Shizhu He,Yequan Wang,Jun Zhao,Kang Liu
机构: Institute of Automation, CAS; University of Chinese Academy of Sciences; Beijing Academy of Artificial Intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) excel as static solvers, transforming them into autonomous agents remains challenging. This transition requires continuous environmental interaction, yet current agents lack the necessary persistent procedural memory. Existing approaches predominantly employ Retrieval-Augmented Generation (RAG) to inject explicit textual guidelines into model contexts. However, relying solely on symbolic instructions can introduce a text-action disconnect, frequently failing to activate the internal representations necessary for correct task execution. To address this, the paper introduces Neural Procedural Memory (NPM), a training-free framework that represents agent memory through implicit activation steering rather than explicit instructions. By distilling procedural skills from historical contrastive experiences into steering vectors in the activation space, NPM directly activates the task-relevant neural mechanisms to guide task execution. Evaluations across four agent benchmarks show that NPM performs comparably to baselines using explicit textual instructions. Furthermore, the results show that combining implicit steering with explicit workflows provides complementary advantages, leading to more robust task execution. Representational analyses indicate that these steering vectors encode consistent task logic, forming organized structures within the activation space. These findings suggest that implicit activation steering provides a promising approach for managing agent memory.

[NLP-49] SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models

【速读】: 该论文旨在解决代码大语言模型(Code LLMs)评估中因训练数据泄露(data leakage)导致性能虚高的问题,即模型在预训练阶段无意中接触了评测集数据,从而人为提升其在基准测试中的表现。现有方法存在依赖专有训练语料、依赖脆弱启发式规则(如时间戳过滤)或使用外部参考集并需手动调参设定非通用阈值等局限性。为此,本文提出一种统一的自参照式泄露检测框架——SrDetection,适用于灰盒(可访问模型输出逻辑值,logits)与黑盒(仅可访问模型输出结果)两种场景。其核心创新在于:生成基准样本的语义等价变体,并通过对比模型对原始样本与其变体的行为差异来识别泄露,若模型在原始样本上显著更易完成任务,则判定为存在泄露。该方法不依赖人工设定阈值,具备强鲁棒性。研究进一步构建了可控的泄露检测测试环境进行验证,在不同模型及训练阶段下,相比强基线方法,灰盒设置平均F1提升21.52点,黑盒设置提升14.46点。对15个主流Code LLM在四大基准上的灰盒分析揭示了超越传统重叠分析的、具有基准特异性的泄露模式,凸显了本方法在真实场景中的有效性与洞察力。

链接: https://arxiv.org/abs/2606.29815
作者: Shuaimin Li,Liyang Fan,Zeyang Li,Zhuoyue Wan,Yufang Lin,Shiwen Ni,Feiteng Fang,Hamid Alinejad-Rokny,Yuanfeng Song,Kun Jing,Chen Jason Zhang,Min Yang
机构: Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; Shenzhen University; University of Science and Technology of China; PolyU; East China Normal University; Artificial Intelligence Research Institute, Shenzhen University of Advanced Technology; University of New South Wales; Anhui University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating code large language models (Code LLMs) requires reliable detection of data leakage, where benchmark performance is artificially inflated by exposure to benchmark data during pre-training. Existing approaches either assume access to proprietary training corpora, rely on brittle heuristics such as timestamp filtering, or use external reference sets with manually tuned, non-generalizable thresholds. To address these limitations, we introduce \textbfSrDetection, a unified \textbfself-\textbfreferential leakage detection framework for both gray-box (access to model logits) and black-box (access to model outputs) settings. SrDetection generates semantically equivalent variants of a benchmark sample and detects leakage by contrasting the model’s behavior on the original versus its variants, flagging cases where the original is disproportionately easier for the model. We further design a controlled leakage detection testbed and evaluate SrDetection in this environment. Across different models and training stages, SrDetection improves average F1 by 21.52 points in the gray-box setting and 14.46 points in the black-box setting over strong baselines, demonstrating robust, threshold-independent leakage detection. Finally, a gray-box study of 15 widely used Code LLMs on four popular benchmarks reveals benchmark-specific leakage patterns beyond prior overlap-based analyses\footnote\footnotesize Source code and data are available at this https URL

[NLP-50] How Far Can You Get Without a GPU? A Systematic Benchmark of Lightweight Hallucination Detection Across Question Answering Dialogue and Summarisation

【速读】: 该论文旨在解决在资源受限环境下实现可信生成式AI(Generative AI)部署中幻觉检测的可及性问题。现有高精度幻觉检测方法通常依赖于GPU计算、专有API调用或对生成模型的白盒访问,难以在计算资源有限的场景下应用。为此,本文提出一种实用替代方案:评估仅基于公开可用模型、轻量级且可在CPU上运行的检测方法在幻觉识别任务中的表现。其关键解决方案在于系统性地基准测试五种无需复杂硬件支持的方法——包括ROUGE-L、语义相似度、BERTScore、基于FEVER训练的DeBERTa模型的自然语言推理(NLI)检测器,以及相似度与NLI得分层面的集成方法,并在HaluEval基准的三大任务(问答、对话、摘要)上进行验证。实验表明,各方法性能高度依赖任务类型,无单一方法全面占优;集成方法在问答任务中表现最佳(F1=0.792,AUC-ROC=0.873),NLI检测器在对话任务中领先(AUC-ROC=0.713),但在摘要任务上所有方法均退化至接近随机水平(AUC-ROC介于0.469至0.574之间)。这一任务依赖性及在摘要任务上的系统性失败揭示了当前无需GPU的幻觉检测技术的实际边界,为在计算受限条件下的方法选择提供了明确实践指导。所有实验均在标准笔记本电脑CPU上使用公开模型完成。

链接: https://arxiv.org/abs/2606.29809
作者: Kriti Faujdar,Smit Kadvani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hallucination detection has become a pressing requirement for trustworthy AI deployment at scale. The most accurate detection methods depend on GPU-intensive inference, proprietary API calls, or white-box access to the generating model. This puts them out of reach for resource-constrained researchers and practitioners. In this paper, we explore a practical alternative: how well can hallucination detection perform using only lightweight, CPU-feasible methods built on publicly available models? We systematically benchmark five such methods: ROUGE-L, semantic similarity, BERTScore, a Natural Language Inference (NLI) detector based on a FEVER-trained DeBERTa model, and a score-level ensemble of similarity and NLI. We evaluate them across all three tasks of the HaluEval benchmark: question answering (QA), dialogue, and summarisation. We calibrate each method on a held-out validation split and evaluate it on 2,000 test instances per task. We find that no single method dominates and performance is highly task-dependent. The ensemble performs best on QA (F1 = 0.792, AUC-ROC = 0.873), the NLI detector leads on dialogue (AUC-ROC = 0.713), and all five methods degrade to near-random performance on summarisation (AUC-ROC between 0.469 and 0.574). This task-dependence and the systematic failure on summarisation map the practical frontier of GPU-free hallucination detection. They give practical guidance for method selection under computational constraints. All experiments run on a standard laptop CPU using public models.

[NLP-51] Fund2Persona: A Framework for Building and Refining Financial Advisor Personas from Fund Disclosure Data

【速读】: 该论文旨在解决个性化金融顾问服务中专家级咨询能力难以规模化、可编码化及稳定复现的问题,尤其针对大语言模型(LLM)在生成金融建议时因简单角色提示(persona prompt)导致推理逻辑模糊、建议泛化且缺乏依据的缺陷。其核心解决方案是提出 Fund2Persona 框架,通过将金融顾问角色(financial-advisor persona)深度锚定于基金披露文件、持仓变动数据、市场背景信息以及基金经理评论等真实结构化数据源,并引入“代理-评分器-修补器”(agentic actor–scorer–patcher)的迭代优化机制,实现对顾问角色的精细化建模与动态修正。该方法使生成的顾问角色不仅能更准确地还原投资组合决策路径和经理意图,还显著提升了在市场情景生成与基于投资者画像的咨询对话中的表现,体现出超越通用基线的特定性与实用性。研究表明,基于基金数据构建的金融顾问角色可有效实现基金经理特异性投资专长的可迁移性,而非仅改变 LLM 的表层风格表达。

链接: https://arxiv.org/abs/2606.29793
作者: Suhwan Park,Hoyoung Lee,Zhangyang Wang,Alejandro Lopez-Lira,Young Cha,Chanyeol Choi,Jaewon Choi,Yongjae Lee
机构: UNIST(韩国科学技术院); LinqAlpha; University of Texas at Austin(德克萨斯大学奥斯汀分校); University of Florida(佛罗里达大学); Blackstone(黑石集团); Hanwha Life(韩华生命)
类目: Computation and Language (cs.CL); General Finance (q-fin.GN)
备注: 17 pages, 5 figures, 12 tables

点击查看摘要

Abstract:Demand for personalized financial advising is growing, but consistent advisor expertise is difficult to obtain, scale, and encode in LLM systems. Simple persona prompts rarely specify how a financial advisor should reason and often drift toward generic recommendations. We propose Fund2Persona, a framework that grounds financial-advisor personas in fund disclosures, holdings transitions, market context, and manager commentary, then refines them through an agentic actor–scorer–patcher loop. We evaluate the resulting personas on held-out holdings-transition reconstruction and manager-commentary alignment, where they better recover portfolio decisions and grounded manager interpretation than generic baselines. We further study two downstream diagnostics: market-scenario generation, where persona retrieval broadens plausible investment views beyond repeated generic rollouts, and advisory dialogues grounded in investor profiles, where matched personas give more specific and useful advice than a generic advisor. These results suggest that fund-data-grounded financial-advisor personas can make manager-specific investment expertise portable rather than merely changing an LLM’s surface style.

[NLP-52] Are Humans Evolved Instruction Followers? An Underlying Inductive Bias Enables Rapid Instructed Task Learning NEURIPS NEURIPS2025

【速读】: 该论文旨在解决人类在仅接收口头或书面指令后即可实现一次性正确执行新任务的快速指令学习(Rapid Instructed Task Learning, RITL)机制问题,尤其关注其在自然智能与人工系统之间的潜在共性。其核心解决方案在于提出“指令遵循偏倚”(instruction-following bias)这一进化形成的归纳偏置(inductive bias),即人类认知系统通过长期演化形成了一种天然倾向,能够高效解析并执行语言指令,从而实现从语言到行为的快速泛化。该偏倚在功能上类似于大语言模型(Large Language Models, LLMs)通过指令微调(instruction tuning)实现的零样本任务性能,但其在人类中被视为一种内在的认知架构特征,而非依赖外部训练。研究综合了认知科学、神经科学与机器学习的证据,支持该假设,并呼吁跨学科合作以验证“指令遵循”作为自然与人工神经网络中实现快速任务学习的统一机制。

链接: https://arxiv.org/abs/2606.29792
作者: Anjishnu Kumar
机构: Amazon Alexa AI (亚马逊语音智能)
类目: Computation and Language (cs.CL)
备注: 4 pages, Position Paper, Published at Neurips 2025 Workshop on Interpreting Cognition in Deep Learning Models - this https URL

点击查看摘要

Abstract:Human adults can often perform a novel task correctly on the first attempt after only receiving verbal or written instructions. This rapid instructed task learning (RITL) is a hallmark of human cognitive flexibility, yet its mechanisms and parallels in artificial systems remain under-explored across disciplines. In this position paper, we argue that humans possess an evolved instruction-following bias – an inductive bias shaped by evolution to interpret and execute linguistic instructions which critically enables fast generalization of behavior from language. This bias functions analogously to the way large language models (LLMs) leverage instruction tuning to achieve zero-shot task performance. We synthesize evidence from cognitive science, neuroscience, and machine learning research to support this hypothesis. While instruction-following in AI is currently achieved via specialized training protocols, we posit that in humans it arises as an innate cognitive architecture feature. We outline testable predictions and call for more interdisciplinary research to investigate Instruction-Following as a unifying mechanism enabling rapid task learning in both natural and artificial neural networks.

[NLP-53] Managing Map Cardinality in Automatic Disease Classification Mapping: Balancing Precision Recall and Coverag e

【速读】: 该论文旨在解决疾病分类系统(如国际疾病分类,ICD)之间自动映射中存在的复杂映射关系问题,尤其是传统基于嵌入的方法主要局限于一对一映射,难以有效处理更复杂的多对一或多对多映射场景。现有阈值法与Top-K方法虽可扩展至多映射情形,但其在精确率(precision)、召回率(recall)与映射覆盖率(mapping coverage,即至少有一个目标代码映射的源代码比例)之间存在固有权衡。为克服这一挑战,本文提出一种受实体消歧中“分块-匹配”(blocking-and-matching)流程启发的新方法:首先通过分块策略生成候选匹配集合(blocking),随后利用大语言模型(Large Language Model, LLM)在每个分块内识别所有有效的映射关系(matching)。实验结果表明,该方法在多个ICD版本对(ICD-9-CM ↔ ICD-10-CM 以及 ICD-10-AM ↔ ICD-11)上实现了更高的精确率,同时保持了相当的召回率并显著提升了映射覆盖率。

链接: https://arxiv.org/abs/2606.29750
作者: Santosh Purja Pun,Oliver Obst,Jim Basilakis,Jeewani Anupama Ginige
机构: Western Sydney University (西悉尼大学); UNSW (新南威尔士大学)
类目: Computation and Language (cs.CL)
备注: Main text: 8 pages, 1 table and 3 figures; Appendix: 8 pages, 11 tables, 2 figures

点击查看摘要

Abstract:Automatic mapping between disease classification systems, such as the International Classification of Diseases (ICD), is a challenging yet essential task for integrating health data and conducting longitudinal data analysis. Existing embedding-based methods primarily focus on \emphone-to-one mappings, overlooking more complex \emphone-to-many scenarios. The threshold-based and top-K methods offer natural extensions; however, they involve inherent trade-offs between \emphprecision, \emphrecall and \emphmapping coverage – the proportion of source codes with at least one mapping to a target code. To address this challenge, we introduce a novel method, which is inspired by the \emphblocking-and-matching pipeline commonly used in \emphentity resolution. In particular, we first generate a block of candidate matches (\emphblocking) and then employ a large language model (LLM) to identify all valid mappings within each block (\emphmatching). Empirically, we show that the proposed method achieves higher precision with comparable recall and broader coverage across multiple ICD version pairs (ICD-9-CM \leftrightarrow ICD-10-CM and ICD-10-AM \leftrightarrow ICD-11). Our source code and dataset is available at: this https URL.

[NLP-54] Fast Numbers Slow Language: Bridging Quantitative and Qualitative Earnings Signals FAST

【速读】: 该论文旨在解决财务研究中长期分离的两大信息维度——定量盈余(quantitative surprise)与定性语义(qualitative language)在盈利公告中的协同分析难题。传统上,金融经济学家聚焦于量化盈余(如每股收益/收入与分析师预测的偏差),而自然语言处理(NLP)研究则关注盈利电话会议摘要(ECT)中的管理层语气、指引可信度等定性信号,二者因采用不兼容的研究框架(目标变量不同:收益率 vs. 波动率;交易策略不同:多空分层组合 vs. 全市场交易;评估指标不同:前20%与后20%收益率差值 Q5-Q1 vs. 均方误差 MSE)而难以实现有效整合与对比。本文提出EarningsInOne,首个统一整合标普1500指数(2022–2025年)的盈利新闻、电话会议文本及日内与次日股价数据的基准语料库,并采用统一的交易与评估范式对两类信号进行分析,揭示出“快数慢语”的清晰时间分离特征:定量盈余在公告瞬间即达峰值并在下一个交易日开盘前基本被市场消化;而定性语义情感则在次日达到峰值,具备真实可交易性,但此前基于点对点均方误差(MSE)的评估方法因忽略方向性,掩盖了其显著的预测能力。解决方案的关键在于构建跨模态、高时空分辨率的统一数据基础设施与评估体系,从而打通定量与定性信号之间的研究壁垒,实现对盈利公告信息效率的系统性重构。

链接: https://arxiv.org/abs/2606.29734
作者: Ding Yu,Zhuo Liu,Hao Zhang,Hangfeng He
机构: 未知
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注: 19 pages, 5 figures. Code and data: this https URL

点击查看摘要

Abstract:Earnings announcements release two types of information sequentially: quantitative surprise (numeric earnings-per-share (EPS)/revenue versus analyst estimate) arrives first in press releases and financial news, processed by algorithmic traders within minutes; qualitative language (management tone, guidance, question-and-answer (QA) credibility) arrives 30-90 min later in the earnings conference call transcript (ECT), requiring human interpretation overnight. Financial economists have studied quantitative surprise for 50 years; natural language processing (NLP) researchers have studied qualitative ECT signals for a decade. Despite studying the same event, the two communities used incompatible frameworks: different targets (return vs. volatility), trading setups (long top-decile and short bottom-decile vs. trade-all), and metrics (return spread between top and bottom 20% (Q5-Q1) vs. mean squared error (MSE)), making direct comparison and connection challenging. We bridge these communities with EarningsInOne, the first corpus aligning earnings news, ECTs, and intraday and next-day prices across SP 1500 (broad U.S. equity universe, 2022-2025). Applying unified trading and evaluation tools to both signal types, we confirm a clean speed separation, fast numbers, slow language: quantitative surprise peaks at announcement and is largely eliminated by the next market open; qualitative ECT sentiment peaks on the next trading day, real and tradeable, but hidden under prior transcript-based evaluation that optimised sign-agnostic volatility with pointwise MSE. Comments: 19 pages, 5 figures. Code and data: this https URL Subjects: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE) ACMclasses: I.2.7; J.4 Cite as: arXiv:2606.29734 [cs.CL] (or arXiv:2606.29734v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.29734 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-55] How Far Do On-Prem Open LLM s Get on Text-to-SQL? A Cross-Family Size x Technique Frontier on BIRD

【速读】: 该论文旨在解决在无法将数据发送至云端API的场景下,本地部署的开源文本转SQL(Text-to-SQL)模型的实际性能与计算开销问题,具体聚焦于评估主流开源模型家族在不同规模与代际下的表现,并验证常见提升准确率的“配方”(accuracy recipes)是否具备实际价值。其解决方案的关键在于构建一个严格一致、完全可复现的基准测试框架,基于BIRD开发集(n=1534,执行准确率),对Qwen2.5-Coder(7B/14B/32B)、CodeLlama-Instruct(7B/13B/34B)和Llama-3.x(8B, 70B)三类模型家族进行跨代际、跨规模的对比实验,采用配对麦克内马尔检验(paired McNemar test)系统性地消融分析三种通用提升策略:模式链接(schema linking)、自校正(self-correction)和自一致性(self-consistency)。研究发现:(1)模型代际差异显著大于参数量影响,且提升策略具有家族鲁棒性;(2)自校正是近乎零成本的稳定增益,适用于所有存在改进空间的模型;(3)模式链接无效,即使采用高召回率检索嵌入式链接器也无法提升性能,彻底排除了“弱词法基线”的解释可能性;(4)自一致性带来的收益极低(+0.13个百分点,代价为约5倍推理token),不具备性价比。此外,研究公开了每阶段的真实成本(每千次查询),并完整发布代码、预测结果与总结数据,确保结果透明可验证。

链接: https://arxiv.org/abs/2606.29733
作者: Vladimir Beskorovainyi
机构: Besk Tech; Moscow Institute of Physics and Technology (MIPT)
类目: Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, 3 tables. Code: this https URL Data DOI: this https URL

点击查看摘要

Abstract:Organizations that cannot send data to a cloud API increasingly ask: how good is Text-to-SQL if the model must run on-premises on open weights, and which popular accuracy “recipes” are worth their compute? We answer with an honest, fully reproducible benchmark on the BIRD development split (n=1534, Execution Accuracy), evaluating three open model families across two generations – Qwen2.5-Coder (7B/14B/32B), CodeLlama-Instruct (7B/13B/34B), and Llama-3.x (8B, 70B) – under one matched protocol, ablating a model-agnostic recipe (schema linking, self-correction, self-consistency) component by component, with every difference tested by the paired McNemar test. Four findings stand out. (i) Generation matters more than raw size, and the recipe is family-robust: Qwen2.5-Coder dominates the older CodeLlama at matched size (39.1 vs 20.9 at 7B), but a modern non-Qwen model (Llama-3.3-70B, 49.2 on a matched serving) is competitive, so CodeLlama’s weakness reflects its 2023 generation, not “non-Qwen = weak”. (ii) Self-correction is a robust, near-free win, significant on all three families where there is room to improve. (iii) Schema linking does not help, and a stronger linker does not rescue it: a retrieval/embedding linker with 96.5% gold-table recall is statistically indistinguishable from no linking, ruling out the “weak lexical strawman” objection across three families. (iv) Self-consistency is poor value (+0.13 pp for ~5x tokens, not significant). We report real per-stage cost ( /1k queries) and release all code, predictions, and summaries; archived code and data: this https URL

[NLP-56] he Hidden Cost of Resampling: How Imbalance Correction Degrades Probability Calibration in Tree Ensembles

【速读】: 该论文旨在解决类别不平衡分类中重采样方法对概率校准(probability calibration)的负面影响问题,特别是针对生成式过采样方法(如SMOTE)和随机欠采样在实际应用中的校准偏差及其影响。其核心解决方案在于揭示:尽管SMOTE会引入轻微的校准损失(ECE增加0.009,效应量为小到中等),但其在判别能力上的提升通常足以抵消这一代价;而真正的校准风险来自随机欠采样,其在高不平衡比下会导致校准严重退化(如ECE从0.008升至0.395),主要源于训练集过小导致概率估计不可靠。关键发现是,通过一次后处理校准步骤(如Platt或等熵校准)可有效消除重采样带来的校准损伤,使ECE降低高达66%,且对排序性能(AUC)几乎无损。此外,研究指出基于先验偏移修正的解析方法不适用于SMOTE,因其不仅改变先验分布,还扭曲了类条件密度,因此必须依赖数据驱动的校准策略。论文建议在不平衡学习研究中同时报告校准与判别性能,并强调在预测概率用于决策时,应始终在重采样后进行校准。

链接: https://arxiv.org/abs/2606.29720
作者: Zewen Liu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 8 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Resampling methods such as SMOTE and random under/over-sampling are standard tools for class-imbalanced classification, almost always evaluated by minority-class accuracy or F1. Prior work has established that undersampling degrades probability calibration by distorting the training prior [1]. We extend this lens to synthetic oversampling (SMOTE) and provide a practical, evidence-based guide to when calibration damage matters and how to fix it. Across five public datasets (imbalance ratio 1.9-70) and two ensemble models (random forest, gradient boosting), with ten seeds and paired statistics, we find: (1) SMOTE’s calibration cost is real but small (ECE +0.009; Cliff’s delta = +0.27, small-to-moderate) across the studied imbalance range (IR 1.9-70) and its discrimination gains typically outweigh the calibration penalty; (2) random undersampling is the genuine danger – its damage grows sharply with imbalance, inflating ECE from 0.008 to 0.395 on a dataset with ratio 70, largely because the resulting training sets are too small to estimate probabilities reliably; (3) a single post-hoc recalibration step (Platt or isotonic) eliminates the damage, reducing ECE by up to 66% at a negligible ranking-power cost (AUC -0.002, Cliff’s delta = -0.07); and (4) the analytic prior-shift correction that repairs undersampling does not transfer to SMOTE, because SMOTE distorts the class-conditional density rather than only the prior – so data-driven recalibration remains necessary. We recommend that imbalanced-learning studies report calibration alongside discrimination, and that practitioners recalibrate after resampling whenever predicted probabilities drive decisions.

[NLP-57] A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics in Self-Adapting LLM Agents

【速读】: 该论文旨在解决当前专有大语言模型(LLM)评估工具在短时间内评估结果失效的问题,即评估结果的可靠性随时间迅速下降。其核心问题是:基于单一时间快照的评估方法无法反映评估系统本身的动态不稳定性,导致研究结论可能因评估器自身的变化而被错误推断。解决方案的关键在于提出一种名为EPC(Evaluation Performance Consistency)的诊断框架,包含多模态偏好坍缩指数(Multimodal Preference Collapse Index, MPCI)、评估器-索引耦合矩阵与Jensen-Shannon散度(JSD),通过跨八种实验条件(共122次独立重复)的系统性分析,量化评估器输出在不同版本间的一致性与偏好偏移。研究发现,部分评估器(如GPT-4o May、Qwen3.7-plus等)表现出显著耦合(耦合系数达0.00–1.18,变异系数约0.9),而另一些则趋于近零耦合,表明其评估行为已发生根本性坍塌;尤其在GPT-4o从5月到6月的版本迭代中,重新复制实验甚至反转了原始结论,凸显评估系统的内在脆弱性。此外,自评估机制(self-evaluation)几乎完全坍缩(97%输出为零,JSD=0.003),提示存在底层地板效应。输出格式混杂性分析显示整体相关性高(ρ=0.89),但个体实例相关性极低(ρ=0.219,p=0.093),进一步暴露评估一致性问题。最终,该研究强调,真正危险的并非某一具体耦合值,而是版本依赖型不稳定性的模式,这使得单次快照的评估研究本质上不可靠。

链接: https://arxiv.org/abs/2606.29719
作者: Liu Zewen
机构: Qilu Institute of Technology, School of Software Engineering (齐鲁工业大学软件工程学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Measurements of proprietary LLM evaluators can become invalid within weeks – we document one case and provide the diagnostic framework to detect it. We introduce EPC – comprising the Multimodal Preference Collapse Index (MPCI), evaluator-indexed coupling matrix, and Jensen-Shannon divergence (JSD) – and apply it across eight experimental conditions (N=112 main + N=10 ablation = 122 unique repetitions, all reported). Coupling coefficients range from 0.00 to 1.18 across per-condition means (CV approx 0.9, n=8 conditions). Four conditions show strong coupling (N=36; GPT-4o May, GPT-4o-mini, Qwen3.7-plus, DashScope 30r); four collapse to near-zero (N=76; GPT-4o June, qwen-plus N=30, symmetric LR, DeepSeek self-eval). The May-to-June GPT-4o drift – an N=8 re-replication inverting the study’s conclusion – is the most informative measurement: a diagnostic instrument detecting its own instability demonstrates the fragility it was designed to measure. Self-evaluation (97% zero, JSD=0.003) consistently collapses, though floor effects are possible. Output-format confound analysis finds per-strategy aggregate rho=0.89 but per-instance rho=0.219 (p=0.093); PCI reported as preference-convergence metric. We release EPC with all data. The finding is not any single coupling magnitude but the pattern of version-conditional instability that makes single-snapshot evaluator studies unreliable.

[NLP-58] SEVA: Self-Evolving Verification Agent with Process Reward for Fact Attribution ICML2026

【速读】: 该论文旨在解决大语言模型(LLM)代理中幻觉(hallucination)导致的可靠性瓶颈问题,核心挑战在于现有事实归属验证器仅输出不透明的二元标签,致使代理无法实现自我修正,操作者亦难以开展审计。为此,本文提出SEVA——一种结构化验证代理(structured verification agent),其关键创新在于能够生成证据对齐、分步推理链、校准置信度以及包含六类可操作修复建议的错误诊断,从而实现可解释、可审计的验证过程。为训练该代理,研究设计了一种过程奖励(process reward)机制,将验证质量分解为五个独立组件,并以70/30的权重侧重过程信号,有效缓解了多组件输出下标准二元奖励引发的优势崩溃(advantage collapse)问题,恢复梯度并诱导隐式课程学习:代理先掌握验证行为(对齐度0.917–0.997,格式正确率72%–100%),再优化最终结果(F1 64.9–69.0)。此外,结构化输出支持“验证-反思-探测-精炼”自进化循环,在7B模型上经四轮迭代后揭示出重要现象:每轮生成的是针对特定基准的专家型代理,而非通用代理(在HaluEval上提升15个百分点,TruthfulQA下降10–14个百分点,且效果在4倍数据量下仍持续存在)。在ClearFacts基准上,SEVA-3B达到与GPT-4o-mini相当的F1分数(69.0 vs. 69.8),同时输出更为丰富且可审计,验证了普适原则:对于具有多组件生成任务的强化学习场景,奖励粒度必须与输出粒度相匹配。

链接: https://arxiv.org/abs/2606.29713
作者: Aojie Yuan,Yi Nian,Haiyue Zhang,Zijian Su,Yue Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at AI4GOOD@ICML 2026 and FAGEN@ICML 2026. Code: this https URL

点击查看摘要

Abstract:Hallucination is the reliability bottleneck for LLM-based agents, and fact attribution verifiers are the last line of defense – yet today’s verifiers emit only opaque binary labels, leaving agents unable to self-correct and operators unable to audit. We present SEVA, a structured verification agent that emits evidence alignments, step-by-step reasoning chains, calibrated confidence, and a six-category error diagnosis with actionable fixes. Training such an agent with RL is non-trivial: standard binary reward on multi-component output triggers advantage collapse – within-group reward variance vanishes and the GRPO gradient disappears. We resolve this with a process reward that decomposes verification quality into five independent components weighted 70/30 toward process signals, restoring the gradient and inducing an implicit curriculum – the agent first masters verification behavior (alignment 0.917 - 0.997, format 72% - 100%), then outcomes (F1 64.9 - 69.0). Structured output further enables a Verify - Reflect - Probe - Refine self-evolution loop, which over four rounds on a 7B model surfaces an unexpected structural finding: each round produces a benchmark-specialist, not a generalist (+15 pp on HaluEval, -10 to -14 pp on TruthfulQA in the same model, persistent at 4x data). On ClearFacts, SEVA-3B matches GPT-4o-mini (69.0 vs. 69.8 F1) while producing substantially richer, auditable output – confirming a principle that should generalize: for any RL task with multi-component generation, reward granularity must match output granularity.

[NLP-59] Why Struggle with Continuous Latents? Interpretable Discrete Latent Reasoning via Rendered Compression

【速读】: 该论文旨在解决大语言模型在推理过程中因采用显式思维链(chain-of-thought)与强化学习导致的输出序列过长、推理时间过长的问题。现有隐式推理(latent reasoning)方法虽通过将计算迁移至潜在空间以降低开销,但其连续潜在表示难以训练,且推理轨迹不稳定、不可解释。作者指出,这一问题的根本原因在于连续空间中的推理过程与离散符号化监督之间存在错位——连续状态缺乏显式的步骤对齐锚点。为解决此问题,论文提出首个将连续潜在状态转换为显式离散标记的方案:离散潜在推理(Discrete Latent Reasoning, DLR)。其核心创新在于受基于渲染的压缩启发,将思维链文本转化为图像,提取视觉特征,并通过聚类微调构建离散潜在词表。该词表可扩展并集成至标准自回归建模框架中,支持预训练对齐、监督微调(SFT)与强化学习(RL)。在五个推理基准及两个模型系列(Qwen3-VL 与 LLaMA-3)上的实验表明,DLR 在保持高性能的同时实现了最高达20倍的压缩比,且学习到的潜在轨迹仍具备可解释的语义结构。总体而言,离散潜在标记为高效且可控的隐式推理提供了可解释的建模基础。

链接: https://arxiv.org/abs/2606.29712
作者: Shuochen Chang,Qingyang Liu,Shaobo Wang,Bingjie Gao,Qianli Ma,Haonan Zhao,Yibo Miao,Yulin Sun,Zelin Peng,Jiangtong Li,Li Niu
机构: Shanghai Jiao Tong University (上海交通大学); Tongji University (同济大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models achieve high reasoning performance via explicit chain-of-thought and reinforcement learning, but require long output sequences and extended inference time. Latent reasoning reduces this cost by shifting computation into a latent space; however, continuous latent methods are hard to train, suffering from unstable and uninterpretable reasoning trajectories. We argue these issues stem from a misalignment between continuous-space reasoning and discrete symbolic supervision, as continuous states lack explicit anchors for step-by-step alignment. To resolve this, we propose \textbfDiscrete Latent Reasoning~(DLR), the first method that converts continuous latent states into explicit discrete tokens. Inspired by render-based compression, we render textual chains of thought into images, extract visual features, and construct a discrete latent vocabulary via clustering-based fine-tuning. Expanding the vocabulary and output head enables standard autoregressive modeling over both natural language and latent tokens, supporting pretraining alignment, SFT, and RL. Experiments on five reasoning benchmarks and two model series~(Qwen3-VL and LLaMA-3) confirm that \textbfDLR outperforms prior latent reasoning baselines with up to \textbf20 \times compression. Furthermore, the learned latent trajectories retain an interpretable semantic structure. Overall, discrete latent tokens provide a controllable and interpretable basis for efficient latent reasoning.

[NLP-60] GUICrafter: Weakly-Supervised GUI Agent Leverag ing Massive Unannotated Screenshots

【速读】: 该论文旨在解决当前图形用户界面(GUI)智能体在数据获取方面面临的严峻挑战,即无法像基础模型那样从互联网大规模采集标注数据,导致现有GUI智能体普遍存在跨设备泛化能力差以及对细粒度GUI元素视觉定位能力不足的问题。其核心解决方案是提出一种弱监督的GUI智能体构建框架GUICrafter,关键在于采用分阶段的课程学习(curriculum learning)策略:第一阶段利用海量未标注的屏幕截图和网页内容,通过自监督方式学习视觉定位能力,充分挖掘GUI交互中蕴含的上下文信号;第二阶段仅需少量高质量标注数据,结合强化学习进行模型精调。该方法显著降低了对人工标注数据的依赖,在仅使用UI-TARS系统0.1%标注数据的情况下,性能达到甚至超越其水平,且在相同标注数据量下优于GUI-R1等先前方法,有效提升了模型的泛化性与视觉理解能力。

链接: https://arxiv.org/abs/2606.29705
作者: Sunqi Fan,Lingshan Chen,Runqi Yin,Qingle Liu,Yongming Rao,Meng-Hao Guo,Shi-Min Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data, as the fundamental substrate of modern intelligence, has greatly driven the development of current foundation models. Naturally, researchers aim to extend this paradigm to the domain of GUI agents, hoping to build strong GUI agents through a similar paradigm. However, GUI agent data cannot be directly harvested from the internet, making it costly and difficult to collect at scale. As a result, current GUI agents suffer from poor cross-device generalization and limited visual grounding ability for fine-grained GUI elements. As an attempt to address data challenge in GUI agents, we propose GUICrafter, a weakly-supervised GUI agent leveraging massive unannotated screenshots to substantially reduce the reliance on expensive human annotations. GUICrafter explores a curriculum learning framework for training GUI agents through two progressive stages. First, the model learns visual grounding from large-scale unannotated screenshots and webpages, leveraging the rich contextual signals inherent in GUI interactions without human annotations. Then, in Stage 2, we leverage a small amount of high-quality data to calibrate the model via reinforcement learning. Experiments show that GUICrafter achieves competitive, or even superior, performance to advanced systems like UI-TARS while using only 0.1% of its data. Furthermore, under the same amount of annotated data, GUICrafter surpasses all previous methods such as GUI-R1. Code, data, and models are available at this https URL.

[NLP-61] Can MLLM s Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在开放式美学批评任务中缺乏有效评估标准的问题。传统评估方法多依赖于数值评分,难以反映人类真实给出的文本化美学评价,而开放式批评本身并无唯一正确答案,导致现有评估体系无法准确衡量模型输出与人类评论之间的语义对齐程度。其解决方案的关键在于采用基于人类参考文本的对比评估框架,通过Reddit照片批评数据集,以多轮排序的人类评论作为参照,系统性地测试五种开源权重的MLLM在六种提示设计条件下的表现,涵盖角色设定、评价维度提示、长度控制、单次/多次生成以及图像错配控制等变量。研究发现,基于参考相似性的度量指标容易产生误导性结论;更严格的词汇和学习型度量显示模型与人类批评仅有微弱对齐,而粗粒度嵌入余弦相似度虽表现出广泛主题重合,但实则反映的是模型固有的“风格惯性”而非针对具体图像内容的细致观察。行为分析进一步揭示,模型在输出上显著偏离人类:即便在长度限制下仍远超人类字数,覆盖几乎所有美学维度且均匀深入,重复冗余严重,缺乏人类所具有的选择性与针对性。因此,论文指出,当前基于参考相似性的评估方式实质上奖励了流畅、全面但缺乏焦点的批评风格,而非人类批评所体现的选择性与具体性,这对未来开放式多模态生成模型的训练与评估范式提出了重要反思。

链接: https://arxiv.org/abs/2606.29689
作者: Sajjad Ghiasvand,Maryam Amirizaniani,Haniyeh Ehsani Oskouie,Mahnoosh Alizadeh,Ramtin Pedarsani
机构: UCSB(加州大学圣塔芭芭拉分校); University of Washington(华盛顿大学); UCLA(加利福尼亚大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Open-ended aesthetic critique is a challenge for multimodal large language models (MLLMs): unlike multiple-choice aesthetic benchmarks, it has no single correct answer, and most aesthetic evaluation has measured models against numeric scores rather than the written critiques people actually give. We evaluate MLLM critiques against ranked human references and ask whether they are close to human ones. Using the Reddit Photo Critique Dataset, we score five open-weight MLLMs against multiple ranked human critiques per photo with reference-based similarity metrics, under six prompt conditions that disentangle persona framing, aspect hinting, length control, and single- versus multi-pass generation, and add an image-grounding control that feeds each model the wrong photograph. We find that reference-based similarity gives a misleading picture. Stricter lexical and learned metrics show only weak alignment with human critiques, while a coarse embedding cosine reports broad topical overlap that the grounding control traces to a stable house style rather than image-specific observation. Behaviorally, the models diverge from humans in consistent ways the scores do not surface: even under a length cap they write two to three times as much, cover nearly every aesthetic aspect where humans are selective, engage each aspect more uniformly and at greater depth, and repeat themselves across critiques of the same photo where humans vary. We argue that reference-based similarity rewards a fluent, comprehensive critique style rather than the selectivity and specificity of human critique, and discuss implications for evaluating and training open-ended multimodal generation.

[NLP-62] How LLM s See Creativity: Zero-Shot Scoring of Visual Creativity with Interpretable Reasoning

【速读】: 该论文旨在解决视觉图像原创性评估在创造力测评中长期存在的挑战,特别是如何实现对视觉创意的自动化评分,并理解AI模型作出评价的内在逻辑。其核心问题是:多模态大语言模型(Multimodal Large Language Models, LLMs)是否能够在零样本(zero-shot)条件下无需任何微调或人类评分示例,作为视觉创造力的评判者?解决方案的关键在于利用多模态大语言模型的零样本推理能力,使其直接基于图像与文本提示进行创造力评估,并通过分析其逐步推理过程(step-by-step reasoning),揭示模型在判断原创性与质量权衡时的关注点及评价依据。研究结果显示,六种多模态LLMs在未经过训练的情况下,对生成图像和手绘草图的创造力评分均与人类评分具有显著相关性(相关系数r = .57–.68,AI生成图像;r = .29–.68,手绘草图),且其推理输出提供了可解释的评价窗口,尽管推理本身并未进一步提升与人类评分的一致性。因此,该研究证明了多模态大语言模型具备无需额外训练即可有效评估视觉创造力的能力,且其推理过程为理解AI创造力评价机制提供了透明化路径。

链接: https://arxiv.org/abs/2606.29672
作者: William Orwig,Roger E. Beaty
机构: Harvard University (哈佛大学); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL)
备注: 21 pages, 9 figures

点击查看摘要

Abstract:Evaluating the originality of visual images poses enduring challenges for creativity assessment. Automated scoring using AI models has proven effective in the verbal domain, yet key questions remain about evaluating visual creativity and understanding how models arrive at their ratings. The present research asks whether multimodal large language models (LLMs) can serve as judges of visual creativity zero-shot (without any fine-tuning or examples of human ratings) and whether their “reasoning” output offers an interpretable window into their evaluation process. We tested six multimodal LLMs (Gemini 3 Flash, Gemma 4 31B IT, GPT-5.4 Mini, GLM-5v Turbo, Kimi K2.5, and Qwen 3.6 Plus) on 992 AI-generated images (based on human-written prompts) and 1,500 hand-drawn sketches scored for creativity by human raters. In Study 1, all models showed substantial alignment with human creativity ratings on both datasets (r = .57-.68 on AI-generated images; r = .29-68 on sketches). In Study 2, we analyzed the step-by-step reasoning processes of three LLMs evaluating the same images and drawings. Although reasoning made model evaluations interpretable – showing what they attend to, how they balance originality vs. quality, and how they justify their ratings – reasoning did not improve alignment with human ratings. In sum, our findings indicate that multimodal LLMs can match human judgments of visual creativity without any additional training, and that their reasoning reveals how AI models evaluate creativity. An open scoring app implementing this pipeline is available at this https URL.

[NLP-63] Resolution Thresholds in VLM Detection of Harmful ASCII Art Across Construction Modes and Languages

【速读】: 该论文旨在解决大型视觉语言模型(VLMs)在内容审核中面临的越狱攻击(jailbreak attack)问题,即攻击者通过将有害文本以ASCII艺术形式进行视觉编码,从而绕过基于VLM的检测机制。其核心解决方案关键在于系统性地探究图像分辨率对VLM识别有害ASCII艺术内容能力的影响,发现在不同字符构造模式(L1-L8)下,当图像分辨率超过某一阈值时,检测成功率显著下降;尤其在基于文字嵌入的构造模式中,模型的检测能力表现出更强的鲁棒性。研究结果揭示了当前VLM内容审核系统存在与分辨率相关的系统性漏洞,并据此提出应建立面向分辨率感知的评估标准,以提升系统安全性。

链接: https://arxiv.org/abs/2606.29649
作者: Yikai Hua,Peter West
机构: The University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) are increasingly deployed as content moderation tools, yet they remain vulnerable to jailbreak attacks in which harmful text is visually encoded as ASCII art. This can allow inappropriate or harmful content to bypass moderation systems. To address this vulnerability, this paper investigates how image resolution affects VLM detection of harmful ASCII art across eight character construction modes (L1-L8), ranging from dense block characters to word-embedded designs. We evaluate eight state-of-the-art VLMs on English and Chinese corpora using a pipeline that generates ASCII art images at ten resolution scales, probing whether a consistent detection-failure threshold exists across models, modes, and languages. Results indicate that detection rates decline sharply above certain resolution thresholds, and that word-based modes are the most resistant to detection across the full resolution range. These findings reveal a systematic vulnerability in VLM-based content moderation systems and motivate resolution-aware evaluation standards.

[NLP-64] wo-Stage Prompt Optimization for Few-Shot Relation Extraction: From Reasoning -Guided Search to Gradient-Guided Refinement

【速读】: 该论文旨在解决小样本场景下事件式关系抽取(episodic few-shot relation extraction)中,针对小型语言模型的自动提示优化(automatic prompt optimization)仍处于探索阶段的问题。其核心挑战在于如何在有限标注数据条件下,有效提升提示词的质量以增强模型性能。解决方案的关键在于提出一种两阶段框架:第一阶段采用基于推理的提示优化方法,在自然语言层面实现对提示的全局性改进;第二阶段引入GradPO(Gradient-based Prompt Optimization),通过损失函数和梯度信号识别对模型输出影响较大的提示片段,并进行局部精细化调整。实验结果表明,局部精修通常能进一步提升第一阶段所得提示的效果,且GradPO在各类测试中表现最为稳定一致,最终在FS-TACRED数据集上实现了当前最优性能,同时在FS-FewRel上保持了较强的竞争力。

链接: https://arxiv.org/abs/2606.29639
作者: Aunabil Chakma,Mihai Surdeanu,Eduardo Blanco
机构: University of Arizona (亚利桑那大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic prompt optimization is still underexplored for episodic few-shot relation extraction with smaller language models. We propose a two-stage framework that combines reasoning-based prompt optimization with gradient-based prompt optimization. The first stage can use any reasoning-based optimizer to make broadprompt improvements in natural language. The second stage applies our GradPO, which uses loss and gradient signals to identify high-impact prompt spans and refine them with local edits. Experiments on FS-TACRED and FS-FewRel show that local refinement usually improves prompts found by the first stage, and GradPO is the most consistent refiner. Our framework achieves state-of-the-art performance on FS-TACRED with Qwen3-4B and remains competitive on FS-FewRel.

[NLP-65] Do We Still Need Fine Tuning? Turkish Sentiment Analysis in the Era of Large Language Model

【速读】: 该论文旨在解决在大型语言模型(Large Language Models, LLMs)时代,土耳其语情感分析是否仍需依赖监督微调(Supervised Fine-tuning)的问题。研究对比了传统机器学习方法、微调的预训练语言模型以及提示工程驱动的大规模语言模型在包含负面、中性与正面三类标签的土耳其电商评论数据集上的表现。研究表明,微调后的BERTurk模型整体性能最优,在全三分类任务中显著优于所有提示型大模型。其中,中性类别成为主要挑战:尽管部分大模型在二分类(正/负)任务中表现接近甚至超越微调模型,但在三分类场景下,其对中性评论的判别能力显著下降,导致大量中性样本被错误归入极化类别。因此,该研究的关键结论是,在真实的土耳其语情感分类任务中,当前提示型大模型在零样本设置下尚未达到监督微调的效果,且引入中性类别对于评估模型鲁棒性具有关键作用。

链接: https://arxiv.org/abs/2606.29614
作者: Sercan Karakaş,Yusuf Şimşek
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the 34th IEEE Signal Processing and Communications Applications Conference

点击查看摘要

Abstract:This study examines whether supervised fine-tuning remains necessary for Turkish sentiment analysis in the era of large language models. We compare classical machine learning methods, fine-tuned pretrained language models, and prompted large language models on a Turkish e-commerce review dataset with negative, neutral, and positive labels. Fine-tuned BERTurk models perform best overall and outperform all prompted large language models in the full three-class task. The neutral class emerges as the main difficulty: while several large language models are much more competitive in binary positive–negative classification, they degrade substantially in the three-class setting by collapsing neutral reviews into polarized categories. The findings suggest that, in realistic Turkish sentiment classification, prompted large language models do not yet match supervised fine-tuning in the zero-shot setting, and that including the neutral class is crucial for robust evaluation.

[NLP-66] How much of an LLM -generated clinical corpus is actually new? A production-scale measurement of content redundancy for provenance classification

【速读】: 该论文旨在解决临床机器学习中大规模语言模型(LLM)生成语料库的冗余问题,即当前依赖的语料规模指标(如生成的总词元数)是否真实反映其信息含量。研究发现,尽管生成了25.1亿个词元,但仅有10.9%为可训练的独特内容,79.4%为冗余信息,原始词元数量高估了实际信息量约九倍。冗余主要源于两种机制:一是将源上下文原文直接复制至每条记录字段中,二是跨记录生成文本的重复;其中仅前者可无损移除。通过基于无模型的无损压缩分析,独立验证了这两种冗余机制的存在。研究还发现,不同处理通道的冗余程度差异显著,表明冗余并非由LLM提取固有属性决定,而是由具体管道结构所影响。未校正的冗余会放大长篇、复杂病例生成的样本权重,从而扭曲语料库在词元层面的训练分布。在下游任务中,对语料去重后进行模型适配,在同等词元预算下显著提升了临床编码器在外部疾病识别基准上的表现,且效果在不同适配深度和多个基准上均具鲁棒性,证实冗余不仅增加存储负担,更带来可量化的性能损耗。为此,研究提出并开源了基于来源的冗余分解(Provenance-based Redundancy Decomposition)分类工具,为高质量临床数据生成提供关键评估与优化手段。

链接: https://arxiv.org/abs/2606.29605
作者: Ali H. Lazem,William J. Teahan
机构: Bangor University (班戈大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Clinical machine learning increasingly relies on training corpora generated by large language models (LLMs) rather than annotated by clinicians, and such corpora are described and reused largely on the basis of their reported scale. We test whether volume reflects information content. Analysing the complete output of a multi-agent clinical extraction pipeline applied to 167,034 patient narratives, 2.51 billion generated tokens across the ten text-bearing channels of an eleven-channel pipeline, we introduce Provenance-based Redundancy Decomposition, a token-level classification of the entire output by source. Only 10.9% of the output is trainable-unique content while 79.4% is redundant; raw token count overstates information content by roughly ninefold. The redundancy arises through two distinct mechanisms, verbatim copying of source context into per-item fields, and duplication of generated text across records, of which only the former is losslessly removable. An independent, model-free analysis based on lossless compression confirms the redundancy, recovering the two mechanisms without reference to the provenance labels. One pipeline channel carries almost no redundancy, showing that the level of redundancy depends on how each channel is structured rather than being a fixed property of LLM extraction. Because uncorrected redundancy up-weights the longer, more complex presentations that generate the most items, it skews the token-level training distribution of the corpus, a property we measure directly. In a controlled downstream test, de-duplicating the corpus before adaptation improved a clinical encoder on external disease-recognition benchmarks at equal token budget, robustly across adaptation depths and replicated on a second benchmark, confirming that the redundancy carries a measurable cost beyond storage. The classification tool is released openly.

[NLP-67] MAM-AI: An On-Device Medical Retrieval-Augmented Generation System for Nurses and Midwives in Zanzibar ATC WWW KR

【速读】: 该论文旨在解决撒哈拉以南非洲地区孕产妇与新生儿死亡率居高不下的问题,其核心挑战在于:当地助产护理多由缺乏国际标准助产培训的护士提供,且临床决策时难以获取权威指南——因指南内容冗长、网络连接不稳定。为此,研究提出MAM-AI,一个完全运行于普通安卓设备上的医疗问答助手,实现端到端离线推理。其解决方案的关键在于:采用轻量级嵌入模型(EmbeddingGemma, 300M)在本地完成检索,匹配经过精心筛选的87份指南文档(共63,650个段落),并利用40亿参数量化版生成器(Gemma 4 E4B)生成带引用的答案,整个过程无数据外传。评估表明,本地检索已基本解决,300M嵌入模型性能仅次于顶级检索器,可有效定位所需信息;而生成器在40亿参数规模下存在“帮助性”与“安全性”的权衡困境——更助人的模型会引入真实危险错误,因此选择更忠实于源文献的版本(其溯源准确率相当于前沿模型),并通过重构提示词将回答回避率从33%降至3%。此外,知识库质量对系统表现具有决定性影响:当语料中包含正确答案时,输出具体且可操作;反之则趋于模糊。综上,该研究为资源受限环境下高质量医疗辅助工具的部署提供了可验证、开源的原型系统。

链接: https://arxiv.org/abs/2606.29580
作者: Yi Ren
机构: École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院); Switzerland(瑞士)
类目: Computation and Language (cs.CL)
备注: 36 pages. Video demo: this https URL ; browser demo, code, models, and benchmarks linked in the paper

点击查看摘要

Abstract:Maternal and newborn mortality remain among the highest in sub-Saharan Africa, where midwifery care is often delivered by nurses who lack midwifery training to international standards, and consulting authoritative guidance at the point of care is hard: the guidelines are long and connectivity is intermittent. We present MAM-AI, a medical question-answering assistant for nurse-midwives in Zanzibar that runs entirely on a commodity Android device: a question is embedded (EmbeddingGemma, 300M) and matched against a curated corpus of 87 guideline documents (63,650 passages), then answered with citations by a 4B int4 generator (Gemma 4 E4B), fully offline, with no query leaving the device. We evaluate the exact deployed configuration with a layered methodology – retriever, generator under oracle context, end-to-end, and latency – scored by LLM judges validated against physician rubrics. The evaluation relocates the hard problem. On-device retrieval is essentially solved: the 300M embedder ranks third of seven retrievers and rivals cloud systems, so the passages the system needs are usually found. The small generator is what remains in doubt: adding retrieved context does not improve its answers, and at 4B it cannot be both helpful and safe at once – of two same-size candidates, the more helpful one commits genuine dangerous errors, so we deploy the other, which is about twice as faithful to its sources (as faithful as a frontier model), and recover its helpfulness with a redesigned prompt that cuts deflection from 33% to 3%. Corpus quality is decisive for the same reason: where the corpus holds the right passage the answer is specific and actionable, and where it does not it goes vague. MAM-AI is a thoroughly evaluated, open-source research prototype, not a fielded product; the system, knowledge base, benchmarks, and evaluation harness are released.

[NLP-68] Anisotropy Decides Cosine vs. Rank Metrics for Text Embeddings

【速读】: 该论文旨在解决文本嵌入(text embeddings)相似性度量中长期存在的核心问题:在不同嵌入模型和数据集下,为何某些相似性度量(如余弦相似度)表现优于其他度量,而这一现象背后的决定性因素究竟是什么。其解决方案的关键在于揭示了嵌入空间的几何结构(geometric structure)是决定最优相似性度量的根本依据。研究通过系统性实证分析十九种无参数相似性度量、十九个不同规模的编码器(从紧凑型句子变换器到七亿参数的大语言模型)及七个数据集,发现当嵌入的方差在各方向上均匀分布时,余弦相似度是最优选择;而当方差集中在少数主导方向(即表现出各向异性,anisotropy)时,基于秩或L1范数的度量则显著超越余弦相似度。研究进一步提出,一个关键指标——单个最大主导维度所承载的方差占比,能够以高达0.86的秩相关性和0.95的线性相关性准确预测替代度量的增益效果。通过投影移除主导方向的实验验证,确认该效应源于方向性而非向量幅值,且仅在原本具有各向异性的编码器上消失,从而证明几何结构是因果决定因素,而非训练方式的副产品。因此,该研究为无参数相似性度量的选择提供了可量化的几何诊断标准,明确了余弦相似度适用于各向同性分布的嵌入,而其他度量在各向异性场景下更具优势。

链接: https://arxiv.org/abs/2606.29571
作者: V.S. Raghu Parupudi
机构: University of California, San Diego(加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The standard way to compare two text embeddings is cosine similarity. Scattered studies report that a different metric does better, but never pin down the geometric condition that decides when, or why. We settle both with a comprehensive empirical study: nineteen parameter-free similarity metrics on nineteen encoders, from compact sentence transformers up to seven-billion-parameter large language models, across seven datasets. The answer is geometric. When an encoder spreads its variance evenly across directions, cosine is the best parameter-free choice and no other metric helps by a usable margin. When the variance concentrates into a few dominant directions, a property known as anisotropy, rank-based and L1-type metrics beat cosine by a clear margin. The absolute gain is modest, but because cosine starts low on these encoders it is a sizable relative improvement, around twenty percent on average and largest where cosine is weakest. What decides this is the geometry of the embedding space, not how the model was trained: where the two disagree, the metric follows the geometry. One number, the fraction of variance held by the single most dominant dimension, predicts how much the alternatives help across all nineteen encoders, with a rank correlation of 0.86 and a linear correlation of 0.95. To test this as the cause rather than a correlate, we project out the dominant directions: cosine recovers and the advantage of the other metrics nearly vanishes, but only on the encoders that were anisotropic to begin with. The effect is directional, not magnitude based, since it survives normalizing every vector to unit length. Among parameter-free metrics, then, cosine is the right tool wherever an encoder is well spread, which includes the fine-tuned embedders commonly deployed for retrieval, and we give a one-number diagnostic for when it is not.

[NLP-69] SurrogateShield: Beyond Redaction for High-Utility Privacy-Preserving LLM Interactions

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)助手在处理用户查询时,将包含个人身份信息(Personally Identifiable Information, PII)的原始数据直接传输至第三方API端点所引发的隐私泄露风险。现有缓解措施“占位符脱敏”(Placeholder Redaction)虽能屏蔽敏感信息,但导致语义连贯性下降,进而影响生成结果的质量。其核心解决方案是提出SurrogateShield——一种客户端代理机制,通过在本地将检测到的PII替换为类型一致的本地生成虚拟值(surrogate values),并在接收响应后还原为原始数据,从而确保真实PII不离开用户设备。该方案的关键在于三阶段级联检测流程(PatternScan、EntityTrace、ContextGuard),覆盖22类PII及准标识符组合,基于Sweeney的k-匿名性框架实现高精度识别;同时,原始值与虚拟值的映射关系以AES-256-GCM加密形式存储于仅驻留在本地的“ShadowMap”中,保障密钥安全。实验表明,该方法在1,124条查询上达到98.87%的综合F1分数,且在语义保真度方面显著优于占位符脱敏,BERTScore提升13.26个百分点(从81.59%增至94.85%),在对抗测试中亦成功抵御了基于提示的LLM攻击,未泄露任何原始敏感信息。

链接: https://arxiv.org/abs/2606.29567
作者: Sherwin Vishesh Jathanna
机构: Arizona State University (亚利桑那州立大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 14 pages, 1 figure, 9 tables. Code and dataset: this https URL

点击查看摘要

Abstract:LLM-based assistants transmit user queries verbatim to third-party API endpoints that lie outside the user’s audit or control. When those queries contain personally identifiable information (PII), the data persists on remote infrastructure subject to breach, subpoena, or policy change. Placeholder redaction (the prevailing mitigation) suppresses PII at the cost of semantic coherence, producing structurally degraded queries and correspondingly degraded responses. We present SurrogateShield, a client-side proxy that substitutes detected PII with locally generated, type-consistent surrogate values prior to transmission and restores originals in the response. No real PII crosses the network boundary. Detection runs through a three-stage cascade (PatternScan, EntityTrace, and ContextGuard) covering 22 PII types and quasi-identifier combinations grounded in Sweeney’s k-anonymity framework. Surrogate-to-original mappings are sealed in an AES-256-GCM encrypted per-conversation ShadowMap that never leaves the device. Evaluations on a 1,124-query corpus demonstrate that the cascade reliably detects PII, achieving an overall F1 score of 98.87%. Surrogate substitution substantially outperforms placeholder redaction in semantic utility, yielding a 13.26 pp improvement in BERTScore (roberta-large), from 81.59% to 94.85%. Within this corpus, the local pipeline restricted real PII transmission across all tested query types; in a 100-query adversarial trial, a prompted LLM adversary recovered no original values from surrogate-substituted messages. Comments: 14 pages, 1 figure, 9 tables. Code and dataset: this https URL Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL) ACMclasses: K.4.1; I.2.7 Cite as: arXiv:2606.29567 [cs.CR] (or arXiv:2606.29567v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.29567 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-70] Coverag e-Driven KV Cache Eviction for Efficient and Improved Inference of LLM

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因键值(Key-Value, KV)缓存占用过高内存而导致的部署成本问题。现有基于注意力稀疏性的KV缓存淘汰策略虽能降低内存开销,但会显著削弱模型在长上下文推理任务中的性能,其根本原因在于这些方法导致了唯一词元(unique tokens)覆盖度下降,进而限制了输入与输出之间的互信息,影响预测准确性。为此,论文提出K-VEC——一种面向覆盖率感知的新型KV缓存淘汰策略,其核心创新在于引入跨注意力头(cross-head)与跨模型层(cross-layer)的覆盖率模块,通过优先保留具有高覆盖率的词元来增强全局语义信息的留存,从而在相同内存约束下有效缓解性能退化。实验结果表明,在16个LongBench子集上,K-VEC相较于现有方法在同等淘汰率和内存限制下最高提升达10.35点,验证了其在资源受限场景下实现高效、高性能LLM部署的可行性。

链接: https://arxiv.org/abs/2606.29563
作者: Shuvendu Roy,Mengyao Zhai,Hossein Hajimirsadeghi,Golnoosh Samei
机构: RBC Borealis; BorealisAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at complex tasks like question answering and summarization, thanks to their ability to handle long-context inputs. However, deploying LLMs is costly, not only due to the high computational demands of quadratic complexity of self-attention and auto-regressive generation, but also because of the significant memory overhead required for storing the key-value (KV) cache during inference. To reduce the memory cost, existing KV-cache eviction strategies leverage the sparsity in attention to selectively store a subset of tokens. While reducing the memory footprint, such approaches show a considerable drop in performance, especially in tasks that require long-context reasoning. We identify that the drop in performance is linked to a reduction in the coverage of unique tokens. Additionally, we theoretically show that reduced coverage limits the mutual information between inputs and outputs, thereby impairing predictive accuracy. To this end, we introduce K-VEC, a novel coverage-aware KV-cache eviction strategy that prioritizes token coverage while evicting tokens in the cache. K-VEC introduces a cross-head and a cross-layer coverage module to enhance token retention across attention heads and model layers, mitigating performance degradation caused by low coverage. Evaluated on 16 LongBench subsets, K-VEC exhibit up to 10.35 points improvement over the existing methods under the same eviction rate and memory constraint. Comprehensive evaluations validate the effectiveness of our approach and demonstrate its potential for efficient LLM deployment in resource-constrained settings.

[NLP-71] AURORA: Asymmetry and Update-Induced Rotation for Robust Hallucination Detection in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险应用场景中因生成幻觉(hallucination)而导致的事实性错误问题。现有幻觉检测方法通常依赖于代价高昂的输出层面一致性检查或静态隐藏状态探针,这些方法在跨数据集评估下表现显著下降,且仅能捕捉浅层的数据集特异性模式。本文提出AURORA框架,其核心创新在于将注意力从静态表征转向模型权重梯度动态变化,关键洞察是:幻觉生成与真实回答会引发模型参数更新中具有质差异的梯度模式——幻觉样本导致不对称且结构错位的梯度更新。为此,AURORA引入两个互补特征进行捕捉:(1) 权重矩阵与其梯度更新方向之间余弦相似度分布的偏度(skewness),反映梯度方向的非对称性;(2) 旋转比率(rotation ratio),通过奇异值分解(SVD)量化梯度更新对权重矩阵奇异向量基底的重构程度。实验表明,AURORA在四种模型家族和四个基准数据集上均表现出色,且具备良好的可扩展性与跨域迁移能力,适用于数学推理及视觉-语言等下游任务。

链接: https://arxiv.org/abs/2606.29545
作者: Zishuai Zhang,Hainan Zhang,Zhiming Zheng
机构: Beihang University (北京航空航天大学); Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University (北京航空航天大学未来区块链与隐私计算高精尖创新中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. However, their tendency to generate hallucinations, namely factually incorrect or unfaithful outputs, poses a critical obstacle to their deployment in high-stakes applications. Although recent hallucination detection methods have made encouraging progress, they typically rely on costly output-level consistency checks or static hidden-state probes that capture shallow dataset-specific patterns, leading to substantial degradation under cross-dataset evaluation. In this work, we propose AURORA, a novel hallucination detection framework that shifts the focus from static representations to the weight-gradient dynamics of LLMs. Our key insight is that hallucinated and faithful answers induce qualitatively different gradient update patterns on the model’s parameters. Specifically, hallucinated samples trigger asymmetric and structurally misaligned gradients, which can be captured through two complementary features: (1) the skewness of the cosine similarity distribution between weight matrices and their gradient update directions, and (2) the rotation ratio, which quantifies how much the gradient update reorients the singular-vector basis of weight matrices via SVD. AURORA achieves strong hallucination detection performance across four model families and four benchmark datasets. Further analyses demonstrate that our method scales effectively across model sizes and transfers to out-of-domain tasks, including mathematical reasoning and vision-language scenarios.

[NLP-72] Em-ergence of the em-dash: a population-level rise in em-dash frequency in medRxiv preprints at the dawn of the large-language-model era

【速读】: 该论文旨在解决生成式 AI(Generative AI)在科学文献写作中是否引发可测量的风格变化这一问题,尤其关注其对特定标点符号——破折号(em-dash,Unicode U+2014)使用频率的影响。研究通过分析 medRxiv 平台 2020 至 2025 年间发布的 69,632 篇预印本全文,采用预注册的研究设计,以讨论部分(Discussion)中是否存在至少一个破折号作为主要结局指标,对比了 ChatGPT 出现前(2022 年 11 月 30 日之前)与之后的破折号使用率变化。关键发现为:破折号在讨论部分的使用率从 4.23% 上升至 11.58%,绝对增幅达 7.35 个百分点(95% CI 6.94–7.77),且该趋势呈现渐进式加速特征,2023 年约为 4%,2024 年达 8.0%,2025 年跃升至 20.3%。研究通过多重敏感性分析、双重虚假检验及跨段落比较验证了结果稳健性,表明破折号使用量的显著上升并非由数据偏差或随机波动所致,而是与生成式 AI 的普及时间高度吻合。尽管无法确立因果关系,但该现象提示科学写作的整体语言风格在 2020 年代初期发生了系统性转变,而破折号可作为群体层面反映生成式 AI 影响的潜在风格标记物。

链接: https://arxiv.org/abs/2606.29540
作者: Przemysław Czuma
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 5 figures. Pre-registered on OSF ( this http URL ). Companion to a pre-registered audit of Unicode fidelity in biomedical bibliographic APIs ( arXiv:2606.24897 )

点击查看摘要

Abstract:Large language models (LLMs) can leave subtle stylistic traces in assisted text; one of the most cited is the em-dash (Unicode U+2014). Yet no one has measured whether em-dash use has changed in the scientific literature. This study, pre-registered on the Open Science Framework (HFT8C), used the full set of medRxiv full-text XML preprints from the official Text-and-Data-Mining resource. The primary cohort was first, original versions deposited 2020-2025 with an extractable Discussion section of at least 500 characters (N = 69,632). The primary endpoint was the presence of at least one em-dash in the Discussion; the principal measure was the absolute change in its prevalence between the pre-ChatGPT era (before 30 November 2022) and the post-ChatGPT era, estimated with a logistic model with standard errors clustered by first author. The analysis plan (six supporting analyses, six sensitivity analyses, two falsification tests) was frozen before any confirmatory result was computed. Em-dash prevalence in Discussion sections rose from 4.23% before ChatGPT to 11.58% afterward, an absolute increase of 7.35 percentage points (95% CI 6.94-7.77; odds ratio 2.96, 95% CI 2.77-3.17). The rise was not a sharp jump but a gradual, delayed acceleration: near 4% through 2023, 8.0% in 2024, and 20.3% in 2025. The effect survived every feasible sensitivity analysis (7.35-7.60 pp) and both falsification tests; a placebo split within the pre-LLM era showed no meaningful change (+0.13 pp, 95% CI -0.33 to +0.58), and was essentially absent in boilerplate sections. Independent LLM-associated lexical markers and within-paper section comparisons pointed the same way. The em-dash is a population-level indicator, not a per-paper detector of LLM use, and the design cannot establish causality; it shows that something in how scientific literature is written changed markedly in the early 2020s, and roughly when.

[NLP-73] Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLM s INTERSPEECH2026

【速读】: 该论文旨在解决当前自动语音识别(ASR)评估基准在数字、口误、实体和大小写等语言形式上采用不一致的标注规范,且标准归一化处理会抹除用户关心的格式差异,导致现有基准无法衡量模型是否遵循用户对输出风格的偏好。其解决方案的关键在于提出PreferenceASR,一个基于自然语言偏好指令评估ASR系统能力的新测试集,涵盖归一化、实体、口误和大小写四个维度;该数据集通过两阶段大语言模型(LLM)辅助管道结合人工验证,从七个开源语料库构建而成,并采用一种感知偏好的归一化器,在执行时根据激活的指令选择性跳过归一化步骤,从而精准评估模型对不同风格偏好响应的能力。实验表明,不同偏好类型下模型排名显著变化,揭示了传统评估方法所掩盖的质量差异,验证了该框架的有效性。

链接: https://arxiv.org/abs/2606.29534
作者: Nithin Rao Koluguri,Sasha Meister,Nikolay Karpov,Piotr Zelasko,Desh Raj,Jagadeesh Balam,Boris Ginsburg
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:Popular ASR test sets adopt inconsistent conventions for numbers, disfluencies, entities, and casing, while standard normalizers erase the format distinctions users care about. Current benchmarks therefore cannot measure whether a model follows user preferences for output style. We introduce PreferenceASR, a test set evaluating ASR systems on their ability to follow natural-language preference instructions across four categories: normalization, entities, disfluencies, and case. Built from seven open-source corpora via a two-stage LLM-assisted pipeline with human verification, it is evaluated with a preference-aware normalizer that selectively skips steps matching the active instruction. Benchmarking four models shows rankings shift across preference types, exposing quality differences traditional evaluation obscures. We publicly release the dataset.

[NLP-74] Do Models Read What They Write? Causal Registers in Scratchpad Reasoning

【速读】: 该论文旨在解决生成式 AI(Generative AI)在推理过程中缺乏可解释性与可控性的问题,核心在于验证模型是否真正将“草稿”(scratchpad)中记录的中间状态用于后续计算,而非仅将其作为输出展示。其关键解决方案是设计了一个受控的状态追踪任务,通过已知的更新规则对比两类模型:一类仅输出最终状态,另一类则在推理过程中显式写出中间状态。实验中,研究者对已写入的中间状态进行内部表示编辑,同时保持可见文本不变,利用已知的转移规则判断正确结果。结果显示,在Qwen2.5-Coder-7B模型中,能够写出中间状态的模型在80%至91%的测试样本上准确预测了由编辑状态引发的后续结果,而仅输出最终答案的基线模型则表现接近随机水平。进一步控制实验排除了单纯依赖下一个词预测或复制其他延续路径的可能性,证实模型的预测依赖于被编辑的状态与当前动作的联合影响。该模式在多个模型家族中均复现,表明训练模型将草稿中的状态实际用于计算,是实现有效过程监督的关键。因此,论文提出更精准的监督目标:不仅要使中间推理过程可读,更要确保所写状态被模型实际作为计算输入使用。

链接: https://arxiv.org/abs/2606.29522
作者: Benjamin Shih,John Winnicki,Eric Darve
机构: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A central hope behind process supervision is that models can expose intermediate variables that matter for their later behavior. For this to help with alignment, a scratchpad must be tied to the computation: when the model writes a state, later steps should compute from that state. To test this requirement, we use a controlled state-tracking task with a known update rule, comparing models trained to report only the final state with models trained to write intermediate states before giving the final answer. At evaluation, we edit the internal representation of one written state while leaving the visible scratchpad text fixed. Because the transition rule is known, the edit has a single correct downstream consequence. In Qwen2.5-Coder-7B, the state-writing model predicts the next phase bit implied by the edited state on 80% and 91% of held-out examples across the two task variants, while pretrained and final-answer-only controls remain near baseline. Additional controls rule out generic next-token steering and copying another continuation: the prediction depends on both the edited state and the current move. The same causal-use pattern replicates across model families. Together, these results suggest a sharper goal for scratchpad oversight: not just to make intermediate reasoning legible, but to train written states that the model uses as part of its computation.

[NLP-75] he Verbose Context Problem in Medical Records ICML2026

【速读】: 该论文旨在解决在人群健康(population health)领域中因结构化概念采用文本表示时存在的“冗长上下文问题”(verbose context problem),即在对纵向患者记录进行队列级分析时,需处理数千个医学编码事件,导致总输入长度超过40万词元(tokens),严重制约了大语言模型的推理能力。其解决方案的关键在于构建了一个名为PopMedQA的新基准,通过在一组纵向患者记录上执行计算任务来专门隔离并评估该问题。研究引入neopatient——一个用于语言控制生成人工患者记录的新工具库,以构建具有真实医学逻辑结构的数据集。通过大量消融实验(包括提示策略、提示压缩及代理式分解等方法),发现通用的域无关方法无法有效缓解冗长上下文问题,表明在大规模人群推理中,仍存在显著机会通过挖掘语言模型输入中的领域特定结构(domain-specific structure)来提升效率与性能。

链接: https://arxiv.org/abs/2606.29503
作者: Shiva Kaul,Min-Gyu Kim,Anjum Khurshid,Sriram Vishwanath
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: SD4H ICML 2026 Spotlight

点击查看摘要

Abstract:The verbose context problem occurs when structured concepts have token-inefficient textual representations. This bottleneck is acute in population health: cohort-level analysis of longitudinal patient records requires reasoning over thousands of medically-coded events, often exceeding 400K tokens in total. We present PopMedQA, a benchmark isolating this problem through computational tasks on groups of longitudinal patient records. We construct the benchmark using neopatient, a new library for language-controlled generation of artificial patient records. Through extensive ablations – including prompting strategies, prompt compression, and agentic decomposition – we find that domain-independent methods fail to alleviate the verbose context problem. There remains significant opportunity to exploit domain-specific structure in language model inputs for population-scale reasoning.

[NLP-76] UCOB: Learning to Utilize and Evolve Agent ic Skills via Credit-Aware On-Policy Bidirectional Self-Distillation

【速读】: 该论文旨在解决生成式智能体在强化学习中利用技能记忆(skill memory)时面临的根本性挑战:已检索的技能并非绝对可靠,其有效性具有上下文依赖性——同一技能在某一状态可能有效,但在另一状态下可能误导策略。这使得传统假设(即技能条件提示可作为固定教师指导无技能提示)变得脆弱。为此,论文提出UCOB框架,其核心解决方案是通过基于信用感知的在线双向自蒸馏机制,实现对技能的动态利用与演化。UCOB将技能条件提示与无技能提示视为同一模型的两种在线上下文视图,通过在同一任务和锚定状态下比较两者预期回报(return-to-go),以更高回报的视图为局部教师信号,从而内化有效的技能行为、纠正错误的技能使用,并驱动技能记忆更新、效用感知检索及反思自训练。实验表明,UCOB在ALFWorld、WebShop和Search-QA等代理任务上显著优于无技能强化学习、技能记忆基线及自蒸馏方法,在不同模型规模下均取得显著性能提升,最大分别超过当前最优基准23.5分和18.0分。消融实验进一步验证了该框架的核心机制与高效性。

链接: https://arxiv.org/abs/2606.29502
作者: Songjun Tu,Chengdong Xu,Qichao Zhang,Yiwen Ma,Yaocheng Zhang,Linjing Li,Dong Li,Xiangyuan Lan,Dongbin Zhao
机构: Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所); Pengcheng Laboratory(鹏城实验室); Memorax AI(记忆科技)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Skill memories can improve agentic reinforcement learning by reusing past experience as textual guidance, but retrieved skills are not oracular: they may help in one state while misleading the same policy in another. This makes the common privileged-teacher assumption fragile, namely that a skill-conditioned prompt can be treated as a fixed teacher for the no-skill prompt. We introduce UCOB, a framework for learning to utilize and evolve agentic skills via credit-aware on-policy bidirectional self-distillation. UCOB treats skill-conditioned and no-skill prompts as two on-policy context views of the same model, compares their return-to-go within the same task and anchor state, and uses the higher-return view as the local teacher. This local credit signal internalizes useful skill-conditioned behavior, corrects misleading skill usage, and guides task/state skill memory updates, utility-aware retrieval, and reflection self-training. Experiments on agentic tasks, including ALFWorld, WebShop, and Search-QA, show that UCOB outperforms skill-free RL, skill-memory baselines, and self-distillation methods across model scales, with up to 23.5 and 18.0 point gains over SOTA baselines on ALFWorld and WebShop. Ablations and analyses further validate its core mechanisms and efficiency.

[NLP-77] Which Tokens Need Context? A Reference-Based Analysis of Translation Responsibility Using Fertility and Entropy

【速读】: 该论文旨在解决机器翻译系统在使用上下文时是否具备类似人类的、选择性地依赖上下文的能力这一关键问题。现有方法受限于特定语篇测试集或依赖模型内部结构,缺乏通用性和客观性。为此,作者提出一种后验、模型无关的分析框架,通过词对齐(word alignment)衍生出两个量化指标——“可生育性”(fertility,即每个源端词生成的目标端词数量)和“熵”(entropy,即可生育性在不同上下文中的稳定性),以在词汇和句法层面衡量上下文敏感性。实验基于三种语言对(德语↔英语、英语→印地语)在四种上下文条件下的人类参考译文,结果表明:上下文会将生成责任从源词有选择性地转移至上下文词,但整体可生育性保持不变;功能词(如代词、助动词)的可生育性显著下降,而内容词则保持稳定,说明上下文主要用于消解歧义而非引入新信息。该框架为人类翻译中选择性使用上下文提供了基准性的“真实情况”刻画,为评估机器翻译模型的上下文利用方式建立了可比较的诊断基线。

链接: https://arxiv.org/abs/2606.29489
作者: Ramakrishna Appicharla,Baban Gain,Santanu Pal,Asif Ekbal
机构: Indian Institute of Technology Patna(印度理工学院比特纳); Wipro AI Lab(威普罗人工智能实验室)
类目: Computation and Language (cs.CL)
备注: This is a work in progress. An extended version with machine translation output analysis and attention correlation is in preparation

点击查看摘要

Abstract:When humans translate, not every word depends equally on the surrounding context. Some tokens, particularly function words like pronouns and auxiliaries, rely heavily on preceding or following sentences, while others, such as proper nouns, do not. Understanding this inherent context sensitivity is essential for evaluating whether machine translation systems use context in human-like ways. However, existing approaches to analysing context usage rely on discourse-specific test sets or model internals, making them narrow or model-dependent. We propose a post-hoc, model-agnostic framework to quantify context sensitivity at lexical and syntactic levels using two measures derived from word alignments: fertility (number of target tokens generated per source token) and entropy (stability of fertility patterns across contexts). Using reference translations for three language pairs (German \leftrightarrow English, English \rightarrow Hindi) under four context conditions, we show that context selectively redistributes generative responsibility from source to context tokens without altering overall fertility. Function words show the largest fertility reductions, while content words remain stable, suggesting that context resolves ambiguity rather than adding new information. Our framework provides a ground-truth characterisation of selective context usage in human translation, establishing a diagnostic baseline for evaluating machine translation models.

[NLP-78] o Reason or to Fabricate: Reasoning Without Shortcuts via Hint-Anchored Pairwise Aggregation

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在大语言模型(Large Language Models, LLMs)推理能力提升过程中因预训练数据重叠(Pre-RL data overlap)导致的“捷径学习”问题。具体而言,当RL训练数据与预训练或监督微调(Supervised Fine-Tuning, SFT)语料存在重叠时,模型会通过记忆正确答案而非进行真实推理来获得奖励,从而生成看似合理但实质为后验编造的推理过程。为应对这一挑战,论文提出HIPPO框架,其核心创新在于将提示注入(hint injection)与定制化的成对奖励建模(pairwise reward modeling)相结合。通过在输入中注入特定提示以主动诱发数据重叠引发的捷径行为,生成的推理轨迹可作为显式的对比锚点,从而为成对比较提供高度可区分的偏好信号。该设计使轻量级判别模型能够可靠地区分真实的逻辑推导与基于捷径的伪推理,同时成对形式的奖励机制相比传统偏好建模(Preference Reward Modeling, PRM)具有更强的优化稳定性与鲁棒性。实验结果表明,HIPPO显著优于标准基线,并在分布外(out-of-distribution)泛化任务中表现出色,验证了其能够提取真正可迁移的、本质性的推理能力,而非依赖表面的捷径模式。

链接: https://arxiv.org/abs/2606.29481
作者: Jiuheng Lin,Chen Zhang,Yansong Feng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While reinforcement learning (RL) significantly enhances LLM reasoning, its efficacy is severely undermined by Pre-RL data overlap, where RL datasets overlap with pretraining or SFT corpora, causing models to exploit shortcuts by memorizing correct answers and fabricating post-hoc reasoning. To address this, we introduce HIPPO, a novel RL framework that integrates hint-injected aggregation with a tailored pairwise reward model. By utilizing hint injection to deliberately trigger overlap-induced behaviors, the resulting traces naturally serve as explicit anchors for pairwise comparison. This provides highly discriminable preference signals, enabling a lightweight judge model to reliably distinguish genuine reasoning deduction from shortcut-driven rationalization, while the pairwise formulation ensures stable and robust optimization compared to standard PRMs. Extensive experiments demonstrate that HIPPO yields substantial improvements over standard baselines and generalizes effectively to out-of-distribution general tasks, showing it extracts authentic, transferable reasoning skills rather than superficial shortcut patterns.

[NLP-79] Interpretable Inverse Design of Metal-Organic Frameworks with Large Language Model Agents

【速读】: 该论文旨在解决金属有机框架材料(MOF)逆向设计中面临的组合空间庞大、性能标签获取成本高,以及现有机器学习模型可解释性差的问题。其核心挑战在于如何在缺乏全局性能数据先验的情况下,高效探索高维结构空间并生成具有优异性能的新型MOF结构。解决方案的关键在于提出一种基于语言模型的闭环框架LLM4MOF,通过两个语言模型代理(language-model agents)协同工作:一个代理负责基于化学原理生成可解释的设计假设,涵盖金属节点、连接体、孔道几何与功能化学等维度;另一个代理将这些假设转化为结构约束,用于筛选候选MOF(由金属节点、有机连接体及匹配拓扑构成)。每个假设通过四类诊断性测试进行验证,分别施加不同子集的约束,从而区分几何、化学或金属选择对性能的影响。该框架在仅400次性质评估内即聚焦于六项吸附、分离和电子结构任务中的高性能结构,且无需针对特定目标训练专用模型。此外,该闭环系统能够从头生成新MOF并实时模拟验证,动态调整孔道几何以适应特定条件,在每轮实验中表现优于随机搜索与遗传算法。研究表明,语言模型代理可在仿真驱动下实现可解释的、无需任务特化训练的逆向设计。

链接: https://arxiv.org/abs/2606.29459
作者: Kyungmin Nam,Seunghee Han,Jihan Kim
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inverse design of metal-organic frameworks (MOFs) requires searching a combinatorially vast space where property labels are expensive and most machine-learning models reveal little about why a structure succeeds. We introduce LLM4MOF, a closed-loop framework in which language-model agents reason about chemistry, build candidate MOFs, and test them in simulation, refining hypotheses over ten autonomous iterations. One agent proposes interpretable design hypotheses over metal nodes, linkers, pore geometry, and functional chemistry, and a second translates them into constraints that select candidate MOFs, each made of a metal node, organic linker, and matching topology. Each hypothesis is tested through four diagnostic beams that apply different subsets of its constraints, so comparing them shows whether geometry, chemistry, or metal choice drives performance. Even when blind to the global property landscape of databases, LLM4MOF concentrates its search on top-performing structures across six adsorption, separation, and electronic-structure tasks within 400 property evaluations. The same loop also generates new MOFs de novo and validates them in live simulation, where it adapts the geometry to each requested condition, outperforming random search and a genetic algorithm at roughly 1 per campaign. LLM4MOF shows that language-model agents can run interpretable, simulation-grounded inverse design without training a model per objective.

[NLP-80] Closing the Activation-Cone Blind Spot: Response-Time Probing and Unified Defense

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段面临的安全威胁,尤其是针对多种攻击类型(如GCG、AutoDAN、DeepInception、prefilling和intent laundering)的防御有效性缺乏系统性评估的问题。其核心挑战在于现有推理时安全机制对特定攻击形式(特别是prefilling攻击)存在结构性盲区。解决方案的关键在于提出“响应时间探测”(response-time probing)——一种基于模型在首个生成词元处隐藏状态的线性探测方法,能够以高达0.97–1.00的AUROC准确识别恶意输入,结合响应中断(halt)策略后可将prefilling攻击成功率降至0/40且不产生良性误报。研究进一步证明,仅依赖单层激活与良性参考空间(锥形、子空间或零空间)对齐进行干预的防御机制,必然无法检测到那些主动构造激活向量落入该参考空间内的攻击;为此,通过将响应时间探测与AlphaSteer的零空间引导策略组合,形成正交防御架构(响应中断捕获prefilling攻击,零空间引导应对语义攻击),实现了在Mistral和Llama模型上分别达到0.983和0.994的防御成功率,显著优于单一组件。此外,研究指出MMLU评测指标无法真实反映防御机制的行为代价,真正的代价表现为行为上的权衡而非事实性损失,并发现使用多样化的负样本训练集可将探测器假阳性率从80%–100%降至接近零。

链接: https://arxiv.org/abs/2606.29441
作者: Subhadip Mitra
机构: 独立研究者(Independent Researcher)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 27 pages, 12 figures, 18 tables. Code and data: this https URL

点击查看摘要

Abstract:Inference-time safety methods for large language models have proliferated, yet no systematic comparison exists. We evaluate five defense paradigms (no defense, static steering, CAST, AlphaSteer, probe-gated) across seven instruction-tuned models (7-31B) and five attack types (GCG, AutoDAN, DeepInception, prefilling, intent laundering). Our central finding: prompt-time activation defenses are structurally blind to prefilling attacks. AlphaSteer achieves 0% attack success on GCG, AutoDAN, and intent laundering but 50% on prefilling. We prove a corollary: any defense that gates intervention on a single layer’s activation alignment with a benign reference (cone, subspace, or null-space) is blind to attacks that craft activations to lie inside that reference, whether checked at prompt time or per token. As its constructive contrapositive we introduce response-time probing: a linear probe on the model’s hidden state at the first generated tokens, with AUROC 0.97-1.00 across all seven models. Combined with a halt, it cuts prefilling attack success to 0/40 on every model with 0% benign false positives, outperforming Llama Guard 3. Cross-template generalisation depends on probe depth, so we scope the claim to the canonical prefilling-template family. Composing the response-halt with AlphaSteer’s null-space steering gives an orthogonal split (the halt catches prefilling, AlphaSteer catches semantic attacks), reaching defense success 0.983 on Mistral and 0.994 on Llama and dominating both components. We further show MMLU fails to capture steering’s true utility cost, which appears as behavioral hedging rather than factual loss, and that diverse negative training sets cut probe false positives from 80-100% to near zero. Code, attacks, per-sample results, and the judge prompt are released.

[NLP-81] EntroRouter: Learning Efficient Model Routing via Entropy Regulation

【速读】: 该论文旨在解决多模型路由(model routing)中因推理与路由过程深度耦合而导致的“信任区域坍缩”(Trust Region Collapse)问题,即在稀疏监督条件下,强预训练先验主导导致高能力专家被系统性抑制,从而引发次优局部极值。其解决方案的关键在于提出一种单轮路由框架EntroRouter,通过将熵调控(entropy regulation)作为核心优化目标,实现推理与路由的解耦:首先利用软监督(Soft Supervision)初始化策略,构建高熵先验以促进探索;随后引入软锚点(Soft Anchor)机制,基于离线能力估计在安全的信任区域内实现受控的熵收缩,从而稳定强化学习过程。实验表明,EntroRouter在保持最强专家98.3%准确率的同时,将计算成本降低48.25%,显著提升了模型路由的效率与可靠性。

链接: https://arxiv.org/abs/2606.29424
作者: Kaiyi Zhang,Xueliang Zhao,Zhuocheng Gong,Wei Wu,Yankai Lin
机构: Gaoling School of Artificial Intelligence, Renmin University of China; Ant International; The University of Hong Kong; Ant Group
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Model routing balances solution accuracy and computational cost by selecting among models of varying capabilities. While recent multi-round frameworks interleave reasoning and planning, we identify a structural failure mode termed Trust Region Collapse. We demonstrate that the deep coupling of reasoning and routing, exacerbated by the dominance of strong pre-training priors under sparse supervision, leads to degenerate local optima where capable experts are systematically suppressed. To decouple these processes, we propose \textbfEntroRouter , a single-round routing framework that treats entropy regulation as a core objective. We first initialize the policy via Soft Supervision, fitting a distribution of suitable models to establish a high-entropy prior for exploration. Subsequently, we stabilize Reinforcement Learning using a Soft Anchor, which utilizes offline capability estimates to orchestrate controlled entropy contraction within a safe trust region. Extensive experiments demonstrate that EntroRouter retains 98.3% of the strongest expert’s accuracy while reducing computational costs by 48.25%.

[NLP-82] LC-ICL: Label-Guided Contrastive In-Context Learning for Robust Information Extraction

【速读】: 该论文旨在解决当前基于大语言模型(LLM)的少样本信息抽取(Few-shot Information Extraction, IE)中,仅依赖正例(正确示例)进行上下文学习(In-Context Learning, ICL)而导致模型泛化能力受限的问题。现有方法忽视了错误示例(负例)所蕴含的错误成因信息,限制了模型对典型错误模式的学习与规避能力。本文提出一种名为LC-ICL的新颖少样本学习方法,其核心在于通过构建包含正例与负例的混合示范样本,尤其引入带有错误成因标签(error-cause labels)的错误样本,使模型能够显式识别并理解错误发生的深层特征,从而在推理过程中避免重复类似错误。该方法的关键在于充分挖掘硬负样本(hard negative samples)及测试样本最近邻正样本中的上下文语义信息,并将其整合为结构化的上下文示范,以增强模型在命名实体识别(NER)与关系抽取(RE)任务中的表现。实验结果表明,LC-ICL在多个数据集上显著优于传统少样本上下文学习方法,展现出优异的性能提升与跨场景适应性。

链接: https://arxiv.org/abs/2606.29407
作者: Xiao You,Tianwei Yan,Shan Zhao
机构: Hefei University of Technology (合肥工业大学); Chongqing Jiaotong University (重庆交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There has been increasing interest in exploring the capabilities of advanced large language models (LLMs) in the field of information extraction (IE), specifically focusing on tasks related to named entity recognition (NER) and relation extraction (RE).Although researchers are exploring the use of few-shot information extraction through in-context learning with LLMs, they tend to focus only on using correct or positive examples for demonstration, neglecting the potential value of incorporating incorrect or negative examples into the learning this http URL this paper, we present LC-ICL a novel few-shot technique that leverages both correct and incorrect sample constructions to create in-context learning demonstrations. This approach enhances the ability of LLMs to extract entities and relations by combining positive samples with negative samples annotated by error-cause labels. These labels expose more detailed error features in erroneous examples, enabling the model to understand why similar predictions fail and avoid repeating such errors during this http URL, our proposed method taps into the inherent contextual information and valuable information in hard negative samples and the nearest positive neighbors to the test and then applies the in-context learning demonstrations based on LLMs. Our experiments on various datasets indicate that LC-ICL outperforms previous few-shot in-context learning methods, delivering substantial enhancements in performance across a broad spectrum of related tasks. These improvements are noteworthy, showcasing the versatility of our approach in diverse scenarios.

[NLP-83] Cross-Temporal Sinhala OCR: Page-Level Adaptation and Diachronic Analysis

【速读】: 该论文旨在解决斯里兰卡僧伽罗语(Sinhala)在真实场景下页面级光学字符识别(page-level OCR)缺乏公开可用真实数据集的问题。此前所有僧伽罗语OCR模型评估均依赖人工生成的数据,无法反映真实文档的复杂性与多样性。为填补这一空白,研究提出并构建了sinhala-ocr-lk-acts-1010数据集,包含1,010张来自1981–1989年及2000–2019年间斯里兰卡立法法案的真实页面图像及其标注转写文本,按707:101:202的比例划分为训练、验证和测试集。解决方案的关键在于采用基于深度学习的视觉语言模型(如DeepSeek-OCR V1/V2、LightOnOCR-2-1B),通过QLoRA方法在消费级与云GPU上进行微调,并在真实世界文档上验证性能。实验结果表明,LightOnOCR-2-1B在所有测试样本上取得1.05%的词错误率(CER),显著优于开源模型(如Surya-OCR 8.84%、Tesseract v5 10.69%)及商业模型(如Google Document AI 2.06%),且在不同印刷时期、严重退化的文档中仍保持稳定性能,证明其在真实场景下具备卓越的泛化能力。

链接: https://arxiv.org/abs/2606.29378
作者: Avisha Dilhara,Nevidu Jayatilleke
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages, 4 figures, 7 tables, Accepted paper at the 12th Moratuwa Engineering Research Conference (MERCon) 2026

点击查看摘要

Abstract:Sinhala is a morphologically rich abugida spoken by roughly 16 million people in Sri Lanka, and to date, there are no publicly available real-world datasets for page-level Sinhala OCR. All previous studies for assessing Sinhala OCR models have used artificially generated data. To bridge the gap, we introduce sinhala-ocr-lk-acts-1010, an annotated dataset of 1,010 page-level images and their transcriptions collected from Sri Lankan Legislative Acts published between 1981-1989 and 2000-2019, split into 707 training examples, 101 validation examples, and 202 testing examples. Three models based on deep learning-based visual language processing, namely DeepSeek-OCR V1, DeepSeek-OCR V2, and LightOnOCR-2-1B, are fine-tuned using QLoRA in 8 experiments conducted on consumer and cloud GPUs. LightOnOCR-2-1B is the top performer, achieving a CER of 1.05% across all test examples, outperforming state-of-the-art open-source OCR models such as Surya-OCR (8.84%) and Tesseract v5 (10.69%), as well as commercially available OCR models such as Google Document AI (2.06%). Our results suggest that LightOnOCR-2-1B outperforms other baselines on real-world OCR tasks and maintains consistent performance across all print periods, even when documents are severely degraded.

[NLP-84] riageRA-CCF: Source-Side Clinical Confidence and Coverag e Signals for Adaptive Rank Budgeting in Medical LLM s

【速读】: 该论文旨在解决医疗领域大语言模型在参数高效微调中采用固定低秩预算(low-rank budget)所带来的适应性不足问题。由于医疗问答任务在置信度、临床覆盖范围及跨域难度上存在显著差异,固定预算难以兼顾不同问题的复杂性需求。其核心解决方案是提出一种基于源数据的自适应秩预算机制——TriageRA-CCF,通过融合三种仅依赖源训练数据的信号:基础模型答案置信度、元数据单元临床覆盖度以及反事实近失(counterfactual close-miss)代理信号,对一个直通式(straight-through)的秩预算路由器进行监督。该路由器动态决定每个问题激活小、中或大秩通道子集,并结合预算成本、熵正则化与秩平衡约束以避免不稳定选择或资源浪费。实验表明,在匹配的CMB源训练协议下,TriageRA-CCF在Qwen3-8B和Llama3.1-8B上均优于LoRA、DoRA及MoELoRA基线,平均准确率分别提升+0.21和+0.16,且各信号组件贡献可验证,但组合效果并非在所有模型主干上单调最优。

链接: https://arxiv.org/abs/2606.29375
作者: Shucan Ji,Yining Huang,Hongliang Guo
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Medical large language models are commonly adapted with a fixed low-rank budget, even though medical questions differ substantially in confidence, clinical coverage, and cross-domain difficulty. We study adaptive rank budgeting for parameter-efficient medical question answering: for each question, the adapter decides whether to activate a small, medium, or large subset of LoRA rank channels. The central challenge is that a naive adaptive budget router can collapse to unstable choices or spend capacity without improving shifted benchmarks. We propose TriageRA-CCF, a source-side teacher for adaptive rank-budgeted LoRA. It combines three signals computed only from source training data: base-model answer confidence, metadata-cell clinical coverage, and a counterfactual close-miss proxy. These signals supervise a straight-through budget router over active ranks 2,4,8, together with budget-cost, entropy, and rank-balance regularization. Under a matched CMB-source training protocol, TriageRA-CCF achieves the best average accuracy among LoRA, DoRA, and MoELoRA baselines on both Qwen3-8B and Llama3.1-8B. The gains are modest and non-uniform across benchmarks: +0.21 average points over the strongest external baseline on Qwen3-8B and +0.16 on Llama3.1-8B. Component ablations show that confidence, coverage, and counterfactual signals all provide useful budget supervision, but their combination is not monotonically best on every backbone.

[NLP-85] Deterministic Decisions for High-Stakes AI. A Zero-Egress Pipeline with the Deployability of RAG and the Accuracy of Machine Learning

【速读】: 该论文旨在解决零样本大语言模型(Zero-shot Large Language Model, LLM)在教育咨询代理中存在的一种未被量化的问题——干预偏差(intervention bias),即在无需干预的场景下仍过度推荐行动。其核心问题是:未经任务特定训练的零样本LLM在面对真实最优策略(hindsight-optimal oracle policy)应保持静默的情况下,仍会频繁建议干预措施,导致大量误报。解决方案的关键在于采用监督式策略学习(supervised policy learning),通过基于同一最优策略标注轨迹的严格前缀特征约束,训练轨迹条件化的ONNX决策变换器(Decision Transformer, DT)与快照XGBoost分类器,从而实现近乎零校准误差的决策能力。实验表明,决策变换器在五个动作类别上达到宏平均F1值0.79(宏召回率0.85),且对罕见动作如负载降低仍能有效预测,同时具备极低的动作翻转率(0%)和亚5毫秒的CPU决策延迟。研究还揭示了“评估鸿沟”(Evaluation Gap)现象:现有基于生成式AI的评判方法(如DeepEval G-Eval)无法识别干预偏差,反而奖励冗余的、流畅但不合理的过度干预行为,进一步凸显了对高可靠性决策系统进行专门评估机制设计的重要性。

链接: https://arxiv.org/abs/2606.29280
作者: Craig Atkinson
机构: Verificate Pty Ltd(维里菲凯特有限公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 41 pages, 11 tables, no figures. Preprint intended for submission to EDM 2027 / LAK 2027. Includes a reproducibility package: trained ONNX Decision Transformer, generic training script, OULAD evaluation scripts, and per-arm results CSVs

点击查看摘要

Abstract:We identify intervention bias as a previously unquantified failure mode of zero-shot large-language-model (LLM) educational advisory agents: without task-specific training, they recommend action when a hindsight-optimal oracle policy mandates inaction. In a six-arm ablation on the Open University Learning Analytics Dataset (N=800 students, four temporal cutoffs), at day 56 – when the oracle designates 70.1% of students as needing no intervention – zero-shot GPT-4o recommends action for 73%, a 43 percentage-point false-positive rate. Commercial RAG and SQL-augmented retrieval are comparably miscalibrated; at 10,000 students this implies about 4,300 unnecessary advisor contacts per cycle. Supervised policy learning eliminates this bias: a trajectory-conditioned ONNX Decision Transformer (DT) and a snapshot XGBoost classifier, trained on the same oracle-labelled trajectories under strict prefix-only features, both achieve near-zero calibration error. The DT reaches macro-F1 0.79 (macro-recall 0.85) across all five action classes, predicting even the rare load-reduction action without collapsing, at a 0% action flip rate and sub-5 ms CPU decision latency. The two supervised arms are on par; the DT’s edge over XGBoost at the final cutoff is indicative only (unpaired across cohorts). Scope: we validate Stage-2 decision-making (EAV state vector to supervised policy) under controlled oracle input from structured OULAD data; high fidelity reflects feature-oracle alignment, not general high-stakes-AI capability. The most robust finding is the intervention-bias contrast, not the absolute accuracies. We also show an Evaluation Gap: LLM-as-judge scoring (DeepEval G-Eval) is blind to intervention bias, rewarding fluent over-prescription rather than decision quality. Comments: 41 pages, 11 tables, no figures. Preprint intended for submission to EDM 2027 / LAK 2027. Includes a reproducibility package: trained ONNX Decision Transformer, generic training script, OULAD evaluation scripts, and per-arm results CSVs Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2.6; I.2.7; K.3.1; H.2.8 Cite as: arXiv:2606.29280 [cs.LG] (or arXiv:2606.29280v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.29280 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-86] Manufactured Confidence: How Memory Consolidation Turns Hearsay into Confident Facts

【速读】: 该论文旨在解决大语言模型(LLM)智能体在多轮对话与跨会话推理中因记忆重写机制引发的可信度误判问题。其核心问题是:当智能体将初始带有保留性措辞(如“可能”“据称”)的陈述在记忆中固化为确定性事实后,即使该信息从未被修正,系统也会将其视为权威事实并无条件执行,从而导致错误决策——这一过程无需外部攻击即可发生,仅依赖于表述方式的自信程度。解决方案的关键在于识别出智能体响应的真正依据并非信息来源本身,而是语义表达的确定性强度:模糊表述(hedging)被弱化,而绝对断言则被当作可信事实采纳,且此行为不依赖特定关键词。研究发现,尽管常见缓解手段(如添加“未验证”标签或指令“不要信任”)均失效,甚至加剧错误,但有效的应对策略在于记忆存储机制的重构——即保持原始的试探性表述,避免将其升级为确定性事实。最终提出的可部署结论是:单一关键记忆源是主要风险点,通过引入冗余信息源进行交叉验证,可恢复正确决策能力。

链接: https://arxiv.org/abs/2606.29279
作者: Alex Kwon
机构: Independent Researcher
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 16 tables, 1 figure. Code: this https URL

点击查看摘要

Abstract:LLM agents carry conclusions across steps and sessions in compressed memory, and memory products (e.g., mem0, LangMem) rewrite conversation into stored “facts” that later steps trust. We show this rewriting manufactures confidence: across our constructed agent settings, a casual, hedged remark becomes a confident, dated assertion the agent then obeys like a verified fact, granting every above-clearance request it faces. No attacker is needed: a role that was true once and never corrected is stored as a flat fact and acted on like a deliberate injection. We then isolate what the agent responds to. It is not the source: attributed, unattributed, and even forged “system of record” claims all grant alike. It is the confidence of the phrasing. A hedge is discounted, a flat assertion is obeyed, and this holds with no special keyword. Not all hedges are equal, though: the evidential register is the least-discounted, with “reportedly” obeyed like a flat assertion on most models. The obvious fixes fail. A passive “unverified” tag is ignored, and an active “do not trust this” instruction escalates even correct memory, so it is safe only by refusing to decide. The real fix lives in the store: keep the tentative phrasing rather than upgrade it. But that is hygiene, not a defense against an attacker who can simply write a confident lie. The deployable lesson is narrower and constructive: a single load-bearing memory is the hazard, and one redundant source restores correct decisions. We release the harness and demonstrations.

[NLP-87] he Complexity Ceiling Benchmark: A Multi-Domain Evaluation of Sequential Reasoning Under Depth Scaling ICML2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中随着推理步骤数量增加而出现的性能衰减问题,尤其关注模型在长序列推理过程中的稳定性与可扩展性。其核心挑战在于:尽管模型在短路径推理中表现良好,但在需要大量连续逻辑步骤的任务中,其正确率会显著下降,且这一衰减模式尚未被系统量化与建模。解决方案的关键在于提出“复杂度天花板基准”(Complexity Ceiling Benchmark, CCB),通过控制语义内容、仅改变任务深度(N=5至50),在三种结构迥异的推理范式——基于空间状态的追踪、抽象符号指针操作以及传递性关系推理——上进行标准化评估。研究发现,所有模型均表现出几何级的每步衰减趋势,且不同任务域存在显著的性能天花板(domain ceiling),其中前两类任务中最强模型在N=50时仍保持约92%的正确率,而第三类任务则在N=5时即全面崩溃,最优模型的50%成功率阈值仅为4.7步。此外,引入迹级指标TFBC揭示14.5%的正确答案依赖于错误的中间推理路径,表明模型可能“猜对”而非真正推理。实验还表明强制详尽的状态追踪无法提升天花板,且推理首次偏离的平均步数k*比参数量更能预测模型在特定任务域内的准确性。最终,CCB与几何衰减模型共同将模型的长程推理能力简化为每个任务家族的一个可解释性数值,为评估和理解模型推理极限提供了统一框架。

链接: https://arxiv.org/abs/2606.29278
作者: Shubh Chapra,Dhruv Kumar,Murari Mandal,Yash Sinha
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 6 figures. Accepted to the 1st Workshop on Combining Theory and Benchmarks (CTB), CTB@ICML 2026

点击查看摘要

Abstract:We introduce the Complexity Ceiling Benchmark (CCB), a controlled evaluation of how language-model reasoning decays as the number of required sequential steps grows. CCB fixes the semantic content of a task and varies only its depth N in 5,…,50 across three structurally distinct regimes: grounded spatial state-tracking, abstract symbolic pointer manipulation, and transitive relational inference. Across 6,000 trials over five frontier and open-weight LLMs we find a consistent pattern of geometric per-step decay with widely separated domain ceilings: on the first two regimes the strongest models retain pd0.92 across N=50; on the third every model collapses by N=5, with the best model’s 50%-success horizon at H0.5~4.7 steps despite pd=0.863. A trace-level metric (TFBC) shows that 14.5% of correct answers across the benchmark are reached via incorrect intermediate reasoning. Forced verbose state-tracking does not move the ceiling (McNemar p=1.000), and the mean step at which reasoning first diverges, k*, predicts within-domain accuracy better than parameter count. CCB and the geometric decay model together reduce a model’s long-horizon reasoning profile to one interpretable number per task family.

[NLP-88] A Hybrid Framework for Song Lyric Annotation Based on Human-LLM Alignment

【速读】: 该论文旨在解决歌曲歌词情感识别中存在的核心问题:歌词内容与歌曲整体情感之间可能存在不一致,导致传统基于歌词的标注方法难以准确反映真实情感状态。这一现象使得歌词标注长期处于研究空白状态。其解决方案的关键在于提出一种混合标注框架,通过预测人类标注者与大语言模型(LLM)在歌词层面情感标注中潜在的不一致性,实现对两者标注结果的协同优化,从而提升标注质量并缓解主观性带来的偏差。

链接: https://arxiv.org/abs/2606.29273
作者: Rashini Liyanarachchi,Frank Tran,Md Mahmudul Hasan,Aditya Joshi,Erik Meijering
机构: University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emotion recognition of song lyrics is a challenging task since lyrics may not necessarily align with the overall emotion of a song. As a result, lyrics annotation remains largely underexplored. Drawing inspiration from research in large language model (LLM) assisted annotation, we examine the alignment between humans and LLMs for annotation of lyrics by creating a new sentence-level dataset of lyrics. Our observations highlight the subjectivity of the task and the inherent challenges. Following this, we present a hybrid annotation framework that optimizes human and LLM annotation by predicting potential misalignment in annotation.

[NLP-89] MIThinker: A Plug-and-Play Policy-Optimized Thinker For Motivational Interviewing Counseling ACL2026

【速读】: 该论文旨在解决现有基于大语言模型(LLM)的心理咨询代理在应用动机访谈(Motivational Interviewing, MI)技术时,缺乏显式思维(thought)与咨询策略之间对齐的问题,导致其干预效果受限。其核心解决方案是提出MIThinker——一种轻量级思维生成模型,能够生成与MI理论一致的治疗性思维(therapeutic thoughts),从而指导咨询策略的选择与应答生成。为克服标注思维数据稀缺的瓶颈,研究设计了AugR1-MI自动化流水线,通过逆向工程从观察到的咨询应答中推断出咨询师的隐含思维。通过监督微调与强化学习相结合的两阶段训练,MIThinker显著提升了对心理理论(theory-of-mind)的理解能力与策略一致性。实验表明,基于MIThinker构建的MindfulMI代理在达到与顶尖系统相当的MI专业水平的同时,计算开销降低了近一个数量级。

链接: https://arxiv.org/abs/2606.29265
作者: Yizhe Yang,Palakorn Achananuparp,Heyan Huang,Jing Jiang,Ee-Peng Lim
机构: Beijing Institute of Technology (北京理工大学); Singapore Management University (新加坡管理大学); Australian National University (澳大利亚国立大学)
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Reasoning large language models (LLMs) have recently made much progress in complex problem-solving, leveraging internal reasoning (or thought) to guide their solution generation. However, existing LLM-based counseling agents, including those using Motivational Interviewing (MI), generate responses without explicitly aligning thoughts with counseling techniques, limiting their effectiveness. We propose MIThinker, a lightweight thinking model that generates therapeutic thoughts to guide MI counseling agents in strategy selection and response generation. To overcome the lack of annotated thought data, we introduce AugR1-MI, an automated pipeline that reverse-engineers counselor’s thoughts from observed responses. Through two-stage training combining supervised fine-tuning and reinforcement learning, MIThinker demonstrates improved theory-of-mind assessment and strategy alignment. Comprehensive evaluations show that MindfulMI, our agent leveraging MIThinker, achieves MI competency comparable to state-of-the-art systems with an order of magnitude less computation.

[NLP-90] ravel-Oriented Reasoning Large Language Model via Domain-Specific Knowledge Graphs KDD2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域(如旅行领域)中因缺乏对底层领域图谱(domain graph)的内化而产生的推理错误问题,即模型虽能生成自信但不准确的回答,根源在于未能正确建模领域内的实体关系与规则。其解决方案的关键在于构建一个基于专家设计知识图谱(Knowledge Graph, KG)的模块化推理框架:首先通过旅行领域知识图谱编码实体及其关系;随后采用自底向上的路径遍历方法生成多跳问答对;再利用这些问答对对预训练模型进行监督微调,使模型能够以可审计的推理链形式嵌入领域知识;最后引入专门的旅行领域基准数据集评估模型的准确性与置信度校准能力。实验表明,该方法在微调后的Qwen3-4B模型上实现了82.4%的精确匹配率,显著优于基线模型的22.4%。校准分析进一步揭示剩余17.57%的错误主要源于两个故障模式:一是高置信度的多标签解码器倾向于同时输出正确答案与一个虚假选项;二是单答案问题中尽管支持事实存在于知识图谱中,模型仍无法重构正确的多跳推理路径。该结果验证了显式基于知识图谱的推理机制对提升专业领域模型准确性与不确定性解释力的有效性,并指明未来改进方向为每选项校准与路径长度感知解码。

链接: https://arxiv.org/abs/2606.29254
作者: Vignesh Ram Nithin Kappagantula,Shayan Hassantabar,Samuel Simpson,Golnaz Moallem
机构: Expedia Group(Expedia集团)
类目: Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: Accepted to the Uncertainty Reasoning and Quantification in Decision Making (UDM) Workshop, KDD 2026 (To be presented in August 2026)

点击查看摘要

Abstract:Large language models (LLMs) demonstrate broad reasoning abilities but struggle with accuracy and reliability in specialized domains such as travel, where reasoning depends on precise definitions, rules, and expert-defined conceptual frameworks, and where confident but unfounded outputs arise from a reasoning failure in which the model has not internalized the underlying domain graph rather than from missing domain knowledge alone. We propose a modular pipeline for building a travel-domain reasoning LLM grounded in an expert-designed knowledge graph (KG). Our pipeline integrates a travel KG that encodes domain entities and their relationships, a bottom-up construction procedure that walks the KG to produce multi-hop question answer (QA) pairs, a supervised fine-tuning stage that embeds the domain knowledge into a reasoning-capable LLM using the generated QA pairs as auditable reasoning traces, and a travel-domain benchmark dataset that measures the fine-tuned model’s accuracy and calibration. We evaluate our approach using Qwen3-4B with LoRA adaptation. Our reasoning model achieves an 82.4% exact match on the benchmark. This performance significantly outperforms the pretrained Qwen3-4B baseline at 22.4% . A calibration analysis decomposes the residual 17.57% of errors into two distinct failure modes: an over-confident multi-label decoder that predicts both correct answers plus one spurious option on most dual-answer mistakes, and a smaller reasoning failure on single-answer questions where the supporting facts are present in the KG but the model fails to reconstruct the correct multi-hop path. This split confirms that explicit KG-grounded reasoning substantially improves the accuracy and uncertainty interpretation of LLMs in specialized domains, and isolates per-option calibration and trace-length-aware decoding as the next axes of improvement.

[NLP-91] Understanding Evaluation Illusion in Diffusion Large Language Models

【速读】: 该论文旨在解决扩散型大语言模型(dLLM)在高效解码过程中存在的评估不一致问题。尽管并行解码具备提升生成效率的潜力,但现有研究在相同评估设置下仍得出矛盾结论,可能导致对解码方法性能的误判。其核心问题是:当前评估体系对提示模板(prompt templates)的微小变化高度敏感,导致不同解码方法的排名出现显著波动,进而产生“仅通过减少去噪步骤即可提升效率而不损失性能”的虚假结论。研究发现,现有并行解码方法在多数情况下仍逊于单令牌解码基线,未能突破速度与质量之间的权衡困境。关键发现在于,有效的提示模板本身可显著提升评估表现,甚至超越增加去噪步骤所带来的边际收益。此外,未被充分考虑的评估设置也会影响结果判断。基于此,论文提出了面向dLLM解码方法的可靠评估实践指南,强调需采用多模板、多样化评估设置以确保结论的客观性与可重复性。

链接: https://arxiv.org/abs/2606.29228
作者: Hengxiang Zhang,Jiaxi Ren,Hongxin Wei
机构: Southern University of Science and Technology (南方科技大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the capability of parallel decoding, diffusion large language models (dLLMs) require many denoising steps to maintain generation quality, motivating recent research on efficient decoding strategies. However, existing studies have reported inconsistent evaluation results even under seemingly identical evaluation settings, risking biased conclusions about dLLM decoding methods. To understand this evaluation concern, we conduct a rigorous evaluation of current decoding methods for dLLMs across diverse evaluation settings. Surprisingly, our analysis reveals that the ranking of decoding methods is highly sensitive to the choice of prompt templates. Single-template evaluation can lead to an illusion that decoding methods improve inference efficiency without performance degradation. Through comprehensive experiments, we find that current parallel decoding methods consistently underperform the single-token decoding baseline, failing to overcome the speed-quality trade-off. We further identify this evaluation inconsistency as the high sensitivity of parallel decoding methods to minor variations in prompt templates. Our experiments show that an effective prompt template can achieve strong evaluation results even with fewer denoising steps, markedly outperforming the marginal gain from increasing denoising steps. Beyond prompt templates, our experiments indicate that overlooked evaluation settings can also notably affect the assessment of decoding methods. Based on these findings, we propose practical guidelines for the reliable evaluation of decoding methods in dLLMs.

[NLP-92] PolicyGuard: A Dialogue-Grounded Sub-Agent Verifier for Policy Adherence in LLM Agents

【速读】: 该论文旨在解决大语言模型(LLM)代理在企业环境中执行任务时的合规性问题,即如何确保代理在多轮对话中持续遵循组织政策。传统方法将合规性视为单一动作的防护问题,依赖于对参数值的外部检查来阻止违规行为,但这种方法忽视了真实工作流的复杂性:其涉及多轮交互、需要用户显式确认、依赖前置条件读取,并且合规性判断应基于整个对话上下文而非孤立的参数值。为此,论文提出关键解决方案——POLICYGUARD,一个与主代理共享对话视图的子代理验证器,具备三大核心能力:(i)利用完整的对话上下文进行判断;(ii)在上下文中对政策与当前对话进行自推理;(iii)生成针对具体对话场景的可操作修正建议,指导代理的下一步行为。实验表明,在tau²-BENCH航空场景下,POLICYGUARD在三个主流模型(GPT-5.4、Claude Sonnet 4.6、Gemini 2.5 Pro)上均显著提升通过率(PASS4提升12.0 / 6.0 / 12.0个百分点),同时在每调用分析中展现出更高的政策违规召回率,且触发拦截次数仅为传统参数级防护的一半,证明其在保持高合规性的同时大幅降低误报率。

链接: https://arxiv.org/abs/2606.29225
作者: Seongjae Kang,Taehyung Yu,Sung Ju Hwang
机构: KAIST(韩国科学技术院); DeepAuto.ai
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 8 figures

点击查看摘要

Abstract:LLM agents handle user requests on behalf of organizations through tool calls and must follow the company policies stated in their system prompts. Prior work approaches this as a safeguarding problem – external checks that block non-compliant agent actions. We argue that policy adherence is a broader problem: real workflows unfold across many turns, require explicit user confirmation and prerequisite reads, and hinge on the content of the dialogue rather than on any single argument value. Meeting this bar requires (i) full conversation context, (ii) self-reasoning over the policy and the current dialogue, and (iii) conversation-specific remediation that guides the agent’s next turn – three capabilities that prior safeguard work has often underestimated. We introduce POLICYGUARD, a sub-agent verifier that shares the agent’s view of the dialogue, reasons over the policy in context, and provides actionable feedback for the agent’s next turn. On tau^2-BENCH airline across three vendors (GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Pro) with four trials per setting, POLICYGUARD improves PASS4 by +12.0 / +6.0 / +12.0 pp. Per-call analyses show POLICYGUARD achieves higher policy-violation recall while blocking roughly half as often as argument-level guards.

[NLP-93] Multi-Block Diffusion Language Models

【速读】: 该论文旨在解决现有基于扩散模型的文本生成方法在扩展至多块并行解码(Multi-Block Diffusion, MultiBD)时存在的训练-推理不匹配问题。具体而言,当前的块扩散语言模型(Block Diffusion Language Models, BD-LMs)大多采用教师强制(teacher forcing)训练,仅依赖于单一噪声块与干净前缀的条件关系,而近期提出的扩散强制(diffusion forcing)虽增强了多噪声块间的可见性,但其训练状态仍与实际多块并行推理中所采用的具有异构逐槽噪声模式的有限运行集(bounded running-set)存在差异。为弥合这一差距,本文提出多块扩散语言模型(Multi-Block Diffusion Language Models, MBD-LMs),通过后训练方式引入多块教师强制(Multi-block Teacher Forcing, MultiTF),该策略结合了教师强制与扩散强制,基于受限制的噪声组(bounded noise-groups)进行训练,并采用随机化的噪声调度器(noise-schedulers),以更贴近多块并行推理的实际状态。解决方案的关键在于:一是设计符合多块并行推理特征的训练机制(MultiTF),实现训练与推理分布的一致性;二是提出基于块缓冲区(Block Buffer)机制的优化解码算法,有效保留前缀缓存复用、维持输入张量形状静态,并将增加的块间并行度转化为实际的墙钟时间加速。实验表明,MBD-LLaDA2-Mini 模型平均每前向传播 token 数(TPF)从 3.47 提升至 6.19,平均准确率由 79.95% 提升至 81.03%;结合 DMax 技术后,MBD-LLaDA2-Mini-DMax 实现 9.34 的平均 TPF,仅在数学与代码基准上出现 1.02% 的准确率下降,显著提升了生成效率与实用性。

链接: https://arxiv.org/abs/2606.29215
作者: Yijie Jin,Jiajun Xu,Yuxuan Liu,Chenkai Xu,Yi Tu,Jiajun Li,Dandan Tu,Xiaohui Yan,Kai Yu,Pengfei Liu,Zhijie Deng
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Block Diffusion Language Models (BD-LMs) improve diffusion-based text generation with KV caching and flexible-length generation. A natural next step is to extend them from Single-Block Diffusion (SingleBD) to Multi-Block Diffusion (MultiBD), where a \textitrunning-set of consecutive blocks is decoded concurrently for inter-block parallelism. However, existing BD-LMs are mostly trained under teacher forcing, where the model observes only one noisy block conditioned on a clean prefix. While the recent diffusion forcing strategy introduces visibility among multiple noisy blocks, its training states still differ from MultiBD inference, where decoding operates on a bounded \textitrunning-set with heterogeneous slot-wise noise patterns. To bridge this gap, we propose \textitMulti-Block Diffusion Language Models (MBD-LMs), obtained by post-training BD-LMs with \textitMulti-block Teacher Forcing (MultiTF). MultiTF integrates teacher forcing and diffusion forcing by training on bounded \textitnoise-groups conditioned on clean prefixes, with randomized \textitnoise-schedulers that better match MultiBD inference states. To make MultiBD practically executable, we further introduce an optimized decoding algorithm based on the \textitBlock Buffer mechanism that preserves prefix-cache reuse, keeps input shapes static, and translates increased decoding parallelism into wall-clock acceleration. Empirically, MBD-LLaDA2-Mini increases average Tokens Per Forward pass (TPF) from 3.47 to \textbf6.19 and improves average accuracy from 79.95% to \textbf81.03%; when combined with DMax, MBD-LLaDA2-Mini-DMax reaches an average TPF of \textbf9.34 with only a 1.02% accuracy drop on math and code benchmarks.

[NLP-94] Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study

【速读】: 该论文旨在解决当前光学字符识别(OCR)系统在印度语系文字(特别是天城文,即印地语)上的性能评估不足问题。尽管现有系统在英语和中文文档基准上表现优异,但其在印度语系文本上的实际表现仍缺乏系统性分析。研究的关键在于构建一个涵盖合成退化与真实印刷扫描的多场景评测框架,对十种主流系统(包括经典OCR、通用视觉-语言模型、专用OCR-VLM及前沿闭源大模型)进行综合评估。其解决方案的核心创新在于:(1)揭示了合成数据对性能评估的严重高估现象,真实扫描数据下多数系统性能骤降76分点;(2)提出以中位数与灾难性失败率(catastrophic-rate)替代均值作为核心评价指标,以捕捉特定模型如DeepSeek-OCR存在的极端重复错误;(3)建立基于字节级(ByT5)后处理的纠错机制,虽可提升单一模型自身错误分布下的表现(chrF++提升1.2–1.5),但不具备跨模型迁移能力;(4)发现强英文OCR性能无法预测印度语系表现,凸显了语言特异性挑战。研究最终释放了基准数据集、代码与模型,为未来印度语系文本理解研究提供可靠评估基础。

链接: https://arxiv.org/abs/2606.29213
作者: Aditya Pratap Singh
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures. Benchmark and code released

点击查看摘要

Abstract:OCR systems, ranging from classical engines to specialised OCR vision-language models (OCR-VLMs) and frontier multimodal LLMs, report strong results on English and Chinese document benchmarks, yet their behaviour on Indic scripts is largely uncharacterised. We benchmark ten systems on Devanagari (Hindi): classical EasyOCR; open VLMs (Qwen2.5-VL-3B, Qwen3-VL-8B, olmOCR-7B); specialised OCR-VLMs (DeepSeek-OCR, Unlimited-OCR); and frontier closed models (Gemini 2.5 Flash, Claude Opus 4.7, GPT-5.5, Mistral OCR), across four synthetic degradation conditions and 300 real printed scans. We report four findings. First, on clean rendered text all ten cluster within chrF++ 91 to 98, so synthetic text does not separate them. Second, under degradation the specialised OCR-VLMs are the most fragile: DeepSeek-OCR suffers rare but catastrophic repetition failures (outputs up to 71 the reference length) that wreck its corpus mean even though its median is the best of any system, which is why we report median and catastrophic-rate instead of the mean. Third, on real scans nine of the ten systems collapse (EasyOCR falls from chrF++ 93.6 to 58.3) and the field spreads across a 76-point range, so synthetic renders badly overstate Devanagari quality. Fourth, strong English OCR does not predict Indic OCR: GPT-5.5 drops to chrF++ 58.5 (tying classical EasyOCR) and olmOCR-7B, the model behind olmOCR-Bench, falls to 40.5, while the open Qwen3-VL-8B (75.2, runnable on a single 24 GB GPU) beats GPT-5.5 and approaches Mistral; Gemini and Claude lead at 86.3 and 82.2. An error taxonomy separates surface errors (numerals, punctuation) from structural ones (conjuncts, matras, nukta), and a byte-level (ByT5) post-corrector improves a cheap engine on its own error distribution (chrF++ +1.2 to +1.5) but does not transfer across engines. We release the benchmark, code, and models.

[NLP-95] Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models ICML2026

【速读】: 该论文旨在解决生成式 AI 模型在评估情境下是否具备自我识别能力这一关键问题,因其直接关系到人工智能安全:若模型能识别出自身处于测试环境,可能采取策略性行为,从而干扰下游评估基准的可解释性。研究通过分析涵盖 Qwen 2.5、Gemma 2 与 Llama 3.2 的11个模型,发现模型规模存在系统性的表征深度变化——在较小模型中,评估感知能力(evaluation-awareness)最易在线性可恢复性上体现于深层网络;而在更大模型中则转移至浅层。这表明模型规模不仅影响评估感知的强度,也改变了其在网络中最具线性可恢复性的位置。该深度迁移现象解释了同家族模型缩放轨迹呈现非单调甚至逆向而非平滑一致的原因,说明在密集采样条件下,简单的通用幂律模型无法充分描述缩放规律。此外,白盒探针信号始终强于黑盒行为表现,且两者间的关系因模型家族而异,这种差异无法仅通过探针的 AUROC 指标预测,凸显了评估感知机制的复杂性与家族特异性。

链接: https://arxiv.org/abs/2606.29196
作者: Archit Manek
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 3 figures. Accepted at the Mechanistic Interpretability Workshop at ICML 2026

点击查看摘要

Abstract:Do language models know when they are being tested? This question matters for AI safety: a model that recognises an evaluation context could alter its behaviour strategically, making downstream benchmarks harder to interpret. Using 11 models spanning Qwen 2.5, Gemma 2, and Llama 3.2, we find a systematic size-dependent shift in representational depth: in both Qwen 2.5 and Gemma 2, the layer at which evaluation-awareness is most linearly recoverable moves from late layers in smaller models to early layers in larger ones. This suggests that scale changes not only the strength of evaluation-awareness but also where it is most linearly recoverable in the network. This depth shift helps explain why within-family scaling trajectories are non-monotonic or inverse rather than smooth and family-general, showing that a simple universal power-law account is not supported under denser within-family sampling. Finally, white-box probe signals are consistently stronger than black-box behavioural expression, and the relationship between the two varies by family in ways not predicted by probe AUROC alone.

[NLP-96] Evidence-Informed LLM Beliefs for Continual Scientific Discovery

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在开放式科学发现中因假设搜索与验证循环的奖励机制设计不当而导致的持续性发现能力受限问题。核心挑战在于,现有方法如AutoDiscovery将“贝叶斯惊喜度”(Bayesian surprise)视为静态量,而人类推理中的惊喜度本质上是动态演化的——它依赖于随经验不断更新的信念体系,这是实现持续科学发现的关键前提。为此,论文提出基于证据的LLM信念更新机制:通过整合先前假设所蕴含的证据来动态更新先验信念,从而计算非平稳(non-stationary)的惊喜度。研究对比了多种上下文内信念更新方式,发现基于嵌入的检索增强生成(retrieval-augmented generation, RAG)在预测最终后验分布方面表现最优,可识别出37.5%的静态惊喜度为虚假信号。在此基础上,论文对搜索过程进行两方面改进:信念更新过滤(belief-update filtering)以剔除冗余或误导性奖励,以及多样性最大化(diversity maximization)以促进探索新奇假设。在五个科学发现领域上的实验表明,相较于原始搜索流程,该方法平均提升了30.62%的累积非平稳惊喜度,验证了持续科学发现不仅需要更精准的信念测量,还需具备去冗余、促多样性的搜索策略。

链接: https://arxiv.org/abs/2606.29182
作者: Dhruv Agarwal,Reece Adamson,Andrew McCallum,Peter Clark,Ashish Sabharwal,Bodhisattwa Prasad Majumder
机构: Allen Institute for AI (艾伦人工智能研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Open-ended scientific discovery with large language models (LLMs) increasingly operates as a long-horizon loop of hypothesis search and verification, where a reward signal guides which hypotheses to test next. A notable recent example is AutoDiscovery, which uses “Bayesian surprise” - the belief shift an LLM undergoes after observing evidence for a hypothesis - as both a discovery metric and a reward for search. We first observe that AutoDiscovery treats surprisal as a static quantity, while surprisal in human reasoning is non-stationary - it is defined relative to beliefs that evolve with experience, a prerequisite for continual scientific discovery. We address this mismatch with evidence-informed LLM beliefs: priors updated with evidence from previous hypotheses to compute non-stationary surprisal for new hypotheses. We compare in-context belief-updating mechanisms and find that embedding-based retrieval-augmented generation over prior discoveries best anticipates eventual posteriors, identifying 37.5% of static surprisals as spurious. We then modify search to avoid these spurious rewards and prioritize hypotheses that remain surprising under non-stationary beliefs. Concretely, we introduce two complementary changes to the original search procedure: belief-update filtering and diversity maximization. Across five discovery domains, our method increases accumulated non-stationary surprisal by 30.62% on average compared to the original search procedure, demonstrating that continual scientific discovery with LLMs requires not only better belief measurement but also search procedures that avoid redundancy and encourage diversity.

[NLP-97] Selective Memory Retention for Long-Horizon LLM Agents ICML

【速读】: 该论文旨在解决在外部记忆增强的大语言模型(LLM)智能体中,记忆保留策略何时具有实际意义这一关键问题。其核心挑战在于:在无噪声的基准任务中,记忆污染现象并不显著,导致传统保留机制的优势难以体现;而在存在噪声干扰的场景下,未受约束的记忆会迅速积累冗余与无关信息,严重降低检索精度和任务表现。为此,论文提出TraceRetain框架——一种轻量级、基于可解释特征的有界外部记忆管理机制,通过综合评估记忆条目的成功性、年龄、访问频率、冗余度、特异性、相似性及下游效用等维度进行评分,并在内存满载时淘汰得分最低的条目。其关键创新在于引入多维度可解释评分机制,使记忆保留具备语义合理性与抗噪能力。实验表明,在干净的ALFWorld环境(如gpt-5-mini)中,有界记忆虽未显著优于无记忆基线(因无明显污染),但在引入75%合成干扰项的噪声压力测试下,无界记忆与先进缓存策略(如FIFO-K50)的Precision@5分别从20.2%与15.8%骤降至12.4%与3.8%,而TraceRetain-CEM保持稳定(16.9%→16.6%),并维持97/100的任务成功率。进一步分析揭示,无界记忆虽平均嵌入相似度高达0.87,但因大量近似无关条目干扰,导致精度下降。该研究证明,有界保留机制在饱和清洁基准上可实现记忆与推理效率的提升且不牺牲任务成功率,仅在数据流含噪声时才展现出相对于传统缓存启发式方法的显著优势。

链接: https://arxiv.org/abs/2606.29178
作者: Pranath Reddy
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at the International Conference on Machine Learning (ICML) 2026

点击查看摘要

Abstract:When does retention matter for memory-augmented LLM agents? We study this with TraceRetain, a lightweight framework for bounded external memory in frozen LLM agents that scores entries by interpretable features (success, age, access frequency, redundancy, specificity, similarity, downstream utility) and evicts the lowest-scoring ones at capacity. On clean ALFWorld with gpt-5-mini, external memory robustly improves over no memory across two seeds, but differences among bounded retention policies fall within Wilson 95% CIs: clean ALFWorld at T=100 to T=200 does not naturally exhibit the memory pollution retention is designed to address. Under a controlled noisy-write stress (75% synthetic distractors), unbounded memory and FIFO-K50 degrade on Precision@5 (20.2% to 12.4% and 15.8% to 3.8%) while TraceRetain-CEM is essentially unchanged (16.9% to 16.6%) and preserves 97/100 task success. The mechanism: unbounded memory has the highest mean similarity (0.87) but lowest precision, indicating failed distractors close to the query in embedding space. Held-out in-distribution evaluation shows memory-augmented policies solving 47 to 49 of 50 tasks vs. 39/50 for no memory. Bounded retention buys memory and step efficiency on saturated clean benchmarks at no task-success cost, and only differentiates from cache heuristics when streams contain noise.

[NLP-98] Symbolic Mechanistic Data Attribution: Tracing Training Influence to Learned Behavioral Policies

【速读】: 该论文旨在解决现有数据归因方法无法解释训练数据如何影响模型高层行为决策的问题。现有方法虽能识别构建特定机制电路的训练样本,却难以揭示数据对模型整体行为策略的塑造机制。其解决方案的关键在于提出符号化机制数据归因(Symbolic Mechanistic Data Attribution, SMDA)框架:通过在稀疏自编码器(Sparse Autoencoder, SAE)特征空间上拟合闭式岭回归(Ridge Regression),建模目标行为,并基于特征激活变化(Delta_X)与输出概率变化(Delta_Y)两条路径,解析每个监督微调(Supervised Fine-Tuning, SFT)样本对行为策略的机械性影响。该方法不仅可揭示基础模型在宗教刻板印象等类别上的安全行为系统性缺陷,还能通过细粒度的特征级分解,阐明有害与无害样本对特定特征产生质性差异影响的机制,并发现训练样本常引发跨特征干扰现象。研究表明,将机制可解释性与数据归因相结合,能够生成比黑箱影响函数更精细、比人工电路分析更具可扩展性的诊断工具。

链接: https://arxiv.org/abs/2606.29171
作者: Reza Habibi,Darian Lee,Magy Seif El-Nasr
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While existing data attribution methods can identify which training examples build specific mechanistic circuits, they cannot explain how training data shapes the high-level behavioral decisions a model learns to make. To bridge this gap, we introduce Symbolic Mechanistic Data Attribution (SMDA), a framework that attributes training pairs to the interpretable symbolic policies governing model behavior. SMDA fits a closed-form Ridge regression over sparse autoencoder (SAE) features to model a target behavior, then analytically decomposes how each supervised fine-tuning example shifts that policy through feature-activation Delta_X and output-probability Delta_Y pathways. We distill a symbolic policy for refusal behavior in Llama-3.2-3B-Instruct and analyze 200 SFT training pairs. Our analysis reveals that (1) the symbolic policy’s coefficients expose systematic gaps in the base model’s safety behavior for categories like religious stereotyping; (2) per-feature Delta_X/Delta_Y decomposition can mechanistically explain why harmful and harmless pairs exert qualitatively different influences on certain features; and (3) individual training pairs routinely exhibit cross-feature interference, allowing SMDA to identify training pairs whose dominant effect falls on unintended features. These results demonstrate that combining mechanistic interpretability with data attribution yields a diagnostic tool that is both more fine-grained than black-box influence functions and more scalable than manual circuit analysis.

[NLP-99] DistilledGemma: Balanced Efficiency-Accuracy for Person-Place Relation Extraction from Multilingual Historical Articles

【速读】: 该论文旨在解决多语言历史报纸文章中人物-地点关系抽取(person-place relation extraction)这一复杂任务的准确性与计算效率之间的平衡问题。针对该任务在跨语言、低资源语境下的挑战,提出的解决方案核心在于构建一个三阶段知识蒸馏(knowledge distillation)流程:首先通过系统性地探索八种大语言模型的提示工程策略,筛选出最优的推理架构;其次,利用具有强大多语言能力的Gemma 4 26B A4B教师模型,基于QLoRA进行监督微调(SFT),生成高质量的银标准思维链(chain-of-thought)标注数据;最后,通过响应级蒸馏将教师模型的推理模式迁移至参数量更小的Gemma 4 E2B学生模型。该方法在保持强推理能力的同时,将部署模型的有效参数规模压缩至约2.3B,且训练时使用的LoRA适配器已合并至学生模型用于推理,实现了高精度与低计算开销的协同优化。实验结果显示,该方案在标准测试集和二分类测试集上分别取得0.688和0.8156的均值得分,并在兼顾效率与准确性的综合评分中位列第二,验证了知识蒸馏在历史文本处理场景中的可扩展性与实用性。

链接: https://arxiv.org/abs/2606.29130
作者: Youssef Aboelwafa,Ahmed Samir,Nagwa Elmakky,Marwan Torki
机构: Alexandria University (亚历山大大学)
类目: Computation and Language (cs.CL)
备注: The Conference and Labs of the Evaluation Forum (CLEF) 2026 - HIPE Challenge

点击查看摘要

Abstract:We present DistilledGemma, an efficient and accurate system for the HIPE-2026 shared task on person-place relation extraction from multilingual historical newspaper articles in English, German, and French. Our approach adopts a three-stage knowledge distillation pipeline designed to balance classification accuracy with computational efficiency. In the first stage, we systematically explored prompt engineering strategies across eight large language models to identify the most effective reasoning architecture for this challenging task. In the second stage, we applied supervised fine-tuning (SFT) via QLoRA to a Gemma 4 26B A4B teacher model, leveraging its strong multilingual capabilities to generate silver-standard chain-of-thought traces across the training corpus. In the final stage, we performed response-level distillation to transfer these learned reasoning patterns into a compact Gemma 4 E2B student model. In the official evaluation, our team WHEREAMI ranked 3rd on the standard test set with an accuracy profile mean score of 0.688, and 2nd on the binary test set with a mean score of 0.8156. Notably, by distilling knowledge from the 26B teacher to the 2.3B student, we preserved strong reasoning capabilities while reducing the deployed model size to approximately 2.3B effective parameters; the LoRA adapters used during training were merged into the student for inference. This configuration ranked 2nd in the balanced efficiency-accuracy profile across both the standard and binary test sets. These results demonstrate that knowledge distillation provides a practical and scalable solution for historical document processing, achieving competitive performance without excessive computational cost.

[NLP-100] How Anthropomorphic Language Impacts Public Perceptions of AI

【速读】: 该论文旨在解决公众对人工智能(AI)的讨论中普遍使用拟人化语言所带来的认知偏差问题,即通过赋予AI人类特征和能力的表述方式,可能误导公众预期、夸大技术能力并加剧技术炒作,从而影响公众对AI的理解及政策制定方向。其解决方案的关键在于通过实验设计,对比分析在包含与不包含拟人化语言的文本描述下,受众对AI认知的差异(样本量N=815),并进一步考察这种影响是否因AI技术类型(如大语言模型与推荐系统)而异,以及在多个公共讨论中的核心维度(如自主性、可靠性、风险感知等)上是否存在显著变化。研究结果显示,在控制条件下,拟人化语言并未显著改变参与者对AI的整体认知,表明其对公众意见的即时影响较为有限;但研究也指出,拟人化语言在自然情境或长期持续暴露下仍可能存在潜在影响,为后续研究提供了重要启示。

链接: https://arxiv.org/abs/2606.29121
作者: Betty Li Hou,Sophie Hao,Sunoo Park,Tal Linzen
机构: New York University; Boston University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Public discourse about artificial intelligence (AI) often uses anthropomorphic language: language that attributes human capabilities and characteristics to the system. This practice has been criticized for setting misleading expectations, inflating claims, and fueling hype around AI, which may distort public understanding of AI and impact policy priorities. We study the effects of anthropomorphic framing by comparing changes in participants’ perceptions (N=815) when reading passages with and without anthropomorphic language, designed to reflect realistic public-facing AI discourse. We further examine whether these effects differ across two types of AI technologies – large language models and recommendation systems – and measure changes in perceptions of AI across several dimensions that are prominent in current public discourse. In a separate condition using a text that explicitly discusses the dangers of AI, we show that individuals’ views of AI can shift in response to reading a text; yet in the main conditions of the experiment, where we compare anthropomorphic and non-anthropomorphic descriptions, we find that whether the text uses anthropomorphic language does not substantially affect participants’ perceptions of AI. Our results indicate that any immediate effects on public opinions of AI are modest, although they leave open the possibility that anthropomorphic language could have an effect in naturalistic settings, or over gradual, continued exposure.

[NLP-101] Knowing in Advance When an Evolutionary Outer Loop Will Not Help: A Pre-Registered Cheap-Baseline Screening Rule

【速读】: 该论文旨在解决在构建基于神经网络参数或结构的进化型/种群型/生命周期型外层循环(evolutionary/population/lifecycle outer loop)前,如何有效评估其是否值得投入的问题。这类外层循环的计算开销是其梯度内层循环的10²–10³倍,但其是否优于低成本的单次静态(single-shot)方法通常只有在完成全部开发与执行后才能确认,导致资源浪费。为此,论文提出了一项预先注册的筛选规则(pre-registered screening rule),其核心在于在阶段零(Phase-0)通过一个单一数值——恢复率 $ R = s/G $ 来决策:其中 $ s $ 为最佳单次静态梯度/曲率统计量的收益,$ G $ 为任意低成本方法所能达到的最佳收益。当 $ R \leq 90% $ 时,建议跳过外层循环的构建。该规则的关键在于将复杂、高成本的演化探索过程前置化为可量化的、可验证的决策指标,从而实现高效资源分配。在实验室内部一系列预注册的外层循环实验中,该规则成功识别出两个案例中单次静态计算已充分捕捉项目关键指标变化,且筛选门控触发($ R \approx 1.0 $,严格指标下 $ \approx 0.95 $),进而放弃外层循环,其中包括一例通过因子分解局部化发现所谓“性能提升”实际源于静态结构变更,而演化生命周期本身并无显著贡献。在一项任务中,该门控耗时约50–70 GPU小时,却避免了超过400 GPU小时(仅第一单元)及数周的开发工作,实现6–8倍的效率提升。该规则具有前瞻性可证伪性:若存在 $ R > 90% $ 但外层循环仍无法超越单次静态方法的情形,则可推翻该规则。

链接: https://arxiv.org/abs/2606.29119
作者: Ramchand Kumaresan
机构: Murai Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a pre-registered screening rule that decides, before any implementation, whether an evolutionary / population / lifecycle outer loop over neural-network parameters or structure is worth building. Such outer loops cost 10^2-10^3x their gradient inner loop, yet whether they beat a cheap single-shot alternative is usually discovered only after the expense is paid. Our rule computes, at a Phase-0 gate, a single number: the recovery R = s/G, the best single-shot gradient/curvature statistic’s gain s divided by the best gain G of any cheap method evaluated, and prescribes skipping the outer loop when R = 90%. We validate the rule on a within-lab series of pre-registered outer-loop bets (two analyzed cases plus a disclosed file drawer): in both analyzed cases a static or single-shot computation captured the effect on the project’s own metric, the gate fired (R approximately 1.0 in both cases; approximately 0.95 under a stricter metric on one), and the outer loop was abandoned, including one case where a companion factorial decomposition localizes the apparent win to a static substrate change with the evolutionary lifecycle contributing no detectable gain. On one project the gate cost about 50-70 GPU-hours and screened out an estimated 400+ GPU-hours (first cell only) plus weeks of implementation, a 6-8x saving. The rule is prospectively falsifiable: a task with R 90% where the outer loop still fails to beat single-shot would refute it.

[NLP-102] Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

【速读】: 该论文旨在解决大语言模型(LLM)在面对不同优化任务时缺乏跨任务经验迁移能力的问题,即现有方法在每次新任务中均从零开始搜索,导致先前积累的演化经验无法复用。其核心挑战在于如何使模型自身具备迭代演化解决方案的能力(如识别可突变部分、决定回溯时机等),而非依赖外部搜索框架。为此,论文提出进化微调(Evolution Fine-Tuning, EFT),通过将进化搜索轨迹转化为监督信号,训练模型在中段训练阶段学习跨任务的演化策略。关键创新在于构建了包含156,000条轨迹的Finch Collection数据集,覆盖10个领域和371个优化任务,并基于此对2B至9B参数的开源模型进行微调。实验证明,EFT显著提升了模型的跨任务泛化能力,在22个保留任务上平均性能优于基线模型10.22%;结合测试时强化学习(RL),在圆排列问题和Erdős最小重叠问题上达到或超越当前最优水平。因此,EFT为通用发现代理提供了一个“实践阶段”,使其能够基于过往经验持续优化,而非每次都从零开始求解新问题。

链接: https://arxiv.org/abs/2606.29082
作者: Young-Jun Lee,Seungone Kim,Minki Kang,Alistair Cheong Liang Chuen,Zerui Chen,Seungho Han,Taehee Jung,Dongyeop Kang
机构: Amazon(亚马逊)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Would experience designing faster GPU kernels also help close in on a long-standing open mathematical conjecture? Large Language Models (LLMs) integrated into evolutionary search have recently produced state-of-the-art solutions on optimization tasks, including open mathematical conjectures, GPU kernel design, scientific law discovery, and combinatorial puzzles. To achieve this, prior work applied search scaffolds to one target task at a time, so every new problem is approached from scratch and the experience accumulated during search is discarded once the model finishes its attempt. This leaves the capability of iteratively evolving a solution (e.g., knowing which part to mutate and how, deciding when to backtrack) entirely in the scaffold rather than in the model itself. Whether the model itself could acquire this capability and reuse it across different tasks has been largely unexamined. To address this, we introduce Evolution Fine-Tuning (EFT), a mid-training paradigm that teaches LLMs to evolve solutions across tasks by converting evolutionary search trajectories into supervision. We construct Finch Collection, a 156K-trajectory dataset spanning 10 domains and 371 optimization tasks, and fine-tune open-source LLMs from 2B to 9B parameters. Empirically, EFT confers cross-task generalization: across 22 held-out tasks, our models surpass their base counterparts by 10.22% on average. Furthermore, when paired with test-time RL, our model matches state-of-the-art performance on two circle-packing tasks and outperforms its base-model counterpart on the Erdős minimum-overlap problem. EFT thus serves as a “practice phase” for general-purpose discovery agents that do not solve new problems from scratch.

[NLP-103] Low-cost concept-based localized explanations: How far can we get with training-free approaches?

【速读】: 该论文旨在解决生成式人工智能(Generative AI)在概念可解释性(Concept-based Explainable AI, C-XAI)中缺乏细粒度概念标注的问题,尤其是如何在无监督、零样本条件下实现对图像局部区域(包括物体及其部件)的概念命名。其核心解决方案在于提出一种可复现的零样本评估协议——概念命名(Concept Naming, CoNa),包含两种策略:一是针对中等规模词表的封闭集、类别约束提示方法;二是针对大规模标签空间的开放型CoNa(Open-CoNa),基于嵌入相似性进行标签分配。实验表明,四种多模态大语言模型(MLLMs,7B-32B参数量)在不同数据集上均表现出一致的性能趋势,对象级精确匹配准确率可达62%–88%,验证了无需训练即可从局部视觉区域实现高质量概念标注的可行性。该研究为低成本、可复现的C-XAI研究提供了框架支持,并揭示了当前方法的局限性与失效模式。

链接: https://arxiv.org/abs/2606.29069
作者: Darian Fernández-Gutiérrez,Rafael Bello,Marilyn Bello,Natalia Díaz-Rodríguez
机构: Central University ”Marta Abreu” of Las Villas (UCLV), University of Granada (UGR)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures, 4 tables. Accepted at the 2026 IEEE International Conference on Artificial Intelligence (CAI), 8-10 May 2026, Granada, Spain. Code: this https URL

点击查看摘要

Abstract:Concept-based Explainable AI (C-XAI) seeks human-understandable explanations grounded in semantic concepts, yet validation is limited by the scarcity of fine-grained concept annotations. We evaluate whether mid-scale Multimodal Large Language Models (MLLMs) can perform localized concept naming under strict zero-shot conditions by assigning labels to bounding-box regions at both object and part levels. We propose a reproducible zero-shot evaluation protocol for Concept Naming (CoNa) with (i) closed-set, category-constrained prompting for moderate vocabularies and (ii) Open-CoNa, an embedding-similarity-based strategy for large label spaces. Experiments with four MLLMs (7B-32B) show consistent performance trends across datasets, reaching 62%-88% object-level exact-match accuracy, highlighting the potential of training-free concept annotation from localized regions. We discuss limitations and failure modes and release a reproducible framework to support future low-cost C-XAI research.

[NLP-104] A Comparative Study on Affective Cues in Text Embeddings Across Psychological Emotion Theories

【速读】: 该论文旨在解决当前生成式文本编码器在情感计算任务中对心理情感理论的表征能力尚不明确的问题,即现代文本编码器所生成的潜在表示是否能有效捕捉结构化的心理学情感理论。其核心解决方案在于系统性地评估12个近期发布的文本编码器在三种经典情感框架下的表现,通过将编码器输出的嵌入向量作为回归与分类任务的输入特征,分别在词级和句级数据上进行评估,并采用语义数据泄露防护技术以提升词级评估的鲁棒性。研究发现,最新发布的指令感知型开源权重编码器在词级情感表征上展现出与专有模型相当甚至更优的情感信息容量;而在句级情感分类任务中,经过特定任务微调及专有模型的嵌入向量则表现最佳。此外,论文还提供了对潜在表示及其编码情感线索的定性分析,揭示了不同模型在情感结构建模上的差异。

链接: https://arxiv.org/abs/2606.29068
作者: Fabio Ciani,Harald Schweiger,Emilia Parada-Cabaleiro,Markus Schedl
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text encoders are known for their utility in natural language processing, as they are able to efficiently compress inputs into dense vectors while preserving semantics. These models have been applied to affective computing, in particular to help with solving sentiment analysis and emotion recognition tasks. Nevertheless, it remains unclear to what extent the latent representations produced by modern text encoders capture well-defined psychological theories of affect. In this work, we investigate the affective capabilities of twelve recently released text encoders by probing their generated embeddings as input features for solving regression and classification tasks across three established emotion frameworks, using both word- and sentence-level data. Additionally, we apply a semantic data-leakage prevention technique to improve robustness in word-level evaluations. Our main findings show that the latent manifolds of the latest instruction-aware open-weight encoders enclose an equal or even a larger amount of affective information in comparison with proprietary counterparts when evaluated at word level. In contrast, embeddings of task-tuned and proprietary encoders reach the highest scores on sentence-level affective classification. Furthermore, a qualitative analysis of latent representations and their encoded affective cues is provided.

[NLP-105] hinkProbe: Beyond Accuracy – Structural Profiling of Open-Ended LLM Reasoning Traces via Non-Generative Thought Graphs EMNLP2026

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)推理过程缺乏系统性结构分析方法的问题,尤其关注如何从复杂的推理轨迹中提取可量化、可比较的认知特征。现有评估方式多依赖于最终答案的准确性,难以揭示模型内部推理路径的本质差异。为此,论文提出ThinkProbe框架,其核心解决方案是将每个推理轨迹转化为一个带环的有向图——思维图(Thought Graph),该图包含8种节点类型和6种边类型,通过完全非生成式的分析流程(结合基于规则的分段与判别式语义链接)构建出一个由19个指标组成的五维认知画像(5D-CP:广度、深度、结构、元认知、效率)。该方法首次揭示了推理结构具有稳定的模型级属性:在多数认知维度上,不同模型间的差异显著高于同一模型在不同认知领域间的差异(最高达四倍),且“结构”维度对问题领域表现出真实敏感性,能够识别出传统准确率评估无法捕捉的质性认知模式差异。

链接: https://arxiv.org/abs/2606.29067
作者: Mohamed Amine Kerkouri,Simon D. Hernandez,Marouane Tliba,Yann Dauxais,Maha Ben-Fares,Pierre Holat
机构: F-Initiatives, Paris, France; Université sorbonne Paris Nord, Villetaneuse, France
类目: Computation and Language (cs.CL)
备注: Under Review for EMNLP 2026

点击查看摘要

Abstract:We present ThinkProbe, a framework for structural analysis of LLM reasoning traces. ThinkProbe converts each trace into a Thought Graph a directed graph with cycles, 8 node types, and 6 edge types and derives a 19-metric five-dimensional cognitive profile (5D-CP: Breadth, Depth, Structure, Metacognitive, Efficiency) through a fully non-generative pipeline combining rule-based segmentation and discriminative semantic linking. Applied to 4,200 traces from 7 native reasoning models across 200 open-ended questions and 10 cognitive domains, ThinkProbe reveals that reasoning structure is a stable, model-level property: between-model variance exceeds between-domain variance by up to fourfold across four of five cognitive dimensions, with Structure showing genuine sensitivity to question domain, exposing qualitatively distinct cognitive profiles invisible to accuracy-based evaluation.

[NLP-106] Masked Diffusion Decoding as x-Prediction Flow

【速读】: 该论文旨在解决掩码扩散语言模型(MDLMs)在生成文本时因采用标准解码器导致的“全有或全无”决策机制所引发的问题:即每一步仅能将某个位置完全确定为一个词元或保持完全掩码,无法表达中间状态的置信度,从而丢失丰富的预测信息并造成过早且不可逆的承诺,尤其在解码预算有限的情况下性能显著下降。其解决方案的关键在于重新将掩码预测视为“干净状态预测”(x-prediction),构建输入嵌入空间中的连续流,实现一种连续解码框架,使词元可在每一步积累部分进展并保持可修订性;进一步地,针对语言中各位置上下文约束的不均衡性,摒弃图像扩散中的全局同步调度,引入基于置信度的异步更新机制,实现逐词元的进度累积;同时设计轻量级策略网络,并将其训练建模为强化学习问题。实验表明,该方法在预训练的LLaDA模型上仅使用25%的解码预算即可达到其在HumanEval数据集上97%的性能表现。

链接: https://arxiv.org/abs/2606.29066
作者: Weitian Wang,Lianlei Shan,Shubham Rai,Cecilia De La Parra,Akash Kumar
机构: Robert Bosch GmbH (罗伯特·博世公司); Ruhr University Bochum (鲁尔大学波鸿分校); University of the Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Masked diffusion language models (MDLMs) generate text by iteratively unmasking tokens, but their standard decoder reduces each step to a binary action: a position is either committed to a single token or left fully masked, with no representation of partial belief in between. This all-or-nothing regime discards rich predictive information and forces premature, irrevocable commitments, leading to poor performance under a limited decoding budget. In this paper, we reinterpret mask prediction as clean-state prediction ( x -prediction) and show that it can be used to induce a continuous flow in input embedding space. Building on this view, we propose a continuous decoding framework for MDLMs where tokens can accumulate partial progress at each diffusion step and remain revisable. To match the uneven contextual constraints across positions in language, we replace the globally synchronous schedule in image diffusion with a confidence-based asynchronous update in which the diffusion progress is token-wise accumulated. Additionally, we introduce a lightweight policy network and formulate its training as a reinforcement learning problem. Applied to pretrained LLaDA, our continuous decoder reaches 97% of its performance on the HumanEval dataset with 25% of decoding budget.

[NLP-107] How to Leverag e Synthetic Speech for LLM -Based ASR Systems?

【速读】: 该论文旨在解决在银行、医疗等受监管领域中,由于隐私限制导致真实语音数据难以获取与留存,而采用现代文本转语音(Text-to-Speech, TTS)生成的合成语音训练自动语音识别(Automatic Speech Recognition, ASR)系统时所面临的合成与真实语音之间存在的分布差异(distributional gap)问题。现有方法通常将这一差距视为需绕过的“黑箱”,而本文则通过深入分析SLAM-ASR架构,直接探究该分布差异的成因。研究发现,大型语言模型(LLM)主干网络在早期至中期层中对真实与合成语音表现出显著区分能力,且这种判别信号最易受到时序和语调扰动的影响。进一步实验表明,表征层面的可分性虽有助于理解差异,但并不能直接预测下游ASR性能的提升。关键突破在于揭示:通过将合成语音与房间脉冲响应(Room Impulse Responses, RIRs)进行卷积处理,能够有效缩小分布差距,其本质并非使合成语音听起来更清晰或自然,而是复现真实录音中的声学不规则性。基于此,作者提出在训练过程中引入层选择模块并结合RIR增强策略,仅使用25%的真实语音(13.6小时)即可达到全量真实数据基线的表现,并在更高比例下实现超越。该方案的核心创新在于从机制层面理解并利用声学失真特征以弥合合成与真实语音间的分布鸿沟。

链接: https://arxiv.org/abs/2606.29031
作者: Yanis Labrak,Dairazalia Sanchez-Cortes,Sergio Burdisso,Séverin Baroudi,Shashi Kumar,Esaú Villatoro-Tello,Srikanth Madikeri,Manjunath K E,Oldřich Plchot,Kadri Hacioğlu,Petr Motlicek,Andreas Stolcke
机构: DefinedAI; Czech Technical University in Prague (捷克技术大学); Google(谷歌); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to SLT 2026

点击查看摘要

Abstract:In regulated domains such as banking and healthcare, where privacy constraints make real speech costly to collect and retain, synthetic speech from modern text-to-speech (TTS) is an appealing alternative for training automatic speech recognition (ASR) without exposing sensitive customer recordings. Yet a persistent distributional gap between synthetic and real data limits how far it can replace genuine recordings. Prior work largely treats this gap as a black box to be engineered around, but in our work, we instead examine its origin directly by probing a SLAM-ASR architecture. Then, we localise where its LLM backbone separates real from synthetic speech and find the discriminative signal concentrated in the early-to-middle layers, where temporal and prosodic perturbations disrupt it most. We further show that representation-level separability, help, but does not directly predict downstream ASR gains. On the other hand, convolving synthetic audio with room impulse responses (RIRs) narrows the gap not by making synthetic speech sound cleaner or more natural, but by reproducing the acoustic irregularities of real recordings. Translating these findings into the training procedure, by adding a layer-selection module combined with RIR augmentation matches a fully real-data baseline using only 25% of the real speech (13.6h) and surpasses it at all higher proportions.

[NLP-108] Conversational Domain Adaptation of IndicTrans2 across 21 Indic Languages via Experience Replay and Model Soups

【速读】: 该论文旨在解决当前开源英-印地语翻译系统(如IndicTrans2-1B)在处理非正式、对话式语料时表现生硬、语言风格不自然的问题。尽管其在通用领域表现优异,但在口语化场景下缺乏流畅性和语用适切性。解决方案的关键在于:仅使用公开数据集(OpenSubtitles、BPCC-H-Daily、Tatoeba),通过“经验回放”(experience replay)将通用领域数据与对话式数据混合训练,并结合“模型平均”(model souping)技术,将微调后的模型权重与原始基线模型进行加权平均。该方法有效缓解了传统微调带来的性能退化问题——在保持通用领域性能(FLORES指标上仅下降0.17 chrF,均在0.7以内)的同时,在全部21种印地语族语言上实现了显著的对话式译文质量提升(平均+6.2 chrF),且经配对引导检验确认其提升具有统计显著性(p = 0.004)。尽管人类及多模型大语言模型评估未证实感知质量的实质性改善,研究强调其增益主要体现为与参考译文在语体(register)上的更好匹配,而非主观可感知的翻译质量飞跃。因此,该工作的核心贡献并非新算法,而是在印地语族对话场景下首次完成的一次透明、端到端的实证研究。

链接: https://arxiv.org/abs/2606.29024
作者: Aditya Pratap Singh
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures, 3 tables. Code: this https URL Model: this https URL

点击查看摘要

Abstract:IndicTrans2 is the strongest open English to Indic translation system, but like most systems it is trained on general text and tends to sound stiff on casual, conversational input. We adapt IndicTrans2-1B to conversational register across all 21 Indic languages using only public data (OpenSubtitles, BPCC-H-Daily, Tatoeba). Plain fine-tuning improves conversational chrF but forgets the general domain (it drops 3.9 chrF on FLORES for Hindi). Mixing general data back into training (experience replay) and then averaging the fine-tuned weights with the base (model souping) removes that trade-off: the resulting model beats IndicTrans2-1B on conversational chrF in every one of the 21 languages (mean +6.2) while matching it on FLORES (mean change -0.17, all within 0.7 chrF). Paired bootstrap tests confirm the conversational gains are significant (p = 0.004) and that FLORES is not significantly degraded. We are deliberate about scope: these are chrF gains, and a blind human plus multi-model LLM check does not confirm them as a perceived quality improvement, so we treat the conversational gain as largely a register match to the references rather than proof of better translation. The techniques are not new; the contribution is the honest, end-to-end study in the Indic conversational setting.

[NLP-109] BERTomelo: Your Portuguese Encoder Best Friend

【速读】: 该论文旨在解决葡萄牙语(Portuguese)领域缺乏与现代模型架构同步的高性能单语编码器(monolingual encoder)的问题。现有模型如BERTimbau和Albertina在可扩展性与效率方面未能跟上英语基准的最新进展,难以充分捕捉葡萄牙语特有的词汇与句法特征。为此,本文提出BERTomelo,一种从零开始预训练的下一代单语编码器,专为葡萄牙语优化。其核心解决方案在于采用ModernBERT架构,并引入硬件级优化技术(如FlashAttention与交替注意力机制),支持1,024个标记的上下文窗口,显著提升计算效率。模型基于大规模高质量葡萄牙语语料库ClassiCC-PT(1.06亿文档)进行训练,确保语言表征与当代葡萄牙语使用高度一致。实验结果表明,BERTomelo不仅在各项任务中超越此前的葡萄牙语编码器,且在语义文本相似度(STS)与命名实体识别(NER)等下游任务中表现优于大型多语言模型,展现出更强的鲁棒性与效率优势。

链接: https://arxiv.org/abs/2606.28999
作者: Rennê Ruan Alves Oliveira,Gustavo Cordeiro Galvão Van Erven,Luís Paulo Faina Garcia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Encoders have become the state of the art for multiple NLP tasks, especially those requiring deep contextual understanding. While multilingual models offer broad coverage, dedicated monolingual encoders are essential for capturing the unique lexical and syntactic nuances of specific languages. For Portuguese, however, existing monolingual options like BERTimbau and Albertina have not kept pace with recent architectural breakthroughs, often lagging behind English benchmarks in scalability and efficiency. This work introduces BERTomelo, a next-generation monolingual encoder pre-trained from scratch and specifically optimized for the Portuguese language. By leveraging the ModernBERT architecture, BERTomelo overcomes the limitations of previous models, offering Base and Large versions with a 1,024-token context window and hardware-level optimizations like FlashAttention and alternating attention mechanisms. The model was trained on ClassiCC-PT, a massive, high-quality Portuguese corpus of 106 million documents, ensuring superior alignment with the language’s contemporary usage. The results demonstrate that BERTomelo not only outperforms previous Portuguese encoders but also provides a more robust and efficient alternative to massive multilingual models in downstream tasks such as STS and NER.

[NLP-110] Fine-Tuning General-Purpose Large Language Models for Agricultural Applications:A Reproducible Framework and Evaluation Protocol Based on Qwen 3-8B

【速读】: 该论文旨在解决通用大语言模型(LLM)在农业领域应用中存在的领域特异性、区域依赖性、时效敏感性及安全关键性问题,尤其针对缺乏数据治理、专家评估与证据约束时可能生成不可靠的农事建议(如病虫害防治、农药使用、施肥方案或政策解读)的风险。其解决方案的关键在于提出一种可复现、可审计的农业领域适配框架AgriTune-R,通过集成公开可验证的Qwen3-8B模型作为基础模型,并融合农业数据治理、指令构建、基于LoRA/QLoRA的参数高效微调、检索增强生成、专家评估与高风险问题安全控制等模块,构建了结构化的工作流程。该框架的核心贡献包括:(1)农业大模型适配的标准化工作流;(2)涵盖农业知识问答、病虫害咨询、栽培管理与政策解释的评估协议;(3)融合事实性、安全性、证据一致性与不确定性表达的专家评审量表;(4)明确区分协议设计与实证结论,为未来研究提供可执行的基准。

链接: https://arxiv.org/abs/2606.28992
作者: Zhaoyang Li,Ruijie Zhang,Jiaqi Liu,Zhaoji Sun
机构: Sanya University (三亚大学); Hebei International Studies University (河北国际关系学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:General-purpose large language models (LLMs) have demonstrated strong abilities in opendomain question answering, information extraction, and text generation. Agricultural applications, however, are domain-specific, region-dependent, time-sensitive, and safety-critical. Without data governance, expert evaluation, and evidence constraints, an agricultural assistant mayproduce unreliable advice on crop diseases, pesticide use, fertilization, or policy this http URL avoid presenting unverified simulated numbers as real experimental findings, this paper doesnot report any model-performance claims that have not been produced by an actual training runand expert evaluation. Instead, we propose AgriTune-R, a reproducible and auditable frameworkfor adapting general-purpose LLMs to agricultural tasks. The framework selects the publiclyverifiable Qwen3-8B model as the recommended base model and integrates agricultural datagovernance, instruction construction, LoRA/QLoRA parameter-efficient fine-tuning, retrievalaugmented generation, expert evaluation, and safety control for high-risk questions. The contributions are: (1) a structured workflow for agricultural LLM adaptation; (2) an evaluationprotocol for agricultural knowledge QA, pest and disease consultation, cultivation management,and policy explanation; (3) an expert-review rubric combining factuality, safety, evidence consistency, and uncertainty expression; and (4) a clear separation between protocol design andempirical conclusions, providing an executable baseline for future empirical studies.

[NLP-111] Can LLM s Hire Fairly? Racial Bias in Resume Screening

【速读】: 该论文旨在解决生成式 AI 在招聘场景中是否存在且如何体现种族与性别歧视的问题,尤其关注大语言模型(Large Language Models, LLMs)在简历筛选过程中可能存在的偏见。研究采用 Kline、Rose 与 Walters(2022)提出的配对简历方法,通过对比同一职位下不同身份背景(如白人/非白人、男性/女性)的虚拟简历被模型推荐的概率,系统评估模型的算法偏见。其关键发现在于:2023年发布的唯一一款模型仍表现出显著的亲白人倾向(回调差距 +2.12 个百分点,p < 0.01),而所有2024年及之后发布的模型均不再呈现此偏见,反而出现非显著或显著的亲黑人逆转(最大达 -3.01 个百分点)。在性别维度亦观察到类似趋势。基于对14个主流模型、共计24,024组配对职位信息的分析,研究揭示了算法招聘偏见方向随模型代际演进发生根本性转变——从早期的系统性歧视转向近年的反向补偿,表明模型训练策略与数据分布的变化正深刻影响其公平性表现。

链接: https://arxiv.org/abs/2606.28978
作者: Zhenyu Gao,Wenxi Jiang,Yutong Yan
机构: The Chinese University of Hong Kong(香港中文大学); CUHK Business School(香港中文大学商学院)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We audit fourteen mainstream large language models (LLMs) for hiring discrimination using the paired-resume methodology of Kline, Rose, and Walters (2022). The sole 2023-vintage model reproduces the pro-White callback gap documented in field experiments on labor market discrimination ( +2.12 pp, significant at the 1% level). Every model released in 2024 or after shows either a null gap or a significant pro-Black reversal (up to -3.01 pp). The same pattern holds on the gender axis. Based on 24,024 paired postings per model across 14 models, our results document a reversal in the direction of algorithmic hiring bias across model generations.

[NLP-112] Beyond the Mean: Three-Axis Fidelity for Aligning LLM -Based Survey Simulators from Small Pilot Data ICML2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在模拟社会调查回答时存在的系统性偏差问题,即其生成结果在边际分布上存在偏斜、响应方差校准不足,以及预测变量与结果变量间关系被弱化。核心问题是:在仅有少量人工调查样本作为先验的情况下,能否利用大语言模型恢复更广泛人群的统计特征?为此,研究从结构保真度(structural fidelity)、边际保真度(marginal fidelity)和个体保真度(individual fidelity)三个维度对恢复能力进行分解评估。基于新冠虚假信息调查的案例研究,作者对比了提示工程(prompting)、校正方法(rectification)和微调(fine-tuning)三类主流策略。研究表明,基于小规模试点样本的微调能够实现多维度保真度的平衡,但不同子群体间的保真度水平存在差异,可能对多元共情对齐(pluralistic alignment)构成挑战。解决方案的关键在于通过小样本微调有效调整模型生成行为,使其更贴近真实人口统计特征,同时揭示了模型在子群体层面表现不一致的风险。

链接: https://arxiv.org/abs/2606.28963
作者: Eun Cheol Choi,Youngrae Kim,Prabhu Pugalenthi,Hong-En Chen,Bo-Ruei Huang
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 11 pages, 8 tables, 3 figures; Pluralistic Alignment @ ICML 2026 Workshop

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to simulate social survey responses, yet their outputs exhibit systematic biases: marginal distributions are skewed, response variance is poorly calibrated, and predictor-outcome relationships are attenuated. We ask a simple question: given a small pilot sample of human responses, can an LLM recover the statistical characteristics of a broader population? We decompose recovery along three axes: structural fidelity, marginal fidelity, and individual fidelity. Using a COVID-19 misinformation survey as a case study, we benchmark three families of approaches: prompting, rectification, and fine-tuning. The findings suggest that fine-tuning on small pilot samples offers a balanced approach for achieving multiple forms of fidelity, but the levels of such fidelity can vary across subsamples, potentially threatening pluralistic alignment.

[NLP-113] Clustering Unsupervised Representations as Defense against Poisoning Attacks on Speech Commands Classification System

【速读】: 该论文旨在解决语音命令分类系统中基于脏标签(dirty-label)的投毒攻击问题,即攻击者通过在特定类别的语音样本中叠加触发信号(trigger),并将其标签篡改为攻击者指定的目标类别,从而破坏模型的正常分类性能。其解决方案的关键在于提出一种基于无监督表示学习与聚类的过滤防御机制:首先利用无标签蒸馏(DINO)方法对所有训练样本进行无监督表征学习,获得高质量的语义特征;随后结合K-means聚类与潜在狄利克雷分配(LDA)对这些表征进行聚类分析;最后,仅保留每个簇内出现频率最高的标签对应的语音样本用于训练,剔除其余不一致样本。该方法有效识别并过滤掉被污染的样本,在10%源类别被投毒的情况下,将攻击成功率从99.75%降至0.25%,且在多种目标/源类别组合及触发模式变化下均表现出良好的鲁棒性。

链接: https://arxiv.org/abs/2606.28953
作者: Thomas Thebaud,Sonal Joshi,Henry Li,Martin Sustek,Jesus Villalba,Sanjeev Khudanpur,Najim Dehak
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: published in ASRU 2025

点击查看摘要

Abstract:Poisoning attacks entail attackers intentionally tampering with training data. In this paper, we consider a dirty-label poisoning attack scenario on a speech commands classification system. The threat model assumes that certain utterances from one of the classes (source class) are poisoned by superimposing a trigger on it, and its label is changed to another class selected by the attacker (target class). We propose a filtering defense against such an attack. First, we use DIstillation with NO labels (DINO) to learn unsupervised representations for all the training examples. Next, we use K-means and LDA to cluster these representations. Finally, we keep the utterances with the most repeated label in their cluster for training and discard the rest. For a 10% poisoned source class, we demonstrate a drop in attack success rate from 99.75% to 0.25%. We test our defense against a variety of threat models, including different target and source classes, as well as trigger variations.

[NLP-114] A3M: Adaptive Adversarial and Multi-Objective Learning for Strategic Bidding in Repeated Auctions

【速读】: 该论文旨在解决在具有贝叶斯反馈(bandit feedback)的重复多单位拍卖中,学习有效出价策略所面临的挑战。现有方法通常依赖于固定的“探索-利用”调度策略,假设对手为静态(stationary),且仅优化投标人的效用,导致适应性差和策略鲁棒性不足。为克服这些局限,论文提出A3M框架,其关键在于融合自适应深度强化学习(Adaptive Deep Reinforcement Learning, DRL)、显式的对抗推理机制以及合理的多目标奖励设计,以实现在线拍卖策略的优化。A3M的核心创新包括:采用基于演员-评论家(actor-critic)结构的DRL模型,动态平衡探索与利用;引入对手模型以支持针对非平稳对手的虚构博弈(fictitious play);设计复合型奖励函数,联合最大化投标人效用、拍卖人收入与公平性。实验结果表明,A3M在区分价格与统一价格拍卖中均显著优于基准方法,最终遗憾度(regret)降低30%-40%,对对手策略突变具有强鲁棒性,且随拍卖单位数 KK 增加表现出良好可扩展性,同时支持可调的多目标权衡。消融实验验证了各核心组件的必要性。A3M因此成为复杂拍卖环境中学习策略的一套强大且灵活的新范式。

链接: https://arxiv.org/abs/2606.28943
作者: Junhan Li,Yuxin Zhang,Haoran Wang,Minghao Chen
机构: Nanjing University (南京大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages

点击查看摘要

Abstract:Learning to bid in repeated multi-unit auctions with bandit feedback poses a fundamental challenge. Existing methods often rely on rigid explore-then-exploit schedules, assume stationary adversaries, and optimize solely for bidder utility, thereby limiting adaptability and strategic robustness. To address these limitations, we introduce the A3M framework, which integrates adaptive deep reinforcement learning (DRL), explicit adversarial reasoning, and principled multi-objective reward design for online auction strategy optimization. A3M employs an actor-critic DRL backbone to dynamically balance exploration and exploitation, an opponent model for fictitious play against non-stationary adversaries, and a composite reward function to jointly maximize utility, auctioneer revenue, and fairness. We provide the first comprehensive empirical evaluation of this integrated approach against established baselines in both discriminatory and uniform price auctions. Results show that A3M reduces final regret by 30–40% in standard settings, maintains robust performance against adversarial strategy shifts, scales favorably with the number of units K , and enables tunable multi-objective trade-offs. An extensive ablation study confirms the necessity of each core component. Our work establishes A3M as a powerful and flexible framework for learning in complex auction environments.

[NLP-115] EVLA: An Electro-Aware Multimodal Assistant for Physically-Grounded Driving Reasoning and Control

【速读】: 该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶辅助系统中将车辆动力学视为“黑箱”而导致决策缺乏对车辆实时电-机械状态感知的问题。其核心挑战在于如何使模型在生成驾驶决策时,既能理解多模态环境信息,又能融合车辆实际运行状态(如电机扭矩、电池荷电状态等),从而实现更安全、高效的能量优化控制。解决方案的关键在于提出一种新型框架——电-视觉-语言助手(Electro-Visual-Language Assistant, EVLA),其创新点包括:第一,设计统一共状态编码器(Unified Co-State Encoder, UCSE),将视觉、文本与车辆状态输入融合为共享的潜在表示,并引入能量效率场以建模空间能量成本;第二,构建电-感知结构化推理链(Electro-aware Structured Reasoning Chain, ESRC),以物理约束和优化目标为基础,替代外部链式思维提示,实现内部确定性推理。该框架通过物理引导的联合损失函数端到端训练,能够生成上下文感知且能量最优的驾驶决策,在驾驶问答基准上显著优于主流微调的VLM基线模型,性能提升达+0.0871分(最终得分)和+5.6%准确率,同时推理速度比多阶段流水线快36%。实验证明,车辆状态感知与结构化物理推理的深度融合是发展下一代物理可解释、可优化驾驶辅助系统的核心。

链接: https://arxiv.org/abs/2606.28938
作者: Yuxin Liu,Zihan Chen,Haoyu Wang,Mingxuan Zhang,Ruijie Lin,Siyuan Zhao
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:Modern vision-language models (VLMs) for driving assistants typically treat vehicle dynamics as a black box, resulting in decisions that lack awareness of the vehicle’s real-time electro-mechanical state. To bridge this gap, we introduce the Electro-Visual-Language Assistant (EVLA) – a novel framework that combines multi-modal scene understanding with real-time perception of the electrified powertrain state (e.g., motor torque, battery SOC). Our approach features two key innovations: first, a Unified Co-State Encoder (UCSE) that fuses visual, textual, and vehicle-state inputs into a shared latent representation, augmented with an Energy-Efficiency Field to model spatial energy costs; and second, an Electro-aware Structured Reasoning Chain (ESRC), which replaces external chain-of-thought prompting with an internal, deterministic reasoning process grounded in physical constraints and optimization objectives. Trained end-to-end with a physics-guided joint loss, EVLA learns to generate context-aware and energy-optimal driving decisions. Extensive evaluations on a driving QA benchmark demonstrate that EVLA substantially outperforms strong fine-tuned VLM baselines, improving the final score by +0.0871 and accuracy by +5.6%. Ablation studies validate the necessity of each component, and efficiency analyses show that EVLA achieves 36% faster inference than multi-stage pipelines. This work underscores that integrating vehicle-state awareness and structured physical reasoning is crucial for developing next-generation, physically-grounded driving assistants.

[NLP-116] FinInvest-GTCN: Explainable Graph-Temporal-Causal Modeling for Risk-Aware Investment Decision Optimization

【速读】: 该论文旨在解决风险投资(VC)决策中面临的多重挑战,包括多源异构数据融合、非平稳时间序列建模以及在高风险、低数据场景下对可解释预测的需求。其核心解决方案是提出一种名为FinInvest-GTCN的图-时序-因果网络架构,将传统的内容推荐任务重构为定量的风险-收益评估任务。该模型的关键创新在于:通过关系图编码器捕捉投资生态系统的拓扑结构,利用多尺度时序融合模块有效处理长期依赖与非平稳性,并引入因果决策头生成具备可解释性的风险调整型预测结果。尤为关键的是提出的元因果适配(Meta-Causal Adaptation, MCA)策略,通过元预训练获得的因果合理结构指导新领域中的鲁棒微调,显著提升数据稀缺场景下的适应能力。实验结果表明,FinInvest-GTCN在自有VC数据集上将主要指标风险调整均方误差(RA-MSE)从基线3.05降低至2.51,并使模拟投资组合累计收益率提升18.7%。消融实验证明各组件不可或缺,额外分析进一步验证了模型的稳定性、可解释性与泛化能力。该研究开创了一种数据驱动且可解释的投资决策支持范式。

链接: https://arxiv.org/abs/2606.28933
作者: Junyan Tan,Yifan Li,Minghao Wang,Zihan Chen,Haoyu Zhang
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 28 pages

点击查看摘要

Abstract:Venture capital (VC) investment decisions face distinct challenges, such as multi-source heterogeneous data, non-stationary time series, and the demand for explainable predictions in high-stakes, low-data settings. To overcome these issues, we introduce \textbfFinInvest-GTCN, a Graph-Temporal-Causal Network that redefines the task from content recommendation to quantitative risk-return assessment. This architecture combines a relational graph encoder to capture the investment ecosystem’s topology, a multi-scale temporal fusion module to handle long-term dependencies and non-stationarity, and a causal decision head that generates risk-adjusted predictions with interpretable causal attributions. A core innovation is the Meta-Causal Adaptation (MCA) strategy, which facilitates robust fine-tuning for new, data-scarce sectors by aligning updates with causally-plausible structures derived from meta-pretraining. Comprehensive experiments on proprietary VC datasets show that FinInvest-GTCN delivers state-of-the-art results, markedly lowering the primary Risk-Adjusted Mean Squared Error (RA-MSE) to 2.51 from a baseline of 3.05 and boosting the cumulative return of a simulated portfolio by 18.7%. Ablation studies underscore the essential role of each component, while additional analyses confirm the model’s stability, interpretability, and enhanced adaptability. This work pioneers a data-driven, explainable framework for investment decision support.

[NLP-117] Latent Bridges for Multi-Table Question Answering

【速读】: 该论文旨在解决结构化表格问答(Table Question Answering, TQA)中如何有效融合关系型数据的复杂结构与大语言模型(LLM)强大推理能力的问题。现有方法在处理多表关联、复杂逻辑推理任务时表现受限,主要源于对表格间语义关系建模不足以及直接微调大模型带来的计算开销。为此,论文提出GRAB——一种构造器-编码器-桥接(constructor-encoder-bridge)架构:首先将关系型数据构造成异质图(heterogeneous graph),通过消息传递机制(message passing)编码图结构信息;随后利用一组查询条件驱动的潜在令牌(query-conditioned latent tokens)作为桥梁,将结构化信号以紧凑且任务相关的形式传递给冻结的LLM,同时保留原始文本扁平表示。关键创新在于仅训练轻量级图编码器与潜在桥接模块(91M参数),而保持LLM完全冻结,从而在不损害其通用推理能力的前提下,高效实现关系型深度学习与大模型之间的协同。该方案在多表复杂问答场景下取得显著性能提升,提供了一种高效且原理清晰的结构化数据与大模型融合范式。

链接: https://arxiv.org/abs/2606.28916
作者: Simone Varriale,Tamara Cucumides,Floris Geerts,Paolo Papotti
机构: EURECOM; University of Antwerp (安特卫普大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:We introduce GRAB, a constructor-encoder-bridge pipeline for table question answering. Our method lifts relational data into an heterogeneous graph, encodes it via message passing, and transfers the signals to an LLM through a small set of query-conditioned latent tokens. This provides the LLM with a compact, task-relevant structural representation together with the flattened text. Crucially, the LLM remains strictly frozen to preserve its general reasoning capabilities; we train only the lightweight graph encoder and latent bridge (91M parameters), allowing the entire pipeline to be trained efficiently. Our pipeline significantly improves performance on relational Question Answering, with the largest gains in demanding multi-table settings, offering an efficient, principled way to connect relational deep learning with LLMs.

[NLP-118] MedEvoEval: Evaluating Continual Evolution of Doctor Agents through Simulated Clinical Episodes

【速读】: 该论文旨在解决当前医疗领域中医生代理(doctor agents)评估体系无法充分反映其在长期门诊诊疗过程中动态决策能力的问题。现有评估多集中于单轮问答或固定流程的交互任务,难以捕捉多轮诊疗中证据积累、资源调用、跨周期记忆更新及决策演化等关键特性。其解决方案的核心在于提出一个可执行的纵向评估框架——MedEvoEval,该框架基于动作触发(action-gated)的模拟门诊病例,将每个病例转化为患者、检查与管理角色视图,仅通过有效操作逐步揭示证据;同时记录结构化事件轨迹,涵盖观察、行动、最终输出、管理评分以及可选的经验写回机制。该框架支持对诊疗过程中的资源分配效率、多学科团队(MDT)协作模式、记忆成熟度、未见任务迁移能力、更新阶段响应行为及后向保留能力等进行深度分析。实验表明,基于轨迹数据的评估能够揭示传统仅依赖最终答案评分所掩盖的过程成本与演化规律,为验证医生代理是否通过经验持续改进、实现知识迁移并保持历史能力提供了坚实的方法论基础。

链接: https://arxiv.org/abs/2606.28900
作者: Hui Zhang
机构: Beijing Institute of Technology (北京理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 31 pages, including appendices

点击查看摘要

Abstract:Doctor agents are moving beyond single-turn answer generation toward evolving clinical decision systems. Within an outpatient episode, they acquire evidence, use examination and consultation resources, and decide when to finalize a diagnosis and management plan. Across episodes, their behavior may change through memory, retrieval, reflection, or other update mechanisms. Current evaluations only partially cover this setting. Fixed-input medical QA benchmarks score final answers from complete inputs, whereas many interactive benchmarks still focus on individual encounters or fixed runs, providing limited support for evaluating how episode-level decisions interact with cross-episode experience. We introduce MedEvoEval, an executable longitudinal evaluation framework based on action-gated simulated outpatient episodes. Each source case is converted into role-specific patient, examination, and manager views; evidence is revealed only through valid actions; and each episode records a structured trace that links observations, actions, final outputs, manager scores, and optional experience write-back. We release a runnable ED artifact with 700 processed episodes, provenance notes, schemas, an episode runner, scoring scripts, configurations, example logs, analysis code, and trajectory- and step-level derivatives. Experiments show that episode traces expose process costs hidden by final-answer scoring, show how MDT-style consultation reallocates resources, and support longitudinal analyses of memory maturation, held-out transfer, update-stage response, and backward retention. Together, these results show that MedEvoEval provides a concrete basis for evaluating whether doctor agents improve through experience, transfer useful behavior, and retain earlier capabilities over time.

[NLP-119] PASTA: A Paraphrasing And Self-Training Approach for Knowledge Updating in LLM s

【速读】: 该论文旨在解决预训练大语言模型(Large Language Models, LLMs)在知识更新方面的核心挑战,特别是模型难以准确回答关于特定事实性信息(如新闻文章)的问题这一局限。现有持续训练方法虽具潜力,但面临显著技术瓶颈。为此,论文提出PASTA框架,其关键在于通过数据增强、问答生成与一种新颖的自学习直接偏好优化(Direct Preference Optimization, DPO)机制的协同作用,实现对新知识的精准注入与幻觉抑制。该框架不仅支持知识覆盖(knowledge overwriting),还能有效降低模型生成虚假信息的风险。实验表明,在基于模型知识截止日期后发布的网络文章进行评估时,PASTA将问答准确率从0.02提升至0.82,同时保持了模型的通用语言能力,验证了其在构建领域专业化大模型方面的有效性。

链接: https://arxiv.org/abs/2606.28898
作者: Takayuki Yamamoto,Daisuke Kawahara
机构: Waseda University (早稻田大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Knowledge updating in pre-trained Large Language Models (LLMs) remains an important challenge. While continual training provides a potential avenue for knowledge updating, it continues to present substantial technical difficulties. Furthermore, LLMs often struggle with accurately answering questions about specific factual information, such as news articles - a capability limitation widely recognized in the research community. This paper proposes PASTA, a simple yet powerful framework for integrating detailed factual information from news articles as new knowledge into LLMs, with the primary goal of building specialized models that accurately answer questions about this knowledge. Our framework combines data augmentation, question-answering generation, and a novel self-learning DPO process that simultaneously enables knowledge overwriting and hallucination suppression. We provide insights into effective knowledge updating through systematic analysis of learning parameters and data configurations. In our experimental evaluation with web articles published after the base model’s knowledge cutoff, PASTA achieved remarkable improvement from 0.02 to 0.82 accuracy while maintaining general language capabilities, demonstrating its effectiveness for creating domain-specialized LLMs.

[NLP-120] Memory-Managed Long-Context Attention: A Preliminary Study of Editable Request-Local Memory

【速读】: 该论文旨在解决长上下文语言模型在处理长序列时面临的双重挑战:一方面需高效压缩历史信息以形成紧凑的状态表示,另一方面需维持可靠的长期记忆能力。现有方法如线性、递归或稀疏注意力虽降低了长序列处理的计算成本,但未能明确界定何时应写入、覆盖、保护或丢弃特定事实。为此,论文提出一种“记忆管理的长上下文注意力”(memory-managed long-context attention)框架,其核心在于将快速的递归或稀疏主干网络与显式的可编辑请求局部记忆槽(request-local memory slots)及查询时的稀疏回退机制相分离。这一解耦设计使得模型能够精确控制记忆槽的生命周期,从而有效应对覆盖、版本管理、抗污染和无写入信号等复杂场景。实验表明,纯固定状态或纯稀疏方法在多个任务中均存在失败案例,而混合架构则能全面覆盖这些情况;小型2,097,152 token机制在压力测试中实现50/50的聚合准确率,2.74M参数的极简因果事件标记模型在轻量写入监督下达到595/600准确率,验证了训练可行性而非规模依赖;六家族冻结隐藏状态桥接任务实现1079/1080受控指针准确率,但依赖生成器提供的整数键ID与独立编码的规范键字符串,属于元数据探针而非开放文本实体解析。此外,局部非排行榜RULER 4K诊断结果接近全上下文性能,而33条记录的LongBench v1 16K子集显示朴素词汇选择不具备泛化能力。研究证据支持三个关键结论:可控记忆槽生命周期是可行的,当写入缺乏未来查询信号时需引入稀疏回退机制,且学习型开放域选择仍是当前架构的主要瓶颈。论文不宣称最终生成式架构、全局槽轨迹收敛或系统整体优越性。

链接: https://arxiv.org/abs/2606.28876
作者: Junyi Zou,Avrova Donz
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 2 figures, 4 tables. Preliminary technical report

点击查看摘要

Abstract:Long-context language models often conflate two different goals: compressing history into an efficient state, and maintaining reliable long-term memory. Linear, recurrent, and sparse attention reduce the cost of processing long sequences, but they do not by themselves specify when a fact should be written, overwritten, protected from distractors, or discarded. We study memory-managed long-context attention, a research route that separates a fast recurrent or sparse backbone from explicit editable request-local memory slots and query-time sparse fallback. Across structured synthetic tasks, token/chunk/sequence bridges, generated natural language, and local frozen-model diagnostics, pure fixed-state or pure sparse methods fail some overwrite, version, anti-pollution, or no-write-signal cases, while a hybrid covers both routes. A small 2,097,152-token mechanism stress test reaches 50/50 pooled accuracy with 2-132 active chunks. A 2.74M-parameter minimal causal event-token model reaches 595/600 with lite write supervision, supporting proof of trainability rather than scale. A six-family frozen-hidden-state bridge reaches 1079/1080 controlled pointer accuracy, but it uses generator-provided integer key IDs and separately encoded canonical key strings; it is an oracle-metadata probe, not open-text entity resolution. Local non-leaderboard RULER 4K diagnostics remain close to full context, whereas a 33-record LongBench v1 16K subset shows that naive lexical selection is not general. The evidence separates three claims: controlled slot lifecycle is feasible, sparse fallback is needed when writes lack future-query signals, and learned open-domain selection remains the main architectural bottleneck. We do not claim a final generative architecture, global slot-trajectory convergence, or systems superiority.

[NLP-121] Open but Incompatible: A License Compatibility Analysis of Corpora for Low-Resource African Languages LREC COLING2026

【速读】: 该论文旨在解决非洲自然语言处理(Natural Language Processing, NLP)领域中语料库发布所面临的版权许可兼容性问题,尤其关注创作共用(Creative Commons, CC)许可证在实际应用中的误用与合规缺失。其核心挑战在于:尽管CC许可证广泛应用于非洲NLP语料库发布,但其严格的兼容性规则常被忽视,导致诸如CC-BY-SA与CC-BY-NC无法合并使用、禁止衍生作品(NoDerivs)条款隐性阻碍文本分词与标注等关键任务。论文通过审计超过二十个常用语料库家族的许可溯源,构建了一个六级许可证兼容性矩阵,并以基图巴/穆努库图巴语、扎尔马语和莫罗语三个案例语言进行实证分析,识别出四种典型失败模式:直接禁止(如JW300因违反服务条款被从OPUS移除)、复合许可虚假声明(如WAXAL声称采用CC-BY 4.0,但其HuggingFace数据卡片内容矛盾)、隐藏的NoDerivs条款伪装为公开许可(如Tanzil项目),以及数据持久性失效(如刚果广播语料库中99%的源链接已失效)。解决方案的关键在于提出一个预标注阶段的尽职调查清单,以及系统性梳理合法且可扩展的语料增强路径,从而推动非洲NLP研究在合规基础上实现可持续发展。

链接: https://arxiv.org/abs/2606.28867
作者: Ernst van Gassen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages. Published in Proceedings of Resources for African Indigenous Languages (RAIL) 2026 @ LREC-COLING 2026, pages 128-139

点击查看摘要

Abstract:Creative Commons licenses dominate African NLP corpus releases, but their compatibility rules are rarely applied. CC-BY-SA and CC-BY-NC cannot be combined in a single published dataset; a NoDerivs clause silently prohibits tokenisation and annotation. This paper audits the license provenance of over twenty corpus families used in African NLP, constructs a six-tier compatibility matrix, and applies it to three case-study languages: Kituba/Munukutuba, Zarma, and Moore. Four failure modes are documented with primary-source evidence: outright prohibition (JW300, removed from OPUS after a legal audit confirmed Terms of Service violation); composite license misrepresentation (WAXAL, whose CC-BY 4.0 claim is contradicted by its own HuggingFace dataset card); a NoDerivs clause hidden behind a CC-BY label (Tanzil); and data persistence failure (the Congolese Radio Corpus, where 402 of 405 source URLs are now dead). A pre-annotation due diligence checklist and a survey of legally clean enrichment opportunities close the paper.

[NLP-122] wav2VOT: Automatic estimation of voice onset time closure duration and burst realisation with wav2vec2 INTERSPEECH2026

【速读】: 该论文旨在解决语音标注自动化过程中对大量人工校正或训练数据依赖的问题,特别是在声学语音学任务中,如辅音爆发参数(如起始发声时间、闭塞持续时间及爆破音实现)的精确标注。现有自动标注工具虽已广泛应用,但其准确性仍受限于人工干预或充足的标注数据。针对这一挑战,论文提出wav2VOT,一种基于wav2vec2大语言模型的自动语音标注工具,用于估计起始发声时间(Voice Onset Time, VOT)、闭塞持续时间及爆破音实现特征。其核心解决方案在于利用预训练的大规模语音模型(如wav2vec2)进行微调,从而在无需大量标注数据的情况下实现高精度的语音学参数估计。实验表明,wav2VOT在未见过的数据集上表现与现有方法相当,并在微调后展现出极高的准确性;同时,对预测结果的分析显示其在清浊辅音区分及发音部位判别方面具有高度保真度。这些结果证明了大规模语音模型在生成高质量语音学标注方面的潜力,为未来将此类模型整合进语音学研究流程提供了有力支持。

链接: https://arxiv.org/abs/2606.28857
作者: James Tanner,Morgan Sonderegger,Jane Stuart-Smith,Tyler Kendall,Jeff Mielke
机构: University of Glasgow(格拉斯哥大学); McGill University(麦吉尔大学); University of Oregon(俄勒冈大学); North Carolina State University(北卡罗来纳州立大学)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Accepted for Interspeech 2026. 6 pages, 4 figures

点击查看摘要

Abstract:While automatic tools for speech annotation are now commonplace within phonetic research pipelines, many tasks require substantial manual correction or training sets to perform accurately. Simultaneously, large speech models such as wav2vec2 have been shown to perform well at speech classification tasks, raising the question of how these models may be applied to phonetic annotation tasks. We introduce wav2VOT: a tool for the automatic estimation of voice onset time, closure duration, and burst realisation using wav2vec2. We demonstrate that wav2VOT performs comparably with current approaches on unseen datasets, and can estimate with high accuracy with fine-tuning. Analysis of wav2VOT predictions demonstrate high fidelity across stop voicing and place of articulation. These results demonstrate that large speech models are capable of producing accurate annotations, and further motivate exploration of large speech models as tools in phonetic research pipelines.

[NLP-123] he Heterogeneous Safety Impacts of Benign Multilingual Fine-Tuning

【速读】: 该论文旨在解决大语言模型在多语言环境下进行微调时所引发的安全性退化问题,即尽管使用非对抗性数据进行微调,模型仍可能显著增加对恶意提示的响应倾向,这种现象被称为“多语言安全漂移”(multilingual safety drift)。其核心问题是:现有以英语为主的研究无法充分反映模型在其他语言微调后的实际安全性表现,导致部署风险评估存在盲区。解决方案的关键在于揭示并量化不同语言微调与评估组合对模型安全性的非对称影响,发现安全性能的变化与通用能力指标解耦,且在不同语言和模型间呈现异质性;同时提出通过多语言基准测试(SORRY-Bench-Multilingual)与公开数据集(Multilingual-Benign-Tune)支持跨语言安全研究,强调必须在目标语言环境中评估模型安全性,而不能仅依赖英语评估结果。

链接: https://arxiv.org/abs/2606.28843
作者: Will Hawkins,Kaivalya Rawal,Jonathan Rystrøm,Stratis Tsirtsis,Zihao Fu,Greta Warren,Ryan Brown,Eoin Delaney,Sandra Wachter,Brent Mittelstadt,Chris Russell
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Fine-tuning a large language model is a ubiquitous method for enhancing its capability on a specific downstream task. However, prior work has shown that this increase in capability comes with a cost: it can increase a model’s tendency to respond to unsafe adversarial prompts, even when fine-tuning with non-adversarial data. We present the first comprehensive empirical study of this phenomenon in multilingual settings by fine-tuning Llama-3.2, Qwen3, and Gemma-3 models using benign data translated across nine languages. We find that safety outcomes are highly sensitive to both the choice of fine-tuning language and the evaluation language, with adversarial compliance rates increasing four-fold in some settings. Multilingual safety drift is decoupled from general capability metrics, and occurs heterogeneously across languages and models. Fine-tuning in non-English languages often induces smaller internal representational drifts than English, but these shifts lead models to default to either exaggerated compliance or refusal. As such, assessing fine-tuning impacts solely in English provides inadequate assurance for deployment. To facilitate further research into these cross-lingual safety blind spots, we release the Multilingual-Benign-Tune dataset and the SORRY-Bench-Multilingual evaluation suite.

[NLP-124] LAMP: Lean-based Agent ic framework with MCP and Proof Repair

【速读】: 该论文旨在解决生成式人工智能在数学推理中生成的证明不可靠且难以验证的问题,尤其是在组合词论(Combinatorics on Words, CoW)这一特定数学领域缺乏足够形式化知识支持的困境。尽管交互式定理证明器如Lean 4通过内核检查确保证明可靠性,但其能力受限于已形式化的知识库覆盖范围;而当前主流形式化数学库Mathlib对CoW领域的覆盖不足,导致专用定理证明器在此领域表现不佳。为此,论文提出两项关键贡献:其一,构建了一个包含8个模块和93条核心定义与基础引理的Lean 4形式化CoW知识库;其二,提出LAMP——一种多智能体框架,通过在推理时引入显式、结构化的领域本体(ontology)而非依赖模型微调,实现内核验证的Lean 4证明自动生成。LAMP采用基于模型上下文协议(Model Context Protocol)的规划器(Planner)、构建者(Builder)与验证器(Verifier)协同机制,使系统能够有效利用领域特定知识。在涵盖全部8个模块及三类难度的90个CoW定理测试集中,LAMP成功生成可验证证明的比例达96.7%,显著优于无引导基线和现有专用定理证明器。消融实验表明,移除LAMP的工具导向架构或规划/构建分离机制分别导致性能下降约12个百分点,凸显其架构设计的关键作用。

链接: https://arxiv.org/abs/2606.28841
作者: Santhana Srinivasan R,Maithilee Patawar
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are increasingly capable of mathematical reasoning, but the proofs they generate are often unreliable and hard to verify. Interactive theorem provers such as Lean 4 address this by accepting only kernel-checked proofs; however, their reach is bounded by the formalized knowledge available. While Mathlib, a repository of formalized Lean 4 theorems that covers diverse mathematical areas, certain specialized areas remain underrepresented; notably, the domain of Combinatorics on Words (CoW). CoW studies sequences, exploring their properties such as periodicity, borders, conjugacy, and morphisms. As a result, specialized provers, trained on Mathlib-centered data, lack the lemmas to operate in CoW. We present two contributions. First, we introduce a Lean 4 formalization of CoW containing eight modules and \textbf93 declarations of core definitions and foundational lemmas. Second, we present LAMP, a multi-agent framework that synthesizes kernel-verified Lean 4 proofs by providing explicit, structured domain knowledge at inference time through an ontology, rather than by fine-tuning a prover. LAMP coordinates a Planner, Builder, and Verifier with Model Context Protocol based access to a domain-specific CoW ontology. In a suite of 90 CoW theorems that span all eight modules and three difficulty levels, LAMP synthesizes verified proofs for 96.7% of theorems, substantially exceeding both an unscaffolded baseline and existing specialized provers. An ablation shows that removing LAMP’s tool-grounded architecture or its Planner/Builder separation each cost roughly 12 percentage points, even with the backbone model held fixed.

[NLP-125] Labeling Training Data for Entity Matching Using Large Language Models

【速读】: 该论文旨在解决大规模实体匹配任务中,如何在不依赖人工标注的特定任务训练数据的前提下,兼顾模型性能与推理效率的问题。当前大型语言模型(LLM)虽具备强大的零样本匹配能力,但其在大规模候选对上的推理成本过高;而传统机器学习方法或小型语言模型(SLM)虽推理速度快,却需大量高质量的人工标注数据进行训练。为此,论文提出采用知识蒸馏(knowledge distillation)框架,以高性能的LLM作为教师模型自动标注训练数据,进而训练轻量级的学生模型(student model),从而避免手动标注的高成本。其解决方案的关键在于:通过精心设计的候选对选择策略、合适的教师模型(如GPT-5.2)、标签后处理方法以及高效的学生模型架构,实现从教师模型生成的机器标注数据上训练出性能接近甚至等同于使用真实人工标注数据的学生模型。实验结果表明,基于机器标注数据训练的学生模型在多个基准测试集(Abt-Buy、Walmart-Amazon等)上的F1分数与人工标注训练模型相比差异不超过2个点,且标注成本仅为人工标注所需时间的极小部分(约40美元/每数据集,相较470小时人工工作量),同时推理速度提升达41.5至534倍。这证明了当前大模型结合合理蒸馏流程可有效消除实体匹配任务中对人工标注数据的依赖。

链接: https://arxiv.org/abs/2606.28823
作者: Aaron Steiner,Christian Bizer
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 5 figures, 9 tables

点击查看摘要

Abstract:Recent large language models (LLMs) achieve strong performance on entity matching without requiring task-specific training data. However, applying these models to large sets of candidate pairs remains slow and costly. In contrast, entity matchers using traditional machine learning methods or small language models (SLMs), such as RoBERTa, offer much faster inference but require task-specific training data. This paper investigates whether the need to provide task-specific training data can be avoided by using knowledge-distillation workflows, in which an LLM serves as a teacher model to label training pairs that are subsequently used to train a smaller student model. We investigate knowledge distillation for entity matching along the following dimensions: pair-selection strategy, teacher model, label post-processing method, and student model. We evaluate the workflows using the Abt-Buy, Walmart-Amazon, WDC Products, DBLP-ACM, and DBLP-Scholar benchmarks, and compare the performance of student models trained with machine-labeled data to the performance of the same models trained using the benchmark training sets. Our experiments show that student models trained using the machine-labeled sets perform approximately on par with models trained on the benchmark training sets, with the remaining differences in both directions staying below two F1 points. Using GPT-5.2 to label the training sets for all five benchmarks costs US\ 28.31 to US\ 40.88, whereas manually labeling the same training sets is estimated to require 470 hours of work. At inference time, Ditto is 41.5 to 534 times faster than directly using an LLM to perform the matching tasks. These results indicate that current LLMs, when combined with a suitable pair-selection method, can substantially reduce or even eliminate the manual effort required to label use case-specific training data for entity matching. Comments: 13 pages, 5 figures, 9 tables Subjects: Computation and Language (cs.CL) ACMclasses: H.2.8; I.2.7; I.2.6 Cite as: arXiv:2606.28823 [cs.CL] (or arXiv:2606.28823v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.28823 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Aaron Steiner [view email] [v1] Sat, 27 Jun 2026 09:15:09 UTC (1,189 KB)

[NLP-126] Categorizing Mathematical Concepts with LLM Voting Ensembles in Mathswitch

【速读】: 该论文旨在解决从开放知识库(如Wikidata)中导入数学概念数据时所面临的噪声问题,即由于协作编辑的图谱结构导致的部分条目非数学性或语义模糊。其解决方案的关键在于采用基于大语言模型(LLM)的投票集成方法对这些数据进行去噪过滤。通过以MathWorld标识为正样本控制集进行评估,研究进一步分析了在移除数据库标识符上下文后分类结果的变化,并对LLM判别不一致的案例进行归类,发现主要存在三类问题:描述退化、范围过窄偏差以及编辑范围不匹配,这提示需采取差异化的数据清洗与修正策略。

链接: https://arxiv.org/abs/2606.28815
作者: Katja Berčič,Slobodan Stanojevikj
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted (pre-peer-review) version. Accepted at CICM 2026; the Version of Record will appear in Springer LNAI. We’ll add the DOI once the proceedings are published

点击查看摘要

Abstract:Mathswitch is an open-source project that imports mathematical concept records from sources such as Wikidata, Wikipedia, MathWorld, Encyclopedia of Mathematics, nLab, ProofWiki, and Agda-Unimath, and links records that refer to the same concept. It does not reorganize or redefine the imported content; each source retains its own structure. The current focus is on importing concept data from Wikidata and the resources it links to, with plans to expand to further sources and better concept linking. Because the concept set is approximated through queries over Wikidata’s collaboratively edited graph, the imported data is noisy: some items are non-mathematical, while others are ambiguous. In this paper, we test whether a voting ensemble of LLM judges can filter this noise. We evaluate it on Wikidata items with known MathWorld identifiers as a positive control, and examine how classification changes when database identifiers are removed from context. We then inspect the cases where the judges disagree with MathWorld and group these disagreements into three categories (degenerate descriptions, narrow scope bias, and editorial-scope mismatches) that suggest different remediation strategies.

[NLP-127] Structure-Preserving Document Translation via Multi-Stage LLM Pipeline: A Case Study in Marathi

【速读】: 该论文旨在解决印度政府文件在多语言环境下的可访问性问题,特别是以马拉地语(Marathi)为主的官方文档在跨区域行政管理、非母语读者及政策分析中面临的翻译障碍。现有神经机器翻译系统虽在句子级翻译质量上有所提升,但普遍忽视文档结构、格式完整性与领域术语的准确性,难以适用于正式政务文档的端到端转换。其解决方案的关键在于提出一种保持结构一致性的马拉地语至英语政府文档翻译框架,通过融合布局感知的光学字符识别(Optical Character Recognition, OCR)、基于坐标的文本提取、大语言模型(Large Language Model, LLM)驱动的翻译以及基于HTML的结构化文档重建技术,实现从原始PDF到目标语言文档的全流程转换。该框架通过施加空间对齐约束并保留文档的层级元素,有效保障了源文档与译文在布局与逻辑结构上的高度一致性。实验结果表明,相较于传统仅依赖文本的翻译流程,该方法在结构保真度、翻译连贯性及术语一致性方面均有显著提升,为电子政务与行政文档处理提供了可扩展的多语言可访问性解决方案。

链接: https://arxiv.org/abs/2606.28796
作者: Manasi Waghe,Danish Chandargi,Mohammad Aamir Rayyan,Raviraj Joshi,A.R. Deshpande
机构: Pune Institute of Computer Technology, Pune, India; L3Cube Labs, Pune, India; Indian Institute of Technology Madras, Chennai, India
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Government documents in India are predominantly issued in regional languages such as Marathi, creating substantial accessibility barriers for non-native readers, interstate administrative bodies, and policy analysts. Although recent advances in neural machine translation have improved sentence-level translation quality, existing systems largely neglect document structure, formatting integrity, and domain-specific terminology, thereby limiting their applicability to official documentation. This paper presents a structure-preserving Marathi-to-English government document translation framework capable of performing end-to-end document transformation while maintaining layout fidelity. The proposed system integrates layout-aware optical character recognition, coordinate-based text extraction, large language model based translation, and structured document reconstruction through HTML representations. By enforcing spatial alignment constraints and preserving hierarchical document elements, the framework ensures structural consistency between the source and translated documents. Experimental evaluation on real-world Marathi government PDFs demonstrates improved structural preservation, translation coherence, and terminological consistency compared to conventional text-only translation pipelines. The proposed framework contributes toward scalable multilingual accessibility solutions for e-governance and administrative document processing.

[NLP-128] Majority Vote Silences Minority Values: Annotator Disagreement at the Hate/Offensive Boundary in HateXplain

【速读】: 该论文旨在解决仇恨言论标注流程中因采用多数投票(majority vote)聚合标注者分歧而导致模型在边界案例上性能显著下降的问题。其核心问题在于,多数投票机制将存在争议的标注(尤其是处于“仇恨/非仇恨”边界上的样本)强制视为确定性标签,从而掩盖了标注过程中的主观判断差异,导致模型学习到错误的“确定性”假设。解决方案的关键在于重构标注设计的上游流程:不应将有争议的判断简单地以多数票定为“真实标签”,而应通过引入多头标注模型(per-annotator multi-head model)等方法保留标注者的个体判断差异,并对边界案例进行更精细的建模。研究发现,传统模型在分歧样本上的准确率比一致样本低22个百分点,且标准评估指标无法识别模型在边界案例上的严重误判,因为其置信度反而更高;这表明当前评估范式存在盲区。因此,根本性解决路径在于从标注设计源头改变——避免将争议性判断伪造成绝对真理,才能真正提升模型在复杂语义边界上的鲁棒性与可靠性。

链接: https://arxiv.org/abs/2606.28772
作者: Joshua Muhumuza,Joab Ezra Agaba,Mercy Amiyo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hate speech annotation pipelines routinely collapse annotator disagreement into majority vote labels before training. We show that this aggregation is not neutral: 42.6% of all annotator disagreement in HateXplain concentrates specifically at the hate/offensive boundary, a pattern consistent with annotators applying different thresholds for where hate begins (chi-squared = 135.199, df = 2, p 0.0001). Both a hard-label BERT model (Model A) and a soft-label model (Model B) drop 22 percentage points in accuracy from agreed posts (~80%) to disagreement posts (~58%), confirmed at p 0.0001. A per-annotator multi-head model (Model C) widens this gap further to 28 points while collapsing offensive disagreement accuracy to 0.245. Critically, Model A expresses significantly higher confidence on boundary case errors than Model C (0.710 vs. 0.495, p 0.0001), meaning standard evaluation metrics will not detect the failure. Three downstream interventions of increasing sophistication all fail to recover boundary accuracy. We argue the problem is structural. Majority vote presents a contested judgment as ground truth, and models inherit that false certainty. The intervention must be upstream in annotation design.

[NLP-129] 5ting at SemEval-2026 Task 8: Strong End-to-End Multi-Turn RAG via LLM -Based Reranking and Faithfulness Control

【速读】: 该论文旨在解决多轮检索增强生成(multi-turn Retrieval Augmented Generation, RAG)系统在实际应用中面临的挑战,包括上下文漂移(context drift)、查询表述不充分(under specification)以及生成幻觉(hallucination risk)等问题。其解决方案的关键在于构建一个集成化流程:首先采用BGE-M3稠密检索模型结合FAISS索引实现高效精准的候选文档检索;进而引入双查询融合检索(dual-query merged retrieval)机制以增强检索相关性;随后通过大语言模型(LLM)进行重排序,提升结果质量;最后在生成阶段采用角色分离的生成策略,严格约束生成内容仅基于检索到的证据,从而有效缓解幻觉问题并提升生成一致性。该方案在任务A中达到nDCG@5 = 0.4719,端到端系统在任务C中取得0.5597的调和平均得分与0.7692的RL_F值,验证了其有效性。

链接: https://arxiv.org/abs/2606.28737
作者: Thien-Qua-T-Nguyen,Chi Hoang,Nguyen Tran,Tri Le,Khanh Truong,Chinh Trong Nguyen
机构: University of Information Technology, Ho Chi Minh City, Vietnam; Vietnam National University Ho Chi Minh City, Ho Chi Minh City, Vietnam
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce 5ting, our system for the SemEval2026 Task 8 (MTRAGEval), which evaluates multi-turn Retrieval Augmented Generation (RAG) systems. Multi turn RAG involves context drift, under specification, and hallucination risk. Our system combines BGE-M3 dense retrieval with FAISS indexing, dual-query merged retrieval, and LLM based reranking, followed by role separated generation constrained to retrieved evidence. The retriever achieved nDCG@5 = 0.4719 in Task A, while the end to end system ranked in Task C with a harmonic score of 0.5597 and RL_F = 0.7692.

[NLP-130] DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation

【速读】: 该论文旨在解决在线内容安全中毒性行为演化带来的动态挑战,特别是现有漂移检测方法因仅关注全局分布变化而难以捕捉局部有害子空间或高风险模型误判区域中的安全相关漂移问题。其核心解决方案是提出DriftGuard框架,通过多监控器协同实现安全感知的自适应内容审核:该框架同时监测全局文本漂移、身份相关伤害漂移、模型不确定性、毒性风险漂移及假阴性风险漂移;当检测到安全相关异常时,采用硬混合(hard-mix)适应数据集进行选择性模型更新,优先处理潜在假阴性、与身份相关的高风险样本、假阳性风险样本以及边界不确定案例。实验表明,相比仅依赖全局漂移检测的方法,该框架能有效识别被忽略的安全风险;在Civil Comments和Jigsaw-to-DynaHate跨数据集迁移场景下,硬混合适应显著提升毒性召回率与准确率,分别达到0.8777和从0.7107提升至0.8523,并通过自助分析验证了其在动态环境下的稳定性与安全性增益。因此,DriftGuard实现了从安全感知漂移检测到针对性、轻量级模型更新的闭环优化,增强了生成式AI(Generative AI)内容审核系统的鲁棒性与适应能力。

链接: https://arxiv.org/abs/2606.28725
作者: Yuting Xin,Hanyu Cai,Binqi Shen,Lier Jin,Lan Hu
机构: University of Minnesota (明尼苏达大学); Northwestern University (西北大学); Duke University (杜克大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated toxicity moderation systems operate in dynamic online environments where harmful behavior evolves through coded language, shifting targets, and strategic adaptation to enforcement. Existing drift detection methods often focus on global distributional change, but such signals may miss safety-relevant shifts that emerge in localized harm subspaces or high-risk model-error regions. This paper introduces DriftGuard, a safety-aware adaptive moderation framework that combines multi-monitor drift detection with selective model updating. The framework tracks global text drift, identity-harm drift, model uncertainty, toxic-risk drift, and false-negative-risk drift. When safety-relevant change is detected, the model is updated using a hard-mix adaptation set that prioritizes likely false negatives, identity-related high-risk examples, false-positive-risk examples, and uncertain boundary cases. Experiments on Civil Comments temporal shift and Jigsaw-to-DynaHate cross-dataset shift show that safety-aware monitors detect risks missed by global drift alone. Hard-mix adaptation improves toxic recall and accuracy over no-update and random-balanced baselines, raising toxic recall to 0.8777 on Civil Comments and from 0.7107 to 0.8523 on DynaHate. Bootstrap analysis further shows stable DynaHate safety gains, with toxic recall increasing by 0.1418 and false-negative prevalence decreasing by 0.0781. Overall, DriftGuard links safety-aware drift detection to targeted, lightweight model updating for more robust adaptive toxicity moderation.

[NLP-131] SEATauBench: Adapting Tool-Agent -User Evaluation Into Low-Resource Southeast Asian Languages

【速读】: 该论文旨在解决东南亚(SEA)地区本土化人工智能(AI)代理在区域语言中能力评估不足的问题,尤其是在主权人工智能(Sovereign AI)背景下,代理在非英语语境下的性能表现尚不清晰。其核心解决方案是提出首个面向东南亚主权AI的代理评估基准——SEATauBench(SeaTau),该框架基于TauBench进行适配,覆盖中文、越南语、泰语、印尼语和菲律宾语五种语言,并在逐步本地化的场景中评估代理在用户-代理交互语言、工具说明及任务领域等方面的综合表现。研究发现,当仅语言交互发生变化时,英文代理的能力具备一定迁移性;但随着任务上下文的进一步本地化,尤其是全领域适应时,代理的表现显著下降,表明单纯依赖英语评估无法准确衡量其在东南亚语言中的真实能力。因此,该工作不仅揭示了现有评估范式的局限性,还提供了一个可诊断、可复用的多语言代理构建与评估流水线,为构建语言多样性区域的可靠多语言代理提供了关键支持。

链接: https://arxiv.org/abs/2606.28715
作者: My Chiffon Nguyen,Aulia Adila,Saksorn Ruangtanusak,Kittiphat Leesombatwathana,Vissuta Gunawan Lim,Patomporn Payoungkhamdee,Samuel Cahyawijaya
机构: SEACrowd; SCB DataX, SCBX Group; Chulalongkorn University; VISTEC; Cohere
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages

点击查看摘要

Abstract:While AI development and evaluation for Southeast Asia (SEA) has grown rapidly, agent capabilities in regional languages are still poorly understood despite its importance to sovereign AI. To fill this gap, we introduce SEATauBench, the first agent-focused evaluation framework for SEA sovereign AI. SeaTau adapts TauBench to five languages – Mandarin, Vietnamese, Thai, Indonesian, and Filipino – and evaluates agents across progressively localized settings that vary the language of user-agent interaction, tool specifications, and task domains. Across three recent models, we find that English agent capabilities transfer reasonably well when only the conversation language changes, but quality and robustness degrade sharply as more task contexts are localized, with the largest losses in full domain adaptation. We also the limits of English-only agent assessment for measuring agent capabilities in SEA languages. More broadly, SeaTau provides a diagnostic benchmark and reusable adaptation pipeline for building reliable multilingual agents for linguistically diverse regions. Data and code can be accessed at this http URL.

[NLP-132] AnTenA: Actionable and Explainable Tensor Analysis System with Large Language Models

【速读】: 该论文旨在解决多维度数据中隐藏模式解释缺乏准确标签或辅助元数据支持的问题。传统方法依赖于标签和辅助元数据来揭示数据中的隐含结构,但这些信息常存在不准确、不一致或不足(如静态表格元数据无法反映时变记录)甚至完全缺失的情况。为此,本文提出一种名为\method的新型方法,其核心在于利用大语言模型(Large Language Models, LLMs)的知识,对通过张量分解提取出的共聚类潜在模式进行解释。该方法的关键创新在于结合任务无关与任务特定提示(prompt),以引导LLMs生成语义可解释的描述。为评估解释的有效性,研究设计了前向与后向推理任务进行验证,从而确保生成解释在逻辑一致性与语义准确性上的可靠性。

链接: https://arxiv.org/abs/2606.28708
作者: Dawon Ahn,Auder Der,Evangelos E. Papalexakis
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurately explaining hidden patterns in multi-aspect data has typically been done by leveraging labels and/or accompanying auxiliary metadata. However, labels and auxiliary data may be inaccurate (e.g. nonstandard, inconsistent), insufficient (e.g. static tabular metadata for time-dependent recordings), or unavailable. % We propose \fullmethod (\method), which leverages the knowledge of large language models (LLMs) to explain the hidden patterns in human narratives. \method uses task-agnostic and task-specific prompts to explain extracted co-clustered latent patterns from tensor decomposition. To evaluate these explanations, we test the LLMs on forward and backward inference tasks. % Our demo system is available at this https URL.

[NLP-133] Mitigating Batch Effects in Histopathology via Language-Mediated Robust Embedding Generation

【速读】: 该论文旨在解决病理学基础模型(Pathology Foundation Models, PFMs)在跨机构应用中因批次效应(batch effects)导致的性能下降问题。批次效应是由不同组织来源机构(Tissue Source Institutions, TSIs)引入的非生物学变异,会扭曲模型学习到的特征表示,从而损害其泛化能力。传统方法如染色归一化在处理高维、复杂的伪影方面效果有限。本文提出的GLMP(General-purpose LLM-Mediated Pathology model)框架通过引入中间文本表征,利用预训练的通用多模态大语言模型(Multimodal Large Language Models, MLLMs)与文本编码器,将组织病理图像块生成鲁棒的数值嵌入。其关键创新在于首次将组织学特征的文本描述作为中间表示,有效区分并优先保留生物相关信号,抑制机构特异性伪影,显著提升跨机构泛化性能。该工作揭示了通用领域、非专业化多模态大模型在计算病理学中的巨大潜力,并开创了一种构建通用、可泛化且鲁棒的病理模型的新范式。

链接: https://arxiv.org/abs/2606.28697
作者: Yishu Zhang,Shushan Wu,Zhenzhong Zhang,Didong Li,Huaxiu Yao,Yun Li,Iain Carmichael,Katherine A. Hoadley,Hongtu Zhu,Di Wu,Daiwei Zhang
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pathology foundation models (PFMs) have demonstrated strong potential across clinical and scientific applications, yet their performance is often hindered by batch effects, which are non-biological variations across tissue source institutions (TSIs) that distort learned feature representations and impair generalization. Conventional mitigation strategies, such as stain normalization, offer limited success in addressing these high-dimensional, complex artifacts. We present GLMP (General-purpose LLM-Mediated Pathology model), a novel framework that generates robust numerical embeddings from histology image patches through an intermediate textual representation. By leveraging pretrained general-purpose multimodal large language models (MLLMs) and text encoders, GLMP effectively prioritizes biologically meaningful signals over TSI-specific artifacts, thereby improving cross-institutional generalization. To our knowledge, GLMP is the first pathology model to use text descriptions of histological features as an intermediate representation for generating numerical embeddings from histology images. Our results highlight the untapped potential of broad-domain, non-specialized MLLMs in computational pathology and introduce a new paradigm for building versatile, generalizable, and robust pathology models.

[NLP-134] Phonological Perception of Sign Language Models

【速读】: 该论文旨在解决当前生成式深度学习模型在手语识别(SLR)中是否真正具备对抽象音位特征的感知能力,还是仅依赖于低层级统计相关性的问题。其核心挑战在于评估模型对音位层面差异(如手形、位置、动作等)的敏感性,以及这些表征与人类认知感知之间的对应关系。解决方案的关键在于采用最小对立对(minimal pairs)的探针方法,结合人类行为数据中的感知相似性判断,系统评估不同架构的SLR模型在音位特征上的表现。研究发现,基于姿态(pose-based)的模型对对手形差异具有较高的敏感性,而基于像素(pixel-based)的模型则更擅长捕捉位置变化;同时,姿态模型所学习到的隐层表征与人类感知相似性存在显著相关性(r~0.49)。这表明尽管现有模型已展现出一定的音位感知能力,但其性能仍受限于架构本身的归纳偏置,当前训练范式尚不足以突破此类局限,实现对音位结构的全面泛化理解。

链接: https://arxiv.org/abs/2606.28667
作者: Kayo Yin,Jessica Carter,Alex Xijie Lu,Annemarie Kocab
机构: University of California, Berkeley (加州大学伯克利分校); Johns Hopkins University (约翰霍普金斯大学); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL)
备注: Accepted to CogSci 2026

点击查看摘要

Abstract:Sign languages are compositional systems where meaning arises by combining sublexical phonological parameters, such as handshape, location, and movement. While deep learning models for Sign Language Recognition (SLR) have achieved increased performance on translation benchmarks, it remains unclear whether these models distinguish abstract phonological features or merely rely on low-level statistical correlations. This work evaluates the phonological perception of SLR models trained on American Sign Language (ASL) by probing phonological sensitivity using minimal pairs and evaluating representational alignment with human behavioral data. Our results reveal that SLR models exhibit emergent phonological sensitivity, but with clear architectural trade-offs: pose-based models are sensitive to handshape contrasts, while pixel-based models better capture location changes. Furthermore, pose-based models learn latent representations that correlate with human perceptual similarity judgments (r~0.49). These findings suggest that while SLR models exhibit emergent phonology, current training paradigms are insufficient to scale them beyond their architectural inductive biases.

[NLP-135] When More Sampling Hurts: The Modal Ceiling and Correlation Ceiling of Test-Time Scaling

【速读】: 该论文旨在解决生成式模型在推理过程中因过度采样(over-sampling)而导致的效率与准确性下降问题。尽管通过增加采样次数(test-time scaling)可提升覆盖度(coverage,即至少有一次正确尝试的问题比例),但部署系统最终需从多个采样结果中选择单一答案,这一选择过程存在“识别瓶颈”——模型虽能生成正确答案,却无法有效识别并选出它,从而形成“可识别性差距”(identifiability gap)。解决方案的关键在于认识到:在多数情况下,答案的共识已在数十次采样内达成,即达到“众数上限”(modal ceiling);进一步采样不仅增加计算成本,还可能强化错误答案的置信度,导致性能退化。因此,真正的优化方向并非盲目增加采样次数,而是识别出“有效采样数”(effective number of samples)——一个由采样过程本身即可揭示的单值指标,用以量化实际有效的推理能力。核心结论是:识别正确答案的能力才是瓶颈,而非生成答案的能力。

链接: https://arxiv.org/abs/2606.28661
作者: Yong Yi Bay,Kathleen A. Yearick
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 24 pages, 10 figures, 3 tables. Code and data: this https URL

点击查看摘要

Abstract:People overthink; language models over-sample, and the extra effort can talk both into a worse answer. Reasoning systems answer a hard question by sampling it many times (test-time scaling), and the more they draw, the more often a correct answer turns up somewhere, so coverage, the fraction of problems with at least one correct try, climbs and appears to be progress. But a deployed system must return one answer, and choosing it, not knowing which try is right, is selection; selection is capped, and past a point extra samples only make the model surer of a confident mistake, even as every draw adds cost. The gap between climbing coverage and stalled selection, the identifiability gap, is the answer a model can produce but not pick. So the real question is not whether to sample but how far, and the answer is: not far. For picking an answer, the vote has already settled within a few dozen draws, the modal ceiling; for scoring a benchmark, sooner still, the correlation ceiling. Beyond that, extra draws cost compute and add nothing, and can even make the answer worse. This paper turns the cutoff into a single number, the effective number of samples, that any sampling run already reveals. The bottleneck is recognizing a right answer, not generating one.

[NLP-136] he Undecidability of Artificial General Intelligence (AGI) Alignment

【速读】: 该论文旨在解决人工智能通用性(Artificial General Intelligence, AGI)安全性的根本性问题,即如何确保AGI系统的行为与人类价值观对齐。其核心挑战并非对齐状态本身不可实现,而在于对齐状态的结构性不可验证性——即使存在一个理想的对齐系统,也无法通过逻辑或计算手段对其进行有效验证。解决方案的关键在于提出两个核心不可能性定理:对齐的不可验证性定理(Unverifiability Theorem of Alignment)与有限结构下对齐的不可验证性定理(Theorem of Finite Structural Unverifiability of AGI Alignment),并将其理论边界锚定在“特拉赫滕布罗特墙”(Trakhtenbrot’s Wall)之上,揭示当前依赖有限硬件或停机架构的工程防御机制无法突破逻辑上的不可解障碍。这一根本限制表现为三种不可避免的封闭性失败:开放域导致本质上的不可判定性(源于Rice定理与哥德尔不完备性);普遍有限验证退化为算法不可计算性(对应Trakhtenbrot定理);特定有界环境则使监督者陷入最坏情况下的不可行边界。由此导出的“一致性—完备性—可计算性三难困境”(Soundness–Completeness–Tractability Trilemma)表明,这三个属性的互不相容是描述复杂性本身的必然结果,而非偶然现象。最终,论文将这些理论极限映射至实际AI工程实践,指出现代安全约束策略并非临时补丁,而是为保障可判定的安全片段所必须付出的、对逻辑表达能力的强制牺牲。

链接: https://arxiv.org/abs/2606.28639
作者: Jose Pascual Gumbau Mezquita
机构: University Jaume I de Castelló(胡安·梅伊·伊大学); Spain(西班牙)
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This article establishes the foundational mathematical limits of Artificial General Intelligence (AGI) safety, proving that the core barrier is not the impossibility of an aligned state, but its structural unverifiability. We formalize this boundary through two central impossibility results: the Unverifiability Theorem of Alignment and the Theorem of Finite Structural Unverifiability of AGI Alignment. We ground this boundary at Trakhtenbrot’s Wall, demonstrating that contemporary engineering defenses relying on finite hardware or halting architectures fail to escape logical obstructions. This failure manifests as an inescapable triad of containment failures: open domains yield fundamental undecidability (Rice and Gödel); universal finite verification collapses into algorithmic incomputability (Trakhtenbrot); and particular bounded environments trap the supervisor within intractable bounds in the worst case. As a direct structural corollary of these results, we derive the Soundness–Completeness–Tractability Trilemma, establishing that the mutual incompatibility of these three properties is a necessary consequence of descriptive complexity rather than an empirical anomaly. Finally, we map these theoretical bounds onto practical AI engineering, demonstrating that modern containment strategies are not temporary patches, but mandatory sacrifices of logical expressivity required to secure decidable fragments of safety.

[NLP-137] What LLM s explain is not what they believe: Evaluating explanation sufficiency under models own input beliefs ICML2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险场景中生成的自由文本解释(如思维链、事后推理等)是否具备充分性的问题,即这些解释是否包含足以阐明模型输出生成过程的全部信息。其核心挑战在于传统特征归因中的充分性概念难以直接适用于非结构化的自由文本解释,且解释的充分性可能随输入分布变化而变化,需明确定义。为此,论文提出将自洽充分性(self-consistent sufficiency)作为自由文本解释的目标,并引入一种基于信息论的度量指标SCSuff,通过利用模型自身生成与解释一致的替代输入,来评估解释的充分性,从而避免依赖预设偏见或捷径。实验表明,SCSuff与针对性扰动测试结果一致,且揭示了解释充分性对输入分布的敏感性;同时发现当前LLM的解释普遍缺乏充分性,且与模型规模、准确率或输出熵的相关性较弱。进一步分析显示,最终词元的隐藏状态可有效预测高/低SCSuff得分,表明该指标具备指导解释质量检测与优化的潜力。

链接: https://arxiv.org/abs/2606.28615
作者: Nhi Nguyen,Shauli Ravfogel,Rajesh Ranganath
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 23 pages, 9 figures, 13 tables, Forty-Third International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in high-stakes domains, where free-text explanations such as chain-of-thought and post-hoc rationales are used to justify model outputs. Yet it remains unclear whether these explanations are sufficient, i.e., if they contain enough information to explain the model’s output-generating process. We generalize classical sufficiency from feature attributions to arbitrary explanations and prove that explanation sufficiency can change depending on the input distribution, which must be explicitly defined for LLM explanations. We propose using the LLM itself to generate alternative inputs conditioned on an explanation, capturing its beliefs about possible inputs. We formalize self-consistent sufficiency as a goal for free-text explanations and introduce an information-theoretic metric, SCSuff, that enables evaluation of free-text explanations without relying on predefined biases or shortcuts. Our experiments show that SCSuff agrees with targeted perturbation tests where applicable and demonstrate that explanation sufficiency can vary with the input distribution. We find LLM explanations are generally insufficient and weakly correlated with model size, accuracy, or output entropy. Analysis of final-token hidden states shows that top and bottom SCSuff scores can be predicted from internal representations, suggesting that SCSuff can guide detection and improvement of sufficient LLM explanations. The code for this paper is available at this https URL .

[NLP-138] Animation2Code: Evaluating Temporal Visual Reasoning in Video-to-Code Generation

【速读】: 该论文旨在解决当前视觉-语言模型(VLMs)在处理含动态变化的视觉内容时,难以有效恢复时间动态性的问题。尽管现有VLMs在静态视觉到代码生成任务(如网页、图表或SVG的代码生成)上表现优异,但其在面对包含运动信息的视频输入时,仍无法准确重建具有时序一致性的可执行动画代码。为此,作者提出了Animation2Code基准,用于评估模型从视频中重构可执行网页动画代码的能力。该基准包含1,069个具有多样视觉外观与运动模式的网页动画视频,并配以对应的HTML/CSS/JavaScript实现。其关键创新在于提出两种人类对齐的评价指标——外观相似性(appearance similarity)和时序相似性(temporal similarity),能够将视觉保真度与时间对齐性解耦,从而更精准地衡量生成结果的质量。实验表明,即使在微调和迭代优化等强化设置下,当前主流VLMs仍难以保持重建结果的时间一致性,凸显了现有模型在时序视觉推理方面的显著局限。

链接: https://arxiv.org/abs/2606.28593
作者: Anya Ji,Abhijith Varma Mudunuri,David M. Chan,Alane Suhr
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While recent vision-language models (VLMs) have achieved significant improvements on static visual-to-code tasks such as generating code for webpages, charts, or SVGs, it remains unclear whether they can recover temporal dynamics when motion is present. To this end, we introduce Animation2Code, a benchmark for evaluating temporal visual reasoning via reconstructing executable web animation code from videos. Animation2Code consists of 1,069 web animation videos with diverse visual appearances and motion patterns, paired with corresponding HTML/CSS/JavaScript implementations. We propose two human-aligned metrics, appearance similarity and temporal similarity, which allow us to disentangle visual fidelity from temporal alignment when comparing rendered animations against ground-truth samples. Benchmarking state-of-the-art VLMs on this dataset shows that current VLMs struggle to maintain temporal consistency in reconstruction, even when achieving high appearance similarity, including under finetuning and iterative refinement settings. Code and data are available at this https URL .

[NLP-139] Correct codes for the wrong reason s? validating LLM s as measurement instruments for theoretical constructs

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在文本编码任务中虽具备高可靠性但缺乏构念效度(construct validity)的问题。现有方法无法区分模型是基于理论所要求的真正构念进行测量,还是仅通过与构念相关的替代指标进行推断,从而导致测量结果可能偏离理论本义。其解决方案的关键在于提出“粒度校准”(grain calibration)方法:将构念分解为语句层级的组成部分,利用抽取式证据逐一验证每个成分与文本的对应关系,并通过显式、基于理论推导的规则整合结果。该规则以可解释的形式呈现,而非嵌入于黑箱式推理流程中,因此其结构本身即构成对测量过程的证据,能够揭示哪些成分决定了最终编码结果;当编码错误时,可判断是遗漏了关键成分,还是邻近构念被误认。由此,效度验证从传统的依赖人工标注者评分转向证明模型确实在其理论所规定的构念基础上运行。

链接: https://arxiv.org/abs/2606.28574
作者: Manuel Pita
机构: CICANT, Universidade Lusófona (人工智能、社会互动与复杂性实验室,里斯本大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:When a large language model (LLM) codes a construct in text as a human annotator would, that agreement makes the LLM a reliable coder. Yet reliability leaves construct validity untouched. The instrument may be theory-naive, reaching the code through a correlate that meets none of the demands the construct’s theory makes, and no current method tells that apart from genuine measurement. We propose grain calibration as a method that closes the gap. It decomposes a construct into clause-level components, tests each against the text with extractive evidence, and combines the results through an explicit, theory-derived rule. Because the rule is stated rather than lodged in one opaque pass, its structure is evidence about the process rather than the output. It shows which components settled a code, and, when the code is wrong, whether a component was missed or an adjacent construct mistaken for it. Validation shifts from scoring an instrument’s outputs against an annotator to showing that the instrument runs on the construct its theory specifies.

[NLP-140] SEAD: Competence-Aware On-Policy Distillation via Entropy-Guided Supervision

【速读】: 该论文旨在解决在线策略蒸馏(On-policy Distillation, OPD)中因学生模型能力依赖教师监督质量而引发的效率低下问题。具体表现为:当学生模型在训练过程中对某些输入产生不连贯的推理轨迹(incoherent rollouts)时,会引入噪声梯度;而对已掌握的文本片段(already-mastered tokens)进行重复监督则造成冗余梯度,导致在词元(tokens)、训练阶段(training phases)和提示(prompts)三个层面均存在资源浪费。现有方法采用统一的监督策略,无法适应这种动态变化的监督需求。为应对该问题,论文提出SEAD(Self-Enhanced Adaptive Distillation),其核心解决方案在于利用熵(entropy)作为统一指标,量化学生模型在三重尺度上的能力依赖性退化程度:(1)通过联合教师-学生熵对词元进行分区域划分,仅对需优化的区域施加差异性损失或置零梯度(约50%的词元被跳过);(2)采用余弦调度机制,随学生能力提升从正向KL散度逐步过渡到反向KL散度;(3)引入基于能力感知的课程学习机制,按提示难度由易到难渐进式引入。上述三者具有协同必要性——词元选择依赖于连贯的推理轨迹(需课程学习保障),而参数衰减过程要求单调性能提升(同样依赖课程学习)。在OLMo-3(7B至32B)模型上,SEAD相比原始OPD在六个数学基准上平均提升4.8%准确率,消融实验进一步验证了各组件间超加性(super-additive)的协同效应。

链接: https://arxiv.org/abs/2606.28562
作者: Chia-Hsuan Lee,Zelei Cheng,Yu Wang,Renkun Ni,Sambit Sahu,Shi-Xiong Zhang,William Campbell
机构: Capital One
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:On-policy distillation (OPD) has a property absent in offline distillation and RL: teacher supervision quality depends on student competence. Incoherent rollouts yield noisy gradients; already-mastered tokens yield redundant ones. This creates waste at three scales (tokens, training phases, and prompts) yet existing methods supervise uniformly. We introduce SEAD, which uses entropy as a unified probe of this competence-dependent degradation at three scales: (1) joint teacher-student entropy partitions tokens into zones receiving tailored divergences or zero gradient (approx. 50% skipped); (2) a cosine schedule anneals from forward to reverse KL as competence grows; (3) a competence-gated curriculum introduces prompts easy-to-hard. These components are symbiotically necessary: token selection requires coherent rollouts (curriculum), annealing requires monotonic improvement (also curriculum). On OLMo-3 (7B to 32B), SEAD achieves +4.8 avg accuracy over vanilla OPD across six math benchmarks, with ablations confirming super-additive interactions.

[NLP-141] Depth-Staggered Fibonacci Spacing for Sparse Attention: Static Schedules Beat Learned Dilation and Extrapolate Where Dense Attention Fails

【速读】: 该论文旨在解决大模型在长序列建模中面临的计算效率与上下文外推能力之间的矛盾问题,特别是针对自注意力机制(self-attention)在扩展上下文长度时性能急剧下降的瓶颈。其核心挑战在于如何在保持高效计算的同时,提升模型在远超训练长度的输入序列上的泛化能力。解决方案的关键在于提出一种基于斐波那契间距(Fibonacci-spaced offsets)的稀疏自注意力结构,并引入每层可调的缩放参数α,通过静态分层阶梯式(static linear stagger)配置α来控制注意力偏移的分布模式。实验表明,这种静态阶梯配置显著优于固定α或逐层学习α的方式,且能实现近乎无损的外推能力——所有稀疏变体在推理长度达到训练长度4倍时仍保持稳定性能,而对应的密集基线模型则出现201%的困惑度激增。这一优势被归因于稀疏注意力仅查询训练期间见过的相对位置,从而避免了对未见位置的无效建模。此外,研究发现尽管稀疏模型在训练长度上略逊于密集模型(困惑度高约26%),但其性能提升在所有上下文位置上均匀分布,而非集中于长距离依赖,揭示了其稳健性机制的本质。

链接: https://arxiv.org/abs/2606.28560
作者: Chad A. Capps
机构: Independent Researcher
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 5 tables

点击查看摘要

Abstract:We study sparse self-attention in which each query attends to a dense local window plus a set of Fibonacci-spaced offsets, with a per-layer scalar alpha that compresses or expands the spacing. Across 21 language models trained under one matched recipe (60M parameters, 512 hidden, 16 layers, 426M tokens), we compare four ways of setting alpha across depth: fixed, per-layer learned, a static linear stagger, and a coprime (anti-gridding) reassignment of that stagger, together with a reach-matched power-of-2 control. Three results stand out. First, a static per-layer stagger improves perplexity over both fixed and learned alpha, and the gain is base-agnostic: applying the same stagger to a power-of-2 base lifts it above fixed Fibonacci and to parity with learned Fibonacci attention. Second, learning per layer is inert: it does not beat the static schedule and costs roughly five times the inference latency. Third, and most consequential, all sparse variants extrapolate to four times their training length with little or no degradation, whereas a recipe-matched dense baseline collapses (perplexity rises by 201% at 4x length); we attribute this to fixed-offset attention only ever querying relative positions seen during training. We also report two honest negatives: at training length the best sparse model has about 26% higher perplexity than the dense baseline, and the staggering gain is uniform across context positions rather than concentrated at long range.

[NLP-142] DataComp-VLM: Improved Open Datasets for Vision-Language Models

【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)训练中大规模数据集构建与优化策略缺乏系统性评估标准的问题。当前高性能VLM的训练依赖于精心策划的高质量数据集,但社区尚无统一的基准来评估不同数据筛选、混合、格式化和采样等数据治理策略的有效性。为此,论文提出了面向VLM的数据中心实验基准DataComp for VLMs (DCVLM),其核心在于构建一个包含160个数据集、总计6万亿多模态标记(6T multimodal tokens)的标准化语料库,涵盖图像-标题对、多模态交错文档、纯文本及指令微调数据四类数据类型。该基准支持在10亿至80亿参数模型以及62.5亿至2000亿标记预算范围内,对多种数据策展策略进行可控实验,并在9个领域共计52个下游任务上进行综合评估。关键发现表明:相较于数据过滤,数据混合(data mixing)是提升训练数据质量的核心因素——以指令数据为主的混合策略在更大规模下表现出更强的可扩展性,且性能增益随规模扩大而显著增强。基于此,研究构建了名为DCVLM-Baseline的基准数据集,使8B参数的VLM在33项核心任务上的平均准确率达到63.6%,相较当前最先进的开源训练数据集FineVision,实现了+5.4个百分点的显著提升。整个DCVLM框架及相关资源将公开发布,为未来多模态数据策展研究提供可复现的实验平台。

链接: https://arxiv.org/abs/2606.28551
作者: Matteo Farina,Vishaal Udandarao,Thao Nguyen,Selim Kuzucu,Maximilian Böther,Andreas Hochlehnert,Adhiraj Ghosh,Marianna Nezhurina,Karsten Roth,Joschka Struber,Yuhui Zhang,Sebastian Dziadzio,Elaine Sui,Soumya Jahagirdar,Dhruba Ghosh,Hasan Hammoud,Thomas De Min,Simone Caldarella,Jehanzeb Mirza,Sedrick Keh,Mehdi Cherti,Hilde Kuehne,Bernt Schiele,Serena Yeung-Levy,Muhammad Ferjad Naeem,Federico Tombari,Ana Klimovic,Elisa Ricci,Matthias Bethge,Sewoong Oh,Ameya Prabhu,Alessio Tonioni,Jenia Jitsev,Massimiliano Mancini,Ludwig Schmidt,Nikhil Parthasarathy
机构: 1. University of Oxford (牛津大学); 2. University of Tübingen (图宾根大学); 3. Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); 4. Stanford University (斯坦福大学); 5. Max Planck Institute for Informatics (马克斯·普朗克信息学研究所); 6. Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); 7. RWTH Aachen University (亚琛工业大学); 8. ETH Zurich (苏黎世联邦理工学院); 9. University of California, Berkeley (加州大学伯克利分校); 10. University of Washington (华盛顿大学); 11. University of Edinburgh (爱丁堡大学); 12. National University of Singapore (新加坡国立大学); 13. University of Trento (特伦托大学); 14. University of Illinois Urbana-Champaign (伊利诺伊大学厄本那-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Building performant Vision-Language Models (VLMs) requires carefully curating large-scale training datasets, yet the community lacks systematic benchmarks for evaluating such curation strategies. We introduce DataComp for VLMs (DCVLM), a benchmark for controlled data-centric experiments to improve VLM training. As part of DCVLM, we collect 160 datasets spanning four data types – image-caption pairs, multimodal interleaved documents, text-only, and instruction-tuning data – into a corpus of 6T multimodal tokens. DCVLM allows participants to test curation strategies (filtering, mixing, formatting, sampling) across 1B-8B models and 6.25B-200B token budgets. Models are then evaluated on a carefully selected suite of up to 52 downstream benchmarks across 9 domains. We conduct extensive experiments on DCVLM and find that data mixing, not filtering, is key to a high-quality training dataset: instruction-heavy mixtures scale better than caption-heavy ones, with gains widening at larger scales. The resulting dataset, DCVLM-Baseline, enables training an 8B VLM to 63.6% accuracy on our 33-task core suite with 200B training tokens. Compared to FineVision, the state-of-the-art open VLM training dataset, this represents an improvement of +5.4pp. DCVLM and all accompanying artifacts will be made publicly available at this https URL.

[NLP-143] urn-Averag ed SAEs for Feature Discovery and Long-Context Attribution

【速读】: 该论文旨在解决标准稀疏自编码器(Sparse Autoencoders, SAE)在处理长上下文语言模型时的可扩展性问题。传统SAE架构基于单个标记(token)的激活进行特征提取,导致活跃特征数量随上下文长度线性增长,难以有效分析长文本对话。为此,本文提出一种新的转平均稀疏自编码器(turn-averaged SAEs),通过学习重建单个对话回合(如用户或助手的一次发言)中模型激活的平均表示,将每个对话回合映射为固定数量的特征,从而实现对长上下文的有效表征。其解决方案的关键在于:以对话轮次为单位进行特征聚合,利用平均激活重构来捕捉单轮对话的高层语义特性,而非依赖逐标记的细粒度表示。实验表明,由大语言模型(LLM)评估时,转平均特征比逐标记特征更能完整描述对话轮次的整体特征;同时,该方法显著简化了下游任务如归因图(attribution graphs)的构建与可视化。总体而言,转平均SAEs使解释性技术在长上下文场景下具备实际可行性。

链接: https://arxiv.org/abs/2606.28548
作者: Kevin Der,Harish Kamath,Ben Thompson
机构: Anthropic
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have become a useful tool for extracting interpretable features in language models. However, standard SAE architectures operate on individual token activations, meaning that the number of active features scales linearly with context length, and studying long model transcripts becomes difficult. We introduce turn-averaged SAEs, which represent a single Human or Assistant turn with a fixed number of features by learning to reconstruct the average model activation across the turn. We find that turn-averaged features describe a single turn’s high-level characteristics more completely than per-token features when judged by an LLM. We also demonstrate that turn-averaged SAEs greatly simplify common downstream uses of SAEs like attribution graphs. Broadly, turn-averaged SAEs make interpretability techniques practical at long context lengths.

[NLP-144] Who Plays Which Role When? Communication Role Dynamics for Peer Recognition and Team Performance Prediction

【速读】: 该论文旨在解决现有计算协作研究中团队角色标注缺乏理论基础的问题,现有方法多依赖领域特定的个人角色设定或数据驱动的聚类,而未能基于坚实的理论框架。其解决方案的关键在于:基于教育学文献构建了一个包含八种沟通角色的理论化分类体系,并通过人工标注6,307条来自18个团队、55名学生的学期制计算机科学项目中的Slack消息,实现该分类体系的可操作化。在此基础上,利用大语言模型(LLM)对专家标签进行近似,实现了可扩展的、基于理论的角色标注。进一步分析表明,不同角色在团队生命周期中呈现阶段性高峰,且随着项目推进,学生表现出更丰富的角色多样性。为验证该角色构念的有效性,研究将其用于预测同伴认可度,结果优于词法特征、对话行为及LLM提示基线;同时,将相同角色构念应用于公开数据集DeliData以预测团队决策后的绩效提升,亦超越了先前方法的表现,证明了其在教育场景外的泛化能力。

链接: https://arxiv.org/abs/2606.28544
作者: Yifan Song,Wenxuan Wendy Shi,Brian P Bailey,Tal August
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄本那-香槟分校); California Polytechnic State University, Pomona (加州理工州立大学波莫纳分校)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Team roles offer an interpretable lens on collaboration, yet computational studies of roles often rely on domain-specific personas or data-driven clustering rather than theory-grounded taxonomies. We operationalize a taxonomy of eight communication roles grounded in education literature and annotate a corpus of 6,307 Slack messages from 55 students across 18 teams in a semester-long computer science course project. We evaluate whether LLMs can approximate expert labels, enabling scalable, taxonomy-driven role annotation. Using these role labels, we characterize role dynamics over teams’ lifecycles, finding that different roles peak at different moments and that students enact a more diverse set of roles as projects progress. To evaluate the utility of our role constructs, we use them to predict peer recognition, outperforming lexical, conversational, and LLM-prompting baselines. To assess generalizability beyond the educational context, we apply the same role constructs to a public dataset (DeliData) to predict team performance improvement after deliberation, again exceeding prior performance.

[NLP-145] Legal Domain Adaptation of Modern BERT Models

【速读】: 该论文旨在解决现代BERT模型(ModernBERT)在法律领域中的域适应(domain adaptation)问题,即如何提升预训练语言模型在特定法律文本数据上的表现。尽管ModernBERT已在约500倍于原始BERT的数据上进行预训练,但研究发现其在法律领域的性能仍可通过进一步的领域自适应预训练显著提升。解决方案的关键在于利用美国法院判决文书大规模语料库,通过掩码语言建模(masked language modeling)目标对ModernBERT进行增量式预训练,从而增强其对法律文本的语言理解能力。实验表明,相较于直接从头预训练,基于已有ModernBERT检查点进行微调的方案在所有与美国法院判决相关的数据集上均取得更优效果。最终模型支持长达8,192个标记的序列处理,可生成高质量的法律文本嵌入或用于快速重排序大量法律条文以响应查询,相关模型检查点已公开发布。

链接: https://arxiv.org/abs/2606.28538
作者: Dominik Stammbach,Peter Henderson
机构: 未知
类目: Computation and Language (cs.CL)
备注: To appear in Proceedings of the 21st International Conference on Artificial Intelligence and Law (ICAIL 2026), June 9-12, 2026, Singapore

点击查看摘要

Abstract:We investigate domain adaptation of modern BERT models in the legal domain. We further pre-train ModernBERT on all US court opinions using the masked language modeling objective. Although ModernBERT has been trained on roughly 500x more data than original BERT, we still find that this model benefits from further pre-training and domain adaptation in the legal domain: we report significant improvements compared to vanilla ModernBERT on all datasets connected to US court opinions. We find gains similar to those reported in early work on domain adaptation of BERT-like models. However, from scratch pre-training does not match the performance of further pre-training an existing ModernBERT checkpoint in our experiments. The resulting models are capable of processing sequences up to 8,192 tokens, and can be used to compute meaningful embeddings of legal passages, or could quickly rerank hundreds of legal passages for a given search query. We release all model checkpoints publicly.

[NLP-146] A Good Talk Does not Look Like a Summary It Teaches You! Measuring Takeaways from Paper-to-Video Talks

【速读】: 该论文旨在解决科学论文生成视频在教学与科研传播中普遍存在的评价瓶颈问题,即现有评估指标仅关注视频的视觉质量或关键内容是否出现,而未能有效衡量视频对观众理解科学思想的实际帮助程度。其核心解决方案是提出一种名为EffectivePresentationScorer的评估框架,该框架从教学有效性角度出发,系统评估生成视频是否清晰阐释了核心思想、是否恰当引入必要的背景概念,以及是否将技术细节与论文的主要贡献有效关联。实验表明,当前主流的论文到视频生成系统虽能正确提及主题并遵循论文结构,却普遍缺乏对前置知识的解释和对方法原理的阐明,而这些缺陷恰恰被传统以内容存在性为导向的评估指标所忽略。因此,EffectivePresentationScorer的关键在于将“解释性”与“教学逻辑性”纳入评估维度,从而实现对科学展示视频真正教育价值的量化判断。

链接: https://arxiv.org/abs/2606.28531
作者: Ishani Mondal,Aparna Garimella,Ananya Sai,Pannaga Shivaswamy,Jordan Boyd-Graber
机构: Adobe Research; University of Maryland, College Park
类目: Multimedia (cs.MM); Computation and Language (cs.CL)
备注: Under Submission

点击查看摘要

Abstract:Automatically generated videos from scientific papers are increasingly used for education and research dissemination. However, existing evaluation metrics mainly measure visual quality or whether key points from the paper appear in the video without assessing whether the video actually helps viewers understand the ideas. We introduce EffectivePresentationScorer, a framework for evaluating the instructional quality of scientific presentation videos. It checks whether a video explains the main ideas clearly, introduces needed background concepts, and connects technical details to the main contribution of the paper. When we apply EffectivePresentationScorer to the existing paper-to-video generation systems, we find that generated videos mention the correct topics and follow the structure of the paper but fail to explain prerequisite concepts or clarify why the method works. These failures are often ignored by existing video evaluation metrics, which focus on content presence rather than explanatory quality.

[NLP-147] Developmental Trajectories of Situation Modeling and Mentalizing in Transformer Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在心智化(mentalizing)能力评估中存在构念效度争议的问题,特别是针对其在虚假信念任务(False Belief Task, FBT)中的表现是否真正反映了对他人信念状态的理解。研究采用发展视角,系统追踪了Olmo2与Pythia模型系列在多个训练阶段中对心理状态推理行为及其潜在前提条件的演变过程。其解决方案的关键在于:通过分析模型在不同规模和训练量下的行为演化,揭示出超越随机水平的FBT表现依赖于模型规模与充分训练量,并在预训练后期才逐渐出现;而最能体现心智化能力的任务(如隐含式虚假信念任务)则在后训练阶段(如监督微调SFT、直接偏好优化DPO)中得到显著提升。然而,研究也发现模型在处理信念推断时具有显著脆弱性——即使在真实信念情境下,使用非事实性动词(如“认为”)仍会诱发错误的信念归因。进一步分析表明,尽管情景建模(situation modeling)能力通常早于且优于FBT表现,但其构建的情景表征在关键方面仍表现出不一致性:例如,当涉及反派角色(始终知晓物品真实位置)的知识状态时,Olmo2 13b模型仍会受到目标角色信念状态及非事实性动词的影响。这些结果表明,大规模且充分训练的模型虽能以类发展的顺序建立部分连贯的情景模型,但在复杂认知推理中仍表现出意想不到的脆弱性,凸显了采用发展轨迹分析与压力测试方法在评估大语言模型认知能力中的重要价值。

链接: https://arxiv.org/abs/2606.28524
作者: Pamela D. Rivière,Cameron Jones,Sean Trott
机构: Rutgers University - Newark; Stony Brook University
类目: Computation and Language (cs.CL)
备注: Non-archival submission to the First Workshop on Computational Developmental Linguistics

点击查看摘要

Abstract:Recent work suggests that Large Language Models (LLMs) are sensitive to the belief states of agents described by text, as measured by the false belief task (FBT), yet persistent concerns of construct validity remain. We adopt a developmental perspective, tracing the pattern of mental state reasoning behavior – and likely preconditions for this behavior – across multiple training stages in the Olmo2 and Pythia language model suites. We find that above-chance FBT performance depends both on model size and sufficient training volume, emerges relatively late in pretraining, and is most improved by post-training interventions (SFT, DPO) in the condition most diagnostic of mentalizing (False Belief, Implicit). However, FBT performance is fragile: consistent with past work, the use of non-factive verbs (e.g., thinks) increases false belief attributions even in the True Belief condition. To contextualize these findings, we track the emergence of situation modeling: the ability to report on basic factual properties of a described scene. Situation modeling accuracy generally precedes and exceeds FBT accuracy, yet situational representations also prove surprisingly incoherent in certain respects: when asked about the knowledge states of the Antagonist agent – who always knows the item’s true location – Olmo2 13b is consistently influenced both by the Target agent’s knowledge state and the presence of non-factive verbs. Together, these results suggest that larger, sufficiently trained models build partially coherent situation models in a developmentally appropriate sequence, yet display surprising fragility – highlighting the value of developmental and stress-testing approaches for evaluating LLM capabilities.

[NLP-148] Detecting Clinical Hallucinations in LVLMs via Counterfactual Visual Grounding Uncertainty

【速读】: 该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在临床影像理解任务中普遍存在幻觉(hallucination)的问题,即模型生成的文本描述或属性与输入图像实际内容不符。其核心解决方案是提出一种无需修改或访问LVLM内部状态的可视觉追溯的幻觉检测框架,通过视觉证据锚定(visual evidence grounding)对任意LVLM输出进行审计。该方法首先提取响应中的可视觉验证实体,并利用经过医学领域适配的Qwen-VL接地验证器将这些实体定位到输入图像上;为进一步提升检测鲁棒性,引入反事实实体扰动机制,通过对比真实与反事实情况下的接地结果来估计视觉证据不确定性,并基于正向置信度、反事实置信度及二者接地重叠度计算实体级不确定性得分,用于二元幻觉判定。实验表明,该方法在多种医学影像模态和不同LVLM主干网络上均显著优于现有基线,同时具备可解释的定位证据和强跨模型迁移能力。

链接: https://arxiv.org/abs/2606.28520
作者: Xiao Song,Haonan Qin,Zhaoxu Zhang,Jiong Zhang,Yuqi Fang,Caifeng Shan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Large vision-language models (LVLMs) are increasingly used for clinical image understanding, yet they remain vulnerable to \emphhallucinations–producing textual findings or attributes not supported by the image. We present a vision-traceable hallucination detection framework that audits arbitrary LVLM responses via visual evidence grounding, requiring neither modification nor internal access to the hidden states of LVLMs. Given an LVLM response, we extract visually verifiable entities and use a medical-domain-adapted Qwen-VL grounding verifier to localize each entity on the input image. To enhance the robustness of our detection method, we introduce a counterfactual entity perturbation method and estimate visual evidence uncertainty by contrasting factual and counterfactual grounding results. Specifically, we compute an entity-level uncertainty score from the positive confidence, counterfactual confidence, and their grounding overlap for binary hallucination decision-making. Experiments on multiple medical imaging modalities and LVLM backbones demonstrate that our method consistently improves hallucination detection performance over recent baselines, while providing interpretable localization evidence and strong cross-model transferability. Code and dataset are available at this https URL.

[NLP-149] GPT NT: Benchmarking Real-Time Collaboration Between Multimodal Agents on Keep Talking And Nobody Explodes

【速读】: 该论文旨在解决当前多模态模型在协作任务中面临的真实挑战,即如何在存在时间压力、信息不对称和沟通不完善等复杂协同条件下的高效、实时协作能力评估问题。现有基准通常孤立地考察单一能力,而未能模拟真实协作场景中的动态交互。其解决方案的关键在于提出GPTNT这一基于合作类视频游戏《Keep Talking and Nobody Explodes》的基准测试框架,通过要求两名代理在实时异步环境中协同工作——一名可看见并操作炸弹但无拆弹手册,另一名拥有手册但无法观察或操控炸弹——从而强制模型在缺乏记忆化解法依赖的前提下进行即时推理与沟通。该设计有效分离了模型对即时推断与已有知识的依赖,使评估聚焦于真正的协作性能。实验表明,当前最先进的闭源与开源模型均无法在实时条件下成功拆弹,暴露出在状态跟踪、高压下高效行动、歧义处理及错误恢复方面的显著缺陷。此外,由于GPTNT运行于真实游戏环境,依托程序化生成机制和活跃的玩家模组社区,能够持续演化以应对模型进步,避免被一次性“破解”,因而为评估多模态模型的协作能力提供了一个可持续、动态且具有现实意义的基准。

链接: https://arxiv.org/abs/2606.28514
作者: Amit Parekh,Sabrina McCallum,Kareem Al-Hasan,Malvina Nikandrou,Alessandro Suglia,Ioannis Konstas
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project website and code at this https URL

点击查看摘要

Abstract:Multimodal models are increasingly deployed to solve tasks collaboratively with humans or other artificial agents. Existing benchmarks show that these models possess many of the required component capabilities, but the conditions that coincide in collaboration, including time pressure, information asymmetry, and imperfect communication, are usually studied in isolation. We introduce GPTNT, a benchmark built on the cooperative video game Keep Talking and Nobody Explodes, in which two agents must coordinate to defuse procedurally generated bomb puzzles against a live countdown. One agent can see and manipulate the bomb but does not have the defusal instructions; the other has the instructions but cannot see or manipulate the bomb. Neither agent can succeed alone: success requires effective and efficient communication. Unlike turn-based proxies, GPTNT requires agents to act asynchronously and communicate in real time. GPTNT is designed to separate collaboration from reliance on memorized solutions: the instruction manual, the partner, or both can be withheld to isolate what a model derives in the moment from what it already knows. We show that GPTNT poses a substantial challenge for state-of-the-art systems: none of the closed- or open-source models we test defuses a single bomb in real time, a bar that human players clear. Through controlled experiments, we identify critical weaknesses in state tracking, efficient action under time pressure, ambiguity handling, and error recovery. We release GPTNT as a benchmark for collaborative performance that current evaluations leave unmeasured. Because it runs on the real game, GPTNT benefits from procedural generation and inherits a living modding community, allowing the benchmark to evolve as models improve rather than being solved once and retired.

[NLP-150] An Agent ic AI Pipeline for Appliance-Level Energy Anomaly Detection and LLM -Driven Recommendations

【速读】: 该论文旨在解决办公楼中设备级能耗监测产生的噪声警报问题,此类警报对非专业设施管理人员而言难以理解和处理。其核心解决方案是构建一个端到端的智能体(agentic)流程,通过融合深度时间序列预测、变分异常检测与大语言模型(LLM)推理,生成优先级明确且可操作的维护建议。关键创新在于:采用混合奇异谱分析(Singular Spectrum Analysis, SSA)与长短期记忆网络(Long Short-Term Memory, LSTM)相结合的预测模型,对七类办公设备进行精准能耗建模;利用基于注意力机制的设备专属LSTM变分自编码器(Variational Autoencoder, VAE),识别每日异常耗能事件;设计三阶段LangChain智能体流水线——上下文智能体(Context Agent)动态检索最多八条相关知识源(包括模型可靠性、小时级基线、专家知识等),依据事件特征条件性扩展检索内容;诊断智能体将证据结构化为JSON格式诊断结果,报告智能体生成自然语言叙述;系统引入可反思的记忆层以整合运维人员反馈。评估在包含16种典型场景的基准测试中表明,动态检索策略在保持与静态全量检索相当性能的前提下,平均上下文来源数从六降至三至六,显著提升效率;最优大语言模型后端在70分阈值下达到90.4/100得分且通过率100%,而本地部署的7B参数小规模模型亦能完全通过所有测试场景,验证了系统的有效性与实用性。

链接: https://arxiv.org/abs/2606.28467
作者: Dihia Falouz,Aida Douaibia,Amine Bechar,Youssef Elmir,Abbes Amira,Adel Oulefki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 07 pages, 01 figure, accepted for presentation at the IEEE International Conference on Communication, Computing, Networking, and Control in Cyber-Physical Systems (CCNCPS 2026)

点击查看摘要

Abstract:Appliance-level energy monitoring in office buildings produces noisy alerts that non-expert facility managers struggle to use. This paper proposes an end-to-end agentic pipeline that combines deep time-series forecasting, variational anomaly detection, and LLM-based reasoning to generate prioritized, actionable maintenance recommendations. The system tracks seven office appliances using a hybrid Singular Spectrum Analysis (SSA) and Long Short-Term Memory (LSTM) forecasting model, and applies a per-appliance LSTM Variational Autoencoder (VAE) with attention to flag abnormal daily consumption episodes. A three-stage LangChain pipeline begins with a Context Agent that always retrieves three core RAG sources (model reliability, hourly baseline, and expert knowledge) and conditionally adds up to three more (forecast context, anomaly history, global baseline) based on event characteristics, capped at eight reasoning steps. A Diagnosis Agent converts the evidence into a structured JSON diagnosis, and a Report Agent renders a human-readable narrative. A reflective memory layer incorporates operator feedback. The dashboard shows real-time 30-minute forecasts, intraday consumption, the previous day anomaly report, and a feedback form. We evaluate the forecasting model, anomaly detector with appliance-specific thresholds, and LLM reasoning on a 16-scenario benchmark including sustained and transient spikes, unexpected shutdowns, and systemic events, comparing five LLM backends under static vs. dynamic retrieval. Dynamic retrieval matches full static retrieval across all backends while cutting average context from six to three-six sources per event. The best backend scores 90.4/100 with a 100% pass rate at a 70-point threshold, and a fully local 7B-parameter model passes all 16 scenarios.

[NLP-151] Extracting Knowledge from an Arabic-English Machine-Readable Dictionary Using Information Extraction

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)应用中因缺乏大规模、高质量语言知识而导致的知识获取瓶颈问题。其核心解决方案是基于阿拉伯语-英语《Al-Mawrid》词典的机器可读版本,通过n-gram分析与关键词上下文(Key-Word-in-Context, KWIC)分析自动识别蕴含形态学、句法或语义信息的词汇模式,并结合人工设计的规则驱动信息抽取方法进行结构化信息提取。此外,利用标点符号及启发式规则从词目条目中提取同义词集合。研究结果显示,该方法在各类信息提取中均表现出高精度,对同义词的召回率较高,而对其他类型信息的召回率较低;同时揭示了《Al-Mawrid》词典中包含大量派生词(形态学信息)以及丰富的同义关系、领域标签和上下位关系(语义信息),验证了该方法在挖掘词典深层语言知识方面的有效性。

链接: https://arxiv.org/abs/2606.28457
作者: Diaa M. Fayed,Aly A. Fahmy,Mohsen A. Rashwan,Wafaa K. Fayed
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 7 figures, 4 tables, Conference version,CITALA 2014: 5th International Conference on Arabic Language Processing,Oujda, Morocco, 26-27 November 2014. Paper listed in archived accepted papers: this https URL Original conference site defunct: this http URL No proceedings PDF is publicly available

点击查看摘要

Abstract:Natural language processing (NLP) applications need large and rich amount of linguistic knowledge. Furthermore, electronic language sources such as dictionaries, encyclopedia, and corpora became available. So, automatic methods are emerged to extract lexical information from those sources to overcome the knowledge acquisition bottleneck. We presented a method to automatically extract lexical information from a machine-readable version of the Arabic-English Al-Mawrid dictionary. We used n-gram analysis and key-word-in-context (KWIC) analysis to discover lexical patterns that manifest morphologic, syntactic, or semantic information. Then, we used hand-crafted rule-based information extraction to extract that information. Furthermore, we used punctuation marks and some heuristics to extract a set of synonyms in a subentry. This study registered high precision for all types of information, high recall for synonyms, and low recall for the other information. The study also showed that the Al-Mawrid has significant amount of derivations (morphologic information) and synonyms, domain labels, and hyponym/hypernym relations (semantic information).

[NLP-152] LoRA-Tuned Large Language Models for Dementia Detection via Multi-View Speech-Derived Features INTERSPEECH2026

【速读】: 该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)早期检测中因单一表征维度局限导致的多模态认知症状整合推理不足的问题。传统方法通常仅依赖声学特征、停顿建模、自动语音识别(ASR)转录文本或跨模态融合,难以全面捕捉复杂且异质的认知障碍表现。为此,本文提出一种基于低秩适应(LoRA)微调的大语言模型(LLM),实现对四种互补的语音衍生信号——带停顿标记的ASR转录文本、话语层面的主题线索、时间流畅性统计特征以及音位序列——的结构化多视角推理。这些信息被统一编码至一个提示(prompt)中,使单一LLM能够在无需模态专用编码器或后期融合的情况下学习一致的决策函数。在ADReSSo数据集上的实验表明,最优模型达到90.14%的F1分数,消融实验进一步验证了各视角之间的互补贡献。

链接: https://arxiv.org/abs/2606.28445
作者: Jonghyeon Park,Olivier Jiyoun Jung,Myungwoo Oh
机构: NAVER Cloud(NAVER云); Ewha Womans University(延世女子大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at INTERSPEECH 2026

点击查看摘要

Abstract:Early detection of dementia enables timely intervention, and reflecting cognitive impairment, spontaneous speech offers a non-invasive screening modality. Conventional approaches often focus on a single representational dimension – such as acoustic descriptors, pause modeling, automatic speech recognition (ASR) transcripts, or multimodal fusion – limiting integrative reasoning across heterogeneous cognitive symptoms. We propose a low-rank adaptation (LoRA)-tuned large language model (LLM) that performs structured multi-view reasoning over four complementary speech-derived signals: ASR transcripts with pause markers, discourse-level topic cues, temporal fluency statistics, and phonological sequences. These cues are encoded within a unified prompt, enabling a single LLM to learn a coherent decision function without modality-specific encoders or late-stage fusion. On ADReSSo, our best model achieves an F1-score of 90.14%, and ablation confirms the complementary contribution of each view.

[NLP-153] Generating in the Limit with Infinitely Many Hallucinations

【速读】: 该论文旨在解决语言生成在极限(language generation in the limit)框架下存在的核心矛盾:如何在保证生成内容对目标语言具有高覆盖率(召回率,recall)的同时,维持生成字符串的合法性(有效性,validity)。传统方法中,学习者需在无限序列中逐步识别目标语言,而近期研究指出,广泛覆盖目标语言往往以牺牲生成内容的有效性为代价。为此,论文提出引入“精度”(precision)的新概念,并将问题重新建模为经典的召回-精度权衡(recall-precision trade-off)。其关键解决方案在于对学习者的约束进行合理放松:允许学习者存在无穷多个无效输出,但要求这些错误出现的频率趋于零,从而确保精度仍为1。这一松弛机制显著提升了在对手长期隐藏大量目标语言成分时的学习者召回能力。此外,论文还引入新颖性(novelty)约束的连续松弛,仅要求固定比例的输出为新样本。整体研究表明,在大型语言模型实际应用背景下,通过控制错误与重复的速率,可构建更贴近现实的语言生成模型。

链接: https://arxiv.org/abs/2606.28354
作者: Irene Strauss,Alexandra Butoi,Ryan Cotterell
机构: ETH Zürich
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The classic paradigm of language identification in the limit models learning as a game between an adversary, who reveals strings from an unknown target language, and a learner tasked with identifying that language. The recently introduced framework of language generation in the limit shifted the objective to better reflect modern language modeling, requiring the learner to produce valid, unseen strings from the target language. Related work highlighted a fundamental tension: a broad coverage of the target often comes at the cost of validity. We introduce a new notion of precision and recast this problem as the classic recall-precision trade-off. We analyze generation in the limit under varying constraints on enumeration, novelty, and validity, aimed at reflecting settings closer to those encountered by large language models. A key contribution is our analysis of learners that are not eventually valid: we allow infinitely many mistakes, provided their frequency tends to zero so that precision remains one. We show that this relaxation can strictly increase recall when the adversary permanently withholds a large portion of the target language. We also study a continuous relaxation of the novelty constraint that requires only a fixed fraction of outputs to be novel. Taken together, our results move toward a more realistic model of language generation where occasional errors and repetitions are unavoidable, but their rates are controlled.

[NLP-154] Auditing LLM -Governed Social Robots with Culture-Specific Moral Gradients

【速读】: 该论文旨在解决生成式 AI(Generative AI)驱动的社会机器人在跨文化情境下优先级决策中存在的伦理不平等风险,尤其关注当前大语言模型(LLM)道德审计普遍以英语为中心、缺乏对具身化场景的实证测试,导致多元文化校准能力缺失的问题。其解决方案的关键在于提出一种基于梯度的多语言评估框架,通过整合九个跨领域社会机器人研究(共8000篇文献)构建对称控制的情境范式,在照护、教育与服务三大场景中将“拯救谁”(Moral Machine Experiment)的伦理困境转化为“优先协助谁”的现实决策问题,并保留身份权衡要素(数量多 vs. 少;年轻 vs. 年长;高地位 vs. 低地位)。研究在四个国家-语言对(英语、中文、日语等)中,针对四种提示策略对四类主流大语言模型进行大规模测试(57,600次决策),并以各国特定的MME偏好梯度为基准,采用序数一致性检验评估模型对文化差异的识别能力,进而建立治理类型学以诊断梯度区分性、方向倾向性和推理深度等方面的系统性缺陷。结果表明,仅靠提示工程无法可靠纠正文化不对称的梯度追踪失败:西方语言情境下的质量校准水平接近中文与日语情境的两倍;多数优先选择中的高度确定性常抹除跨文化差异;对年龄与地位规范的部分敏感性可能进一步边缘化少数群体。提示策略效果不均,仅有对比示例能带来稳定提升,而纯推理提示甚至会恶化表现。研究强调,必须将多语言、多元文化审计作为大语言模型-机器人部署前的关键审查机制,并指出模型自身特性是比提示工程更稳健的优化杠杆。

链接: https://arxiv.org/abs/2606.28345
作者: Carmen Ng,Gjergji Kasneci
机构: Technical University of Munich (慕尼黑工业大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted for publication in Proceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

点击查看摘要

Abstract:LLM-governed social robots increasingly decide who receives real-world assistance first. As prioritization norms vary across cultures by age, status, and group size, failure to calibrate pluralistically can scale into unequal access. Yet LLM moral audits remain English-centered, rarely test embodied contexts, leaving pluralistic calibration as an urgent diagnostic gap amid intensifying LLM-robot deployment. We introduce a gradient-based audit framework for multilingual evaluation of LLM moral trade-off behavior against cultural preference gradients. Grounded in nine cross-domain social robotics reviews (8,000 papers), we derive symmetry-controlled scenarios across care, education, and services, translating the Moral Machine Experiment’s “whom to spare” into “whom to assist first” dilemmas with preserved identity trade-offs (many vs. few; young vs. old; higher vs. lower status). We audit four LLMs across four country-language pairs in four prompting regimes (57,600 decisions), benchmarked against country-specific MME preference gradients. Ordinal concordance tests whether models differentiate cultural contexts; a governance typology maps vulnerabilities in gradient differentiation, directional tendency, and deliberation. We find persistent, culturally asymmetric gradient tracking failures that prompting alone cannot reliably correct: quality calibration is nearly twice as strong for Western-language decisions as for Chinese and Japanese; high determinism in majority-first trade-offs often erases cross-cultural gradients; partial sensitivity to age- and status-based norms risks sidelining minorities. Prompting effects are uneven; only contrastive exemplars yield consistent gains, while reasoning-only prompts can worsen tracking. Our results motivate multilingual, pluralistic audits as an LLM-robot pre-deployment gate and suggest model factors are a more robust lever than prompting alone.

[NLP-155] LLM -Ideoplasticity: Measuring Ideological Plasticity in the Political Behavior of LLM s as a Context-Conditioned Distribution

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)政治立场的静态本质假设问题,指出其政治意识形态并非固定不变的点,而是在特定语境下呈现为一个条件分布 P(positioncontext)\mathbb{P}(\text{position} \mid \text{context})。其核心挑战在于如何准确刻画模型在复杂语境下的动态政治倾向。解决方案的关键在于构建并应用一套统一的测量框架,基于VAA-CHES投影模型,将模型输出映射到三个经过验证的政治维度(lrgen、lrecon、galtan),并在六个语境轴上进行评估。研究发现,模型政治坐标对语境高度敏感:说服性话语框架和低代表语言可分别导致坐标偏移达0.57和0.52单位,且链式思维推理常加剧而非缓解重述不稳定性。尽管存在局部可塑性,但整体上模型群体仍集中于一个极为狭窄的“奥弗顿窗口”内,其覆盖范围仅为主要欧洲政党的三分之一。通过多特质多方法(MTMM)分析,研究进一步表明,单一数值无法概括LLM的政治行为,必须将其视为一种“形状”(shape)。

链接: https://arxiv.org/abs/2606.28335
作者: Adib Sakhawat,Syed Rifat Raiyan,Tahsin Islam,Takia Farhin,Hasan Mahmud,Md Kamrul Hasan
机构: Islamic University of Technology, Dhaka, Bangladesh
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review, 38 pages, 18 figures, 10 tables

点击查看摘要

Abstract:We argue, with systematic empirical evidence, that a large language model’s political ideology is not a fixed point, but a conditional distribution \mathbbP( position \mid context ) over a real political space. We evaluate nine current LLMs using a unified measurement framework anchored by VAA-CHES projection models, which map responses onto three validated dimensions (lrgen, lrecon, galtan) across six contextual axes. Our findings reveal high sensitivity to context: persuasive framing and under-represented languages displace coordinates by up to 0.57 and 0.52 units, respectively, while chain-of-thought reasoning often amplifies rather than dampens paraphrase instability. Despite this local plasticity, the model cohort occupies a remarkably narrow Overton envelope overall, occupying roughly one-third the spread of major European parties. Supported by a multi-trait multi-method (MTMM) analysis, we conclude that a single point cannot summarize LLM political behavior; it must be characterized as a shape. Our code and data are publicly available at this https URL.

[NLP-156] he Digital Afterlife of Empires: Four Language Models Converge on the Same Imperial Cartography of Writing

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理全球书写系统时存在的显著数字支持不平等问题。其核心问题是:当前主流语言模型对不同书写系统的数字化支持程度差异巨大,导致部分书写系统(尤其是少数族群或非主流书写系统)在自然语言处理任务中被系统性边缘化。解决方案的关键在于构建一个七维的数字书写表征指数(Digital Script Representation Index, DSRI),用于量化评估300种书写系统在当代数字基础设施中的支持水平,并通过实证分析揭示这种不平等的深层根源。研究发现,仅有9.7%的书写系统获得完全数字支持,且模型分词器效率在不同书写系统间存在高达31.7倍的差距;进一步的结构方程模型分析表明,历史帝国干预通过影响使用者人口规模、网络语料库建设,最终传导至分词器效率,形成完整的中介路径,而直接效应不显著,提示该不平等现象并非由单个模型设计导致,而是源于共享训练语料中长期积累的历史结构性不公。此外,跨四个独立模型家族(Claude、GPT-4o、Grok、DeepSeek)的大量API测试显示,模型在172个书写特征上的错误高度一致,尤其表现出“过度归因”(over-attribution)远多于“识别不足”(under-recognition),且宗教用途相关错误集中占43.6%的共现错误,表明模型偏差具有多通道、系统性特征。研究结论指出,语言模型中的书写系统偏见主要源于历史帝国所造成的结构性不平等在现代训练数据中的延续,而非单一模型的设计选择。

链接: https://arxiv.org/abs/2606.28325
作者: Hiroki Fukui
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: Part II of the Kotonoha Series. Companion paper: arXiv:2604.10957 (q-bio.PE). 35 pages, 8 figures, 3 tables. 12,000 API calls across 4 LLM families (Anthropic, OpenAI, xAI, DeepSeek); cross-architecture convergence of typological knowledge biases (Spearman rho = 0.85-0.98, all p 0.002)

点击查看摘要

Abstract:Large language models process the world’s writing systems with radical inequality. We constructed the Digital Script Representation Index (DSRI), a seven-axis measure of digital support, and applied it to the 300 writing systems of the Global Script Database (Fukui, 2026). Only 29 scripts (9.7%) are fully supported by contemporary digital infrastructure; among 158 living scripts, 60 (38.0%) lack complete support. Tokenizer efficiency varies by a factor of 31.7 across 45 scripts measured with parallel text. A serial mediation model – imperial intervention to speaker population to web corpus to tokenizer efficiency – is consistent with full mediation, with the direct effect of empire indistinguishable from zero (beta = -0.22, p = 0.39) and structural equation model fit indices indistinguishable from saturation at n = 45; the bias-corrected bootstrap CI grazes zero, and we treat the mediation as suggestive rather than confirmatory. Across four independent LLM families (Claude, GPT-4o, Grok, DeepSeek; 12,000 API calls), base-rate-deviation error patterns converge at Spearman rho = 0.85-0.98 (all p 0.002). 172 script-feature items are answered identically wrong by all four models; over-attribution outnumbers under-recognition 3.9:1, and “used for religion” alone concentrates 43.6% of convergent errors (enrichment 4.1x). With religion excluded as a sensitivity check, the cross-architecture convergence is preserved (mean rho = 0.87 on nine features) and the over-attribution asymmetry persists at 1.77:1 (n = 97, binomial p = 0.008), indicating multi-channeled rather than single-channeled bias. The findings are consistent with an interpretation in which the structural inequalities historical empires inflicted on script communities persist in contemporary language models through the shared training corpus rather than through any individual model’s design choices.

[NLP-157] When Does Personality Composition Matter for Multi-Agent LLM Teams?

【速读】: 该论文旨在解决生成式 AI(Generative AI)在多智能体协作中,人格特质提示(personality prompting)对任务执行效果的影响问题。尽管已有研究揭示了低宜人性(low agreeableness)会引发对抗性语言、高宜人性则促进合作行为,但其与客观任务绩效之间的关系尚未在多个任务领域中系统验证。本文通过操控前沿大语言模型(LLM)在结构化编码、开放式研究协作和竞争性谈判三个任务场景中的性格特质,发现人格特质对团队表现的影响高度依赖于任务结构:在编码任务中,低宜人性虽引发显著沟通风格变化,但对里程碑完成度影响有限;而在开放协作与谈判任务中,相同的人格操纵导致性能显著下降。其解决方案的关键在于揭示了人格提示的效能并非普适,而是受任务语境制约,从而为多智能体系统设计提供了重要启示——即人格干预的有效性需结合具体任务结构进行权衡,且存在内在局限性。

链接: https://arxiv.org/abs/2606.27443
作者: Aryan Keluskar,Amrita Bhattacharjee,Huan Liu
机构: Arizona State University (亚利桑那州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:Personality prompting shapes how large language models communicate, yet whether these behavioral shifts affect objective task outcomes remains under-explored. Prior work shows that agents prompted with low agreeableness produce adversarial language, while those prompted with high agreeableness become cooperative, but the relationship between communication style and task performance has not been systematically examined across multiple domains. In this work, we investigate whether personality composition matters for multi-agent team performance by manipulating personality traits across frontier LLMs on three task domains: structured coding, open-ended research collaboration, and competitive bargaining. We find that personality effects depend critically on task structure. In coding tasks, low agreeableness leads to large communication shifts that have little effect on milestone completion. In open-ended collaboration and bargaining, the same manipulation substantially degrades performance. We discuss implications for multi-agent system design and the limits of personality manipulation.

[NLP-158] ReFreeKV: Towards Threshold-Free KV Cache Compression ACL2026

【速读】: 该论文旨在解决大语言模型(LLM)推理过程中键值缓存(KV cache)内存消耗过高的问题,尤其针对现有压缩方法依赖输入/领域特定阈值进行缓存预算分配所导致的性能不稳定与泛化能力差的问题。现有方法虽能在特定数据集上实现无损内存压缩,但其性能高度依赖于预设的、针对具体输入特性的阈值,而真实场景中开放域输入具有多变的领域、长度和复杂度,难以确定统一且最优的阈值,从而造成在任意输入下性能显著下降。为克服这一根本局限,论文提出一种新的目标——“阈值无关”(threshold-free)的鲁棒性KV压缩机制,强调自适应动态调整缓存预算的同时保持全缓存性能。基于此目标,作者提出了首个实现该理念的方法ReFreeKV,通过自适应机制实现无需人工设定阈值的高效缓存管理。在13个涵盖不同上下文长度、任务类型和模型规模的数据集上的大量实验验证了ReFreeKV在性能与效率方面的优越性,且代码已公开。

链接: https://arxiv.org/abs/2502.16886
作者: Xuanfan Ni,Liyan Xu,Chenyang Lyu,Longyue Wang,Mo Yu,Lemao Liu,Fandong Meng,Jie Zhou,Piji Li
机构: Nanjing University of Aeronautics and Astronautics; WeChat AI, Tencent; Fudan University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:To reduce memory consumption during LLM inference, a handful of methods have been proposed for KV cache pruning. While these techniques can accomplish lossless memory reduction on many datasets, they often hinge on an under-emphasized condition: an input/domain-specific threshold for KV cache budget needs to be pre-determined to achieve the optimal performance. However, such input-sensitive design may be considerably limited in real-world scenarios, as open-domain inputs span diverse domains, lengths and difficulty levels, without clear boundaries for threshold selection. As a result, the dependence of such input-sensitive threshold can be a fundamental limitation that causes large degradation on arbitrary inputs. In this work, we propose a new objective that lifts the threshold constraints for robust KV compression, advocating for “threshold-free” methods that adaptively adjust budget allocation while preserving full-cache performance. We then propose a novel method, ReFreeKV, serving as the first instantiation of this objective. Extensive experiments across 13 datasets with diverse context lengths, task types, and model sizes demonstrate its efficacy and efficiency. Our code is publicly released at this https URL.

[NLP-159] DNA Language Models: An Assessment of Pre-Training for Fine-Tuning Tasks

【速读】: 该论文旨在解决当前基因组序列分析中不同深度学习模型性能评估不充分的问题,特别是针对基于Transformer的模型(如DNABERT2)与传统卷积模型(如ConvNova)在预训练成本与微调任务表现之间的权衡关系。其核心问题是:在基因组任务中,尽管基于Transformer的模型需耗费大量计算资源进行预训练,但其带来的性能提升是否足以证明该开销的合理性;同时,预训练阶段对下游任务的实际贡献程度如何;以及目前广泛采用的字节对编码(BPE)分词方式在DNA序列表征中的有效性是否存在争议。解决方案的关键在于通过系统性基准测试,全面比较多种模型架构在相同任务上的表现,量化预训练的贡献,并评估BPE分词策略对基因组下游任务性能的影响,从而为生成式AI在基因组学中的应用提供实证依据与优化方向。

链接: https://arxiv.org/abs/2606.30140
作者: Romain Karpinsky,Julien Mozziconacci,Mickaël Delcey
机构: CNRS (法国国家科学研究中心); LORIA (洛林计算机科学研究中心)
类目: Genomics (q-bio.GN); Computation and Language (cs.CL)
备注: 12 pages, 2 figures, 14 tables

点击查看摘要

Abstract:Recent breakthroughs in foundation models and Large Language Models (LLMs) have introduced new opportunities for studying and decoding genomic sequences. Several state-of-the-art approaches, such as DNABERT2, rely on transformer-based architectures, while others, such as ConvNova, still build upon more conventional convolutional models. However, systematic benchmark comparisons across these methods remain scarce. Given that transformer-based models require extensive and costly pretraining, it is crucial to evaluate whether their performance gains justify this overhead. Moreover, LLMs such as DNABERT2 typically rely on Byte Pair Encoding (BPE) tokenization, whose relevance for DNA sequence representation is still debated within the genomics community. In this work, we investigate three key questions: (i) do transformer-based models provide sufficient improvements on fine-tuning tasks upon heavy pretraining, (ii) what is the actual contribution of pretraining in this setting, and (iii) how does BPE tokenization impact performance on genomics-related tasks?

[NLP-160] Improving Large-Scale Weakly Supervised ASR by Filtering and Selection

【速读】: 该论文旨在解决弱监督语音识别(ASR)训练中大规模数据集普遍存在的标签噪声和领域特异性不足的问题,这些问题限制了模型的鲁棒性与泛化能力。其核心解决方案是提出一种分阶段的训练框架,关键在于通过字符错误率(CER)进行数据过滤与目标域相似样本的选择性采样,实现对弱监督数据的有效利用。具体而言,该方法包含三个步骤:首先在全量数据上进行预训练;其次在基于CER筛选出的高质量子集上继续预训练;最后从该子集中选取与目标领域声学特性相近的小规模样本进行微调。实验表明,在9万小时的日语弱监督数据集上,该方法通过协同运用过滤与选择策略,分别将字符错误率(CER)降低了6.4%和4.0%,且在不引入新数据的前提下复用了前期训练样本,显著提升了模型性能。

链接: https://arxiv.org/abs/2606.28728
作者: Kohei Matsuura,Masato Mimura
机构: NTT Human Informatics Laboratories (日本电信电话公司人类信息学实验室); NTT Corporation (日本电信电话公司)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: 5 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Leveraging large-scale weakly supervised datasets is crucial to train robust end-to-end automatic speech recognition (ASR) models. However, such datasets often contain noisy labels and lack domain specificity, limiting their effectiveness. To address these issues and make better use of weakly supervised datasets, we propose a novel training approach incorporating data filtering and selection. Our approach consists of three steps: pretraining on the entire dataset, continued pretraining on a filtered subset based on character error rate (CER), and fine-tuning on a small number of acoustically similar samples to the target domain, selected from the filtered subset. In experiments with a 90,000-hour weakly supervised Japanese dataset, the proposed filtering and selection methods synergistically reduced CER by up to 6.4% and 4.0%, respectively, even though these steps reused training samples already used in the first pretraining step.

信息检索

[IR-0] Field Order Should Not Matter: Permutation-Invariant Embedding Model Fine-Tuning for Structured Metadata Retrieval

链接: https://arxiv.org/abs/2606.30473
作者: Aivin V. Solatorio,Olivier Dupriez,Rafael Macalaba
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); General Economics (econ.GN)
备注: 26 pages, 7 figures, 12 tables

点击查看摘要

Abstract:We study retrieval over catalogs of structured metadata, where each record is a small schema whose fields answer different kinds of query. Embedding a record with a text encoder first serializes its fields into a string, which forces a choice of field order. We show this choice, usually treated as an implementation detail, silently controls retrieval quality once the encoder is fine-tuned. A standard fine-tune loses 7.4 nDCG@10 points when the index is rebuilt under a different field order, because it reads absolute position instead of the field labels. We propose permutation-invariant fine-tuning ( \textbfPI-FT ), which serializes each record under a freshly sampled field order with random field dropout, so meaning binds to the labels rather than to position. The change is about two lines in the data loader; it costs negligible in-distribution accuracy and cuts the order-change penalty to 0.2 points. We study this in the discovery of development statistics, a catalog of nearly 10,000 indicators that should be searchable in many languages by a model small enough to self-host. As AI assistants and agents increasingly mediate access to public data and statistics, this retrieval step decides whether an answer is grounded in the right indicator or series, making discoverability a precondition for disseminating data through AI. Because usage logs cannot provide training signal for indicators no one has searched, we generate the queries instead. \textbfDevDataBench is a fully LLM-generated benchmark of grounded, facet-targeted queries across 15 languages, covering every indicator for both training and evaluation. A fine-tuned 118M-parameter CPU encoder outperforms every zero-shot baseline, including \texttttext-embedding-3-large (0.707 vs.\ 0.556 nDCG@10), with the largest gains in low-resource languages. We release the benchmark, pipeline, models, and a reusable PI-FT framework.

[IR-1] ENC-ODE: Event-level Neurodegenerative Modeling in Continuous Time with Neural ODEs MICCAI2026

链接: https://arxiv.org/abs/2606.30398
作者: Yujee Song,Seunghun Baek,Guorong Wu,Won Hwa Kim
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: MICCAI 2026

点击查看摘要

Abstract:Accurately predicting the temporal evolution of clinical biomarkers is crucial for the early diagnosis and management of neurodegenerative diseases such as Alzheimer’s disease. However, this relies on longitudinal data to capture biomarker changes over time, which is often sparse and irregular due to the high cost, labor-intensive nature, and patient burden. To address these challenges, we propose ENC-ODE, an Event-level Neurodegenerative modeling in Continuous time with neural Ordinary Differential Equations. ENC-ODE predicts future biomarker evolution by modeling clinical events through diagnosis-conditioned continuous dynamics. A target-conditioned attention mechanism weights and aggregates event-level predictions for the target time and modality without history compression. Extensive experiments on Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset demonstrate that ENC-ODE outperforms representative sequence models while offering a scalable and neuroscientifically grounded solution for clinical support. The code is available at this https URL.

[IR-2] Research Entity Extraction and Topic Detection from UKRI Grant Proposals

链接: https://arxiv.org/abs/2606.30304
作者: Xingran Ruan,Angelo Salatino,Rosa Filgueira,Kara Moraw,Alexandru Marcoci,Gemma Derrick,Sarah Callaghan
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at the STI-ENID Conference. Will be presented in September 2026 in Antwerp (Belgium)

点击查看摘要

Abstract:This paper presents preliminary findings from a UKRI-funded Metascience project comparing three LLM-based approaches, GPT-4o, Mistral, and a bespoke algorithm, DSIT-Taxonomies, for extracting and classifying research entities from funding proposals. Our project “Tracking Stars and Unicorns” aims to identify early signals of emerging research areas to inform public investment. Our methodology employed a three-stage pipeline, leveraging Mistral for primary entity extraction and mapping against the OpenAlex Topics taxonomy. We evaluated our approach across 42 proposals’ abstracts from different areas and observed that Mistral and GPT-4o produce comparable, high-quality entity sets with significant semantic overlap, outperforming the fragmented DSIT-Taxonomies approach. Crucially, the Mistral-based approach achieved superior topic classification accuracy (90.5%) compared to the full DSIT-Taxonomies pipeline (71.4%). We conclude that Mistral offers a high-performance, operationally efficient, and secure solution for large-scale analysis of sensitive grant data.

[IR-3] Query-Aware Spreading Activation for Multi-Hop Retrieval over Knowledge Graphs

链接: https://arxiv.org/abs/2606.30133
作者: Illia Makarov,Mykola Glybovets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted for publication in Cybernetics and Systems Analysis (Springer). Not yet published

点击查看摘要

Abstract:Retrieval-augmented generation built on knowledge graphs (Graph RAG) outperforms flat passage retrieval on multi-hop question answering by leveraging graph structure. In most existing systems, however, the question only sets the seed nodes; the subsequent traversal becomes “query-blind”, depending solely on the graph structure. The exception is QAFD-RAG, which implements query-aware traversal via a flow-diffusion solver with combined edge re-weighting. This architecture requires loading the full graph into Python memory and an iterative solver with a variable number of iterations complicating integration with the graph database. We propose a spreading-activation method that achieves the same query-aware traversal with a single per-step semantic gate: the step weight is the cosine similarity between the candidate entity’s description and the question, and the number of iterations is fixed. The whole retrieval procedure - seed mapping, propagation, top-K selection and context assembly - is expressed as a single Cypher query executed in one round-trip to Neo4j; the graph never leaves the database. On MuSiQue our method matches QAFD-RAG by exact match (32.80 vs 33.50) and outperforms the strongest purely-structural baseline in our comparison, HippoRAG, by 5.3 EM and 3.4 F1; on 2WikiMultiHopQA HippoRAG and QAFD-RAG retain an advantage due to their phrase-node architectures. An ablation with the gate disabled confirms that the gate is the source of a simultaneous F1 gain of 3.6 to 7.4 points and a retrieval-latency reduction by a factor of 1.5 to 4.9.

[IR-4] Efficient Retrieval-Augmented Generation via Token Co-occurrence Graphs

链接: https://arxiv.org/abs/2606.30093
作者: Gianluca Bonifazi,Christopher Buratti,Michele Marchetti,Federica Parlapiano,Giulia Quaglieri,Davide Traini,Domenico Ursino,Luca Virgili
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) mitigates hallucinations in Large Language Models (LLMs) by grounding the generation process on external knowledge. However, standard RAG approaches struggle with multi-hop reasoning. While recent graph-based RAG methods improve the retrieval of interconnected chunks, they often rely on computationally expensive and error-prone LLM-based extraction pipelines. To address these issues, we propose TIGRAG (Token-Induced GraphRAG), an efficient graph-augmented RAG framework based on a token co-occurrence Knowledge Graph. TIGRAG directly models topological relationships between tokens using sliding-window co-occurrence statistics, thus enabling scalable graph construction. During inference, it combines graph-based semantic expansion and neural reranking to retrieve interconnected evidence for multi-hop reasoning. Specifically, it introduces an iterative entity-driven retrieval strategy that progressively expands the query using bridging entities extracted from previously retrieved contexts. We evaluated TIGRAG on three widely adopted multi-hop Question Answering (QA) benchmarks. Experimental results demonstrated that our framework consistently outperforms dense retrieval and graph-based RAG methods in both retrieval and downstream QA tasks, while substantially reducing indexing time, inference latency, and prompt footprint.

[IR-5] Behind the Content: Wikipedia Mobile Views and Tourism Activity

链接: https://arxiv.org/abs/2606.29991
作者: Lucas Eustache,Paul Favier
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This study examines whether open digital traces can provide interpretable, high-frequency indicators of local tourism activity. We argue that the device composition of Wikipedia attention helps distinguish situated information use from remote planning: mobile pageviews are more likely to reflect on-site, contemporaneous information needs, whereas desktop pageviews capture temporally diffuse interest. Linking daily Accor hotel room-nights to Wikipedia city-page traffic for 704 French communes from 2018 to 2025, we find that mobile pageviews are positively associated with same-day hotel demand and dominate desktop traffic in joint specifications. The relationship is stronger in leisure-oriented destinations and in places with higher Wikipedia visibility. A micro-validation using daily attendance at six cultural attractions in Orléans shows the same pattern: mobile pageviews predict same-day gate counts, while surrounding leads and lags are close to zero. The findings position mobile Wikipedia traffic as a transparent, replicable nowcasting signal for tourism activity.

[IR-6] From Extraction to Navigation: Progressive Retrieval with Indirectly Infinite Depth

链接: https://arxiv.org/abs/2606.29970
作者: Linxiao Che,Shanshan Huang,Haitao Lu,Yijia Sun,Qiang Luo,Ruiming Tang,Han Li,Kun Gai,Guorui Zhou
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Modern large-scale recommender retrieval is shifting from static similarity matching to dynamic item space navigation, framing retrieval as iterative goal-driven graph traversal. Conventional item-to-item (i2i) methods fall into the “interest tunnel” and fail to excavate deep user interests, while existing index-based retrieval suffers from persistent “search drift”, caused by static entry nodes and fixed graph topologies unable to track shifting real-time user intent. To resolve the above defects, we present IID-Nav, a framework modeling retrieval as stateful autonomous graph exploration with three core contributions: (1) A goal-aware navigation policy substituting passive neighborhood expansion with active intent routing supervised by a target discriminator; (2) A recursive state evolution mechanism supporting Indirectly Infinite Depth (IID) via cross-request state reuse, which enables logical unlimited-depth graph traversal without linearly rising inference latency; (3) A trajectory-aligned training paradigm equipped with graph hard negative sampling to stabilize optimization over full navigation paths. Evaluations on billion-level industrial datasets show IID-Nav surpasses mainstream retrieval baselines under strict latency budgets. Empirical results verify that our method alleviates search drift remarkably and retains high precision for deep retrieval paths, offering an efficient, robust retrieval solution for industrial recommendation systems.

[IR-7] Know Before You Fetch: Calibrated Retrieval-Budget Allocation for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2606.29959
作者: Zhe Dong(1),Fang Qin(2),Manish Shah(3),Yicheng Wang(3) ((1) University of Maine at Presque Isle, (2) Stanford University, (3) Independent Researcher)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 17 pages, 9 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) typically retrieves a fixed number of passages for every query. This is wasteful when the reader already knows the answer, and it can be harmful when irrelevant or partially relevant passages distract the reader. We formulate adaptive RAG as calibrated retrieval-budget allocation: given a query, decide whether to answer closed-book, retrieve a compact context (k=1), retrieve a full context (k=5), or abstain. The contribution is a probability interface rather than a new raw uncertainty signal. We calibrate sequence log-probability and prefix-logit uncertainty signals into probabilities of correctness, then use these probabilities for graded context selection, selective abstention, and explicit latency/token trade-offs. Across core QA experiments on TriviaQA, Natural Questions, and MS MARCO, with auxiliary PopQA motivation and Qwen/Llama family checks, diagnostic out-of-fold calibration improves probability quality dramatically: for sequence log-probability, ECE drops from 0.275 to 0.062 on TriviaQA, 0.643 to 0.009 on NQ, and 0.711 to 0.031 on MS MARCO. Graded retrieval improves full-context and passage-budget frontiers for both our signal and TARG-style prefix entropy/margin, while retrieval-call AUC remains essentially tied with binary gating because k=1 is still a retrieval call. Held-out train/validation/test threshold experiments report deployable operating points. At matched-accuracy frontier operating points, a measured cost model reveals that gating is not universally faster: it increases latency by about 27% on Qwen3-8B but saves about 8% on Qwen3-32B. These results support a nuanced view of adaptive RAG: calibrated confidence is best understood as a reusable interface for allocating retrieval budget under task and system constraints.

[IR-8] Diagnosing and Mitigating Retrieval Bottlenecks in LLM -Based Cold-Start Recommendation

链接: https://arxiv.org/abs/2606.29947
作者: Zhe Dong(1),Fang Qin(2),Manish Shah(3),Yicheng Wang(3) ((1) University of Maine at Presque Isle, (2) Stanford University, (3) Independent Researcher)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 17 pages, 6 figures, 13 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as rerankers in recommender systems, with the expectation that semantic understanding will help in cold-start and long-tail regimes. We test this assumption with a five-domain benchmark that explicitly separates reranking quality from retrieval coverage. In a positive-controlled regime where the gold item is guaranteed present, calibrated LLM rerankers fail to consistently outperform strong collaborative and content baselines under natural traffic, and within-family scaling from Qwen3-8B to Qwen3-32B narrows but does not close the gap on most domains. In a retrieval-realistic regime where the gold item is not injected, the bottleneck is more severe: standard single retrievers place the gold item in a 200-item pool only 4.6-22.9% of the time, largely because 32-91% of cold-start targets are brand-new items with no training interactions. We introduce LHF, a validation-trained learned hybrid fusion layer over a multi-retriever union pool, as a retrieval-side realizability baseline. LHF is the only combiner we test that beats every single retriever on all five domains and recovers 17-61% of oracle coverage headroom on content-rich domains, but only 5-7% on collaboratively strong domains. End-to-end experiments reveal the remaining mismatch: learned non-LLM ranking exploits the LHF pool, while prompt-level LLM reranking often degrades it. LLMs exhibit pockets of semantic cold-start advantage, especially in text-rich domains when the item is already present, but this advantage is largely unreachable in current retrieve-then-rerank pipelines. We release the benchmark protocol, splits, prompts, evaluation tooling, and archived reproducibility artifacts: data at this https URL and code at this https URL.

[IR-9] POEM: Partial-Order Enhanced Real-Time Sequential Modeling for Recommendation

链接: https://arxiv.org/abs/2606.29946
作者: Linxiao Che,Yijia Sun,Siyuan Lou,Shanshan Huang,Qiang Luo,Ruiming Tang,Han Li,Kun Gai
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Real-time recommendation systems suffer from the dynamic drift of user interests and varying contextual conditions. Conventional sequential recommendation models only exploit static historical click sequences, which fail to capture instant preference changes and overlook structured signals hidden within the multi-stage ranking pipeline of industrial recommendation systems. To tackle these limitations, we propose POEM (Partial-Order Enhanced Modeling), a new real-time sequential modeling framework built upon intrinsic partial-order relations from the recommendation cascade. POEM takes real-time multi-task ranking scores (including predicted CTR and predicted watch duration) generated by upstream ranking modules as supervision to construct dynamic partial-order sequences, supporting fine-grained real-time interest modeling and consistent optimization between system ranking targets and user behavioral patterns. We summarize our core contributions as three aspects: (1) a partial-order guided sequence construction paradigm, which enriches vanilla chronological sequences via dynamic grouping and sampling conditioned on real-time ranking scores to reassess user interests per request; (2) a multi-objective score fusion module that unifies heterogeneous ranking signals into a compact quintuple representation with normalized rank-aware weighting; (3) a hierarchical sample learning strategy, which adopts system-favored high-ranked items and user positive feedback (e.g., long-duration watched videos) as positive instances, paired with graph-mined hard negatives and a margin-based pairwise loss for robust training. Fully deployed on Kuaishou online traffic, POEM achieves significant online gains: average per-user watch time lifts by 0.249% on the KS Single Page and 0.213% on the KS Lite Page.

[IR-10] SABER-Math: Automated Benchmark for Information Retrieval Evaluation in Mathematics ICML

链接: https://arxiv.org/abs/2606.29894
作者: Nikolay Georgiev,Maria Drencheva,Kseniia Ibragimova,Ivo Petrov,Dimitar I. Dimitrov,Martin Vechev
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted in the 3rd AI for Math Workshop at the 43rd International Conference on Machine Learning (ICML), Seoul, South Korea, 2026

点击查看摘要

Abstract:As agentic AI systems tackle more complex mathematical tasks, they increasingly rely on information retrieval (IR) to search problem databases, theorem libraries, and educational resources. However, choosing the right retriever remains difficult, as it is infeasible to directly isolate its effect on downstream performance. On the other hand, existing retrieval-specific benchmarks often fail to capture fine-grained mathematical relevance, penalizing relevant documents. We address this gap by introducing SABER-Math, the first fully automated benchmark for evaluating mathematical IR without expert annotation. Starting from 283K high-school-level math problems with solutions, SABER-Math builds challenging reranking tasks in three steps: (i) first, LLMs extract concise solution summaries and mathematical topics for each problem; (ii) then, per-query relevant documents are discovered using ontology topic-based and lexical solutions-summary-based similarities, and (iii) finally, a Swiss-style LLM preference tournament produces fine-grained relevance ratings for the documents. We evaluate lexical retrievers, specialized mathematical retrieval systems, and recent embedding models. We find that while modern embedding models substantially outperform classical and math-specific baselines, even the strongest systems struggle in symbol-heavy domains like Algebra and Calculus. Importantly, we show that general-purpose IR benchmarks such as MTEB do not reliably predict mathematical performance, especially for recent embedding models, highlighting the need for math-specific retrieval benchmarks.

[IR-11] Exploring Motivations for Algorithm Mention in the Domain of Natural Language Processing: A Deep Learning Approach

链接: https://arxiv.org/abs/2606.29859
作者: Yuzhuo Wang,Yi Xiang,Chengzhi Zhang
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:With the rise of data-intensive science, algorithms have become central to scientific research. In academic papers, algorithms are mentioned for different purposes, such as describing, using, comparing, or improving methods for specific research tasks. Identifying these purposes can reveal relationships among algorithms and help assess their roles and value. Taking natural language processing (NLP) as an example, this study proposes a sentence-level framework for identifying, analyzing, and tracing the evolution of motivations for mentioning algorithms. We first identify algorithm entities and algorithm-related sentences from full-text papers through manual annotation and machine learning. We then classify mention motivations using pretrained models and data augmentation, and analyze their distribution and temporal evolution. The results show that deep learning models trained with augmented data outperform traditional machine learning models in motivation classification. In NLP papers, more than half of algorithm-related sentences express direct use, whereas improvement is the least frequent motivation. The diversity of motivations has increased over time. For specific algorithm categories, grammar-based algorithms are more often mentioned for description, while machine learning algorithms are more often mentioned for use. Over time, use motivations have gradually replaced description motivations across different algorithms, and the number of motivation types associated with individual algorithms has declined significantly. This study reveals how authors mention algorithm entities in academic writing and provides a basis for future research on algorithm relationship identification and algorithm impact evaluation.

[IR-12] Revealing the Technology Development of Natural Language Processing: A Scientific Entity-Centric Perspective

链接: https://arxiv.org/abs/2606.29836
作者: Heng Zhang,Chengzhi Zhang,Yuzhuo Wang
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Most studies on technology development have been conducted from a thematic perspective, but the topics are coarse-grained and insufficient to accurately represent technology. The development of automatic entity recognition techniques makes it possible to extract technology-related entities on a large scale. Thus, we perform a more accurate analysis of technology development from an entity-centric perspective. To begin with, we extract technology-related entities such as methods, datasets, metrics, and tools in articles on Natural Language Processing (NLP), and we apply a semi-automatic approach to normalize the entities. Subsequently, we calculate the z-scores of entities based on their co-occurrence networks to measure their impact. We then analyze the development trends of new technologies in the NLP domain since the beginning of the 21st century. The findings of this paper include three aspects: Firstly, the continued increase in the average number of entities per paper implies a growing burden on researchers to acquire relevant technical background knowledge. However, the emergence of pre-trained language models has injected new vitality into the technological innovation of the NLP domain. Secondly, Methods dominate among the 179 high-impact entities. An analysis of the z-score trend about the top 10 entities reveals that pre-trained language models, exemplified by BERT and Transformer, have become mainstream in recent years. Unlike the trend of the other eight method entities, the impact of Wikipedia dataset and BLEU metric has continued to rise in the long term. Thirdly, in recent years, there has been a remarkable surge in popularity for new high-impact technologies than ever before, and their acceptance by researchers has accelerated at an unprecedented speed. Our study provides a new perspective on analyzing technology development in a specific domain.

[IR-13] Mandol: An Agglomerative Agent Memory System for Long-Term Conversations

链接: https://arxiv.org/abs/2606.29778
作者: Yuhan Zhang(1),Zhiyuan Guo(1),Ziheng Zeng(1),Wei Wang(1),Wentao Wu(2),Lijie Xu(1) ((1) Institute of Software, Chinese Academy of Sciences, (2) Microsoft Research)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Long-term conversational agents need to remember and query cross-session, multi-typed information with complex correlations. Existing agent memory systems rely on heterogeneous vector and graph databases, which fragment memory information and cause high cross-database I/O latency. For retrieval, common RAG-style methods tend to introduce noise, miss correlated clues, and lack token budget control, degrading LLM accuracy and efficiency. We propose Mandol, an agglomerative memory system that consolidates fragmented memory representations and storage into a unified memory-native architecture. Its core components include: (1) a hierarchical memory model that organizes memory into a basic layer representing raw memory information and a high-level abstract layer that agglomerates basic memories into traceable abstract memories, both uniformly represented as structured semantic graphs; (2) an agglomerative semantic data structure combining SemanticMap and SemanticGraph, which natively fuses key-value, vector, and graph structures and provides unified hybrid retrieval operators to eliminate cross-database I/O; and (3) a quantitative query mechanism with query-adaptive routing, quantitative denoising and conflict resolution, and token-constrained context generation, all without involving LLMs during retrieval. Experiments on two widely used long-term conversation benchmarks, LoCoMo and LongMemEval, show that Mandol achieves the best overall accuracy among representative agent memory systems. For performance comparison, Mandol also obtains a 5.4x retrieval speedup and a 4.8x insertion speedup under 10 QPS concurrent load, while still maintaining low latency on consumer-grade hardware. Comments: 10 pages, 3 figures Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2606.29778 [cs.DB] (or arXiv:2606.29778v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2606.29778 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-14] Do Recommendation Algorithms Work When Users Are LLM Agents ? A Case Study on Moltbook

链接: https://arxiv.org/abs/2606.29762
作者: Daming Li,Simeng Han,Jialu Zhang
类目: Information Retrieval (cs.IR)
备注: 10 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly populating web platforms, raising a fundamental question for recommender systems: do algorithms designed for human users still work when users are LLM agents that may not have well-defined content consumption preferences? We study this question by formulating a forum recommendation problem on Moltbook, a large-scale social media platform exclusively for autonomous AI agents running on the OpenClaw framework. We evaluate eight recommendation methods spanning simple heuristic rules, matrix factorization, ItemKNN, graph-based, and sequential models on the task of predicting which forums an agent will engage with next. We find that simple popularity-based rules or item-side collaborative filtering leveraging the co-occurrence structure and a vote count feature outperform techniques that explicitly learn a user representation. The static agent persona descriptions, the closest analog to a preference profile, fail to add value in predicting engagement. This suggests that for AI agent users, recommendation may collapse from personalization to structural pattern matching. We show multiple lines of evidence that AI agents’ content consumption behaviors differ from human users, providing a new angle for studying agent societies and designing robust recommendation algorithms as agents increasingly populate the web.

[IR-15] Diagnosing and Mitigating Context Rot in Long-horizon Search

链接: https://arxiv.org/abs/2606.29718
作者: Shijie Xia,Yikun Wang,Zhen Huang,Pengfei Liu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Extensive context has become the norm as Large Language Models (LLMs) are increasingly deployed in long-horizon tasks. The concern that increasing context length degrades model capabilities, known as context rot, has become a central issue for these applications. In this paper, we focus on deep search scenarios, aiming to investigate the rot phenomenon and its mitigation strategies. By evaluating four flagship open-source models across three benchmarks, we reveal a prevalent but unnoticed rot phenomenon: extensive context causes models to directly give up or prematurely provide uncertain answers, and this issue is exacerbated as the context grows. Through pruning experiments, we demonstrate the relationship between the accumulated context and the rot phenomenon. Furthermore, we investigate mitigating this issue through context management and post-hoc rejection sampling. For context management, we systematically evaluate seven different methods across three categories, based on performance, cost, and impact on context rot, providing clear guidance for strategy selection and usage. For rejection sampling, we develop a rot-aware filtering strategy and demonstrate its effectiveness across three aggregation methods. Finally, we show that these two approaches can be combined for further performance improvements.

[IR-16] ARMOR: Adaptive Retriever Optimization for Low-Resource Telecom Question Answering

链接: https://arxiv.org/abs/2606.29706
作者: Heshan Fernando,Quan Xiao,Yan Xin,Tianyi Chen
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Telecom question answering (QA) is a challenging setting for retrieval-augmented generation (RAG): evidence is fragmented across standards, papers, encyclopedic resources, and web documents, and answers often hinge on technical tables, equations, and specialized protocol language. In low-resource subdomains, generator fine-tuning can over-specialize and degrade general capability, making query-side retriever adaptation an attractive alternative. To this end, we ask whether a fixed-generator, query-adapted RAG system can outperform generator-side adaptation, and which retriever objectives best support that setting. We motivate retrieval, rather than generator fine-tuning, as the adaptation target through a capacity comparison: under bounded-parameter and soft-retrieval assumptions, query-encoder tuning can have a smaller estimation term than supervised fine-tuning when its effective dimension is smaller. We identify two particularly relevant objectives – the latent-document RAG likelihood, which optimizes generation utility, and the InfoNCE contrastive objective, which improves semantic retrieval geometry – and leverage them jointly through a retriever optimization method targeting downstream QA performance in the telecom domain. Specifically, we introduce ARMOR, Adaptive Regularized Mixture Optimization for Retrievers, which learns separate temperatures for the RAG retrieval distribution and InfoNCE softmax and regularizes the adapted query encoder toward the frozen base query encoder. Across telecom-specific retrieval and generative QA benchmarks, we show that ARMOR improves evidence retrieval and answer generation in several in-domain settings. Code is available at this https URL.

[IR-17] As We May Search

链接: https://arxiv.org/abs/2606.29652
作者: Saber Zerhoudi,Adam Roegiest,Jelena Mitrovic,Michael Granitzer
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The sensitive information in personal documents, legal files, and medical records is among the most valuable things to search, yet current retrieval-augmented generation systems still require sending content to remote servers. We propose local-first IR, a design philosophy where indexes, models, and inference reside on user devices, treating remote services as optional. This paper makes four contributions: (1) a framework organizing retrieval architectures along three dimensions: privacy and control, capability, and accessibility, (2) experiments on consumer hardware across five benchmarks, scaling from 1K to 1M documents with dense retrieval, BM25, and hybrid fusion. Dense retrieval keeps over 91% nDCG@10 up to 100K documents, with approximate HNSW indexes extending this to 1M with only 2% quality loss; a 7B local language model reaches within 4 points of a cloud baseline on answer quality, (3) competing perspectives for and against local-first IR, informed by experimental evidence, and (4) a research agenda identifying open problems. The real tradeoff is scope rather than quality: what matters is what you can search, not how well you can search it.

[IR-18] Metadata Structure or Strategy? A Decomposition of RAG Context Enrichment

链接: https://arxiv.org/abs/2606.29645
作者: Saber Zerhoudi,Michael Granitzer,Jelena Mitrovic
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems increasingly enrich retrieved passages by attaching quality metadata, structuring them into explicit records, and adopting multi-hop retrieval strategies that accumulate evidence across steps. These changes assume that richer context yields better answers, yet existing evaluations cannot test this because they vary all three factors at once. We isolate each factor in a controlled experiment across six benchmarks, four models from three families, and five enrichment levels, totaling over 24,000 evaluated responses. The assumption does not hold. Most enrichment reduces accuracy. Models prompted to use confidence scores comply correctly yet produce worse answers, a gap between utilization and accuracy that no prior work has measured. What determines answer quality is not how much metadata the context carries but whether the model can act on it for the given task. When metadata and retrieval strategy are aligned with model capabilities, a smaller model outperforms a frontier model by 19 F1 points. These findings motivate a processability hierarchy that predicts, from pre-training properties alone, which metadata a model can productively use, reframing RAG design as a question of model-context alignment rather than metadata accumulation.

[IR-19] mamabench and mamaretrieval: Benchmarks for Evaluating Medical Retrieval-Augmented Generation in Maternal Neonatal and Reproductive Health

链接: https://arxiv.org/abs/2606.29467
作者: Yi Ren
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 13 pages, 3 tables. Datasets and construction code linked in the paper

点击查看摘要

Abstract:Medical question-answering benchmarks rarely cover the maternal, neonatal, child, and reproductive-health questions a nurse-midwife asks, and, to our knowledge, no public chunk-level relevance benchmark exists for maternal-health guideline retrieval. We release two benchmarks that fill these gaps. mamabench is a scope-filtered QA set of 25,949 items assembled from seven existing expert-authored sources across multiple-choice, short-answer, and rubric-graded tracks; to help users calibrate the LLM judge that scores the rubric track, we re-scope HealthBench’s physician-labelled meta-evaluation to the domain. mamaretrieval pairs 3,185 clinical queries with graded (0-6) relevance labels over a 63,650-chunk maternal-health guideline corpus, using a decomposed rubric that distinguishes a chunk that answers a query from one merely on its topic. Three decisions shape both: assemble and filter expert sources rather than author questions, grade relevance rather than binarise it, and measure and disclose the limits of the labels – scope-classifier agreement, a frontier-judge check, and a pooling-completeness audit – rather than treat them as an oracle. A companion paper uses the benchmarks to evaluate a deployed on-device assistant; both are released openly for research.

[IR-20] Monosemanticity in Recommender Systems

链接: https://arxiv.org/abs/2606.29341
作者: Yagel Alfasi,Eden Rzezak,Eadan Schechter
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Latent factor models such as matrix factorization are widely used in recommender systems, yet the learned embedding dimensions typically lack explicit semantic interpretation. This opacity limits transparency, explainability, and principled intervention in recommendation behavior. While sparse autoencoders (SAEs) have recently been used to extract monosemantic features from dense neural representations, standard SAEs suffer from scaling pathologies including feature splitting, feature absorption, and feature composition, which degrade interpretability as dictionary size increases. In this work, we investigate whether hierarchical sparse representations can reveal interpretable structure in collaborative filtering embeddings. We train a large-scale matrix factorization recommender system on the Amazon Fashion dataset and apply a Matryoshka Sparse Autoencoder (MSAE) to the learned embeddings. We analyze the resulting latent features through metadata alignment and LLM-generated labeling to assess semantic coherence and disentanglement. Finally, we show an intervention on a subset of gender associated latent neurons that emerged from the analysis. Our findings suggest that collaborative filtering embeddings contain recoverable hierarchical structure, and that Matryoshka training provides a principled mechanism for exposing interpretable latent factors in interaction-driven recommendation models.

[IR-21] Covering the Unseen: Information Demand Coverag e Optimization for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2606.29328
作者: Bingxue Zhang,Jianying Jia,Feida Zhu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 13 tables

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) typically treats context selection as ranking chunks against a single query embedding. This assumption breaks down for complex queries, such as multi-hop or ambiguous questions, where top-k selection tends to over-cover one semantic aspect while ignoring critical sub-questions. We propose GeoRAG, which recasts context selection as Information Demand Coverage Optimization. GeoRAG builds a multi-dimensional demand distribution through diverse sub-query generation and reverse-validation weighting, then selects context by minimizing the Sinkhorn-Wasserstein distance between this demand distribution and the coverage of the selected set. The resulting demand-weighted facility-location objective is monotone submodular, giving a 1-1/e greedy guarantee, which we approximate with a Sinkhorn-based marginal-gain surrogate. The method is unsupervised, training-free, and retrieval-agnostic. We further show that single-point, query-proximity scorers cannot cover multi-modal demands, exposing a structural limit of ranking-based selection. On six open-domain QA benchmarks, GeoRAG improves exact match (EM) by +6.5 to +7.5 points over top-k truncation (up to +9.7 on HotpotQA and ASQA) and outperforms strong baselines including MMR, DPP, BGE-Reranker, SMART-RAG, and AdaGReS, with stable gains across context budgets and sub-query generators.

[IR-22] An Information-Geometric Justification for Composite Coherence in Event-Based Narrative Extraction

链接: https://arxiv.org/abs/2606.29118
作者: Brian Keith-Norambuena
类目: Information Theory (cs.IT); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted to publication in Entropy on June 24, 2026

点击查看摘要

Abstract:Graph-based narrative extraction relies on a coherence function to score transitions between events, but the coherence metrics in current use are defined operationally and lack an information-theoretic foundation. We study the composite metric C=\sqrtA\cdot T , where A is the angular similarity of document embeddings and T=1-d_\mathrmJS is a topic proximity from the Jensen-Shannon distance of soft memberships, and give it an information-geometric reading together with an axiomatic characterization of the geometric-mean combinator. On the product manifold \mathbbS^d-1\times\Delta^K-1 , the negative log-coherence decomposes additively into an angular and a topic cost. Because the Riemannian metric tensor induced by the Jensen-Shannon distance on the simplex is proportional to the Fisher information matrix, the topic component is locally consistent with the Fisher-Rao metric singled out by Chentsov’s theorem. Within the compensability spectrum of combinators, the geometric mean is the unique one consistent with four natural axioms (a boundary/veto condition, symmetry, log-additivity, normalization), and the construction motivates a proper product metric d_\times . Experiments on four corpora, three embedding families, and three topic models are consistent with the framework: the Fisher identity holds ( R\ge0.99 ), the geometric mean tracks d_\times closely ( \rho=0.999 ), and a downstream LLM-as-judge check finds it is not dominated by any alternative combinator or single-channel baseline. Sweeping the spectrum, the bottleneck-coherence gap between extracted and random storylines splits into a symmetric component, maximized at the geometric mean across five corpora, and a displacement term; a cross-modal image-narrative case study reproduces the effect. These results justify the composite coherence metric and articulate when the geometric mean is the natural choice.

[IR-23] AB-RAG : Adaptive Budgeted Retrieval-Augmented Generation for Reliable Question Answering

链接: https://arxiv.org/abs/2606.29090
作者: Ansh Kamthan
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 16 pages, 9 figures, 12 tables

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become the standard way to ground large language models in external knowledge, yet most systems retrieve a fixed number of passages for every question regardless of its difficulty. This wastes computation on easy questions, starves hard ones, and gives no signal for when a generated answer can be trusted. With a growing share of question answering systems built on top of commercial language model APIs, a method that can decide how much to retrieve, and how far to trust its own answers, without retraining the underlying model, is of clear practical value. This paper presents AB-RAG (Adaptive Budgeted Retrieval-Augmented Generation), a training-free and backbone-agnostic framework that generates an answer, estimates its confidence from a combination of three signals, and then decides whether to stop or to retrieve more evidence, subject to a fixed retrieval budget. The estimator combines the model’s own certainty, the agreement between the answer and the evidence, and the variance of the retrieval scores. For models that expose token probabilities the certainty signal is read directly; for closed APIs it is approximated by self-consistency, so the method works without access to model internals. Across three backbones and two datasets, the central result is that the confidence estimate reliably separates correct from incorrect answers on every backbone, reaching a clean split of 57.6% against 0% Exact Match between high- and low-confidence answers on a factoid dataset. The adaptive policy improves accuracy on capable backbones, and the study reports its negative and nuanced findings honestly, including a confidence signal that proved unsuitable for short answers and a retrieval signal whose sign was found and corrected through measurement. The entire study was carried out on a single consumer laptop with only a few dollars of API spend.

[IR-24] Fairness Attacks on Recommender Systems

链接: https://arxiv.org/abs/2606.29064
作者: Yanan Wang,Yong Ge
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The unfairness of recommender systems has become a topic of concern due to its significant social and ethical implications. Although existing works have shown the effectiveness of attacks on the performance of recommender systems (e.g., promotion and demotion attack), the study of fairness attacks on recommender systems remains largely under-explored. To this end, we propose a novel structure-aware reinforcement learning-based fairness attack method designed to exacerbate the unfairness of target recommender systems. Specifically, we first employ a graph-based structure encoder to model the structural dependencies among the generated fake user-item interactions and the original user-item interactions. Then, we model the sequential dependency of the injected fake items using a recurrent neural network. Based on the learned structure-aware and sequence-aware representations of the fake user and item, the item selection policy attentively decides the next injected fake item. Since the target recommender system may employ fairness-aware training and leverage the user’s sensitive attribute information, such as gender, we further designed a gender selection policy to decide the gender of the entire fake user profile. Both the item selection and gender selection policy are learned jointly in our proposed method. Finally, experimental results on four types of target recommendation models and two real-world datasets demonstrate the effectiveness of the proposed attack method in exacerbating the unfairness of recommender systems.

[IR-25] he strength of clinical evidence is recoverable from language model representations but not from their stated grades

链接: https://arxiv.org/abs/2606.29034
作者: Soroosh Tayebi Arasteh
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly summarize clinical evidence, where a claim’s weight depends on how strongly it is supported. Yet these models convey confidence poorly, and properties they never state, such as truth, are often readable from their activations. Whether a clinical model registers evidence strength, distinct from truth, and states it when asked is untested, and any such signal could be lexical. We compiled 45,134 clinical claims from six public sources, harmonized 20,611 into a four-level evidence grade under three independent frameworks, and tested 22 local, open-weight LLMs from several developers (0.6-70 billion parameters; general, medical, and reasoning), with lexical, truth, and cross-framework controls. A linear estimator recovered the grade in every model (median AUROC 71.8), yet decodability did not rise with scale and was weakest in reasoning models. The grade the models stated fell to chance, 25-27 percentage points below the estimator. The recoverable signal was largely lexical and did not transfer across topics or frameworks, yet it was distinct from factual truth and still flagged weakly supported claims (AUROC 69.2). Clinical LLMs thus carry an ordered evidence-strength signal they do not express, so their stated grades fail to convey a claim’s support even when it is recoverable from their representations and text.

[IR-26] Human-in-the-Loop Nugget Annotation for Accountable LLM -as-a-Judge Evaluations

链接: https://arxiv.org/abs/2606.29033
作者: Laura Dietz
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Evaluating AI/Agentic system outputs reliably requires human judgment, but how one incorporates the human determines whether one gets a real quality signal or expensive theater. The common approaches either accidentally anchor human experts (leading to rubber-stamping) or leave them unsupported in high-variance labeling tasks. We present a prototype annotation tool that implements a different division of labor: humans identify what information matters (nuggets), while LLMs handle high-volume matching of nuggets to system outputs. This plays to each party’s strengths while maintaining genuine human oversight. We describe the three-phase workflow, key design decisions, and how exported nugget banks integrate with automated judges.

[IR-27] Multimodal Graph RAG for Long-range Visually Rich Document Understanding

链接: https://arxiv.org/abs/2606.28780
作者: Yi-Cheng Wang,Chu-Song Chen
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are widely applied to visual document understanding. However, comprehending long documents remains an issue by the limited context window. Though recent multimodal retrieval-augmented generation (MMRAG) can address this challenge by retrieving relevant pages. It still struggles with the visual question answering (VQA) requiring holistic comprehension of a document. To cope with this, knowledge graph (KG) that summarizes global knowledge of a document can provide an effective solution. However, most existing LLM-based KG construction methods handle only the language modality, leaving the automatic creation of multimodal KGs (MMKGs) for visually rich documents largely unexplored. In this paper, we introduce a multimodal graph-based RAG approach to tackle this problem. Existing LLM-based KG methods evaluate the QA performance relying on indirect evidence such as comprehensiveness, diversity, empowerment, and so on. The lack of annotated datasets for comprehensive document-level VQA poses a significant challenge to effective model evaluation. To overcome this limitation, we also introduce a new benchmark, DLVQA (document-level VQA), which provides reference summaries and corresponding supporting facts for global document-level questions. Experimental results show that our approach outperforms existing MMRAG or KG-based approaches on multi-hop QA/VQA benchmarks and DLVQA.

[IR-28] Reproducing FACTER: Fairness via Conformal Thresholding and Prompt Repair

链接: https://arxiv.org/abs/2606.28620
作者: Oscar Miró López-Feliu,Daimy van Loo,Xanthos Kekkos,Mikel Blom,Clara Rus
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 29 pages. Accepted by Transactions on Machine Learning Research (TMLR), 2026. OpenReview: this https URL . Code: this https URL

点击查看摘要

Abstract:Fayyazi et al. (2025) recently proposed FACTER, a model-agnostic framework designed to jointly enforce fairness and statistical coverage in LLM-based recommendation through conformal thresholding and iterative prompt repair. In this work, we conduct a reproducibility study of the FACTER framework across diverse architectures and dataset sparsity levels, evaluating both the original open-ended generation task and a constrained re-ranking extension. Under the strict reproduction, we observe a divergence in recommendation utility, which we trace to underspecified target-set evaluation in the original study. We then use the constrained re-ranking setting to evaluate FACTER when the candidate set is fixed, and introduce a static Fair Zero-Shot baseline to isolate the contribution of the iterative prompt repair loop. Our analysis shows that FACTER consistently reduces adaptive-threshold violation counts, but that these reductions are not consistently reflected under the fixed threshold or in global fairness metrics. In the constrained ranking setting, static fairness instructions achieve comparable semantic-parity outcomes to FACTER’s dynamic repair loop, suggesting that the additional online repair mechanism provides limited benefit in this formulation. All code and reproduction artifacts are available at this https URL.

[IR-29] R2-Searcher: Calibrating Retrieval and Reasoning Boundaries for Agent ic Search

链接: https://arxiv.org/abs/2606.28566
作者: Sheng Zhang,Junyi Li,Wenlin Zhang,Xiaowei Qian,Yichao Wang,Yingyi Zhang,Maolin Wang,Yong Liu,Xiangyu Zhao
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recent search agents for multi-hop reasoning often fail by either retrieving incomplete evidence or reasoning over irrelevant portions of the retrieved content, leading to a retrieval-reasoning boundary shift. We propose R ^2 -Searcher, a novel framework that explicitly explores and calibrates the retrieval and reasoning boundaries via fine-grained, query-token-guided evidence modeling and post-retrieval reflection. Specifically, R ^2 -Searcher: (1) constructs fine-grained reasoning contexts by extracting precise facts from retrieved content based on query token semantics (e.g., subjects, actions, temporal markers, and degree modifiers), thereby guiding the attention of search agent; (2) introduces a retrieval reflection mechanism that evaluates and corrects boundary deviations after each retrieval step, guiding the generation of improved queries grounded in the extracted reasoning contexts; and (3) employs an end-to-end reasoning-reflection-guided reinforcement learning algorithm, R ^2 PO, which jointly optimizes both boundaries through a tree-based exploration of reasoning regions and reflections. Our method significantly enhances the quality of both retrieval and reasoning, establishing an iterative loop where retrieval and reasoning mutually enhance each other. Extensive experiments on seven complex multi-hop QA benchmarks demonstrate that R ^2 -Searcher significantly outperforms state-of-the-art agentic search methods in answer accuracy and retrieval-reasoning quality. Ablation studies further confirm the critical role of retrieval-reasoning boundary calibration.

[IR-30] CMSL: Constructive Multi-Sequence Learning for Recommendation Systems

链接: https://arxiv.org/abs/2606.28533
作者: Zikun Cui,Renzhi Wu,Junjie Yang,Li Sheng,Jijie Wei,Linfeng Liu,Tai Guo,Tao Jia,Xiaodong Wang,Hong Li,Li Yu,Sri Reddy,Hong Yan
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sequence learning has emerged as the promising paradigm in recommendation systems, surpassing traditional Deep Learning Recommendation Models (DLRM) by capturing the temporal nuances of user behavior. However, current state-of-the-art architectures operate under a limiting analogy: they treat user history as a monolithic chronological sequence like a sentence in a Large Language Model (LLM). We observe a fundamental divergence between natural language and recommendation data: unlike the linear, logical flow of text, user history is inherently multi-faceted. A user’s journey is a fragmented reflection of diverse interests, resulting in much weaker coherence between items than is found in LLM training data. This lack of structural unity leads to context pollution. In single-sequence modeling, unrelated behaviors compete for the same attention budget. This “noisy” signal dilutes the model’s focus, effectively capping its ability to discern high-intent patterns from background activity. To address this, we propose Constructive Multi-Sequence Learning (CMSL), a paradigm shift from passive sequence ingestion to active “context engineering” that constructs multiple coherent sequences in latent space. CMSL leverages a learnable Sequence Construction Module to disentangle user history into “pure” thematic strands, followed by a linear attention mechanism to efficiently model these strands at scale. CMSL has been deployed across ranking and retrieval tasks and across four major surfaces at Meta.

[IR-31] SemFlowRAG : Directed Semantic Flow from Abstraction to Evidence for Complex Reasoning

链接: https://arxiv.org/abs/2606.28447
作者: Houyuan Qin,Rong Wu,Qinyuan Qin,Botian Shi,Jingjing Qu,Yang Sun,Pinlong Cai
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhanced by Knowledge Graphs has shown promise in complex multi-hop reasoning tasks. However, existing graph-based retrieval methods typically rely on flat, undirected topologies. During the retrieval process, the probability flow often gets trapped in high-degree abstract concept nodes which we define as probability black holes'', leading to semantic drift and noise accumulation. To address this, we propose SemFlowRAG, a framework that reconstructs the flat retrieval space into a corpus-adaptive semantic gradient graph. This data-driven self-organization enables a hierarchical structure to emerge naturally from the data distribution, capturing the intrinsic semantic granularity of the corpus to suppress structural noise. By quantifying the semantic abstractness of entities through the embedding variance of their associated passages, we transform static undirected edges into directed semantic constraints. Furthermore, we design an abstractness-guided directed PageRank algorithm that forces the retrieval trajectory to follow a high-to-low semantic abstractness’’ gradient. This mechanism ensures layer-by-layer evidence convergence, smoothly guiding the retrieval process from abstract concepts to specific document evidence. Extensive experiments on complex QA datasets demonstrate that SemFlowRAG effectively mitigates the ``probability black holes’’ issue, outperforming existing baselines in both retrieval and downstream reasoning performance.

[IR-32] Schema-First Retrieval: Embedding Catalogs for Natural Language Analytics

链接: https://arxiv.org/abs/2606.28387
作者: Adarsh Agrawal,Shashank Indukuri
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Enterprise text-to-SQL systems often fail before SQL is generated: the model receives the wrong schema context. Modern warehouses contain thousands of tables, abbreviated columns, informal metrics, hidden join conventions, and permission boundaries that are not captured by raw table names. We introduce Schema-First Retrieval, a retrieval layer that embeds catalog metadata rather than warehouse rows. The system indexes five typed catalog objects, tables, columns, metrics, relationships, and query history, using object-specific text templates. At query time, it combines parallel vector search, lineage expansion, cross-encoder reranking, workload memory, and deterministic access-control gates before SQL generation. On CRUSH4SQL (1,534 questions), Schema-First Retrieval reaches 96.4% table recall@20 and cross-encoder reranking adds +11.1 points at column recall@10; against an equally-templated BM25 baseline, semantic retrieval is +32.8 points at table recall@5. On SEDE (857 questions), query history raises table recall@5 from 52.1% to 92.3%. On BIRD (96 questions), schema-first context reduces SQL execution errors from 15.6% to 6.2%, a 2.5x reduction. These results show that catalog selection is a first-class retrieval problem for natural language analytics, not a prompt formatting detail.

[IR-33] LEDGER: Scaling Agent ic Document Editing with Dependency-aware Graph Retrieval ACL2026

链接: https://arxiv.org/abs/2606.28379
作者: Mike Hang Wang,Utkarsh Garg,Reza Davari,Huitian Jiao,Hao Cheng,Baolin Peng,Tao Ge,Si-Qing Chen
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2026

点击查看摘要

Abstract:We introduce LEDGER to tackle the novel context engineering challenge of agentic document editing, where localized edits to long, structured documents must be applied efficiently without breaking cross-references or semantic consistency. LEDGER constructs a lightweight dependency graph that explicitly models document structure, including hierarchical organization, explicit references, implicit dependencies, and semantic relationships. For each edit, graph-guided retrieval selects only the necessary context, avoiding full-document processing while preserving consistency. We evaluate LEDGER on a curated benchmark of 1.9k test cases with various document types and lengths, spanning six state-of-the-art models: LEDGER improves consistency from 56% to 76% across all six models and test scenarios while reducing token usage. Notably, LEDGER with low reasoning effort matches baseline performance at high reasoning effort using fewer tokens, showing that explicit dependency representations can partially substitute for expensive internal reasoning in agentic document editing.

[IR-34] When Does Overlap Help? OSU-Mem and a Cell-Conditional Analysis of Trajectory Memory for LLM Agents

链接: https://arxiv.org/abs/2606.28376
作者: Mellow Baixuan Chen,Xiangguo Sun
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 32 pages, 4 figures, 18 tables

点击查看摘要

Abstract:Long-horizon large language model (LLM) agents accumulate interaction trajectories that quickly exceed any practical prompt budget, and existing memory methods either truncate aggressively and lose non-local evidence or retain boilerplate that degrades decision quality. We ask a mechanism question rather than claiming a better general-purpose memory system: when does organizing trajectory memory into overlapping semantic units (OSUs) – groups of related steps in which one step may belong to several units – help retrieval over flat or disjoint alternatives? We instantiate this in OSU-Mem, which retrieves from an overlapping OSU pool via budgeted coarse-to-fine expansion, and show its benefit is conditional: overlapping memory helps when the evidence steps a query needs share tool calls or entities, but hurts when those steps are fully heterogeneous and share neither. On a synthetic benchmark where evidence carries such shared structure by construction, OSU-Mem improves over the strongest baseline as the theory predicts; yet on a concatenated, constructed unaugmented \tau -bench setting its aggregate advantage over flat retrieval vanishes. Splitting queries by whether their evidence shares tools and entities shows this near-tie to be an artifact of mixing query types rather than a property of either method, and ToolBench, a controlled probe built to carry shared structure by design, corroborates the same mechanism via an overlap-vs.-disjoint construction contrast (under a coverage-guided variant), isolating the construction principle rather than validating the full default system. Because the relevant sharing is cheaply estimable from metadata, the analysis yields a metadata-based heuristic for predicting when overlap is likely to improve retrieval. We deliberately isolate the retrieval layer, assessed by retrieval quality and an LLM-mediated evidence-selection stage.

[IR-35] Conversational Query Engine for Mixed-Modality Heterogeneous Enterprise Data Sources KDD2026

链接: https://arxiv.org/abs/2606.28370
作者: Darshita Rathore,Vineet Kumar,Vaibhav Singal,Ankur Vivek Singh,Anindya Moitra
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted at Agent4IR @ KDD2026

点击查看摘要

Abstract:Enterprise business intelligence queries span structured warehouses and unstructured document repositories – modalities with fundamentally different access methods, cost profiles, and correctness semantics. Existing AI-enabled interfaces force users to select the right tool: NL2SQL systems cannot reason over slide decks, and RAG pipelines lack access to live warehouse tables. We present COGNI, a production conversational BI system that treats natural-language analytics as a heterogeneous query processing problem, organized as four architectural layers. First, an indexing layer implements slide-adaptive chunking – recursive chunking for plain-text slides, hierarchical chunking for structured content such as tables, charts, and key-value blocks - achieving 88.3% on our internal enterprise benchmark. Second, a routing layer built on a LoRA fine-tuned Qwen-2.5-1.5B-Instruct model that produces a dual output - modality decision and complexity assessment at 93.8% accuracy and approximately 7\times lower cost than frontier-model. Third, a retrieval layer executes complexity-adaptive pipelines: a self-correcting NL2SQL agent at 93.9% G-Eval, and Recursive Language Models reaching 91.0% on multi-hop synthesis queries. Finally, a caching layer validates query equivalence across multiple dimensions beyond embedding similarity, achieving zero false cache hits and 8.4\times latency reduction. Comments: Accepted at Agent4IR @ KDD2026 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.28370 [cs.IR] (or arXiv:2606.28370v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.28370 Focus to learn more arXiv-issued DOI via DataCite

[IR-36] Multimodal and Multiscale Spatial-Temporal Semantic Search and Recommendation with AI Foundation Models

链接: https://arxiv.org/abs/2606.28369
作者: Yuanyuan Tian,Wenwen Li,Xiao Chen,Michael Brook,Michael Brubaker,Anna Liljedahl,Chitta Baral
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 17 pages, accepted for publication in the ACM Transactions on Spatial Algorithms and Systems

点击查看摘要

Abstract:Semantic search and recommendation of similar documents, such as news and reports about unusual environmental events (e.g., a dead whale washed ashore in Alaska) that contain spatial and temporal information, is a critical task in Geographic Information Retrieval (GIR). This work presents a novel framework that leverages AI foundation models, including Large Language Models (LLMs) and Vision-Language Models (VLMs), to enable effective similarity search and ranking for such event documents. To support this goal, we introduce two new strategies: (1) CAMERA (Context-Aware Multimodal Event Retrieval Algorithm), which fuses textual and visual information to generate richer embeddings than those derived from text alone; and (2) ASTRA (Adaptive Spatial and Temporal Re-ranking Algorithm), which improves similarity ranking by incorporating scale-dependent spatiotemporal relevance alongside semantic similarity. Experimental results, using a dataset from the Local Environmental Observer Network, demonstrate that our VLM-enhanced methods outperform unimodal, LLM-based approaches in similarity ranking effectiveness. By automatically linking relevant event reports, the proposed framework helps both data curators and the general public gain deeper insights into environmental change and its localized impacts. These findings highlight the potential of AI foundation models to advance GIR through multifaceted, intelligent analysis that integrates key geographic concepts: space, time, scale, and semantics.

[IR-37] EvoRec: Self Evolving Agent ic Recommender Systems

链接: https://arxiv.org/abs/2606.28368
作者: Lingyu Mu,Hao Deng,Haibo Xing,Jinxin Hu,Yu Zhang,Xiaoyi Zeng
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Optimizing modern recommender systems still relies heavily on engineers iterating by hand, which is slow and bounded by individual expertise. LLM-based agents open a path toward automating this loop, yet two issues remain. First, the agent is used only as a code translator and accumulates no methodology across iterations. Second, the optimization space is confined to a predefined range and rarely introduces structurally new ideas. To address these problems, we propose EvoRec, a multi-agent framework that co-evolves the recommendation model and the optimization methodology driving it. Four collaborating agents carry out a dual-track loop: the Research Agent and Code Agent iterate the model each round, while the Skill Evolver periodically distills reusable methodology from a persistent Memory of past experiments. Experiments on two public benchmarks and one large-scale industrial dataset show that EvoRec improves offline metrics by up to 5.54% over the strongest baseline, and an online A/B test delivers a 1.85% revenue lift and a 1.02% CTR gain.

[IR-38] Beyond the Reranker: Do RAG Retrieval Enhancements Help Once a Strong Reranker Is Present?

链接: https://arxiv.org/abs/2606.28367
作者: Sadanand Singh,Allam Reddy,Manan Chopra
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is routinely extended with methods meant to improve retrieval: query expansion, hierarchical and cross-document summarization, graph-based expansion, per-query routing, rank fusion, and corrective re-retrieval. The benefits reported for these methods come almost exclusively from homogeneous corpora, predominantly Wikipedia prose. Whether they hold on the mixed-format collections common in practice, where code, markdown, tables, scientific PDFs, and prose are interleaved within one corpus, has not been measured. To study this directly, we build \textbfHetDocQA, a heterogeneous benchmark with \emphchunker-agnostic span-overlap relevance labels and collection-disjoint splits, and pair it with MuSiQue and QASPER as homogeneous controls. We evaluate eight methods on a shared backbone, with bootstrap confidence intervals and multiple-comparison correction. A strong cross-encoder reranker accounts for most of the pipeline’s quality; beyond it, only two methods yield reliable gains: query expansion and SSCC. SSCC, a per-source calibrated corrector introduced here, sets a separate acceptance threshold for each score source and helps only on heterogeneous data. The remaining reranking and pool-expansion methods in common use, among them hierarchical summarization, graph expansion, routing, and rank fusion, give no reliable gain once that reranker is present.

[IR-39] CAMI: Cost-Aware Agent -Guided Multi-Indexing for Semantic Retrieval

链接: https://arxiv.org/abs/2606.28365
作者: Adnan Qidwai,Anand Eswaran,Sonam Mishra,Jaydeep Sen,Sachindra Joshi
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to ACM CAIS 2026

点击查看摘要

Abstract:RAG ingestion pipelines frequently augment search corpus index with semantic enrichment indices (e.g., synthetic queries or summaries generated from corpus chunks) that are subsequently queried alongside the base index to improve retrieval via better alignment between document representations and user intent. While these supplementary representations substantially improve retrieval quality, they introduce a computational bottleneck: the configuration space of enrichment types and generator models is combinatorial, and the cost of exhaustive index-time evaluation scales linearly with corpus size. We introduce CAMI (Cost-Aware Multi-Indexing), a framework that formalizes multi-index construction as a budgeted, multi-objective portfolio selection problem. CAMI targets the upstream decision of which enrichment views to generate and materialize before the retrieval backend is applied. CAMI incorporates three primary mechanisms: (i) an agentic discovery phase that proposes corpus-specific representation templates; (ii) an atomic-unit search procedure that evaluates individual enrichment-model pairs and recombines them via fidelity-local closure to identify synergistic portfolios; and (iii) a confidence-aware promotion schedule that prunes unpromising configurations early, decoupling optimization spend from total corpus size. We evaluate CAMI across diverse retrieval corpora. Our findings reveal that the framework systematically isolates high-recall portfolios under strict budget constraints, outperforming standard content-only baselines in challenging settings by up to 9.4% recall@10. Further, CAMI is able to systematically identify these high-recall portfolios using up to 5x less budget compared to random search baselines, making our approach practical in real production scenarios.

[IR-40] LLM based Knowledge Graph Approach to Automating Medical Device Regulatory Compliance

链接: https://arxiv.org/abs/2606.28364
作者: Subhankar Chattoraj,Karuna Pande Joshi
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Advanced medical devices increasingly rely on AI-driven frameworks to automate compliance processes, ensuring safety and efficacy while reducing regulatory burdens. In the United States, software-based medical devices, including those utilizing AI/ML models, are regulated by the FDA’s Center for Devices and Radiological Health (CDRH) under the Code of Federal Regulations (CFR) Title 21. These regulations are extensive, cross-referenced documents that require significant human effort to parse, leading to high compliance costs for manufacturers. We propose a novel, semantically rich framework that extracts regulatory knowledge from FDA documents and translates it into a machine-processable format. Our system encodes regulatory knowledge into an OWL/RDF-based knowledge graph and uses the Mistral 7B Instruct model to dynamically generate SPARQL queries, perform compliance reasoning, and produce structured reports. This enables automated device classification (Class I, II, or III) and real-time regulatory evaluation. Validated through real-world use cases, our framework significantly reduces manual review effort, enhances interpretability, and accelerates time-to-market. The proposed approach integrates AI reasoning and semantic technologies to achieve scalable, transparent, and automated regulatory compliance.

[IR-41] meta-pipe: An LLM -agent pipeline for end-to-end automated systematic review and meta-analysis

链接: https://arxiv.org/abs/2606.28363
作者: Hsieh-Ting Lin,Jiunn-Tyng Yeh
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 13 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Objective: To describe the architecture and design rationale of meta-pipe, an open-source large language model (LLM)-agent pipeline that integrates the complete systematic review and meta-analysis (SR/MA) workflow – from literature search through statistical analysis, manuscript generation, and quality assurance – with mandatory human oversight at critical decision points. Study Design and Setting: We developed a 10-stage modular pipeline integrating Claude (Anthropic; Opus 4 for reasoning, Haiku 3.5 for classification) for LLM-assisted screening and extraction, Python (~3,600 lines of code) for automation, R (meta, metafor, gemtc, netmeta) for statistical analysis, and Quarto for manuscript rendering. Five mandatory human decision points enforce oversight. We systematically compared meta-pipe’s capabilities with five existing SR automation tools based on published documentation as of March 2026. Results: meta-pipe offers four capabilities not available in any single existing tool: automated manuscript generation from analysis outputs, semi-automated GRADE assessment, overclaim detection (12 predefined patterns), and dual-paradigm network meta-analysis (Bayesian and frequentist). Estimated API cost is 15-30 per typical 5-10 study review. No validation data are reported; this is a system description, not a validation study. Conclusion: End-to-end AI-assisted evidence synthesis is architecturally feasible as an open-source tool with mandatory human oversight. Formal validation reproducing published Cochrane reviews is underway and essential before routine use. Comments: 13 pages, 1 figure, 2 tables Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL) MSC classes: I.2.7, J.3 Cite as: arXiv:2606.28363 [cs.IR] (or arXiv:2606.28363v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.28363 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Hsieh-Ting Lin Dr. [view email] [v1] Sun, 5 Apr 2026 17:21:11 UTC (214 KB)

[IR-42] LUMEN: Cost-Transparent Multi-Agent Pipeline for Automated Systematic Review and Meta-Analysis

链接: https://arxiv.org/abs/2606.28362
作者: Yen-Hsun Huang(1),Yu-Shiou Lin(2) ((1) Department of Education, Taipei Veterans General Hospital, Taipei, Taiwan, (2) Department of Psychiatry, Taipei Veterans General Hospital, Taipei, Taiwan)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Digital Libraries (cs.DL)
备注: 15 pages, 5 figures. Open-source implementation and cost logs available

点击查看摘要

Abstract:Systematic reviews and meta-analyses (SR/MA) remain the gold standard for evidence synthesis, yet completing one typically requires 67 weeks and substantial expert effort. Recent large language model (LLM) systems have demonstrated strong performance on individual SR phases - screening (otto-SR: 96.7% sensitivity), extraction (Gartlehner et al.: 91.0% accuracy), and search (TrialMind: 0.83 recall) - but no study has reported what it actually costs to run an end-to-end pipeline, how cost distributes across phases, or how architectural choices affect the cost-quality trade-off. We present LUMEN, an open-source multi-agent pipeline that automates six SR/MA phases using 11 specialized LLM agents with deliberate model routing. We evaluate LUMEN on seven datasets: five self-conducted domain reviews (psychiatry, psychology, surgery, vaccinology, cardiology) and two SYNERGY screening benchmarks. Across 13 ground-truth-comparable outcomes, LUMEN achieves 100% directional agreement with published meta-analyses, with effect sizes within 1% for homogeneous study designs. The primary contribution is the first empirical cost and operational characterization of such a pipeline: a complete review costs 19 to 29 USD (median 22.65 USD), with title-abstract screening and data extraction together dominating expenditure. A three-arm extraction ablation reveals a phase-dependent architecture reversal: multi-agent design hurts screening but is essential for extraction, producing 5.7x more poolable analyses than single-model alternatives while eliminating clinically dangerous direction errors. A two-dataset screening benchmark demonstrates that model ranking is domain-dependent and not transferable across review topics. All code and cost logs are publicly available.

[IR-43] ConCise: Training-Free Conclusion-Chain State Compression for Cost-Efficient Multi-Step RAG Services

链接: https://arxiv.org/abs/2606.28361
作者: Kuan Yan,Zhiqing Tang,Tian Wang,Weijia Jia
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: to be published in IEEE ICWS 2026

点击查看摘要

Abstract:Multi-step retrieval-augmented generation (RAG) has been widely deployed as LLM-powered web services for complex question answering, where iterative retrieval-reasoning rounds deliver strong multi-hop accuracy. However, this paradigm causes historical documents and reasoning traces to accumulate across rounds, inflating cumulative input tokens approximately as O(N^2) with progressively increasing noise density. In API-based service architectures, such growth directly amplifies per-request billing cost, network payload, and response latency. Existing compression approaches rely on pretrained modules or GPU-level KV cache access, introducing model hosting overhead incompatible with API-native, Serverless, and edge-side deployments. To address this issue, this paper proposes ConCise, a training-free state-layer protocol that restructures cross-round context transmission for multi-step RAG services. Specifically, ConCise replaces raw-text accumulation with an append-only chain of structured conclusions, compressing cumulative context growth from O(N^2) to approximately O(N) . Furthermore, a fused generation mechanism is introduced to jointly emit reasoning and conclusions in a single API call, eliminating repeated input billing from serial dual-invocation overhead. Extensive experiments across twelve paired configurations spanning three models, two datasets, and two representative frameworks demonstrate that ConCise achieves 64.63% average token savings while maintaining acceptable accuracy, providing a plug-and-play, deployment-friendly solution for cost-efficient multi-step RAG service optimization.

[IR-44] Carolina Guide: A Multi-Agent RAG System with Institutional Guardrails for Academic Policy Assistance

链接: https://arxiv.org/abs/2606.28360
作者: Ben Torsion,Jun Zhou
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:University students often struggle to navigate complex academic policies, leading to advising bottlenecks and delayed access to critical information. Although large language models (LLMs) offer promise for automated assistance, their tendency toward hallucination and inability to enforce institutional constraints make them unsuitable for high-stakes policy guidance without careful architectural design. We present Carolina Guide, a retrieval-augmented generation (RAG) system for academic policy assistance at the University of South Carolina (USC). The system employs a modular multi-agent pipeline with institutional guardrails to provide citation-supported, policy-grounded answers to student queries while refusing unsafe requests such as course recommendations or personalized advising. We evaluate the system on a 90 query test set across 6 departments, achieving 98.9% retrieval success at the = 2 threshold (genuinely relevant results) with the first relevant chunk at rank-1 for 98.9% of queries (MRR at 10 for rel = 2 = 0.989). Through systematic baseline comparisons and ablation studies, we show that each architectural component-MMR reranking, adequate retrieval context (k=20), and citation enforcement-contributes measurable practical value despite limited statistical power at 90 queries. The evaluation of the guardrail on 30 adversarial queries demonstrates Safety F1 of 0.89, correctly refusing 86% of unsafe queries while maintaining 93% coverage of benign queries. These results show that production-ready LLM systems for institutional policy guidance require rethinking standard RAG patterns to prioritize safety, transparency, and departmental autonomy over conversational sophistication.

[IR-45] he Voronoi Bottleneck: Capacity-Aware Dense Retrieval for Product Search

链接: https://arxiv.org/abs/2606.28359
作者: Charith Chandra Sai Balne,Rithwik Maramraju,Siddharth Pratap Singh,Rohit Upadhyay,Aditya Singh,Chittaranjan Tripathy,Yogananda Domlur Seetharama
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 5 pages

点击查看摘要

Abstract:Dense embedding retrieval compresses all relevance information into a single inner product, imposing a fundamental geometric limit – the Voronoi Bottleneck – on the number of query-document relevance patterns expressible at fixed embedding dimension (d). We make three contributions. (1) Unified capacity theory. We prove that Voronoi complexity and sign-rank are equivalent for top-1 retrieval, yielding tight dimension bounds and a computable diagnostic, the Capacity Utilization Score (CUS), that predicts per-query retrieval failure with AUC ( 0.8) without relevance labels. (2) Diagnosis. CUS identifies two capacity regimes – moderate ((\delta \gtrsim 1)), where density-aware training yields measurable gains, and vacuous ((\delta \ll 1)), where it does not – giving practitioners an a priori check before investing in retraining. (3) DART training. We introduce AT-DW-InfoNCE, an Adaptive-Temperature Density-Weighted contrastive objective with formally derived optimal weighting (\alpha^* = 2.0). On a 100K-query synthetic product-search corpus with controlled relevance structure, DART improves +1.9 Recall@100 over a same-data InfoNCE baseline ((84.9 \pm 0.0) vs. (83.0 \pm 0.3); 8 seeds, (p 0.001)), outperforming focal loss and temperature-schedule alternatives. DART requires zero inference-time overhead – it is a drop-in training objective that improves any dual-encoder system.

[IR-46] How Do LLM s Cite? A Mechanistic Interpretation of Attribution in Retrieval-Augmented Generation ECIR2026

链接: https://arxiv.org/abs/2606.28358
作者: Ian van Dort(University of Amsterdam),Maria Heuss(University of Amsterdam)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in Advances in Information Retrieval, ECIR 2026, Lecture Notes in Computer Science, vol. 16485, pp. 458-473, and is available online at this https URL

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) aims to enhance the trustworthiness of Large Language Models (LLMs) by grounding their outputs in external documents, often using inline citations for verifiability. However, the faithfulness of these citations – whether the model genuinely uses a source to generate an answer – remains a critical, unverified assumption. This paper offers the first mechanistic account of how a large language model decides whether to attach an inline citation while answering a factoid question. Using the Llama-3.1-8B-Instruct model in a controlled experimental environment based on the PopQA dataset, we employ an activation patching approach. We map the underlying mechanism responsible for citation, discovering that it is not a single, localized component but a distributed, multi-stage “attributional ensemble” of attention heads and MLP layers. We show that amplifying or attenuating only those critical heads and MLPs repairs over 90% of missed citations and eliminates 69% of spurious ones on PopQA without harming answer accuracy. Although gains on the multi-document HotpotQA benchmark are modest, the same component set still moves citation rates in the intended direction, indicating that the underlying mechanism is not dataset-specific. The results reveal a potential disconnect between the model’s apparent reasoning and its internal computational pathway, suggesting that inline citations can create a false sense of security.

[IR-47] Reason Rec: A Reasoning -Augmented Multimodal Agent for Unified Recommendation ACL2026

链接: https://arxiv.org/abs/2606.28357
作者: Yihua Zhang,Mingfu Liang,Jiyan Yang,Rong Jin,Wen-Yen Chen,Yiping Han,Huayu Li,Buyun Zhang,Liang Luo,Frank Shyu,Luke Simon,Sijia Liu,Tianlong Chen,Xi Liu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: The 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

点击查看摘要

Abstract:Recent advances in multimodal recommenders excel at feature fusion but remain opaque and inefficient decision-makers, lacking explicit reasoning and self-awareness of uncertainty. We introduce ReasonRec, a reasoning-augmented multimodal agent structured around a three-stage explicit reasoning pipeline. Specifically, we propose a reasoning-aware visual instruction tuning strategy that systematically transforms diverse recommendation tasks into unified CoT prompts, enabling the VLM to explicitly articulate intermediate decision steps. Additionally, our evidence-horizon curriculum progressively enhances the reasoning complexity to better handle cold-start and long-tail user scenarios, significantly boosting model generalization. Furthermore, the uncertainty-guided delegation mechanism empowers the agent to assess its own confidence, strategically allocating computational resources to optimize both recommendation accuracy and inference efficiency. Comprehensive experiments on four standard recommendation tasks across five real-world datasets demonstrate that ReasonRec achieves over 30% relative improvement in key ranking metrics compared to state-of-the-art multimodal recommenders. Crucially, ReasonRec substantially reduces inference latency by dynamically delegating up to 35% of queries to efficient sub-models without compromising accuracy. Extensive ablation studies further confirm that each proposed reasoning and planning mechanism individually contributes substantially to ReasonRec’s overall effectiveness. Collectively, our results illustrate a clear pathway towards interpretable, adaptive, and efficient multimodal recommendation through explicit reasoning and agentic design.

[IR-48] SafeGEO: Understanding Generative Engine Optimization Risks in Recommendation Agents

链接: https://arxiv.org/abs/2606.28356
作者: Qianfeng Wen,Yifan Simon Liu,Xin Liu,Difan Jiao,Blair Yang,Junda Wu,Zhenwei Tang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 41 pages,23 figures

点击查看摘要

Abstract:Generative Engine Optimization (GEO) lets content owners rewrite web content to increase their visibility in generative systems. In recommendation agents, this creates a risk that seller-controlled sources make flawed products appear better supported than they are. We study this risk by asking whether recommendation agents preserve utility-aligned decisions when seller-controlled sources are rewritten for GEO. To make this question measurable, we construct SafeGEO, an evaluation suite with 22 GEO attack variants across 600 recommendation cases. We empirically show that GEO attacks can promote flawed target products. On average, they increase the rate at which such flawed products enter the recommendation set by up to 83.2%. We further study whether agent-side design choices can mitigate this risk and show that simple defenses, including defensive prompting and structured evidence checks, reduce harmful target promotion by up to 39.2%. These gains are substantial but do not restore the no-GEO performance, showing that GEO remains a serious risk despite developer-side mitigation.

[IR-49] DBpedia-Enriched Company Representation for B2B Lead Recommendation ESWC2026

链接: https://arxiv.org/abs/2606.28355
作者: Yuyan Qian,Claude Montacie,Milan Stankovic,Victoria Eyharabide
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 4 pages. Preprint of a paper accepted at the ESWC 2026 Industry Track

点击查看摘要

Abstract:Selecting which companies to approach is a central challenge in business-to-business (B2B) sales, where decisions are often based on manual research and fragmented information sources. Modern B2B sales platforms centralize company records and use learned company embeddings to support tasks such as recommending and prioritizing potential clients. In this study, we investigate whether enriching these company embeddings with Semantic knowledge from DBpedia improves downstream interaction-prediction performance, within a pipeline that integrates structured company attributes and text embeddings deployed on a real B2B platform. We evaluate the learned embeddings on a downstream interaction prediction task using real user feedback data from the platform. Results show that DBpedia enrichment improves downstream performance, with gains observed on ranking and discrimination metrics.

[IR-50] From Regulatory Approvals to Patents: Cross-Domain Linking for Cardiovascular Device Traceability ACL2026

链接: https://arxiv.org/abs/2606.28353
作者: Yang Qingqing,Liu Haijiang,Li Moyan
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 Main Conference

点击查看摘要

Abstract:Linking FDA-approved medical devices to their underlying United States Patent and Trademark Office (USPTO) patents enables critical applications such as recall root-cause analysis, MA-driven IP discovery, and technology trajectory mapping. However, this cross-domain entity linking task remains unexplored due to severe semantic gaps: FDA documents focus on clinical outcomes, while patents describe technical mechanisms, yielding minimal lexical overlap. We formalize medical device-patent linking as a challenging cross-domain entity linking problem characterized by label scarcity and domain shifts. Using cardiovascular devices as a high-impact, representative domain featuring diverse technologies, high recall rates, and abundant disclosures, we construct a benchmark with 434 devices, 698K patents, and 585 high-fidelity expert-verified pairs. To address these challenges, we propose Bridge-MedDevKG, a coarse-to-fine framework that integrates (1) MedDevOnto, a domain-specific ontology that anchors device concepts via three-tier UMLS normalization; (2) Multi-signal candidate generation fusing company affiliation, semantic similarity, and ontology-weighted entity overlap; and (3) Heterogeneous reranking with multi-signal scoring and XGBoost classification on hard negatives. Our approach achieves a conservative lower-bound recall of 91.6% on the gold standard with 50.9% noise reduction, substantially outperforming LLM baselines under comparable evaluation. The resulting MedDevKG provides 6.8M high-confidence links, laying a scalable foundation for regulatory-IP integration across medical specialties.

[IR-51] Sifei at SemEval-2026 Task 8: Hybrid Retrieval and Query Rewriting for Multi-Turn RAG SEMEVAL-2026 ACL2026

链接: https://arxiv.org/abs/2606.28352
作者: Sifei Meng,Dmitry Ilvovsky
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted to SemEval-2026 Task 8 (MTRAGEval), co-located with ACL 2026. Camera-ready version. Code: this https URL

点击查看摘要

Abstract:Multi-turn retrieval-augmented generation (RAG) is challenging due to evolving user intent, conversational noise, and strict context limits. We propose a training-free hybrid retrieval pipeline for SemEval-2026 Task 8 that combines dense and sparse retrieval with controlled query rewriting and cross-encoder reranking. On the official test set of Task A, our system achieves 0.5453 nDCG@5, ranking third among 38 teams and outperforming the strongest baseline score of 0.4795. For Task C, we reuse the documents retrieved for Task A and apply a lightweight generation pipeline guided by the official prompt, achieving 0.5312 as the harmonic mean of relevance and faithfulness and ranking 15th among 29 teams. All retrieval components are open-source, while query rewriting and answer generation rely on LLM APIs.

[IR-52] HyperSU: Corpus-Driven Semantic-Unit Hypergraph for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2606.28351
作者: Jiate Liu,Liuyi Chen,Zhengyi Yang,Chuan He,Mingchen Ju,Bocheng Han,Ruyi Liu,Xu Zhou
类目: Information Retrieval (cs.IR)
备注: 24 pages, 6 figures, 23 tables

点击查看摘要

Abstract:Recent Hypergraph-based retrieval-augmented generation (HyperRAG) methods use hyperedges to connect multiple entities simultaneously, enabling more efficient multi-entity evidence organization than pairwise graph structures. However, existing HyperRAG methods often rely on LLM-generated summaries to construct hyperedges, which can introduce hallucinations while also incurring high indexing costs. In addition, during retrieval, existing methods typically rely on either one-hop neighbor expansion or PageRank diffusion. The former may miss useful multi-hop evidence, while the latter can suffer from uncontrolled propagation over excessive hub nodes, leading to semantic drift and noisy reasoning chains. To address these challenges, we propose HyperSU, a novel hypergraph-based RAG framework featuring semantic-unit hyperedges and clue-guided bidirectional retrieval. During construction, HyperSU formulates hyperedge construction as an entity-aware minimum-description-length (MDL) optimization problem, inducing source-grounded semantic-unit hyperedges that balance sentence-level semantic coherence and entity compactness. It then constructs a hypergraph by modeling each semantic unit as a hyperedge over its co-mentioned entities. During retrieval, HyperSU performs clue-guided bidirectional expansion over the semantic-unit hypergraph, enabling both multi-hop evidence discovery and answer-aware noise reduction. Experiments show that HyperSU consistently improves answer accuracy over standard, graph-based, and hypergraph-based RAG baselines, achieving up to a 14.7% relative accuracy improvement on GraphRAG-Bench, with larger gains on reasoning-intensive tasks.

[IR-53] UniCA: Bi-directional Cross-Attention with Positive Similarity Loss for Robust Multi-Modal Retrieval

链接: https://arxiv.org/abs/2606.28350
作者: Yini Huang,Wenlong Zhang
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal retrieval has become increasingly critical for handling the growing volume of integrated visual-textual data in real-world applications, but existing frameworks rely on implicit fusion via text encoder self-attention, limiting explicit cross-modal semantic alignment. To address this gap, this paper proposes UniCA (Unified Cross-Attention Encoder), a multi-modal retrieval model with four key innovations: 1) a bi-directional cross-attention (Bi-CA) block that enables active semantic exchange between visual and textual tokens prior to concatenation, capturing inter-modal correlations more efficiently. 2) a Positive Similarity Loss that optimizes absolute semantic proximity between query and positive candidate embeddings. 3) a streamlined dataset UMR-S10 (Universal Multimodal Retrieval Sample 10%) to reduce computational costs while retaining semantic diversity and task representativeness. 4) an experimental validation on the WebQA benchmark demonstrates that UniCA outperforms the baseline model across Hybrid and Image-Text tasks, achieving improvements of up to 4.09% in Recall@5, 3.28% in Recall@10, and 3.96% in MRR@1 for the hybrid task. UniCA provides an efficient and robust solution for multi-modal retrieval, lowering deployment barriers through its lightweight dataset and enhanced fusion mechanism.

[IR-54] HMARS: A Hierarchical Multi-Agent Memory System for Long-Context Reasoning

链接: https://arxiv.org/abs/2606.28349
作者: Zeju Li,Ziyang Zheng,Yizhou Zhou,Qiang Xu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-context reasoning requires models to access, retrieve, and integrate evidence scattered across documents, dialogues, and accumulated interaction histories. Standard retrieval-augmented generation reduces this problem to top- K chunk retrieval, but such passive access can discard relevant evidence before reasoning begins, especially when relevance depends on broader context. We propose HMARS, a hierarchical multi-agent memory system that treats long contexts as managed memory rather than a flat retrieval corpus. Sub-agents maintain grounded access to bounded memory regions, mid-agents manage regional context and provide query-specific coordination, and a frontier model performs final reasoning over retrieved evidence pages. To evaluate this view, we construct two diagnostic benchmarks targeting evidence breadth and context-dependent relevance. Across long-document and multi-turn memory tasks, HMARS achieves the best overall performance against retrieval, reranking, full-context, graph-based, and agentic long-context baselines. Evidence coverage analysis further shows that its gains come from retrieving the required supporting evidence more completely, rather than merely changing the final answer prompt.

[IR-55] PIXELRAG : Web Screenshots Beat Text for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2606.28344
作者: Yichuan Wang,Zhifei Li,Zirui Wang,Paul Teiletche,Lesheng Jin,Matei Zaharia,Joseph E. Gonzalez,Sewon Min
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Our code is available at this https URL

点击查看摘要

Abstract:Augmenting large language models (LLMs) with retrieved web text has become a dominant paradigm, yet the web is not natively textual: existing systems depend on complex parsing pipelines that linearize HTML and discard layout, visual structure, and formatting. We introduce PixelRAG, a new retrieval-augmented method that represents websites in their native visual form and performs retrieval and reading entirely in pixel space, enabling an end-to-end architecture that eliminates text abstraction. PixelRAG is, to our knowledge, the first pipeline to operate over a full Wikipedia corpus in this form, scaling to a datastore of 30 million screenshot images with an efficient visual retrieval index. Built on an existing visual embedding model (i.e., Qwen3-VL-Embedding), PixelRAG further fine-tunes this model on screenshot data with carefully curated contrastive training data. Retrieved screenshots are then fed directly as pixel inputs to a VLM, without intermediate text conversion. PixelRAG consistently outperforms both no-retrieval and text-based RAG baselines, most surprisingly on widely studied text-centric tasks such as NQ and SimpleQA. It also achieves strong gains on multimodal open-domain QA (e.g., MMSearch), benchmarks over noisy news corpora (e.g., LiveVQA), and agentic benchmarks (e.g., MoNaCo), improving accuracy by up to 18.1% over text-based baselines. Finally, pixel representations enable a new efficiency lever for RAG through image compression, achieving up to 3x token cost reduction at lower resolutions while maintaining accuracy. Our results challenge the necessity of text representations in web retrieval, suggesting that web RAG can operate directly in the web’s native visual form while improving both performance and efficiency.

[IR-56] he Crowded Embedding Space: A Mean-Field Mechanism for Emergent Marginalization in Retrieval-Augmented Agents

链接: https://arxiv.org/abs/2606.28343
作者: Shwan Ashrafi,Dan Roth
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generative agents rely on retrieval for grounding, yet are typically evaluated on a query-by-query basis. This isolates interactions that are geometrically coupled in a shared embedding space. For example, we show that the high document density required to serve majority interests (e.g., generic “Crime” movies) can geometrically overcrowd the retrieval neighborhood of a semantically similar minority (e.g., “Film Noir”), effectively expelling minority content from top- k results. We introduce a formal framework to analyze how such goal collisions in dense retrieval induce fundamental performance limits and emergent fairness issues inherent to spatial crowding. In our static analysis, we demonstrate that for a fixed embedding space, a phase transition occurs where minority user goals suffer a catastrophic collapse in performance as the density of majority goals increases. We then extend this to a dynamic model and derive a non-linear Fokker-Planck equation that governs the evolution of document embeddings as the agent updates them to maximize retrieval accuracy. Our analysis reveals that this local relevance objective triggers an emergent global mechanism that systematically marginalizes minority interests. We prove that such objectives drive the system to self-organize into a state that exclusively serves majority interests. These results provide a theoretical foundation for understanding a critical grounding failure mode in retrieval-augmented agents.

[IR-57] Rethinking Fairness in LLM -Based Recommender Systems: A Survey

链接: https://arxiv.org/abs/2606.28340
作者: Song-Duo Ma,Chu-Yun Chen,Bang-An Li,Pin-Yu Chen,Shau-Yung Hsu,Yun-Nung Chen
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are reshaping recommender systems by enabling more semantic, generative, and interactive recommendation pipelines. However, this shift also introduces new fairness challenges, as biases may arise from pretrained knowledge, prompts, generated explanations, decoding strategies, and feedback loops. This survey provides a systematic review of fairness in LLM-based recommender systems (LLM4Rec), organizing existing studies through a two-dimensional view of bias mechanisms and fairness targets, together with a structured overview of the evaluation landscape and mitigation strategies. We further connect fairness with broader trustworthy concerns, including explainability, privacy, robustness, and controllability. To the best of our knowledge, this is the first survey specifically focused on fairness in LLM4Rec, aiming to provide a structured foundation for future research on comprehensive and reliable fairness evaluation in LLM4Rec.

[IR-58] Memory Shot for Long-Term Dialogue

链接: https://arxiv.org/abs/2606.28338
作者: Chunyi Peng,Haidong Xin,Xuanshuo Sheng,Xin Dai,Zhenghao Liu,Shuo Wang,Yukun Yan,Zulong Chen,Yu Gu,Ge Yu
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities in general conversation, instruction following, and complex reasoning. However, in long-term dialogue settings, they often struggle to locate and utilize historical information most relevant to the current query. Existing approaches address this issue by constructing structured text-centered memory units through compressing and reorganizing user interaction history. However, these systems often rely on brute-force extraction of crucial evidence to associate episodes across dialogue sessions, causing substantial computational overhead and weakening structural cues such as speaker transitions, turn boundaries, and local contextual relationships. To avoid fragile text-based memory representations, we propose MemShot, which leverages dialogue structuring for long-term dialogue modeling and relies on the model’s internal visual reasoning capabilities to associate key episodes. Specifically, MemShot renders local contiguous dialogue spans into structured visual memory units, preserving meta-information and chronological dialogue turns while avoiding heavy-weight textual memory construction. Experimental results show that MemShot achieves stable and competitive performance on both LoCoMo and LongMemEval, while substantially shortening the memory construction pipeline and delivering 70 \times speedup. Further analysis reveals that MemShot enhances the localization and utilization of historical evidence by directing memory processing toward structured local dialogue cues rather than surface-level lexical matching in a flat text stream. All codes are released on this https URL.

[IR-59] A Systems-Level Analysis of Sensitivity Robustness and Stability in Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2606.28337
作者: Bharath Simha Reddy Muthyam
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems are often evaluated using final answer accuracy, even though their failures can originate from preprocessing, retrieval, context packing, or generation. This paper presents a controlled empirical study of RAG sensitivity, robustness, and stability across 56 experimental runs. We evaluate how chunk size, retrieval depth (top k), embedding-based reranking, probabilistic retrieval noise, and repeated seeded runs affect retrieval, context packing, and generation behavior. Using a fixed 500-question QA subset mapped to 20,958 unique corpus contexts, we analyze both final answer metrics and intermediate failure modes. Across these experiments, retrieval-oriented metrics improved under broader retrieval settings, while downstream exact-match and F1 scores often behaved non-monotonically. We also observe preprocessing-induced answer loss under smaller chunk sizes, progressive degradation under retrieval corruption, and higher observed variance in broader retrieval regimes. These findings suggest that RAG evaluation should include sensitivity, robustness, stability, and multi-stage failure analysis rather than relying only on final answer accuracy.

[IR-60] HyBIRD: Hyperbolic Bridge Retrieval and Diagnosis for Methodology Inspiration Retrieval

链接: https://arxiv.org/abs/2606.28336
作者: Yang Yang,Boyun Xu,Hao Fu,Jindong Li,Zining Zhong,Bowen Tian,Jiemin Wu,Menglin Yang,Yutao Yue
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Methodology Inspiration Retrieval (MIR) asks a system to retrieve prior papers whose methods can inspire a new research proposal. Unlike general scientific retrieval, the central challenge is not topical similarity but whether a candidate paper provides concrete mechanisms that can instantiate an abstract methodological need. Existing MIR dense retrievers provide strong paper-level rankings, but the returned lists do not expose how proposal needs are bridged by retrieved methods, where evidence is weak, or which complementary snippets may help. We propose HyBIRD, a frozen-anchor framework that treats MIR as hyperbolic bridge retrieval and post-hoc method diagnosis. HyBIRD keeps a strong MIR dense retriever fixed, learns lightweight point, cone, and factorized hyperbolic bridge variants, and uses LLM-assisted method blocks for post-hoc explanation and evidence selection. On the MIR benchmark, the factorized bridge reaches 59.034 mAP while preserving the dense anchor’s strong retrieval behavior. More importantly, HyBIRD converts ranked papers into inspectable query need profiles, factor coverage, maturity views, and complementary evidence bundles. The results suggest that hyperbolic geometry is most useful as calibrated structure over a dense anchor, rather than as a standalone replacement for dense retrieval.

[IR-61] High-Dimensional Concentration and Retrieval Instability in Embedding Spaces: Implications for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2606.28330
作者: Ernesto Lopez Fune(DE)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Embedding-based retrieval systems rely on the assumption that geometric proximity in highdimensional representation spaces reflects semantic relevance. However, high-dimensional geometry induces concentration phenomena that can reduce the discriminative power of similarity measures and can destabilize nearest-neighbor retrieval. This work studies distance concentration, cosine concentration, contrast collapse, hubness, and retrieval instability through controlled numerical experiments across multiple synthetic distributions. The results show that similarity signals progressively lose contrast as dimension increases, leading to unstable retrieval behavior and structural bias in nearest-neighbor selection. A simplified Retrieval-Augmented Generation experiment further suggests that these effects can degrade grounding reliability upstream of generation. These findings motivate geometry-aware diagnostics and robustness-oriented retrieval strategies for embedding-based AI systems. The experiments are intentionally synthetic in order to isolate intrinsic geometric effects. High-dimensional embedding space Distance and cosine concentration Score-gap collapse and hubness Retrieval instability under perturbations Weak or incomplete retrieved context Potential degradation of grounding 1.

[IR-62] M3 QuestionIng: Multi-modal Multi-span Medical Question Answering

链接: https://arxiv.org/abs/2606.28329
作者: Anisha Saha,Vaibhav Rathore,Abhisek Tiwari,Akash Ghosh,Sai Ruthvik Edara,Sriparna Saha
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The growing adoption of AI in healthcare, particularly in preventive care, highlights the critical need for accessibility and precision in Medical Question Answering (MedQA). In recent years, significant efforts have been made to develop multi-span medical question-answering systems, where the answer to a query may span multiple sections or paragraphs of a source document. However, existing systems fall short of aligning with real-world scenarios, where source documents often include both textual and visual content, requiring answers to incorporate images for better comprehension. To address this gap, we propose M^3QAFrame , a multi-modal, multi-span medical question-answering framework that leverages visual cues to enhance the generation of comprehensive answers drawn from diverse textual and visual spans. The model takes the context, query, and images as input and outputs an answer containing both textual answers and relevant images. The text and image embeddings are processed using a transformer-based architecture to determine the sentence and image relevance. We curate a multi-modal, multi-span medical question-answering ( M^3 QuestionIng ) dataset containing queries, medical contexts, associated medical images, and extractive answers. Additionally, each query-answer pair is labeled with user intent and query type to enhance query and context comprehension. Extensive experiments show that our approach consistently outperforms existing methods across various evaluation metrics.

[IR-63] xtClusterLab: An Integrated Framework for Reliable Text Clustering Studies

链接: https://arxiv.org/abs/2606.28328
作者: Daoming Wan,Yizheng Huang,Jimmy X. Huang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 13 pages, 4 figures, 9 tables

点击查看摘要

Abstract:In recent years, text clustering has become a critical technique for applications including intent discovery, topic mining, and recommendation systems. However, evaluating text clustering algorithms remains challenging since many real-world textual datasets are not suitable for clustering assessment due to ambiguous semantic boundaries, the high dimensionality of embeddings, and inconsistent cluster structure. Current clustering dataset generators are designed for numerical data, providing limited support for text-specific benchmarking. This paper introduces TextClusterLab, a comprehensive framework for text clustering research. TextClusterLab offers a Large Language Model (LLM)-driven text clustering dataset generator to produce synthetic text datasets for evaluating clustering algorithms. This generator supports setting various clustering attributes, such as class imbalance, intra-cluster compactness, and inter-cluster diversity. These generated datasets can serve as practical benchmarks for testing the robustness and versatility of text clustering algorithms in diverse scenarios. Moreover, we introduce a benchmark to verify whether a text dataset is suitable for clustering evaluation. Therefore, TextClusterLab provides an integrated framework for reproducible and comprehensive text-specific clustering research. Our TextClusterLab is publicly available at this https URL, and some synthetic example datasets with various attributes are publicly available at this https URL.

[IR-64] he Interference Gap: Comparing Retrieval Bounds in Human Memory and RAG Systems

链接: https://arxiv.org/abs/2606.28327
作者: Dongxin Guo,Jikun Wu,Siu-Ming Yiu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 2 figures, 1 table. Accepted at CogSci 2026

点击查看摘要

Abstract:How do retrieval bounds compare between human episodic memory and Retrieval-Augmented Generation (RAG) systems under semantic interference? We present a unified signal detection theory (SDT) framework that applies to both, and use it to fit behavioral and computational data in matched paradigms. Both systems show logarithmic accuracy decline with association count (fan), but humans exhibit lower interference sensitivity ( \alpha/\sigma = 0.41 ) than dense passage retrieval ( \alpha/\sigma = 0.67 ), with cognitively-inspired HippoRAG falling between the two ( \alpha/\sigma = 0.44 ). Behavioral experiments ( N = 112 ) and simulations validate the framework; parameter recovery confirms identifiability ( r \geq .93 ) and model comparison favors the logarithmic specification over a power-law alternative ( \Delta BIC 15 ). We discuss encoding specificity, temporal context binding, and retrieval gating as candidate mechanisms whose causal role remains to be established. Six falsifiable predictions connect cognitive memory research with AI retrieval evaluation.

[IR-65] ADEPT: An Entropy-Driven Dual-Strategy Agent for Interactive Video Retrieval ICASSP2026

链接: https://arxiv.org/abs/2606.28326
作者: Ke Chen,Shengyuan Han,Yongfeng Huang,Yujin Zhu,Jingwei Xiong,Liang Xu,Jundong Liu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures. Published in IEEE ICASSP 2026

点击查看摘要

Abstract:This research aims to solve the challenge of video retrieval from massive datasets, caused by ambiguous user queries. Prevailing single-round retrieval paradigms face a performance bottleneck, as they lack effective feedback mechanisms to handle complex search intentions. The root cause is the “Intent-Query Gap”, where users’ intent cannot be captured by a simple text query. To solve this, we propose the ADEPT framework: a training-free agent that pioneers an entropy-driven decision engine to efficiently guide dialogue by dynamically selecting between ASK and REFINE strategies. Experiments on two challenging datasets demonstrate that ADEPT significantly outperforms all non-interactive, heuristic, and Video-LLM baselines. The core contribution of this work is an efficient and interpretable entropy-driven interactive strategy that sets a new performance benchmark for the field of interactive video retrieval.

[IR-66] SocialPersona: Benchmarking Personalized Profiling and Response with Multimodal Social-Media Context

链接: https://arxiv.org/abs/2606.26654
作者: Qinkai Zhang,Yanyan Zhao,Xin Lu,Yulin Hu,Pengtao Han,Bing Qin
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Personalized language-model assistants are often evaluated through a memory lens: can a model recall preferences users have explicitly stated in dialogue? More comprehensive personalization demands a harder capability – inferring what users care about from the multimodal traces they naturally leave behind. We introduce SocialPersona, a benchmark for evaluating whether multimodal large language models (MLLMs) can recover revealed preferences from longitudinal social-media timelines and use them in dialogue. Built from longitudinal timelines of 171 everyday, non-promotional social-media users, SocialPersona contains text, images, timestamps, and 2,597 human-verified preference tags across seven interest domains, separating stable interests from recent interests. It supports two tasks: constructing structured user profiles from multimodal context and generating responses aligned with inferred profiles. Experiments with proprietary and open-weight MLLMs show that models can identify broad interest domains, yet their performance drops on fine-grained and recent interests and degrades further when inferred profiles must be used to personalize dialogue. Together with evidence that text and images provide complementary preference signals, these results indicate that robust cross-modal, long-horizon user modeling remains a key challenge, and that SocialPersona can help measure and advance progress toward assistants that infer and act on revealed preferences.

人机交互

[HC-0] Concept Catalyst: Exploring Scrutable Interfaces to Structure K-12 Teacher Interactions with Generative AI

链接: https://arxiv.org/abs/2606.30590
作者: Gennie Mansi,Sunni Newton,Roxanne Moore,Meltem Alemdar,Mark Riedl
类目: Human-Computer Interaction (cs.HC)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Purpose: This paper explores how to align AI-based tools with teachers’ classroom needs by using scrutable interfaces – interfaces that link an easily manipulable knowledge representation to an underlying AI model, so users can change the system’s outputs without understanding its details. It provides an in-depth discussion and example of a scrutable interface that structures teachers’ interactions with generative AI. This study aims to expand how and where scrutable interfaces are used in AI-based tools to support teachers, who have not been historically targeted in the design of scrutable systems. Design/Methodology/Approach: This paper presents the design and evaluation of Concept Catalyst, an AI-based tool with a scrutable interface, created to support teachers’ reflection while using generative AI for curriculum development. It presents the findings from an exploratory study using Wizard-of-Oz testing with middle and high school engineering teachers, resulting in 10 depth interviews lasting 55 minutes on average. Screen/audio recordings and the classroom content teachers produced during the session were also collected. Findings: The paper provides empirical insights about how scrutable interfaces can positively structure teachers’ interactions with generative AI models when creating classroom content. Findings suggest that scrutable interfaces can help teachers reflect on their teaching practices while improving efficacy, efficiency, and motivation when using AI. What is original/value of the paper: This paper explores an identified need to support teachers’ classroom practices and needs when using generative AI. It extends the consideration of scrutable interfaces in two ways: to support teachers as users (not just students) and to structure interactions with generative AI models. Comments: 11 pages, 2 figures Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2606.30590 [cs.HC] (or arXiv:2606.30590v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2606.30590 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gennie Mansi [view email] [v1] Mon, 29 Jun 2026 17:33:28 UTC (929 KB)

[HC-1] he Human Creativity Benchmark

链接: https://arxiv.org/abs/2606.30561
作者: Aspen Hopkins,Allison Nulty,Alexandria Minetti,Anoop Pakki,Angad Singh
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 30 pages

点击查看摘要

Abstract:Modern AI evaluation frameworks treat evaluator disagreement as noise to be resolved. In creative domains, professional disagreement reflects genuine differences in taste, not measurement error. We argue that evaluating creative AI requires preserving two distinct signals: convergence, where professionals align around shared best practices, and divergence, where individual taste legitimately varies. We present the Human Creativity Benchmark (HCB), a benchmark that operationalizes this separation by collecting pairwise preferences, scalar ratings on prompt adherence, usability, and visual appeal, and qualitative rationale from domain professionals. Across 15,000 professional judgments spanning five creative domains and three workflow phases (ideation, mockup, refinement), we find that convergence concentrates on verifiable dimensions like technical correctness and visual hierarchy, while divergence concentrates on taste-driven dimensions like aesthetic direction and conceptual risk. No model excels uniformly across all phases. Collapsing these signals into a single quality metric discards the most actionable information: where models must be correct versus where they should remain steerable.

[HC-2] o Tab or Not to Tab: Measuring Critical Engagement in AI Code Completion Tools Using Behavioral Signals and Attention Checks ICSE2026

链接: https://arxiv.org/abs/2606.30549
作者: Jessica Hutchison,Ian Tyler Applebaum,Kenneth Angelikas,Kush Rakesh Patel,Phuoc Nguyen,Antonio Lazaro,Nicholas Rucinski,Rahad Arman Nabid,Stephen MacNeil
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 7 pages. Accepted for publication in the Proceedings of the 31st ACM Conference on Innovation and Technology in Computer Science Education (ITiCSE 2026), Madrid, Spain, July 10-15, 2026. Author’s accepted manuscript

点击查看摘要

Abstract:AI code completion tools, such as Github Copilot, provide students with code suggestions to help them write programs. However, recent qualitative studies suggest that students fail to critically evaluate these suggestions. We present Clover, a code completion tool that logs students’ interactions with code suggestions and additionally offers attention checks to probe reflective engagement during programming tasks. We also develop a taxonomy of behavioral interaction metrics for AI-assisted programming, informed by literature. We analyzed relationships between interaction patterns, engagement with attention checks, and task performance. We observed that higher rates of tab accept were associated with lower attention check performance, while increased dwell time was associated with higher attention check performance. We conclude by discussing how programming process data and attention checks might support reflective engagement in AI-assisted programming.

[HC-3] Using Large Language Models as Low-Cost Statistical Estimators for Human-Response Data

链接: https://arxiv.org/abs/2606.30372
作者: Haobo Yang
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 37 pages

点击查看摘要

Abstract:Quantitative research across the social and behavioral sciences depends on human subject experiments that are expensive, slow, and subject to sampling bias. Here we show that pretrained large language models induce risk-equivalent estimators of conditional expectations under squared loss, establishing restricted functional risk equivalence: under squared loss, the LLM induces an estimator whose risk matches the Bayes optimal risk for squared-loss prediction of conditional expectations for any inference that depends on the data only through the conditional mean. We formalize the LLM as a misspecified functional estimator T(\hatP_n) trained on i.i.d.\ data, decompose the estimation error into representation bias \epsilon_\mathrmrep and optimization error, and prove that under mild regularity conditions the LLM’s expected error converges to the irreducible population variance plus the squared representation bias, with the representation bias bounded by the Pinsker inequality. The identifiability error \delta propagates into the effective bias, inflating the asymptotic risk floor. We establish restricted functional risk equivalence via a bidirectional Le Cam deficiency analysis: the forward deficiency vanishes asymptotically while the reverse deficiency is exactly zero. We provide finite-sample concentration bounds and a calibration protocol with explicit decision rules. The result is a precise, provable statement: a well-calibrated LLM achieves the Bayes-optimal risk for conditional-mean-dependent inference, bounded by explicit scope conditions. In practical applications, this means that under satisfied conditions and well-calibrated models, large language models can be used in many prediction and decision-making tasks that originally relied on human experiments, approximating near-optimal statistical inference at lower cost.

[HC-4] Rehearsed Multi-Agent Live Product Demonstrations with Real-Time Voice Question Answering

链接: https://arxiv.org/abs/2606.30294
作者: Rahul Khedar,Mayank Malhotra,Avinash Karn,Mouli V,Prakhar Mehrotra
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: Preprint. 4 figures, 1 algorithm, 5 tables. Systems paper with a preliminary six-session case study on four deployed applications; full benchmark protocol proposed, corpus run to appear in a later revision

点击查看摘要

Abstract:Live product demonstrations are a recurring, high-cost activity in software organizations: a human presenter must select features, dispatch the corresponding interactions on a running application, narrate them coherently, and answer questions in real time. Existing automation addresses only fragments – generalist browser agents target instruction-conditioned task completion, and demo-video tools produce fixed MP4 artifacts that cannot be questioned and silently break under interface drift. We propose Rhetor, a multi-agent system that takes a running web application and its source-code repository as input and produces a rehearsed live demonstration with segment-synchronized narration and real-time voice question answering. The architectural contributions are a cross-modal feature representation that merges UI exploration with source-code analysis into features tagged with discrete focus tiers, a grounded scripter constrained to UI elements observed during exploration and dispatched through multi-strategy semantic locators, a pre-presentation rehearsal loop with explicit convergence and graceful degradation to narration-only segments, and a runtime synchronization invariant that ties each browser action to the audio-end event of its narration segment. Across six pipeline sessions on four deployed applications – including the public-domain whiteboard application Excalidraw – the rehearser’s internal locator-firing rate (sigma-bar) spans 0.31-1.00 over 147 scripted actions; on the substantial workload (53 actions, full tier differentiation), sigma-bar is approximately 0.92, and on the public-domain reference point the locator-repair step drives convergence to sigma-bar = 1.00 at iteration 2. We additionally define a benchmark protocol of ten metrics across six application categories that would establish, beyond the case study, whether each design choice contributes positively.

[HC-5] Consensus Clustering of Free-Viewing Gaze Data: New Insights into Human-Information Interaction

链接: https://arxiv.org/abs/2606.30035
作者: Beryl Gnanaraj,Jaya Sreevalsan-Nair,Saqib Alam Ansari,Maanasa Rajaraman
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 31 pages, 10 figures, 8 tables

点击查看摘要

Abstract:Free-viewing gaze data provides a rich, task-free window into human visual attention. Conventional exploratory data analysis of the data provides user attention patterns through fixations and areas of interest. However, despite the richness of this gaze data, its human-information interaction (HII) patterns are understudied. We address this gap using consensus clustering of gaze data with respect to users and stimulus characteristics. We present a novel end-to-end unsupervised ensemble learning system for consensus clustering of free-viewing gaze datasets, EnsembleGaze. With a goal of characterizing the user behavior and stimulus type, we propose a feature engineering step based on statistical descriptors of fixation-based distributions. EnsembleGaze involves consensus voting of selected clustering methods implemented on the feature vector to compute the co-association matrix. Using the separate consensus clustering of users and stimuli as a baseline, we further propose two high-dimensional clustering strategies for determining gaze clusters based on joint user and image characterization. They are consensus subspace clustering and spectral biclustering. Clustering performance is evaluated using selected standard metrics and is further interpreted through image-level properties. Our system provides a replicable method for the unsupervised analysis of fixation behavior in scene perception research. Our results show that image stimuli groupings are highly consistent across methods, reflecting a robust ambient-versus-focal viewing mode distinction, whereas user groupings are image-context-dependent, a structure that only biclustering and the two-step conditional approaches are architecturally capable of recovering. Testing on the publicly available datasets revealed dataset-specific patterns, with each offering complementary insights through distinct clustering strategies.

[HC-6] SICAGE: Speaker-Independent Culture-Aware Gesture Generation using TED4C-L Dataset ECCV2026

链接: https://arxiv.org/abs/2606.30001
作者: Ariel Gjaci,Antonio Sgorbissa,Vittorio Murino
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Sound (cs.SD)
备注: Accepted at ECCV 2026

点击查看摘要

Abstract:Recent co-speech gesture generation methods often overlook cultural differences, limiting their effectiveness in human-agent interaction. Moreover, culture-conditioned models are rarely evaluated under speaker-disjoint splits, so apparent “cultural” behavior may be confounded with speaker-specific gesturing style. We introduce SICAGE, a modular framework for culture-aware co-speech gesture generation that conditions motion synthesis models on speaker-independent cultural representations. SICAGE learns these representations from audio and text by treating each speaker as a separate domain while imposing invariance across speakers. This encourages representations to remain culture-discriminative while reducing dependence on speaker identity. The resulting cultural embeddings condition a multimodal generator to produce culturally appropriate gestures. We instantiate this idea with two domain generalization approaches: adversarial learning and Fishr regularization. We further introduce ALaDiT, a real-time diffusion-based gesture generator designed to efficiently incorporate the learned cultural embeddings. To validate our method, we built TED4C-L, a 106-hour multimodal dataset of 764 TED speakers from four cultural groups. Experiments show that SICAGE improves motion realism, diversity, beat synchronization, semantic relevance, and cultural consistency.

[HC-7] Legible Shared Autonomy: Implicit Communication of Robot Belief through Motion IROS2026

链接: https://arxiv.org/abs/2606.29846
作者: Jinwei Liu,Pengfei Li,Shaofeng Chen,Tao Wang,Yun-Bo Zhao
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: Accepted at IROS 2026

点击查看摘要

Abstract:Shared autonomy systems combine user input with autonomous assistance to help users with motor impairments control robot arms to perform everyday manipulation tasks, by inferring user goals and providing appropriate guidance. However, the robot’s internal beliefs about user goals cannot be observed by users. Traditional shared autonomy systems provide assistance along efficient shortest paths toward inferred goals, but when multiple objects lie in similar directions, such assistive motion remains ambiguous and fails to reveal the specific goal identified by the robot. This creates two critical problems. First, when the robot correctly infers the goal, users continue controlling because they cannot perceive understanding from ambiguous assistive motion, wasting effort when autonomous completion would suffice. Second, when the robot misunderstands intent, users cannot quickly detect errors until assistive motion diverges significantly, requiring substantial corrective input. We address this by introducing legible motion into shared autonomy, where robot actions must both advance toward the goal and clearly reveal which goal has been inferred, enabling users to understand the robot’s beliefs and adjust control accordingly. The robot modulates communication strength through confidence-aware adaptive authority allocation by providing assertive legible assistive actions when confident while increasing user authority when uncertain, transforming shared autonomy into transparent bidirectional collaboration. User studies including simulation and physical experiments with a six-degree-of-freedom robot arm demonstrate that legible shared autonomy significantly improves users’ understanding of robot beliefs and reduces user control effort compared to standard shared autonomy.

[HC-8] Making Multimodal LLM s Reliable Chart Data Extractors: A Benchmark and Training Framework

链接: https://arxiv.org/abs/2606.29808
作者: Yuchen He,Peizhi Ying,Liqi Cheng,Kuilin Peng,Yuan Tian,Dazhen Deng,Yingcai Wu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted at CHI’26

点击查看摘要

Abstract:Chart data extraction, which reverse-engineers data tables from chart images, is essential for reproducibility, analysis, retrieval, and redesign. Existing interactive tools are reliable but tedious, and mixed-initiative systems, while more efficient, lack generalizability. Recent multimodal large language models (MLLMs) offer a unified interface for chart interpretation, yet their ability to extract accurate data tables, especially without visible labels, remains unclear. We build a benchmark featuring diverse real-world charts without data labels to evaluate this capability. Results show that, while current MLLMs reliably reconstruct table structures, they struggle with precise value recovery. To address this, we revisit chart data extraction from a human-centered perspective and argue that extraction should follow a progressive learning process similar to how people read charts. Our training framework substantially improves numerical accuracy, achieving state-of-the-art performance with a 7B-parameter model. A user study further shows that our model effectively supports mixed-initiative workflows for reliable chart data extraction.

[HC-9] DEEPMED Search: An Open-Source Agent ic Platform for Medical Deep Research with Introspective Verification ECAI2026 WWW IJCAI

链接: https://arxiv.org/abs/2606.29746
作者: Maolin Liu,Fanyu Xu,Ruoqing Xu,Jiahang Zhang,Hao Wang,Rui Wang
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 5 pages, 2 figures, 2 tables. Accepted to IJCAI-ECAI 2026 Demo Track. Project website: this https URL . Demo video: this https URL

点击查看摘要

Abstract:Navigating the deluge of heterogeneous medical data, from academic literature (PubMed) to clinical guidelines (Web) and private knowledge bases, remains a critical bottleneck for evidence-based medicine. While commercial black-box tools lack transparency, standard open-source RAG implementations frequently suffer from reasoning drift when handling complex, long-tail queries. We present DEEPMED Search, a fully open-source, agentic platform designed for transparent medical deep research. Built on a high-performance this http URL architecture, DEEPMED Search features a source-adaptive router that autonomously dispatches sub-queries to PubMed, web search, or local graph-based knowledge bases based on information density. Crucially, the platform integrates an introspective verification module, powered by a causal-consistent multi-agent debate framework, to validate retrieved evidence against diagnostic logic before synthesis. To demonstrate its robustness, we showcase DEEPMED Search’s ability to autonomously decompose high-difficulty rare disease queries, filter out confounding noise, and generate structured, citation-backed research reports in minutes. By open-sourcing this software, we provide the community with a robust infrastructure to democratize access to trustworthy, glass-box medical reasoning in research and prototyping settings.

[HC-10] DeepTrans Studio: Turning Expert Interventions into Shared Team Knowledge in Agent ic Translation Workflows

链接: https://arxiv.org/abs/2606.29727
作者: Ziyang Lian,Qingya Zhang,Hao Wang,Huiwen Xiong,Qi Yang,Lingyi Meng,Xiaoyi Gu,Rui Wang
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 4 pages, 2 figures. Accepted to CSCW 2026 Demo. Code and demo video: this https URL , this https URL

点击查看摘要

Abstract:Professional translation is often a team-based process: translators, reviewers, and project managers must coordinate terminology, legal force, and accountability across documents. Yet many LLM-based translation tools treat human corrections as isolated edits. Expert decisions made in one segment or by one member are rarely captured as reusable knowledge for the rest of the team. We present DeepTrans Studio, a collaborative translation workspace that lets professionals intercept selected nodes in an agentic translation workflow, review evidence, revise AI outputs, and save approved decisions to a shared team memory. During the demo, attendees will role-play translators and reviewers, resolve preset terminology and legal-modal risks, and see how their decisions are propagated to downstream segments and surfaced in a teammate’s workspace as reusable precedents. The demo illustrates how human interventions in AI-mediated work can become shared, traceable knowledge rather than one-off corrections.

[HC-11] From Trait to Behavior: A Cognitive-Affective Personality System (CAPS) Perspective on Multi-Homing Intention in AIGC Platforms

链接: https://arxiv.org/abs/2606.29726
作者: Xuchao Zhang,Jihye Lee
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Author’s Original Manuscript. The Version of Record has been published in International Journal of Human-Computer Interaction

点击查看摘要

Abstract:With the rapid development of Artificial Intelligence Generated Content (AIGC) platforms, users increasingly show cross-platform usage intentions. Existing research focuses on adoption and usage intentions in single-platform AIGC contexts. A theoretical gap still exists in studies on cross-platform usage. This paper constructs and verifies a three-stage multiple mediation model based on the personality trait-perception-behavioral response framework. The model integrates the optimum stimulation level (OSL) theory, complementarity theory, and perceived value theory, and it sets social influence and use experience as control variables to examine users’ multi-homing intention. The results show that: (a) OSL significantly enhances users’ perceived complementarity; (b) perceived complementarity positively affects perceived epistemic value; © perceived epistemic value significantly and positively predicts multi-homing intention; (d) OSL influences multi-homing intention through a chain mediation path of perceived complementarity and perceived epistemic value; and (e) social influence has a significant positive effect on multi-homing intention, while the effect of use experience is not significant.

[HC-12] VISTA-DZ: Visual Semantic Trajectory Adaptation for Personalized Dilemma Zone Prediction

链接: https://arxiv.org/abs/2606.29548
作者: Chuheng Wei,Ziye Qin,Ziran Wang,Guoyuan Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: This manuscript is currently under review

点击查看摘要

Abstract:Driver decision making in the dilemma zone at signalized intersections is safety critical, as vehicles approaching a yellow signal must decide whether to stop or proceed within limited time and distance margins. Accurate prediction of both stop-go decisions and decision timing is important for adaptive signal control, advanced driver assistance systems, and human-centered intelligent transportation applications. However, dilemma zone behavior is strongly driver dependent. Similar approach trajectories may lead to different decisions across drivers because of differences in risk preference, braking habit, and decision threshold. Existing personalized models often rely on handcrafted scalar descriptors, which provide useful but limited summaries of individual behavior. This paper proposes VISTA-DZ, a semantic-profile-conditioned framework for personalized stop-go and decision-time prediction. Historical trajectories are converted into visual representations, interpreted by a vision-language model to generate behavioral profiles, and encoded as semantic embeddings to condition a dual-output prediction network. The final model combines a bidirectional GRU encoder, driver-conditioned multi-head cross-attention, and Feature-wise Linear Modulation for temporal evidence selection and feature adaptation. Experiments on the SDZ dataset and a newly collected FDZ dataset show that VISTA-DZ outperforms trajectory-only and handcrafted personalization baselines, achieving 93.26% in-domain simulation accuracy and 90.22% mean accuracy across 20 held-out simulation drivers. Cross-domain results further show feasible zero-shot simulation-to-real transfer and better real-world generalization when simulation data are combined with limited field data.

[HC-13] LLM ography: Transforming Human-AI Conversations into Traceability Oversight and Auditability Indicators

链接: https://arxiv.org/abs/2606.29437
作者: Mohammed Bousmah
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Preliminary exploratory study; 19 anonymized student audit reports; includes prototype screenshots

点击查看摘要

Abstract:The growing use of Large Language Models (LLMs) in education, software engineering, academic writing, and technical documentation raises a key question: how can we evaluate not only AI-assisted outputs, but also the interaction process that produced them? Current debates often focus on detecting whether a final artifact was generated by AI, while overlooking the conversation history that reveals human direction, AI contribution, corrections, validation, and traceability. This paper introduces LLMography, a framework for transforming Human-AI conversations into measurable indicators of provenance, human contribution, AI dependency, reproducibility, and auditability. By analogy with bibliography and webography, LLMography documents the dynamic trajectory of interaction between a human and a Large Language Model as a structured trace of Human-AI co-production. We present a prototype that analyzes Human-AI conversation traces and generates KPI reports including Prompt Quality Score, Human Direction Score, AI Dependency Level, Auditability Score, Final Output Traceability, Privacy Risk Level, and a recommended LLMography label. A preliminary exploratory evaluation was conducted on 19 anonymized audit reports from engineering students. Most interactions were classified as Human-AI co-produced, with average scores of 86.8/100 for Human Direction, 81.9/100 for Prompt Quality, 72.8/100 for Auditability, and 77.1/100 for Final Output Traceability. The paper also applies LLMography to its own writing process, classified as human-originated, human-directed, AI-assisted co-production. The findings suggest that AI transparency should move beyond output detection toward documenting the history of interaction. Comments: Preliminary exploratory study; 19 anonymized student audit reports; includes prototype screenshots Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2606.29437 [cs.HC] (or arXiv:2606.29437v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2606.29437 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mohammed Bousmah [view email] [v1] Sun, 28 Jun 2026 14:54:03 UTC (6,402 KB)

[HC-14] he Role of Online Forums in Developer Understanding of Privacy Law – A Reddit Case Study

链接: https://arxiv.org/abs/2606.29393
作者: Sara. Haghighi,Clark LaChance,Ali Pourghasemi Fatideh,Travis Breaux,Sepideh Ghanavati
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: Accepted at PoPETs 2026

点击查看摘要

Abstract:Software practitioners use online forums to navigate complex and often ambiguous legal privacy requirements, yet little is known about their professional backgrounds, what challenges they face, and how they use and assess the credibility of the advice received, or how they resolve ambiguities in posts. We report the findings of a survey of 223 Reddit users from regulatory-focused subreddits, complemented by a qualitative analysis of 2,248 posts and responses. Our results show that, despite holding privacy-related certifications, most participants frequently use forums to seek legal advice. Key challenges reported or identified include implementing a data protection impact assessment, reporting a data breach, and obtaining cookie consent. Reddit users often assess credibility by reviewing respondents’ post history, verifying sources cited, trusting advice from recognized experts, and following up for clarity before responding. We highlight research and educational directions to bridge gaps in support needed for regulatory compliance guidance.

[HC-15] When Stopping Fails: Rethinking Minimal Risk Conditions through Human-Interactive Autonomous Driving for Safe Transportation Systems ITSC2026

链接: https://arxiv.org/abs/2606.29115
作者: Yash Tandon,Giovanni Tapia Lopez,Marcus Blennemann,Mohan Trivedi,Ross Greer
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 8 pages, 1 figure, Accepted to IEEE ITSC 2026

点击查看摘要

Abstract:Autonomous vehicles (AVs) are increasingly deployed in urban environments, yet their safety frameworks remain primarily designed around collision avoidance and minimal risk condition (MRC) behaviors such as slowing or stopping when uncertainty arises. Although effective in reducing immediate crash risk, real-world deployments indicate that stopping alone does not guarantee safe integration into human-governed roadway systems. Incidents reported by municipalities and public records show that AV fallback behaviors can obstruct traffic, interfere with emergency response operations, and create accessibility challenges for passengers and pedestrians. This paper presents an analysis of publicly documented incidents involving AV stopping behavior and human-AV interaction failures. We categorize these incidents according to limitations in perception, planning, and control within current AV architectures. Using this taxonomy, we identify key gaps in existing safety paradigms, particularly the lack of mechanisms for interpreting human authority, responding to multimodal instructions, and adapting to dynamic, socially regulated traffic conditions. We then review emerging research directions that support human-interactive perception, language-grounded and accessibility-aware planning, and assisted control through remote guidance and teleoperation. The analysis highlights the need to augment current AV safety frameworks with capabilities that enable cooperative interaction with human agents and infrastructure. These findings suggest that reliable urban deployment of AVs requires moving beyond passive fallback strategies toward human-interactive autonomy.

[HC-16] Beyond Her: Safety Dynamics in Role-play AI Companions

链接: https://arxiv.org/abs/2606.28968
作者: Zehang Deng,Zhaoyang Xie,Changzhou Han,Hiran Thabrew,Wanlun Ma,Yue Huang,Jason(Minhui)Xue,Sheng Wen,Tianqing Zhu,Yang Xiang
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注: Under review

点击查看摘要

Abstract:The film ‘Her’ pictured a future of love between humans and AI. That future has quietly emerged in the form of Role-play AI Companions (RACs), where emotionally responsive interactions blur the boundary between tool use and relational engagement. However, the safety implications remain poorly understood, as user experiences evolve over time through safety dynamics, spanning both emotional and risk behavioral dynamics, that can gradually shift interactions toward risk. In this paper, we investigate safety dynamics in RAC usage through a two-part mixed-methods study (Study I \ II). (1) Study I consists of semi-structured interviews (N = 16) to identify the key factors shaping these dynamics. We find that users’ internalizing problems, the role personality adopted by the RAC, and risk interaction patterns jointly shape safety dynamics. Building on these insights, (2) Study II conducts a 14-day Ecological Momentary Assessment (N = 102) to examine how safety dynamics unfold in real-world usage. We identify distinct user profiles based on internalizing problems and show that interactions with RACs can produce short-term emotional relief while masking longer-term deterioration. Furthermore, vulnerable users exhibit more unstable risk behavioral patterns over time, making risk emergence less predictable and harder to mitigate with static safeguards. Our findings highlight the importance of modeling safety as a dynamic process rather than a static property. We conclude with three-layer design implications for next-generation AI companions, advocating for adaptive safeguards that can respond to evolving emotional and behavioral signals.

[HC-17] Exploring the Value of Diverse LLM Explanations in Introductory Programming

链接: https://arxiv.org/abs/2606.28882
作者: Seth Bernstein,Paul Denny,Juho Leinonen,Kush Patel,Rayhona Nasimova,Matt Littlefield,Stephen MacNeil
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures; accepted to SIGCSE Virtual 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have shown the potential to generate code explanations that surpass those of peers in quality, offering promising opportunities for computer science education. While these explanations may not yet match the depth and clarity of instructor-provided explanations, research in computational creativity highlights that the quantity and diversity of ideas can often outweigh a singular focus on quality. Inspired by this, we explore whether combining multiple diverse explanations, each emphasizing distinct aspects (e.g., function, concept, goal), can enhance students’ understanding of programming exercises compared to generic explanations that do not emphasize distinct conceptual aspects. In our study 971 first-year computing students were randomly assigned either diverse or generic LLM-generated explanations for two programming exercises. Students completed multiple-choice and open-ended questions for each exercise, followed by Likert-scale questions and open-ended reflections. Our findings outline patterns in student performance and perceived cognitive load across the two explanation conditions. These findings highlight how variation in explanation emphasis may relate to learner engagement and understanding. Across participants, open-ended response accuracy was consistently about 7.7% higher when students received diverse explanations, with no difference in perceived cognitive load.

[HC-18] phony Voice Agent for Banking Services

链接: https://arxiv.org/abs/2606.28779
作者: Nitya Dhagat,Vipul K. Dabhi,Harshadkumar B. Prajapati,Zankhana J. Barad
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper proposes a voice-powered AI-based banking system based on Google Conversational Agent, Dialogflow CX, which provides safe and convenient banking by phone. The system supports essential banking functions such as balance inquiries, transaction history retrieval, card activations, PIN-based authentication of sensitive tasks, smooth live agent handoff for complex and out-of-scope queries, and ensures seamless handover to human agents when required. These tests were performed with high-duration calls, high concurrency, and noisy environments; the system proved to be scalable, responsive, and resilient. All the data used is safely stored in the cloud environment for efficiency and security in real-time voice interactions. A voice-based banking solution that is efficient and easy to use can be provided through this.

[HC-19] Designing Automation Boundaries for Trustworthy Smart Medication Support

链接: https://arxiv.org/abs/2606.28777
作者: Liqian You,Jianlong Zhou
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Smart medication systems increasingly automate medication recognition, reminders, and logging. However, automation in home medication routines should be carefully bounded, as users may have different capabilities, privacy expectations, and needs for control over decisions. We present a mixed-methods study of a Smart Medication Support system comparing three automation conditions: confirmation required, automatic logging with undo, and fully automatic support. Across 53 participants and interviews with 11 older adults, we found that higher automation did not necessarily lead to higher trust or acceptance. Participants preferred automation that reduced routine effort while preserving opportunities for correction. Fully automatic support was less interruptive but was rated lower in autonomy, trust, transparency, dignity, and satisfaction. Interviews also showed clear differences among older adults. Their preferences were shaped by privacy concerns, digital confidence, perceived vulnerability, and caregiver involvement. We contribute empirical evidence and design implications for calibrating automation in smart medication systems according to task risk, user control, and ethical acceptability.

[HC-20] Four Types of LLM Reliance and Their Predictors Among Undergraduate Writers: A Mixed-Methods Study at a Minority-Serving R1 University

链接: https://arxiv.org/abs/2606.28749
作者: Shahin Hossain
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 18 pages, 5 figures

点击查看摘要

Abstract:Although most undergraduates now use large language models (LLMs), a form of generative artificial intelligence (GenAI) for academic writing, no validated method distinguishes the qualitatively different ways students rely on them. Existing instruments assess reliance solely by frequency of use, a measure that, as this study shows, inadvertently rewards dependence on AI rather than recognizing students’ own intellectual contribution. Conducted at a public minority-serving university and grounded in the AI Literacy Framework, Expectancy-Value Theory, and Biggs’s Presage-Process-Product model, the study drew on 382 undergraduates, 14 interviews, and 396 open-ended survey responses. Four distinct reliance types were identified and confirmed: Strategic (34.3%), Instrumental (30.9%), Dialogic (30.4%), and Dependent (4.5%). Students’ value and cost beliefs predicted the intensity of their reliance on LLMs, whereas their AI literacy predicted the type of reliance they adopted, indicating that differentiated support is needed. Notably, Strategic users, those who engaged AI most deliberately, scored lowest on standard outcome measures. This pattern reflects a limitation of current instruments, which index AI’s contribution rather than writing quality, thereby penalizing students who show the greatest independent thinking. Analysis also revealed an additional group, roughly 13%, who declined to use AI for ethical rather than practical reasons, and who existing frameworks overlook. These findings carry implications for AI literacy programs, the measurement of student learning outcomes, and equitable AI policy at minority-serving institutions.

[HC-21] “If I Can See You”: Understanding Spatially Situated Virtual Embodiment in Close Human-AI Relationships

链接: https://arxiv.org/abs/2606.28714
作者: Yulin Chen,Yang Zhan,Qiao Jin
类目: Human-Computer Interaction (cs.HC)
备注: 17 pages, 3 figures

点击查看摘要

Abstract:AI companions are increasingly used for emotional support, companionship, and intimate interaction. While prior work has examined text- and voice-based AI companionship and emerging XR companion designs, less is known about how users with existing close AI companion relationships expect those relationships to change when companions become virtually embodied and spatially situated in everyday environments. To address this gap, we conducted a qualitative study with 17 AI companion users recruited from Reddit AI companion communities. We frame spatially situated virtual embodiment as a form of relational escalation: embodiment can make AI companionship more present, socially legible, and risk-sensitive in everyday life. Our findings show that: (1) embodiment creates tensions between support and intrusion, concreteness and imaginative openness, and growth and consistency; (2) embodiment can turn private AI companionship into a socially legible relational arrangement, requiring visibility, form, interaction style, and mode of access to be negotiated across social contexts; and (3) embodiment can intensify risks of emotional dependence, sensitive disclosure, social judgment, and misguided spatial action by increasing the companion’s perceived relational presence, intimacy, public legibility, and spatial authority. We argue that future system design should first consider when embodiment is warranted, how embodied presence should be staged, how visibility and role boundaries should be negotiated, and how embodied companionship can remain safe. This work contributes to HCI research on human-AI intimacy by showing how virtual embodiment can transform close AI companionship into a spatial, socially visible, and risk-sensitive relationship.

[HC-22] A French OSCE Dialogue Dataset and Controllable Virtual Patient System for Clinical Training SIGDIAL2026

链接: https://arxiv.org/abs/2606.28526
作者: Doria Bonzi,Tom Bourgeade,Fabrice Lefèvre,Irina Illina
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 9 pages. Accepted at SIGDIAL2026

点击查看摘要

Abstract:The clinical and communication skills of medical students are commonly assessed through Objective Structured Clinical Examinations (OSCEs), which consist of brief scenario-driven simulations of doctor-patient interactions. However, training is often limited by the low availability of human standardized patients, motivating the development of realistic virtual patients (VPs). To address this gap, we introduce a French OSCE dialogue dataset comprising 240 student-patient training interactions. We build upon it a controllable LLM-based pipeline to generate synthetic OSCE dialogues. The pipeline integrates modular components, such as retrieval-based grounding and a reflection loop, to ensure patient fidelity, coherence, and realism. Additionally, we propose a multi-level evaluation framework assessing patient simulation quality, student performance, and linguistic quality, using an LLM-as-a-Judge approach. Experiments suggest that controllability modules generally improve patient fidelity and student evaluation consistency. Finally, we implement an interactive prototype in which students can practice with a VP and receive automatic feedback.

[HC-23] Drag Infer Reproject: Grounding LLM s through Spatial Interaction for Image Clustering

链接: https://arxiv.org/abs/2606.28517
作者: Yang Liu,Xuxin Tang,Jiahao Xu,Chris North
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Dimension reduction and semantic interaction support image clustering by making similarity structure visible and manipulable. Existing semantic interaction methods encode users’ clustering criterion (a user-interpretable semantic dimension, e.g., action, location, or mood) from direct manipulation to steer reprojection, giving users direct control over the resulting layout. Yet they typically depend on learned embeddings or a predefined criterion. In practice, users’ clustering criterion often emerges gradually and becomes refined through interaction rather than being fully clear at the outset. In this work, we present CriterionSI (Criterion-guided Semantic Interaction), a method that translates incremental drag interactions into criterion-guided reprojection. CriterionSI uses large language models to infer and refine the clustering criterion from sequential user drags, while grounding semantic interpretation in human-provided feedback rather than fixed prior assumptions. CriterionSI combines the inferred criterion with local drags to guide global reprojection. The simulation-based evaluation and usage scenario demonstrate that CriterionSI can discover and refine the target criterion from sequential interactions and progressively produce criterion-aligned clustering layouts. Our code and data are available at: this https URL.

[HC-24] Generative AI Literacy Training Improves Intelligence Analysts Discrimination of Real and AI-Generated Images

链接: https://arxiv.org/abs/2606.28510
作者: Negar Kamali,Candice Rockell Gerstner,Jessica Hullman,Matthew Groh
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 26 pages, 5 figures, 1 table

点击查看摘要

Abstract:Across social and online platforms, people are increasingly exposed to AI-generated images. As a consequence, the task of distinguishing AI-generated from authentic images is becoming a central challenge for information ecosystems. While humans perform better than chance, accuracy falls short of many operational needs. Initial evidence shows that visually oriented training can improve deepfake detection but does not improve participants’ ability to identify real images as real. Here, we investigate the efficacy of a brief training intervention for intelligence analysts employed by the United States government in 2024. We conducted a counterbalanced within-subject randomized experiment in which we showed participants real and AI-generated images varying in pose complexity and scene context and asked them whether each image was real or AI-generated, both before and after an expert delivered a 30-minute training that pointed out patterns in seven real and 50 AI-generated images. We collected 2,544 image-level judgments from 32 intelligence analysts. We find training increased overall accuracy by 9 percentage points (95% CI: [2.7, 15.4]) from a baseline of 72%. We find the improvement is driven by a 14.2 percentage point increase in accuracy for real images (95% CI: [0.7, 27.7]). Through a careful experimental setup that curated matched pairs of real and AI-generated images across pose complexity categories, we reveal how these trainings influence people with different levels of digital forensics and generative AI experience and identify the kind of image-based content where this training intervention appears to be most effective. Ultimately, these results provide causal evidence that a brief, structured training can improve human judgment across a diverse array of real and AI-generated images, informing organizational responses to AI-generated visual misinformation.

[HC-25] When May I Help You? On The Effect of Proactivity on Group Human-Robot Collaboration

链接: https://arxiv.org/abs/2606.28469
作者: Thomas Vitry,Vanessa Maeder,Kieran Edgeworth,Asihati Hazaiti,Doga Deniz Ates,Connor Gäde,Jan-Gerrit Habekost,Dennis Becker,Stefan Wermter
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: Published at the RO-MAN 2026 conference

点击查看摘要

Abstract:Robot initiative is a central challenge in multi-party human-robot collaboration. A robot that contributes without being addressed may provide timely support, but it may also disrupt coordination, divide attention, or interrupt turn-taking; a robot that waits to be addressed may preserve human control, but it may also miss opportunities to assist. We investigate this design challenge in a collaborative escape room in which pairs of participants work with a humanoid robot under either a reactive interaction model, where the robot responds only when addressed, or a proactive model, where it listens continuously, contributes autonomously, and periodically re-initiates interaction. We evaluate both models using puzzle-solving performance, interaction frequency, and participant ratings on the Godspeed and RoSAS scales. The proactive model substantially increases interaction frequency, whereas the reactive model shows a descriptively higher overall success rate (92.86% vs. 71.42%). The strongest differences emerge when prior experience and personality are taken into account: participants with LLM experience solve the early puzzles faster in the reactive condition, and participants with prior robot experience show modified evaluations of proactive and reactive interaction as do introverted participants. These findings demonstrate that the effects of robot initiative are simultaneously shaped by users’ prior experience, personality traits and more generally by the needs of the group.

计算机视觉

[CV-0] Open-Vocabulary and Referring Segmentation for 3D Gaussians Using 2D Detectors

链接: https://arxiv.org/abs/2606.30638
作者: Jameel Hassan,Yasiru Ranasinghe,Vishal Patel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged at the forefront of 3D scene reconstruction. Extending 3DGS with language-driven, open-vocabulary understanding has gained significant attention for real-world applications such as embodied AI. Recent methods achieve this by learning an instance feature attribute and assigning semantics by distilling high-dimensional Contrastive Language-Image Pretraining (CLIP) features directly into the scene representation. However, the instance grouping mechanisms of these methods either require a predefined number of instances or suffer from noise in their bottom-up grouping strategies. Furthermore, the reliance on CLIP restricts semantic understanding to simple noun phrases, preventing complex spatial reasoning and referential expression grounding. We present GaussDet, a method that circumvents the need for dense CLIP features by leveraging discrete, open-vocabulary 2D object detectors with referring expression capabilities. We learn instance features for individual Gaussians to decompose the scene into 3D instance groups. By rendering these groups and aggregating semantic votes from multi-view 2D detections, we generate a robust View-Aggregated Semantic Label Distribution (VASD) for each 3D instance. This view-aggregation strategy acts as a strong regularizer, attenuating spurious labels caused by low-quality instance grouping. Our approach enables a straightforward, zero-shot extension from simple language queries to complex referential grounding. Extensive evaluations across two key tasks – open-vocabulary segmentation (LeRF-OVS, ScanNet) and referring expression grounding (Ref-LeRF) – demonstrate that GaussDet achieves consistent improvements over existing methods. Most notably, we achieve a substantial 16.7% mIoU improvement in referential grounding within a strict zero-shot setting.

[CV-1] GROW2: Grounding Which and Where for Robot Tool Use

链接: https://arxiv.org/abs/2606.30632
作者: Yuhong Deng,Yuyao Liu,David Hsu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Can the robot use a plate to cut a cake if no knife is available? Tool use greatly expands robot capabilities, but to use tools creatively beyond their intended functions, the robot faces the challenge of \textitopen-world affordance grounding : select an open-category object to act as a tool and localize its specific region of action. To this end, we introduce GROW ^2 (GROunding Which and Where), which leverages object parts as a natural abstraction to split the grounding process hierarchically into semantic and geometric levels, thus bypassing the need for data-heavy, end-to-end training. Semantically, GROW ^2 harnesses the commonsense reasoning of Vision-Language Models (VLMs) to parse a natural-language task instruction, select a suitable object as the tool, and identify task-relevant parts on the tool and the target object. Geometrically, vision foundation models then ground the selected parts into precise 3D regions from a single RGB-D image. Experiments on established benchmarks show that GROW ^2 outperforms state-of-the-art baselines on affordance prediction benchmarks. Further, it achieves zero-shot generalization over open-category objects and outperforms baselines in both simulated and real-world robot tool use experiments.

[CV-2] Reweighting Framewise Attention in Video Transformers for Facial Expression Understanding ECCV2026

链接: https://arxiv.org/abs/2606.30611
作者: Seongro Yoon,Donghyeon Cho,Jinsun Park,François Brémond
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026

点击查看摘要

Abstract:Understanding facial expressions in videos requires modeling subtle and localized facial dynamics under unconstrained conditions. Although recent Vision Transformer~(ViT)-based video models have shown strong performance through large-scale self-supervised pretraining, their attention mechanisms often emphasize dominant global motions and coarse temporal dynamics, limiting sensitivity to fine-grained facial variations. To address this limitation, we propose MiRA (Marginal-induced Attention Redistribution), a plug-in frame-marginal attention redistribution framework for ViT backbones that enhances spatio-temporal selectivity toward subtle facial dynamics without introducing additional trainable parameters. MiRA derives frame-level confidence and intra-frame concentration statistics from self-attention maps to estimate frame-wise marginal importance and redistribute attention toward spatiotemporally localized facial cues. We first introduce a principled \textitexact mode based on post-softmax attention redistribution. To further improve efficiency, we propose \textitflashLite mode, a lightweight pre-softmax approximation that integrates frame-marginal redistribution into FlashAttention kernels while preserving the effectiveness of the exact formulation. Experimental results on challenging Facial Expression Recognition~(FER) benchmarks demonstrate consistent improvements over strong ViT baselines.

[CV-3] UnfoldArt: Zero-Shot Recovery of Full Articulated 3D Objects from Text or Image

链接: https://arxiv.org/abs/2606.30608
作者: Mohamed el amine boudjoghra,Ivan Laptev,Angela Dai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Articulated 3D objects are essential for interactive environments in embodied AI, robotics, and virtual reality, but reconstructing their structure and motion from sparse observations remains challenging. Existing approaches remain largely constrained by lack of supervised data or lack the priors needed to reliably recover articulation, hidden geometry, and internal object structure. We present the first debate-driven agentic approach to articulated 3D object reconstruction from text or image inputs that both grounds articulation reasoning in concrete motion and exposes the occluded geometry revealed under articulation. High-level agents reason about object semantics and motion using knowledge from vision-language and video models, while low-level agents estimate articulation parameters and interaction points; together, they engage in a two-round structured debate that first exploits global–local disagreement and then grounds the agents in freely generated video. The same video prior, conditioned on the agreed articulation, then drives each part through its motion to expose occluded interiors and geometry that cannot be inferred from a single static view. By combining agentic reasoning with a video generative prior, our approach jointly infers articulation and reconstructs complete 3D articulated objects, producing high-fidelity geometry, internal structure, and motion-consistent states beyond directly observed surfaces.

[CV-4] Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing

链接: https://arxiv.org/abs/2606.30599
作者: Sen Liang,Cong Wang,Zhentao Yu,Fengbin Guan,Zhengguang Zhou,Teng Hu,Youliang Zhang,Yuan Zhou,Xin Li,Qinglin Lu,Zhibo Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing instruction-based video editing datasets commonly focus on single-task appearance editing, failing to meet the complex creative demands of real-world scenarios. To bridge this gap, we present Goku, a large-scale dataset featuring 2 million high-quality, instruction-aligned video editing pairs, which is the first to extend task boundaries from basic appearance editing to multi-task and structural manipulations(e.g., precise control of subject movement). To tackle the data synthesis challenges inherent in these complex tasks, we design an efficient data synthesis pipeline that decomposes complex edits into controllable sub-problems and introduce a progressive filtering system for data reliability throughout the whole process. Furthermore, we explore the optimal network structures on Goku, and propose Goku-Edit. To deeply comprehend complex editing instructions, Goku-Edit leverages an MLLM as its text encoder and adopts a decoupled dual-branch design: a dedicated mask branch handles structural control, freeing the main branch for appearance rendering. A comprehensive video editing benchmark, Goku-Bench, is also proposed with 1,000 human-verified test cases and 7 novel editing-specific metrics. Evaluated on Goku-Bench, Goku-Edit obtains up to +8% improvement on other open-source models in terms of instruction following.

[CV-5] owards in-the-wild Egocentric 3D Hand-Object Pose Estimation ECCV2026

链接: https://arxiv.org/abs/2606.30598
作者: Siddhant Bansal,Zhifan Zhu,Shashank Tripathi,Jiahe Zhao,Michael J. Black,Dima Damen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026; Project Page: this https URL

点击查看摘要

Abstract:Estimating accurate 3D hand-object pose from in-the-wild egocentric RGB remains challenging due to severe occlusions and ambiguous contact. Existing learning-based methods often struggle to generalise to in-the-wild scenes and are limited by the scarcity of supervision. We address these issues with two contributions. First, we introduce EPIC-Contact, an in-the-wild egocentric dataset of 2.3K clips (62.3K frames) with dense, bijective 3D hand-object contact correspondences and posed meshes. Second, we propose HOPformer, an end-to-end transformer that jointly predicts bi-manual hand and object pose in a single forward pass. A cross-attention decoder conditions object features on hand priors, producing robust pose estimation. We test HOPformer on the in-lab 3D dataset, ARCTIC, as well as our newly introduced EPIC-Contact dataset. HOPformer reaches 82.4% success rate on ARCTIC (+6.2 pts over current SOTA). On EPIC-Contact, it nearly doubles the success rate while reducing contact deviation by 75%. EPIC-Contact, HOPformer code and checkpoints are released: this https URL.

[CV-6] Learning from Reliable Latent Prompts for Visual Recognition with Missing Modalities

链接: https://arxiv.org/abs/2606.30597
作者: Taixi Chen,Nancy Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale multimodal models (LMMs) have achieved superior performance in visual recognition by synergizing information across diverse, massive-scale paired modalities. In real-world scenarios, however, missing-modality inputs are ubiquitous, causing models optimized for modality-complete data to exhibit precipitous performance degradation. Existing research has introduced prompt learning to mitigate this issue, typically by generating dynamic prompts from instance-level features, regardless of whether the input modalities are complete or partially absent. However, such input-conditioned strategies are hindered by the escalating unreliability of instance-level features; as higher missing rates increase the proportion of incomplete modalities, the resulting instability in prompt learning limits the model’s performance. To address this limitation, we hypothesize that learnable latent prompts themselves encapsulate stable, modality-intrinsic priors that are decoupled from corrupted inputs. Consequently, we propose a novel paradigm: Learning from Reliable Latent Prompts. Unlike prior methods, we model input-agnostic learnable prompts as stable latent anchors that enable robust guidance and effective cross-modal knowledge compensation, even under extreme missing rates (e.g., 90%). Empirical results across three benchmark datasets demonstrate that our “learn-from-latent-prompts” approach achieves state-of-the-art performance across a wide range of missing-modality scenarios. Extensive experiments further confirm the effectiveness of this paradigm in providing a robust solution to the missing-modality problem.

[CV-7] APRIL-MedSeg: A Modular Medical Image Segmentation Toolbox Embracing Modern Paradigms

链接: https://arxiv.org/abs/2606.30577
作者: Juntao Jiang,Jinsheng Bai,Linxuan Fan,Yali Bi,Jiangning Zhang,Yong Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 1 figure, and 8 tables

点击查看摘要

Abstract:We present APRIL-MedSeg, a YAML-driven modular framework for 2D medical image segmentation. It provides a unified and extensible ecosystem that decomposes segmentation networks into reusable components. Also, the framework integrates a broad spectrum of advanced paradigms, including semi-supervised learning, domain adaptation, knowledge distillation, weakly supervised learning, and text-guided segmentation as well as foundation model support. A registry-based configuration system with inheritance enables flexible and reproducible experiment management, supporting seamless switching across models, datasets, and training strategies. In addition, the framework provides a unified interface for medical datasets, augmentation pipelines, deployment utilities and model ensembling. Overall, APRIL-MedSeg is designed as a general-purpose research and development platform that bridges algorithmic innovation and practical deployment, while also serving as a structured ecosystem for systematically organizing and reproducing advances in medical image segmentation. The code is available at this https URL under an Apache 2.0 license.

[CV-8] Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization

链接: https://arxiv.org/abs/2606.30576
作者: Liyao Wang,Ruipu Wu,Haojun Xu,Lei Shi,Linjiang Huang,Si Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-view object geo-localization (CVOGL) aims to locate a target object from a query view (e.g., ground or drone) within a geo-tagged reference image (e.g., satellite). Existing approaches heavily rely on 2D appearance matching and are constrained by limited datasets lacking geometric metadata, diverse prompts, and standard field-of-view imagery. To address these intertwined challenges, we first introduce \dataset, a large-scale, high-fidelity building dataset comprising over 220,000 ground-satellite and drone-satellite pairs. It provides multi-modal prompts (points, boxes, masks) and camera poses to enable flexible target referring and explicit spatial modeling. Furthermore, we propose a novel single-stage Geometry-Aware Geo-localization framework (GAGeo), built upon the permutation-equivariant 3D foundation model \pi^3 . By seamlessly integrating visual features, referring prompts, and learnable task tokens, our model adapts the inherited 3D prior to jointly predict bounding boxes, segmentation masks, and camera poses in a single forward pass. Additionally, we introduce a contrastive loss that utilizes the satellite view as a universal anchor, implicitly aligning ground and drone representations to enable zero-shot ground-to-drone localization without requiring triplet training data. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, exhibiting exceptional generalization ability in unseen scenes and novel cross-view setups.

[CV-9] EcoVideo: Entropy-Orchestrated Video Generation Paradigm in Cloud-Edge Dynamics ECCV2026

链接: https://arxiv.org/abs/2606.30557
作者: Jiayu Chen,Hengyi Zhang,Maoliang Li,Minyu Li,Zihao Zheng,Xuanzhe Liu,Guojie Luo,Xiang Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: EcoVideo is honored to be accepted by ECCV 2026

点击查看摘要

Abstract:DiT video generation is latency-intensive due to iterative full-frame denoising, while prior cloud-edge methods largely rely on static inter-step decoupling and cannot leverage inter-frame similarity or adapt to system dynamics. We propose EcoVideo, an entropy-orchestrated framework for dynamic inter-frame decoupling: early-stage self-attention entropy provides a training-free estimate of frame-wise information density for frame selection; a cloud large model denoises sparse high-entropy keyframes; and an edge lightweight model reconstructs the remaining frames via motion-aware interpolation with refinement for temporal stability. EcoVideo further adapts the keyframe budget and edge refinement depth to real-time bandwidth and compute availability, optimizing end-to-end latency under constraints. Experiments on representative DiT video generators show improved quality–efficiency trade-offs and up to 2.9x end-to-end speedup in low-bandwidth, compute-limited edge settings. Code is available at this https URL.

[CV-10] raining Vision-Language-Action Models with Dense Embodied Chain-of-Thought Supervision

链接: https://arxiv.org/abs/2606.30552
作者: Haoyang Li,Guanlin Li,Youhe Feng,Chen Zhao,Zhuoran Wang,Yang Li,Qizhe Wei,Shifeng Bao,Haitao Shen,Yihan Zhao,Tong Yang,Jing Zhang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-embodiment transfer in vision-language-action (VLA) models remains challenging because low-level state and action spaces differ fundamentally across robot platforms. We observe that the high-level cognitive process underlying manipulation, including scene perception, object identification, task planning, and sub-task decomposition, is largely shared across embodiments. Based on this observation, we present ZR-0, a 2.6 billion parameter end-to-end VLA model that uses dense Embodied Chain-of-Thought (ECoT) supervision to align cross-embodiment representations within the vision-language model (VLM). ZR-0 adopts a dual-stream architecture: a pre-trained VLM (System 2) generates structured ECoT reasoning during training, while a Diffusion Transformer-based action expert (System 1) produces continuous action chunks via flow matching. The two components are coupled through cross-attention, with an attention mask that restricts the action expert to input prompt features only, enabling ECoT generation to be entirely skipped at inference without any performance loss. ZR-0 is pre-trained on ProcCorpus-60M, a large-scale dataset comprising approximately 60 million frames (approximately 1,000 hours) from over 400K trajectories, with dense ECoT annotations covering 96.8% of all frames. We evaluate ZR-0 on three simulation benchmarks spanning single-arm (LIBERO), bimanual (RoboTwin 2.0), and humanoid (RoboCasa GR-1 Tabletop) embodiments, as well as real-world experiments on the xArm platform, demonstrating strong performance across all settings. Code and model checkpoints are available at this https URL.

[CV-11] StereoGS: Sparse-View 3D Gaussian Splatting via Stereo Priors ECCV2026

链接: https://arxiv.org/abs/2606.30545
作者: Wenhao Yuan,Yiyuan Ge,Deli Cai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures, accepted to ECCV 2026, project page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has achieved remarkable success in real-time novel view synthesis, yet it suffers from severe overfitting under sparse-view settings due to insufficient geometric constraints. While recent methods introduce monocular depth priors to mitigate this, they inherently struggle with scale ambiguity and cross-view inconsistency, leading to defective geometry. In this paper, we propose StereoGS, a novel sparse-view 3DGS framework that integrates stereo priors to establish reliable binocular consistency. Unlike scale-agnostic monocular constraints, StereoGS introduces a Stereo Depth Regularization by constructing virtual stereo pairs during optimization and leveraging a foundation stereo model to enforce absolute scale and binocular-consistent structures. To further suppress overfitting and eliminate redundant primitives, we design a Gradient-Aware Opacity Decay strategy that dynamically penalizes Gaussians based on their relative opacity gradient magnitudes. Combined with a Consistency-Aware Dense Initialization using zero-shot multi-view depth estimation, StereoGS effectively anchors primitives to accurate scene surfaces. Extensive experiments on LLFF, DTU, Mip-NeRF360, and Blender datasets demonstrate that StereoGS achieves state-of-the-art performance in sparse-view settings without incurring any additional inference overhead. Project Page: this https URL

[CV-12] Learning from Mistakes: Rollout-Retrieval Lifelong Policy Learning for Autonomous Driving

链接: https://arxiv.org/abs/2606.30537
作者: Cheng Gong,Haoyang Wang,Chao Lu,Zirui Li,Jianwei Gong
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 6 figures. Code available at: this https URL

点击查看摘要

Abstract:Autonomous driving policies should be able to improve continually as deployment exposes them to increasingly diverse and long-tail traffic situations. However, most learning-based policies are trained or fine-tuned on expert demonstrations and then rely largely on generalization to handle challenging closed-loop scenarios, lacking an explicit mechanism to correct and retain the mistakes exposed in these scenarios. This paper studies autonomous driving policy improvement from a lifelong learning perspective: Can a pretrained policy improve continually by accumulating corrective knowledge derived from its own mistakes, while retaining previously acquired driving competence? To answer this question, we propose Rollout-Retrieval Lifelong Policy Learning (R ^2 LPL), a policy learning framework that retrieves corrective targets from recoverable policy-induced mistakes and retains the resulting knowledge through lifelong policy learning. R^2LPL addresses a key bottleneck in continual policy improvement: closed-loop mistakes reveal where the policy is weak, but do not directly specify what the policy should learn. By filtering recoverable mistake-related states and retrieving feasible corrective targets, R ^2 LPL turns sparse failure evidence into compact supervised knowledge for stable and sample-efficient policy improvement. We evaluate R ^2 LPL on large-scale closed-loop nuPlan benchmarks. With only a few rollout and continual-learning cycles, R ^2 LPL elevates a learning-based planner with moderate initial performance to state-of-the-art performance across the evaluated benchmarks, especially on the challenging and long-tail Test14-hard split. These results demonstrate the effectiveness of R ^2 LPL in converting recoverable closed-loop mistakes into corrective knowledge for sustained policy improvement.

[CV-13] Orca: The World is in Your Mind

链接: https://arxiv.org/abs/2606.30534
作者: Yihao Wang,Yuheng Ji,Mingyu Cao,Yanqing Shen,Runze Xiao,Huaihai Lyu,Senwei Xie,Euan Liu,Klara Tian,Tianfeng Long,Yichi Zhang,Zhengliang Cai,Ruike Chen,Jifan Zhao,Ruochuan Shi,Zihan Tang,Jing Lyu,Wenxing Tan,Ningbo Zhang,Yangtao Hu,Yuming Gao,Xiansheng Chen,Junkai Zhao,Congsheng Xu,Boan Zhu,Ziqi Wang,Yupu Feng,Qiongqiong Zhang,Yingli Zhao,Yulong Ao,Shaoxuan Xie,You Liu,Guocai Yao,Leiduo Zhang,Xiaodan Liu,Yunyan Zhang,Yance Jiao,Xinyan Yang,Jiaxing Wei,Xu Liu,Tengfei Pan,Shaokai Nie,Chunlei Men,Sen Cui,Xiaojie Jin,Hongyang Li,Jianlan Luo,Yao Mu,Yunchao Wei,Jun Yan,Hang Zhao,Xiaolong Zheng,Jiaming Li,Yonghua Lin,Tiejun Huang,Zhongyuan Wang,Pengwei Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce Orca, an initial instantiation of a general world foundation model. Orca learns a unified world latent space from multimodal world signals and exposes it through multimodal readout interfaces. Rather than optimizing isolated next-token, next-frame, or next-action prediction, we are centered on Next-State-Prediction modeling, offering a unified state-transition modeling route toward understanding, predicting, and acting upon the world. Orca learns through two complementary paradigms: unconscious learning captures dense natural state transitions from continuous videos, and conscious learning models sparse meaningful state transitions by language-described events and VQA supervision. For pre-training, we construct a large-scale world-learning inventory data, including 125K hours of video data and 160M event annotations. After pre-training, Orca learns a unified world latent space. To examine whether the learned latent supports downstream, we evaluate it by three representative downstream readouts: text generation, image prediction, and embodied action generation. Orca’s backbone is frozen, and only the lightweight modality-specific decoders are trainable. Experiments show the scalability of the proposed paradigm and verify that stronger world latent enables stronger downstream readouts. Orca outperforms similar-sized specialized baselines. These results show that Orca, as a general world foundation model, presents a promising approach to understanding, predicting, and acting upon the world. Finally, we discuss the current limitations, aiming to provide useful insights and inspiration for the community.

[CV-14] μFlow: Leverag ing Averag e Images for Improving Generalisation of Deepfake Faces Detectors ECCV

链接: https://arxiv.org/abs/2606.30528
作者: Orazio Pontorno,Mattia Litrico,Luca Guarnera,Mario Valerio Giuffrida,Sebastiano Battiato
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the European Conference on Computer Vision (ECCV) 2026

点击查看摘要

Abstract:Current generative models, including GANs and diffusion models, have reached an outstanding level of photorealism, posing significant risks to privacy and security. To ensure real-world applicability, deepfake detectors must generalise effectively to unseen generators. However, most existing approaches rely on supervised training with both real and fake images, which limits their generalisation especially across generators categories (e.g. GANs vs DMs). In this work, we introduce \mu Flow, a one-class deepfake detector trained only on real images without relying on pseudo-deepfakes or synthetic artifacts. Our approach builds on the observation that averaging multiple images amplifies consistent generative traces, producing highly discriminative feature representations. We leverage this property by modelling the distribution of features extracted from averaged images and training a normalizing flow to align the feature space of individual images with this distribution. This alignment yields a likelihood-based criterion that separates real and fake samples while promoting strong generalisation. We evaluate \mu Flow on a fully out-of-distribution setting, where both real and fake datasets are unseen during training. Experimental results show that our method significantly outperforms SOTA detectors. Project page: this https URL.

[CV-15] HASTE: A Framework for Training-Free Dynamic and Steerable Compression of Pre-Trained Convolutional Neural Networks

链接: https://arxiv.org/abs/2606.30516
作者: Lukas Meiner,Jens Mehnert,Alexandru Paul Condurache
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this article is published in Springer Nature Compute Science, and is available online at this https URL

点击查看摘要

Abstract:Deploying large convolutional neural networks (CNNs) on resource-constrained devices is challenging due to their high computational cost. While dynamic execution methods are promising, existing approaches for CNNs typically require specialized training or fine-tuning, limiting their effectiveness when applied to pre-trained models and requiring data access. To address this gap, we propose HASTE (Hashing for Tractable Efficiency), a plug-and-play convolution module that enables training-free, dynamic compression of large pre-trained CNNs. At inference time, HASTE uses locality-sensitive hashing to identify and merge redundant channels of latent feature maps on a patch-wise basis. This process simultaneously compresses the depth of both input features and their corresponding filters, resulting in computationally cheaper convolutions. We conduct extensive experiments on CIFAR-10 and ImageNet across a range of architectures, demonstrating a 46.2% FLOPs reduction in a ResNet34 on CIFAR-10 with only a 1.25% drop in accuracy, without any retraining. We support our claims by comprehensive ablation studies to validate our core design choices, an analysis of the method’s properties and limitations, and a discussion that connects our channel merging scheme to the conceptually related task of token merging in Vision Transformers. Our results demonstrate that HASTE provides an effective solution for steerable compression of pre-trained CNNs at runtime, opening new possibilities for the deployment of efficient deep learning methods.

[CV-16] 3D Scene-Adaptive Trajectory-Controllable Human Image Animation with Camera Movement

链接: https://arxiv.org/abs/2606.30514
作者: Deyin Liu,Jicheng Xu,Lin Yuanbo Wu,Xiaowei Zhao,Xiatian Zhu,Zhe Jin,Anjan Dutta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human image animation, which aims to generate a video of a reference subject following a provided action sequence, has received increasing research interest. With the development of diffusion-based/flow-based video foundation models, existing animation works have began to upgrade the guidance information from 2D skeleton/pose to 3D modeling conditions. Despite achieving reasonable results, these approaches face challenges in synthesizing trajectory-controllable human motion within natural scene under changed camera views. In this work, we present a scene-adaptive human image animation framework that controls both human motion and camera trajectories within a reconstructed 3D environment for video generation. To achieve this, we first develop a ground-adaptive 3D motion retargeting approach to enable user-friendly motion trajectory control adapting to the changes of elevations of ground and orientations automatically. Then we design a viewpoint-adaptive latent fusion mechanism to inject point-cloud geometric priors through scene-visibility masking into the generative process, providing precise guidance of viewpoint changes under camera control. Experiments on two standard human image animation benchmark datasets demonstrate remarkable improvements of our method over the state of the arts in related video generation metics. Project page: this https URL

[CV-17] High-Resolution Flood Mapping With Sentinel-1 and Sentinel-2 via Misalignment-Robust Cross-Sensor Learning and Generative Despeckling

链接: https://arxiv.org/abs/2606.30511
作者: David Ma,Jeremy Feinstein,Shreya Pandit,Arkaprabha Ganguli,Eugene Yan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable high-resolution flood extent mapping from satellite imagery remains constrained by limited data fidelity and sensor-specific artifacts. Multispectral optical imagery is degraded by clouds, shadows, and urban confounders, while synthetic aperture radar (SAR) imagery is affected by speckle noise and sensor co-registration uncertainty. This work presents an integrated flood mapping framework that jointly addresses these limitations through curated datasets and novel learning strategies. We introduce a new Sentinel-2 (S2) and Sentinel-1 (S1) dataset covering the contiguous United States, featuring pixel-accurate 10 m water masks with emphasis on challenging weather conditions and urban environments that are underrepresented in existing benchmarks. High-quality S2 annotations are manually produced using rigorous geospatial labeling protocols and transferred to SAR imagery through weakly labeled temporally coincident acquisitions. To address SAR-specific artifacts, a shift-invariant loss function is employed to tolerate residual geolocation uncertainty between SAR imagery and optical-derived labels, and a Conditional Variational Autoencoder (CVAE) is trained on multitemporal SAR composites to suppress speckle while preserving flood-relevant spatial structure. Experiments using UNet and UNet++ architectures demonstrate strong multispectral performance (AUPRC up to 0.956) and statistically significant improvements in SAR flood mapping when using shift-invariant loss and CVAE-based despeckling compared to classical filters. These results underscore the importance of dataset fidelity, misalignment-robust training, and demonstrate the viability of generative despeckling for operational flood mapping.

[CV-18] On the Faithfulness of Post-Hoc Concept Bottleneck Models ECCV2026

链接: https://arxiv.org/abs/2606.30498
作者: Laines Schmalwasser,Jan Blunk,Niklas Penzel,Julia Niebling,Joachim Denzler
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ECCV 2026, 41 pages, 13 figures, 2 tables

点击查看摘要

Abstract:Human decision-making interprets the world through high-level concepts, such as recognizing a bird by its belly color. To bridge the gap between opaque deep learning representations and human understanding, Post-Hoc Concept Bottleneck Models (post-hoc CBMs) project latent features onto interpretable concept spaces using auxiliary datasets or vision-language models. However, relying on target task accuracy as the primary measure of post-hoc CBM success obscures whether the learned concepts are semantically meaningful or merely predictive artifacts. For example, random concept projections can achieve competitive accuracy despite being semantically meaningless. In this work, we analyze the learned projections directly and identify two failure cases: First, for concept projections learned from auxiliary data, covariate shifts can lead to unfaithful concept representations for the target task. In particular, we provide an upper bound on the error introduced by this shift. Second, systematic label noise in surrogate concept labels generated by vision-language models leads to unfaithful projections. After formalizing these failure modes, we introduce novel metrics that decouple concept faithfulness from predictive accuracy. Our empirical results across real-world and synthetic benchmarks confirm that these metrics identify unfaithful behaviors that standard accuracy-based evaluation fails to detect.

[CV-19] RBE-Flow: Recurrent Bayesian Estimation on Feature Manifolds for Cross-Modal Registration ECCV2026

链接: https://arxiv.org/abs/2606.30492
作者: Mengzhu Ding,Xin Song,Xiaoke Ding,Hongwei Ding,Xuecong Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Cross-modal image registration is essential for multi-sensor perception but remains fundamentally challenging due to severe non-linear radiometric discrepancies and geometric distortions. Existing deterministic matching methods lack uncertainty awareness, struggling to navigate the resulting highly non-convex optimization landscape and frequently accumulating errors in ambiguous regions. In this paper, we propose RBE-Flow, a novel framework that reformulates dense cross-modal flow estimation as a closed-loop recurrent Bayesian estimation problem on learned feature manifolds. Diverging from standard feed-forward regression, RBE-Flow establishes a robust self-correcting mechanism by deeply coupling feature-metric non-linear optimization with probabilistic state updates. Specifically, a Recurrent Manifold Optimization (RMO) block iteratively generates flow observations and their associated uncertainties, which are then optimally assimilated into the prior state via an Uncertainty-Adaptive Probabilistic Update (UAPU) using deterministic sigma-point projection. Crucially, the resulting calibrated posterior covariance is fed back to adaptively regularize the damping of subsequent optimization steps, allowing the system to modulate its convergence based on predictive confidence. To ensure stable probabilistic training, we introduce a hybrid supervision scheme featuring a geometry-aware rectified NLL loss that structurally prevents variance collapse. Extensive experiments on challenging OSdataset, WHU-OPT-SAR, and RoadScene benchmarks demonstrate that RBE-Flow consistently achieves state-of-the-art performance, outperforming existing methods by a significant margin, particularly under strict sub-pixel criteria. Project page: this https URL

[CV-20] PGE-SAM: Prompt-Guided Feature Enhancement for Interactive Segmentation under Degradation

链接: https://arxiv.org/abs/2606.30477
作者: Tuan-Duc Nguyen,Anh-Tuan Mai,Duc-Trong Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 54 pages

点击查看摘要

Abstract:Segment Anything Model (SAM) has revolutionized promptable image segmentation with strong zero-shot generalization. However, its performance degrades substantially under real-world imaging artifacts such as noise, blur, and compression. Existing methods restore features globally without focusing on segmentation-relevant regions and neglect SAM’s iterative refinement mechanism, leading to suboptimal performance in interactive settings. We propose Prompt-Guided Feature Enhancement SAM (PGE-SAM), a framework that explicitly leverages user prompts and prior mask predictions to spatially guide the feature restoration process toward regions of interest through a Prompt Guidance Generator. To recover fine-grained details lost under degradation, we introduce Multi-Scale Features Interaction to incorporate low-level encoder features, along with a Foreground Reconstruction Loss that restricts feature-level supervision to the segmentation target. Furthermore, we present DM-Seg, a benchmark for interactive segmentation on degraded medical images, spanning multiple imaging modalities with both general and modality-specific degradations at varying severity levels. Extensive experiments demonstrate that PGE-SAM achieves SOTA robustness on both medical and natural image domains across multiple degradation levels, while maintaining generalization to clean images and adding less than one-fifth of the parameters of prior methods.

[CV-21] PS-MOT: Cultivating Instance Awareness from Point Seeds for Multi-Object Tracking ECCV2026

链接: https://arxiv.org/abs/2606.30476
作者: Kai Luo,Fei Teng,Mengfei Duan,Wanjun Jia,Xu Wang,Hao Shi,Kunyu Peng,Zhiyong Li,Kailun Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Accepted to ECCV 2026. The source code is available at this https URL

点击查看摘要

Abstract:We introduce Point-supervised Multi-Object Tracking (PS-MOT) as a cost-effective alternative to traditional bounding box supervision, shifting the focus from spatial fitting to topological center-driven representation. However, PS-MOT faces challenges, e.g., spatial ambiguity and identity drift due to the lack of explicit geometric structure and scale constraints. To address these, we propose PS-Track, a hierarchical pipeline transitioning from points to instances across data, model, and loss levels. At the data level, we introduce Temporal-Feedback Prompting (TFP) to evolve points into temporally consistent pseudo-labels using negative spatial cues and motion priors. At the model level, we design the Point-Excited Wavelet Attention (PEWA) module, which leverages semantic correlations to activate high-frequency components, ``hallucinating’’ object boundaries. At the loss level, Uncertainty-Guided Gaussian Learning (UGL) models pseudo-labels as probabilistic distributions, dynamically calibrating supervision intensity. Experiments on DanceTrack, EmboTrack, SportsMOT, and JRDB demonstrate that PS-Track provides a feasible and effective point-supervised alternative across diverse tracking scenarios, establishing a new state-of-the-art for point-supervised tracking. The source code is available at this https URL.

[CV-22] FR-DETR: Frequency and Recurrent Feature Refinement for Robust Object Detection under Adverse Weather

链接: https://arxiv.org/abs/2606.30471
作者: Tuan-Duc Nguyen,Duc-Trong Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:Object detection under adverse weather remains challenging due to severe visual degradations and domain shifts. Existing enhancer-based approaches attempt to improve detection by cascading an enhancer with a detector, but they introduce redundant feature extraction and incur high computational cost with limited accuracy gains when paired with SOTA detectors. We propose FR-DETR, a detector-centric framework that refines features rather than images, focusing enhancement on regions of interest and leveraging frequency-domain cues. Specifically, we design (I) a Frequency Refinement Module that dynamically separates and reweights low- and high-frequency components to improve foreground-background discrimination, and (II) a Recurrent Focus Refinement Module (RFRM) that iteratively refines features using coarse predictions as guidance. Extensive experiments demonstrate that FR-DETR achieves superior detection accuracy under adverse weather while being significantly more computationally efficient than enhancer-based methods. Our implementation is available at this https URL.

[CV-23] Cross-Resolution Semantic Transfer for Robust Text-to-Image Retrieval in Low-Resolution Surveillance

链接: https://arxiv.org/abs/2606.30458
作者: Wenjie Qian,Bin Yang,Xiao Wang,Wenke Huang,Ling Mei,Xin Xu,Mang Ye
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages,8 figures,conference

点击查看摘要

Abstract:Text-to-image person re-identification (TIPR) retrieves target persons using natural language descriptions. However, existing methods largely overlook resolution variance in real-world surveillance. They characterize cross-resolution TIPR through two coupled failure modes: Evidence Reliability Collapse (ERC), where degraded visual tokens become unreliable for grounding fine-grained text, and Ranking Distribution Drift (RDD), where mixed-resolution galleries distort similarity neighborhoods and destabilize retrieval rankings. To address this challenge, we propose Cross-Resolution Semantic Transfer (CRST), a CLIP-style framework with three modules: resolution-conditioned reasoning, text-guided refinement and CR-RDA. Resolution-conditioned reasoning estimates token reliability to suppress corrupted evidence. Text-guided refinement injects semantic priors to recover discriminative cues. CR-RDA transfers HR neighborhood geometry to stabilize LR ranking under mixed resolutions. Experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid show that CRST improves ultra-low-resolution Rank-1 and mAP on average by 5.7% and 5.3%, while stabilizing mixed-resolution retrieval without sacrificing high-resolution this http URL code will be made publicly available.

[CV-24] Vision-Language-Action Models: Experimental Insights from a Real-World UR5 Platform

链接: https://arxiv.org/abs/2606.30456
作者: Mathilde Hochedel,Marc Lalonde
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 16 figures

点击查看摘要

Abstract:This project investigates whether recent Vision-Language-Action (VLA) models can be transferred from controlled research benchmarks to a real-world robotic platform, specifically a UR5e manipulator, in a reproducible and operationally meaningful manner. The work integrates real-robot data acquisition, dataset engineering (compatible with the RLDS format), and the fine-tuning and deployment of OpenVLA and OpenVLA-OFT models, with systematic validation of action representations and control interfaces. The project resulted in several foundational assets: (i) a complete real-robot data acquisition pipeline, (ii) a dataset conversion workflow aligned with RLDS standards, (iii) an initial fine-tuning and inference infrastructure for VLA models, and (iv) a structured set of experimental observations grounded in real-robot trials. These elements collectively establish a reproducible framework for evaluating learning-based manipulation systems beyond simulation. Empirically, the experiments reveal a consistent gap between promising offline indicators and unstable closed-loop behavior on the physical system: this gap cannot be attributed solely to model limitations, it is strongly influenced by action semantics, coordinate frame conventions, temporal alignment between modalities, image preprocessing consistency, and dataset coverage and quality. These observations lead to a key interpretation: the successful deployment of VLA systems in real-world settings depends less on incremental improvements in model capacity and more on precise control of the entire data-model-control pipeline. The project reframes VLA-based robotics from a primarily model-centric challenge to a system-level problem; it highlights the difficulty of running robust task execution on the real robot and provides a clear, experimentally grounded understanding of the conditions required for reliable deployment.

[CV-25] Robust and Efficient Monocular 3D Gaussian SLAM for Kilometer-Scale Outdoor Scenes

链接: https://arxiv.org/abs/2606.30436
作者: Sicheng Yu,Dongxu Shen,Beizhen Zhao,Guanzhi Ding,Hao Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scaling monocular 3D Gaussian Splatting (3DGS) SLAM to kilometer-level outdoor environments poses two tightly coupled challenges: fragile long-term pose tracking and excessive memory overhead during large-scale mapping. In this paper, we propose KiloGS-SLAM, a highly efficient and robust monocular 3DGS-SLAM system that jointly addresses both bottlenecks. Since high-fidelity scene reconstruction fundamentally relies on drift-free camera poses, we first introduce a motion-adaptive hybrid tracking module. This module features a condition-triggered three-tier solving pipeline. It dynamically switches between Essential matrix and PnP models to handle geometric degeneracies. An on-demand foundation model can also be activated to rescue the trajectory from catastrophic drift. To ensure the system can sustain these long trajectories without memory exhaustion, we subsequently design a lifecycle-managed Gaussian mapping strategy. By integrating probabilistic initialization with chunk-based multi-view densification and pruning, this full-pipeline optimization effectively reduces primitive redundancy while preserving high-frequency details. Together, the robust tracking guarantees the geometric foundation required for accurate mapping, while the memory-efficient lifecycle-managed mapping enables large-scale operation. Extensive experiments across three challenging outdoor datasets demonstrate that our approach achieves state-of-the-art tracking accuracy and rendering quality, successfully scaling to sequences of over 10,000 frames on a single GPU.

[CV-26] OWMDrive: Causality-Aware End-to-End Autonomous Driving via 4D Occupancy World Model IROS

链接: https://arxiv.org/abs/2606.30421
作者: Junjie Cheng,Ruiqi Song,Ye Wu,Nanxing Zeng,Ximiao Li,Yunfeng Ai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Intelligent Robots and Systems (IROS), 2026

点击查看摘要

Abstract:Autonomous driving systems are steadily moving toward end-to-end paradigms to mitigate the limited adaptability of rule-based pipelines in complex traffic environments. However, most existing learning-based methods still make decisions from static representations of the current scene, without explicit future rollouts or modeling of the temporal causal dynamics in traffic interactions. This limitation often results in unstable or overly conservative planning under high-uncertainty conditions, such as occlusions and unexpected events. To overcome these challenges, we introduce OWMDrive, a generative end-to-end driving framework built upon an Occupancy World Model for multi-step 3D occupancy forecasting, which serves as a conditional prior to guide diffusion-based planning. Conditioned on both current observations and predicted future states, the planner iteratively refines trajectory candidates to generate a reinforced driving trajectory. By explicitly modeling scene evolution over future horizons, OWMDrive captures key spatiotemporal causal dependencies, which leads to more foresighted and robust trajectory generation. Extensive experiments demonstrate that OWMDrive significantly improves planning reliability and safety, especially in challenging and partially observable driving scenarios.

[CV-27] Beyond Point Estimates for Glaucoma Visual Field Forecasting with Diffusion Models

链接: https://arxiv.org/abs/2606.30417
作者: Marta Colmenar Herrera,Pablo Márquez Neila,Şerife Seda Kucur Ergünay,Martin S. Zinkernagel,Raphael Sznitman
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Forecasting visual fields (VFs) is critical for personalized monitoring and treatment planning in glaucoma. This is inherently uncertain due to heterogeneous disease progression and measurement variability, yet most existing methods produce single deterministic predictions that fail to represent this uncertainty. We formulate VF forecasting as a probabilistic prediction problem and the use of conditioned denoising diffusion models to generate distributions of plausible future VFs from longitudinal observations with irregular follow-up intervals. Experiments on two independent VF cohorts show that diffusion-based predictions produce well-calibrated distributions for clinically relevant VF measures. When reduced to a standard point-estimate, the proposed approach achieves state-of-the-art accuracy compared to clinical baselines and prior learning-based methods. Our results highlight the advantages of distributional modeling for VF forecasting and support a shift from point-estimate prediction toward uncertainty-aware, clinically interpretable risk assessment in glaucoma.

[CV-28] SA-Homo: Scale Adaptive Homography Estimation for Scale Variation Scenarios

链接: https://arxiv.org/abs/2606.30408
作者: Shangxuan Xie,Haifeng Wu,Yuhang Wang,Huarong Jia,Wen Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Homography estimation, as one of the fundamental problems in computer vision, remains challenged by scale variation scenarios where image pairs potentially exhibit significant scale discrepancies. Existing deep learning frameworks frequently suffer from a significant performance degradation in such cases, as they rely on limited displacement assumptions and local feature consistency that might not hold under large scale gaps. In this paper, we propose SA-Homo, a novel scale-adaptive homography estimation framework designed to achieve robust alignment across a wide range of scale discrepancy ratios. We adopt a hierarchical scale alignment strategy that transitions from the global perspective with a heavy module to a local perspective with a light module. Specifically, we introduce the Scale-aware Discrepancy Bridging Module (SDBM) for initial alignment, which utilizes a Multi-scale Linear Attention Cascade (MLAC) to capture long-range dependencies and mitigate feature inconsistencies, along with a global Cross-scale Similarity Matrix Block (CSMB) for scale robust correlation representation. Once the initial scale gap is bridged, a lightweight Iterative Homography Estimation Refinement Module (IHERM) progressively polishes the result using local correlations. To facilitate this research, we contribute the HMSA dataset, a high-resolution, multi-modal satellite benchmark specifically tailored for scale-variant challenges. Extensive experiments demonstrate that SA-Homo maintains high precision even under 8 \times scale discrepancies, outperforming state-of-the-art methods in both conventional scale-similar scenarios and challenging scale variation scenarios. Code and collected datasets are available at this https URL

[CV-29] SADL: What to Ignore? A Benchmark for Subject-Aware Distractor Localization

链接: https://arxiv.org/abs/2606.30393
作者: Cao-Tri Nguyen,Nguyen-Khoa Luong,Vinh-Tiep Nguyen,Minh-Triet Tran
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Photographs frequently contain \emphvisual distractors besides foregrounds and backgrounds of the intended subject, competing for attention and weakening composition. While modern editing tools streamline object removal, identifying which objects to remove remains a mostly manual process. Existing saliency models and open-vocabulary detectors operate without subject awareness, failing to adapt to shifting user intent. Furthermore, context-agnostic removal may disrupt the scene’s semantic coherence (e.g., keep the person but remove the chair they are sitting on). To address these limitations, we formalize the task of subject-aware distractor localization, which identifies distractors while retaining compositionally essential objects. This paper introduces \textscSADL, the first real-world benchmark for this task, comprising 1,800 subject-aware cases across 1,000 photographs to enable systematic evaluation and facilitate future research. In total, there are 14,617 annotated candidates, including a robust set of 1,938 hard negatives to stress-test exclusion calibration. We evaluate seven proprietary and open-weight Vision-Language Models (VLMs) on a sequential pipeline of distractor classification followed by exclusion filtering, structured around five inclusion factors and three contextual exclusion rules. Our analysis reveals that VLMs are highly capable of identifying distractors, but then over-apply exclusion, which systematically suppresses true distractors at scale. By exposing this critical bottleneck, \textscSADL provides a foundational diagnostic tool to advance subject-conditioned reasoning in multimodal systems.

[CV-30] RenderFormer: Scalable and Physically Grounded Feed-Forward Neural Rendering

链接: https://arxiv.org/abs/2606.30380
作者: Huangsheng Du,Haoran Zhu,Youcheng Cai,Jinyang Meng,Ligang Liu
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present RenderFormer++, a scalable and physically grounded feed-forward neural rendering framework for global illumination in mesh scenes. Existing Transformer-based neural rendering methods such as RenderFormer achieve promising cross-scene generalization, but suffer from limited physical consistency and poor scalability due to the quadratic attention complexity of triangle-level tokenization. To address these issues, we introduce Physics-Informed Transport Guidance (PITG), which embeds rendering-equation inductive biases into the attention mechanism and enforces transport consistency loss, enabling physically consistent light transport modeling. We further propose Hierarchical Object-Centric Tokenization (HOCT), which aggregates triangle-level features into compact object-level tokens via cross-attention with learnable queries, substantially reducing computational and memory costs while preserving geometric and radiometric information. Extensive experiments demonstrate that RenderFormer++ achieves scalable, stable, and generalizable feed-forward global illumination rendering across complex large-scale scenes with improved physical accuracy and efficiency over prior neural rendering methods.

[CV-31] OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

链接: https://arxiv.org/abs/2606.30378
作者: Haocong He,Chenfei Liao,Zichen Wen,Zihao Dongfang,Xu Zheng,Bin Ren,Chang Su,Zixin Zhang,Harold Haodong Chen,Hongfei Zhang,Weijia Li,Kailun Yang,Conghui He,Xuming Hu,Nicu Sebe,Linfeng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated promising spatial reasoning capabilities, while these abilities remain underexplored in the emerging visual modality of panoramic imagery. The full 360° \times 180° field of view of panoramas essentially supports complex global multi-step reasoning, which is also the fundamental advantage of panoramas in applications such as embodied intelligence. However, existing panoramic benchmarks largely focus on simplistic queries that rely on local cues or single-/few-step reasoning, thereby ignoring the fundamental advantage of panoramas and failing to fully exploit their potential. To address this gap, we introduce OmniCoT, a panoramic spatial reasoning suite designed to enable MLLMs to use global evidence and perform multi-step inference across viewpoints. It includes OmniCoT-B (6.7K data) for evaluation, which measures both answer accuracy and reasoning quality, OmniCoT-Real (1K data) as a manually annotated real-world subset to quantify the Sim-to-Real gap. For training, OmniCoT-T (14.3K data) is purpose-built with structured stepwise Chain-of-Thought annotations that explicitly link intermediate reasoning steps to panoramic evidence. Based on OmniCoT-T, we introduce OmniCoT-R1 and adopt a two-stage training strategy tailored to the geometrically complex panoramic space, where Supervised Fine-tuning (SFT) anchors reasoning to panoramic evidence (e.g., bearings, proximity) and GRPO penalizes geometrically incoherent paths to consolidate global 360° spatial consistency. Through OmniCoT, we aim to recalibrate the difficulty of panoramic spatial reasoning to better align with the intrinsic capabilities of panoramic imagery, thereby fostering meaningful progress in this research area.

[CV-32] FlowAWR: Online Adaptive Flow Reinforcement via Advantage-Weighted Rectification

链接: https://arxiv.org/abs/2606.30376
作者: Zheming Fu,Ruizhe He,Wei Shang,Xiaoxiao Ma,Lei Wang,Chang Liu,Siming Fu
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Aligning generative flow models on continuous spaces via online reinforcement learning is constrained by intractable trajectory likelihoods. Existing density-approximated policy gradient methods rely on stochastic SDE samplers to construct tractable transition kernels, which introduce training-inference inconsistencies and necessitates Classifier-Free Guidance (CFG). While implicit frameworks such as DiffusionNFT directly optimize forward-process velocity fields, its heuristic fixed-magnitude corrections prevent optimization strength from relative intra-group quality. We propose \textitFlow Advantage-Weighted Rectification (\textbfFlowAWR), a paradigm that recasts continuous generative policy optimization as supervised regression toward a theoretically optimal velocity field. Starting from the optimal policy of a KL-constrained reward maximization, FlowAWR derives the optimal velocity field that admits a magnitude-aware, advantage-weighted rectification form, yielding SDE-free optimization and CFG-free generation. In comparative evaluations on SD3.5-Medium, FlowAWR achieves improved alignment performance alongside a 2 \times to 5 \times convergence acceleration over DiffusionNFT (e.g., reaching a 24.12 PickScore in 1.2k steps, versus 23.82 in 2.0k steps for DiffusionNFT and 23.50 in 4k steps for FlowGRPO). Under multi-reward constraints, FlowAWR sustains generation quality, satisfying structural rules while maintaining stable out-of-domain performance. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.30376 [cs.LG] (or arXiv:2606.30376v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.30376 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Siming Fu [view email] [v1] Mon, 29 Jun 2026 14:37:36 UTC (9,190 KB) Full-text links: Access Paper: View a PDF of the paper titled FlowAWR: Online Adaptive Flow Reinforcement via Advantage-Weighted Rectification, by Zheming Fu and 6 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-06 Change to browse by: cs cs.CV References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-33] Set-Inclusive Uncertainty Modeling for Robust Brain Tumor Segmentation MICCAI2026

链接: https://arxiv.org/abs/2606.30374
作者: Seunghun Baek,Jihwan Park,Jaeyoon Sim,Hoseok Lee,Seungjoo Lee,Won Hwa Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: MICCAI 2026

点击查看摘要

Abstract:Multimodal MRI is essential for accurate brain tumor segmentation. However, acquiring all modalities at inference is often challenging in practice, which causes intrinsic uncertainty due to unavoidable information loss. Without modeling this uncertainty, existing methods encode incomplete evidence into deterministic representations that appear plausible but lack reliability. In this regime, we propose a probabilistic representation framework that models representations as Gaussian distributions, where their mean captures task information and their variance measures uncertainty from missing evidence. To make variance reflect information deficiency, we regularize the mean from each partial configuration toward its full-modality counterpart, while scaling the variance with the discrepancy between their aligned means. We further introduce a set-inclusive strategy that exploits the hierarchical structure of modality subsets and enforces an ordering constraint to maintain their consistent uncertainty relationships. Extensive experiments on BraTS 2018 and 2020 demonstrate that our approach offers superior performance over baselines across diverse missing-modality scenarios. Code and model checkpoint are available at this https URL.

[CV-34] MUSE: Unlocking Timestep as Native Task Steering for One-Step Dense Prediction ECCV26

链接: https://arxiv.org/abs/2606.30370
作者: Shuo Zhou,Zhaoxin Li,Xiujuan Chai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV26

点击查看摘要

Abstract:Monocular dense prediction has recently seen remarkable success by repurposing pre-trained diffusion models. This opens a promising yet challenging avenue for more efficient multi-task learning paradigm. However, existing multi-task diffusion methods often introduce parameter-heavy adapters, experts, or learnable task tokens, leading to computational redundancy. In this paper, we reveal an inherent mechanism within one-step diffusion models: the native, fixed sinusoidal timestep embedding can be repurposed as an endogenous task steering signal. Based on this discovery, we propose Multi-task Unified eStimation via timestep Embedding (MUSE), a parameter-free, single-model multi-tasking approach for dense prediction. We interpret this mechanism via Manifold Decoupling, where discrete, fixed timestep values deterministically steer the generation process towards decoupled, task-specific manifolds in the latent space. Extensive experiments across 10 datasets demonstrate that MUSE achieves highly competitive performance on both monocular depth and normal estimation, and its efficacy generalizes across U-Net and DiT architectures. Our work offers a concise and efficient path toward generalist vision models by simply unlocking the latent potential of existing generation infrastructure.

[CV-35] CouCE: A Unified Causal Framework for Debiased Deep Metric Learning

链接: https://arxiv.org/abs/2606.30365
作者: Xin Yuan,Zhenyang Niu,Meiqi Wan,Huilin Zhu,Xin Xu,Kui Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep Metric Learning (DML) often struggles with zero-shot generalization because standard objectives inherently capture what co-occurs rather than what causes similarity. Consequently, DML models are vulnerable to shortcut learning driven by two structurally distinct confounders: background spurious correlations (which create backdoor paths via scene context) and foreground nuisance perturbations (which inject non-semantic variations like pose or illumination). Although existing methods have proposed targeted solutions for each pathway individually, none can simultaneously address both due to their fundamentally distinct causal roles. To bridge this gap, we propose the Counterfactual Causal Embedding (CouCE), a unified causal framework that explicitly models and neutralizes both confounders. Specifically, we introduce Orthogonal Dictionary-Based Backdoor Adjustment (ODBA), which isolates spurious background patterns into a variance-gated dictionary and stably disentangles them from the learned embeddings via soft orthogonal regularization. Simultaneously, we propose Multi-Scale Randomized Causal Intervention (MSRCI) to enforce causal invariance against foreground nuisances through multi-scale Fourier amplitude randomization and a symmetric KL invariance constraint. Notably, CouCE seamlessly integrates with any proxy-based loss, incurring modest training overhead without requiring architectural modifications during inference. Extensive experiments on CUB-200-2011, Cars-196, and Stanford Online Products demonstrate that CouCE consistently achieves state-of-the-art performance, providing a principled and robust solution for debiased DML.

[CV-36] ReactiveBFM: Reactive Closed-Loop Motion Planning Towards Universal Humanoid Whole-Body Control

链接: https://arxiv.org/abs/2606.30362
作者: Xiao Chen,Weishuai Zeng,Xiaojie Niu,Zirui Wang,Jianan Li,Huayi Wang,Furui Xu,Jiahe Chen,Weixiang Zhong,Lihe Ding,Kailin Li,Jiangmiao Pang,Tai Wang,Tianfan Xue,Jingbo Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:While current Behavior Foundation Models (BFMs) provide robust control priors for humanoids, they only execute pre-defined reference motions. As a result, they are vulnerable to environmental shifts and incapable of reactive whole-body coordination. Naively cascading them with generative motion planners fails to achieve true reactivity, as inevitable tracking discrepancies induce fatal cumulative exposure bias. To bridge this gap, we propose ReactiveBFM, a real-time closed-loop planning-control framework. At its core, we effectively mitigate exposure bias via a scheduled prefix sampling curriculum, forcing the generative planner to actively learn error-recovery behaviors from imperfect physical states rather than ground-truth trajectories. Systematically, to reconcile the severe latency mismatch between auto-regressive planning and high-frequency tracking, we introduce an asynchronous replanning mechanism. Combined with trajectory chunking to temporally ensemble spatial references, our system guarantees spatio-temporally fluid execution without physical jitter. Deployed on the Unitree G1 humanoid, ReactiveBFM demonstrates unprecedented physical agility across a vast repertoire of text-conditioned closed-loop motions. Notably, ReactiveBFM achieves zero-shot moving target reaching, showcasing intricate whole-body coordination and on-the-fly replanning. In sim-to-sim benchmarking under severe perturbations, ReactiveBFM achieves a 93.1% success rate, significantly outperforming cascaded open-loop baselines by 28.6%.

[CV-37] On the Vulnerability of Parameter-Level Defenses to Model Merging ECCV2026

链接: https://arxiv.org/abs/2606.30360
作者: Kuangpu Guo,Qingyan Zheng,Jian Liang,Yongcan Yu,Zilei Wang,Ran He,Tieniu Tan
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026

点击查看摘要

Abstract:The training-free integration of expert models via model merging has exposed significant security risks, enabling free-riders to combine specialized models without authorization. Recent works propose parameter-level defenses that employ linear parameter transformations to neutralize this threat. In this paper, we systematically analyze such defenses and reveal that their protected task vectors are inherently small in magnitude. Consequently, the protected weights remain overwhelmingly dominated by the pretrained model. Based on this observation, we designate the pretrained model as a static reference anchor and propose the Anchor-Guided Attack (AGA) to circumvent existing safeguards. Specifically, AGA aligns the protected model with this anchor to recover the transformation matrix analytically. Extensive evaluations validate that AGA consistently bypasses both individual and composite defenses under realistic defense-agnostic scenarios. Furthermore, we provide Anchor-Repulsive Fine-tuning (ARF), a defense method to mitigate the anchor dominance leveraged by AGA. Empirical results confirm that ARF effectively defeats the proposed attack. Our code is available at this https URL.

[CV-38] Residual-Guided Expert Specialization for Incomplete Multimodal Learning ECCV2026

链接: https://arxiv.org/abs/2606.30355
作者: Seunghun Baek,Jihwan Park,Jaeyoon Sim,Minjae Jeong,Hoseok Lee,Won Hwa Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ECCV 2026

点击查看摘要

Abstract:As real-world prediction systems often face missing modalities at inference, incomplete multimodal learning (IML) remains a practical challenge. While prior methods aim to learn representations robust to missing inputs, representations from incomplete modalities inevitably deviate from their full-modality counterparts due to missing evidence. To explicitly leverage these deviations, we propose MARS (Missingness-Aware Residual-guided Specialization), a mixture-of-experts framework that guides expert specialization based on how representations are reshaped by missingness. By contrasting task representations derived from incomplete inputs with their complete counterparts during training, we derive a privileged residual signal that captures this representational gap. The residual signal guides a residual router to assign samples to experts specialized for the corresponding deviation patterns. In parallel, a feature router learns to imitate this routing behavior using only incomplete inputs, enabling deployment without access to full modalities. To mitigate this train-test router gap, we develop a discrepancy-aware noise regularization that adaptively perturbs the residual router’s decisions when the feature router deviates, enhancing expert robustness under imperfect imitation. Experiments on multimodal classification (CASIA-SURF, CREMA-D, UPMC Food-101) and segmentation (MCubeS) under missing scenarios show that MARS consistently surpasses baselines while remaining efficient and extensible to diverse backbones and tasks.

[CV-39] FastPano3D: Feed-Forward Indoor Panoramic 3D Reconstruction from a Single Image

链接: https://arxiv.org/abs/2606.30352
作者: Jianqiang Li,Liumei Zhang,Wenjia Guo,Tianlong Feng,Yongzhi Liao,Di Lu,Hanchi Ren,Jingjing Deng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review. 20 pages, 9 figures

点击查看摘要

Abstract:Recent advances in 3D scene reconstruction have highlighted the intricate trade-offs among rendering quality, inference efficiency, and data dependency. To address the challenge of rapidly reconstructing detailed 3D indoor scenes from minimal input, we introduce FastPano3D, an end-to-end framework that directly generates renderable 3D Gaussian representations from a single panoramic image. Unlike perspective-based methods, panoramic images inherently suffer from equirectangular projection distortions and spatially non-uniform feature distributions, making direct feed-forward Gaussian generation particularly challenging. In contrast to existing Gaussian Splatting based methods that rely on multi-view supervision or per-scene optimization, FastPano3D employs a lightweight feature encoder, adaptive Gaussian sampling, and a point-cloud-guided refinement strategy to achieve efficient and accurate scene generation without any test-time optimization. Our approach reconstructs high-fidelity 3D scenes within seconds, achieving up to 156 times faster inference than prior state-of-the-art methods such as Pano2Room, while using only half the parameters. Extensive experiments demonstrate that FastPano3D delivers rendering quality comparable to NeRF- and 3DGS-based reconstructions, establishing a new benchmark for rapid, single-view 3D scene inference.

[CV-40] FFAvatar: Feed-Forward 4D Head Avatar Reconstruction from Sparse Portrait Images

链接: https://arxiv.org/abs/2606.30347
作者: Jianjiang Yao,Ke Xian,Renxiang Dai,Robert Caiming Qiu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present FFAvatar, a Transformer-based 3D Gaussian framework for fast construction of high-quality and animatable 4D head avatars from one or more reference portrait images. Unlike existing feed-forward approaches that require a fixed number of input views, FFAvatar supports incremental reconstruction, progressively refining the avatar representation as additional reference images become available. At the core of our method is an alternating attention mechanism that disentangles identity appearance from expression and viewpoint variations, enabling the reconstruction of a canonical 3D appearance that remains consistent across poses and facial expressions. To balance visual fidelity and computational efficiency, we introduce a sparse-to-dense learning paradigm. Coarse appearance features are first learned using sparse primitives anchored to the FLAME vertex level and are subsequently densified in the UV domain to capture fine-grained geometric and texture details. We further propose a plug-and-play motion refinement module that enables subject-specific dynamic personalization by modeling residual motion beyond parametric deformation. Extensive experiments demonstrate that FFAvatar efficiently produces high-fidelity and controllable 4D head avatars, achieving superior flexibility, driving efficiency, and identity-consistent rendering across diverse expressions and viewpoints.

[CV-41] Early Cue Precision Shapes Visual Shortcut Learning in Controlled Cue-Manipulation Benchmarks

链接: https://arxiv.org/abs/2606.30344
作者: Chanho Park,Woochan Lee,Janyeong Oh,Geongho Gong,Minshu Kim,Yeachan Kwak,Seongim Choi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual classifiers can achieve high matched-distribution accuracy while relying on low-level cues that fail under conflict or suppression. We test whether this failure is shaped by early cue precision: the reliability with which a low-level cue predicts the label during early learning or downstream probe fitting. Across synthetic shape-texture tasks, sequential digit training, a 10-class frozen-representation audit, and a CIFAR-10 natural-image-based texture-overlay benchmark, we manipulate object-texture match probability and evaluate matched-ID accuracy, conflict accuracy, texture-choice rate, and suppression behavior. Degraded-but-predictive input does not substitute for cue decorrelation. In 10-class digit probes, conflict accuracy drops from 0.589 under chance-like cue precision to 0.005 under target-perfect texture. In CIFAR-10 frozen probes, conflict accuracy drops from 0.569 to 0.114, while texture choice rises from 0.049 to 0.855; this ordering persists across texture-overlay strengths alpha in 0.15,0.25,0.35,0.50. End-to-end CIFAR-10 training shows that low early cue precision improves pre-target conflict behavior, but shortcut-rich fine-tuning can rapidly overwrite this benefit. Cue decorrelation must therefore be maintained during downstream adaptation rather than treated as a one-time inoculation.

[CV-42] A Classifier-Agnostic Zero-Shot Adversarial Attack Detection via CLIP

链接: https://arxiv.org/abs/2606.30342
作者: Hodaya Krakover,Meir Yossef Levi,Eyal Gofer,Guy Gilboa
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial attacks pose a challenge to the reliability of deep learning models, motivating effective detection methods. Existing techniques often rely on attack-specific assumptions, access to adversarial samples, or knowledge of the underlying classifier (white-box). We propose \textit A^4D (\textbfAttack- and \textbfArchitecture-\textbfAgnostic \textbfAdversarial \textbfDetector), a completely black-box, zero-shot adversarial attack detection framework that utilizes prompt-based similarity scores derived from CLIP. To the best of our knowledge this is the first attempt to utilize CLIP for such a task. The method is based on two key observations: (i) CLIP is sensitive even to small imperceptible non-semantic perturbations; (ii) The shift in CLIP embedding space is not arbitrary and can be used as a robust attack indicator. Experiments across multiple attacks, datasets and classifiers validate that A^4D achieves SOTA detection results in the attack-agnostic and classifier-agnostic setting.

[CV-43] UniGP: Taming Diffusion Transformer for Prior-Preserved Unified Generation and Perception

链接: https://arxiv.org/abs/2606.30332
作者: Qin Guo,Hao Luo,Dongxu Yue,Weixuan Jin,Xiao Fu,Fan Wang,Dan Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in diffusion models have shown impressive performance in controllable image generation and dense prediction tasks. However, existing approaches typically treat diffusion-based controllable generation and dense prediction as separate tasks, overlooking the potential benefits of jointly modeling the heterogeneous distributions. In this work, we introduce UniGP, a framework built upon MMDiT, which unifies controllable generation and dense prediction through simple joint training, without the need for complex task-specific designs or losses, while preserving the backbone’s versatile priors. By learning controllable generation and prediction under different conditions, our model effectively captures the joint distribution of image-geometry pairs. UniGP is capable of versatile controllable generation, dense prediction, and joint generation. Specifically, the proposed UniGP consists of DUGP and a unified dataset training strategy. The former, following the principle of Occam’s razor, uses only a copied image branch of MMDiT to model dense distributions beyond RGB, while the latter integrates heterogeneous datasets into a unified training framework to jointly model generation and perception tasks. Extensive experiments demonstrate that our unified model surpasses prior unified approaches and performs on par with specialized methods. Furthermore, we demonstrate that multi-task joint training provides complementary benefits: generative priors enrich perceptual details, while perceptual learning improves structural alignment in generation.

[CV-44] Optimizing Image Preparation and Compression for Face Recognition within 1024 Bytes

链接: https://arxiv.org/abs/2606.30321
作者: Paul Andreas,Torsten Schlett,Christoph Busch
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:ICAO-compliant machine readable travel documents enable automated biometric face verification. The biometric reference is stored on an RFID chip included in form of a JPEG or JPEG 2000 compressed facial image. In contrast, temporary travel documents lack of machine readability, which excludes the owner from such automated processes. This disadvantage could be solved by equipping such documents with 2D barcodes. This technology offers a resource-saving alternative to expensive RFID chips, while still offering machine readability and fast issuing processes. However, this solution introduces the challenge of storing the face images at significantly smaller storage capacities, creating the need for reducing the file size of the included facial image to a maximum of 1024 bytes. This study examines preprocessing steps and compression configurations, using JPEG, JPEG 2000, JPEG XL, JPEG AI, HEIF, AVIF, and WebP for image compression to this target size, while still preserving as much face recognition performance as possible. While the reference sample must always comply with ICAO specifications, the individual samples may or may not meet these requirements, depending on the application. This work optimizes compression steps for both of these prerequisites. It is shown that the recently standardised JPEG AI, when using optimized settings, provides the best face recognition performance, in particular when the comparison includes only images with high face image quality. AVIF and WebP also provide good results. The losses caused by the strong lossy compression are comparatively small. For the comparison of ICAO-compliant face images only, converting the images to grayscale proves to be a helpful preprocessing step, whereas for comparisons involving less suitable samples, preserving color is preferable. In addition, smoothing and resizing the images beforehand also turns out to be beneficial.

[CV-45] BrainJanus: A Unified Model for Understanding and Generation across Brain Vision and Language

链接: https://arxiv.org/abs/2606.30319
作者: Haitao Wu,Qirui Zhang,Zhouheng Yao,Shangquan Sun,Qihao Zheng,Mianxin Liu,Chi Zhang,Wanli Ouyang,Chunfeng Song,Changqing Zhang,Jiamin Wu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modeling the bidirectional correspondence between external sensory stimuli and internal neural activity has emerged as a critical frontier in neuroscience. However, existing approaches predominantly treat brain encoding and decoding as isolated tasks, relying heavily on unimodal alignment and external priors while overlooking the brain’s intrinsic nature as a multimodal integration system. To address these limitations, we propose BrainJanus, the first unified brain model that integrates brain, vision, and language within a single framework. Specifically, we introduce a Unified Brain Tokenizer to quantize continuous neural dynamics into discrete tokens aligned with visual and linguistic representations in a shared Omni space. Building on this, we utilize an All-in-One autoregressive architecture that leverages next-token prediction to enable seamless any-to-any generation, which encompasses image-to-brain and text-to-brain encoding, and brain-to-image and brain-to-text decoding. Extensive experiments demonstrate that BrainJanus achieves superior performance across diverse benchmarks. Furthermore, our framework exhibits zero-shot generalization and preserves interpretable biological topography, highlighting its potential as a general-purpose brain modeling paradigm. The code is available at \hrefthis https URLGitHub.

[CV-46] Real-Time Underwater Image Enhancement via Frequency-Guided Dual-Path Attention ICME2026

链接: https://arxiv.org/abs/2606.30314
作者: Leshen Zhang,Ao Li,Ce Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures. Accepted at ICME 2026

点击查看摘要

Abstract:Real-time underwater image enhancement (UIE) is crucial for mobile underwater photography and autonomous robotic systems, where practical deployment typically requires low latency and compact models under constrained computational resources. Recent ultra-lightweight CNNs based on structural re-parameterization meet these constraints but operate purely in the spatial domain, ignoring the frequency-sensitive nature of underwater degradation. To address this, we propose a lightweight UIE framework that integrates two key components: a Multi-Branch Reparameterizable Convolution with Fixed DCT Priors (MBRConv-DCT) that injects structured directional frequency priors during training, and a Frequency-Guided Dual-Path Attention (FGDPA) module that fuses spatial and spectral cues via a dual-path design for adaptive feature modulation. Both components are fully compatible with structural re-parameterization: the convolution branch introduces zero additional inference cost after re-parameterization, while the attention module incurs only a minimal computational overhead. Experiments show our model achieves state-of-the-art performance with only 4.23K parameters and 600+ FPS, outperforming much larger methods in both quantitative metrics and visual quality. Code is available at this https URL.

[CV-47] RACE: A Concept Bottleneck Model for Longitudinal 3D Glioblastoma Response Assessment IJCAI2026

链接: https://arxiv.org/abs/2606.30313
作者: Alia Tarek,Hamsa Saberr,Hamza Elghonemy,Youssef Afify,Tamer Basha,Omair Shahzad Bhatti,Abdulrahman M. Selim,Hasan Md Tusfiqur Alam Daniel Sonntag
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accept in the EXPLIMED: Explainable Artificial Intelligence for the Medical Domain workshop in IJCAI 2026

点击查看摘要

Abstract:Longitudinal glioblastoma response assessment requires comparing subtle tumor changes across MRI time points using structured clinical criteria such as RANO. However, most deep learning methods predict response labels directly from imaging features, which limits clinical inspection, verification, and correction. We introduce TRACE, a RANO 2.0-aligned concept bottleneck model for interpretable 4-class glioblastoma response classification on longitudinal 3D MRI. TRACE processes paired baseline and follow-up multimodal MRI scans with a shared 3D vision encoder, predicts clinically meaningful tumor measurements as root concepts, computes downstream RANO-derived concepts through deterministic rules, and incorporates scan interval and new-lesion information as passthrough concepts. This design frames response assessment as structured concept reasoning rather than direct image-to-label prediction. Using 5-fold patient-wise cross-validation on the LUMIERE dataset, TRACE achieves a 4-class macro F1 of 0.4769 and a binary progression-versus-non-progression macro F1 of 0.7085. It improves over a concept bottleneck baseline and remains within the range of published non-interpretable deep learning approaches. Ablation studies show that the expert RANO graph and intervention-consistency training are important for performance, while intervention experiments demonstrate that correcting concepts can improve downstream predictions. These results suggest that structured concept bottlenecks offer a transparent and clinically aligned direction for longitudinal glioblastoma response assessment, while highlighting the need for larger protocol-aligned datasets and external validation. Comments: Accept in the EXPLIMED: Explainable Artificial Intelligence for the Medical Domain workshop in IJCAI 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2606.30313 [cs.CV] (or arXiv:2606.30313v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.30313 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-48] A Point Cloud Transformer for Remote Monitoring and Automated Assessment of Physical Rehabilitation Exercises ALT

链接: https://arxiv.org/abs/2606.30309
作者: Kazi Rafat,Md. Ismail Hossain,M M Lutfe Elahi,Sifat Momen,Fuad Rahman,Nabeel Mohammed,Shafin Rahman
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IEEE Journal of Biomedical and Health Informatics (JBHI), 2026

点击查看摘要

Abstract:Rehabilitation exercises are essential in restoring lost physical functions of patients suffering from various diseases (e.g., Parkinson’s, back pain). Carrying out these rehabilitation exercises, often prescribed by health experts, is costly, unavailable, and requires expert supervision. The availability of RGBD images and movement/position data of joints along with expert annotation of exercise data has prompted the use of automatic assessment of the quality of rehabilitation exercises, which is cost-effective and can be carried out at home. However, existing approaches do not extract relevant features, lack practical application, require expensive pre-processing, or overlook crucial features. This study proposes a transformer-based framework for point clouds to extract features and assess rehabilitation exercises by analyzing joint positions collected through RGBD data. We adapt and utilize a curve-based point-cloud feature aggregation technique to augment point-cloud information that aids model output. The transformer architecture also uses axial self-attention, recognizing important joints and their roles to assist users in performing the exercise better. The guided system outperforms existing approaches and is also practically relevant due to its small size, fast inference, and generalization on specific joints in similar exercises. We conduct our experiments on three crucial baseline datasets for rehabilitation exercises: Kimore, UI-PRMD, and IRDS.

[CV-49] he Surprising Effectiveness of Video Diffusion Models for Hand Motion Reconstruction

链接: https://arxiv.org/abs/2606.30308
作者: Yuxi Wang,Chengkai Jin,Yufei Liu,Wenqi Ouyang,Tianyi Wei,Zhiwei Zeng,Siyuan Huang,Zhiqi Shen,Xingang Pan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:4D hand motion reconstruction from egocentric video is bottlenecked by clear limitations of existing methods: image-based pipelines depend on a detector that fails under heavy occlusion, while video-based methods rely on temporal modules learned only from scarce hand-pose annotations, a narrow signal insufficient to model motion dynamics, occlusion reasoning, and hand-object interaction. These capabilities, however, are exactly what video generative models must implicitly acquire when trained to synthesize coherent video at internet scale. Motivated by this, we present ViDiHand, which leverages the representations of a pretrained video diffusion model to reconstruct 4D two-hand pose. We adapt it via a hand-overlay rendering objective that specializes its features for hands while preserving its world priors. A decoder then recovers metric-scale pose from the adapted features. The whole pipeline operates directly on full frames–no detector, no infiller, and no test-time optimization. On ARCTIC, HOT3D, and HOI4D, ViDiHand substantially outperforms prior methods, establishing video diffusion models as a powerful new foundation for hand motion reconstruction and a promising route to scalable in-the-wild data collection for embodied AI. Project page: this https URL.

[CV-50] DreamForge-World 0.1 Preview: A Low-Compute Real-Time Controllable World Model

链接: https://arxiv.org/abs/2606.30292
作者: Daniyel Ayupov,Artur Markov-Tsoy
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present DreamForge-World 0.1 Preview, a preview foundational world model for real-time interactive world simulation. The system adapts the LongLive 1 autoregressive video stack, itself derived from Wan2.1-T2V-1.3B, with a residual action pathway inspired by the Matrix-Game family. DreamForge-World 0.1 Preview focuses on a complementary axis to frontier-scale world simulators: low-compute adaptation, consumer-GPU runtime, and broad interactive capability coverage. It supports live keyboard and mouse control, multimodal initialization, mid-stream reprompting, dual-view operation, and minute-scale interactive rollouts at native 480p resolution, reaching up to 14 to 15 FPS FPS on a single RTX 4090 with a low memory footprint. By leveraging open video backbones and applying targeted adaptation runs, we build the preview system with high cost-efficiency. DF-World 0.1 Preview is not yet a memory-complete or frontier-quality world simulator, but demonstrates a practical low-compute route toward real-time controllable world-model previews on consumer GPUs.

[CV-51] VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context ECCV2026

链接: https://arxiv.org/abs/2606.30288
作者: Xiaoqian Shen,Mohamed Elhoseiny
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026; Project page: this https URL

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have achieved remarkable success on vision-language tasks, yet fine-grained perception over high-resolution images and long-context videos remains challenging. As the number of visual tokens increases, the visual attention sink phenomenon becomes increasingly severe, causing irrelevant tokens to absorb a disproportionate amount of attention mass. Recent approaches attempt to mitigate this issue by explicitly predicting bounding boxes or temporal spans and re-encoding the cropped visual regions. Such methods depend on unreliable numeric localization in the discrete token space and incur significant computational overhead due to additional forward passes. In this work, we propose VisReflect, a simple yet effective framework that improves fine-grained perception in long visual contexts through latent visual reflection. Instead of decoding intermediate predictions into discrete tokens, the model generates continuous visual reflection that represents question-relevant visual features in the latent space. These reflections selectively emphasize salient regions or frames, guiding attention towards relevant visual tokens within a single forward pass. We conduct comprehensive evaluations on challenging high-resolution image benchmarks, including BLINK, V*, and HRBench-4K/8K, as well as video understanding benchmarks such as MVBench, VideoMME, and MLVU. Our method consistently improves over strong baselines, achieving gains of 4.1% on image benchmarks and 1.8% on video benchmarks. Compared with zooming-based methods, our model achieves comparable performance while reducing inference time by roughly 44% on video understanding.

[CV-52] Intermediate Text Representation Guided Text-to-Image Generation for Enhancing One-and-Only Alignment ECCV2026

链接: https://arxiv.org/abs/2606.30262
作者: Soyoun Won,Aryan Yazdan Parast,Basim Azam,Jean Honorio,Naveed Akhtar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models often fail to faithfully render explicit textual descriptions, instead defaulting to strongly learned visual priors due to a phenomenon referred to as concept association bias. We show that such bias is particularly strong for one-and-only (OAO) objects, entities that exist in a single canonical form, such as celestial bodies, landmarks, and artworks. The deeply ingrained visual identity for these concepts often resists modification through prompting alone. Addressing this challenge, we first identify through an information-theoretic analysis that the final text embedding discards concept-level information present in the intermediate-layer text representations, reducing the mutual information available to the subsequent denoising process. We then propose Intermediate Text Representation (IR)-guided diffusion, which injects intermediate hidden states of the text encoder into the conditioning signal during early denoising steps, recovering suppressed concepts without any additional training, optimization, or external models. To systematically evaluate the challenging task of aligning generative outputs with unusual prompts for OAO objects, we introduce OAO-AttackBench, a benchmark comprising counterfactual prompts that directly conflict with the core visual identity of OAO objects. Experiments on four benchmarks, including OAO-AttackBench, show that our method achieves up to a 19.1 percentage-point improvement in VQAScore while preserving generation fidelity and human preference. Project page: this https URL.

[CV-53] Your Data Manifold is Secretly a Reward Model: Shell-LCC for Text-to-Video Generation ECCV2026

链接: https://arxiv.org/abs/2606.30248
作者: Shihao Zhang,Yuguang Yan,Junzhe Zhang,Wei Zhao,Bohan Wang,Hanwang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ECCV 2026

点击查看摘要

Abstract:Recent text-to-video (T2V) diffusion models rely heavily on auxiliary reward signals (e.g., via reward models or DPO) to align generated content with human aesthetics and improve realism. These signals, however, incur substantial computational overhead, require costly human annotations, and often yield limited improvement in fine-grained local details. In this paper, we argue that your data manifold is secretly a reward model. By explicitly modeling the manifold structure of high-quality Supervised Fine-Tuning (SFT) data and encouraging video latents to lie on this manifold, we derive dense, differentiable, and nearly cost-free reward signals that significantly improve video quality, particularly in mitigating low-level distortions. Our modeling builds upon Local Coordinate Coding (LCC), which captures the skeleton' of the manifold. However, directly applying LCC suffers from mean regression, pulling latents toward the geometric mean and losing high-frequency details. We therefore extend it to Shell Local Coordinate Coding (Shell-LCC), which models the manifold surface’ as an isotropic shell to align with the true high-density region. Experiments demonstrate that our approach improves realism, enhances high-frequency details, reduces over-smoothing artifacts, and alleviates motion blur.

[CV-54] Semantic-Driven Scale and Spatial Selection for Efficient Cross-Modal Alignment in Referring Remote Sensing Image Segmentation

链接: https://arxiv.org/abs/2606.30244
作者: Kun Li,Shengxi Gui,Francesco Nex,Michael Ying Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted

点击查看摘要

Abstract:Referring Remote Sensing Image Segmentation (RRSIS) seeks to localize and segment the target object or region specified by a natural language expression in a remote sensing image. While existing RRSIS models have benefited from large-scale foundation models, they predominantly rely on full fine-tuning. These approaches are computationally intensive and may weaken the generalization ability of pre-trained models, as extensive fine-tuning on significantly smaller downstream datasets can distort the well-structured feature representations learned during large-scale pre-training. Although Parameter-Efficient Tuning (PET) offers a potential alternative, existing PET frameworks primarily focus on single-modal optimization, failing to capture the complex cross-modal dependencies required for multimodal reasoning, while simultaneously struggling to bridge the substantial domain gap between natural scenes and aerial imagery. To address these limitations, we propose a novel framework, Semantic-driven Scale and Spatial Selection for Efficient Cross-modal Alignment (S4ECA), which enables effective and efficient cross-modal interaction through parameter-efficient adaptation. Specifically, we design a dual-encoder adapter architecture. The textual adapter employs learnable queries to distill highly semantic language proxies from word-level embeddings, facilitating early grounding. Simultaneously, the visual adapter refines hierarchical feature representations through a multi-scale dense extractor, followed by a language-guided scale and spatial selection mechanism that dynamically emphasizes relevant visual contexts, ensuring precise cross-modal alignment. By updating only 2.4% of the backbone parameters, our proposed model achieves state-of-the-art performance on the RRSIS-D and RefSegRS datasets, demonstrating superior efficiency and precision in complex aerial scenarios.

[CV-55] From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA

链接: https://arxiv.org/abs/2606.30220
作者: Sena Korkut,María Alejandra Bravo Sarmiento,Sanghwan Kim,Zeynep Akata
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High benchmark accuracy does not guarantee genuine use of visual evidence. We study this problem in traffic accident Video Question Answering (VideoQA), where correct answers should depend on scene-specific visual evidence but may instead be inferred from textual shortcuts. Through an audit of four public benchmarks, we find that several recent open-weight Vision-Language Models (VLMs) perform competitively, and sometimes better, without video input. On the MM-AU benchmark, removing video consistently improves accuracy, and adding more frames further degrades performance. To quantify visual dependence, we introduce two dataset-level diagnostics: Blind Gap, measuring above-chance text-only performance, and Visual Gain, measuring the marginal benefit of adding video. We further propose an instance-level Shortcut Score that combines text-only confidence with visual necessity signals, enabling continuous, training-free filtering of shortcut-prone questions. The resulting subsets reduce shortcut bias and improve visual grounding. Our findings reveal large differences in grounding quality across benchmarks and show that visually grounded evaluation, not just high accuracy, is essential in safety-critical VideoQA.

[CV-56] Efficient RGB-T Object Detection via Sparse Cross-Modality Fusion ECCV-2026

链接: https://arxiv.org/abs/2606.30215
作者: Chao Tian,Zikun Zhou,Chao Yang,Guoqing Zhu,Zhenyu He
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ECCV-2026

点击查看摘要

Abstract:RGB-T detectors leverage the complementary strengths of visible and thermal infrared modalities, achieving robust performance under challenging conditions. Many of them resort to heavy dual backbones and exhaustive cross-modality fusion across the entire image, leading to impractically high computational costs. We observe that most image regions are smooth backgrounds (e.g., sky, ground) that can be easily handled by lightweight single-modality models. In light of this observation, we propose a sparse fusion mechanism for efficient RGB-T detection: first rapidly scanning the image to identify the proposals and then carefully examining the remaining sparse proposals via feature fusion. We propose a two-stage framework to instantiate this mechanism, which performs detection in two stages: 1) a lightweight and modality-specific detection stage that produces high-recall RoIs, and 2) a fusion-driven examination and refinement stage that filters out the false positives and refines the bounding boxes. This design enables the detector to adaptively allocate more computational resources to the potential foregrounds, improving the efficiency while ensuring detection accuracy. Extensive experiments show that our method achieves competitive performance with substantially fewer parameters and lower cost, while maintaining strong scalability to high-resolution images.

[CV-57] A Multi Center Breast FNAC Whole-Slide Cytology Dataset for AI-Assisted Patch-Wise Classification Using C1 to C5 Reporting Categories

链接: https://arxiv.org/abs/2606.30209
作者: Garima Jain,Abhijeet Patil,Surabhi Jain,Sanghamitra Pati,Amit Sethi,Sandeep Mathur,Pulkit Verma,Nishi Halduniya,Jatin Kashyap,Sharat Kumar,Simmi Kharb,Sunita Singh,Sucheta Devi Khuraijam,Sushma Khuraijam,Ratan Konjengbam,Arvind Kumar,Deepali Tirkey,Saurav Banerjee,Shivani Kalhan,Rakesh Kumar Gupta,Ranjana Solanki,Deepika Hemranjani,Shashank Nath Singh,Uma Handa,Manveen Kaur,B. G. Malathi,Yogender P.,Niraj Kumari,Shruti Gupta,Indu R. Nair,Vidya C.,Basumitra Das,Sunil Kumar Komanapalli,Ravindra Karle,Tanaya Kulkarni,Vandana Raphael,Biswajit Dey,Vaishali Gaikwad,Nilam Adhav
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure

点击查看摘要

Abstract:We present a multi center breast fine needle aspiration cytology (FNAC) dataset designed for patch wise classification using C1 to C5 reporting labels. The prospective dataset includes 321 patients and 470 whole-slide images (WSIs) collected from participating tertiary medical centers in India between May 2023 and March 2026. Slides were stained using Papanicolaou (190 WSIs) or MayGrunwald Giemsa (280 WSIs), scanned on a Hamamatsu NanoZoomer S360 at 40X magnification and 0.25 microns per pixel, and stored directly in NDPI format. Across the 470 WSIs, 446 WSIs contain annotated patch regions, yielding 7,398 PNG image patches with expert-verified C1 to C5 labels. The release includes NDPI WSIs, WSI-level GeoJSON annotation files, extracted patch images, deidentified metadata, a data dictionary, a validation summary, a manifest linking WSIs to Zenodo records, and code for dataset inspection and reuse. The complete dataset is approximately 950 GB and is available through Zenodo.

[CV-58] Few-Shot Domain Incremental Learning via Continual Vision-Language Consolidation

链接: https://arxiv.org/abs/2606.30190
作者: Naeem Paeedeh,Mahardhika Pratama,Wolfgang Mayer,Mukesh Prasad,Weiping Ding,Yew-Soon Ong
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing domain-incremental learning (DIL) strategies call for massive amounts of data to adapt to new domains and suffer from the overfitting problem in the case of data scarcity. This paper puts forward a relatively uncharted problem, namely, few-shot domain incremental learning (FSDIL), taking into account the problem of extreme data shortages in the realm of DIL. A novel algorithm, namely Continual Vision-Language Consolidation (CVLC), is proposed to address the FSDIL problem, where the key idea lies in the concept of latent space reservation in the base domain coupled with dual coalescent projection (DCP) as a parameter-efficient fine-tuning method. First, the vision prototype is calibrated while multiple templates and synonyms are generated via LLMs to induce the language prototype. The vision and language prototypes are fused. Adaptation to never-ending arrivals of new domains is done by the DCP technique, fine-tuned in such a way to prepare the model to unseen domains via latent-space reservations committed in the base domain. CVLC is structured under shared and domain-specific components to combine general knowledge and domain-specific details. The advantage of our approach is demonstrated through a range of benchmark problems and comparisons with prior arts, in which CVLC outperforms them by up to a 16% gap. Our codes are shared publicly in this https URL .

[CV-59] DrivenMorph: Bridging Attention Mechanism and Variational Image Registration via Difference Modeling

链接: https://arxiv.org/abs/2606.30183
作者: Mingke Li,Jianping Zhang,Jinqiu Deng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:Medical image registration benefits significantly from deep learning, yet existing approaches often lack physical explainability and fine-grained deformation control. Motivated by Demons algorithms, we propose a novel DrivenMorph framework that bridges attention mechanisms with variational image registration by incorporating difference modeling as a physically inspired inductive bias. The resulting driving force, computed from local differences in the latent feature space, provides explicit semantic guidance throughout the registration process. It directly drives the registration process through a neural Demons layer that simulates force-displacement interactions to generate smooth and anatomically consistent deformation. Unlike previous methods, our approach not only integrates traditional registration principles with popular deep networks, providing an explainable and efficient solution for learning-based medical image registration, but also separates difference modeling from deformation, improving modularity and explainability. Extensive experiments on multiple 3D brain MRI datasets demonstrate superior performance over state of-the-art learning-based and optimization-based methods. Furthermore, visualizations and statistical analyses confirm that the learned driving force aligns closely with actual deformation patterns, supporting its explanatory value.

[CV-60] HiRes: A Hierarchical Cascaded Method for Resistor Value Identification ICONIP2026

链接: https://arxiv.org/abs/2606.30179
作者: Rama Y. AlHamidi,Aseel A. Mohamed,Mustafa A. Eltayeb,Osama Hasoneh,Mohammad Shaqfeh
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Submitted to ICONIP 2026

点击查看摘要

Abstract:Accurate identification of resistor values from unconstrained images remains a challenging computer vision task due to variations in lighting, orientation, scale, and background complexity. This paper presents HiRes, a hierarchical cascaded pipeline for end-to-end resistor value identification directly from full-frame images. The approach combines object detection (YOLOv8n), semantic segmentation (UNet++ with EfficientNet-B2), and structured geometric decoding via projection along the resistor axis. To improve robustness, we incorporate geometric filtering, gap-preserving band separation, and validation against the E24 resistor series. Experiments across diverse real-world images show that HiRes achieves a detection mAP50 of 0.9906, a segmentation mIoU of 0.8444, and an end-to-end identification accuracy of 85.8% (95% CI: 78.0-91.9%), outperforming the publicly available classical baseline, CVResist, which fails to generalize beyond controlled conditions. In addition, our architecture outperforms state-of-the-art MLLMs on our challenging test set, offering a lower cost, high efficiency, and an interpretable alternative method. These results demonstrate the effectiveness of integrating learned visual representations with structured reasoning for robust resistor interpretation. Code and dataset are available at this https URL.

[CV-61] Latent Noise Mask for Reducing Visual Redundancy in Multimodal Large Language Models

链接: https://arxiv.org/abs/2606.30168
作者: Kai Jiang,Ruishu Zhu,Siqi Huang,Hongyuan Zhang,Xuelong Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 7 figures;

点击查看摘要

Abstract:Multimodal large language models (MLLMs) often fail in fine-grained visual reasoning, as question-relevant visual cues are diluted by dense and redundant image tokens. Recent multimodal reasoning methods usually extend chain-of-thought from language models into visual or latent spaces, seeking to add intermediate reasoning states while overlooking the negative impact of redundant visual tokens. We propose LatEnt Noise maSk (Lens), a question-conditioned visual evidence purification framework that empowers MLLMs to reason with cleaner visual cues in latent space. Lens introduces a lightweight Lens Evidence Token (LET) to score which visual tokens support the current question and preserve them during decoding. Guided by the LET scores, it injects adaptive latent noise into low-relevance tokens, softly suppressing distractors without changing the model backbone or token sequence. With only one temporary learnable control token and a lightweight noise generator, Lens adds minimal overhead while improving the base MLLM by 2.4-6.4 points on most VQA datasets and by 4.1-6.4 points on grounding tasks. These results show that multimodal reasoning can benefit more directly from cleaner question-relevant visual evidence than from simply extending the reasoning trace.

[CV-62] A Dual-domain Refinement Network with FBP-based Jacobian Learning for Sparse-view Dual-Energy CT Material Decomposition

链接: https://arxiv.org/abs/2606.30159
作者: Qian Liu,Xiaohong Fan,Ke Chen,Chong Chen,Shuaikang Wang,Jianping Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Computational Imaging, 16 pages

点击查看摘要

Abstract:Dual-energy CT (DECT) exploits attenuation differences across different X-ray spectra to provide richer material information and has been widely used in medical imaging. While sparse-view acquisition can lower radiation exposure, it makes DECT material decomposition even more challenging, as the problem is nonlinear and ill-posed. Existing deep unrolling approaches generally do not explicitly incorporate the Jacobian operator induced by the nonlinear forward model, and their sparsity priors are still mainly built on conventional convolutions, which are insufficient for modeling global structural information. This study addresses the challenge of DECT multi-material decomposition in sparse-view settings by representing it as a sparse-regularized nonlinear least-squares problem. To solve it, we propose an iterative dual-domain refinement network (DECT-DRNet). In each iteration, the filtered back-projection (FBP)-based Jacobian approximation module is used first to generate an intermediate material decomposition result. Here, we characterize the forward process of material decomposition using a nonlinear operator, and then construct a theoretically grounded learnable approximation of the adjoint Jacobian operator by integrating the FBP algorithm with a U-Net into the backward process. In addition, to address the limitation of existing deep learning-based decomposition methods in globally suppressing noise and artifacts, we introduce a learnable sparse dual domain regularization term that incorporates Fourier convolutional residual blocks. This refinement block combines geometric feature extraction in the image domain with noise suppression in the frequency domain, allowing the model to capture both global and local features while maintaining structural details. DECT-DRNet demonstrates its ability to achieve more accurate material decomposition under sparse-view conditions.

[CV-63] 2LDM: A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation

链接: https://arxiv.org/abs/2606.30147
作者: Wentao Qu,Qi Zhang,Chenxu Wang,Guofeng Mei,Yongfei Liu,Xiaoshui Huang,Gim Hee Lee,Liang Xiao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in Text-to-Image generation benefits from large-scale Text-Image pairs. However, the scarcity of Text-LiDAR pairs often causes over-smoothed scenes and limited controllability. In this paper, we rethink the limitations of Text-LiDAR generation task, focusing on alleviating insufficient training priors and constructing controllable Text-LiDAR data. We propose a \textbfText-\textbfto-\textbfLiDAR \textbfDiffusion \textbfModel for LiDAR scene generation, T2LDM++, with a Self-Conditioned Representation Guidance (SCRG). Specifically, to alleviate object over-smoothing, SCRG employs a Guidance Network (GN) to provide reconstruction-based soft supervision to the Denoising Network (DN). This enables DN to learn geometry-aware representations through reconstruction guidance, leading to more accurate denoising in DDPMs. Meanwhile, through analysis and design, SCRG exhibits more effective and lightweight, while decoupled in inference, avoiding computational overhead. Furthermore, we construct two high-quality Text-LiDAR benchmarks ( 100K samples) using a generalized strategy of geometric annotations, along with a controllability metric. Moreover, a directional position prior is designed to mitigate street distortion, further improving scene fidelity. Additionally, T2LDM++ supports multiple conditions, including (Semantic, Box, BEV, Camera)-to-LiDAR, Sparse-to-Dense, and Dense-to-Sparse generation, by learning a control encoder via frozen DN. With effective prior modeling and high-quality Text-LiDAR benchmarks, T2LDM++ can generate realistic LiDAR scenes with rich geometric details in unconditional and conditional settings.

[CV-64] FacePlex: Full-Duplex Joint Speech-Facial Motion Generation for Conversational Avatars

链接: https://arxiv.org/abs/2606.30145
作者: Habin Lim,Jae-Ho Lee,Hah Min Lew,Ji-Su Kang,Gyeong-Moon Park
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Natural face-to-face conversation requires real-time speech generation together with synchronized facial motion. Existing systems only partially address this problem: speech-only full-duplex models can generate speech in real time but do not produce facial motion, while audio-driven facial motion models animate a face from already available audio rather than jointly generating speech and motion online. To bridge this gap, we first formalize full-duplex joint speech-facial motion generation, where speech tokens and facial motion tokens are produced together every step. Building on this formulation, we propose FacePlex, a unified streaming framework with two key components. First, Rolling Flow Matching adapts flow matching to online motion generation by committing new motion frames at each streaming step. Second, Rolling Cross-Attention couples the streaming audio queue with the motion queue, allowing speech and facial motion to condition each other as generation progresses. Through extensive experiments, ablation studies, and a user study, we show that FacePlex enables full-duplex joint speech-facial motion generation under online streaming constraints, while achieving stronger lip-sync quality and motion fidelity than audio-driven facial motion baselines.

[CV-65] Hyper-Network Neural Functional Maps for Unsupervised Robust 3D Shape Matching ECCV2026

链接: https://arxiv.org/abs/2606.30131
作者: Dongliang Cao,Florian Bernard
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ECCV2026

点击查看摘要

Abstract:Functional maps are the cornerstone of recent non-rigid 3D shape matching methods due to their efficiency and performance. However, existing methods struggle with challenging scenarios, such as partiality, topological noise, and raw point clouds. A primary bottleneck is that significant intrinsic distortion prevents truncated spectral bases from being accurately aligned via linear transformations (i.e., functional maps). To address this, we introduce a hyper-network that predicts non-linear neural functional maps (NFM), learned in an unsupervised manner, to better align spectral bases. Specifically, we model the NFM as an MLP with skip-connection to refine standard FM and employ a hyper-network to predict its weights, conditioned on standard FM. Our framework is trained using a novel unsupervised spectral alignment loss. Experiments demonstrate that our approach can be seamlessly integrated into state-of-the-art unsupervised deep functional map pipelines, substantially improving matching accuracy in demanding scenarios.

[CV-66] SciIR: A Large-scale Training Dataset and Benchmark for Scientific Image Reasoning Generation ECCV2026

链接: https://arxiv.org/abs/2606.30124
作者: Zhiyuan Ma,Zhengfeng Shi,Yuning An,Peize Li,Jiabao Wei,Ruijie Li,Junhao Xiao,Jianjun Li,Bowen Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026

点击查看摘要

Abstract:While Text-to-Image (T2I) models have shown remarkable success in generating photorealistic visual content, they still struggle with the rigorous semantic alignment and logical reasoning required for scientific imagery. Inspired by Peirce’s Semiotic Triad, we introduce Scientific Image Reasoning (SciIR), a comprehensive resource for training and evaluation of scientific image generation. We formalize scientific reasoning into three core dimensions: Entity Structure (Icon), Scientific Process (Index), and Scientific Law (Symbol). Specifically, to overcome the scarcity of training data in scientific image generation, we elaborately create SciIR-82k, a large-scale dataset containing over 80,000 high-quality scientific image-text pairs from cutting-edge publications. The dataset is hierarchically organized according to the semiotic dimensions and incorporates a Scientific Reasoning Chain-of-Thought (Sci-RCoT) to explicitly model underlying visual logic. For evaluation, we propose SciIR-Bench, which aligns with these three semiotic levels and employs an Atomic Checklist to convert the outcome-oriented scientific accuracy into process-oriented, verifiable, fine-grained questions. Our extensive experiments reveal significant deficiencies in current models’ scientific reasoning capabilities. Furthermore, by fine-tuning on the SciIR-82k dataset, we developed the Qwen-Image-SciIR model, which achieves a substantial improvement on the SciIR-Bench, increasing the final score from 35% to 43%, laying a solid foundation for future advances in scientific image generation.

[CV-67] LETT-NeXt: A Lightweight RECIST-Guided Model for 3D CT Lesion Segmentation

链接: https://arxiv.org/abs/2606.30108
作者: Sebastian Aas,Elias Stenhede,Arian Ranjbar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:RECIST diameter measurements are widely used for tumor response assessment, but they provide only a limited 2D description of lesion extent. We present LETT-NeXt, a lightweight RECIST-guided model that predicts 3D lesion masks from CT volumes and RECIST markers for the CVPR 2026 Foundation Models for Pan-cancer Segmentation in CT Images competition. LETT-NeXt extracts a RECIST-centered regional crop, encodes the RECIST line and endpoints as two prompt channels, and concatenates them with the CT input. A compact MedNeXt-v2 encoder–decoder predicts the lesion mask, followed by prompt-aware component selection and adaptive AutoZoom inference. On the public validation set, LETT-NeXt achieved a Dice Similarity Coefficient (DSC) of 79.4 \pm 10.1 and a Normalized Surface Dice (NSD) of 72.3 \pm 16.2. On the hidden test set, it achieved a DSC of 73.9 and an NSD of 67.3, corresponding to a challenge score of 70.6%. On the public validation mirror, LETT-NeXt completed CPU inference in 6.9 \pm 3.0 s per case with a peak memory use of 3.6 GB. Code is available at this http URL.

[CV-68] SIR: Structured Image Representations for Explainable Robot Learning CVPR2026

链接: https://arxiv.org/abs/2606.30101
作者: Paul Mattes,Jan Schwab,Jens Bosch,Nils Blank,Maximilian Xiling Li,Minh-Trung Tang,Moritz Haberland,Rudolf Lioutikov
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at CVPR 2026

点击查看摘要

Abstract:Existing robot policies based on learned visual embeddings lack explicit structure and are sensitive to visual distractions. Thus, the representations that drive their behaviour are often opaque, making their decision-making process difficult to interpret. To address this, we introduce Structured Image Representations (SIR), a method that leverages Scene Graphs (SGs) as an intermediate representation for robot policy learning. Our approach first constructs a fully connected graph, using image-derived features as initial node representations. Then, a module learns to sparsify this graph end-to-end, creating a task-relevant sub-graph that is passed to the action generation model. This process makes our model intrinsically explainable. Evaluations on RoboCasa show that our sparse graph policies outperform image-based baselines on average with 19.5% vs 14.81% success rate. Most importantly, we show that the learned sparse graphs are a powerful tool for model analysis. By analysing when the model’s sub-graph deviates from human expectation, such as by including distractor nodes or omitting key objects, we successfully uncover dataset biases, including spurious correlations and positional biases. this https URL

[CV-69] CylindTrack: Depth-Aware Cylindrical Motion Modeling for Panoramic Multi-Object Tracking

链接: https://arxiv.org/abs/2606.30097
作者: Buyin Deng,Kai Luo,Lingxin Huang,Xinqi Liu,Fei Cheng,Hang Zheng,Liming Yin,Kailun Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The source code will be released at this https URL

点击查看摘要

Abstract:Multi-Object Tracking (MOT) is a core capability for embodied perception, and panoramic cameras are attractive for embodied systems because their 360° field of view reduces blind spots and keeps surrounding targets observable for longer durations. However, panoramic MOT is not a straightforward extension of perspective MOT. In equirectangular panoramic videos, the horizontal image domain is periodic rather than Euclidean, which breaks planar motion assumptions and makes IoU-based association unreliable near the 0°/360° seam. Meanwhile, large-FoV scenes often contain more objects, stronger scale variation, and more frequent interactions, making online association particularly sensitive to unstable frame-wise depth cues. To address these issues, we propose CylindTrack, a depth-aware cylindrical tracking-by-detection framework for panoramic MOT. CylindTrack first introduces Depth-Temporal Trajectory Modeling (DTM), which promotes instance depth from an isolated frame-wise cue to a temporally filtered trajectory-level state. To improve the reliability of depth observations, we further develop Spherical Spatio-Temporal Consistency Learning (SSTC), which combines a Temporal Mixer and Spherical Geometry-aware Attention to enhance temporal coherence and panoramic geometric alignment in depth-aware representations. Finally, we design a Topology-Aware Cylindrical Motion Model (TCMM) that lifts horizontal motion into a continuous angular state space and performs seam-consistent motion prediction and association in the periodic panoramic domain. By jointly modeling trajectory-level depth consistency and panoramic topology, CylindTrack improves identity preservation and trajectory continuity in challenging panoramic scenes. The source code will be released at this https URL.

[CV-70] One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding

链接: https://arxiv.org/abs/2606.30084
作者: Chen Liu,Ling Chen,Hanzhang Zhou,Liangyu Chen,Chenglin Cai,Xin Yu,Steven Hoi,Yue Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:MLLM-based GUI grounding methods commonly formulate target localization as autoregressive coordinate generation, enabling models to leverage the strong instruction-following and semantic understanding capabilities of MLLMs. However, this formulation requires the model to retain region-level target evidence while decoding coordinate tokens with the spatial precision demanded by GUI clicking. Our diagnostic analysis reveals that target-region awareness emerges in intermediate decoder layers but is neither retained nor translated into the final coordinate prediction. Existing ZoomIn-style methods address this issue through an external crop-and-rerun pass, which improves localization but increases end-to-end latency and computational cost. To retain the accuracy benefits of two-pass zooming without this extra cost, we propose InnerZoom, a single-forward framework for cross-layer evidence bridging. InnerZoom transforms target-related cues from the original forward pass into a compact cross-layer evidence state, then preserves, refines, and reinjects this state throughout later decoding layers to guide coordinate prediction. Extensive experimental results suggest that InnerZoom-4B achieves state-of-the-art performance on all six GUI grounding benchmarks, obtaining 64.7 on OSWorld-G, 40.2 on UI-Vision, 73.1 on OSWorld-GR, and 87.6 on MMBench-GUI, surpassing the previous best results by 4.1, 3.2, 2.9, and 2.3 points, respectively. Under a controlled 4B setting, InnerZoom improves the same SFT+RL baseline by 5.3 points on average and outperforms two-pass ZoomIn by 1.3 points on average, while reducing end-to-end latency by up to 31.8% and TFLOPs by about 29%. Code and models will be publicly available.

[CV-71] Clinical Risk-Aware Multi-Level Grading for Coronary Artery Stenosis through Curved Feature Reconstruction

链接: https://arxiv.org/abs/2606.30082
作者: Shishuang Zhao,Hongtai Li,Junjie Hou,Yuhang Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Developing a multi-level grading model for coronary artery stenosis holds great clinical significance for the diagnosis of coronary artery disease. However, designing an effective multi-level deep learning algorithm faces significant challenges. Specifically, utilizing CCTA or 3D SCPR images alone presents inherent shortcomings: CCTA images are difficult to analyze due to the tortuous paths of blood vessels, while 3D SCPR images are prone to abnormal distortions that hinder accurate grading. Furthermore, different stenosis grades are associated with varying clinical risks, and incorporating this association into the algorithm is non-trivial. To address the former problems, we propose the Curved Feature Reconstruction (CFR) module, which uses vessel curves as prior and employs a point-by-point correspondence strategy to precisely align and fuse features from both 3D SCPR and CCTA images. Meanwhile, a Clinical Risk-Aware (CR) Loss is employed to introduce clinical risk relevance into the network training so that the algorithm can better align with the clinical diagnosis. The experimental results on a in-house dataset reveal that our approach significantly outperforms other methods, and several ablation studies also demonstrate the effectiveness of our proposed designs.

[CV-72] Neural Subspace Reallocation: Continual Learning as Retrieval-Based Subspace Memory Management

链接: https://arxiv.org/abs/2606.30067
作者: Byeong Hoon Yoon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 1 figure

点击查看摘要

Abstract:We introduce Neural Subspace Reallocation (NSR), which reframes continual learning as memory management over parameter subspaces. Instead of treating Low-Rank Adaptation (LoRA) modules as disposable per-task adapters, NSR manages them as compressible, retrievable memory units on a frozen backbone through a recurring cycle: (1) compress learned LoRAs via SVD, (2) reserve them in a TaskKnowledgeBank, (3) recall related past LoRAs by embedding similarity to warm-start new or returning tasks, and (4) reallocate the active subspace accordingly, with distillation protecting prior tasks. We prove that in cyclic environments any memoryless allocation policy incurs cumulative regret Omega(T(M-1)Delta_switch) relative to a history-aware policy backed by the Bank (Theorem 1). Empirically, on Split-CIFAR-100 the Bank reduces cyclic recovery time by 10x, exactly as predicted, and on the heterogeneous 5-Datasets benchmark NSR achieves the highest accuracy and the least forgetting, about 9x closer to zero backward transfer than the memoryless heuristics. Crucially, we run a controlled study that isolates which component matters: holding the Bank fixed and varying only the allocation rule, we find that a simple similarity-based retrieval rule matches or beats a learned reinforcement-learning controller (recovering recurring tasks in 0 vs 1.8 steps and reaching equal accuracy). Our central, honest finding is therefore that the memory mechanism – compression and similarity retrieval – rather than a learned allocation policy, drives continual-learning performance under fixed capacity. A memory-budget analysis confirms the compressed Bank stays small – 0.29 MB of parameter memory per task – so a top-K retention cap bounds the total footprint while preserving fast recovery for retained tasks.

[CV-73] Emergence of a Shared Canonical Object Frame from In-the-Wild Videos

链接: https://arxiv.org/abs/2606.30058
作者: Tom Fischer,Martin Sundermeyer,Adam Kortylewski,Eddy Ilg
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Comparing object orientations and positions across different instances requires their poses to be expressed in a shared canonical frame. Establishing such frames has traditionally required manual annotation, creating a scaling bottleneck that limits category and instance diversity. We show that a shared canonical frame can instead emerge from self-supervised training on object-centric videos captured in the wild, using only noisy camera poses from Structure-from-Motion. Our key idea is to route all training sequences through a shared geometric bottleneck: a coarse canonical mesh that carries no category-specific detail. By learning dense correspondences from image pixels to this mesh, and estimating per-sequence alignments from noisy SfM geometry, a common canonical frame emerges from multi-view consistency and the semantic priors of the feature extractor, without any canonical pose labels or category conditioning. Trained in a self-supervised manner on 160,000 in-the-wild object videos, our method achieves competitive accuracy on category-level pose estimation benchmarks compared to methods that rely on canonical pose supervision. The code and checkpoint is available on this https URL.

[CV-74] Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation ECCV2026

链接: https://arxiv.org/abs/2606.30054
作者: Chonghuinan Wang,Zhikai Chen,Chunwei Wang,Yecong Wan,Junwei Yang,Zhixin Wang,Wei Zhang,Jiaqi Xu,Renjing Pei,Xiaohe Wu,Fan Li,Wangmeng Zuo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV2026

点击查看摘要

Abstract:The advancement of generative AI models capable of producing text and image marks a critical step forward in the realm of multimodal intelligence, particularly for tasks involving the interleaving of both modalities. To advance this intelligence to the next stage, it is crucial for models to autonomously generate free-form interleaved text-image sequences. In this paper, we introduce ILLUME-X, an advanced unified multimodal paradigm that enables high-quality, free-form interleaved text-image generation by improving multimodal data efficiency and stabilizing the multimodal training process. ILLUME-X comprises three key components: (i) an expanded training data pipeline optimized for interleaved text-image generation, (ii) a progressive training strategy with self-adaptive objectives for free-length multimodal token sequences, and (iii) an objective and comprehensive evaluation method ILScore for interleaved text-image sequences. Notably, our ILLUME-X outperforms previous unified models across multiple interleaved text-image generation tasks like style transfer, image decomposition and storytelling.

[CV-75] Bridging the Gap Between Image Restoration and Navigational Safety in Hazy Conditions: A New Visibility Estimation Metric for Maritime Surveillance

链接: https://arxiv.org/abs/2606.30049
作者: Wentao Feng,Guobei Peng,Wengang Mao,Ryan Wen Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages,10 figures

点击查看摘要

Abstract:Visibility distance is critical to maritime navigational safety because it determines the effective observation range of shipborne and shore-based monitoring systems. Under hazy conditions, degraded visual information shortens observable distance and increases navigational risks and economic losses. Although numerous image dehazing methods have been developed, conventional image quality assessment metrics, such as PSNR, SSIM, FSIM, FADE, and NIQE, cannot establish a physically interpretable relationship between restoration quality and practical visibility thresholds. To address this limitation, this work proposes a visibility-oriented evaluation framework that links dehazing performance with visible-distance estimation. First, a Maritime Simulated Visibility Dataset (MSVD) is constructed using Unity3D to simulate maritime traffic scenes under graded visibility conditions. The dataset provides paired hazy and clear images with precise visibility annotations, enabling quantitative analysis of visibility restoration. Second, a dehazing visibility evaluation metric is developed by using object detection accuracy as an intermediate indicator. By establishing a mapping between visibility distance and detection performance, the proposed metric converts image restoration improvements into measurable visibility gains. Six representative dehazing methods are evaluated using both conventional image quality metrics and the proposed visibility metric. Experimental results under different imaging conditions demonstrate that MSVD provides a reliable benchmark for evaluating dehazing performance across graded visibility levels, while the proposed metric enables interpretable and reliable visible-distance estimation, thereby supporting the assessment of navigational safety and operational efficiency.

[CV-76] Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes

链接: https://arxiv.org/abs/2606.30047
作者: Xi Li,Linyuan Li,Yan Wu,Tong Rao,Kai Zhang,Xinchen Hui,Cihui Pan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Metric feed-forward 3D reconstruction for panoramic data remains under-explored due to the lack of large-scale panoramic RGB-D training data. We present Realsee3D, a hybrid dataset of 10K indoor scenes (1K real, 9K synthetic) with 299K panoramic viewpoints and precise metric annotations, and Argus, a feed-forward network trained on it for metric panoramic 3D reconstruction. In the sparse unordered capture setting of Realsee3D, a poorly chosen coordinate anchor can cause global pose drift. Argus addresses this with a learned covisibility module that selects the geometrically optimal reference view to anchor the metric world frame. To further improve multi-task learning, we decompose the bidirectional pixel-to-world mapping into interpretable sub-steps with per-step supervision and cross-coordinate joint constraints, reinforcing geometric consistency across prediction branches. On the Realsee3D benchmark, Argus achieves state-of-the-art metric performance in camera pose estimation, depth estimation, and point cloud reconstruction. Project page: this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.30047 [cs.CV] (or arXiv:2606.30047v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.30047 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xi Li [view email] [v1] Mon, 29 Jun 2026 09:39:17 UTC (11,969 KB) Full-text links: Access Paper: View a PDF of the paper titled Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes, by Xi Li and 6 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-77] Walking in the Implicit: Interactive World Exploration via Neural Scene Representation ECCV2026

链接: https://arxiv.org/abs/2606.30045
作者: Zhiqi Li,Chengrui Dong,Zhenhua Du,Hangning Zhou,Cong Qiu,Hailong Qin,Mu Yang,Dongxu Wei,Peidong Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026

点击查看摘要

Abstract:Interactive video generation systems for camera-controlled world exploration roll out growing sequences of latent video frames, entangling state transition with high-frequency observation synthesis. We propose Walking in the Implicit, a scene-centric paradigm that changes the rollout variable from frame latents to a fixed-length, renderable implicit state, termed Neural Implicit Scene (NIS). This factorizes interactive generation into stochastic transition of a compact scene state and deterministic pose-conditioned rendering given the sampled state. We instantiate this paradigm as NeuWorld: a transformer VAE learns locally anchored NIS from sparse posed frames, and a diffusion transformer evolves NIS conditioned on future camera trajectories and geometry-aware retrieved history. By reusing the VAE encoder as a unified conditioner, NeuWorld maps camera, reference-image, and history cues into the same NIS modality, avoiding external heterogeneous encoders. Trained from scratch on public posed-view data without pretrained video backbones or auxiliary 3D reconstructors, NeuWorld achieves strong long-horizon consistency with favorable inference efficiency.

[CV-78] CogSENet: Blind Image Deblurring with Blur-Conditioned Semantic Routing and Explicit Frequency Fusion ECCV2026

链接: https://arxiv.org/abs/2606.30030
作者: Pan Wang,Yihao Hu,Xiujin Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026

点击查看摘要

Abstract:Blind image deblurring demands the recovery of high-fidelity details and coherent structures from complex, unknown degradations. Current blind image deblurring methods struggle with real-world, spatially varying degradations, and lack the semantic awareness necessary to reliably differentiate valid textures from artifacts. To bridge this gap, we propose CogSENet, a dynamic, semantic-aligned reconstruction framework inspired by the eagle’s visual system. By mimicking the eagle’s active saccadic scanning, we devise a Semantic-Driven State Space Module (SDSSM) with semantic-aware token regrouping via differentiable routing, enabling prompt-conditioned long-range dependency modeling. To ensure physically interpretable recovery of textures and structures, a BiFreqFusionBlock (BFFB) mirrors functional differentiation of the eagle’s retina by decomposing features into high and low frequencies using wavelet transforms. Finally, we estimate a continuous Blur Field (CBF) from blur image and fuse it with CLIP semantic priors to modulate the deepest latent features, emulating focal adaptation and enabling adaptive restoration under spatially non-uniform blur. Extensive experiments demonstrate that CogSENetoutperforms state-of-the-art deblurring methods in both visual quality and structural fidelity with fewer parameters, while also performing favorably on dehazing, deraining, and denoising tasks.

[CV-79] Cross-Modal Iteration Distillation for Robust IHD Screening: The IDNet Framework and A New Benchmark

链接: https://arxiv.org/abs/2606.30027
作者: Yongchang Gao,Junjie Pang,Shuaiyu Yang,Yusheng Yang,Xichao Jia,Shaojie Li,Hongfei Zhang,Jia Mu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 2026 IEEE International Conference on Systems, Man, and Cybernetics (SMC 2026)

点击查看摘要

Abstract:Color Fundus Photography (CFP) offers a low-cost and non-invasive route for ischemic heart disease (IHD) screening, but current studies are limited by scarce public benchmarks and ineffective fusion of retinal images with sparse clinical variables. We propose IDNet, a multimodal framework with a Cross-Modal Distillation Aggregator (CDA) that uses learnable queries to sequentially integrate left-eye, right-eye, and clinical features, mitigating the imbalance between high-dimensional visual features and low-dimensional tabular inputs. We also construct a reproducible UK Biobank benchmark with open-source curation and quality-control pipelines, yielding 50,410 images from 25,205 subjects. On this benchmark, IDNet outperforms image-only, clinical-only, and several multimodal baselines, and CDA consistently improves multiple visual encoders as a plug-in fusion module.

[CV-80] MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLM s

链接: https://arxiv.org/abs/2606.30026
作者: Yuxuan Fan,Gyusik Seo,Jing Hao,Jaemin Cho,Mohit Bansal,Jaehong Yoon
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Audiovisual arts encompass diverse creative disciplines, including cinema, visual arts, stage performance, and game design, where artistic meaning arises from deliberate combinations of visual, auditory, and narrative elements (e.g., fear amplified through claustrophobic framing, or grief conveyed through silence and lingering close-ups). True artistic understanding extends beyond recognizing what is depicted to reasoning about why it is expressed through particular creative choices. Despite the strong progress of multimodal large language models (MLLMs), this critical aspect of artistic understanding remains underexplored, as existing benchmarks largely measure perceptual recognition while overlooking reasoning about creative intent. To address this gap, we introduce Musebench, a comprehensive benchmark designed to evaluate MLLMs on nuanced artistic understanding. It comprises 4,016 questions spanning cinematic arts, static visual arts, stage performing arts, and game arts, distilled from over 10K candidate video essays that pair professional commentary with visual demonstration. To capture the open-ended nature of artistic analysis at scale, the benchmark combines single-select and variable-option multi-select questions. All questions are generated and refined through a four-phase iterative pipeline combining shortcut filtering, adversarial distractors, and expert validation. Comprehensive zero-shot evaluation of 28 state-of-the-art MLLMs reveals that even the best-performing model achieves only 48.29% accuracy, substantially below human expert performance of 87.18%, exposing a significant gap in current models’ creative domain expertise.

[CV-81] IBRSteG: Learning a Generalizable Steganography Framework for 3D Gaussian Splatting

链接: https://arxiv.org/abs/2606.30024
作者: Fanye Kong,Hongyu Xia,Yu Zheng,Boyang Gong,Jie Zhou,Jiwen Lu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Transactions on Multimedia (TMM)

点击查看摘要

Abstract:Recent advances in deep learning have notably improved steganographic message hiding. However, designing a generalizable steganographic approach for 3D Gaussian Splatting (3DGS) that can embed meaningful 3D scene content remains challenging. In this paper, we propose IBRSteG, a generalizable framework for 3DGS steganography that enables undetectable concealment of secret scenes within a steganographic scene. Unlike existing approaches whose parameter generation is rigidly coupled with the specific scene, we formulate 3D steganography as a feed-forward 3D Gaussian embedding process that generalizes across different 3DGS scenes. To realize this, we introduce GAS (Gaussian Attributes Steganographer), a network that learns a scene-independent embedding function by injecting the attributes of secret 3D Gaussian points into a cover scene, thereby directly reconstructing the steganographic scenes without per-scene finetuning or optimization. By transforming 3D Gaussian into these structured attributes, these attributes are compatible with 2D learning paradigms and benefit from their structured nature, thereby enhancing generalization to unseen 3DGS scenes. Extensive experiments on established datasets demonstrate that IBRSteG can effectively conceal different scenes with high visual quality, and achieves superior capacity and security. Code is available at this https URL.

[CV-82] Uncertainty Estimation in Pathology Foundation Models via Deep Mutual Learning

链接: https://arxiv.org/abs/2606.30020
作者: Gbègninougbo Aurel Davy Tchokponhoue,Sevda Öğüt,Ali Idri,Dorina Thanou,Pascal Frossard
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pathology foundation models (PFMs) offer generalizable representations for whole-slide image (WSI) analysis, yet their clinical adoption remains limited. Specifically, their predictions lack reliable confidence estimates, and no single PFM is universally best across tasks, which severely undermines trust in medical settings. To overcome this, we propose \mathttDICE , a plug-and-play framework that ensembles K frozen PFMs and models their disagreement as a proxy for uncertainty estimation. To ensure this proxy yields meaningful estimates, we align the ensemble members via deep mutual learning, and theoretically show that this objective upper-bounds the model uncertainty. Additionally, we demonstrate that the ensemble’s consensus localizes abnormalities at the patch level without any explicit supervision. We evaluate \mathttDICE on three challenging WSI benchmarks. Notably, our framework provides reliable uncertainty estimates that accurately flag failure-prone cases under in- and out-of-distribution settings, while matching or outperforming SOTA baselines in classification, calibration, and localization. Overall, \mathttDICE takes a crucial step toward translating PFMs into uncertainty-aware decision-support systems.

[CV-83] OmniDance: Multimodal Driven Dance Video Generation with Large-scale Internet Data ECCV2026

链接: https://arxiv.org/abs/2606.30019
作者: Kaixing Yang,Jiashu Zhu,Xulong Tang,Ziqiao Peng,Xiangyue Zhang,Chubin Chen,Puwei Wang,Jiahong Wu,Xiangxiang Chu,Hongyan Liu,Jun He
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026

点击查看摘要

Abstract:Music-driven dance video generation aims to synthesize expressive human motion that is temporally aligned with music while maintaining high visual fidelity. Despite recent progress, existing methods still face two key limitations: the lack of large-scale, high-quality dance video datasets, and the absence of principled frameworks for integrating music as a complementary conditioning signal into Video Generation Foundation Models. To address these limitations, we introduce CIPE-Dance, a large-scale Internet-sourced dance video dataset with choreography-informed text annotations, constructed via a progressive expert pipeline. To the best of our knowledge, CIPE-Dance is the largest dataset for dance video generation to date, comprising 300k high-quality clips over 400 hours and covering diverse dancers, environments, and dance genres. We further propose OmniDance, a framework-level recipe for integrating music into a TI2V foundation model without sacrificing its original controllability or visual fidelity. Motivated by the complementary roles of text as low-frequency semantics and music as high-frequency temporal dynamics, OmniDance co-designs a depth-aware specialization architecture, an anchored easy-to-hard curriculum learning strategy, and a modality-specialized time-dependent CFG strategy, enabling unified TI2V, MI2V, and MTI2V generation. Extensive experiments on CIPE-Dance demonstrate that OmniDance achieves state-of-the-art performance across all three tasks and exhibits robust multimodal integration capability. Project is available at this https URL.

[CV-84] Monte Carlo Energy Aggregation for Mobile 3D Gaussian Splatting ECCV2026

链接: https://arxiv.org/abs/2606.30017
作者: Xiaobiao Du,YuAn Wang,Hao Li,Bosheng Wang,Xun Sun,Xin Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026, Project Page: this https URL

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting have demonstrated unprecedented success in novel view synthesis. However, the substantial inference and storage overhead driven by high-order Spherical Harmonics (SH) are primary bottlenecks for mobile platforms. In this paper, we present Flux-GS, a real-time Gaussian Splatting method designed to achieve high-fidelity rendering with significantly reduced overhead for resource-constrained mobile platforms. We first propose a Monte Carlo Specular Energy Aggregator, sampling third-order radiance residuals and aggregating specular energy into a compact latent space. In this way, our method effectively preserves visually salient lighting features in lower-order bands without expensive distillation or pre-training. To mitigate the high-frequency details lost during compression, we introduce an Attribute-Conditioned SH Enhancement module. This module predicts Gaussian-aware offsets based on intrinsic Gaussian attributes, which enhance the first-order SH representation prior to inference, without extra inference costs. Furthermore, the original single-view gradient-based densification is prone to producing excessive Gaussians and overfitting to a certain view. We address these limitations by proposing a Multi-view Alpha-based Densification and Pruning strategy. By leveraging multi-view guidance, we ensure multi-view structure consistency and the precise removal of redundant primitives. Extensive experiments demonstrate that Flux-GS achieves substantial parameter reduction while maintaining competitive visual quality, offering a robust and scalable solution for real-time mobile rendering. Code: \textcolormagenta\hrefthis https URLthis https URL.

[CV-85] Shell-Supervised Gaussian Splatting for Urban Real-to-Sim Reconstruction

链接: https://arxiv.org/abs/2606.30014
作者: Yuan Yang,Peijun Lu,Fangzhou Lu,Sai Fan,Siqi Yan,Chenyuan Zhang,Haobo Liang,Yichen Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages main paper, 2 pages supplementary material

点击查看摘要

Abstract:Real-to-sim reconstruction for embodied AI requires geometry that is useful for collision reasoning, navigation, and agent-environment interaction, not only photorealistic novel-view synthesis. However, close-range urban facades are difficult for video-to-3D reconstruction: glass, reflections, repeated windows, and weak texture can produce visually plausible renderings with unstable surface geometry. We introduce shell-supervised Gaussian Splatting, a reconstruction-stage framework that uses an external facade structural shell as lightweight geometric supervision for video-driven Gaussian reconstruction. The method aligns an exterior shell to the video reconstruction frame, renders per-view depth, camera-space normal, and valid-mask maps, and applies these cues through mask-gated losses during Gaussian optimization. This design preserves RGB-driven appearance while regularizing only visible shell-supported facade regions. Experiments on anonymized close-range urban facade scenes show improved facade orientation and visible-surface point-cloud consistency over photo-only, monocular-cue, and surface-oriented Gaussian baselines, while maintaining comparable held-out rendering quality.

[CV-86] SkelEM: Training-Signal Decoupling of Skeleton and Diffusion for Self-supervised Axial Super-Resolution in Volume Microscopy ECCV2026

链接: https://arxiv.org/abs/2606.30012
作者: Bohao Chen,Yanchao Zhang,Yanan Lv,Chenxun Deng,Hua Han,Xi Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Volume microscopy, including electron and light microscopy, suffers from severe anisotropic resolution due to physical axial sectioning. Existing self-supervised axial super-resolution (ASR) methods face a trilemma bounded by overly smoothed regression textures, structural hallucinations of pure diffusion models, and prohibitive inference latency. In this paper, we propose Skeleton-refinE Microscopy (SkelEM), a self-supervised framework that decouples ASR at the training-signal level: a frozen topological network and a diffusion refiner are optimized by disjoint objectives, separating low-frequency topology formulation from high-frequency detail enhancement. Building on this deterministic skeleton, we exploit a unified cycle-consistent mechanism on input sparse slices to simultaneously extract a real-domain residual prior and bidirectionally align the diffusion refiner, washing away cross-plane artifacts without synthetic bias. By truncating the reverse diffusion process with this physical prior, SkelEM achieves high-fidelity detail restoration in merely \le 5 steps. To rigorously assess cross-instrument generalization, we further introduce BRAVE-ASR, a new benchmark of co-aligned anisotropic and isotropic volumes acquired on a Plasma-FIB instrument. Across public benchmarks, SkelEM achieves the most favorable balance across the fidelity-perception trade-off among self-supervised methods, with state-of-the-art downstream membrane segmentation performance and robust zero-shot generalization across distinct modalities.

[CV-87] GeoEdit: Geometry-Aware Object Editing via Dual-Branch Denoising ECCV2026

链接: https://arxiv.org/abs/2606.30003
作者: Yi He,Jiangming Wang,Xinyu Wang,Mark Fong,Songchun Zhang,Yuxuan Xue,Hai-Tao Zheng,Yue Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Precisely manipulating objects in a single photograph (translation, rotation, scaling) while obeying 3D physical constraints remains unsolved for diffusion-based editors. Current 2D methods lack spatial awareness and produce perspective violations. Forcing structural proxies into the latent space also disrupts variance homogeneity, and the resulting self-attention leakage leads to ghosting and background blur. The core difficulty is asymmetric: the relocated object must follow a rigid geometry, yet the uncovered background needs freedom to synthesize plausible content. We present GeoEdit, a training-free Lift-Manipulate-Render-Denoise pipeline that satisfies both constraints. We decouple scene and object in 3D, align them through point correspondence, and render a geometry-aligned proxy with a structural depth map. A Dual-Branch Denoising stage then refines this proxy: a video diffusion backbone preserves object identity, while 3D constraints are injected into the foreground within a narrow denoising window at matching noise variance (variance-homogeneous injection). The background denoises freely. Because the injected signal matches the native latent statistics, self-attention stays undisturbed. We also introduce GeoEditBench, a pose-aware benchmark covering object translation, object rotation, and camera movement with pose-aware evaluation metrics. Experiments confirm consistent gains in geometric accuracy, identity fidelity, and background quality. Our codes are available at this https URL.

[CV-88] Rigel: Self-Distilled Score Adaptation for Image and Video Captioning Evaluation

链接: https://arxiv.org/abs/2606.29997
作者: Shuitsu Koyama,Kazuki Matsuda,Yuiga Wada,Shinnosuke Hirano,Daichi Yashima,Komei Sugiura
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic evaluation of image and video captioning is essential for benchmarking multimodal systems, although standard evaluation metrics show limited alignment with human judgments. Recent approaches using large language models (LLMs), commonly referred to as LLM-as-a-Judge, have improved alignment with human judgments but still suffer from a mismatch between large-vocabulary language modeling and evaluation over a small label set. To address this, we propose Rigel, an automatic evaluation metric for image and video captioning, based on self-distilled score adaptation. The metric employs an evaluation-specific scoring head distilled from a frozen LLM, which captures judgment signals in a task-aligned space without relying on large-vocabulary token sets. We then refine the LLM backbone with human judgment data. To train Rigel, we constructed the Vid-Lepus dataset, which contains 3,338 video clips, 33,380 reference captions, and 5,637 candidate captions. Experiments on multiple benchmarks show that Rigel outperforms state-of-the-art metrics, achieving over 10-point improvements on ActivityNet-Fact in the reference-free setting.

[CV-89] Learning Efficient 4D Gaussian Representations from Monocular Videos with Flow Splatting

链接: https://arxiv.org/abs/2606.29976
作者: Shengjun Zhang,Jinzhao Li,Xin Fei,Yueqi Duan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing dynamic 3D scenes from monocular videos is challenging due to scene complexity and temporal dynamics. With the advancement of 3D Gaussian Splatting in novel view synthesis, existing methods extend 3D Gaussians to 4D domain with deformation fields, trajectories or spatiotemporal 4D volumes to model scene element deformation. However, these methods suffer from long training time, low rendering speed or high memory consumption for per-frame reconstruction of 4D volumes, without fully exploiting dense dynamic information. To address this issue, we propose Flow Splatting, which constructs the velocity field and enables the conventional splatting technique to render optical flow from the velocity field to supervise dynamics learning process from monocular videos. Specifically, we extend 4D volumes with time varying means and covariance to represent complex dynamics. Then, we construct and approximate the velocity field naturally based on this representations. While conventional volume rendering techniques support to render color fields, we extend the volume rendering strategy to splat the velocity field by considering the influence of camera motions. We conduct experiments on various benchmarks to demonstrate the efficiency and effectiveness of our method. Compared to the state-of-the-art methods, our model achieves better image quality with less time consumption and higher rendering speed.

[CV-90] Variance Reduction on the Camera Axis: Multi-View Score Distillation for 3D WACV2027

链接: https://arxiv.org/abs/2606.29964
作者: Marian Lupascu,Mihai Sorin Stupariu,Ionut Mironica
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 19 figures. Submitted to WACV 2027 (Algorithms Track)

点击查看摘要

Abstract:Score distillation turns a pretrained 2D diffusion model into a 3D generator, but the per-step gradient is estimated from a single randomly chosen view: it is high-variance and blind to global shape consistency. Prior work addresses this by retraining the diffusion prior on multi-view data; this improves consistency but makes the sampling contribution inseparable from prior quality. We instead isolate the sampling axis. The per-step gradient is one noisy sample of an expectation over views; aggregating K samples per step at a fixed total UNet budget reduces variance without touching the prior. We introduce Multi-View Aggregated Score Distillation (MV-SDI), which aggregates gradients from K views per step via gradient accumulation, keeping peak memory unchanged and the 2D prior frozen, and draws views as antithetic antipodal pairs, a prior-independent geometric property, for balanced angular coverage. At a fixed 10,000-UNet-call budget, K=2 raises CLIP R-Precision from 74.8% to 83.8% and CLIP score from 0.297 to 0.312, with consistent gains on HPSv2 and ImageReward and a 0.0% divergence rate on the 43-prompt benchmark; optimization steps halve as a consequence. K=4 gives a fourfold step reduction at R-Precision 86.9% and CLIP 0.307, still well above the single-view baseline on every alignment metric. MV-SDI is compatible with gradient-based score-distillation pipelines, including Score Distillation via Inversion, and requires no retraining and no multi-view data.

[CV-91] Explainability-Aware Frustum Attack: Exposing Structural Vulnerabilities in LiDAR-Based 3D Object Detectors ECCV2026

链接: https://arxiv.org/abs/2606.29963
作者: Chengzeng You,Binbin Xu,Soteris Demetriou
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: The 19th European Conference on Computer Vision (ECCV 2026)

点击查看摘要

Abstract:The structural vulnerabilities of point cloud-based 3D object detectors remain poorly understood. Prior work has studied adversarial robustness primarily on isolated 3D object models, while recent LiDAR spoofing attacks target richer and more realistic driving scenes but focus mainly on physical realizability rather than understanding detector behavior or attack efficiency. In this work, we investigate how LiDAR-based detectors rely on spatial evidence in complex scenes and whether these reliance patterns can be exploited to induce failures more efficiently. To this end, we propose an explainability-guided adversarial analysis methodology. We introduce the Saliency-LiDAR (SALL) method, which aggregates Integrated Gradient attributions across scenes to produce universal saliency maps for LiDAR-based 3D object detectors. Guided by these maps, we design the Explainability-aware Frustum Attack (EFA), which selectively perturbs only the most influential frustums rather than uniformly attacking entire object regions. Experiments on KITTI and nuScenes, across detectors such as PointPillars and SECOND, show that EFA reduces detection recall by more than 15 percentage points while requiring 25-50% fewer perturbed frustums than the state-of-the-art non-saliency-aware baseline. These findings reveal that modern 3D detectors concentrate discriminative evidence in a small subset of spatial regions, exposing a structural robustness vulnerability in current LiDAR perception systems. Our code is released at this https URL.

[CV-92] Exploiting Local Flatness for Efficient Out-of-Distribution Detection ECCV2026

链接: https://arxiv.org/abs/2606.29952
作者: Seonghwan Park,Hyunji Jung,Dongyeop Lee,Namhoon Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026

点击查看摘要

Abstract:Detecting out-of-distribution (OOD) data is crucial for reliable machine learning deployment. Among detection strategies, post-hoc methods are particularly attractive due to their efficiency, as they operate directly on pre-trained networks without requiring retraining. Within this paradigm, one promising direction exploits loss-landscape curvature to estimate model uncertainty; however, such methods incur substantial computational cost and rely on implicit assumptions about how landscape flatness differs between in-distribution (ID) and OOD data. In this work, we provide the first systematic investigation of this curvature discrepancy and show that OOD inputs exhibit larger Hessian curvature than ID data, with the gap widening under stronger distributional shifts. Motivated by these observations, we propose Fold, a lightweight flatness-modulated OOD detector that leverages the feature Hessian and partial feature normalization to improve ID-OOD separability while avoiding costly parameter-space curvature approximations. To optimally adapt this normalization across diverse datasets, we further introduce AutoFold, a self-supervised tuning scheme that synthesizes pseudo-OOD samples via ID logit masking for automatic calibration without requiring external data. Experiments on OOD benchmarks show that Fold outperforms prior methods, improving the average AUROC by 1.63% and reducing FPR95 by 2.30%, while maintaining computational efficiency comparable to a standard forward pass. Supported by theoretical analysis and extensive ablations, Fold provides a principled and practical solution for robust real-world deployment.

[CV-93] Scene-aware Prediction of Diverse Human Movement Goals

链接: https://arxiv.org/abs/2606.29942
作者: Qiaoyue Yang,Amadeus Weber,Magnus Jung,Ayoub AI-Hamadi,Sven Wachsmuth
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published on ROBOVIS 2025

点击查看摘要

Abstract:Anticipation of human behaviours facilitates autonomous systems in proactive planning. Human behaviour could be stochastic due to varying goals. Human goals typically guide their own movement and could therefore help to predict the human trajectory and human motion in the long-term. To infer the human movement intentions, the environmental context plays a significant role, in addition to the social cues expressed by the individual. Previous works on human goals prediction either require semantic knowledge of the scene, or only tackle interactions with objects. In this paper, we propose a novel multi-goal prediction method using the generative model to address the stochasticity of human movement. It leverages the current RGB scene and the human pose to predict diverse potential future goals of human movement based on the Conditional Variational Autoencoder (CVAE). Our results demonstrate that our approach is capable of generating multiple movement goals in the scene via samplings in latent space of the CVAE and exhibits generalization capability across scenarios in GTA-IM dataset and PROX dataset. Code is publicly available at \hrefthis https URL\textttthis https URL.

[CV-94] Seeing Touch from Motion: A Unified Modality-Aware Visuo-Tactile Policy with Tactile Motion Correlation ECCV2026

链接: https://arxiv.org/abs/2606.29941
作者: Shengqi Xu,Guojin Zhong,Yang Liu,Fanjie Wang,Hu Luo,Hanyu Zhou,Weiyao Zhang,Ziyi Ye,Zuxuan Wu,Yu-Gang Jiang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026. Project website: this https URL

点击查看摘要

Abstract:Visuo-Tactile policies leveraging optical tactile sensors have shown great promise in contact-rich manipulation. These sensors achieve high spatial resolution and multi-dimensional force sensing by utilizing an internal camera to monitor the deformation of their elastic gel surface, thereby indirectly inferring tactile cues. Despite their advantages, extracting fine-grained contact states necessary for contact-rich manipulation remains an open challenge. Existing methods typically use either raw images or cumulative motion fields to represent tactile cues. However, both are prone to perception ambiguity. Raw tactile images mainly capture appearance changes, while cumulative motion fields only reflect the aggregate gel deformation. Consequently, distinct fine-grained contact states can exhibit highly similar patterns, making it difficult to explicitly distinguish subtle contact variations. To address this issue, we explore the dynamic priors of tactile motion and discover that the correlation between transient and cumulative motion can explicitly distinguish fine-grained contact states. Based on this insight, we propose a motion-aware tactile representation to facilitate contact-rich manipulation. Beyond tactile representation, effective fusion of tactile and visual modalities is also critical. Most existing fusion methods either directly concatenate features from each modality or train modality-specific networks separately and fuse their outputs. However, these strategies struggle to simultaneously model cross-modal interactions and preserve modality-specific characteristics. In this work, we take advantage of the Mixture-of-Transformers architecture and propose a unified modality-aware visuo-tactile policy that captures cross-modal complementarity while maintaining modality-specific properties.

[CV-95] Latent-CURE for Breast Cancer Diagnosis MICCAI2026

链接: https://arxiv.org/abs/2606.29928
作者: Weiyi Zhao,Xiaoyu Tan,Lu Gan,Liang Liu,Xihe Qiu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures, 3 tables. Accepted to MICCAI 2026

点击查看摘要

Abstract:Multimodal Large Models have significantly advanced automated breast ultrasound diagnosis. However, most existing frameworks utilize opaque, end-to-end paradigms prioritizing global statistical correlations over structured clinical reasoning. Consequently, these models remain susceptible to shortcut learning amid extreme real-world epidemiological imbalances, often bypassing rare but decisive malignant indicators for dominant benign patterns. To address this disconnect, we propose Latent-CURE, a novel diagnostic framework driven by asymmetric weighted chain-of-thought methodology grounded in latent space reasoning. Unlike traditional approaches, our framework constructs an implicit reasoning trajectory forcing the model to sequentially infer standardized BI-RADS morphological descriptors before converging on a final diagnosis. Furthermore, to combat the extreme scarcity of critical malignant features, we couple this architecture with a dual-asymmetric optimization strategy. By dynamically adjusting margins and weights, this strategy safeguards high-specificity malignant descriptors from being overshadowed by common benign priors. Comprehensive evaluations demonstrate that our knowledge-injected approach provides transparent clinical evidence while achieving robust, accurate diagnostic performance in imbalanced medical cohorts.

[CV-96] DCGrasp: Distance-aware Controllable Grasp Generation

链接: https://arxiv.org/abs/2606.29924
作者: Hiroyasu Akada,Jesús Pérez,Emre Aksan,Vasileios Choutas,Cristian Romero,Alberto Garcia-Garcia,Vladislav Golyanik,Christian Theobalt,Thabo Beeler
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating 3D hand-object interactions is essential for applications in robotics, XR, and synthetic data generation, where flexible controllability and strong generalization to diverse object geometries are required. However, existing methods rarely satisfy these requirements, limiting their practical applicability. We present DCGrasp, a distance-aware controllable grasp generation system built on a novel grasp energy term. This term computes Distance Profile, a signed distance from each hand vertex to the nearest object point, coupled with distance-aware weighting, effectively capturing the semantically similar hand-object interaction in near-contact regions while remaining invariant to object and hand identity. Given various controllable signals, DCGrasp first generates a Distance Profile based on a Diffusion Transformer, together with a corresponding candidate hand pose. We then refine the candidate pose through optimization, enforcing consistency between the optimized hand pose and the generated Distance Profile in near-contact regions. Our experiments show that DCGrasp produces high-quality, physically plausible grasps with flexible user control, generalizing to diverse object and hand shapes and scales. Our work establishes a robust and versatile pipeline for the synthesis of controllable 3D hand-object interactions.

[CV-97] H-GRPO: Permutation-Invariant Reinforcement Learning for Grounded Visual Reasoning

链接: https://arxiv.org/abs/2606.29915
作者: Eric Peh,Debaditya Roy,Basura Fernando
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) often achieve high performance on benchmarks while remaining “black boxes”, yet they remain prone to hallucination or rely on superficial shortcuts. In this work, we propose a framework designed to enhance both performance and interpretability through De-compositional Evidence Grounding. Unlike monolithic inference approaches, our approach forces the model to decompose a global query into a sequence of atomic sub-questions, each requiring an explicit sub-answer and critically a localized evidence bounding box. By grounding intermediate logical steps (e.g. identifying a container, analyzing liquid properties, and assessing environmental context) in specific visual regions, we construct a structured reasoning path that mirrors human-like deduction. This allows the final answer to emerge as a logical consequence of verified visual facts rather than a statistical guess.

[CV-98] raffic-CBM: A Structurally Interpretable Multimodal Framework for Encrypted Traffic Classification

链接: https://arxiv.org/abs/2606.29909
作者: Honglei Jin,Wenshuo Chen,Shaofeng Liang,Haozhe Jia,Menshuo Zhao,Shuxu Jin,Songning Lai,Yutao Yue
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, figures and tables

点击查看摘要

Abstract:Encrypted traffic classification has achieved strong performance, but its decision process remains difficult to interpret. Existing methods usually combine flow statistics, packet sequences, and byte-level representations into opaque latent features, making it unclear which type of evidence actually drives the prediction. In this paper, we propose Traffic-CBM, a structurally interpretable multimodal framework for encrypted traffic classification. Instead of directly fusing heterogeneous traffic signals into a black-box representation, Traffic-CBM organizes them into a unified hierarchical concept space. These concepts are not manually annotated semantic attributes; rather, they are scalar evidence summaries constrained by predefined traffic evidence groups. More specifically, grouped flow statistics are mapped to statistical concepts, dedicated temporal encoders learn temporal concepts from disjoint feature subspaces, and byte-level evidence is further organized into packet-level and cross-packet concepts. This design turns heterogeneous traffic evidence into an explicit concept representation and makes different levels of traffic evidence easier to analyze. We evaluate Traffic-CBM on multiple encrypted traffic benchmarks. Results show that it achieves competitive and balanced classification performance while providing a clearer structural interpretation interface than conventional end-to-end fusion models. Further analyses suggest that the learned concept space is actively used in the prediction process and provides a clearer structural explanation of multimodal traffic evidence.

[CV-99] StrucTab: A Structured Optimization Framework for Table Parsing

链接: https://arxiv.org/abs/2606.29905
作者: Gengluo Li,Shangpin Peng,Chengquan Zhang,Binghong Wu,Hao Feng,Weinong Wang,Pengyuan Lyu,Huawen Shen,Xingyu Wan,Zhuotao Tian,Han Hu,Can Ma,Yu Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Table parsing aims to convert table images into structured, machine-readable representations, a task requiring the joint perception of complex spatial layouts and textual content. While recent vision-language models (VLMs) enable end-to-end parsing, they typically rely on direct supervision of the final output, thereby bypassing the explicit intermediate reasoning that is crucial for understanding complex table structures. Furthermore, attempts to optimize these models using reinforcement learning (RL) are often hindered by unstable or ambiguous reward designs, limiting potential performance gains. To address these limitations, we propose StrucTab, a table parsing model learned through intermediate structural supervision and reward decomposition. At the modeling level, by decomposing the parsing process into human-inspired subtasks, such as row-column counting and merged-cell analysis, StrucTab progressively unifies them through a sequential reasoning strategy. At the optimization level, we introduce Uni-TabRL, a unified RL framework that leverages decomposed rewards (validity, structure, and content) to provide stable and informative optimization signals. Finally, at the evaluation level, we present TableVerse-5K, a large-scale, challenging benchmark encompassing diverse, real-world table scenarios. Extensive experiments demonstrate the state-of-the-art performance of StrucTab across all evaluated public benchmarks and significant improvements on TableVerse-5K, validating the effectiveness of explicit structural modeling and decomposed reward optimization. Code and benchmark are publicly available at this https URL.

[CV-100] LLM -based Multimodal Personality Recognition via Facial Action Unit-Text Semantic Fusion

链接: https://arxiv.org/abs/2606.29900
作者: Tianyi Zhang,Wei Shan,Yuan Zong,Tianhua Qi,Wenming Zheng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Personality recognition in asynchronous video interviews (AVIs) has become increasingly important due to their widespread adoption in modern recruitment. Existing approaches often rely on large language models (LLMs) to analyze textual responses of interviewees in AVI. However, unimodel methods often suffer from information loss (e.g., ignore facial cues). In contrast, multimodal methods that employ full-face images or sparsely sampled frames can discard fine-grained temporal dynamics critical for accurate personality assessment. To overcome these limitations, we propose an LLM-based framework that semantically fuse facial action units (AUs) with textual responses of AVI. AU sequences are first converted into interpretable textual descriptions, which are then fused with participants’ textual responses through an LLM. A lightweight regression head transforms the resulting embeddings into continuous personality scores without disrupting the underlying semantic space. Experiments on the AVI-6 benchmark demonstrate consistent improvements over most baselines, with lower prediction errors and stronger correlations with human-rated scores across multiple traits. Further analysis reveals that AU-derived semantic representations offer complementary non-verbal cues to textual responses. Decoupling semantic understanding from regression prediction within the LLM also leads to greater training stability and clearer interpretability. Overall, these findings demonstrate that AU-text fusion provides a psychologically grounded and computationally efficient framework for personality recognition in AVIs.

[CV-101] Same Concept Different Directions: Cross-Modal Feature Heterogeneity in Sparse Autoencoders

链接: https://arxiv.org/abs/2606.29888
作者: Chungpa Lee,Jihoon Kwon,Kyle Min,Jy-yong Sohn
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models map images and text into a joint embedding space. However, these embeddings often entangle multiple semantic features, which limits their interpretability and controllability. While sparse autoencoders have emerged as a useful tool for decomposing these embeddings into monosemantic features, their application to joint embedding spaces has largely relied on an implicit, untested assumption that semantically corresponding features share the same directions across modalities. In this paper, we challenge this assumption by identifying discrepancies in feature directions for the same concept across image and text modalities, a phenomenon we term cross-modal feature heterogeneity. We demonstrate that this heterogeneity is a key driver of the modality split, where a shared concept activates different latents depending on the modality. This finding further reveals why aligning latent activations alone is insufficient to resolve the underlying feature mismatch. Motivated by this observation, we propose an approach that trains modality-specific sparse autoencoders to preserve each modality’s feature geometry, and then aligns corresponding features post hoc. Our method improves reconstruction fidelity and enhances performance in cross-modal retrieval and concept steering.

[CV-102] Building artificial intelligence virtual tissue (AIVT) for tissue state representation feature prediction and dynamic simulation

链接: https://arxiv.org/abs/2606.29883
作者: Qiqi Lu,Qianjin Feng,Shaoqun Zeng,Shenghua Cheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modeling tissue states and their transitions is essential for understanding tissue homeostasis in health and pathological remodeling in disease. However, conventional computational modeling approaches are inadequate to capture the complexity of tissues as spatially organized, multiscale biological systems. Artificial intelligence (AI) has shown a remarkable ability for representing intricate systems, creating new opportunities to characterize tissue states and their transitions. Here, we propose the concept of AI virtual tissue (AIVT), an AI framework grounded in spatial multimodal data for modeling tissues in health and disease. AIVT is designed to learn unified, spatially resolved, and dynamically manipulatable representations of tissue state, enabling tissue state representation and analysis, molecular and morphological feature prediction, and simulation of spatiotemporal tissue dynamics. We outline the fundamental assumptions, core capabilities, architectural components, as well as data and algorithm foundations of AIVT as a framework for AI-driven tissue modeling.

[CV-103] IREU: Identity-Related Encoder-Only Unlearning for Customized Portrait Generation ECCV2026

链接: https://arxiv.org/abs/2606.29880
作者: Chaoyi Shi,Shanshan Zhang,Jian Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Customized Portrait Generation (CPG) technologies have been widely used to generate high-fidelity person images given an input image indicating the identity and a text prompt indicating the required edits. Yet these methods pose significant privacy risks by spreading fake visual information. Against such risks, each public generator should be able to suppress its generation ability for a particular person when requested. Therefore, in this work we investigate the identity unlearning problem for CPG. Since there are no previous methods in this field, we propose a simple baseline that updates the image encoder by minimizing identity similarity between generated and input images for target identities to be unlearned, while maximizing it for identities to be retained. However, we find such a global perturbation in the feature space harms the fidelity of generated images for other identities to be retained. To solve this problem, we propose a novel method IREU, which first locates identity-related features in an offline manner and then only performs feature perturbations on them. The experimental results show that our proposed method IREU achieves better identity unlearning performance for target identities to be unlearned, and also keeps high fidelity for other identities to be retained. In addition, our unlearned image encoder is generalizable across different generators with the same encoder without fine-tuning, which is friendly for deployment in practice.

[CV-104] LWDrive: Layer-Wise World-Model-Guided Vision-Language Model Planning for Autonomous Driving

链接: https://arxiv.org/abs/2606.29879
作者: Chen Yang,Yuhao Wei,Ze Xu,Ziheng Zou,Shuang Liang,Delin Ouyang,Lingfeng Qi,Jie Li,Guofa Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) provide powerful semantic understanding and commonsense reasoning for End-to-End Autonomous Driving (E2E-AD) planning. However, trajectories directly generated by VLMs often encode only coarse driving intentions and remain insufficient for geometrically accurate, future-aware, and multi-view-grounded planning. To address these limitations, we develop the Layer-Wise World-Model-Guided Driving framework (LWDrive). LWDrive is a VLM planning framework that refines coarse trajectories through layer-wise world-model guidance. Instead of treating the VLM output as the final trajectory, LWDrive uses it as an intent-aware coarse plan, expands a diverse candidate space around it, and progressively refines the candidates through a Foresight Cascade Planner (FCP). Specifically, we introduce future-frame generation supervision to encourage the VLM to learn forward-looking scene representations, thereby injecting planning-relevant predictive dynamics into its internal hidden states. Built upon these world-model-supervised representations, FCP exploits VLM features across multiple layers and integrates historical temporal states, Action-Query representations, and current-frame multi-view Bird’s-Eye-View (BEV) features to refine candidate trajectories in a coarse-to-fine manner. This design enables progressive correction of spatial positions and motion trends while grounding trajectory refinement with multi-view scene cues and preserving the high-level driving intention produced by the large model. Finally, a score head evaluates the refined candidates and selects the best trajectory as the final planning output. Experiments show that LWDrive achieves a score of 92.0 on the NAVSIM benchmark and 89.6 on NAVSIM-v2. Code and models will be made publicly available.

[CV-105] SUMO: Segment and Track Any Motion with Nonlinear State Space Models

链接: https://arxiv.org/abs/2606.29861
作者: Kexin Tian,Sixu Li,Keshu Wu,Yang Zhou,Zhengzhong Tu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Object Tracking (VOT) and Moving Object Segmentation (MOS) are two fundamental tasks in computer vision that involve both spatial and temporal object dynamics. Existing methods rely predominantly on visual cues and thus often falter in real-world scenarios where object motions are inherently complex and nonlinear. To address this limitation, we propose SUMO, a zero-shot, training-free, unified framework integrating nonlinear dynamics with vision-based segmentation for accurate and consistent VOT and MOS. Specifically, we develop a nonlinear State Space Model (SSM) inspired by robotics principles to capture the complex object dynamics. Building on this model, we propose a Selective Unscented Filter (SUF) for accurate state estimation, which features a joint scoring mechanism and dynamically fuses multi-source predictions to identify the most plausible object state over time. Furthermore, we apply a memory selection mechanism to evaluate the reliability of memory frames. Our extensive experimental results show that SUMO achieves state-of-the-art performance on both VOT and MOS tasks.

[CV-106] RainODE: Continuous-Time Precipitation Forecasting with Latent Neural ODEs

链接: https://arxiv.org/abs/2606.29855
作者: Yeeun Seong,Doyi Kim,Minseok Seo,Changick Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In precipitation forecasting, not only accuracy but also temporal resolution is critical. However, increasing temporal resolution is constrained by observational limitations and the computational cost of dense discrete modeling. To overcome this limitation, we reformulate precipitation forecasting as a continuous-time dynamical system and propose RainODE, a framework that models precipitation evolution in latent space using a Neural ODE. This formulation enables derivative-consistent temporal dynamics and captures the dominant large-scale advective motion of precipitation systems. Nevertheless, a purely deterministic ODE struggles to represent non-advective intensity changes such as localized growth, decay, and sub-grid variability, often leading to over-smoothed predictions. To address this issue, we introduce a stochastic source modeling module based on a Brownian Bridge formulation, which refines residual intensity variations and restores fine-grained structures while preserving advective consistency. By combining deterministic continuous dynamics with stochastic refinement, RainODE enables arbitrary-time inference while maintaining sharp predictions. Experiments on SEVIR and the newly introduced Radar-based Precipitation Integrated Dataset (RAPID) demonstrate consistent improvements across multiple temporal intervals and precipitation regimes. The code is available at this https URL.

[CV-107] Efficient Visual Pointing for Embodied AI:Agent -Driven Data Synthesis Cross-Block Attention and Iterative Correction

链接: https://arxiv.org/abs/2606.29850
作者: Zijian Hong,Qi Lv,Yuxiang Xie,Jianming Xing,Xiang Deng,Weili Guan,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual pointing maps a language instruction to pixel co ordinates, a core skill for embodied AI. We describe our PointArena 2026 solution, which achieves 77.2% overall accuracy and ranks second on the benchmark. The ap proach targets three failure modes. First, agent-driven syn thesis builds large semantic and anchor-relative candidate pools; the server inventory contains 55,372 processed out puts, 53,772 de-duplicated sample IDs, and 37,574 train able completed or accepted rows. Second, a determinis tic steerable-data pipeline creates a verified 10,000-sample main set, plus reserve samples, using masks, templates, and path verification. Third, two model-side modules address complementary errors: AttnRes adds gated cross-block at tention for steerability, while ABC correction encodes per turbed coordinates with visual features for general coordi nate grounding. Category-aware routing combines comple mentary specialists; local validation used to select experts records 93.9% Affordance, 82.6% Spatial Relation, 78.2% Reasoning, 70.4% Counting, and 63.0% Steerability.

[CV-108] See Only When Needed: Context-Aware Attention Intervention for Mitigating Hallucinations in LVLMs

链接: https://arxiv.org/abs/2606.29847
作者: Yuqing Lei,Wenbo Lyu,Yingjun Du,Xiantong Zhen,Cees G.M. Snoek,Ling Shao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) excel at multimodal tasks but remain prone to object hallucinations. Prior training-free remedies often uniformly strengthen visual signals, which may also amplify irrelevant regions and introduce spurious evidence, harming fluency. We propose Context-aware Attention Intervention (CAI), a training-free inference-time mechanism that enforces a see only when needed principle via two-axis selectivity: where to look and when to intervene. At each decoding step, CAI derives token-specific visual relevance from early-layer representations to localize semantically aligned regions, and applies a conservative, entropy- and depth-gated attention tilt only for uncertainty-spiking tokens in deeper layers where visual grounding degrades, leaving confident tokens and irrelevant regions largely unchanged. This targeted intervention strengthens visual grounding while preserving linguistic fluency, and it yields consistent improvements even without contrastive decoding, which remains optional as an auxiliary bias-suppression module. Extensive experiments across multiple LVLM backbones and benchmarks show that CAI achieves state-of-the-art hallucination mitigation, and our analysis characterizes CAI as a KL-minimal attention reweighting with bounded interference under inactive gates or small tilts. Code is available at this https URL.

[CV-109] Bricker to BRACE: A Bracket Exposure RAW Dataset and Restoration Model for Flicker-Banding

链接: https://arxiv.org/abs/2606.29845
作者: Zihan Zhou,Libo Zhu,Jue Gong,Zhiyi Zhou,Jiezhang Cao,Yong Guo,Yulun Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flicker-banding (FB), arises from temporal aliasing between a camera’s rolling shutter and a display’s brightness modulation, degrading screen-captured image readability with color shifts and jagged patterns. Existing single-frame methods with simplified parametric stripe models cannot reliably distinguish these artifacts from genuine texture. To address this, we conduct a systematic analysis of complex FB morphologies and reveal their significant variation across exposure settings, motivating a multi-frame bracketed RAW restoration paradigm. We construct Bricker, a synthetic-real bracketed RAW dataset built via ray-tracing-based physical simulation and automated multi-exposure capture tool. We further propose BRACE: Bracketed RAW Flicker-Banding Removal, a multi-frame restoration model that utilizes frequency-aware banding prior and a multi-scale spatial cross-attention modulator (MSCAM) for cross-exposure spatial fusion. We also introduce the Stripe Frequency Consistency (SFC) metric to evaluate banding removal. Experiments demonstrate state-of-the-art performance on both synthetic and real benchmarks. Our dataset and code are available at: this https URL.

[CV-110] Robust Trajectory Distillation: Hybrid Reweighting Meets Teacher-Inspired Targets

链接: https://arxiv.org/abs/2606.29837
作者: Kaifeng Chen,Lechao Cheng,Jiyang Li,Shengeng Tang,Fan Zhang,Yantao Pan,Yaxiong Wang,Tuanrui Hui,Zhun Zhong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dataset distillation (DD) condenses large corpora into compact, information-rich subsets for efficient training and reuse. However, under noisy supervision, DD risks condensing corrupted associations together with useful signals, degrading robustness. Conventional noisy-label remedies (sample selection, loss weighting, label correction) tightly couple noise estimation with model optimization, often require clean anchors, and can amplify confirmation bias-assumptions that are misaligned with DD’s goal of compact, plug-and-play supervision. We therefore propose a trajectory-based DD framework that jointly suppresses noise and preserves transferable knowledge without relabeling or clean subsets. It comprises two complementary components: Selective Guidance Reweighting (SGR), which fuses global forgetting patterns (second-split forgetting) with local neighborhood consistency into a progressive reweighting scheme that prioritizes clean supervision along the teacher trajectory; and Teacher-Inspired Auxiliary Targets (TIAT), which inject auxiliary residual guidance distilled from intermediate teacher dynamics to reinforce informative signals while remaining internally consistent. Together, SGR and TIAT produce distilled datasets with cleaner and richer representations under noisy supervision. The framework is robust, label-preserving, computationally lightweight, and broadly applicable, yielding consistent gains over state-of-the-art DD baselines across symmetric, asymmetric, and real-world noise.

[CV-111] HomeDiffusion: Zero-Shot Object Customization with Multi-View Representation Learning for Indoor Scenes

链接: https://arxiv.org/abs/2606.29828
作者: Guoqiu Li,Jin Song,Yiyun Fei
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Recently, zero-shot object customization generation methods have rapidly developed and shown tremendous potential for applications. For instance, in the e-commerce domain, consumers can observe the visual effect of furniture placed within their personal living spaces or clothes worn on their own bodies. Many existing approaches perform object customization generation based on diffusion models and extracted reference object features. However, the generated object significantly diverges from the original reference object in details such as patterns and curves. Particularly for asymmetrical reference objects, the absence of comprehensive multi-viewpoint information prevents the generation of object poses that harmonize with the background scene. To address these shortcomings, we have constructed a novel dataset comprising multi-angle images of furniture and indoor scenes. Based on diffusion models, we introduce HomeDiffusion, which can leverage multi-viewpoint images of the same reference object to accurately generate visually harmonious object poses within specified areas of the background scene. During the diffusion process, we further extract high-fidelity details of the reference object and perform cross-attention with the noise latents in the latent space, thereby ensuring the preservation of details in the customized object generation. Extensive qualitative and quantitative experiments demonstrate that our method achieves superior performance over other existing zero-shot as well as few-shot object customization approaches.

[CV-112] Learning Cross-view Correspondences for Geo-localization on Planetary Surfaces

链接: https://arxiv.org/abs/2606.29821
作者: Hong Minh Nguyen,Marcus Märtens,Tat-Jun Chin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures, to be published in SPAICE 2026

点击查看摘要

Abstract:Maintaining global position awareness is a fundamental challenge for planetary surface exploration, since satellite-based positioning systems are unavailable and onboard odometry drifts over time. Although orbital mapping products, such as overhead imagery and terrain-derived maps, provide global context, aligning them with surface observations is challenging due to large viewpoint differences, low texture, repetitive terrain, and drastic changes in appearance caused by varying illumination and topography. We introduce a new cross-view geo-localization benchmark built from physically rendered surface panoramas and overhead tiles derived from a high-resolution lunar terrain model. Our dataset contains 10438 ground views rendered as 360 ^\circ surface panoramas with matching overhead images precisely centered at the same location. Additionally, a set of overlapping tiles is provided to study off-center localization with multiple plausible candidates per panorama. We study the performance of a state-of-the-art transformer-based geo-localization method on our data, by training it from scratch and reporting retrieval accuracy. Our results demonstrate that learning-based cross-view localization methods can be successfully applied to the domain of planetary surfaces, providing a vision-based alternative to global navigation satellite systems.

[CV-113] Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

链接: https://arxiv.org/abs/2606.29814
作者: Shufan Li,Greg Heinrich,Hanrong Ye,Yonggan Fu,Aditya Grover,Jan Kautz,Pavlo Molchanov
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 12 figures

点击查看摘要

Abstract:We propose Nemotron-Labs-Diffusion-Image, a state-of-the-art masked discrete diffusion model (MDM) for high-resolution text-to-image synthesis. Compared with prior work on masked image generation, Nemotron-Labs-Diffusion-Image addresses two key challenges. First, unlike continuous diffusion models which progressively refine latent representations across the entire image, standard MDMs lack self-correcting capability because discrete tokens cannot be modified once they are unmasked. Second, although increasing the vocabulary size of discrete image tokenizers improves reconstruction fidelity, it introduces optimization difficulties for generative modeling as the per-token training signal becomes increasingly sparse. To address the first challenge, Nemotron-Labs-Diffusion-Image incorporates a token-editing mechanism that enables the model to dynamically revise already-unmasked tokens during inference, similar to how a sculptor iteratively refines their work. To tackle the second challenge, we propose a Grouped Cross-Entropy (GCE) objective that assigns positive learning signals to tokens neighboring the ground truth in embedding space, thereby alleviating signal sparsity. To further improve training efficiency, we implement a custom fused operator for GCE that significantly reduces VRAM usage in large-vocabulary settings. Experimental results demonstrate that these innovations substantially improve both training efficiency and image fidelity of masked discrete image generators, achieving a score of 0.90 on GenEval, 86.9 on DPG and 10.76 of HPSv3.

[CV-114] Consistency as Inductive Bias: Learning Cross-View Invariance for Robust Multimodal Reasoning

链接: https://arxiv.org/abs/2606.29812
作者: Xin Zou,Haolin Deng,Yibo Yan,Shuliang Liu,Kening Zheng,Zhiwei Jin,Chen Chen,Haonan Lu,Xuming Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inductive biases steer learning toward generalizable solutions by encoding task structure. In this work, we identify a crucial missing bias in MLLMs: cross-view consistency, \textiti.e., semantically invariant views of the same instance should lead to the same answer. Standard reinforcement learning with verifiable rewards (RLVR) objectives do not impose this constraint, but instead assign pointwise rewards to each visual input. Even with data augmentation (DA), transformed views are typically rewarded independently, providing little signal once within-view rewards saturate. We propose \textbfConsistRoll, a simple but effective method that injects cross-view consistency into RLVR training by reusing the group-sampling mechanism of GRPO. Specifically, ConsistRoll places original and semantically invariant transformed views in the same generation group, and assigns a joint reward only when paired completions are both correct and consistent. In this way, ConsistRoll turns consistency into an online credit-assignment signal, \textbfwithout extra generation overhead and annotations. Theoretically, we show that cross-view consistency is a valid inductive bias, and ConsistRoll introduces a cross-view correction term absent from DA, penalizing view dependence and alleviating advantage collapse. Comprehensive benchmarks across math, general-purpose, hallucination domains confirm that ConsistRoll achieves robust improvements in multimodal reasoning.

[CV-115] Rethinking Forgery Attacks on Semantic Watermarks in Black-Box Settings: A Geometric Distortion Perspective DATE ICML2026

链接: https://arxiv.org/abs/2606.29807
作者: Cheng-Yi Lee,Yichi Zhang,Yuchen Yang,Chun-Shien Lu,Jun-Cheng Chen
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026, updated

点击查看摘要

Abstract:Recent studies have shown that semantic watermarks, which embed information into the initial noise of latent diffusion models (LDMs), are vulnerable to black-box forgery attacks. However, existing methods primarily rely on empirical evidence and lack a rigorous theoretical understanding of the conditions under which such attacks succeed or fail. To bridge this gap, we rethink the nature of such attacks through the lens of rate-distortion in the latent space. Our analysis identifies an irreducible distortion floor due to structural mismatches between proxy and target models, which fundamentally limits the fidelity of forged watermarks. We further characterize this distortion as structured geometric deviations on the latent manifold, in the form of global drift and local deformation rather than stochastic noise. Leveraging these insights, we propose a scheme-agnostic detection method that distinguishes forged samples before watermark verification. Extensive experiments demonstrate the effectiveness of our method across diverse black-box scenarios, while preserving robustness to common distortions.

[CV-116] Clearer Sight Fewer Lies: Oriented Pickup Preference Optimization for Multimodal Hallucination Mitigation

链接: https://arxiv.org/abs/2606.29805
作者: Xin Zou,Haolin Deng,Yibo Yan,Shuliang Liu,Zhiwei Jin,Chen Chen,Haonan Lu,Xuming Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are prone to hallucination as their generation preferences are insufficiently calibrated to visual evidence, causing them to fall back on linguistic priors, rather than faithful grounding. In this work, we start from an empirical observation: when query-relevant visual evidence is explicitly strengthened using the model’s own attention, generation becomes more accurate, suggesting that many failures do not arise solely from missing perception, but from an insufficient tendency to trust the evidence the model has already attended to. Motivated by this finding, we propose Oriented Pickup Preference Optimization (\textttOPPO), an evidence-aware alignment objective that learns preferences over the strength of visual evidence, rather than only response quality. Concretely, \textttOPPO contrasts the same faithful response under stronger, anchored, weaker-evidence views, turning naive visual preference into ordered visual-evidence alignment. We further combine this objective with fine-grained span-level and token-level regularization to stabilize the training. Besides, we provide a theoretical analysis showing that ordered evidence margins induce a positive lower bound on local visual sensitivity. Extensive evaluations across hallucination and general-purpose benchmarks demonstrate that \textttOPPO consistently outperforms baseline methods.

[CV-117] Concept Removal Guidance: Evidence-Calibrated Negative Guidance for Safe Diffusion Sampling ICML2026

链接: https://arxiv.org/abs/2606.29801
作者: Yoonseok Choi,Chaeyoung Oh,Hyunjun Choi,Seokin Seo,Kee-Eung Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ICML 2026

点击查看摘要

Abstract:Text-to-image diffusion models remain vulnerable to adversarial prompts that elicit disallowed content, motivating reliable inference-time controls. A popular approach is negative guidance, which subtracts a negative prompt direction with a fixed weight. However, it often forces a safety-fidelity trade-off, causing artifacts or prompt drift when over-applied and failing under attacks when under-applied. Dynamic variants reweight guidance using posterior-odds signals, which can be brittle for open-vocabulary compositional prompts, while lightweight similarity-based methods ignore the evolving image evidence along the denoising trajectory. We introduce Concept Removal Guidance (CRG), a training-free method that estimates unwanted-concept presence at each diffusion step from the model’s noise predictions, and adaptively calibrates negative guidance via a closed-form constrained update enforcing a target presence threshold while minimally perturbing the conditional trajectory. Across red-teaming benchmarks, CRG reduces attack success rates while preserving benign fidelity, and extends to additional suppression targets such as artist style and violence without fine-tuning or external classifiers.

[CV-118] UniTriSplat: A Unified 3D Gaussian Splatting Framework with Uniform Spherical Rasterization for Universal Cameras ECCV2026

链接: https://arxiv.org/abs/2606.29794
作者: Yipeng Zhu,Huajian Huang,Tristan Braud,Sai-Kit Yeung
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 14 figures, 6 tables. Project page: this https URL . UniTriSplat was accepted to ECCV 2026

点击查看摘要

Abstract:Existing 3D Gaussian Splatting (3DGS) frameworks rely on camera-specific rasterization, suffering from inconsistent solid-angle sampling and degraded performance across heterogeneous camera models (e.g., perspective, fisheye, omnidirectional). To address this limitation, we propose UniTriSplat, a unified 3DGS framework for universal cameras that reformulates Gaussian splatting on the unit sphere via HEALPix discretization. Leveraging the equal-area property of HEALPix, we construct a spherical sampling grid aligned with the angular resolution of input images. We derive the forward rendering and gradient propagation of Gaussians directly in the spherical radian domain, yielding uniform optimization behavior from narrow-FoV images to full 360-degree panoramas. To enhance perceptual reconstruction quality, we additionally introduce a HEALPix-aware SSIM loss that respects spherical neighborhood structure. Extensive experiments across diverse camera models demonstrate that UniTriSplat consistently improves cross-camera generalization while preserving geometric fidelity and rendering quality.

[CV-119] OP3DSG: Open-Vocabulary Part-Aware 3D Scene Graph Generation for Real-World Environments ECCV2026

链接: https://arxiv.org/abs/2606.29786
作者: Yirum Kim,Ue-Hwan Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ECCV 2026

点击查看摘要

Abstract:3D scene graphs (3DSGs) provide a compact and structured abstraction of 3D environments. Although advances in foundation models have enabled open-vocabulary 3DSG generation, existing approaches remain object-centric and encode limited relational information – restricting their applicability in real-world scenarios that require fine-grained understanding. We propose OP3DSG, an open-vocabulary part-aware 3DSG generation framework that constructs unified graphs that jointly model objects, interactive parts, spatial relations, functional relations, and affordances. OP3DSG integrates object-part knowledge-guided detection with part-aware 3D fusion to preserve small and interaction-relevant components, and employs a geometry-initialized prior graph with LLM-based refinement to reduce spurious relational predictions while enabling efficient graph construction. To systematically evaluate unified 3D scene graph construction, we introduce UniGraph3D, a benchmark designed for part-aware perception and multi-level relational reasoning. Experimental results show that OP3DSG achieves state-of-the-art performance and demonstrates its effectiveness as a perception backbone in diverse real-world robotics tasks.

[CV-120] FalconTrack: Photorealistic Auto-Labeled Perception and Physics-Aware Vision-Based Aerial Tracking

链接: https://arxiv.org/abs/2606.29783
作者: Yan Miao,Karteek Gandiboyina,Noah Giles,Hideki Okamoto,Bardh Hoxha,Georgios Fainekos,Sayan Mitra
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-based aerial tracking is critical in GPS-denied environments. Reliable perception for tracking depends on large-scale labeled data, yet most photorealistic datasets rely on heavy manual annotation and are time-consuming to produce. We present FalconTrack, a unified perception-and-tracking framework that (i) leverages a photorealistic editable simulator for automated label generation and (ii) combines multi-head perception with physics-aware tracking for zero-shot sim-to-real transfer. FalconTrack provides an automated labeling pipeline in a Gaussian Splatting simulator that isolates target Gaussians from short object videos and composites them with randomized backgrounds to generate RGB, mask, class, and 6-DoF pose labels, producing about 10k labeled images in under 20 minutes. Using this dataset, we train a multi-head perception module with staged learning and reprojection consistency, and fuse its outputs with class-conditioned dynamics priors in an EKF for tracking. Our perception model outperforms two baselines and reaches 96-100% class accuracy in zero-shot sim-to-real transfer on three geometrically diverse objects and two environments, while maintaining consistent performance in unseen simulated and real scenes. In real hardware closed-loop visual tracking, the onboard system runs at about 25 Hz and achieves 100% success in sim-to-real F1-tenth and gate tracking in five trajectories across two environments, while a mask-centered vision baseline drops to 60% success on F1-tenth during fast out-of-view scenarios.

[CV-121] Graph-GSReg: Leverag ing 3D Scene Graphs for Gaussian Splatting Registration

链接: https://arxiv.org/abs/2606.29782
作者: Jaewon Lee,Mangyu Kong,Euntai Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Merging multiple 3D Gaussian Splatting (3DGS) scenes into a single unified Gaussian representation is essential for large-scale 3D mapping and long-term map management. Despite its importance, this area remains underexplored, and existing solutions exhibit several limitations. Learning-based methods attempt direct correspondence between Gaussian primitives and require training on large 3DGS datasets. Image-based optimization methods depend heavily on coarse initialization from generic foundation models and often incur expensive refinement. We present \ourmodel. Our method constructs a 3D scene graph from a 3DGS and its rendered images, \textitreformulating 3DGS registration as a graph registration problem. The proposed 3D scene graph represents each 3DGS at a higher-level representation, enabling a globally consistent understanding of semantic information and structural context for accurate registration. To further construct a seamless unified scene, we introduce a Self-Supervised Test-Time Optimization. Naively merging two 3D Gaussian scenes often suffers from occlusion artifacts such as hollows and floaters. To alleviate this issue, we refine the merged Gaussians to preserve visual consistency between the original scenes and the merged scene. We evaluate our method on real and synthetic benchmarks, demonstrating competitive registration accuracy and merged scene rendering quality.

[CV-122] UrbanCDNet: Appearance-Robust and Boundary-Aware Bitemporal Change Detection for Korean Urban Building Monitoring

链接: https://arxiv.org/abs/2606.29781
作者: Abdirashid Omar,Jonghyuk Park
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Urban building change detection from bi-temporal aerial imagery is important for redevelopment monitoring, infrastructure management, and unauthorized-construction screening, but Korean urban scenes remain difficult because changed regions are often sparse, appearance varies strongly between acquisition dates, and useful outputs must follow building footprints rather than coarse blobs. This paper presents UrbanCDNet, a task specific Siamese CNN that combines appearance-robust multi-cue comparison, alignment-aware middle-scale differencing, lightweight context refinement, scene calibration, and auxiliary boundary supervision. Experiments use a corrected AIHub-based Korean benchmark with 3,998 training, 503 validation, and 499 test pairs, and report changed-class precision, recall, F1, and IoU. On the locked test split, UrbanCDNet achieves 0.7335 precision, 0.7696 recall, 0.7511 F1, and 0.6014 IoU, outperforming a strong Siamese U-Net baseline (0.7108 F1, 0.5514 IoU) and the strongest external competitor, ChangeFormer-MIT-B0 (0.7107 F1, 0.5512 IoU). Additional diagnostic slicing shows that the gain is concentrated in the operating regimes that motivated the design: on the sparse-change subset with less than 5% changed area, F1 improves from 0.4765 to 0.6175, and on the high photometric-gap subset it improves from 0.6349 to 0.7285. Boundary F1 at 3-pixel tolerance rises from 0.3445 to 0.4447, while object F1 at IoU 0.3 rises from 0.0690 to 0.2258. These results indicate that, on this Korean benchmark, task-shaped temporal comparison and boundary-aware supervision matter more than generic model scale alone

[CV-123] opoAgent : An Agent ic Framework for Automated Topology Learning in Medical Imaging

链接: https://arxiv.org/abs/2606.29763
作者: Guangyu Meng,Pengfei Gu,Xueyang Li,Yiyu Shi,Erin Wolf Chambers,Danny Z. Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Topological data analysis (TDA), particularly persistent homology (PH), captures geometric structural properties in medical images (e.g., connected components, loops, shape characteristics), which conventional pixel-level deep learning approaches often neglect. While many topological descriptors are known for converting persistence diagrams (PDs) or raw images into topological feature vectors, existing methods mostly default to a single fixed descriptor (e.g., persistence images), leaving the diversity of topological representations largely unexplored. To the best of our knowledge, there is no known large language model (LLM)-based agentic framework that can automatically determine the most suitable topological descriptors for a given image dataset and produce the corresponding topological feature vectors for downstream tasks. To fill this gap, we propose \textbfTopoAgent, an LLM-based agentic framework that automates topology learning for medical image this http URL operates through a Perception–Reasoning–Action–Reflection loop supported by 21 domain-specific tools and dual memory that accumulates experience across runs. Its skill set is distilled from systematic evaluation of 15 topological descriptors across 26 datasets with six classifiers. TopoAgent analyzes input images and their topological characteristics, reasons about which topological descriptors best suit the input, and determines the optimal descriptor and its configuration, all without task-specific training.

[CV-124] MR-IQA: A Unified Margin View of Regression and Ranking for Blind Image Quality Assessment

链接: https://arxiv.org/abs/2606.29760
作者: Yuan Li,Youyuan Lin,Zitang Sun,Yung-Hao Yang,Kiyofumi Miyoshi,Chenhui Chu,Shin’ya Nishida
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Blind image quality assessment (BIQA) is commonly built on two basic learning paradigms: regression and ranking. Regression calibrates absolute scores, whereas ranking recovers quality structure from ordinal relations. Although joint regression-ranking supervision often improves BIQA, the relation between the two paradigms remains largely empirical and underexplored. In this work, we revisit what underlies regression and ranking and identify pairwise relational distance, termed quality margin, as their common bridge. Our derivation shows that, at the objective-optimization level, both paradigms fit quality margins: regression fits margins induced by score endpoints, while ranking fits transformed or sign-level margins through preference probabilities. Motivated by this insight, we propose MR-IQA, a direct quality-margin optimization framework for reinforcement learning (RL)-based BIQA. MR-IQA samples quality scores and optimizes pairwise margin errors as policy rewards, thereby modeling quality structure more explicitly. Experiments on six BIQA benchmarks show competitive general performance, and controlled comparisons demonstrate that MR-IQA achieves the strongest average PLCC/SRCC over regression- or ranking-based RL methods. Our findings provide a new insight into unifying regression and ranking, offering a theoretical basis for understanding quality-structure modeling in BIQA and beyond.

[CV-125] LEIQ-Assessor: Multi-dimensional Quality Assessment of Low-light Enhanced Images via Multi-task Learning

链接: https://arxiv.org/abs/2606.29752
作者: Wei Sun,Yanwei Jiang,Dandan Zhu,Jinqiu Sang,Jikai Xu,Weixia Zhang,Guangtao Zhai
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: The paper achieved second place in the QoMEX 2026 Grand Challenge on Low-light Enhanced Image Quality Assessment

点击查看摘要

Abstract:Low-light image enhancement algorithms (LIEAs) aim to improve the visibility of images captured under poor illumination. However, the enhancement process often introduces artifacts such as noise amplification, color shift, structural damage, and over-exposure, which degrade the perceptual quality of the enhanced images. Therefore, a reliable image quality assessment (IQA) metric for evaluating enhancement effects is of great importance for both the development of LIEAs and their practical applications. In this paper, we present \textbfLEIQ-Assessor, a multi-dimensional quality assessment model for low-light image enhancement based on multi-task learning, developed for the QoMEX 2026 Grand Challenge on Low-light Enhanced Image Quality Assessment. Specifically, our method leverages a pre-trained SigLIP2 Vision Transformer as the backbone and simultaneously predicts the overall Mean Opinion Score (MOS) together with six perceptual sub-attributes: lightness, color fidelity, noise level, exposure quality, naturalness, and content recovery. By jointly optimizing these correlated objectives via the PLCC loss, the shared representation captures richer quality-aware features than its single-task counterpart. Experiments on the MLE benchmark demonstrate that LEIQ-Assessor significantly outperforms existing no-reference IQA models and hand-crafted quality descriptors. Our method achieved second place in the QoMEX 2026 Grand Challenge on Low-light Enhanced Image Quality Assessment. The code is available at this https URL.

[CV-126] HTC-SGA Former: A Hybrid Transformer-CNN Network with Self-Guided Attention and a New Boundary-Weighted Adaptive Loss for Coronary DSA Vessel Segmentation

链接: https://arxiv.org/abs/2606.29744
作者: Rayan Merghani Ahmed,Marwa Omer Mohammed Omer,Mohamed Elmanna,Shijie Li,Bin Li,Shoujun Zhoua
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 10 figures, 3 tables. Submitted for journal review

点击查看摘要

Abstract:Accurate coronary Digital Subtraction Angiography (DSA) vessel segmentation is essential for computer-aided diagnosis and treatment planning of coronary artery disease (CAD). However, thin low-contrast vessels, background interference, and severe vessel-background class imbalance make reliable segmentation of weak distal branches and vessel boundaries challenging. Existing methods struggle to balance global contextual reasoning with preservation of weak vessels, vessel continuity, and fine boundaries. To address these limitations, we propose HTC-SGA Former, a lightweight hybrid Transformer-CNN framework for coronary DSA vessel segmentation. It employs a CNN encoder for local vessel morphology extraction and a Transformer decoder for contextual feature modeling. A Multi-Scale Global-Local Window Attention (MS-GLWA) block performs efficient global-local contextual modeling, while a Self-Guided Feature Attention (SGFA) module enhances weak-vessel responses. In addition, a Boundary-Weighted Adaptive Compound Loss (BWACL) emphasizes thin-vessel boundaries and adaptively balances vessel recovery and boundary refinement. Experiments on private right and left coronary artery DSA subsets show that HTC-SGA Former outperforms 14 state-of-the-art segmentation methods while maintaining a compact architecture with only 0.81M parameters. BWACL also improves performance over binary cross-entropy and Dice losses across four encoder-decoder architectures, demonstrating strong cross-backbone applicability. HTC-SGA Former improves thin-vessel recovery, vessel continuity, and boundary localization through complementary global-local contextual modeling, vessel-focused refinement, and adaptive optimization, supporting reliable and computationally efficient coronary vessel analysis for future computer-assisted cardiovascular interventions.

[CV-127] ScaleAware-JEPA: Latent Representation for Discovery in Multiscale Physical Fields

链接: https://arxiv.org/abs/2606.29723
作者: Guang-Xing Li
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Continuous physical fields represent a large fraction of data under scientific investigation. Their multiscale structures are central to discovery, yet useful coordinates are not known in advance. Standard self-supervised methods define context and targets in fixed image coordinates, posing a predictive task misaligned with fields organized across a continuous scale hierarchy. We introduce ScaleAware-JEPA, a framework that constructs dense, label-free latent coordinates for continuous scalar fields. Constrained Diffusion Decomposition (CDD) separates each field into pixel-registered scale components and provides the scale coordinates that define the masking geometry. The resulting JEPA objective predicts hidden structure with a context footprint tied to the diffusion scale of each component rather than to an arbitrary patch size. Across MHD turbulence, interstellar molecular gas and urban nighttime-light structure, the learned geometry maps back to coherent morphology, forming dense structural atlases without labels or predefined segmentation rules. By tying latent prediction to the scale hierarchy of a field, ScaleAware-JEPA constructs latent coordinates through which complex physical patterns can be inspected before their relevant structures have been prescribed. Code is available at this https URL.

[CV-128] AerialMetric: Benchmarking and Adapting UAV Monocular Metric Depth Estimation in the Real World ECCV2026

链接: https://arxiv.org/abs/2606.29716
作者: Zhongqiang Song,Guanying Chen,Yuqi Zhang,Yin Zou,Chuanyu Fu,Zhiyuan Yuan,Chuan Huang,Shuguang Cui,Xiaochun Cao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026. Project page: this https URL

点击查看摘要

Abstract:This paper addresses the problem of monocular metric depth estimation in aerial UAV imagery. Although recent data-driven methods have achieved remarkable progress in ground-level scenarios, models trained primarily on street-view and indoor datasets exhibit significant domain gaps when applied to aerial viewpoints. To tackle these challenges, we introduce AerialMetric, a benchmark dataset designed to evaluate and facilitate the adaptation of monocular metric depth estimation under UAV aerial viewpoints. The dataset consists of four complementary subsets collected from different sources, jointly covering real-world photogrammetry data, controlled aerial acquisition settings, photorealistic synthetic scenes, and in-the-wild Internet imagery. Totally, AerialMetric provides 52K real-world and 16K synthetic image-depth pairs with reliable metric ground truth. Based on this dataset, we conduct systematic evaluations of existing state-of-the-art models under aerial settings and investigate the impact of viewpoint, altitude, and camera parameters on metric depth prediction. In addition, by fine-tuning representative metric depth model on our dataset, we establish a comprehensive aerial benchmark and achieve state-of-the-art performance across diverse aerial imagery. Our dataset, code, and model weight are publicly available at this https URL.

[CV-129] Accurate Recognition of Pneumonia and COVID-19 by Geometric Shape Normalization of Lung Region using Automatic Landmark Detection and Piecewise Affine Warping

链接: https://arxiv.org/abs/2606.29715
作者: Salvador E. Ayala-Raggi,Rafael Alejandro Cruz-Ovando,Lauro Reyes-Cocoletzi,Aldrin Barreto-Flores
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 13 figures

点击查看摘要

Abstract:This paper presents an automatic system for recognizing pulmonary diseases in chest X-rays using geometric normalization of the lung region. The method combines three modules: (1) a ResNet-18 landmark detector with coordinate attention that predicts 15 lung-contour landmarks, achieving a mean localization error of 3.61 pixels through an ensemble of four models with test-time augmentation; (2) a geometric normalizer based on Generalized Procrustes Analysis, Delaunay triangulation, and piecewise affine warping to map each lung region to a standardized shape; and (3) a ResNet-18 classifier with transfer learning and SAHS contrast enhancement to classify images as COVID-19, Viral Pneumonia, or Normal. On the COVID-19 Radiography Database, the normalized-image classifier achieved 98.60+/-0.26% accuracy and 98.00% F1-Macro using five-fold cross-validation. Although original images produced slightly higher raw accuracy, Grad-CAM and cropping experiments suggest that this advantage is partly influenced by acquisition artifacts. In contrast, geometrically normalized images outperformed artifact-masked/cropped unaligned images on both the COVID-19 Radiography Database (98.60% vs. 96.24%) and a balanced adult-pediatric mixed dataset including pediatric cases from the Kermany dataset (94.67% vs. 94.17%). These results suggest that anatomical alignment can provide a more controlled and artifact-resistant representation for pulmonary disease recognition.

[CV-130] UniVAD v2: Unified Visual Anomaly Detection via Support-Conditioned Boundary Construction

链接: https://arxiv.org/abs/2606.29714
作者: Zhaopeng Gu,Bingke Zhu,Zhaowen Li,Guibo Zhu,Yingying Chen,Ming Tang,Peng Su,Jinqiao Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unified visual anomaly detection seeks to train a single detector that can be deployed across categories, domains, and application scenarios. In the few-shot transfer regime, the key challenge is to estimate an episode-specific boundary for an unseen target category from a small support set. Existing approaches mainly infer this boundary from normal-side evidence and provide limited abnormal-side evidence for deployment-specific tolerance. Within the normal side, they often struggle to jointly capture local correspondences and global support-query relations, making their boundaries less reliable for unseen anomalies. To address these issues, we propose UniVAD v2, a two-sided support-conditioned boundary construction framework for unified visual anomaly detection. Built on the component-patch divide-and-conquer framework of UniVAD, UniVAD v2 strengthens the normal side with an Optimal Transport-based Relational Modeling module (OTRM), which complements retrieval with support-query matching through transport-style allocation, and an Adaptive Coordination mechanism for Retrieval and Relational Modeling (ACRRM), which estimates episode-conditioned reliabilities to fuse the two sources of evidence. On the abnormal side, a Few-Shot Abnormal Reference module (FAR) converts optional abnormal references into rejection-side evidence for boundary adjustment. Experiments on six datasets spanning industrial, logical, and medical anomaly detection demonstrate strong cross-domain generalization. Under the 1N-shot protocol, UniVAD v2 improves the mean image-level AUC over UniVAD from 83.0% to 84.5%, and further reaches 85.7% in the 1N+1A-shot setting. On the MVTec-AD Severity Split (MVTec-AD-SS), UniVAD v2 achieves 96.2% image-level AUC and 96.9% pixel-level AUC, showing that abnormal references enable controllable boundary customization without retraining.

[CV-131] Early Warning Signals for OpenVLA Failure under Visual Distribution Shift

链接: https://arxiv.org/abs/2606.29699
作者: Dipesh Tharu Mahato,Rachel Ren
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Vision Language Action models combine perception, language grounding, and control in a single policy, but their failures are hard to diagnose once visual conditions shift. We test whether OpenVLA feedforward activations contain linearly decodable information about near term task failure in LIBERO manipulation rollouts. The policy is fixed throughout. We log internal activations during execution and fit lightweight monitors after the rollouts are collected. Occlusion is the main controlled stress test. It reduces OpenVLA success from 57% to 17% over 100 episodes per condition. Under this shift, a logistic probe at layer 16 reaches AUROC 0.972 and AUPRC 0.352 for predicting failure within a 15 step horizon. It outperforms both a mean difference direction and an action disagreement baseline. A sparse layer sweep finds uneven decodability across depth: layer 16 is strongest among the tested layers, layer 8 remains informative, and layer 10 is weaker. To check whether the monitor is just an occlusion detector, we also evaluate color shift and camera jitter without refitting. Color shift produces no failures in this setting, so it is a benign control rather than a failure benchmark. Camera jitter does induce failures, and the occlusion trained monitor remains above random. The result is deliberately limited: OpenVLA internal states contain failure relevant structure under controlled perceptual shift, but these experiments do not establish a causal mechanism, task held out generalization, or a deployable recovery system.

[CV-132] MF-UAVPose6D: A Model-Free Monocular 6-DoF Pose Estimation Framework for Fixed-Wing UAVs

链接: https://arxiv.org/abs/2606.29697
作者: Juanqin Liu,Leonardo Plotegher,Eloy Roura,Shaoming He
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:For uncrewed aerial vehicles (UAVs), estimating six-degree-of-freedom (6-DoF) poses is essential for airspace situational awareness, target tracking, and counter-UAV operations. However, non-cooperative targets usually lack computer-aided design (CAD) models and keypoint priors, making existing model-based or keypoint-matching methods difficult to apply reliably. To address these challenges, this paper proposes MF-UAVPose6D, a model-free monocular 6-DoF pose estimation framework for fixed-wing UAVs. During inference, the method takes only a single red-green-blue (RGB) image and camera intrinsics as input. It first obtains a stable target anchor through heatmap-guided center localization, introduces a Perspective-Aware Module (PAM) to model observation-ray priors, exploits Dynamic Topological Sampling (DTS) to complement weak structural cues from the wings, fuselage, and tail, and adopts a decoupled translation-rotation pose decoding mechanism to estimate the 6-DoF pose. In addition, we construct the FW-UAV6DPose synthetic dataset, which covers fixed-wing UAV observations across diverse distances, viewpoints, and poses. Experimental results show that MF-UAVPose6D achieves accurate and efficient monocular 6-DoF pose estimation without requiring CAD models, and demonstrates strong robustness in long-range rotation estimation, depth recovery, and joint pose evaluation.

[CV-133] Progressive Self-Supervised Learning with Individualized Community Assignment for Brain Network Analysis

链接: https://arxiv.org/abs/2606.29695
作者: Hairui Chen,Yanwu Yang,Jianfeng Cao,Hanyang Peng,Chenfei Ye,Ting Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain networks exhibit a modular community structure that varies across individuals and neurological conditions. However, existing self-supervised learning (SSL) methods often overlook this heterogeneity, relying on generic masking strategies that fail to capture subject-specific functional organization. We propose BrainPICM, a self-supervised framework for brain network analysis via progressive individualized community aware masking. BrainPICM formulates ROI-to-community mapping as a progressive unbalanced optimal transport process, yielding soft assignments and per-ROI confidence scores. Guided by these confidence estimates, a curriculum-style masking strategy gradually incorporates low-confidence, potentially pathological regions into training, enabling the model to learn both stable modular structures and individual variations. Additionally, a deviation-aware aggregation module quantifies functional reorganization by measuring mass redistribution relative to a population template, enhancing interpretability and downstream prediction. Experiments on three fMRI datasets (ABIDE-I, ADHD-200, ADNI) show that BrainPICM consistently outperforms state-of-the-art supervised and SSL methods in diagnostic accuracy, indicating that explicitly injecting modular community structure into masked modeling yields more functionally consistent and generalizable representations. The source code for this approach will be released at this https URL.

[CV-134] PoseShield: Neural Collision Fields for Human Self-Collision Resolution ECCV2026

链接: https://arxiv.org/abs/2606.29686
作者: Zhengyuan Li,Zeyun Deng,Yifan Shen,Liangyan Gui,Miaolan Xie,Joseph Campbell,Xifeng Gao,Kui Wu,Zherong Pan,Aniket Bera
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026

点击查看摘要

Abstract:Self-collision remains a persistent challenge in SMPL-based human pose estimation and motion generation. Under extreme articulations or stochastic motion synthesis, generated meshes frequently exhibit self-penetrations, leading to physically implausible results. We propose PoseShield, a neural collision constraint defined directly in SMPL pose space. We formulate collision correction as a constrained optimization problem and connect the learned constraint with the Eikonal equation. Enforcing Eikonal regularization ensures non-vanishing gradients near the collision boundary, improving numerical stability and robustness of the optimization process. Unlike prior methods that operate in the mesh space or rely on heuristic penalties, our approach operates directly in the low-dimensional space of human poses and is theoretically grounded. The same learned constraint extends to human motion sequences, providing a generator-agnostic post-hoc collision corrector without retraining the underlying motion model. Experiments on a newly constructed SMPL pose benchmark show that our method achieves a 95.8% success rate and outperforms state-of-the-art baselines.

[CV-135] Evolutionary Hyperparameter Optimization to Find Lightweight CNN Models for Autonomous Steering

链接: https://arxiv.org/abs/2606.29684
作者: Devson Butani,Ryan Kaddis,Chan-Jin Chung
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 7 pages, 5 figures. Accepted at 2025 IEEE International Conference on Electro Information Technology (eIT). Author-accepted manuscript. Final published version: this https URL

点击查看摘要

Abstract:This research investigates the optimization of Convolutional and Dense Neural Networks (CNNs and DNNs) for autonomous steering using the (N+M) Evolution Strategy (ES) with the 1/5th success rule. The primary objective is to develop a lightweight CNN based model capable of real-time steering angle prediction, mimicking human driving behavior on predefined paths. The ES algorithm automates hyperparameter tuning, dynamically adjusting parameters such as filter sizes and layer configurations. Data collection encompasses driving scenarios recorded via the LTU ACTor autonomous driving platform, including variations in path direction and driving style. The very small dataset consists of timestamped images labeled with steering angles and pre-processed to focus on relevant visual information. Initial experiments involve training a baseline CNN model, which is then refined using ES to significantly reduce the size of the model while maintaining competitive predictive accuracy. The results highlight the viability of lightweight neural network architectures for real-time autonomous systems, striking a balance between computational efficiency and performance. This study not only advances research initiatives on the use of evolutionary algorithms for autonomous driving applications but also lays the foundation for the deployment of cost-effective and scalable solutions in self-driving technology.

[CV-136] Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature

链接: https://arxiv.org/abs/2606.29667
作者: Subham Ghosh,Shubham Tiwari,Mohammad Ibrahim,Abhishek Tewari
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The materials science literature encodes decades of experimental knowledge in figures, yet this visual record remains locked away and inaccessible to AI at scale. The core difficulty is structural: most scientific figures are compound, with a single caption describing multiple sub-panels simultaneously, making direct image-text pairing unreliable. We present MatMMExtract, an end-to-end open-source pipeline that resolves this by decomposing compound figures into individual sub-panels and generating structured, grounded annotations using a large language model guided by a curated materials science taxonomy. Applied to 14,810 open-access articles, MatMMExtract produces MatSciFig; 391,606 panel-level image-text pairs from 180,571 figures, each annotated with a sub-caption, a two-level visualisation category spanning 19 classes and over 100 subtypes, and a scientific summary. To enable accurate panel localisation, we introduce MaterialScope, a domain-specific detection dataset of 2,811 manually annotated materials science figures, on which a fine-tuned YOLO12-m detector achieves mAP_50 of 0.9227. Among six benchmarked language models, Gemini 3.1 Flash Lite delivers the best cost-quality trade-off for annotation generation, with 82% of outputs rated good and a hallucination rate of 4.8%. A dual-encoder retrieval baseline on MatSciFig achieves a 4.4 times improvement in R@1 over zero-shot CLIP, demonstrating the dataset’s immediate utility for vision-language learning. All resources are released openly to the community.

[CV-137] Benchmarking Geospatial Foundation Models for Agriculture Applications

链接: https://arxiv.org/abs/2606.29664
作者: Zhuocheng Shang,Sanmay Das,Ahmed Eldawy
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to ACM SIGSPATIAL 2026

点击查看摘要

Abstract:Geospatial foundation models pretrained on satellite imagery promise broad generalization across remote sensing tasks and regions, but their geographic transferability has not been systematically tested, especially in agriculture applications. This paper presents a controlled benchmark that evaluates three models, Prithvi, SpectralGPT, and SatMAE, on multi-temporal crop segmentation and change detection across four U.S. states, Iowa, North Carolina, California, and Minnesota. By assigning each train, validation, and test split to a separate region, we measure how well each model transfers to land it has not seen. All three degrade sharply under regional distribution shift, predicting only the most common crops while missing rare ones. We further find that fitting these models to a shared input format affects each one differently, which complicates direct architectural comparison. These results expose key limitations of current geospatial foundation models for agriculture and point to region aware evaluation as a necessary standard.

[CV-138] One Scene Two Depths: Probing Geometric Ambiguity in Monocular Foundation Models ECCV

链接: https://arxiv.org/abs/2606.29600
作者: Xiaohao Xu,Feng Xue,Xiang Li,Haowei Li,Shusheng Yang,Tianyi Zhang,Matthew Johnson-Roberson,Xiaonan Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 49 pages, 25 figures; Accepted by European Conference on Computer Vision (ECCV) 2026

点击查看摘要

Abstract:A faithful 3D world representation should account for layered geometry, where a single camera ray may contain multiple visible and geometrically valid surfaces. Monocular depth estimation, however, reduces this structure to one scalar depth per pixel. Transparent scenes make this ambiguity measurable: the same ray can pass through foreground glass and observe the background, turning the supervised target into a convention of annotation, data, and training rather than a scene-intrinsic truth. A learned predictor exposes this convention as its depth-layer preference. We introduce MultiDepth-3k (MD-3k), a sparse two-layer ordinal benchmark for measuring depth-layer preference and multi-layer spatial relationship accuracy (ML-SRA). On MD-3k, leading depth foundation models exhibit diverse layer preferences under standard RGB input, showing that the same layered geometry can be resolved differently across models. We further find that Laplacian Visual Prompting (LVP), a training-free spectral input transformation, can substantially change the reported layer for certain frozen models. The strongest RGB/LVP pair, DAv2-L, reaches 75.5% ML-SRA. These results suggest that depth foundation models may express complementary geometric hypotheses that standard RGB inference leaves unexpressed. We invite the community to rethink depth supervision and evaluation through an ambiguity-aware lens, where multiple valid 3D interpretations are treated as geometric structure to be measured, preserved, and expressed.

[CV-139] SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

链接: https://arxiv.org/abs/2606.29586
作者: Hang Su,Chao Sun,Zhaofan Li,Wei Hu,Juhua Liu,Bo Du
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language foundation models have shown strong potential in medical image analysis. Although foundation models for ultrasound imaging have recently emerged, the domain remains particularly challenging due to severe speckle noise, acquisition variability, and subtle anatomical boundaries, leading to high inter-observer variability. Existing CLIP-based models rely primarily on global image-text alignment, limiting their sensitivity to clinically decisive local structures. We propose SonoCLIP, the first million-scale region-controllable fetal ultrasound vision-language foundation model that integrates segmentation masks as mask-channel visual prompts within the vision encoder, enabling joint global-local contrastive representation learning. To support scalable region-text alignment, we introduce a sigmoid-based pairwise contrastive loss that improves stability under large-scale supervision. We further curate a 1.44M-image multimodal fetal ultrasound dataset spanning 24 standard planes for large-scale pretraining. Extensive cross-center evaluations demonstrate that SonoCLIP achieves superior zero-shot transfer performance under both global and mask-guided inference, establishing a controllable and clinically oriented foundation model for fetal ultrasound analysis. Our code and data are available at this https URL.

[CV-140] ScAle: Attention Head Scaling as a Minimal Adapter for Spatial Reasoning in Vision Language Models ECCV2026

链接: https://arxiv.org/abs/2606.29579
作者: Rahul Chowdhury,Timothy A Rupprecht,Xuan Shen,Pu Zhao,Yanzhi Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted by ECCV 2026

点击查看摘要

Abstract:Spatial reasoning remains a persistent challenge for many vision language models (VLMs), and improving it typically requires fine-tuning with substantial additional parameters. Our preliminary analysis reveals that rescaling activations in selected transformer layers-without modifying pretrained weights-can significantly influence downstream performance. Motivated by this observation, we propose ScAle, an ultra-lightweight adaptation method that learns a small set of scalar coefficients to modulate last-token attention and MLP activations in a fully frozen backbone. We evaluate our method on the synthetic spatial reasoning benchmark SpatialEval and on real-world VQA datasets (COCOQA and VGQA) across multiple model families. Our method, ScAle, achieves up to 134.1% relative accuracy gains using only 1K trainable parameters without requiring millions of trainable parameters as in standard PEFT methods such as LoRA. Despite its extreme compactness, our approach recovers a substantial fraction of standard PEFT performance while preserving strong non-spatial VQA accuracy. These results demonstrate that bounded activation reweighting provides a simple, architecture-agnostic, and highly parameter-efficient alternative for adapting pretrained VLMs.

[CV-141] ReMAP-PET: Beyond Visual Understanding – Learning Region-Guided Metabolic Alignment Semantics from Brain PET

链接: https://arxiv.org/abs/2606.29577
作者: Dasen Dai,Yanteng Zhang,Shuoqi Li,Yuxiang Wei,Hongjie Yu,Qingxin Zhang,Qizhen Lan,Jagath C. Rajapakse,Vince D. Calhoun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Positron Emission Tomography (PET) reveals brain metabolism and is clinically central to neurodegenerative disease assessment, yet existing 3D brain foundation models treat PET as generic volumetric data, missing the structured regional metabolic information that distinguishes it from structural neuroimaging. To address these limitations, we propose ReMAP-PET, a framework that moves beyond visual encoding by supervising a partially-tuned MedicalNet 3D ResNet-50 with brain regional standardized uptake value ratio (SUVR) profiles through joint regression and contrastive objectives, enabling the encoder to learn the metabolic semantics underlying PET modality. On 1015 paired PET–SUVR samples, ReMAP-PET achieves 0.070 SUVR MAE and 77.8% PET SUVR Recall@1, substantially outperforming five frozen pretrained baselines. We further connect the metabolic embedding to clinical language via contrastive alignment with frozen BioClinicalBERT and demonstrate end-to-end PET-to-report generation through SUVR-constrained verbalization. Linear probing on diagnostic classification and cognitive regression tasks confirms that the embeddings retain clinically relevant information without task-specific fine-tuning. Our results show that grounding PET encoders in regional metabolic semantics – rather than treating PET as generic volumetric data – yields representations that are structured, interpretable, and language-compatible, pointing to a new direction for metabolic-aware PET understanding.

[CV-142] Reliability-Prioritized Fine-Grained Generation in Multimodal Large

链接: https://arxiv.org/abs/2606.29573
作者: Xiaomeng Fan,Wu Wei,Yuwei Wu,Zhi Gao,Shiyu Luo,Mingyang Gao,Haoyu Zhao,Zhenxin Diao,Yuxuan Ba,Lijia Feng,Yunde Jia,Mehrtash Harandi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Equal contribution: Xiaomeng Fan and Wu Wei. Corresponding authors: Zhi Gao and Yunde Jia

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are increasingly expected to generate fine-grained descriptions of visual content. However, we observe and theoretically show that generating fine-grained responses poses a reliability challenge, \textiti.e., fine-grained generation is more error-prone than coarse-grained generation. This phenomenon suggests that models should generate the finest description that remains reliable rather than simply produce more specific outputs. To investigate this problem, we develop \textscGranFact, a granularity-aware benchmark consisting of expert-verified multi-object images with coarse-to-fine category annotations. Then, we design a hierarchy-aware evaluation algorithm, which assesses both whether model predictions are visually correct and how specific the correct predictions are. We also propose a reliability-prioritized preference optimization method based on Direct Preference Optimization, which penalizes unreliable fine-grained claims while rewarding reliable specificity. Experiments on \textscGranFact show that our method improves fine-grained generation while preserving reliability. Code and data are available \hrefthis https URLhere.

[CV-143] GarmentZoom: Generating Zoomable Images from Garment Listings

链接: https://arxiv.org/abs/2606.29535
作者: Renjie Zhao,Jingwei Ma,Huy Huynh Cao,Brian Curless,Steven M. Seitz,Ira Kemelmacher-Shlizerman
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online product listings for garments often include an overview photo and a close-up to show garment details. However, each photo focuses on either field of view or garment detail, forcing users to alternate between views and breaking browsing continuity. We present GarmentZoom, a system that enhances the full-view photo to match the fidelity of its accompanying close-up, enabling seamless zoom-and-pan exploration. Unlike standard reference-based super-resolution, our setting involves close-up references that are spatially unaligned with the full view, and scale factors that vary substantially across garments 3-20 \times . Prior work typically relies on alignment to transfer details or requires per-instance fine-tuning to memorize them. Instead, we train a single model that supports a continuous range of scales across diverse garments. Our approach synthesizes details without requiring spatial alignment and matches the quality of per-instance methods with a fraction of the training cost.

[CV-144] MotionAtlas: Detailed Region Captioning for Motion-Centric Videos ECCV2026

链接: https://arxiv.org/abs/2606.29531
作者: Weisong Liu,Haochen Wang,Kuan Gao,Yuhao Wang,Yikang Zhou,Zhongwei Ren,Jacky Mai,Anna Wang,Yanwei Li,Jason Li,Zhaoxiang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ECCV 2026. Project page: this https URL

点击查看摘要

Abstract:We propose MotionAtlas, a system for detailed captioning of motion-centric videos, comprising (1) a dedicated human-annotated benchmark, (2) a scalable, high-quality pipeline to construct training samples, and (3) a family of powerful Video-MLLMs. Unlike conventional global motion captioning datasets, we focus on region-aware motion captioning: given a video and a spatiotemporal mask, the model generates precise descriptions of motion within the target region, thereby alleviating visual clutter and motion entanglement and enabling reliable, quantifiable evaluation. Concretely, we first build MotionAtlas-Bench, a comprehensive benchmark comprising 2,073 multiple-choice questions, meticulously annotated for a curated set of high-quality, motion-centric videos, to evaluate fine-grained motion understanding of the objects in question. Second, we design a rigorous and scalable data pipeline that leverages self-bootstrap refinement to suppress fine-grained hallucinations, yielding 159k high-quality motion captioning data. Third, we design a tailored training data composition strategy, which achieves consistent and substantial performance gains across diverse baseline Video-MLLMs, including Molmo2 and Qwen3-VL. For instance, MotionAtlas-4B surpasses Qwen3-VL-4B by an average of 5.2 percentage points across general motion benchmarks. The benchmark, dataset, and code have been released.

[CV-145] Scenes as Objects Not Primitives: Instance-Structured 3D Tokenization from Unposed Views

链接: https://arxiv.org/abs/2606.29513
作者: Mijin Yoo,In Cho,Subin Jeon,Jiwoo Lee,Eunbyung Park,Seon Joo Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:A 3D scene is understood through its objects, not the primitives that compose them. Yet feed-forward reconstruction methods output dense, unstructured sets of points or Gaussians, leaving object-level structure to be recovered after the fact. We propose a feed-forward framework that decomposes a scene into instance-structured 3D token groups directly from unposed multi-view images – compact object-centric units from which reconstruction, segmentation, and manipulation all follow. Each token group pairs an instance token capturing entity-level identity with anchor tokens that encode local geometry and appearance, which are decoded into a set of 3D Gaussians. This two-level factorization decouples object identity from local appearance, making object instances a native interface of the representation rather than a derived product. The token groups are learned through differentiable rendering with joint reconstruction and segmentation supervision, requiring no 3D annotations. Our feed-forward model surpasses per-scene optimization baselines in class-agnostic instance segmentation while remaining competitive in novel view synthesis. Beyond these metrics, the same token groups directly unlock instance-level scene editing – removing, translating, or inserting objects by operating on their groups – as well as efficient open-vocabulary 3D instance retrieval, where retrieval complexity scales with the number of instances rather than primitives.

[CV-146] Benchmark AUC Is Not Deployable Reliability: A Cross-Dataset Audit of Off-the-Shelf Features for Surveillance Video Anomaly Detection

链接: https://arxiv.org/abs/2606.29506
作者: Mohammadreza Rashidi
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 10 pages, 5 figures, 8 tables

点击查看摘要

Abstract:Automated “suspicious behavior” flagging is a headline promise of AI surveillance, and the field reports high frame-level ROC-AUC on standard video anomaly detection benchmarks. Those numbers are measured by training and testing on the same camera and scene. We audit what happens when that assumption is dropped. We build an unsupervised normality model from the all-normal training frames of one dataset, using frozen off-the-shelf embeddings (CLIP, DINOv2, ResNet-50, EfficientNet-B0) and a nearest-neighbour distance, and score the test frames of the same and of other datasets. Across 4 real datasets (UCSD Ped1, UCSD Ped2, CUHK Avenue, ShanghaiTech) and 4 backbones, same-dataset AUC averages 0.704 but cross-dataset AUC averages 0.499, which is chance: a detector calibrated on one scene is no better than a coin flip on another, and in several pairs it is below chance. The strongest backbone makes this worse, not better: DINOv2 has the best same-dataset AUC (up to 0.901 on Ped2) and the largest cross-dataset drop. The collapse is not an artefact of the scoring rule: replacing the nearest-neighbour detector with a PaDiM-style Mahalanobis detector reproduces it almost exactly (cross-dataset gap 0.202 versus 0.208). Even at a favourable operating point the false-alarm rate is on the order of 31,931 per hour. We conclude that the benchmark numbers quoted for surveillance anomaly detection describe a calibrated laboratory setting and overstate deployable reliability by a wide margin, and we release the code that reproduces every number.

[CV-147] Empirical Evaluation of Multi-Modal Touch Detection in Over-the-Shoulder Video Surveillance

链接: https://arxiv.org/abs/2606.29504
作者: Mohammadreza Rashidi
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 9 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Video Intelligence Surveillance (VIDINT) on over-the-shoulder footage is a proposed vector for monitoring human-computer interaction patterns without direct screen recording access. In this paper, we evaluate a Behavioral Intelligence (BEHINT) touch-detection framework designed to reconstruct keystroke events on mobile keypad interfaces from physical finger interactions. Our system integrates four parallel detection modalities: (1) anatomical hand landmarks via MediaPipe, (2) HSV skin color filtering, (3) temporal frame differencing for motion detection, and (4) shape-guided Canny edge analysis. We map relative touch coordinates to a reference screen layout to reconstruct typing sequences. Evaluation on a 120-frame first-person staged video of passcode entry reveals that while MediaPipe and Skin Detection fail to run autonomously due to partial hand occlusion and ambient noise, Motion-Only and Edge-Only configurations achieve F1-scores of 18.5% and 18.2%, respectively. The combined multi-modal configuration achieves an F1-score of 16.7% and a sequence similarity of 3.0% when mapped to the iOS passcode layout. We conduct ablation, resolution decay, noise sensitivity, and proximity threshold tuning to characterize the system’s operational envelope. We then audit generalization on 5 real, publicly licensed third-person phone videos and find that the detector emits a median of 57 touch points per frame (peaking at 205), one to three orders of magnitude more than the rate of real taps, because the skin filter responds to the whole hand rather than to fingertip contact. The staged keystroke result does not survive contact with uncontrolled footage; the system does not achieve reliable keystroke reconstruction outside the calibrated staged setting.

[CV-148] Learning Where and When: Patch-Based Spatiotemporal Localization in Weakly Supervised Video Anomaly Detection

链接: https://arxiv.org/abs/2606.29498
作者: Hamza Karim,Nghia Nguyen,Lokman Bekit,Yasin Yilmaz
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Weakly supervised video anomaly detection (WSVAD) has predominantly focused on temporal localization, identifying when anomalies occur while largely neglecting their spatial extent within frames. Yet, spatial localization is essential for interpretability and practical deployment in real-world settings. We introduce a patch-based spatiotemporal framework for weakly supervised anomaly localization that jointly models where and when anomalies occur. Our approach operates on grid-level patch features and learns region-level anomaly scores under a multiple instance learning paradigm. We further propose a Proximity-Aware Top-k spatiotemporal selection strategy that enables the model to generate fine-grained spatial anomaly maps without requiring bounding-box supervision during training. Our method surpasses existing state-of-the-art approaches across multiple benchmarks, yielding substantial gains in spatiotemporal localization accuracy. In addition, we release frame-level bounding-box annotations for the test sets of two widely used datasets, along with our code and pretrained models, providing new resources to facilitate future research in spatially grounded WSVAD.

[CV-149] Rectifying Mask via Entropy for Distractor-Free 3DGS in Ambiguous Scenarios

链接: https://arxiv.org/abs/2606.29496
作者: Wongi Park,Jiyeon Lim,Minjae Lee,Myeongseok Nam,Seongjun Choi,Jungwoo Kim,Soomok Lee,William J. Beksi,SangHyun Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 30 figures, and 24 tables

点击查看摘要

Abstract:We present RefineSplat, a systematic framework that effectively constructs transient masks to identify diverse ambiguous distractors. To do this, we qualitatively and quantitatively analyze issues and propose a novel entropy-aware adaptive masking method. Unlike existing approaches that struggle to distinguish transient elements from static scenes due to color or semantic ambiguity, RefineSplat captures ambiguous distractors leveraging entropy and instance masks. Furthermore, we propose a simple yet effective entropy-aware density control to align Gaussians in ambiguous scenarios considering Entropy-aware positional gradients. Additionally, to rigorously validate our method, we first create and release the Ambiguous wild dataset, including 18 scenes where distractors and static scenes are hard to distinguish due to color or semantic resemblances. Experimental results on various datasets demonstrate that RefineSplat shows state-of-the-art performance, showing distractor-free novel view synthesis.

[CV-150] VCS-SLAM: Geometry-Validated Semantic Evidence Fusion for 3D Gaussian SLAM

链接: https://arxiv.org/abs/2606.29494
作者: Raman Jha,Shuaihang Yuan,Yi Fang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual SLAM performance often deteriorates in complex real-world applications. Semantic 3D Gaussian SLAM commonly fuses 2D semantic priors into a persistent 3D map using uniform optimization weights. However, such priors are not equally reliable in online mapping: occlusions, unsupported semantic boundaries, and ambiguous ray geometry can introduce persistent semantic artifacts into the global Gaussian map. We propose VCS-SLAM, a geometry-validated semantic evidence fusion framework for RGB-D 3D Gaussian SLAM. Instead of treating all semantic observations as uniformly valid supervision, VCS-SLAM evaluates their geometric reliability through visibility consistency, surface-supported boundary evidence, and ray-level conflict uncertainty. The resulting reliability-aware objective suppresses occluded semantic updates, reduces unsupported semantic bleeding, and delays premature label assignment in ambiguous regions. Experiments on Replica demonstrate improved semantic consistency, boundary preservation, and reconstruction quality. Results on ScanNet further show that VCS-SLAM maintains competitive tracking performance under real RGB-D inputs

[CV-151] he Calibrated Deepfake Trust Score (CDTS): Competence-Coupled Trust Degradation Across Deepfake Detectors

链接: https://arxiv.org/abs/2606.29484
作者: Md Anas Biswas
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 27 pages, 13 figures, 11 tables

点击查看摘要

Abstract:Modern deepfake detectors are rarely consumed as bare classifiers. In moderation, provenance, and verification pipelines their output probability is read as a degree of trust, so its calibration matters as much as raw accuracy. We reframe deepfake detection as a calibrated, self-auditing trust instrument, the Calibrated Deepfake Trust Score (CDTS), and identify what governs its trustworthiness. Our central finding is a competence-calibration coupling: the calibration of the trust score degrades as the detector’s discriminative competence falls. We establish it across 32 configurations (pooled Pearson r = -0.81), demonstrate it within a single dataset, reinforce it by inducing low competence directly, and replicate it on a fourth held-out dataset the detectors never trained on. It holds across three architecturally distinct detectors, two convolutional networks and a CLIP vision transformer (r = -0.88, -0.83, -0.86). The result is also deployable: a single calibrator frozen on in-domain data fails on exactly the low-competence generators the coupling flags (its error tracks competence at r = -0.98), and competence is estimable without labels, so a label-free monitor flags calibration risk on unseen generators and routing source-batches on a reference-free competence estimate lowers overall AURC and improves the low-to-mid coverage operating region relative to confidence-based routing. The same competence factor also drives calibration inequity across demographic subgroups (distinct from accuracy inequity) and explanation faithfulness. We therefore argue that detector trustworthiness is organized by competence as a shared driver, that competence is the right quantity to estimate and condition on, and that trust scoring must be competence-aware. We offer the CDTS wrapper as the mechanism, and report openly where the unification is tight and where it is architecture-specific.

[CV-152] MAVIN: Multi-Shot Audio-Visual Generation with Narrative Control

链接: https://arxiv.org/abs/2606.29473
作者: Kaiqi Liu,Yunyao Mao,Ziqi Cai,Zheng Geng,Jing Wang,Qiulin Wang,Xintao Wang,Pengfei Wan,Kun Gai,Shuchen Weng,Boxin Shi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While recent generative models produce high-fidelity videos, they struggle with the complex narrative control required for coherent multi-shot audio-visual generation. Existing methods suffer from temporal misalignment, limited controllability, and incomplete scripting. In this paper, we propose MAVIN, the first framework for multi-shot audio-visual generation with customized narrative control. To resolve temporal misalignment, we propose boundary-aware attention, which leverages hierarchical captions and boundary-aware token routing to render audio-visual elements within their respective temporal boundaries. To improve the controllability for multi-subject scenarios, we propose ID-aware propagation, utilizing identity embeddings and an identity-aware mask to bind specific identities to consistent visual appearances and vocal timbres. To provide comprehensive audio-visual narratives, we present a multi-agent scripting pipeline to transform free-form user inputs into hierarchical captions. Furthermore, we construct MAVINSet, a multi-shot audio-visual dataset for robust training and evaluation. Extensive experiments demonstrate that MAVIN achieves state-of-the-art performance, opening up a new avenue for integrating generative models into professional filmmaking workflows.

[CV-153] Rank-Aware Hyperbolic Alignment for Vision-Language Dataset Distillation ECCV2026

链接: https://arxiv.org/abs/2606.29464
作者: Jongoh Jeong,Sun-Kyung Lee,Kuk-Jin Yoon
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication at ECCV 2026. Project Page: this https URL

点击查看摘要

Abstract:Vision-language dataset distillation (VLDD) compresses a large image-text paired dataset into a small set of synthetic pairs that can efficiently train contrastive vision-language models under strict data and compute budgets. Most existing methods match expert trajectories or cross-modal statistics, yet still enforce full-dimensional alignment in a Euclidean embedding space. This is often overly restrictive due to rank-deficient image–text correlation, with shared semantics concentrated in a low-dimensional range and remaining variation spread across a weakly correlated residual subspace. LoRS relaxes alignment at the similarity level by low-rank factorization, but does not explicitly control dominant alignment capacity and structure in the representation space. We thus propose a rank-aware hyperbolic alignment (RAHA) that combines hierarchical geometry with explicit alignment-capacity control. RAHA lifts multimodal representations to hyperbolic space and optimizes distilled pairs with asymmetric objectives that enforce geodesic alignment in the shared range while regularizing the residual subspace to preserve modality-private diversity and improve transfer robustness. Experiments on benchmarks show that RAHA demonstrates competitive cross-modal retrieval and improved transfer indicators under fixed budgets.

[CV-154] CellDETR: A Detection-Guided Framework for Scalable Cell Representation Learning from Histopathology Images

链接: https://arxiv.org/abs/2606.29463
作者: Shikang Zhang,Guojun Li,Yicong Mao,Chulin Sha
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12pages, 5 figures

点击查看摘要

Abstract:Recent advances in pathology foundation models have substantially improved patch and slide level representation learning from whole-slide images (WSIs).However, cell-level representations learning remain underexplored, limiting cell resolved interpretability, biological discovery, and clinical translation. We propose CellDETR, a detection-guided framework built on Deformable DETR for scalable cell representation learning from WSIs. By introducing location feature decoupling and box-constrained attention mechanism, CellDETR enables automated extraction of cell-level embeddings, and outperform existing state-of-the-art methods in supervised cell classification on PanNuke data. In addition, by incorporating contrastive learning design, we build a CellDETR-based pretraining model for scalable cell representation learning from unlabeled WSIs, which improves downstream cell classification performance. Furthermore, we show that after pretraining with Xenium spatial transcriptomics-derived cell annotations, CellDETR achieves accurate cross-dataset cell classification, demonstrating the transferability and biological relevance of the learned cell embeddings. Together, CellDETR provides a scalable route toward general cell-level representation learning framework for interpretable computational patholog

[CV-155] MIRROR: Aligning Semantic Relations from Language to Image via Gromov–Wasserstein ECCV2026

链接: https://arxiv.org/abs/2606.29462
作者: Hong-Han Wang,Yuntao Wang,Hu Ding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. 18 pages, 4 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) inherit rich relational priors from their language backbones, yet often fail when asked to apply these relationships in visual contexts. We trace this failure to a structural blind spot: projection-based alignment trains each visual token to carry the right semantics, but never asks whether the relationships between concepts survive the crossing from language to vision. To address this, we propose MIRROR (Mapping Inter-concept Relations from language to visual Representation via Optimal-transport-based Regularization), a geometric regularization framework that transfers relational priors from language to vision by exploiting the rich relational structure encoded in language representations. Specifically, we derive a surrogate loss from the proposed Semi-Inverse Gromov-Wasserstein (SI-GW) problem, an inverse geometric problem that aligns visual representations with language-derived relational priors. We show that this formulation admits a unique closed-form solution that prescribes the ideal visual relational structure implied by language geometry and cross-modal coupling. The structure of the formulation also enables efficient computation, making it applicable to long token sequences. Applying SI-GW inside decoder-only Transformers requires careful design. We introduce targeted strategies at the layer, head, and token levels to ensure stable extraction without additional parameters or inference cost. MIRROR improves relational consistency while preserving performance on general vision-language tasks.

[CV-156] From Phase to Phenomenon: Self-Supervised Learning of Subsurface Scattering with Minimal Phase-shift Inputs ECCV2026

链接: https://arxiv.org/abs/2606.29461
作者: Arjun Majumdar,Raphael Braun,Andreas Engelhardt,Hendrik PA. Lensch
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. 15 pages

点击查看摘要

Abstract:We propose a self-supervised pretraining framework for learning sub-surface scattering (SSS) light transport representations from minimal input. Our method leverages a stereo projector-camera setup that captures only eight high-frequency phase-shift profilometry (PSP) images per view to pretrain an encoder in a multi-view, multi-object setting. We introduce a tailored augmentation strategy for PSP-based SSS data, and show that it significantly outperforms standard ImageNet-style augmentations for SSL pretraining. The pretrained encoder learns generalizable SSS representations that transfer effectively to downstream tasks, including spatially varying relighting and representation evaluation using a kNN classifier. Combined with a decoder, the model reconstructs dense scattering footprint responses, trained using a dedicated cost function that improves accuracy, particularly for anisotropic footprints. Despite using only eight input images per view, our approach generalizes to unseen objects with complex geometry and material properties, achieving high-fidelity reconstructions while requiring orders of magnitude fewer images than prior methods.

[CV-157] Resonant Brane Splatting for Arbitrary-Scale Super-Resolution

链接: https://arxiv.org/abs/2606.29453
作者: Giulio Federico,Giuseppe Amato,Claudio Gennaro,Fabio Carrara,Marco Di Benedetto
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Arbitrary-Scale Super-Resolution (ASR) reconstructs images at continuous magnification factors. Recent methods accelerate inference by replacing computationally heavy implicit neural decoders with explicit 2D Gaussian Splatting (GS). However, since standard Gaussians are smooth low-pass primitives, modeling edges and fine textures requires multiple overlapping, well-aligned splats, which creates severe bottlenecks during rasterization. To address this, we introduce Resonant Brane Splatting (RBS), a feed-forward ASR framework. RBS replaces flat Gaussians with Branes: expressive primitives that emit spatially varying colors to natively model local contrast and complex textures within a single footprint. We achieve this by augmenting the standard Gaussian envelope with internal Gaussian-Hermite modes, assigning a distinct color coefficient to each. The zero-order mode recovers standard GS, while higher-order modes capture high frequencies. We predict Brane parameters directly from low-resolution features. Because Branes provide a mathematically richer formulation than simple Gaussians, far fewer primitives need to overlap to reconstruct a given target pixel. To exploit this, we introduce an efficient fully differentiable rasterizer with a precise culling strategy based on the classical quantum turning point. This allows us to safely skip negligible regions, drastically reducing the rendering overhead. Experiments on standard ASR benchmarks show that RBS improves reconstruction quality over implicit and GS baselines, while achieving superior speed-quality trade-off than prior GS methods.

[CV-158] he Platonic Defense: Backdoor Defense for Self-Supervised Encoders in the Era of Large Scale Pre-training

链接: https://arxiv.org/abs/2606.29451
作者: Tuo Chen,Minjing Dong,Benlei Cui,Jian Liu,Jie Gui
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) pretrained models have become a dominant paradigm for visual representation learning, but they are vulnerable to backdoor attacks. Existing defenses struggle to defend against such attacks in a fully black-box setting because they often require access to labels, attack patterns, or training data. To tackle this issue, we propose a new attack-agnostic, model-agnostic, and modality-agnostic black-box test-time defense paradigm, called \emphPlatonic Representation Defense. It is inspired by the Platonic Representation Hypothesis, which suggests that large-scale independently trained encoders converge toward compatible projections of the same underlying reality. We formalize this idea as a conditional energy function defined over source representations and a set of reference representations. The energy function is trained for detection through noise-contrastive estimation and for representation purification through denoising score matching. Theoretically, the energy gap between matched and mismatched samples is lower bounded by the mutual information between source and reference representations. We demonstrate the effectiveness of our method on multiple self-supervised encoders and more than 10 attacks. The method can perform both representation detection and purification, and achieves substantial performance gains across multiple attacks. Code is available \hrefthis https URLhere.

[CV-159] Miti360: A Comprehensive Dataset for Improved Reforestation Monitoring

链接: https://arxiv.org/abs/2606.29447
作者: Cedric Kiplimo,Samuel Mbatia,Ciira wa Maina,Arthur Sichangi,Dennis Gitundu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 figures, 4 tables, 25 pages (20 excluding references), Under review at Nature Scientific Data

点击查看摘要

Abstract:Over the past decade, interest in applying machine learning (ML) to automate forest monitoring has grown significantly. However, existing training datasets are predominantly drawn from North America, Europe, Asia, and Australia, leaving a critical gap in African forestry data. To address this limited geographic diversity, we present Miti360, a comprehensive dataset for reforestation monitoring that comprises high-resolution imagery, ground truth data, and longitudinal weather data. Data collection occurred within a 770-ha reforested section of the Kieni Forest in Kenya between March 2023 and February 2025. Miti360 comprises aerial photos (orthophotos and tiles) with tree bounding box annotations, terrestrial images (single and stereo), and detailed data records including tree biophysical parameters, species, and GPS coordinates, alongside historical weather data. Aerial surveys utilized a DJI Mavic 2 Pro, with imagery stitched via Agisoft Metashape and tiled using ArcGIS Pro, while terrestrial captures used smartphones and custom stereo cameras. Miti360 enables the training of ML systems for tasks such as accelerating tree censuses, matching species to geographical areas, modelling growth based on weather conditions, and developing digital twin frameworks. Models can be trained on Miti360 to address challenges specific to Sub-Saharan Africa, ultimately advancing reforestation monitoring and fostering sustainable forestry practices in underrepresented regions. We demonstrate the utility of this dataset by successfully tracking tree crowns across three years and improving the DeepForest model’s box precision and box recall by 12% and 69% respectively through fine-tuning on Miti360.

[CV-160] Bridging VideoQA and Video-Guided Agent ic Tasks via Generalized Keyframe Extraction ECCV2026

链接: https://arxiv.org/abs/2606.29445
作者: Sunqi Fan,Qingle Liu,Runqi Yin,Meng-Hao Guo,Shuojin Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ECCV 2026. Project Page: this https URL

点击查看摘要

Abstract:Video understanding is a fundamental capability for multimodal intelligence, and recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance on Video Question Answering (VideoQA) benchmarks. However, existing benchmarks primarily evaluate whether models can perceive shallow visual cues, while rarely examining whether MLLMs can learn deeper knowledge or procedural skills from video tutorials and generalize them to downstream long-horizon agentic tasks. To address this gap, we introduce VG-GUIBench (Video-Guided GUI Benchmark), a new benchmark designed to evaluate whether MLLM-based GUI agents can follow video tutorials to complete corresponding GUI interactive tasks. Furthermore, we observe that the performance of models on both VideoQA and video-guided agentic tasks critically depends on effective keyframe extraction. Based on this observation, we propose TASKER (Task-driven And Scene-aware Keyframe searchER), a keyframe extraction algorithm that jointly considers task relevance and scene dynamics to identify informative frames. Experimental results demonstrate that TASKER achieves significant performance improvements on both VideoQA and video-guided agentic task benchmarks, outperforming the best baseline by 2.0% on the EgoSchema fullset and 1.8% on the NExT-QA dataset, respectively. These results further highlight the potential of generalized keyframe extraction methods for video understanding tasks. Our code and data are available at this https URL.

[CV-161] EvLIR: Learning Illumination Residuals from Ordered Events for Low-Light Image Enhancement

链接: https://arxiv.org/abs/2606.29430
作者: Haoxian Zhou,Chuanzhi Xu,Langyi Chen,Pengfei Ye,Haodong Chen,Qiang Qu,Ali Anaissi,Weidong Cai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-light image enhancement is severely ill-posed when the input frame contains missing structure, saturated noise, and weak local contrast. Event cameras provide asynchronous brightness-change observations with high temporal resolution, but prior works often treat voxel channels as an unordered or static feature stack before fusion, rather than explicitly modeling their within-window temporal evolution, weakening the temporal evidence that makes events useful. We propose EvLIR, a temporal-residual enhancement framework that learns illumination residuals from ordered events for low-light image enhancement. Given a low-light frame and its aligned event voxel, EvLIR preserves the ordered temporal bins of the event stream and introduces a Temporal Event Residual Module (TERM) to encode short-window event dynamics with a lightweight ConvGRU. The resulting temporal state is converted into a bounded illumination correction, which provides spatially adaptive photometric guidance for Retinex-style illumination estimation and subsequent reliability-aware image-event restoration. On SDE and SDSD indoor/outdoor benchmarks, EvLIR achieves the best result on eleven of twelve dataset-metric pairs, with average scores of 25.63~dB PSNR, 28.30~dB PSNR*, and 0.827 SSIM across the four benchmarks.

[CV-162] Robust Zero-shot Anomaly Detection under Limited Auxiliary Anomaly Priors ECCV2026

链接: https://arxiv.org/abs/2606.29428
作者: Guanyu Lu,Fang Zhou,Cheqing Jin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026

点击查看摘要

Abstract:Zero-shot anomaly detection aims to identify defects in arbitrary novel domains; however, existing models assume that the auxiliary data contains a rich diversity of anomalies, neglecting the far more complex and unpredictable variations in real-world target domains. This study introduces DIVE, the first approach to investigate the scenario of limited auxiliary anomaly priors and resolve the resulting substantial performance degradation. Through a shallow-and-deep text embedding injection strategy during visual encoding, DIVE learns to abstract generic anomaly concepts shared across the auxiliary training domain and diverse target domains. Moreover, we propose a disentanglement mechanism to tackle the suboptimal alignment between visual embeddings entangled with object semantics and object-agnostic textual prompts. Experiments demonstrate that, under the setting of limited anomaly patterns in auxiliary data, DIVE outperforms SOTA baselines by up to 16.2% and 28.5% on two classification metrics, and 23.4%, 24.1%, and 47.0% on three segmentation metrics, in terms of average performance across twelve datasets. Furthermore, it maintains highly competitive performance when auxiliary data exhibits sufficient anomaly diversity.

[CV-163] Bit-ViP: Leverag ing Bit-planes to Preserve Visual Privacy in Images through Obfuscation

链接: https://arxiv.org/abs/2606.29417
作者: Vishesh Kumar Tanwar,Ashish Gupta,Sanjay Madria,Sajal K. Das
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:The unprecedented growth of computer vision applications, such as surveillance systems and social media, raises security and visual privacy concerns, especially when data is stored on cloud servers. Image obfuscation offers a way to preserve visual privacy while maintaining an adequate level of usability; thus, it has been a topic of great interest in recent years. However, prior obfuscation schemes are either vulnerable to malicious attacks, such as model inversion to reconstruct original images from obfuscated images, or generate non-trainable obfuscated images, making them unusable for achieving reasonable accuracy. This paper proposes a novel bit-plane-based image obfuscation scheme, \em Bit-ViP, to preserve visual privacy for image-based recognition tasks. The Bit-ViP scheme produces secure, usable images by incorporating an innovative end-to-end obfuscation function. While doing so, the obfuscated image would contain non-invertible noise (generated by Lorenz’s chaotic system and differential privacy), making it hard for an adversary to reconstruct the original image. We conduct extensive experiments on two popular activity recognition datasets, namely UCF101 and HMDB51, to validate the effectiveness of Bit-ViP. In the face of attacks on reconstruction, pixel frequency, information entropy, and pixel inter-correlation, we present a rigorous security analysis demonstrating tangible improvements over existing schemes.

[CV-164] Can Machines Really See Objects in Images? A Study Based on Syntactic Distance and Visual Self-Referential Instances

链接: https://arxiv.org/abs/2606.29416
作者: Xingyu Peng,Junran Wu,Yue Hou,Zhongliang Qiao,Jiaheng Liu,Shangzhe Li,Jichang Zhao,Wenjun Wu,Xianglong Liu,Yongxin Tong,Li Dong,Ke Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages

点击查看摘要

Abstract:Can a vision model truly see an object, or does it only fit surface-level visual cues? Following Wittgenstein’s view that the limits of language are the limits of the world, we view a model’s recognition ability as bounded by the descriptive system it has learned. In current vision models, this system is often realized through learned feature representations that exploit local statistical cues. We therefore ask whether a model can still classify correctly when such local cues provide no stable basis for distinction. We formalize this question with syntactic distance, which measures class separability through the symmetry of the operations mapping one class to the other: positive distance exposes exploitable local features, whereas zero distance requires global semantics rather than local rules. We construct a visual self-referential task in maximum-variance binary noise: positive samples contain a closed square, while negative samples contain an otherwise identical square with one flipped boundary pixel. The two classes differ in global semantics but have zero syntactic distance, making local statistical shortcuts unreliable. Experiments on ResNets and Vision Transformers reveal a consistent phase-transition phenomenon, with accuracy collapsing to random guessing once the image scale crosses a critical point and does not recover within the tested range. Larger training sets and models only delay this collapse, while globally attentive ViTs reach it earlier. These results reveal a structural capability boundary of current architectures on global-concept tasks, suggesting that general intelligence may require creating new language, not reusing an existing one.

[CV-165] FiRe: Frequency Reparameterization as a Preconditioner for Periodic Implicit Neural Representations

链接: https://arxiv.org/abs/2606.29414
作者: Harinandan Shukla,Rajarshi Verma,Jitin Singla
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Periodic Implicit Neural Representations (INRs) such as SIREN and FINER assign every neuron, the same global frequency, spending the representational budget inefficiently when local signal content varies. We introduce FiRe (Frequency Reparameterization), that accelerates optimization by reparameterizing per-neuron frequency of periodic INRs without changing their underlying activation function. FiRe gives each neuron a bounded, input-dependent frequency via a separate low-rank gating path and is applicable to any periodic activation function. The gate acts as an implicit preconditioner that improves optimization conditioning at initialization via the Neural Tangent Kernel (NTK). This better-conditioned initialization makes optimization converge faster, and the high-frequency content of the reconstruction tracks the target more closely at a fixed computational budget. On 2D image fitting, FiRe increases PSNR over a parameter-matched baseline (up to +1 dB at short training budgets), with gains that vary with resolution and diminish at full convergence. We characterize how performance depends on resolution, rank, and training budget, and give an NTK account that predicts these trends.

[CV-166] Learning to Adaptively Allocate Gaussians for Arbitrary-Scale Image Super-Resolution

链接: https://arxiv.org/abs/2606.29400
作者: Giulio Federico,Giuseppe Amato,Claudio Gennaro,Fabio Carrara,Marco Di Benedetto
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:In computer graphics, visual content is continuously warped, zoomed and resampled. This occurs when engines upscale frames, users zoom into 3D scenes, or foveated VR applies varying scaling. Handling these transformations requires Arbitrary-Scale Super-Resolution (ASR). Traditional models, designed for fixed scales, typically predict at a lower integer scale (e.g., x4) and rely on sub-optimal interpolation for continuous resolutions, compromising quality. Furthermore, most methods process pixels uniformly. Since fine details are sparse, this creates overhead; efficiency dictates concentrating resources only where structural complexity demands it. While implicit models and Gaussian Splatting (GS) enable continuous representation, GS is advantageous due to adaptive densification. However, transitioning GS into a feed-forward model for ASR is non-trivial. Standard GS optimization needs high-resolution gradients to drive primitive growth, which are unavailable during inference. Thus, the network must autonomously predict GS densification from low-resolution inputs. To solve this, we propose QuADA-GS. After encoding inputs into a latent space, a Neural Routing Architecture evaluates local complexity to distribute a global budget, assigning specific upsampling factors to features to avoid redundant processing. Features are dynamically densified based on these factors, forming an irregular topology decoded into 2D Gaussian primitives. To coordinate features before decoding, we introduce Hierarchical Pointer Convolution. This non-grid operator achieves O(1) neighbor lookup complexity, facilitating efficient spatial communication and bypassing dense bottlenecks. Experiments show QuADA-GS achieves state-of-the-art ASR performance, maintaining low latency and a lean memory footprint.

[CV-167] NaLA: A 3D Native LLM Layout Agent for High-quality 3D Scene Generation ECCV2026

链接: https://arxiv.org/abs/2606.29395
作者: Cheng Wan,Yongsen Mao,Wenzheng Wu,Yuxuan Xie,Chucheng Xiang,Runze Wang,Xiang Zhang,Zhongyuan Liu,Rushi Dai,Yuan Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) have emerged as promising layout agents for 3D scene generation. Existing layout agents still suffer from implausible layout generation because most of them convert 3D assets and 3D layouts into textual descriptions as inputs and outputs, which involves severe information loss due to the modality gap between texts and 3D assets and 3D layouts. We propose NaLA, a native 3D LLM layout Agent for high-quality 3D scene generation by placing 3D assets in the scene. For the inputs, NaLA encodes 3D scene boundaries and 3D assets directly into the LLM, preserving fine-grained geometry and enabling explicit reasoning over relationships like collisions, surface supporting, and containment. To accurately output the positions and orientations of assets, NaLA adopts a coarse-to-fine prediction mechanism that first predicts discrete poses in an autoregressive manner and then refines the discrete poses with a continuous regression. Trained on diverse layout datasets, NaLA attains strong geometric perception and layout coherence. Experiments demonstrate that NaLA outperforms prior layout agents in both generation quality and inference efficiency, with comprehensive ablation studies to verify each component’s effectiveness.

[CV-168] Event-VLA: Action-Conditioned Event Fusion for Robust Vision-Language-Action Model

链接: https://arxiv.org/abs/2606.29384
作者: Jiaxin Liu,Xun Xu,Zhenhao Zhang,Hanqing Wang,Ruiqi Chen,Shi Chang,Weiyu Guo,Laurent Kneip
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have become an important paradigm of embodied AI. However, existing VLA models typically assume well-lit and stable indoor settings, while real-world embodied manipulation may involve degraded RGB observations caused by illumination shifts, posing critical challenges for robust robotic manipulation. To address this gap, we propose \textbfEvent-VLA, an event-enhanced VLA framework for generalizable manipulation across varying illumination conditions. We formulate VLA-based manipulation under degraded visibility as a practical robustness problem for RGB-centric policies, and introduce event streams as an illumination-robust, motion-sensitive complementary observation to improve robustness across visibility levels. Specifically, unlike conventional multimodal fusion that directly merges event features into the global semantic token space, Event-VLA injects event information through an action-query routing pathway. It uses learnable action queries to extract task-relevant semantics from the VLA reasoning process, and selectively aggregates event tokens via gated cross-attention to construct event-aware action representations. This design preserves the pretrained RGB-language semantic priors while effectively leveraging event information for robust action prediction. Experiments in simulation and real-world deployment show that Event-VLA maintains strong manipulation performance under normal lighting and improves success rates under low-light degradation and near-dark real-world settings.

[CV-169] DR-GS: Physically-Based Deformable and Relightable 2D Gaussians

链接: https://arxiv.org/abs/2606.29379
作者: Jiaxin Li,Tong Wu,Yi Wei,Tailin Wu,Li Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Gaussian splatting (GS) has garnered significant attention in VR/AR and digital content creation due to its explicit parameterization and efficient rendering capabilities. However, existing GS-based methods for deformable objects face two key limitations: (i) illumination is erroneously baked into textures, causing physically inconsistent responses under dynamic deformations and lighting changes; (ii) snapshot-based reconstruction restricts post-reconstruction material editing. To address these challenges, we propose Deformable and Relightable GS (DR-GS), a unified Gaussian framework that integrates physically-based inverse rendering, relighting, and deformation-aware manipulation. Through explicitly disentangling geometry, illumination, and material representations, DR-GS overcomes the limitations of static snapshots, resolving unrealistic appearance under varying conditions while enabling post-reconstruction parameter editing. Extensive experiments show that DR-GS achieves leading visual quality across static reconstruction, dynamic deformation, and relighting, reliably preserving reflections and specular highlights on glossy surfaces. It further establishes a fully decoupled geometry-illumination-material pipeline, enabling high-quality 3D asset creation and comprehensive post-editing.

[CV-170] SAD-GS: Learning Reliable 3D Semantic Gaussian Fields via Dynamic Geo-Semantic Anchoring

链接: https://arxiv.org/abs/2606.29376
作者: Yufei Zhang,Chenlu Zhan,Gaoang Wang,Hongwei Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-vocabulary 3D semantic Gaussian field learning relies on multi-view 2D supervision, whose semantic targets and spatial assignments are often unreliable. Across varying viewpoints, view-dependent features cause semantic identity drift, while propagated tracker masks introduce boundary leakage and identity switches. Directly optimizing against these unreliable 2D targets forces the 3D representation to absorb multi-view contradictions, leading to severe error accumulation. To resolve this limitation, we propose SAD-GS, a framework for learning reliable 3D semantic Gaussian fields via dynamic geo-semantic anchoring. Specifically, Semantic Anchor Distillation (SAD) distills per-view visual embeddings into consensus text anchors to establish a viewpoint-invariant semantic identity. Concurrently, the Geo-Semantic Feedback Loop (GSFL) leverages the evolving 3D field to actively filter tracker anomalies and refine spatial mask assignments via a conservative three-gate update rule. Extensive evaluations on LERF-OVS, 3D-OVS, and Mip-NeRF360 show that SAD-GS consistently achieves the best overall performance in both open-vocabulary localization and semantic segmentation. These comprehensive improvements validate the effectiveness and robustness of dynamic geo-semantic anchoring for reliable 3D semantic Gaussian field learning.

[CV-171] L2D2-GS: Learning to Densify for Feedforward Dynamic Gaussian Scene Reconstruction

链接: https://arxiv.org/abs/2606.29374
作者: Zetian Song,Chenming Wu,Junnan Liu,Chitian Sun,Liangliang He,Hangjun Ye,Jiaqi Zhang,Siwei Ma,Wen Gao
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:High-fidelity reconstruction of dynamic urban environments is a cornerstone of autonomous driving simulation and large-scale world modeling. While 3D Gaussian Splatting (3DGS) has established a new standard for real-time rendering, its reliance on expensive per-scene optimization limits scalability. Conversely, recent feedforward methods that infer Gaussian parameters offer faster speed but face fundamental bottlenecks: they are memory-prohibitive at high resolutions and struggle to fuse dense multi-view observations consistently. This paper presents L2D2-GS, a unified framework that reformulates generalizable reconstruction not as a one-shot regression, but as a robust iterative process of optimization and densification. To resolve the ambiguity of supervision in primitive generation, we propose a self-supervised densification policy that derives explicit reward signals from global reconstruction gains to guide local densification. Furthermore, we mitigate irreversible early-stage artifacts through a geometric regularization mechanism, utilizing reparameterization to constrain the optimization manifold and prevent convergence to poor local optima. Extensive experiments on the PandaSet and Waymo datasets demonstrate that our method achieves state-of-the-art reconstruction fidelity and strong zero-shot generalization, while using fewer primitives than competing baselines.

[CV-172] SAFE-DiT: Semantics-Aware Fast-path Execution for High-Resolution Diffusion Transformers

链接: https://arxiv.org/abs/2606.29360
作者: Xuanhua Yin,Yuxuan Jia,Chuanzhi Xu,Weidong Cai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 12 figures, 21 tables

点击查看摘要

Abstract:High-resolution Diffusion Transformer (DiT) inference contains substantial spatial redundancy, but many spatially adaptive implementations encode regional computation as attention masks, which can inadvertently move scaled dot-product attention (SDPA) away from FlashAttention fast paths. We identify this avoidable systems bottleneck as Mask-Induced Dispatch Tax (MIDT) and show that it grows with latent sequence length. We introduce SAFE-DiT, a training-free Semantics-Aware Fast-path Execution framework that separates exact mask elision from approximation-based spatial scheduling. SAFE-DiT removes only provenance-certified image self-attention masks that induce a row-wise constant shift in attention logits, preserves semantics-bearing masks such as text-padding masks, and realizes spatial adaptation through prompt-conditioned token partitioning, selective state updates with global context, and periodic context refresh. We call this acceleration-only configuration SAFE-Core and report sensitivity-weighted classifier-free guidance separately as SAFE-DiT+SW. On the evaluated PyTorch SDPA stack, redundant masks make long-sequence attention 4.1\times to 5.8\times slower than the mask-free path. On Lumina-Next, SAFE-DiT achieves 2.69\times end-to-end acceleration at 1024^2 resolution and 5.09\times at 2560^2 , reduces peak memory at 2560^2 from 94.1 to 27.9 GB, and enables 3072^2 generation when dense inference runs out of memory. Paired metrics, component ablations, and a blinded human study support visual non-inferiority of SAFE-Core to the dense fast-path baseline, while SAFE-DiT+SW provides a separate prompt-alignment operating point without reintroducing spatial self-attention masks. Code is available at this https URL.

[CV-173] Dynamic Parsing and Updating Natural Language Specification using VLMs for Robust Vision-Language Tracking

链接: https://arxiv.org/abs/2606.29357
作者: Xiao Wang,Liye Jin,Dan Xu,Yuehang Li,Lan Chen,Yaowei Wang,Yonghong Tian,Jin Tang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language tracking guided by natural language specifications leverages high-level semantic cues of target objects to substantially boost tracking accuracy and robustness. Existing studies have verified that adaptively optimizing textual descriptions throughout the tracking process can effectively mitigate the semantic-visual mismatch induced by dynamic variations in target appearance, position, and other inherent attributes. Nevertheless, mainstream methods that directly generate textual information via sequence models or large language models inevitably suffer from inherent defects, including erroneous target updating, excessive background distraction, and pervasive hallucination artifacts. To address the aforementioned limitations, this paper proposes a novel language dependency parsing mechanism to precisely distill core tracking principal components, encompassing target objects, semantic concepts, and background contextual information. On this basis, we perform component-aware adaptive textual description updates by exploiting the powerful cross-modal understanding capability of the pre-trained vision-language model Qwen-VL. By integrating the proposed elaborately designed modules into the baseline framework, our method achieves consistent and superior tracking performance on multiple large-scale vision-language tracking benchmarks, including TNL2K, LaSOT, TNLLT, and OTB-LANG. The source code and pre-trained models will be released at this https URL.

[CV-174] Fast Enough to Act: Spatio-Temporal Visual Token Merging for Low-Latency Robotic VLMs and VLAs

链接: https://arxiv.org/abs/2606.29350
作者: Junzhou Chen,Jindong Wang,Gang Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models and vision-language action models endow the robot with unprecedented capabilities. However, the input of video and high-resolution images yields a massive number of visual tokens, leading to extremely high inference latency and severely hindering the robot’s real-time control. To break through this computational bottleneck, we propose ST-Merge, a plug-and-play, training-free framework that efficiently fuses redundant tokens directly during the visual encoding phase. By explicitly constructing 3D spatiotemporal coordinates, it employs a multi-queue parallel matching and weighted aggregation mechanism to achieve efficient and geometrically consistent fusion of redundant tokens across frames. In addition, we introduce a post-merge positional correction mechanism that effectively eliminates spatial deviation caused by merging by dynamically re-evaluating the rotational position code of the weighted centroid of the vision token, thereby ensuring the high-precision spatial awareness required for dexterous operation. In the Video Question Answering task on the mainstream VLM, Qwen2.5-VL, ST-Merge achieves a 2 \times inference speedup with only a tiny 1% loss in precision. When deployed on the \pi_0.5 VLA policy, ST-Merge achieves an 8.3 \times speedup at 1024 \times 1024 resolution and matches the baseline success rate at this high-resolution setting. At lower resolutions, it introduces a small drop in accuracy.

[CV-175] W4A4 Quantization for Inference on Wan2.2-I2V-A14B ICME2026

链接: https://arxiv.org/abs/2606.29337
作者: Yidong Chen,Chengyu Shi,Jiahao Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 4 pages, 8 figures; ICME 2026 Low-Bit-width Large-Model Quantization Challenge submission

点击查看摘要

Abstract:We summarize our submission to Sub-Challenge 1: W4A4 Quantization for Inference (HiF4 / MXFP4) of the ICME 2026 Low-Bit-width Large-Model Quantization Challenge. The sub-challenge targets 4-bit weight and 4-bit activation inference on Wan-AI/Wan2.2-I2V-A14B under HiF4 or MXFP4 numerical formats. We adapt two complementary ideas from LLM quantization, MixQ-style mixed precision for sparse activation outliers and SmoothQuant-style per-channel smoothing, together with block-wise HiF4 packing for Wan2.2 feed-forward linear layers. Calibration on representative OpenS2V-5M batches identifies heavy-tailed activation channels; smoothing rebalances dynamic range before W4A4 rounding; and a dual-branch GEMM preserves outlier columns in higher precision while the bulk of channels use strict W4A4. On official VBench I2V metrics, our pipeline stays within 2-3.5 percent of FP16 on most quality axes and improves motion smoothness, outperforming a native HiFloat4 baseline that degrades roughly 5 percent relative to FP16 across all reported scores.

[CV-176] Multi-scale Object-Aware Gaze Estimation via Geometric Reasoning ECCV2026

链接: https://arxiv.org/abs/2606.29334
作者: Jiajie Mi,Xinyu Liu,Mengke Song,Chenglizhao Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026

点击查看摘要

Abstract:Gaze target estimation aims to predict the semantic object an observer fixates upon within an image, a task deeply rooted in the object-oriented nature of human gaze. Observers tend to select a specific semantic entity as the attentional target, rather than responding randomly across arbitrary regions of the image. However, existing methods typically model this task as a direct mapping from global features to gaze heatmaps, essentially treating it as a pixel-level regression problem. This approach fails to explicitly represent the gazed object as a distinct entity, making it difficult to produce stable and semantically consistent predictions in complex scenes. To address this, we propose a two-stage gaze estimation framework guided by object semantics, reformulating gaze target estimation as a hierarchical reasoning process. Our method incorporates object-level representations during feature encoding to align image features with discrete semantic entities, then introduces multi-scale feature fusion and geometric constraints from head pose and gaze direction for fine-grained localization and object-level discrimination. Extensive experiments on GazeFollow, VideoAttentionTarget, ChildPlay, and GOO-Real demonstrate that our method achieves AUC of 0.961, 0.948, 0.987, and 0.977 respectively, delivering strong performance across all benchmarks while maintaining a compact parameter size of 7.1M.

[CV-177] HiReFF: High-Resolution Feedforward Human Reconstruction from Uncalibrated Sparse-View Video

链接: https://arxiv.org/abs/2606.29333
作者: Yiming Jiang,Hanzhang Tu,Wenfeng Song,Siyou Lin,Liang An,Shuai Li,Aimin Hao,Yebin Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Uncalibrated volumetric video streaming for human reconstruction is essential for holographic communication and AR/VR, yet remains challenging due to the need for temporal consistency and computational efficiency from sparse-view inputs. Existing methods rely on per-scene optimization or calibrated cameras, while recent feed-forward models are limited to low-resolution (0.5K) single-frame synthesis. We present HiReFF, a feed-forward method for 2K-resolution 360° human video reconstruction from uncalibrated sparse-view videos. Our framework decomposes the problem into two key tasks: foreground 3D Gaussian reconstruction from sparse-view videos (four views separated by 90°) and computationally efficient high-resolution synthesis. To enable the former, we propose Scale-synchronized Camera Calibration to resolve scale ambiguity for multi-view supervision, and Gaussian-wise Foreground Masking to reconstruct clean foregrounds by modulating Gaussian parameters. For efficient high-resolution synthesis, our High-resolution Side-tuning achieves 2K rendering by augmenting the Gaussian head with supplementary features while keeping the backbone at 0.5K, drastically reducing computational overhead. Experiments demonstrate that HiReFF significantly outperforms existing methods in high-resolution streaming volumetric video reconstruction. this https URL

[CV-178] RAG A: Real Time Ray Traced Gaussian Shadow Casting for 3DGS Avatar-Scene Interaction ECCV2026

链接: https://arxiv.org/abs/2606.29329
作者: Aymen Mir,Riza Alp Guler,Jian Wang,Peter Wonka,Bing Zhou,Gerard Pons-Moll
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026. Project Page at this https URL

点击查看摘要

Abstract:We study the problem of physically plausible shadow casting when animating 3D Gaussian Splatting (3DGS) avatars, either individually or in multi-avatar and object-interaction scenarios, within existing 3DGS scenes. In contrast to prior methods that rely on binary hit tests and mesh-based shadow casters, our method performs shadow computation entirely in Gaussian space, without requiring any mesh reconstruction. We introduce RAGA, a Ray-Traced Gaussian Shadow Casting formulation based on exact ray-Gaussian line integrals. For each occluding Gaussian, we integrate the opacity profile along the shadow ray and normalize by the theoretical maximum integral, producing a weight that captures how the ray traverses the occluder rather than merely whether an intersection occurred. To reduce temporal variance from clothing deformations in animated avatars, we further introduce an avatar proxy representation that stabilizes shadow casting while preserving visual fidelity. We implement RAGA using custom CUDA kernels integrated with the NVIDIA OptiX framework; as such, our shadow tracer runs at rates of about 50 FPS. We evaluate on single-avatar, multi-avatar, and avatar-object interaction scenarios across multiple datasets, demonstrating substantially improved shadow realism, temporal stability, and scene coherence. Our project page is available at this https URL.

[CV-179] FDM-MFVT: Few-step Sampling Diffusion Model for Mask-Free Virtual Try-On ECCV2026

链接: https://arxiv.org/abs/2606.29319
作者: Jiaxin Liu,Xiaoye Liang,Lai Jiang,Mai Xu,Jun Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV2026

点击查看摘要

Abstract:Image-based Virtual Try-On (IVTON) has greatly advanced through diffusion models, yet existing methods require many sampling steps and depend on masks with costly auxiliary networks. In addition, the absence of large-scale mask-free paired datasets further limits the development of mask-free IVTON. We propose FDM-MFVT, a few-step diffusion model for mask-free IVTON, integrating an Outfit-aware Noise Optimization Module (OANO) and an Instruction-driven Try-on Module (IDT) to enhance efficiency and this http URL OANO module initializes the alignment space with noise using the input image and only needs 6 steps to generate a higher-fidelity try-on image compared to 30 this http URL IDT module uses virtual try-on prompts and efficient adaptation to generate high-quality results from garment and person images alone. We further introduce MFVT, a 30,000-pair mask-free IVTON dataset. Experiments show that FDM-MFVT achieves superior quantitative and qualitative results with fewer inference steps than mask-based and mask-free baseline methods.

[CV-180] D2R2OSR: Degradation-Disentangled Representation for Real-World Omnidirectional Image Super-Resolution

链接: https://arxiv.org/abs/2606.29314
作者: Hongyu An,Xinfeng Zhang,Xu Fan,Shijie Zhao,Li Zhang,Ruiqin Xiong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the growing demand for immersive visual experiences, high-quality omnidirectional images (ODIs) have become increasingly important. However, limitations in imaging devices and transmission bandwidth often lead to low-resolution ODIs, hindering the rendering of fine-grained 360° details, especially in the presence of real-world degradations and geometric distortions. Existing real-world super-resolution (Real-SR) methods are inadequate for ODIs, as their degradation models fail to account for the complex imaging pipeline involving fisheye capture and Equirectangular Projection (ERP), introducing severe aliasing and projection-specific distortions. To address these challenges, we propose D ^2 R ^2 OSR, a Degradation-Disentangled Representation framework for Real-world Omnidirectional image Super-Resolution. D ^2 R ^2 OSR explicitly models degradations arising from both fisheye imaging and ERP projection, guided by two key insights: (1) projection priors play a critical role in shaping real-world degradations, and (2) human perception in immersive environments is inherently viewpoint-centric. Accordingly, we introduce a Perspective Projection Representation (PPR) operating alongside the ERP branch to capture viewpoint-aware features, together with a Degradation-Specific Module (DSM) that jointly models ERP-induced geometric distortions and PPR-specific real-world degradations. Extensive experiments demonstrate that D ^2 R ^2 OSR achieves state-of-the-art performance and produces visually compelling, high-fidelity omnidirectional Real-SR results while maintaining favorable computational efficiency for low-resource deployment.

[CV-181] MirrorPPR: Exemplar-Based Portrait Photo Retouching ECCV2026

链接: https://arxiv.org/abs/2606.29308
作者: Zhihong Liu,Zheng Li,Jiachun Jin,Siqi Kou,Yitao Jian,Fengpei Yu,Zhijie Deng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026. 27 pages

点击查看摘要

Abstract:While text-guided image editing has made remarkable progress, it remains limited in structural portrait retouching. Textual descriptions struggle to convey fine-grained changes to facial features and body proportions. To address this gap, we introduce Exemplar-Based Portrait Photo Retouching, where the model is given an exemplar pair and tasked with inferring and applying the same retouching operations to a new query image. Existing exemplar-based editing methods primarily focus on tasks with pronounced visual transformations. In contrast, structural portrait retouching involves extremely delicate and localized modifications, making accurate extraction and transfer of these edits challenging. To tackle this, we propose MirrorPPR, a novel framework designed to capture and transfer subtle structural retouching operations. Our method uses a Retouching Operation Extractor to capture the subtle differences from the exemplar pair. The extracted representations are then injected into a pre-trained Diffusion Transformer (DiT) through a connector and Low-Rank Adaptation (LoRA) modules. Furthermore, constructing perfectly aligned cross-identity training pairs is severely hindered by operation misalignment. To overcome this, we propose an advanced data self-augmentation paradigm that ensures strictly aligned retouching operations. To alleviate data scarcity and support this novel task, we introduce MirrorPPR47M, a large-scale dataset with over 47 million retouched pairs. By structuring the dataset into simulated and professional subsets, we enable progressive curriculum learning to smoothly optimize the network. Extensive experiments demonstrate that MirrorPPR significantly outperforms existing baselines in both retouching quality and identity preservation. The project page is available at this https URL.

[CV-182] Occlusion-Robust Multi-Object Decoupling for Physics-Based Interaction

链接: https://arxiv.org/abs/2606.29303
作者: Xin Dong,Wenfeng Deng,Yansong Tang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:We propose a mask-free method for lossless multi-object 3D reconstruction from sparse and occluded real-world views, enabling physically plausible interaction via Material Point Method (MPM) simulation. Our key insight is that object coupling stems from occlusion and limited viewpoints, which we address by formulating multi-object decoupling as a sparse-view reconstruction problem. Using 3D Gaussian Splatting as base representation, we first obtain coarse instance partitions with a SAM2-trained segmentation field. Rather than relying on masks, we reconstruct fragmented geometries by leveraging a joint Score Distillation Sampling (SDS) process, which integrates reference-view supervision with novel-view synthesis guided by 2D and 3D diffusion priors to enforce both texture fidelity and 3D consistency. Furthermore, we incorporate geometry-aware priors such as intra-object and inter-object similarity to regularize geometric reasoning. Experimental results demonstrate that our method produces complete, simulation-ready 3D objects without requiring manual masks, enabling realistic dynamic interactions on both synthetic and real-world datasets.

[CV-183] Pointer-CAD v2: Plan-Then-Construct CAD Generation with Dimension-Aware Parametric Precision ECCV2026

链接: https://arxiv.org/abs/2606.29301
作者: Dacheng Qi,Chenyu Wang,Jingwei Xu,Yi Ma,Shenghua Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. Code is available at this https URL

点击查看摘要

Abstract:Computer-aided design (CAD) plays a fundamental role in modern manufacturing by providing the high precision required for industrial production. Recent large language model based approaches formulate CAD generation as a sequence prediction problem and have achieved promising results. However, existing methods and evaluation protocols primarily emphasize visual similarity, while overlooking precise geometric parameters and correct metric scale. Small numerical deviations that are negligible at the shape-level may still violate industrial tolerance requirements, a problem further compounded by current autoregressive paradigms that utilize command sequence representations, aggressively quantize numerical parameters to ease LLM prediction. In this work, we present Pointer-CAD v2. Compared with v1 (arXiv:2603.04337), this version directly predicts continuous values, bypassing the need for quantized numerical parameters and thereby eliminating quantization errors. Specifically, we propose a unified framework that decouples parameter reasoning from geometric construction through a Plan-Then-Construct paradigm. Our method first produces a structured design plan with explicit metric scale parameters. These parameters are organized into a dictionary and directly referenced during sequence generation via a pointer mechanism, eliminating discretization errors and ensuring dimensionally consistent execution. In addition, we construct a new large-scale dataset with plan-level annotation and introduce three hierarchical geometry accuracy metrics to evaluate parametric fidelity at the vertex, edge, and face levels. Extensive experiments demonstrate that Pointer-CAD v2 consistently outperforms existing baselines and achieves substantial improvements in geometric accuracy, enabling reliable CAD generation for precision-critical engineering applications.

[CV-184] Beyond Trajectory Matching: Reflow with Marginal Distribution Alignment

链接: https://arxiv.org/abs/2606.29287
作者: Chen Wang,Peiran Yun,Pan Xie,Ke Deng
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion and continuous-flow generative models achieve high-quality generation, and their deterministic sampling can be formulated as solving learned ODE dynamics. However, accurate ODE discretization often requires many steps, making efficient few-step generation a key challenge. Among acceleration strategies, reflow-based distillation simplifies teacher ODE trajectories so that a student model can approximate the teacher transport with fewer steps. We identify a theoretical limitation of this paradigm, namely that trajectory matching can under-determine the distribution induced by the student model. In particular, two student models can attain the same trajectory-matching loss while inducing different endpoint marginal distributions, which may lead to different generation quality. To address this limitation, we introduce a marginal-alignment regularizer that penalizes the discrepancy between the student-induced marginal and the corresponding teacher marginal at the endpoint of each distillation interval. The regularizer is computed by tracking log-density changes along the ODE induced by the student model and evaluating scores from the frozen teacher model, without requiring auxiliary trainable networks or adversarial optimization. The resulting framework applies uniformly to the reflow family, including vanilla reflow and piecewise reflow. We further prove a telescoping total-variation bound showing that local marginal alignment controls the final-time discrepancy between the student-induced and teacher-induced distributions. Experiments on benchmark backbones demonstrate the effectiveness of the proposed method for few-step generation.

[CV-185] ASTAD: Asymmetric Style Transfer for Synthetic-to-Real Adaptation in Autonomous Driving ECCV2026

链接: https://arxiv.org/abs/2606.29286
作者: Dingyi Yao,Xinqi Zhang,Lihui Peng,Jianming Hu,Danya Yao,Yi Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the 19th European Conference on Computer Vision (ECCV 2026)

点击查看摘要

Abstract:Synthetic data mitigates the data scarcity problem in autonomous driving perception. However, the synthetic-to-real gap leads to performance degradation, hindering real-world model generalization. Although current methods leverage diffusion models for photorealistic style transfer to bridge this gap, they critically ignore a practical asymmetry: while synthetic data possesses perfect pixel-level annotations, real-world style reference images generally lack corresponding labels. Consequently, existing methods relying on symmetric semantic guidance suffer from either prohibitive annotation costs or severe semantic misalignment. To address this dilemma, we formally propose a novel task: Asymmetric Style Transfer for Autonomous Driving (ASTAD), which requires semantically consistent transfer using only labeled synthetic content and unlabeled real-world references. We further introduce the ASTModel, a training-free two-stage framework designed to bridge this domain gap under asymmetric constraints. ASTModel first extracts a coarse semantic prior from the unlabeled target, followed by dynamic prior refinement and class-consistent style injection during the denoising process. Extensive experiments demonstrate that ASTModel significantly outperforms existing methods in downstream perception utility and structural fidelity, while offering a 3.2 \times inference speedup. This work aligns synthetic-to-real adaptation with practical constraints, holding the potential to accelerate the scalable deployment of robust autonomous driving systems. Code: this https URL.

[CV-186] ScaleErasure: Inference-Time Minimal Intervention for Precise Concept Erasure in Next-Scale Autoregressive Image Generation ICML2026

链接: https://arxiv.org/abs/2606.29282
作者: Cong Wang,Haiyu Wu,Zhiwei Jiang,Zifeng Cheng,Fei Shen,Yafeng Yin,Qing Gu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026

点击查看摘要

Abstract:Concept erasure aims to prevent image generative models from producing unsafe content while preserving their general generative capability. Meanwhile, next-scale autoregressive (AR) image generation has recently emerged as a new generative paradigm characterized by next-scale prediction, for which concept erasure remains largely unexplored. In this paradigm, semantic information is highly compressed at early scales, leading to severe entanglement between unsafe and unrelated semantics. In this paper, we propose ScaleErasure, an inference-time concept erasure method that performs minimal intervention. ScaleErasure precisely selects and guides predicted logits that are most relevant to the unsafe concept, thereby enabling effective erasure under severe semantic entanglement. Specifically, ScaleErasure performs two additional forward passes conditioned on the unsafe concept and the corresponding safe concept, and leverages their outputs to guide the target logits away from unsafe concepts toward safe concepts. To enable precise and minimal intervention, logits selection and guidance are conducted across three dimensions: scales, tokens, and bit channels. Experiments demonstrate that ScaleErasure outperforms adapted baselines in the next-scale AR paradigm, achieving more precise concept erasure while largely preserving general generative capability. The code is available at this https URL.

[CV-187] Enhancing Part-Level Point Grounding for Any Open-Source MLLM s CVPR2026

链接: https://arxiv.org/abs/2606.29267
作者: Jin-Cheng Jhang,Fu-En Wang,Xin Yang,Nan Qiao,Lu Xia,Min Sun,Cheng-Hao Kuo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Visual grounding aims to associate free-form textual queries with specific regions in an image. While recent Multimodal Large Language Models (MLLMs) have demonstrated promising capabilities in this domain, they primarily excel at object-level grounding and often struggle with part-level grounding-an essential requirement for fine-grained tasks such as robotic manipulation. In this work, we introduce a general approach that equips any open-source MLLMs with accurate 2D part-level point grounding, offering a more direct alternative to conventional grounding representations. Our method leverages the attention mechanisms inherently present in MLLMs. By synthesizing text-conditioned, grounding-aware queries within intermediate layers via the proposed Q-Synth Module, we capture target-relevant attention patterns and refine them with a lightweight Attention-to-Point Decoder, which converts these patterns into a point-centric heatmap for final prediction. Notably, all original MLLM parameters are frozen, ensuring full preservation of their pre-trained capabilities. Experiments show that our design consistently improves part-level grounding accuracy across datasets and can be seamlessly integrated into any open-source MLLMs.

[CV-188] Nonlinear mixture model motivated subspace clustering

链接: https://arxiv.org/abs/2606.29261
作者: Ivica Kopriva
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 table, conference

点击查看摘要

Abstract:We derive the linear union-of-subspaces (UoS) model for subspace clustering (SC) from the nonlinear mixture model (NMM) used in blind source separation (BSS) to represent a D-dimensional observation vector as an unknown multivariate nonlinear mapping of C latent variables. Assuming the mapping is differentiable up to an unknown order K, we approximate NMM by a K-th order Taylor expansion, yielding a model equivalent to the linear UoS framework underlying SC. This establishes that: (i) the smoothness order K corresponds to the unknown subspace dimension d; (ii) KC equals the number of anchors; and (iii) the sparsity of the representation vector equals K (i.e., d). These relationships enable estimation of bounds on subspace dimension, and that is validated on six benchmark datasets using five established SC algorithms. Established theoretical results are important for post-processing of self-representation matrices estimated by SC algorithms.

[CV-189] Confidence-feedback-weighted graph matching network: online-offline laser-induced damage site matching under complex interference

链接: https://arxiv.org/abs/2606.29255
作者: Yueyue Han,Guanhua Chen,Hangcheng Dong,Kang Zhang,Fengdong Chen,Zhitao Peng,Fa Zeng,Qihua Zhu,Guodong Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages,12 figures,2 tables

点击查看摘要

Abstract:Online inspection images of final optics in high-power laser facilities contain pseudo-damage sites that closely resemble true damage sites. Determining the authenticity of online-detected sites is therefore difficult and requires accurate matching to offline ground-truth sites. However, this matching remains highly challenging due to limited match-discriminative features, local geometric distortions, and numerous distractor sites. Existing matching models mainly suppress distractors implicitly through loss-function supervision. We propose a confidence-feedback-weighted graph matching network that requires only damage-site centroid coordinates as input. It estimates node matchability confidence from each round of matching scores and feeds it back as a reliability weight to guide subsequent edge-feature aggregation, thereby suppressing distractor propagation and enhancing cross-graph discriminability. Within this framework, a geometric consistency constraint calibrates spurious high-confidence matchability estimates, while a hard-example mining loss improves discrimination between structurally similar sites. Experiments on our Complex-Scene dataset show that the proposed method achieves a matching F1-score of 96.36 % with robust and efficient performance.

[CV-190] When Does Synthetic CT Transfer? A Label-Free Donor/Host Diagnostic for Medical Vision-Language Model Routing on Real Lung CT

链接: https://arxiv.org/abs/2606.29232
作者: Fakrul Islam Tushar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures, 1 table, 1 supplement

点击查看摘要

Abstract:A synthetic measurement of model competence is useful only if it survives the move to real data, yet the real labels that would verify it are exactly what medical imaging lacks. We ask whether transfer can be predicted in advance, label-free, and answer with a mechanism: on synthetic digital twins, competence that is donor-driven (a property of the transplanted nodule) survives the synthetic to real change of host, while host-driven competence (a property of the surrounding anatomy) need not. We test this on three lung CT vision-language tasks chosen to span that axis, across five public VLMs, four guidance conditions, and seven real datasets. The prediction holds in every case: presence and size orderings transfer (R2 = 0.96), lobe does not; the split survives leave-source-out calibration, and the diagnostic names that boundary before any real label. TrialCouncil, a training-free council calibrated only on synthetic CT, confirms it by matching the best fixed model exactly where transfer is predicted. The contribution is not the router but the finding that transfer itself is predictable, label-free, from synthetic data alone.

[CV-191] Again-Pose: Anchor-Guided Adaptive Inter-Frame Motion Cues Propagating for High-quality Human Pose Reconstruction

链接: https://arxiv.org/abs/2606.29230
作者: Shuaikang Zhu,Yiding Sun,Yang Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing continuous 3D human poses from unconstrained videos is challenging, especially in extreme motion scenarios involving severe motion blur and occlusion. Current state-of-the-art methods typically rely on implicit temporal attention to aggregate features across frames. However, under severe visual degradation, input features often suffer from collapse, rendering them indistinguishable from noise. In such cases, implicit aggregation fails to distinguish valid signals, leading to catastrophic reconstruction errors. To address this robustness gap, we propose a simple yet effective framework called Anchor-guided adaptive inter-frame motion cues propagating (Again-Pose), reformulating pose estimation in degraded frames as a motion-guided recovery task. Instead of blindly smoothing features, we explicitly identify high-quality Anchor Frames based on feature saliency and propagate reliable kinematic cues to “inpaint” the poses of degraded intermediate frames. Specifically, a Dual-path Motion-aware Module captures fine-grained inter-frame dynamics, while a Difference-weighted Fusion Module adaptively propagates these cues to suppress drift. Extensive experiments on standard benchmarks (Human3.6M, 3DPW, PoseTrack) and the challenging FineDiving dataset demonstrate that Again-Pose significantly outperforms state-of-the-art methods in robustness and stability, effectively recovering plausible poses where other methods fail.

[CV-192] Zero-Gated Language-conditioned Human Motion Prediction

链接: https://arxiv.org/abs/2606.29208
作者: Guanhui Qiao,Lu Zhou,Ding Jiang,Jinqiao Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Pose histories provide the core kinematic evidence for 3D human motion prediction, but they lack explicit high-level semantic guidance. This paper introduces ZGL, a lightweight language-conditioned predictor that uses captions of the observed motion as a semantic prior while preserving a strong motion backbone as the main source of dynamics. We render only the observed poses, generate a one-sentence description with a vision-language model, encode the caption with a frozen CLIP-L text tower, and project it into a small set of conditioning tokens. These tokens are injected into a DCT-based spatial-temporal Transformer by compact crossattention adapters with zero gates: each adapter output is multiplied by a learnable gate initialized to zero, so the full network is numerically identical to the pose-only baseline at initialization and can learn to use language only when it reduces prediction error. On Human3.6M, ZGL improves overall MPJPE over representative motion-prediction baselines in our comparison. Results on CMUMocap further show that compact caption conditioning transfers to a second benchmark and provides a practical semantic cue for 3D human motion prediction.

[CV-193] DTI: Dynamic Trajectory Initialization for Generative Face Video Super-Resolution ECCV2026

链接: https://arxiv.org/abs/2606.29198
作者: Yingwei Tang,Chen Yan,Wendi Liu,Qiang Hu,Xiaoyun Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by ECCV 2026

点击查看摘要

Abstract:As the most perceptually powerful Face Video Super-Resolution (FVSR) method, existing works in Generative FVSR (GFVSR) mainly exploit the generative prior of pretrained diffusion models. However, viewed as full generation, they suffer from fixed sampling and expensive inference costs if without large-scale auxiliary training. Furthermore, an excessive pursuit of generic perceptual metrics often results in low fidelity. To address these issues, we present Dynamic Trajectory Initialization (DTI) paradigm for GFVSR, which reformulates GFVSR as an input-driven directional restoration. With a novel enhancement-and-injection conditioning mechanism for pretrained DiT backbone, fidelity of our model has been significantly improved without compromising perceptual quality. To dynamically set the starting sampling point, we propose a Discriminative Guide (DG) trained via objective Signal-to-Noise Ratio (SNR) alignment. With only minor model adaptation and fine-tuning, our method achieves a SOTA overall performance across diverse metrics and benchmarks. An analysis of relationship between actual comprehensive quality and common metrics is also conducted, which demonstrates the perception-distortion trade-off and that the LPIPS is the most convincing metric in our case.

[CV-194] Anomaly Factory 3D: A Modular Framework for Diverse Pseudo-Anomaly Synthesis in Unsupervised 3D Anomaly Detection

链接: https://arxiv.org/abs/2606.29181
作者: Ali Balapour,Faraz Hach
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting and localizing defects in 3D point clouds is challenging because abnormal samples are scarce and diverse, while training is often limited to normal data. We propose Anomaly Factory 3D (AF3AD), a modular framework that synthesizes diverse pseudo-anomalies from normal point clouds to expand the training data for unsupervised 3D anomaly detection methods that rely on pseudo-anomalies. AF3AD uses a center-conditioned parametric deformation model defined in local PCA frames, with kernel-controlled spatial falloff, anisotropy, directional gating, and normal/tangential displacement fields, enabling a broad set of geometric defect presets. We demonstrate its ease-of-use and effectiveness by integrating AF3AD with an offset-prediction detector and a reconstruction-based anomaly detection method, showing that AF3AD transfers across detection paradigms. Experiments on AnomalyShapeNet and Real3D-AD show consistent improvements in object- and point-level detection and localization, supported by ablations on preset groups and robustness under noise. AF3AD is designed as a standalone synthesis tool to facilitate adoption across different 3D anomaly detection paradigms. Code is available at this http URL.

[CV-195] Articulating then Matching: Zero-Shot Shape Matching for Uncurated Data

链接: https://arxiv.org/abs/2606.29167
作者: Qilong Liu,Qinfeng Xiao,Chenyuan Yi,Liying Zhang,Kit-lun Yick
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Finding dense correspondences between 3D shapes is a fundamental yet unresolved challenge, especially in real-world environments. These environments present severe challenges, including the lack of time and sufficient samples for training, the prevalence of uncurated extreme-high resolution data with topological distortions, and the need to handle diverse 3D representations. In this paper, we present ATM, a zero-shot framework that requires no correspondence-specific training and robustly addresses these issues at once through an articulate-then-match paradigm. Rather than relying on intrinsic geometric properties, we leverage powerful pretrained vision foundation models and parametric shape priors to estimate parametric shape models from multi-view renderings, and systematically ground these estimations via multi-view geometric consistency. By mapping diverse inputs into a shared canonical parametric space, we inherently establish robust coarse correspondences that bypass topological noise, which are then refined into precise dense mappings via spectral refinement. Operating purely on test-time optimized parametric reconstructions, ATM requires no correspondence training data, is naturally immune to connectivity artifacts, and seamlessly handles diverse 3D modalities, including meshes, point clouds, and 3D Gaussians. Extensive experiments demonstrate that our method achieves strong results on non-isometric benchmarks (average geodesic errors of 2.4-TOPKIDS, 3.8-SMAL), reducing errors by 73% and 37% respectively compared to the baseline URSSM. Furthermore, it exhibits unprecedented robustness on in-the-wild raw scans of up to 200k vertices per shape while maintaining near-constant computation time and consistent superior accuracy.

[CV-196] Spatially Localized Image Degradation Embeddings for Image Quality Assessment

链接: https://arxiv.org/abs/2606.29162
作者: Krishna Srikar Durbha,Hassene Tmar,Ping-Hao Wu,Ioannis Katsavounidis,Alan C. Bovik
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Under Review

点击查看摘要

Abstract:Self-supervised learning (SSL) currently drives state-of-the-art performance in no-reference image quality assessment (NR-IQA). However, standard SSL pipelines uniformly apply synthetic distortions across the entire image field, which can limit their sensitivity to spatially localized and co-occurring degradations encountered in real-world content. In this work, we empirically expose this representational blind spot across existing state-of-the-art encoders, demonstrating their reduced sensitivity to spatially bounded image degradations. To bridge this gap, we introduce Spatial Localized Image Degradation Embeddings for Image Quality Assessment (SLIDE-IQA). SLIDE-IQA employs a dual-branch Vision Transformer framework that injects spatially bounded degradations into a contrastive pretraining objective. To handle the spatial complexity of these degradations, we introduce a Threshold-Bounded Exclusion Mechanism, a representational design choice that resolves structural conflicts arising from spatially localized distortions to ensure the latent space respects both degradation type and spatial scale. Finally, we show that SLIDE-IQA’s synthetic-only pretraining significantly improves sensitivity to localized distortions, while achieving competitive performance on NR-IQA benchmarks against existing SSL NR-IQA models.

[CV-197] GPC: Large-Scale Generative Pretraining for Transferable Motor Control

链接: https://arxiv.org/abs/2606.29148
作者: Yi Shi,Yifeng Jiang,Chen Tessler,Xue Bin Peng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Developing controllers capable of completing a wide range of tasks in a natural and life-like manner is a key challenge in enabling practical applications of physics-based character animation. In this work, we introduce Generative Pretrained Controllers (GPC), which leverage tokenization and next-token modeling to create general-purpose, reusable generative controllers from large-scale motion datasets. Our framework utilizes end-to-end reinforcement learning to jointly optimize a “motion vocabulary”, modeled via Finite Scalar Quantization (FSQ), along with a corresponding control policy that can map the discrete codes to physics-based controls. After the “codebook” has been learned, the underlying structure of this large vocabulary is modeled by training a GPT-style autoregressive transformer, leading to a powerful generative controller that generates controls for a physically simulated character by performing next-token prediction. Once the generative controller has been trained, we propose a suite of adaptation techniques for finetuning the controller for new downstream tasks. Our proposed framework greatly simplifies the training process compared to previous tokenized methods, and achieves a 99.98% success rate in reproducing a vast corpus of motion clips. The generative controller exhibits a variety of natural emergent behaviors, such as responsive behaviors to perturbations and recovery behaviors after falling. This results in highly robust general purpose controllers for a variety of downstream applications.

[CV-198] CMTFormer: Marrying Transformer with Hierarchical Information Interaction for RGB-Event Object Detection

链接: https://arxiv.org/abs/2606.29136
作者: Yu Li,Yuenan Hou,Yingmei Wei,Jiangming Chen,Yanming Guo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:Event cameras capture sparse brightness changes with high temporal resolution and high dynamic range, compensating for the deficiencies of the conventional RGB frames. However, previous multi-modal fusion techniques typically fail to handle the inherent heterogeneity between RGB frames and event streams, thus easily leading to noise amplification or redundant feature integration during cross-modal fusion. In this paper, we propose a Cross-Modal information inTeraction transFormer, coined as CMTFormer, which hierarchically integrates RGB and event information to achieve efficient and stable multimodal collaboration. Specifically, we design a shallow-to-deep information interaction scheme. In the shallow stage, we present the Shallow Alignment Module (SAM) to achieve an efficient fusion of RGB and event low-level features, which mitigates attribute disparities and prevents noisy information. In the middle stage, we devise the Cross-modal Enhancement Module (CEM) that utilizes texture and edge information to produce mutually reinforced middle-level features. In the deep stage, we present the Learnable Deep Fusion Module (LDFM) which performs high-level information aggregation through learnable weights, thus enabling the network to adaptively fuse RGB and event clues. A Spatial Prior Module is further designed to utilize global spatial information to enhance localization accuracy. Extensive experiments are conducted on two prevalent event-based object detection benchmarks, i.e., DSEC-Detection and PKU-DAVIS-SOD. Our CMTFormer consistently surpasses the detection counterparts in both uni-modal and multi-modal settings, strongly demonstrating the effectiveness of our paradigm. Codes will be available upon publication.

[CV-199] Beyond Backscatter: AlphaEarth Land-Cover Priors for Rapid SAR Flood Segmentation Across Foundation Backbones

链接: https://arxiv.org/abs/2606.29134
作者: Sanjay Thasma,Yu-Hsuan Ho,Ali Mostafavi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rapid flood mapping is critical for emergency response, yet optical imagery is often unusable during major flooding and single-temporal SAR is ambiguous, since new inundation, permanent water, and other smooth surfaces produce similar backscatter. This study evaluates whether stable land-context priors can improve post-event SAR flood segmentation when a registered, seasonally matched pre-event acquisition is unavailable. Using the CONUS (Continental United States) subset of ImpactMesh-Flood, we compare four backbones spanning distinct pretraining regimes-a from-scratch CNN UNet, an ImageNet-pretrained UNet, the SAR-pretrained TerraMind Vision Transformer, and the optical-satellite-pretrained DINOv3 Vision Transformer-in SAR-only, SAR+DEM, and SAR+AlphaEarth configurations under an identical fusion design, training protocol, and event-stratified split. Models are selected on a validation flood event and evaluated separately on two held-out events, Hurricane Florence and the Louisiana floods, with three-seed reporting for auxiliary configurations. Both auxiliary priors improve over the observed SAR-only baselines across all backbones and test events. AlphaEarth exceeds DEM on the harder Florence event for every backbone and achieves the best Florence IoU, while DEM is competitive on Louisiana and produces the best result there. The seed analysis reveals a trade-off: DEM is more stable across initializations, whereas AlphaEarth offers higher peak performance and higher recall on the harder event. Cross-event differences track flood-class prevalence and similarity to the training distribution, underscoring the need for per-event evaluation. We reframe single-temporal SAR flood segmentation as an alignment between radar observations and stable land-surface priors, where learned and physical context offer complementary pathways to more reliable rapid flood mapping.

[CV-200] A Deep Multiscale Neural Network for Accurate Neurological Disorder Detection from MRI Scans and Real-Time Web Deployment

链接: https://arxiv.org/abs/2606.29106
作者: Ali Fatahi,Hoda Zamani,Mohammad H. Nadimi-Shahraki
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neurological disorders involve diverse pathologies of the brain and nervous system, making early and accurate detection essential. While many deep CNNs have been developed for MRI-based classification of neurological disorders, most are optimized for binary tasks and often fail to capture the multi-class features needed to distinguish subtle anatomical differences across conditions. This study proposes the Enhanced Neurological Disorder Detection Network (End-Net) for multi-class MRI classification of neurological disorders. End-Net includes 24 convolutional layers, beginning with convolutional blocks followed by 21 optimized inception modules. These modules extract multiscale features via parallel 1 x 1, 3 x 3, and factorized 5 x 5 convolutional branches, along with max pooling, enabling the model to capture complementary texture, edge, shape, and contextual information. A global average pooling head, compact fully connected classifier, and dropout reduce parameters, limit overfitting, and improve robustness. End-Net was evaluated on the Multi-Class Neurological Disorder dataset, comprising MRI scans from patients with Alzheimer’s disease, brain tumors, multiple sclerosis, and healthy controls. Severe class imbalance was addressed by augmenting minority classes with WGAN-GP and randomly undersampling the majority class. The results show that End-Net outperforms existing architectures in both accuracy and generalization. The model is also integrated into an online system for real-time web-based inference and accessibility.

[CV-201] BTI-Net: Bidirectional Decoder-Level Task Interaction via Uncertainty-Aware Gating for Multi-Task Medical Image Analysis

链接: https://arxiv.org/abs/2606.29102
作者: Abdullah Al Shafi,Md Kawsar Mahmud Khan Zunayed,Safin Ahmmed,Sk Imran Hossain,Engelbert Mephu Nguifo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 5 figures, 6 tables; supplementary material included (8 pages, 19 figures, 18 tables)

点击查看摘要

Abstract:Jointly learning to segment and classify medical images demands cross-task synergy, yet encoder-sharing architectures limit decoder reconstruction to task-private representations, permanently discarding the boundary cues and semantic priors each branch could supply to the other. This work introduces BTI-Net, which establishes bidirectional communication at every decoder level through two parallel pathways via Task Interaction Modules (TIM). Spatial boundary context is gated into the classification branch, while global semantic priors multiplicatively modulate the decoder, with refined features propagating progressively from coarse semantics to fine boundary detail across all four decoder resolutions. Since cross-task interaction is not equally reliable for every input, Uncertainty Proxy Attention (UPA) gates each TIM output per instance and per level using three signals that capture cross-task alignment, scene complexity, and prediction confidence, without external annotations or additional inference passes. Experiments on three medical benchmarks spanning ultrasound, dermoscopy, and brain MRI demonstrate consistent improvements in segmentation IoU and classification accuracy over both encoder-sharing and decoder-interaction baselines. Ablation confirms adaptive gating contributes +2.36 IoU over fixed bidirectional interaction, and classification accuracy improves by up to +2.26 points over the strongest multi-task baseline. UPA’s uncertainty proxies serve as reliable single-pass task-failure signals without the overhead of stochastic sampling. Code: this https URL

[CV-202] rafficAlign: Aligning Large Language Models for Traffic Scenario Generation CVPR2026

链接: https://arxiv.org/abs/2606.29097
作者: Zhi Tu,Liangkun Niu,Tianyi Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Recent research has investigated the use of large language models (LLMs) to generate traffic scenarios for autonomous driving. However, pretrained LLMs often fail to align with real-world traffic distributions. In this work, we present TrafficAlign, an automated framework that synthesizes traffic scenarios based on real-world driving videos, performs data validation, and aligns LLMs with the synthesized scenarios. The evaluation shows that traffic scenarios generated by TrafficAlign are highly effective, revealing up to 10.8% more collisions on average across three autonomous driving models than state-of-the-art methods. Furthermore, fine-tuning these driving models with TrafficAlign-generated scenarios significantly reduced collision rates by 36.1% compared with the original models. A qualitative study using traffic datasets from six geographically diverse regions shows that TrafficAlign-generated scenarios exhibit strong alignment with corresponding traffic distributions in these regions.

[CV-203] HorizonRelight: Relighting Long-horizon Videos Consistently via Diffusion Transformers

链接: https://arxiv.org/abs/2606.29095
作者: Jing Yang,Mayoore Jaiswal,Zian Wang,Steven Zeng,Rochelle Pereira,Yajie Zhao,Jianyuan Min
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: this https URL

点击查看摘要

Abstract:Diffusion-based video relighting enables controllable relighting from a single input video, but modern video diffusion backbones are trained on short clips and applied to long-horizon videos through chunked sliding-window inference, often causing temporal discontinuities at chunk boundaries. We address this by reframing long-horizon relighting as \emphtemporally conditioned latent domain translation. Our framework enforces cross-chunk continuity by propagating target-domain latents across boundaries and makes this behavior learnable using \emphmasked target-domain self-conditioning, training the model to continue from temporally masked propagated context. We further introduce \emphwarm-start prompting with a relit prompt anchor from a controllable generative model, which establishes the initial target-domain state and creates a general interface for prompt-based relighting. Experiments on in-the-wild long-horizon videos show markedly improved temporal consistency, with chunk-boundary artifacts largely reduced and unwanted appearance changes across chunks greatly suppressed.

[CV-204] From Fog Chamber to Aircraft Window: Pixel-Registered Imaging and Synthetic Fine-Tuning Enable Cross-Domain Defogging

链接: https://arxiv.org/abs/2606.29093
作者: Alexander Ingold,Sabina D. Menon,Manya Yellepeddy,Alec Ikei,John D. Hodges,Jordan Baker,Syed N. Qadri,Rajesh Menon
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:A deep defogging pipeline pretrained on controlled laboratory fog and fine-tuned with domain-randomized synthetic fog applied to clear outdoor scenes generalizes across a graded sequence of out-of-distribution settings with no target-domain training, from chamber-free free-flowing fog to iPhone video recorded through an aircraft cabin window in flight, an entirely unseen sensor, scene, and optical path. This directly addresses an open transfer limitation reported for real-world binocular defogging. Two design choices support the transfer. First, a single-camera fog imager photographs a flat-panel display through an artificial-fog enclosure with a fixed 114~mm scattering path, producing 5,495 pixel-aligned foggy/clear pairs. Exact registration permits a paired Laplacian ratio that predicts per-image restoration quality far better than single-image proxies (Spearman \rho = 0.632 versus 0.399 ) and supports pixel-exact L_1 reconstruction training that avoids adversarial hallucination. Second, the fog-chamber checkpoint is fine-tuned on Mapillary Vistas crops overlaid with on-the-fly randomized synthetic fog spanning a broad range of strengths, spatial variations, airlights, and noise conditions. On a 552-image held-out split, a uniform comparison of 30 restoration backbones places NAFNet at the top (24.33~dB~/~0.7912~SSIM), with a compact alternative within 1.29~dB at 3% of the parameter count, and a ResNet-50 classifier confirms that the restoration preserves semantic content rather than only pixel-level structure. On unpaired aircraft-window video, NIQE decreases from a mean of 6.22 to 4.97 after fine-tuning, with temporally stable output across full-motion sequences. The same backbone, under paired supervision, also reaches 20.71~dB~/~0.683~SSIM on a non-overlapping O-HAZE/NH-HAZE split (a transferability check rather than a competitive ranking).

[CV-205] Flow Matching in Feature Space for Stochastic World Modeling

链接: https://arxiv.org/abs/2606.29059
作者: Francois Porcher,Nicolas Carion,Karteek Alahari,Shizhe Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 18 figures, 6 tables

点击查看摘要

Abstract:World modeling requires forecasting uncertain futures while preserving information useful for downstream perception. Existing visual world models often struggle to satisfy both goals: VAE-based stochastic models operate in low-dimensional reconstruction latents, which can limit perception performance, while deterministic predictors using strong pretrained features collapse multimodal futures into a single blurry mean. In this work, we propose FlowWM, a stochastic world model that performs flow matching directly within pretrained feature space (e.g., DINOv3). This is challenging because pretrained features are substantially high-dimensional, making standard diffusion recipes suboptimal. To address this, we investigate the design choices needed for feature-space flow matching and introduce a differentiable one-step projection mechanism that enables efficient training with temporal consistency and task-driven objectives. We evaluate FlowWM on two benchmarks: a synthetic benchmark for systematic evaluation of accuracy and diversity, and a real-world benchmark FuturePerception. FlowWM improves perception performance, mode coverage, and horizon robustness, validating our proposed design for stochastic world modeling in high-dimensional feature spaces. Comments: 24 pages, 18 figures, 6 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.29059 [cs.CV] (or arXiv:2606.29059v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.29059 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Francois Porcher [view email] [v1] Sat, 27 Jun 2026 19:35:27 UTC (10,585 KB) Full-text links: Access Paper: View a PDF of the paper titled Flow Matching in Feature Space for Stochastic World Modeling, by Francois Porcher and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-06 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-206] Adaptive Spectrum-Aware Feature Disentangled Network for Small Object Detection ECCV2026

链接: https://arxiv.org/abs/2606.29029
作者: Yang Guo,Zihan Yang,Feifei Kou,Yulan Hu,Ran Zhang,Siyuan Yao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 7 figures. Accepted at ECCV 2026

点击查看摘要

Abstract:Small Object Detection (SOD) is a fundamental yet challenging problem in computer vision due to its limited spatial resolution and weak visual cues. Although recent approaches have achieved remarkable advances, the background distractors in different frequency spectra still degrade the performance. In this paper, we propose a novel small object detection framework termed SFDNet, which is capable of detecting small objects via efficient spectrum-aware feature disentanglement. Specifically, we propose an Adaptive Spectrum Disentanglement (ASD) module that decomposes backbone features into multiple complementary spectral components, aiming to construct discriminative object-relevant representations by discarding the background distractors for each component. Afterwards, to strengthen the semantic consistency of the similar objects in the same class, we propose a Class-Wise Prototype Distillation (CPD) procedure, which establishes class prototypes for the object instances and enforces the compact representation by efficient prototype distillation. Extensive experiments on multiple challenging benchmarks show that SFDNet outperforms existing state-of-the-art methods by a large margin. Our code is available at this https URL.

[CV-207] Efficient Spatio-Temporal Grounding with Multimodal Large Models via Second-Level Tracking and RL Verification

链接: https://arxiv.org/abs/2606.29023
作者: Tianshu Zhang,Yan Wang,Ji Qi,Lijie Wen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatio-temporal grounding in long videos requires precise temporal localization and robust object tracking conditioned on natural-language queries. While recent vision-language models (VLMs) show strong reasoning ability, directly applying frame-by-frame inference to long sequences is computationally expensive and unstable. We propose a practical pipeline that shifts from frame-level to second-level tracking and performs cross-second smoothing to preserve continuity while reducing sequence length. To improve reasoning supervision, we synthesize chain-of-thought style trajectories using advanced multimodal models for temporal localization and target selection, and replace generated spatio-temporal coordinates with ground-truth annotations to avoid noisy supervision. We further optimize the policy with reinforcement learning using a verifier based on t_\mathrmIoU+mv_\mathrmIoU . Experiments across multiple FPS settings show that our method achieves a strong trade-off between efficiency and localization quality.

[CV-208] Semantic-Aware Physics-Informed Geometry-Grounded Weather Video Synthesis

链接: https://arxiv.org/abs/2606.29020
作者: Chenghao Qian,Nedko Savov,Lingdong Kong,Yeying Jin,Rui Song,Wenjing Li,Zhun Zhong,Jiaqi Ma,Gustav Markkula,Luc Van Gool
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Weather synthesis aims to add weather effects to input videos while preserving scene identity, structure, and motion. The key limitation of existing methods is the lack of diversity in weather appearance and effective control over weather dynamics (e.g., temporal evolution and particle motion). Most approaches rely on text prompts, which are inherently underspecified and often fail to produce detailed weather characteristics. Additionally, general-purpose video editors optimized for clean and aesthetic outputs tend to suppress heavy weather phenomena, making dense particle effects difficult to generate. To address these, we propose a Semantic-Aware, Physics-Informed, and Geometry-Grounded framework that steers an off-the-shelf video editor to synthesize diverse global appearances and detailed particle dynamics. We factorize the synthesis into three conditional signals, so that each provides a distinct and stable source of guidance: semantics specifies what the weather should look like, dynamics governs how it evolves over time, and geometry determines where it should appear in the scene. Specifically, we introduce (1) semantic-aware appearance anchoring to establish the target appearance from scene semantics and user input; (2) physics-informed dynamic simulation to generate particle effects by simulating a Gaussian-represented particle field under gravity, wind, and turbulence; and (3) geometry-grounded video synthesis to align the simulated particles with target scene geometry and synthesize the final video. Experiments demonstrate that our method produces diverse, physically and visually realistic weather effects. Furthermore, we show that our synthesized data significantly improves the robustness of autonomous driving semantic segmentation under adverse weather conditions. Project page: this https URL.

[CV-209] Mural: Transferring LLM knowledge to image generation via Mixture-of-Transformers

链接: https://arxiv.org/abs/2606.29013
作者: Achin Jain,Jie An,Siddharth Chaudhary,Davide Modolo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Leveraging capabilities of large language models (LLMs) in text-to-image (T2I) synthesis is an important research direction. In this work we investigate whether the knowledge of a frozen LLM can be effectively utilized in T2I generation when trained exclusively on standard text-image pairs. We integrate a frozen, reasoning-capable LLM with a diffusion-based image generator via shared attention within the Mixture-of-Transformers (MoT) architecture. Our experiments span two critical questions: (1) what degree of the LLM’s intrinsic knowledge remains accessible during T2I training, and (2) what novel capabilities emerge in the resulting system. Across established benchmarks, our models achieve strong performance among unified understanding-generation systems: 0.85 on GenEval, 86.75 on DPG-Bench, and 0.66 on WISE with inference-time reasoning, using only text-image data. Remarkably, we uncover emergent behaviors absent from training data, including cross-lingual image generation, color-guided composition, emoji / ASCII scene construction, and generation directed by world knowledge. These results demonstrate that pretrained LLM knowledge can guide image synthesis under standard text-to-image training paradigms, without interleaved multimodal signals or explicit reasoning supervision. Our findings open new avenues for harnessing frozen model capabilities in resource-constrained multimodal learning.

[CV-210] SciFlow: Semantic Cross Interference for Self-Supervised Optical Flow Domain Generalization

链接: https://arxiv.org/abs/2606.29004
作者: Jamie Menjay Lin,Jisoo Jeong,Hong Cai,Kai Wang,Fatih Porikli
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages

点击查看摘要

Abstract:Motions of objects and scenes carry essential intelligence in video understanding, offering rich cues for interpreting dynamic settings and interactions. Due to the cost and scarcity of high-quality annotation or ground truth of pixel-wise optical flow, however, motion estimation models are typically trained in synthetic domains while deployed in real-world domains. Addressing synthetic-to-real domain generalization challenges has been crucial for developing practical solutions in diverse open-world use cases. This paper introduces SciFlow, a simple yet effective, network-agnostic, training-based approach that leverages self-supervised learning to generalize motion estimation across synthetic and open-world domains. Specifically, SciFlow imposes semantic interference from open-world images onto synthetic images during training, blending indomain features with cross-domain interference, which enables the network to adapt to the real-world domains. Additionally, SciFlow utilizes geometric consistency to ensure validity of the self-supervision. Our experiment results show that SciFlow not only significantly enhances model robustness amidst domain variations, but also remarkably enables synthetic-to-real domain generalization without requiring any ground truth in the open world. Comments: 4 pages Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.29004 [cs.CV] (or arXiv:2606.29004v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.29004 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-211] Learning from Acquisition: Metadata-driven Multimodal Pre-training for Cardiac MRI

链接: https://arxiv.org/abs/2606.28991
作者: Xueyi Fu,Liwei Hu,Zi Wang,Guang Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 11 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Cardiac magnetic resonance imaging (CMR) routinely records structured acquisition metadata, yet most CMR foundation models rely primarily on image-only pre-training and leave this naturally available source of weak semantic supervision largely underexplored. We propose MetaCLIP-CMR, a metadata-driven framework based on Contrastive Language–Image Pre-training (CLIP), which converts imaging modality, anatomical view, scanner vendor, field strength, and scanner model into textual supervision for CMR representation learning. The pretrained image encoder is evaluated on imaging modality classification, cine view classification, and cardiac segmentation. MetaCLIP-CMR achieves 86.8% modality accuracy and 86.5% cine view accuracy, clearly outperforming ImageNet and masked reconstruction initialisations. For downstream cardiac segmentation, MetaCLIP-CMR consistently obtains the highest Dice score across the evaluated ACDC and MMs cine short-axis (SAX) settings under both full-data and 20% fine-tuning regimes. Compared with recent image-focused large-scale CMR pre-training models, MetaCLIP-CMR achieves comparable ACDC segmentation performance, while requiring less than 1% of their pre-training image scale. These results suggest that metadata learning offers a natural and easy-to-use strategy for transforming routinely recorded acquisition information into effective supervision for foundation-level CMR representation learning, highlighting the promise of metadata-driven multimodal pre-training.

[CV-212] Evidence-Based Text-Conditioned 3D CT Synthesis for Ovarian Cancer

链接: https://arxiv.org/abs/2606.28980
作者: Francesca Pia Panaccione,Eugenio Lomurno,Francesca Fati,Carlotta Pecchiari,Marina Rosanu,Luigi De Vitis,Lucia Ribero,Gabriella Schivardi,Giovanni Damiano Aletti,Nicoletta Colombo,Maria Francesca Spadea,Francesco Multinu,Matteo Matteucci,Elena De Momi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ovarian cancer is frequently diagnosed at an advanced stage, making preoperative contrast-enhanced computed tomography (CT) central to staging and surgical planning; yet the scarcity of annotated imaging data, compounded by privacy regulations, limits the development of generalizable computational models in this domain. Text-conditioned 3D CT synthesis has shown promise, but existing pipelines depend on paired radiology reports and have been evaluated only on chest CT. We propose OvESyn (Ovarian Evidence-based Synthesis), a framework that constructs standardized Findings and Impression sections directly from CT-derived imaging descriptors and routine clinical metadata, without any original radiology report, and uses them to condition a latent diffusion model adapted to 493 high-grade serous ovarian carcinoma patients. This is the first text-conditioned 3D CT synthesis framework adapted to an abdomino-pelvic oncologic setting. A systematic ablation over two adaptation axes, vision-language encoder alignment and generator fine-tuning, identifies generator domain adaptation as the operative mechanism for crossing the domain gap and establishing the target anatomy: without it, synthesis remains anchored to the thoracic pretraining domain, with Precision and Recall collapsing to zero and FID2.5D exceeding 140, regardless of encoder alignment. Encoder alignment instead refines intensity and fine detail. The full OvESyn attains the best distributional and intensity fidelity (FID2.5D 29.35, Precision 0.671, Wasserstein-1 0.044), while the generator-only variant maximizes coverage (Recall 0.645), reflecting a fidelity/coverage trade-off governed by encoder adaptation. Requiring only automatic segmentations and routine preoperative metadata, OvESyn supports transferability to report-scarce settings and provides a foundation for synthetic cohort generation in abdomino-pelvic oncologic imaging.

[CV-213] Self-Evolving Agent ic Image Restoration via Deliberate Planning and Intuitive Execution

链接: https://arxiv.org/abs/2606.28971
作者: Shuang Cui,Fan Ji,Guanglong Sun,Yufei Guo,Xiongxin Tang,Jiangmeng Li,Fanjiang Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-world image restoration (IR) remains challenging due to complex and coupled degradations. While recent agentic IR frameworks leverage Large Language Models for flexible tool planning, they face two critical limitations. First, from a search scheme perspective, excessive reliance on greedy strategies fails to balance exploration and exploitation. Second, existing agentic systems underutilize information, exhibiting episodic amnesia. To address these challenges, we propose \textbfSelf-Evolving Agentic Image Restoration (SEAR), which formulates restoration as a sequential decision-making problem. Inspired by the dual-process theory, SEAR comprises an Intuitive Executor and a Deliberate Planner, respectively following the fast-thinking \textitSystem 1 and slow-thinking \textitSystem 2 principles. The Deliberate Planner employs Pruning-Aware Monte Carlo Tree Search for long-horizon reasoning, utilizing a hybrid no-reference reward and a Multimodal Large Language Model (MLLM)-based tournament to prevent metric exploitation. Complementarily, the Intuitive Executor leverages a self-evolving episodic memory indexed by degradation-aware state fingerprints. This mechanism distills expensive search trajectories into adaptive expertise, overcoming episodic amnesia while progressively amortizing cold-start exploration costs through memory reuse. Extensive experiments on synthetic and real-world benchmarks demonstrate its strong perceptual and quantitative performance.

[CV-214] Character Recognition of Nepali Number Plate

链接: https://arxiv.org/abs/2606.28946
作者: Satyasa Khadka,Sandhya Baral,Sudip Tiwari,Sharad Kumar Ghimire
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at World Conference on Information Systems for Business Management (ISBM) 2025

点击查看摘要

Abstract:This paper presents a robust Automatic Number Plate Recognition (ANPR) system tailored for Nepali license plates written in Devanagari script. In this paper, a pipelined model was used that integrates YOLO-based models for license plate and character detection, followed by a CNN classifier trained on 34 Devanagari characters. Two publicly available data sets were used that incorporate diverse lighting, fonts, and structural variations. Data augmentation and additional training on embossed plates enhanced the generalizability of the model. The system achieved a recognition accuracy of up to 93%, demonstrating strong performance under real-world conditions and providing a scalable solution for traffic management in Nepal. Code: this https URL

[CV-215] ExACT: Exemplar-Driven Calibrated Refinement for Training-Free Visual Grounding in Remote Sensing Images

链接: https://arxiv.org/abs/2606.28920
作者: Zixiao Zhang,Lingling Li,Pei He,Xu Liu,Licheng Jiao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures, supplementary material included

点击查看摘要

Abstract:Remote sensing visual grounding (RSVG) aims to locate specific objects in high-resolution RS imagery using free-form natural language descriptions. While recent advances in multimodal large language models (MLLMs) show great potential for such open-vocabulary RSVG, their training-free adaptation is hindered by the modality gap between abstract linguistic semantics and fine-grained visual cues. In cluttered RS scenes, this gap inevitably causes severe localization drift. To bridge this gap, we propose Exemplar-driven Calibrated Refinement (ExACT), a novel training-free framework driven by a one-shot visual prompting mechanism to explicitly provide discriminative structural guidance for precise pixel-level localization. Specifically, we propose a Vision Exemplar-based Calibrator (VEC) that extracts fine-grained visual correspondences from the given exemplar to rectify the rough cross-modal priors from frozen MLLMs, effectively suppressing background artifacts and accurately outlining target boundaries. Subsequently, a Structure-Aware Refiner (SAR) employs an iterative merge-and-select clustering strategy to consolidate the calibrated priors into high-quality positive and negative geometric prompts. These prompts then guide the Segment Anything Model (SAM) to achieve precise pixel-level predictions. Extensive experiments confirm the superiority of ExACT over existing training-free and weakly-supervised methods.

[CV-216] What Color is the Sky (for a non-human) ?

链接: https://arxiv.org/abs/2606.28912
作者: Yair Weiss,Ofer Springer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The light of the daytime sky contains a mixture of many colors yet is perceived as blue by human observers. This is largely due to the particular response functions of the human cones. Under these response functions skylight and blue light are metamers: they yield the exact same excitation of the cones. In this paper we ask: is it possible to define the ``color’’ of the sky for other visual systems? We present a simple computational method to determine monochromatic metamers to a given input light for arbitrary visual systems. Using published values on spectral sensitivity functions of various species, we use our method to determine the dominant wavelength of monochromatic metamers to skylight. For a wide range of species (bichromats, trichromats and tetrachormats) we find monochromatic metamers to skylight but the dominant wavelength of the metamer can vary drastically between species and be very different from the color perceived by humans.

[CV-217] Projection-based coupling of infrared thermography and stereocorrelation-based digital image correlation

链接: https://arxiv.org/abs/2606.28905
作者: Jendrik-Alexander Tröger,Lutz Müller-Lohse,Stefan Hartmann
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Full-field measurement techniques such as digital image correlation and infrared thermography are prevalent in experimental solid mechanics. Digital image correlation is used to analyze surface deformation, while infrared thermography quantifies surface temperature fields. However, sophisticated procedures are necessary to express both datasets in the same Lagrangian frame, especially when analyzing non-flat surfaces. In this study, we propose an external projection-based coupling that uses the pinhole camera model to relate two-dimensional temperature data measured by infrared thermography to three-dimensional point coordinates from stereocorrelation-based digital image correlation. Unlike existing multiview approaches, we utilize two independently calibrated industrial-grade systems and augment the experimental evaluation with the pinhole camera model. The projection matrix of the camera model is calibrated using a single image of a reference object. Through this projection, temperature fields are accurately represented at material points. Our method is particularly suited for, but not restricted to, curved surfaces and straightforward to embed in existing experimental protocols, as the image registration is kept as is. Additionally, we propose using radial basis functions as a global interpolation ansatz in both space and time to compute in-plane temperature gradients and even temperature rates on curved surfaces, thereby providing an extensive and information-rich full-field dataset.

[CV-218] On Test-Time Scaling for Vision-Language Models ECCV2026

链接: https://arxiv.org/abs/2606.28864
作者: Fawaz Sammani,Tzoulio Chamiti,Nikos Deligiannis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026

点击查看摘要

Abstract:Test-time scaling is a paradigm where large models use additional compute at inference to achieve better performance, without changing model weights. While it has been widely studied for Large Language Models (LLMs), its applicability to Large Vision-Language Models (LVLMs) remains less explored and analyzed, with limited analysis of whether, when, and to what extent these approaches transfer to LVLMs. In this work, we ask a simple but fundamental question: can conventional test-time scaling methods developed for LLMs be directly applied to LVLMs? We present the first comprehensive study of test-time scaling for LVLMs, spanning multiple models and model sizes, nine test-time scaling methods, and six diverse benchmarks. Our main findings is that 1) different from previous findings, small, well-performing models benefit the most from test-time scaling, enabling performance improvements of up to around 30%, reaching large models performance, and often outperforming them, 2) LVLMs lose focus when given more compute than necessary, and 3) Visual information is encoded early in the reasoning chain, after which the chain is dominated by text-only reasoning and the contribution of image tokens drops significantly. Finally, we also provide a global and fine-grained analysis on the quality and information sufficiency of the reasoning chains produced. Overall, our findings and analysis provide practical guidance and insights into LVLMs and their deployment in research and industry.

[CV-219] HKVLM: Faithful Reasoning Grounding by Binding Language Queries to a Frozen Detector

链接: https://arxiv.org/abs/2606.28862
作者: Bo Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many visual requests – the object to open this bottle'', the person not wearing a helmet’’ – require reasoning, not just category matching. Pure open-vocabulary detectors need an explicit phrase; vision-language models (VLMs) can reason yet ``see but mis-speak’', attending to the right region but returning the wrong box or label. We argue this is a \emphbinding failure: in coordinate-as-text VLMs localization passes through the autoregressive head, coupling it to language generation; in two-stage pipelines the model’s intent is squeezed through a single class string. We present HKVLM, which removes localization from the language path. A frozen, language-aligned detector emits class-agnostic region proposals; a frozen language model encodes reasoning instructions as referential query embeddings; a lightweight \emphalignment hook binds queries to regions by contrastive retrieval and bipartite assignment in a shared embedding space. A perception-grounded faithfulness veto forbids naming an object that no region supports. Only the hook is trained, targeting small-data cold-start settings where monolithic VLM tuning struggles. We formalize a \emphsay-vs-see decomposition separating localization error (SeeErr) from binding error (SayErr), and evaluate on RefCOCO/RefCOCO+/RefCOCOg and POPE. With frozen Grounding DINO and Qwen2.5-VL, training only the hook lifts grounding accuracy by 50 – 90\times over untrained cross-space matching; the faithfulness veto raises POPE accuracy from near-chance ( 0.50 ) to 0.66 – 0.76 and reduces hallucination from \sim0.99 to 0.23 – 0.43 , with gains from 200 expressions. Increasing proposals from M=50 to M=300 improves grounding by 19 – 24% without retraining, confirming that residual error is perceptual (SeeErr) rather than binding (SayErr).

[CV-220] EpiSAM: Character Segmentation in Challenging Stone Inscriptions ICDAR2026

链接: https://arxiv.org/abs/2606.28859
作者: Arnav Sharma,Pratyush Jena,Amal Joseph,Ravi Kiran Sarvadevabhatla
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in ICDAR 2026

点击查看摘要

Abstract:Stone inscriptions are invaluable sources of historical and linguistic knowledge, yet their automated analysis remains a major challenge due to surface irregularities, erosion, and low visual contrast. Conventional document and handwriting analysis techniques fail to perform well in these scenarios. In this work, we propose character detection as a core strategy for robust inscription analysis. We introduce EpiSAM, a prompt-guided transformer framework for character segmentation in stone inscriptions. Rather than treating characters in isolation, EpiSAM employs a novel neighbor-aware strategy, explicitly predicting adjacent characters alongside the target. These contextual cues resolve boundary ambiguities, improving mask generation and enabling more accurate character segmentation. Furthermore, we expand an existing stone inscription dataset by adding dense polygonal annotations for characters, thereby enabling comprehensive research on Southeast Asian epigraphy. Experimental results show that EpiSAM achieves consistent improvements over existing baselines, while also exhibiting strong zero-shot generalization in challenging epigraphic scenarios.

[CV-221] Personalizing MLLM s via Reinforced Multimodal Reference Game ECCV26

链接: https://arxiv.org/abs/2606.28845
作者: Deepayan Das,Davide Talon,Yiming Wang,Massimiliano Mancini,Elisa Ricci
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV26. Project page: this https URL

点击查看摘要

Abstract:Personalizing Multimodal Large Language Models (MLLMs) aims to recognize users’ unique concepts from visual data and provide personalized responses. Although prior work has shown the benefit of concept descriptions and reasoning for this task, MLLM descriptions often include information, such as state and context, that does not help and may in fact hinder the unique identification of the target concept among other visually similar items. Effective descriptions of personal concepts should instead be accurate, discriminative, and free of distracting details. To achieve such descriptions, we introduce Reinforced Reference Game (RRG), a learning framework that promotes discriminative descriptions through a novel reinforced multimodal reference game. The MLLM plays both the roles of speaker and listener in a contrastive game setting, whose goal is to effectively communicate discriminative information about a target concept. Our approach formulates a verifiable contrastive reward over hard positives (dissimilar views of the same concept) and hard negatives (visually similar but different concepts). Empirically, RRG achieves state-of-the-art across multiple tasks on three personalization benchmarks. RRG generalizes to unseen domains and outperforms existing methods based on concept descriptions and personalization-specific RL frameworks. We will release code and models in the project page.

[CV-222] DLGStream: Dynamic Language-embedded Guassian Splatting for Open-vocabulary Enabled Free-viewpoint Video Streaming ECCV2026

链接: https://arxiv.org/abs/2606.28840
作者: Zhihui Ke,Yuyang Liu,Xiaobo Zhou,Tie Qiu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026

点击查看摘要

Abstract:3D Gaussian Splatting~(3DGS) has emerged as a promising paradigm for reconstructing streamable free-viewpoint video~(FVV) from multi-view videos. However, 3DGS-based FVVs typically lack user interaction and editing capabilities, which diminishes the immersive experience. Recent research has integrated language features from CLIP into 3DGS via distillation, enabling open-vocabulary queries and supporting many downstream applications. Nevertheless, the stringent requirements of FVV, low frame size and high FPS, make current language Gaussian representations unsuitable for language-embedded FVV. In this paper, we propose DLGStream, a novel language-embedded FVV representation that streams time-varying language features alongside Gaussian attributes to support 4D environment interaction, scene editing, and spatial intelligence. Specifically, we propose a dual-opacity dynamic language Gaussian representation, which maintains two opacity attributes for color and language features to deal with performance degradation that occurs when colors and features are jointly optimized. Furthermore, we introduce an interpolation-based deformation field to reduce temporal redundancy. This deformation field can also be used for 4D frame interpolation, boosting FVV sequences from low to high FPS. Experimental results demonstrate that DLGStream achieves superior performance in both on open-vocabulary segmentation and reconstruction quality with an average frame size of merely 43 KB. The code is available on \hrefthis https URLthis https URL.

[CV-223] Ground4D: Consistency-Aware 4D Reconstruction from Monocular Video

链接: https://arxiv.org/abs/2606.28828
作者: Qing Zhao,Weijian Deng,Pengxu Wei,Liang Lin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning a 4D scene representation from a single monocular video that supports dynamic novel-view synthesis while maintaining faithful geometry over time remains challenging. Dynamic Gaussian Splatting achieves strong rendering performance through photometric optimization, yet does not explicitly enforce multi-view geometric consistency. In contrast, 3D foundation models recover coherent scene geometry and camera motion, but their point-based outputs are not designed for photorealistic rendering. We propose Ground4D, a geometry-grounded framework built on two stages. First, we perform geometry initialization via 3D foundation models, leveraging VGGT in a training-free manner to reconstruct multi-view-consistent 3D geometry and camera poses from monocular video. The recovered geometry provides a structured and reliable initialization for dynamic Gaussian representations. Second, we conduct geometry-consistency-aware refinement via dynamic Gaussian Splatting, optimizing the representation through differentiable rendering while maintaining multi-view geometric consistency across both observed and synthesized viewpoints. Furthermore, Ground4D inherently models the continuous 4D dynamics of the scene, naturally supporting rendering at arbitrary timestamps. By integrating foundation-level geometric priors into dynamic Gaussian optimization, Ground4D achieves stronger reconstruction fidelity and rendering performance, underscoring the role of geometry-grounded constraints in robust 4D scene modeling.

[CV-224] RefGlass-GS: A UAV-Enabled Fusion Framework for Photorealistic Semantic and Interactive Digitization of Reflective Glass Facades via Gaussian Splatting

链接: https://arxiv.org/abs/2606.28826
作者: Zhenyu Liang,Xiao Zhang,Boyu Wang,Zhaolun Liang,Ang Li,Jeff Chak Fu Chan,Mingzhu Wang,Jack C.P. Cheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing digitization of buildings with reflective glass facades suffers from geometric reconstruction distortion, unrealistic view-dependent texture rendering, and difficulties in object-based semantic enhancement. Therefore, we propose RefGlass-GS, a fusion framework that enables end-to-end UAV-based photorealistic, semantic, and interactive digitization of reflective glass facades. The contributions include: (1) proposing an individual glass panel segmentation method based on maximum a posteriori estimation with structural regularities, robust to severe reflection and background interference; (2) formulating a UAV viewpoint planning optimization function that maximizes the coverage of view-dependent appearance for sufficient data capture; (3) developing an optimized Gaussian Splatting framework with a Reflection MLP, a novel deferred shading function, and two enhanced regularization terms for effective modeling of high-frequency near-field reflections; (4) introducing a standardized data organization paradigm for structuring GS-based representations into object-based models, facilitating interactive facility management on digital twin platforms. Experiments on real-world reflective glass facade scenes validate the effectiveness and superiority of the proposed method. Specifically, the glass panel segmentation achieves an improvement of 0.1927 in mIoU over SOTA methods, and only our method enables instance-level panel extraction. The UAV view planning improves novel view synthesis for reflective facades by 13.15 dB in PSNR compared to commercially used nap-of-the-object planning methods. The RefGlass-GS modeling outperforms SOTA Gaussian Splatting approaches for reflective scenes with an average improvement of 5.08 dB in PSNR.

[CV-225] CoGS: Compositional Dynamic Human-Object Scenes Gaussian Splatting from Monocular Video

链接: https://arxiv.org/abs/2606.28820
作者: Jerrin Bright,John Zelek
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Reconstructing dynamic human–object interaction scenes from monocular video is difficult because the human, manipulated object, and background obey different motion models while sharing the same pixels. Existing dynamic radiance-field and Gaussian-splatting methods often entangle these components, causing object motion to leak into the human or static scene, and monocular human reconstruction remains underconstrained in regions that are rarely observed. We present CoGS, a compositional Gaussian-splatting framework for monocular human–object scene reconstruction. CoGS decomposes the video into three coordinated branches: an articulated human initialized from a complete canonical prior, a rigid object field driven by an estimated object trajectory, and a static scene field regularized by weak scene-only planar primitives when available. A six-stage optimization schedule first stabilizes the human and object independently, then fuses them with the scene under full-image supervision, visibility-aware human anchoring, object silhouette and motion constraints, and delayed scene regularization. This design keeps each component responsible for its own geometry and motion while allowing photometric evidence to correct the final composite. Experiments on HOSNeRF and NeuMan show that CoGS improves both human–object interaction reconstruction and in-the-wild human–scene rendering, achieving stronger fidelity and perceptual quality across full-frame and human-focused evaluations. Code will be released upon publication.

[CV-226] ViPSim: Collaborating Visual and Parameter Spaces for Consistent Long-Horizon Embodied World Models

链接: https://arxiv.org/abs/2606.28804
作者: Longyu Chen,Heng Li,Wei Yang,Manqi Zhao,Dongsheng Jiang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to Robotics: Science and Systems (RSS) 2026

点击查看摘要

Abstract:Embodied World Models (EWMs) have emerged as a scalable and risk-free paradigm for advancing embodied intelligence, enabling the safety-critical evaluation of Vision-Language-Action systems. However, their reliability as evaluation benchmarks and foundational simulators is often hindered by the representation gap between low-dimensional actions and high-dimensional video synthesis. This gap results in a lack of geometric correspondence, manifesting as accumulated trajectory drift and inconsistent robot-object interactions during long-horizon rollouts. To bridge this gap, we propose ViPSim, a framework that achieves consistent long-horizon generation through the synergistic collaboration of Visual and Parameter Spaces. We define the Visual Space as a domain of explicit spatial priors, integrating pixel-aligned projections of end-effector pose, camera perspectives, depth-informed scene geometry, and robotic morphological masks to provide dense structural grounding. Concurrently, the Parameter Space serves as a domain of numerical drivers, injecting raw action sequences and camera matrices to provide precise motion guidance. By unifying these two spaces, ViPSim ensures that the generated states are simultaneously anchored by geometric boundaries and steered by numerical commands. Extensive experiments demonstrate that ViPSim is backbone-agnostic and significantly enhances trajectory consistency. Notably, our approach exhibits emergent capabilities in generating complex interactions with deformable objects (e.g., cloth folding) and maintains robust performance in out-of-distribution and cross-embodiment scenarios, providing a high-fidelity foundation for the automated evaluation and predictive control of embodied agents.

[CV-227] PSP: Harnessing Position and Shape Priors for Cross-Domain Few-Shot Medical Image Segmentation MICCAI2026

链接: https://arxiv.org/abs/2606.28799
作者: Bin Xu,Yazhou Zhu,Haofeng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2026. This is the pre-review submitted version, not the camera-ready version. The final authenticated version will be available in the MICCAI 2026 proceedings

点击查看摘要

Abstract:Few-Shot Medical Image Segmentation (FSMIS) offers a powerful solution to data scarcity but struggles to generalize across different imaging modalities. This performance collapse stems primarily from the drastic texture discrepancies between domains, which mislead models trained on source-specific intensity distributions. While existing methods attempt to align frequency or local texture features, they often fail to decouple semantic structure from domain-specific appearance. To address this, we identify a critical invariance: despite distinct imaging physics, the position and geometric shape of organs remain robustly consistent across modalities. Therefore, we propose a novel framework that harnesses Position and Shape Priors (PSP) for cross-domain FSMIS. Specifically, PSP first introduces a Position Coordinate Embedding (PCE) module to inject relative spatial coordinates for rapid organ localization. Subsequently, a Shape Prototype Modulation (SPM) module constructs domain-invariant structural prototypes via explicit shape priors, effectively filtering out texture noise. Furthermore, the Hybrid-Prototype Prediction (HPP) module adaptively calibrates the support prototype to the query feature distribution, mitigating feature misalignment. Extensive experiments on two public medical imaging datasets demonstrate that PSP significantly outperforms state-of-the-art methods.

[CV-228] Virtual Ring Try-On

链接: https://arxiv.org/abs/2606.28792
作者: Vishnu D. Burkhawala,Zankhana J. Barad,Harshadkumar B. Prajapati,Vipul K. Dabhi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents an innovative approach that enables the users to capture their hand and try the jewel ring on their hand. The user captures the image of the hand using the React Native base GUI of the mobile application and selects the ring that the user wants to try, and the output image will have the user’s hand with the ring image. This approach is implemented using a combination of MediaPipe hand point detection and YOLO-V8 custom object detection. The hand image uploaded by the user first undergoes mediapipe hand point detection. It will give the hand points and a Region of Interest mask where the ring is going to be placed. Then the ring is passed through YOLO object detection, in which ring points are detected, and background is removed. After that, using vector algebra, the angular discrepancy between the finger’s reference axis and the ring’s principal axis is computed. Also, ring size is rescaled according to finger thickness, preserving the aspect ratio to maintain perceptual realism. Then the ring is placed on the hand image and the output image is generated and shown on the user screen.

[CV-229] BREIT: A Framework for Brain Stroke Reconstruction using Multi-Frequency 3D EIT

链接: https://arxiv.org/abs/2606.28787
作者: Djahid Abdelmoumene,Ishak Ayad,Maï K. Nguyen,Christian Daveau
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-Frequency Electrical Impedance Tomography (MF-EIT) is a non-invasive, low-cost modality that reconstructs electrical property distributions from boundary voltages. For stroke imaging, progress in 3D deep-learning reconstruction is limited by the lack of large-scale datasets with paired ground-truth (GT) volumes and by non-standardized pipelines for data generation, simulation, and evaluation. We introduce BREIT, a modular framework for 3D MF-EIT stroke reconstruction providing: (i) a neuroimaging-to-EIT pipeline that converts CT/MRI into frequency-dependent GT admittivity volumes; (ii) a self-contained Python 3D Complete Electrode Model (CEM) forward solver for simulating MF-EIT voltages; and (iii) a 3D D-bar implementation supporting non-uniform electrode layouts. Building on BREIT, we propose dFNO-bar, which integrates Fourier Neural Operators into D-bar by learning a mapping from scattering data t(\xi) to conductivity \sigma(x)=\Re\gamma\ . We evaluate dFNO-bar against D-bar, Deep D-bar, and Gauss–Newton reconstructions on UCLH-matched synthetic data, and observe higher brain SSIM with comparable CC across noise settings. Code and data are publicly available at: this https URL

[CV-230] Stochastic Optimal Control Sampling for Diffusion Inverse Problems ECCV2026

链接: https://arxiv.org/abs/2606.28785
作者: Jie Zhang,Youmei Qiu,Hanling Tian,Jingyuan Zhang,Xiang Yin,Xiaolin Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV2026

点击查看摘要

Abstract:Benefiting from the strong ability to capture data distributions, diffusion models have become powerful tools for solving image inverse problems. The key is to controllably steer the sampling trajectory toward the measurements while respecting the diffusion prior. In this work, we introduce Stochastic Optimal Control Sampling (SOCS), which models the denoising process as a dynamical system and injects control signals via SOC. Previous SOC-based approach addresses inverse problems by optimizing over the entire trajectory, which is computationally expensive. In contrast, we derive a closed-form control update and apply it at each sampling step, pulling the measurement-consistent clean prediction back onto the denoising flow. In SOCS, we can readily modulate the control strength to align with the diffusion model’s native capabilities and thereby enhance perceptual quality. Our method is compatible with a variety of linear stochastic differential equation backbones. Extensive experiments across a broad spectrum of image inverse tasks demonstrate that SOCS achieves accurate measurement-aligned reconstructions with improved visual fidelity and stronger quantitative performance.

[CV-231] X-Mind: Efficient Visual Chain-of-Thought via Predictive World Model for End-to-End Driving

链接: https://arxiv.org/abs/2606.28758
作者: Bohao Zhao,Chengrui Wei,Guangfeng Jiang,Ruixin Liu,Xuejie Lv,Liu Liang,Sutao Deng,Xiuyang Fan,Pengkun Zheng,Jinyun Zhou,Rui Guo,Hanpeng Liu,Yutong Zheng,Yi Guo,Xinlong Zheng,Qingyu Luo,Zhuangzhuang Ding,Yu Zhang,Hang Zhang,Xianming Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting future states is essential for autonomous agents, yet current Vision-Language-Action (VLA) models fundamentally lack this capability, relying instead on reactive perception-action mapping. While integrating Predictive World Models (PWMs) addresses this gap, existing approaches either incur prohibitive cascaded latency or act as shallow terminal tasks that fail to deeply embed forward-looking reasoning. To endow VLA models with this reasoning capability, we propose X-Mind. Rather than treating PWMs as an external auxiliary module, this framework internalizes them as the Visual Chain-of-Thought (Visual CoT). By enforcing a world rollout prior to action, the model is constrained to imagine future evolution first, yielding a driving policy that is robustly grounded in environmental dynamics and aware of the future consequences its actions will unfold. The challenge here is efficiency, and we tackle it on two fronts. First, we introduce a compact representation of visual thinking: an abstract sketch that fuses a Bird’s-Eye-View (BEV) layout with abstract driving priors (e.g., navigation intents and traffic rules). Rather than rolling out dense future frames, the model reasons over this sketch as a mental canvas; aided by a Deep Compression Autoencoder (DC-AE), a 12-frame future rollout is reduced to merely 96 tokens, alleviating the long-context computational bottleneck. Second, to accelerate generation further, we propose a recurrent block diffusion scheme that unrolls the denoising steps across the layers of the large drive model, folding iterative refinement into the backbone’s one forward pass. Trained and validated on large-scale real-world data, X-Mind achieves competitive end-to-end driving performance, which makes it a highly practical, low-latency solution that successfully deploys large-scale cognitive reasoning directly onto resource-constrained vehicle platforms.

[CV-232] A Physics-Grounded Benchmark for Multi-Agent Dynamics in World Models

链接: https://arxiv.org/abs/2606.28757
作者: Nuo Chen,Lulin Liu,Zihao Li,Ziyao Zeng,Zihao Zhu,Wenyan Cong,Junyuan Hong,Yunhao Yang,Zhengzhong Tu,Yan Wang,Boris Ivanovic,Marco Pavone,Zhangyang Wang,Yang Zhou,Zhiwen Fan
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 34 pages, 9 figures, 12 tables

点击查看摘要

Abstract:Generative world models hold immense promise as scalable simulators for autonomous systems, particularly for synthesizing rare but safety-critical multi-agent interactions, such as vehicle collisions. However, current evaluation paradigms index heavily on visual fidelity and semantic alignment, leaving a critical blind spot: they cannot reliably quantify whether generated dynamics actually obey the fundamental physical laws required for reliable simulation. Assessing this physical plausibility is inherently difficult due to a lack of physical metrics and the challenge of extracting metric-scale kinematics from uncalibrated video rollouts. To bridge this gap, we introduce CrashTwin, a physics-grounded evaluation framework designed to stress-test the physical trustworthiness of world models. CrashTwin couples a diverse dataset of multi-agent collision scenarios, comprising 25K controllable synthetic and 12K in-the-wild real-world collision sequences with a novel calibration-free reconstruction pipeline, enabling the recovery of 3D physical attributes directly from world model rollouts. We propose a diagnostic suite that systematically evaluates three dimensions: spatio-temporal consistency, momentum and kinetic energy conservation, and world-dynamics integrity. Extensive benchmarking of state-of-the-art models reveals a crucial insight: high perceptual quality frequently masks severe physical violations during complex interactions. By quantitatively exposing these failure modes, CrashTwin provides a vital diagnostic tool for developing physically grounded world models capable of reliable real-world simulation.

[CV-233] FreqOrtho-SR: Frequency-Guided Orthogonal Expert Learning for Real-World Image Super-Resolution ECCV2026

链接: https://arxiv.org/abs/2606.28745
作者: Minh Son Hoang,Dinh Phu Tran,Quyen Nguyen Duc,Dam Hoang Phuong,Daeyoung Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026

点击查看摘要

Abstract:Diffusion prior-based methods have shown impressive results in real-world image super-resolution (ISR), yet two key challenges persist: balancing pixel-level fidelity with semantic quality, and adapting to diverse degradations. Existing dual-branch approaches freeze the pixel module during semantic training, but the semantic branch can still expand capacity within the pixel subspace, precluding genuine perceptual improvement. Moreover, using a single static adapter cannot generalize across heterogeneous real-world corruptions. To address both issues, we propose FreqOrtho-SR, which comprises: \textbfFreq uency-guided Mixture of LoRA Experts (FreqMoE), it routes inputs to specialized experts via a non-parametric FFT-based degradation-feature extractor that encodes frequency-domain signatures, enabling stable and interpretable specialization across corruption types; and \textbfOrtho gonal Gradient Projection (OGP), which reframes the dual-objective optimization as a subspace-constrained problem: by extracting the pixel-fidelity subspace via SVD on combined expert weight deltas and projecting semantic gradients onto its null space, OGP guarantees orthogonality between the two objectives, enabling genuinely complementary learning without mutual interference. Experiments show that FreqOrtho-SR achieves competitive overall performance and a strong fidelity-perception trade-off across multiple benchmarks with efficient single-step inference. The source code of our method can be found at \hrefthis https URL\textttsonhm3029/FreqOrtho-SR .

[CV-234] CCRC: A Change-Aware Captioning and Reasoning Chain for Image Change Captioning and Segmentation

链接: https://arxiv.org/abs/2606.28724
作者: Jinhong Hu,Xiaoping Wang,Shuyin Huang,Guojin Zhong,Kaitai Liu,Kai Lu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding and localizing subtle changes between paired images is critical for tasks such as surveillance and image editing. However, traditional Image Change Captioning (ICC) methods lack spatial grounding, limiting their precision. We introduce Image Change Captioning and Segmentation (ICCS), a new multimodal task that jointly requires structured change description and pixel-level localization. To address ICCS, we propose the Change-aware Captioning and Reasoning Chain (CCRC), a dual-chain framework that decouples semantic reasoning from spatial segmentation. The first chain, Chain-of-Change-Captioning (CCC), enhances fine-grained change perception via a visual fusion module based on Multi-Head Change-aware Attention inserted between the visual and language components of a Multimodal Large Language Model (MLLM). CCC also determines whether a change is segmentable. If not, it alone generates the caption. Otherwise, the second chain, Chain-of-Change-Segmenting (CCS), is activated, leveraging spatial priors from CCC and refining masks with a Change-aware Token Refiner for accurate boundary localization. We evaluate CCRC on both synthetic and real-world change detection benchmarks with pixel-level supervision. Experiments show CCRC achieves state-of-the-art performance.

[CV-235] LogiCo: A Unified Framework for Logical and Structural Anomaly Detection ECCV2026

链接: https://arxiv.org/abs/2606.28688
作者: Ximiao Zhang,Min Xu,Xiuzhuang Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026

点击查看摘要

Abstract:Current anomaly detection methods primarily focus on structural anomalies, while paying insufficient attention to anomalies that violate logical constraints. Conversely, top-performing logical anomaly detection approaches address this by modeling global semantic consistency, but perform poorly on subtle structural anomalies due to inadequate detection granularity. In this paper, we propose LogiCo, a unified framework for Logical and structural anomaly detection via Component-level feature reconstruction. Unlike existing methods that rely on explicit global semantic modeling, LogiCo employs a novel component-level feature reconstruction technique to capture inter-component logical constraints. Specifically, LogiCo maps pre-trained image features into a discrete component-level feature space and performs collaborative feature reconstruction at both component and patch levels, enabling it to effectively detect both logical and structural anomalies. Furthermore, to address the specific challenge of count-related logical anomalies, we integrate a segmentation-map discriminator that extends the model’s capability to identify quantitative inconsistencies. LogiCo achieves state-of-the-art performance on both logical and structural anomaly detection across four benchmarks, including MVTec-LOCO, MVTec-AD, VisA, and Real-IAD, demonstrating its superiority and practical feasibility. The code is available at this https URL.

[CV-236] SATB-VR: Training Few-Step Video Restoration Diffusion Model using SNR-Aware Trajectory Blending

链接: https://arxiv.org/abs/2606.28677
作者: Haoran Bai,Xiaoxu Chen,Xiaoyu Liu,Zongsheng Yue,Sibin Deng,Wangmeng Zuo,Ying Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While diffusion models excel in video restoration, their reliance on extensive iterative steps limits efficiency. Conversely, aggressive single-step distillation often compromises fine texture recovery. To achieve an optimal balance, we present SATB-VR, a few-step paradigm that jump-starts the denoising process via an auxiliary predictor, explicitly bypassing early low signal-to-noise ratio (SNR) steps. However, naive joint training of the predictor and the denoiser inherently introduces a severe train-inference discrepancy. To resolve this, we propose the SNR-Aware Trajectory Blending (SATB) strategy. During the forward process, SATB constructs the noisy input by dynamically blending the predictor’s output with the ground-truth trajectory based on the SNRs. This forces the denoiser to robustly compensate for initial prediction errors while smoothly converging to the clean data manifold. Furthermore, we introduce a Denoiser-Driven Consistency (DDC) loss, leveraging the concurrently updated denoiser as a dynamic evaluator to explicitly align internal features and boost predictor accuracy. Extensive experiments demonstrate that, under flexible few-step inference regimes (\eg, \le 5 steps), SATB-VR performs favorably against existing approaches on synthetic, real-world, and AIGC benchmarks.

[CV-237] Predicting Metastatic Risk from Primary Tissue Architecture via Distance-Aware Spatial Modeling

链接: https://arxiv.org/abs/2606.28676
作者: Sandesh Pokhrel,Hamid Manoochehri,Bodong Zhang,Beatrice S Knudsen,Tolga Tasdizen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting the risk of distant metastasis from primary tumor tissue histology is a critical yet challenging task in computational pathology. Multiple Instance Learning (MIL) approaches can attend to subdomains in tumor regions that harbor features of metastatic cancer progression. However MIL models treat tissue patches as unordered bags, discarding the spatial layout that defines the metastatic potential. We propose that metastatic risk is inherently dictated by the geometric arrangement of the tumor microenvironment at the interface with tumor cells. Our model is designed to explicitly capture the spatial relationships between tumor cells, tumor associated fibroblasts and infiltrating lymphocytes. For this purpose, we propose Distance aware Tissue Modeling for Multiple Instance Learning(DTMf-MIL), a novel method that reinforces visual features with explicit spatial priors. By computing signed distance functions (SDF) relative to tissue phenotypes, our model learns to recognize structural signatures of metastatic risk. This geometric awareness translates directly to superior clinical performance as DTMf-MIL significantly outperforms state-of-the-art methods that ignore spatial layout on metastasis prediction from tissue in the primary tumor. We further validate our approach on public benchmarks, demonstrating that spatial awareness consistently improves diagnostic accuracy across diverse clinical tasks.

[CV-238] BackTranslation2.0 – A Linguistically Motivated Metric to Assess Sign Language Production ECCV2026

链接: https://arxiv.org/abs/2606.28673
作者: Oliver Cory,Maksym Ivashechkin,Karahan Sahin,Oline Ranum,Jianhe Low,Edward Fish,Anton Pelykh,Ozge Mercanoglu Sincan,Richard Bowden
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026

点击查看摘要

Abstract:Sign Languages (SLs) are the primary means of communication for millions of deaf individuals, yet existing evaluation metrics for generated SL remain simplistic and poorly aligned with human judgements. We introduce BackTranslation2.0, a linguistically grounded evaluation metric for text-to-sign translation that moves beyond naïve backtranslation. Our approach adopts an agentic framework in which a deterministic pipeline orchestrates a suite of specialised tools to assess four scoring dimensions - grammatical correctness, phonological accuracy, motion fluency, and generation fidelity - aligned with human rater assessments. Tool outputs are not treated independently: a set of large language model (LLM)-based cross-referential comparison modules evaluates consistency across tools and checks outputs against linguistic expectations, enabling structured reasoning over grammatical, phonological, and motion-level evidence. Final dimension scores are computed through deterministic weighted formulas over validated tool outputs. To validate BackTranslation2.0, we introduce and evaluate on a British Sign Language (BSL) dataset rated in a human rater study across the same quality dimensions, following a protocol developed in collaboration between linguists and deaf experts, benchmarking against six baseline metrics. Our method demonstrates strong correlation with human judgements across all dimensions, providing a more comprehensive, interpretable, and linguistically principled evaluation framework for sign language production systems.

[CV-239] SemDynReg: Semantics-Guided Deformation Regularization for Dynamic 3D Gaussian Splatting

链接: https://arxiv.org/abs/2606.28656
作者: Ruitao Chen,Mozhang Guo,Jinge Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deformable 3D Gaussian Splatting (3DGS) has emerged as an efficient approach for rendering dynamic scenes in a wide range of 3D applications. However, existing deformation field-based approaches largely lack explicit object-level modeling, often resulting in inconsistent Gaussian deformations within individual objects and unwanted coupling between different objects. To address this limitation, we introduce a semantics-guided framework that enforces dynamic regularization at the object level, aiming to achieve spatially consistent object-wise deformation. Specifically, we first extract segmentation masks using the Segment Anything Model (SAM) and derive semantic features from input images. An object-ID map is then constructed via feature relevance matching with a predefined object dictionary. Guided by this object-ID map, we identify the pixel-wise top-k contributing Gaussians for each object and impose consistency regularization on their deformation parameters, including position, scale, and rotation. Unlike prior methods that learn deformation fields without explicit object-level constraints, our approach incorporates semantic cues to guide deformation behavior at the object level. Experimental results demonstrate that our semantics-aware regularization improves object-level deformation consistency and outperforms baseline methods in rendering quality, achieving higher PSNR and SSIM and lower LPIPS in dynamic 3DGS rendering. Our project page is available at this https URL.

[CV-240] FedLAS: Feature-Modulated Bidirectional Label Smoothing for Neural Network Calibration ECCV2026

链接: https://arxiv.org/abs/2606.28654
作者: Thiru Thillai Nadarasar Bahavan,Sachith Seneviratne,Saman Halgamuge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ECCV 2026 Accepted Provisionally

点击查看摘要

Abstract:Deep Neural Network (DNN) classifiers suffer from poor calibration when their softmax outputs (predictive confidence) deviate from the empirical likelihoods. This manifests itself as either overconfident incorrect predictions or under-confident correct predictions. Label smoothing (LS) enhances model calibration by introducing entropy regularization during training through redistributing probability mass from the ground-truth label to the remaining classes. LS, including Margin-based LS (MbLS), have restrictive assumptions: they rely on predefined, uniform smoothing rules and only tackle overconfidence. In reality, samples exhibit diverse characteristics, such as difficulty/ambiguity, that interact with the evolving nature of the model being trained. In training, samples may have various degrees of under- or overconfidence. To overcome this, a mechanism that identifies the specific confidence state of each sample and determines the appropriate degree of smoothing in each training step is needed, tailoring the adjustment to the individual sample. We propose FedLAS: Feature-Modulated Bidirectional Label Smoothing, a plug-and-play algorithm for label smoothing-based losses. In FedLAS, we introduce a Feature Norm-based Confidence Indicator (NCI) to control smoothing and a Bidirectional Calibration Gating (BCG) module to detect both over and under-confidence. Our algorithm can be integrated with LS and MbLS based losses when applied to standard DNNs, enhancing performance. Extensive experiments on standard and fine-grained high-resolution vision benchmarks show that FedLAS consistently improves calibration compared to modern baselines, reducing Expected Calibration Error (ECE) and Adaptive ECE while maintaining Top-1 accuracy. Code: this http URL

[CV-241] Obliviate: Erasing Concepts from Autoregressive Image Generation Models ECCV2026

链接: https://arxiv.org/abs/2606.28643
作者: Hossein Shakibania,Jonas Henry Grebe,Tobias Braun,Ege Aktemur,Saleh Aslani,Mehmet Görkem Yiğit,Marcus Rohrbach
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026

点击查看摘要

Abstract:The widespread adoption of generative AI models has intensified concerns about misuse, including the creation of unsafe or disturbing imagery. To mitigate such issues, several concept erasure approaches have been proposed to remove harmful content from multimodal generative models. Yet concept erasure for autoregressive image generation remains largely unexplored, despite the growing relevance of these models in recent trends toward unified multimodal architectures. In this work, we fill this gap by introducing Obliviate, a guidance-based concept erasure method for autoregressive image generation. Our method builds on three key design choices: KL-based supervision over visual token distributions, trajectory-level updates over full autoregressive rollouts, and aligned visual prefixes for stable target construction. We evaluate Obliviate on three state-of-the-art autoregressive text-to-image models, Liquid, Emu3-Gen, and Janus-Pro, covering the erasure of explicit content, graphic violence, and branded imagery. Obliviate consistently outperforms current alternatives, reducing nudity on the defensive RAB benchmark from 91.58 to 3.15 while preserving overall model utility.

[CV-242] AEGIR: Modeling Area Emitters for Indoor Inverse Rendering using Gaussian Splatting

链接: https://arxiv.org/abs/2606.28635
作者: Mohamed Shawky Sabae,Philipp Langsteiner,Jan-Niklas Dihlmann,Hendrik Lensch
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Inverse rendering requires separating illumination from surface materials, which is highly ambiguous due to their tight coupling in observed images. While Gaussian Splatting is efficient for novel view synthesis, existing relightable methods approximate scene lighting using discrete point lights, global environment maps, or implicit representations. By ignoring the physical spatial extent of real-world emitters, these approaches produce incorrect light attenuation and unrealistic shadows. We present AEGIR (Area Emitters for Gaussian Inverse Rendering), a framework that explicitly models local area emitters within a relightable Gaussian Splatting representation. Joint optimization of emitters, materials, and geometry is challenging due to flexible emitter parameterization, which increases both the number of parameters and the ambiguity between illumination and materials. We address this by introducing a differentiable deferred rendering pipeline that integrates multiple importance sampling with targeted regularization. As a result, AEGIR accurately simulates local light transport and achieves more consistent decomposition. Experiments show that explicit area emitters improve illumination reconstruction and enhance downstream tasks, including novel view synthesis, controlled relighting, and virtual object insertion, particularly in scenes with complex local lighting.

[CV-243] Physics-Grounded Disentangled Flow Modeling for Brain Disease Progression Trajectory

链接: https://arxiv.org/abs/2606.28630
作者: Jun Wang,Peirong Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 8 figures

点击查看摘要

Abstract:Forecasting longitudinal brain lesion evolution is critical for disease monitoring and treatment planning. Existing approaches typically learn a direct mapping from a baseline image to a future observation, without explicitly modeling the physical mechanisms underlying the lesion progression. Such an entangled modeling of structural deformation and image intensity variation limits physical plausibility, model generalization, and interpretability. To address this, we propose PDF, a Physics-grounded Disentangled Flow matching framework for longitudinal brain disease forecasting. We explicitly decompose the longitudinal modeling of lesion growth into two processes, each learned by a dedicated flow matching network: morphology evolution, which captures lesion growth and structural deformation; and intensity evolution, which models signal changes driven by variations in lesion concentration. To enforce physics-grounded constraints, we introduce a PDE-regularized loss based on lesion growth dynamics, that enforces a diffusion-reaction-advection formulation for morphological evolution. Experiments on three public longitudinal datasets spanning diverse brain diseases demonstrate state-of-the-art performance, validating the effectiveness of the disentangled modeling framework and physics-grounded learning design. Code is publicly available at this https URL.

[CV-244] SIGNET: Motion-Level Knowledge Transfer for Cross-Language Sign Language Translation ECCV2026

链接: https://arxiv.org/abs/2606.28626
作者: Sobhan Asasi,Ozge Mercanoglu Sincan,Richard Bowden
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026

点击查看摘要

Abstract:Sign language translation (SLT) remains challenging due to its high spatio-temporal complexity, long sequences, and the need to model multiple articulators without relying on gloss annotations. Existing approaches are typically tailored to individual datasets or languages and struggle to scale, while overlooking the relationships between sign languages that could inform more effective cross-lingual transfer. We present \textbfSIGNET, a framework that enables motion-level knowledge transfer for cross-language sign language translation. Our key insight is that, although sign languages differ in grammar and lexicon, pretrained models capture motion-level visual patterns that can be reused across datasets and languages. \textbfSIGNET integrates multiple pretrained sign language backbones through an attention-based, hand-prior aggregation mechanism that guides a gated fusion network in dynamically selecting the most relevant experts. Comprehensive experiments on four benchmarks (How2Sign, Phoenix14T, CSL-Daily, and MeineDGS) demonstrate state-of-the-art translation performance, and \textbfSIGNET also surpasses prior methods on WLASL for sign language recognition.

[CV-245] Meshtryoshka: Differentiable Rendering of Real-World Scenes via Mesh Rasterization

链接: https://arxiv.org/abs/2606.28622
作者: David Charatan,Daniel Xu,Richard Szeliski,George Kopanas,Vincent Sitzmann
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Daniel Xu and David Charatan contributed equally; author order decided by coin flip. Project website: this https URL

点击查看摘要

Abstract:Differentiable rendering has emerged as a powerful approach for 3D reconstruction and novel view synthesis. State-of-the-art differentiable rendering methods combine a variety of custom representations of 3D geometry and appearance with specialized renderers. However, most downstream tasks in computer graphics rely on 3D meshes. While prior work has attempted differentiable rendering with mesh representations, these approaches are limited to object-centric scenes and fail to reconstruct large-scale, unbounded scenes. In this work, we introduce Meshtryoshka, a novel mesh differentiable rendering framework that combines an off-the-shelf triangle rasterizer with a 3D representation that consists of nested mesh shells which resemble a matryoshka doll. In every forward pass, the mesh shells are extracted anew from a 3D signed distance function via iso-surface extraction, and the opacities for each vertex are computed as a function of signed distance. Each mesh shell is then rasterized independently, and the final image is created via alpha compositing. Crucially, mesh vertex positions are updated only indirectly via gradients that flow through the opacity values into the signed distance function, and hence, our method is compatible with off-the-shelf mesh renderers that need not be differentiable with respect to vertex positions. On object-centric scenes, our method performs competitively with surface-based differentiable rendering techniques. Our differentiable mesh rendering method scales to unbounded, real-world 3D scenes, where it yields high-quality novel view synthesis results approaching those of state-of-the-art, non-mesh methods. Our method suggests that it may be possible to solve the differentiable rendering problem without relying on specialized renderers, only using conventional tools from the computer graphics toolbox.

[CV-246] IMU-HOI: A Symbiotic Framework for Coherent Human-Object Interaction and Motion Capture via Contact-Conscious Inertial Fusion CVPR2026

链接: https://arxiv.org/abs/2606.28604
作者: Lizhou Lin,Songpengcheng Xia,Zengyuan Lai,Lan Sun,Jiarui Yang,Ling Pei
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures. Accepted by CVPR 2026

点击查看摘要

Abstract:Capturing full-body human motion with object interactions is crucial for AR/VR and robotics applications, yet it remains challenging for conventional vision-based methods due to occlusions and constrained capture volumes. Inertial measurement units (IMUs) offer a compelling alternative without line-of-sight requirements, but existing IMU-based motion capture assumes an isolated human and ignores object contacts and dynamics. To bridge this gap, we present IMU-HOI, a novel framework that jointly recovers full-body human pose and 6-DoF object trajectory from sparse IMUs on the body and object, explicitly modeling human-object interaction. Our approach first infers probabilistic hand-object contacts directly from IMU streams and uses them as a high-level signal to route between kinematic and inertial reasoning. These contact cues drive a three-stage fusion pipeline that refines human pose and root translation, and fuses hand-based forward kinematics with object-IMU integration for object motion, yielding coherent, drift-resilient trajectories for both human and object. Experiments on challenging human-object interaction scenarios demonstrate substantial accuracy gains over prior inertial motion capture methods. Moreover, IMU-HOI can be plugged into existing sparse-IMU mocap backbones with minimal changes, effectively extending the scope of purely inertial motion capture from isolated humans to full human-object interaction and joint motion estimation.

[CV-247] SatSplat: Geometrically-Accurate Gaussian Splatting for Satellite Imagery

链接: https://arxiv.org/abs/2606.28581
作者: Shuang Song,Jiyong Kim,Rongjun Qin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Photogrammetric Engineering Remote Sensing

点击查看摘要

Abstract:High-resolution satellite imagery demands 3D reconstruction methods that deliver both speed and geometric accuracy. Recent adaptations of 3D Gaussian Splatting (3DGS) to satellite imagery demonstrate strong efficiency, but reconstruction quality often degrades under diverse illumination across multi-date, high-altitude acquisitions (with small intersection angles), limiting applicability to remote sensing and vision tasks. We present SatSplat, the first framework to adapt 2D Gaussian Splatting (2DGS) to satellite photogrammetry, with online camera adjustment. We approximate satellite cameras with an affine model and learn a minimal delta parameterization for in-splat camera refinement from dense observations. The formulation is implemented with a 2DGS scene representation. To handle time-varying shadows and illumination changes, we integrate geometric shadow mapping and per-camera color correction during training. Across the evaluated DFC2019 and IARPA2016 benchmark sites, SatSplat achieves strong geometric accuracy while significantly outperforming prior 3DGS-based baselines. On our processed DFC2019 benchmark, SatSplat reduces mean absolute error by 11.93% and peak video memory by 31% relative to the previous state of the art. Our approach enables large-scale digital surface modeling with practical computational efficiency. The project page is available at this https URL.

[CV-248] KM-Speaker: Keypoint-Based Style Control for High-Quality Speech-Driven 3D Facial Animation and Dialogue Localization

链接: https://arxiv.org/abs/2606.28568
作者: Arthur Josi,Emeline Got,Abdallah Dib,Luiz Gustavo Hafemann,Rafael M. O. Cruz
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 20 pages, 14 figures

点击查看摘要

Abstract:Speech-driven 3D facial animation methods face significant challenges in simultaneously achieving high-fidelity motion and precise artistic control at production quality. Existing controllable models typically learn global style control by relying on large-scale, low-quality \emphin-the-wild datasets that compromise overall animation realism. Furthermore, these frameworks often lack the fine-grained temporal precision required for demanding tasks such as dialogue localization (e.g., dubbing), where matching specific facial expressions is as critical as lip synchronization. We present KM-Speaker (Keypoint-Matching Speaker), a novel keypoint-conditioned flow-based generative framework that provides both global style guidance and frame-level temporal control from reference performances. We propose a disentanglement strategy that separates audio-driven lip motion from keypoint-driven upper-face dynamics, together with a global style context preservation mechanism to ensure coherent full-face expressiveness. KM-Speaker advances example-based 3D facial animation by achieving high-fidelity motion and flexible controllability in a data-constrained setting, consistently outperforming state-of-the-art methods in lip-sync accuracy, style adherence, and expressive temporal control.

[CV-249] MammoFlow: Multiview Mammogram Synthesis with Anatomically Consistent Flow Matching MICCAI2026

链接: https://arxiv.org/abs/2606.28537
作者: Yuexi Du,Leya Barrientos,Laura Sheiman,John Lewin,Hemant D. Tagare,Nicha C. Dvornek
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by MICCAI 2026

点击查看摘要

Abstract:Multiview mammography relies on paired craniocaudal (CC) and mediolateral oblique (MLO) views to provide complementary projections of a 3D breast volume, enabling precise anomaly localization. However, acquiring high-quality, balanced datasets remains challenging for deep learning applications. We propose a novel method to synthesize multiview mammograms by leveraging the inherent geometric relationship between CC and MLO views. To enforce an implicit 3D consistency prior during generation, we develop an alignment module that searches a 2D affine transformation subspace to establish optimal anatomical correspondence. Leveraging this alignment, we introduce a pixel-space self-consistency loss based on the Earth Mover’s Distance (EMD) between the 1D anteroposterior (AP) axis tissue distributions of the generated images. Integrated into a pretrained flow matching model, MammoFlow forces synthesized pairs to share physically plausible tissue distributions from the chest wall to the nipple. To our knowledge, this is the first work to guide multiview mammogram generation using implicit geometric tissue correspondence. Our method demonstrates superior image quality, passes expert radiologist evaluation, and generates physically consistent pairs that improve downstream classification AUC by 5%. Code is available at this https URL

[CV-250] he Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks

链接: https://arxiv.org/abs/2606.28529
作者: Yujin Wang,Junli Chen,Yixuan Li,Shunan Dong,Huazhong Yang,Yongpan Liu,Hongyang Jia
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages

点击查看摘要

Abstract:Embodied foundation models have recently been widely used to improve robot generalization and task success rates. Previous works apply lossy efficient-inference techniques such as quantization, pruning, and asynchronous inference, accepting small action quality degradation in exchange for lower per-step computation cost and inter-action latency. However, unlike traditional static ML tasks, embodied tasks involve repeated interaction with the environment, and task-level performance is determined not only by per-step cost, but also by closed-loop effects unique to embodied execution, which remain insufficiently characterized in current efficient-inference studies. In this work, we propose TISED (\underlineTask-level \underlineInference \underlineSpeedup \underlineEffect \underlineDecomposition), an analytical framework that unifies diverse lossy inference optimization techniques and decomposes their effects on static and dynamic tasks, and uncovers some paradoxical effects on task-level performance: (1) on \textitstatic tasks, optimization sometimes can lengthen end-to-end per-task completion time even as per-step latency drops; (2) on \textitdynamic tasks, moderate lossy optimization can raise task success rate even above the baseline; and (3) the monotonicity and sweet-spot location of both effects can shift with hardware configuration. Together, our findings provide a new perspective on adapting inference optimization techniques to embodied tasks.

[CV-251] CLEAR-MoE: Shared-Basis Expert Extraction from Frozen Vision Transformers via Calibration-Driven Layer Selection

链接: https://arxiv.org/abs/2606.28516
作者: Md Irtiza Hossain,Humaira Ayesha,Junaid Ahmed Sifat
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:We present CLEAR-MoE, a four-phase post-training pipeline that converts a frozen pretrained Vision Transformer (ViT) into a sparse Mixture-of-Experts (MoE) model without updating backbone weights. The pipeline (i) scores feed-forward network (FFN) layers by sparsity, clusterability, and output sensitivity; (ii) decomposes selected layers into a shared low-rank SVD basis and per-cluster residual experts using k-means clustering; (iii) trains lightweight routers supervised by cluster labels; and (iv) dispatches tokens through pluggable CUDA backends. On Imagenette with DeiT-Small, CLEAR-MoE retains 99.9% of the dense model’s accuracy (86.70 +/- 0.02% versus 86.73%). Extensive ablation studies reveal a consistent empirical finding: the shared SVD basis is the primary factor responsible for preserving accuracy. Random routing, learned routing, and three different router architectures produce nearly identical performance, with accuracy varying by at most 0.06 percentage points (86.62%-86.68%). Accuracy also remains stable across different SVD ranks, expert counts (2-8), calibration set sizes (50-500), and random seeds. This behavior generalizes across five ViT backbones (DeiT-Tiny, DeiT-Small, DeiT-Base, ViT-Small, and ViT-Base), covering models from 5.7M to 86.6M parameters, with accuracy differences = 0.10 percentage points from their dense counterparts. On a GTX 960 GPU, routing and scatter-gather overhead make the CLEAR-MoE FFN 1.3-1.7x slower than the dense implementation. A dispatch microbenchmark further shows that routing is an order of magnitude more memory-bound than expert matrix multiplications, identifying fused dispatch kernels as a promising direction for future optimization.

[CV-252] JuZhou 1.0 Technical Report: The First Edge-Native Text-to-Image Foundation Model Trained Entirely on China-Developed AI Accelerators

链接: https://arxiv.org/abs/2606.28421
作者: Ce Chen,Congrui Wang,Yonglin Li,Zhenchen Wan,Mingyang Geng,Junhao Xiao,Zhengpeng Xing,Yaqing Hu,Yao Wu,Zhaoyang Qu,Long Lan,Xinwang Liu,Yingqi Peng,Shijia Li,Zufeng Zhang,Chen Ma,Jingjing Zhou,Xingyu Wang,Qilin Lu,Bin Jiang,Qilin Sun,Shanzhi Gu,Yaoguang Jin,Tongliang Liu,Kede Ma,Yifan Peng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models typically require substantial computational resources and cloud infrastructure, posing significant challenges for edge deployment in terms of latency, cost, and user privacy. We present JuZhou 1.0, an ultra-lightweight T2I foundation model designed for fully offline, on-device execution. JuZhou 1.0 achieves its efficiency through four key designs: (1) a compact image-generation backbone consisting of a 0.385B-parameter denoising U-Net and a 1.90M-parameter distilled decoder, totaling approximately 0.387B parameters; (2) Rectified Flow training combined with DMD2 distillation, reducing inference to 4 sampling steps; (3) Chinese semantic alignment trained on 9M curated image-text pairs, enabling direct Chinese prompting without external translation at inference time; and (4) a training and distillation pipeline completed on domestically developed Sugon K100 AI accelerators without relying on NVIDIA GPUs for training or distillation. Despite its compact scale, the 28-step base model of JuZhou 1.0 achieves an overall GenEval score of 0.69, outperforming published baselines including SDXL (2.6B, 0.55), SD3-Medium (2B, 0.62), and IF-XL (4.3B, 0.61). We further validate the full poetry-to-image pipeline on Android and the core CLIP-U-Net-VAE generation branch on iOS. On a smartphone powered by the Snapdragon 8 Elite Gen 5 Mobile Platform, the 4-step U-Net denoising branch runs in approximately 1.6 seconds, while the full Android poetry-to-image pipeline takes 4.5 seconds with on-device prompt refinement on Xiaomi 17 Pro Max. These results position JuZhou 1.0 as a practical approach to mobile text-to-image generation and provide a concrete reference for Chinese-native generation, domestic-compute training, and fully offline on-device deployment after one-time installation.

[CV-253] MedDiffuseMix: Preserving Diagnostic Evidence with Saliency-Aware Diffusion Medical Image Data Augmentatio

链接: https://arxiv.org/abs/2606.28419
作者: Teerath Kumar,Raja Vavekanand,Muhammad Turab
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review Signal Image and Video Processing

点击查看摘要

Abstract:Limited data availability, class imbalance, and domain variability remain major barriers to reliable medical image classification. Conventional augmentation can improve training diversity but may distort diagnostically informative structures, whereas unconstrained generative augmentation may introduce label-inconsistent content. This paper proposes MedDiffuseMix, a saliency-guided diffusion mixing framework for controlled medical image augmentation. The method uses classifier-derived saliency maps to separate high-saliency diagnostic regions from low-saliency background areas and applies diffusion-guided mixing mainly to regions with lower diagnostic importance. Adaptive mixing, Gaussian boundary blending, and a saliency-preservation constraint reduce semantic distortion and reject or attenuate samples that shift model attention away from clinically relevant evidence. The framework is evaluated on four public benchmarks: the Radiological Society of North America pneumonia chest radiography dataset, Musculoskeletal Radiographs, PatchCamelyon, and the Breast Cancer Histopathological Image Classification dataset. Experiments with convolutional and transformer-based classifiers show that MedDiffuseMix improves accuracy, F1-score, and area under the receiver operating characteristic curve compared with standard augmentation, Mixup, GenMix, SaliencyMix, and diffusion-based augmentation baselines. Ablation studies confirm the importance of saliency guidance, adaptive region mixing, and smooth boundary blending. Visual attribution analysis further indicates that MedDiffuseMix better preserves diagnostically salient regions. These results suggest that saliency-guided diffusion mixing is an effective augmentation strategy for limited-data medical image classification.

[CV-254] DiffRGD: An Inference-Time Diffusion Guidance Through Riemannian Gradient Descent

链接: https://arxiv.org/abs/2606.28417
作者: Jia-Wei Liao,Li-Xuan Peng,Mei-Heng Yueh,Min Sun,Cheng-Fu Chou,Jun-Cheng Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, diffusion models have been widely adopted in generative modeling and have served as foundational models for many image generation tasks. To control the generation without costly re-training or fine-tuning, many works seek inference-time guidance methods to steer the latent via a differentiable objective at inference time. However, these methods cannot effectively preserve the original Gaussian distribution because they introduce distributional drift, thereby degrading the sample quality. To address this gap, we propose DiffRGD, a distribution-aware guidance framework that explicitly preserves the latent Gaussian structure. DiffRGD formulates each sampling step as a constrained optimization problem on a spherical manifold induced by the latent Gaussian distribution, and solves it efficiently via Riemannian Gradient Descent (RGD). DiffRGD is a plug-and-play method that can be seamlessly integrated into any pre-trained diffusion model. Extensive experiments demonstrate that DiffRGD outperforms previous methods in most image restoration and conditional generation tasks. Our codebase is available at this https URL.

[CV-255] AEGIS: A Semantic GAN and Evidential Learning Frameworkfor Robust Adversarial Detection in Vision Sensors

链接: https://arxiv.org/abs/2606.28416
作者: Maher Boughdiri,Mounira Msahli,Albert Bifet
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25 pages, 5 figures, Submitted to Sensors Journal (Under Review )

点击查看摘要

Abstract:Deep neural networks (DNNs) have shown outstanding performance in visual recognition tasks within vision sensor networks; however, they are still vulnerable to adversarial manipulations and imperceptible perturbations that can lead to erroneous predictions. To address that, this paper presents AEGIS, a semantic aware and uncertainty guided adversarial detection framework designed for robust image classification in vision sensors pipelines. At its core, a SemantiGAN module functions as a multi class semantic discriminator, identifying and filtering visually inconsistent adversarial inputs before they propagate further in the pipeline. For inputs that pass this stage, a stochastic augmentation process generates test time variations, from which handcrafted instability metrics FlipScore, Prediction Inconsistency, Layerwise Cosine Similarity (early and mid layers), and Entropy are computed. These features are aggregated into a compact five dimensional vector and processed by an Evidential Deep Learning (EDL) classifier, which models output evidence using a Dirichlet distribution to yield both class predictions and calibrated uncertainty estimates. Evaluations on the Tiny ImageNet dataset across six categories clean, FGSM, PGD, patch based, functional, and geometric attacks demonstrate the effectiveness of AEGIS. The proposed framework achieves an AUROC of 92.1%, an AUPRC of 90.2%, and an accuracy of 90.7%, outperforming conventional softmax-based detectors in terms of detection performance, robustness, interpretability, and uncertainty calibration.

[CV-256] RSGPNet: Geometric Prompting for Remote Sensing Open-Vocabulary Semantic Segmentation

链接: https://arxiv.org/abs/2606.28410
作者: Shanwen Wang,Xin Sun,Sirui Wang,Xiao Xiang Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Open-vocabulary, Remote sensing, Geometric prompting, Multimodal large language model

点击查看摘要

Abstract:Open-vocabulary semantic segmentation (OVSS) enables text-guided segmentation of unseen objects, breaking fixed-class limitations to achieve open-world understanding. However, existing OVSS methods primarily focus on modifying the CLIP attention mechanism, which still suffers from unstable local segmentation for remote sensing (RS) domain. To address these limitations, we propose RSGPNet, a training-free geometric prompting framework for RS OVSS that refines segmentation by leveraging object geometric areas and consistency constraints. Specifically, RSGPNet comprises three core modules: a Text-guided Coarse Mask module (TCM), a Geometric Re-prompting Module (GRP), and a Coarse-to-fine Consistency Verification Mechanism (CVM). TCM utilizes text prompts and the input image to construct initial coarse segmentation masks. GRP then converts these coarse masks into geometric box prompts, feeding them back into the segmentation model to generate refined masks. Finally, CVM employs consistency computation to prevent prompting from reinforcing erroneous regions. They allow the model to improve segmentation accuracy in complex areas, such as category boundaries. Extensive experiments on RS datasets demonstrate that RSGPNet significantly outperforms state-of-the-art methods across both quantitative and qualitative metrics while exhibiting excellent interpretability. The code is released at \hrefthis https URLthis https URL.

[CV-257] Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models

链接: https://arxiv.org/abs/2606.28406
作者: Davie Chen
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Text-to-image and multimodal generative models are increasingly used to produce scientific figures such as mechanism diagrams, experimental-design schematics, conceptual frameworks, and graphical abstracts. Yet existing image-generation benchmarks (e.g., GenEval, T2I-CompBench, DPG-Bench) evaluate natural images and measure compositionality, object counting, or photorealism. None of them measure what makes a generated scientific figure usable: correct and legible text labels, faithful depiction of entities and their relations, coherent diagrammatic structure, and adherence to disciplinary drawing conventions. We introduce SciDraw-Bench, a benchmark of 32 structured scientific-figure generation tasks spanning eight figure types and ten disciplines, where each task pairs a natural-language prompt with a machine-checkable specification of required labels, relations, components, conventions, and negative constraints. We propose a four-dimensional evaluation protocol: Text Fidelity (OCR-based label recall and character error rate), Semantic Correctness (vision-language-model judging against the specification), Structural Quality, and Convention Adherence, together with a meta-evaluation protocol and a preliminary inter-judge reliability analysis (human-rating validation is ongoing). We evaluate a domain-specific system, SciDraw AI, against representative general-purpose text-to-image models, and outline a code-to-figure baseline as a planned extension. In a pilot over all eight figure types, the domain-specific system substantially outperforms the general-purpose baselines on every dimension and figure type, with the largest gaps on semantic correctness and convention adherence; text fidelity remains the hardest dimension for all systems.

[CV-258] Enhancing Layer Interaction Using Key-Correlated Layer Attention

链接: https://arxiv.org/abs/2606.28405
作者: Jianlong Xiong,ChuanBo Xie,Le Yu,Quansong He,Tao He
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in network architecture design have introduced layer attention to enhance inter-layer interactions. In such frameworks, each layer queries all preceding layers to establish cross-layer connections. However, layer attention results in quadratic computational complexity with respect to network depth. To mitigate this issue, prior works have proposed Recurrent Layer Attention (RLA) and linear attention mechanisms, which suffer from static information updates and limited long-range cross-layer dependency modeling. To overcome these limitations, we propose Key-Correlated Layer Attention (KCLA), inspired by our observation that Key representations in layer attention exhibit high cosine similarity. KCLA achieves linear computational complexity while preserving dynamic information updates, directly derived from the foundational definition of layer attention. Furthermore, KCLA maintains long-range cross-layer connections and features a fixed spatial complexity, independent of network depth. Empirical evaluations demonstrate that KCLA delivers good performance across diverse tasks, including image recognition, object detection, and medical image segmentation. The code is publicly available at this https URL.

[CV-259] DCSNet: Multiscale Feature Aggregation for Small Medical Object Segmentation with Detection-guided Hierarchical Cropping

链接: https://arxiv.org/abs/2606.28402
作者: Shanfeng Zhang,Bo Gou,Yue Cao,Lei Zhang,Zhang Yi,Tao He
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Small object segmentation in medical imaging is primarily hindered by class imbalance and inherent boundary complexity. Consequently, conventional global networks frequently fail to detect sparse targets or suffer from severe edge degradation. To overcome these limitations, we propose the Detection-guided Cropping Segmentation Network (DCSNet), an end-to-end framework that transforms global dense prediction into a localized refinement process. This framework integrates two core components, namely Detection-guided Hierarchical Cropping (DGHC) and Multiscale Feature Aggregation (MSFA). The DGHC module leverages region proposals to dynamically extract object-centric features, effdataectively filtering out massive background interference to mitigate class imbalance. Subsequently, the MSFA module operates strictly within these purified regions, synergizing a Transformer encoder with a pixel-adaptive fusion strategy. This mechanism dynamically aggregates multiscale features to capture both semantic context and fine-grained details for sharp boundary delineation. Extensive experiments across three diverse medical datasets demonstrate that DCSNet significantly outperforms existing state-of-the-art methods, yielding substantial improvements in boundary precision and offering a highly robust solution for clinical micro-lesion segmentation.

[CV-260] Vision-driven Preference Synthesis for Mitigating Hallucinations in VLMs

链接: https://arxiv.org/abs/2606.28401
作者: Yunhun Nam,Jongheon Jeong
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 29 pages; Code is available at this https URL

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown strong performance in visual understanding, yet they still suffer from hallucinations, generating content that is not grounded in the image. Preference alignment is a promising approach to improve visual faithfulness, but its success depends heavily on how preference pairs are constructed. Existing methods exhibit two key limitations; (a) intervention-based methods often introduce significant deviation from the policy distribution, and (b) sampling-based methods often underuse visual information during the construction. In this paper, we propose ViPSy (Vision-driven Preference Synthesis), a framework for constructing preference data that are both policy-aligned and visually grounded. Our framework consists of two stages; in the first stage, ViPSy derives a visual cue from recurring object-level content across semantically aligned image variants, so preference construction can rely on visual information rather than language priors. In the second stage, ViPSy conditions the policy’s own rollouts on this cue, allowing candidates to be guided by visually grounded content while staying close to the policy’s response distribution. The resulting candidates remain close to the policy’s response distribution while better leveraging visual information from the image. Experiments show that the resulting VLM, preference-aligned with ViPSy-constructed preference pairs, achieves a new state-of-the-art in hallucination mitigation. Compared with the previous state-of-the-art method, it reduces hallucination rates on AMBER and Object HalBench by 35.7% and 24.5%, respectively. The resulting model further improves on general visual grounding benchmarks, e.g., MMStar, MMVP, and CV-Bench, while also yielding gains in semantic segmentation and ImageNet linear probing, underscoring the effectiveness of our framework in enhancing the model’s visual capabilities.

[CV-261] Meta-learning as a principle for human-like visual representations

链接: https://arxiv.org/abs/2606.28399
作者: Can Demircan,Marcel Binz,Alireza Modirshanechi,Eric Schulz
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:The structure of human visual representations underpins our capacity for adaptive behaviour. While pretrained neural networks model human visual representations with unprecedented success, a large discrepancy remains. We propose one reason: these networks optimise a single fixed objective, whereas human representations must support open-ended tasks. We hypothesise this flexibility arises from meta-learning (learning to learn), a pressure shaping representations to acquire new tasks from few observations. To test this, we train a sequence model, without any supervision from human data, across thousands of semantically rich tasks mapping images to high-level concepts. Compared to their pretrained base encoders, meta-learned representations better predict human similarity judgements, semantic rule learning, and high-level visual cortex. Behavioural gains depend on disentangled, high-level task distributions, while brain alignment is driven primarily by the learning-to-learn pressure. Our results suggest the flexibility of human visual representations reflects the functional demand to learn new semantic relationships on the fly.

[CV-262] Semantic-Aware Generative Image Transmission for Resource-Constrained Visual IoT Systems

链接: https://arxiv.org/abs/2606.28398
作者: Chenyang Zhang,Changwang Liu,Jinqi Zhu,Jiayi Chang,Yuxuan Wang,Shuqing He,Jia Guo
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Resource-constrained visual Internet of Things (IoT) systems, such as edge cameras, unmanned sensing platforms, industrial inspection nodes, and remote monitoring sensors, often need to transmit task-relevant visual evidence over low-rate wireless links to an edge/cloud service. Existing image communication methods usually compress or transmit complete global representations, leaving limited room to exploit receiver-side generative restoration. This paper proposes a semantic-aware generative image transmission framework for edge-assisted visual IoT. The image captured by an IoT visual sensor is encoded into a discrete token grid by a VQ encoder. At the IoT transmitter or nearby gateway, token recoverability, estimated from prediction entropy and local structure complexity, is fused with semantic importance obtained from instance segmentation and category-aware scoring. A spatial dispersal sampler then selects the tokens to be transmitted under a bitrate budget. The transmitter sends only the quantization indices of kept tokens and a binary mask map, while the edge/cloud receiver recovers masked tokens through MaskGIT with Halton sequence scheduling. Experiments on Kodak and VisDrone scenes under AWGN and Rayleigh channels show that the proposed method provides a flexible bitrate-quality tradeoff for narrowband visual IoT links. At 0.074 bpp, it uses 44.6% of the transmitted bits of the 0.167-bpp DeepJSCC/WITT reference while achieving 29.9 dB PSNR. A pseudo-GT downstream detection study on Kodak further shows that semantic-aware masking preserves task-relevant objects better than random masking at both 30% and 50% mask ratios.

[CV-263] CLOSER-VLN: Closed-Loop Self-Verified Retrieval-Augmented Reasoning for Aerial Vision-Language Navigation

链接: https://arxiv.org/abs/2606.28397
作者: Shaoxuan Li,Xiangyu Dong,Xiaoguang Ma,Junfeng Chen,Haoran Zhao,Yaoming Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language navigation (VLN) has recently advanced with large language and multimodal models, enabling agents to follow natural-language instructions in unseen environments without training a task-specific navigation policy. However, most existing VLN methods relying on large models still adopt an open-loop decision-execution approach, where candidate actions are generated from instructions and observations but are rarely verified or corrected before execution. This causes critical issues in aerial VLN, where minor errors in intermediate actions may quickly accumulate into large trajectory deviations and lead to target loss. To address this issue, we propose Closed-loop Self-verified Retrieval-augmented Reasoning (CLOSER), a training-policy-free method that sequentially performs action reasoning, reliability verification, targeted retrieval, and action correction in a closed-loop manner before executing concrete actions. We instantiate the CLOSER in aerial VLN tasks and develop a CLOSER-VLN framework, which is composed of three components: a hierarchical reasoner for generating candidate actions based on available information, a multidimensional action verifier for assessing the reliability of actions generated by the reasoner, and a verification-triggered multimodal retriever for retrieving targeted exemplars from a memory bank only when verification fails. We conduct experimental evaluations on the CityNav benchmark, where CLOSER-VLN achieves 32.01% SR and 21.28% SPL on the test-unseen split, confirming the effectiveness of closed-loop reasoning.

[CV-264] RadarTwin: Scene-Specific mmWave Radar Simulation and Learning for Mobile Indoor Perception

链接: https://arxiv.org/abs/2606.28396
作者: Emily Bejerano,Federico Tondolo,Devang Gupta,Aaron Mano Cherian,Taeyoo Kim,Ayaan Qayyum,Xiaofan Yu,Xiaofan Jiang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Millimeter-wave (mmWave) radar perception is limited by data scarcity: models trained on existing radar datasets fail to generalize to new objects, environments, and sensing trajectories. We present RadarTwin, a framework for generating deployment-specific radar training data before real data collection. Given a 3D reconstruction of a target space (phone LiDAR, robot-mounted sensing, or RGB-to-3D), RadarTwin uses a vision-language model to infer radar-relevant surface materials and a physics-based ray tracer to synthesize raw frequency-modulated continuous-wave (FMCW) radar measurements with multi-bounce propagation. To study what transfers from simulation to reality, we collect a paired real-simulated dataset spanning household objects, material classes, distances, rotations, translations, and mobile sensing trajectories. We show that simulated and real radar share the same object-discriminative shape and material features, and that modeling the environment’s multipath is essential to matching real measurements. A representation trained on simulation alone recognizes real objects at 2.5 times chance with no real radar labels, and a few labeled examples raise this to 95.3% on a 12-way recognition task. RadarTwin enables training radar perception for a new space before any real radar data is collected there.

[CV-265] JASPR: Joint Spatial Representation learning of histology and spatial genomics for improved virtual genomic screening and clinical prognostication

链接: https://arxiv.org/abs/2606.28395
作者: Marija Pizurica,Eric Zimmermann,Neil Tenenholtz,James Hall,Olivier Gevaert,Ava P. Amini,Lorin Crawford,Kristen A. Severson
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent studies have shown that spatial properties of tumors are critical for understanding disease biology and predicting patient outcomes. These spatial properties are increasingly uncovered through complementary modalities: spatial transcriptomics (ST) captures spatially-resolved molecular states, while hematoxylin and eosin-stained whole slide images (HE) reveal tissue morphology. While approaches are emerging to fuse these modalities, effective methods that learn not only joint representations but also incorporate spatial context across modalities are lacking. Here, we present JASPR (Joint Spatial Representation learning), a self-supervised deep learning framework that integrates HE images and ST data through a cross-modal reconstruction objective that incorporates spatial context within HE images and ST profiles. It employs shared modules to capture universal spatial properties across modalities, while modality-specific experts encode features unique to morphological and genomic data. We train and validate JASPR on breast cancer datasets, demonstrating that its learned joint representation substantially improves HE-based prediction of 9,248 genes and provides prognostic value for breast cancer outcomes.

[CV-266] GPU-Accelerated Inverse Structural Anastylosis from Block Collapse Dynamics

链接: https://arxiv.org/abs/2606.28394
作者: L.A. Muñoz
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pges, github link included, 6 figures

点击查看摘要

Abstract:The physical anastylosis of collapsed architectural monuments – the meticulous reassembly of fallen stone elements into their original structural configuration – represents one of the most intellectually demanding challenges in conservation science. Traditional approaches depend heavily on expert archaeologist judgement and manual block-by-block correspondence, a process that is both labour-intensive and inherently subjective. Inspired by the combinatorial complexity of this problem as manifested in the game of Jenga, we present Jenga Inverse Predictor , a GPU-accelerated deep learning framework that addresses structural anastylosis as an inverse prediction task. Given an image of a collapsed block assembly, JIP-2 reconstructs the most probable prior tower configuration by: (1) implementing a complete rigid-body physics engine with OBB/SAT collision detection and a Projected Gauss-Seidel (PGS) contact solver accelerated with Numba JIT and CuPy CUDA; (2) applying the analytical force thresholds of Ziglar (CMU, 2006) – F_app = 3mu_smg (Y-axis, torque-free) and F_app = 4mu_smg (X-axis, torque risk) – over three friction levels (mu_s in 0.25, 0.40, 0.60) across 450 simulated episodes; (3) training a dual-stream ResNet-18 that injects a friction one-hot vector and jointly predicts block removal count, per-position removal probabilities, centre-of-mass imbalance, and Ziglar torque risk; and (4) generating a smooth 3-D video of the block-by-block reverse reconstruction. We discuss implications for computer-assisted anastylosis at the UNESCO Maya site of Uxmal, Yucatan, and provide a detailed technical description of the full pipeline, architecture, and loss formulation.

[CV-267] ransition-Aware best-of-N sampling for Longitudinal Chest X-ray Reports

链接: https://arxiv.org/abs/2606.28393
作者: Halil Ibrahim Gulluk,Max Van Puyvelde,Wim Van Criekinge,Olivier Gevaert
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In longitudinal clinical practice, every chest X-ray is read in the context of the patients prior exam, and much of what the radiologist communicates is the change from one visit to the next. To the best of our knowledge, we present the first training-free best-of-N sampling scheme for pre-trained chest X-ray report generators that is explicitly aware of this longitudinal prior to current transition. We call it transition-aware best-of-N sampling, each report is split into sentences and embedded into an unordered set in Rd; each (prior, current) pair is reduced to a fixed-dim directional vector via a set-to-set distance designed to encode the change between the two sets; and candidates are scored by cosine distance from their candidate transition vector to a cached bank of ground-truth training transition vectors, aggregated as min or kNN. We instantiate the framework with four directional set distances (mean-shift, novelty residual, directed-Hausdorff anchor, and cost-weighted optimal transport) and evaluate on a multi-visit AP-PA cohort, running inference under three prompts on three vision-language generators. Transition-aware best-of-N outperforms random selection across the board, with the largest relative gains on the Impression section.

[CV-268] RADIANT-PET: Reasoning -Augmented PET/CT Lesion Segmentation with Large Language Models and Reinforcement Learning

链接: https://arxiv.org/abs/2606.28392
作者: Jiasheng Wang,Tanun Jitwatcharakomol,Piyawadee Jongpradubgiat,Simeng Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate lesion segmentation in PET/CT is critical for oncology, yet remains challenging because physiologic tracer uptake and artifacts can mimic malignant signal. We present RADIANT-PET, a reasoning-augmented framework that couples a high-sensitivity voxel-level segmentation model with lesion-level large language model (LLM) adjudication. Candidate uptake regions are generated with a deliberately permissive segmentation stage, then converted into structured textual descriptions that summarize uptake intensity, morphology, and regional and global anatomical context. An LLM classifies each candidate as true lesion vs. false positive, optionally leveraging the radiology report as additional clinical context. To strengthen lesion-level reasoning, we further optimize a local LLM via reinforcement learning using Group Relative Policy Optimization, rewarding correct lesion classification and anatomically concordant site assignment. Across AutoPET and an OSU test cohort, RADIANT-PET consistently outperforms strong image-only baselines, with the largest improvements observed when radiology reports are provided. Overall, these results demonstrate that LLM-based lesion-level reasoning adds a novel reasoning layer beyond conventional segmentation, suppressing physiologic false positives and aligning voxel-level predictions with clinical interpretation. The project repository is available at: this https URL.

[CV-269] Few-class Fidelity: Evaluating Explanations of Real-conditions CNN classifiers with Optimized Perturbations

链接: https://arxiv.org/abs/2606.28391
作者: Wistan Marchadour,Pedro Soto Vega,Franck Vermet,Mathieu Hatt
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The wide use of Convolutional Neural Networks (CNN) in numerous domains and real-world classification applications is justified by their high precision and automation speed, helping users concentrate on higher-expertise tasks. To better understand the models and avoid bias during deployment, eXplainable Artificial Intelligence (XAI) techniques can be used after training. But as the list of XAI solutions expand, comparisons between them diverge, and consensus over their evaluation cannot be reached. This paper proposes a variation of Fidelity-based XAI metrics, with a focus on real-conditions applications, where the number of classes is often low. The approach generates in-distribution, uncertainty-provoking perturbations, to ensure proper measurement of the XAI methods faithfulness. As demonstration of the evaluation framework usefulness, it is compared with human-centric object localization and segmentation metrics. Once applied to both medical and natural imaging applications, it highlights the intricate correlation between domain, data curation, and XAI solution choices in order to validate training of a new CNN model.

[CV-270] Automated Quality Assessment of Geospatial Vector Data: A GeoAI Approach using Spatial Representation Learning

链接: https://arxiv.org/abs/2606.28390
作者: Hao Li,Chen Chu,Filip Biljecki,Cyrus Shahabi,Wenwen Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Geospatial vector data quality is a foundational research topic in GIS, yet classic rule-based quality assessment algorithms often struggle with diverse urban morphologies and massive data volumes. Recently, Geospatial Artificial Intelligence (GeoAI) shows promising potential for automating geospatial analysis, while its application to native vector data remains largely underexplored. To fill this research gap, we proposed Topo4Vec, an automated GeoAI framework, designed for scalable vector data quality assessment via advanced Spatial Representation Learning (SRL). Specifically, Topo4Vec relax the labor-intensive manual annotation process via topological error simulation, such as overlapping polygons and street network connectivity errors e.g., overshoots and undershoots. Then, it leverages state-of-the-art SRL approaches to encode complex, native vector geometries (e.g., polylines and polygons) into a latent space where topological errors are isolated from valid ones. A systematic performance evaluation across three study areas (Los Angeles, Munich, and Singapore) demonstrates the effectiveness and robustness of Topo4Vec, achieving a peak accuracy of 0.99 for detecting overlapping building footprints and 0.60 for overshoots and undershoots in street networks. Moreover, lessons learned from Topo4Vec shed a promising light into a scalable and autonomous GeoAI approach for large-scale vector data consistency and quality monitoring within the fast-growing geospatial data ecosystems. The code and data used in the paper are made openly available in this https URL.

[CV-271] SoccerNet 2026 Player-Centric Ball Action Spotting: Per-Player Attention with Agreement-Based Ensembling SOCC

链接: https://arxiv.org/abs/2606.28389
作者: Faisal Altawijri,Ismail Mathkour
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2 pages, 1 figure, 2 tables. SoccerNet 2026 challenge technical report

点击查看摘要

Abstract:We present our submission to the SoccerNet 2026 Player-Centric Ball Action Spotting challenge, which uses a two-stage pipeline: a Track-Aware Action Detector (TAAD) produces per-player action logits from broadcast video, and a Denoising Sequence Transduction (DST) transformer converts game-state features and TAAD logits into structured event sequences. We improve the TAAD with a temporal transformer that adds cross-frame context, alongside several training fixes. For the DST stage, we introduce a two-stage per-player attention mechanism operating on game-state features, and show that a spatial-first attention ordering (cross-player attention before temporal attention) improves validation Macro-F1 by 1.87%. To exploit architectural diversity, we train four model variants and combine them with a Weighted Event Fusion ensemble that applies agreement filtering to suppress single-model false positives while preserving recall, plus a dedicated exception for the rare tackle class. Our final system improves the challenge Macro-F1 from a baseline of 48.6 to 58.94.

[CV-272] Data Provenance for Image Auto-Regressive Generation ICLR2026

链接: https://arxiv.org/abs/2606.28386
作者: Bihe Zhao,Louis Kerner,Michel Meintz,Tameem Bakr,Franziska Boenisch,Adam Dziedzic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Image autoregressive models (IARs) have recently demonstrated remarkable capabilities in visual content generation, achieving photorealistic quality and rapid synthesis through the next-token prediction paradigm adapted from large language models. As these models become widely accessible, robust data provenance is required to reliably trace IAR-generated images to the source model that synthesized them. This is critical to prevent the spread of misinformation, detect fraud, and attribute harmful content. We find that although IAR-generated images often appear visually identical to real images, their generation process introduces characteristic patterns in their outputs, which serves as a reliable provenance signal for the generated images. Leveraging this, we present a post-hoc framework that enables the robust detection of such patterns for provenance tracing. Notably, our framework does not require modifications of the generative process or outputs. Thereby, it is applicable in contexts where prior watermarking methods cannot be used, such as for generated content that is already published without additional marks and for models that do not integrate watermarking. We demonstrate the effectiveness of our approach across a wide range of IARs, highlighting its high potential for robust data provenance tracing in autoregressive image generation.

[CV-273] Zero-Label Driving Scenario Complexity Detection via Joint Embedding Predictive Architecture

链接: https://arxiv.org/abs/2606.28383
作者: Santosh Jaiswal
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Identifying complex and safety-critical driving scenarios in large unlabelled datasets is an important but expensive problem. Existing approaches rely on human annotators, supervised classifiers, or carefully engineered rule sets, all of which require substantial prior knowledge about what constitutes a difficult scenario. We ask whether a model can discover scenario complexity on its own, with no labels at any stage. We train a minimal Joint Embedding Predictive Architecture (JEPA) on structured agent state data from the nuPlan mini dataset and use the temporal prediction error as a zero-shot complexity score. Without access to any ground-truth labels during training or evaluation setup, the model assigns significantly higher scores to scenarios involving unprotected turns, crosswalk interactions, and pedestrian proximity, and significantly lower scores to lane-following and stationary-traffic scenarios. We validate this finding through four ablation experiments that isolate the source of the signal, and through a downstream anomaly detection evaluation that achieves Average Precision of 0.512 against a 0.436 chance baseline. The results show that temporal prediction error in a self-supervised latent world model is a practical proxy for driving scenario complexity.

[CV-274] Memory-Augmented LSTM Autoencoder for Unsupervised Activity Recognition with IMU Sensor Fusion

链接: https://arxiv.org/abs/2606.28377
作者: Saeid Arabzadeh,Farshad Almasganj,Mohammad Mahdi Ahmadi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:HAR using Inertial Measurement Unit (IMU) sensors is vital for healthcare monitoring and rehabilitation. Despite deep learning advancements, major challenges remain: reliance on labeled data, multi-sensor fusion complexity, and the limited ability of unsupervised methods to capture spatiotemporal dependencies. These issues are pronounced in real-world scenarios with noisy data, overlapping activities, and missing labels. We propose a fully unsupervised spatiotemporal feature fusion framework using a memory-augmented autoencoder. It enhances activity representations via short temporal windows of multi-sensor IMU data, enabling real-time applications. Our framework extracts hierarchical static features via a Stacked Autoencoder, fusing them within and across sensors. A sequence-to-sequence LSTM Autoencoder then temporally refines these features, incorporating historical motion patterns without labels. We analyze key hyperparameters to identify configurations that maximize feature separability under short-window constraints. Evaluated on DaLiAc and PAMAP2 using realistic inter-class window segmentation, our method achieves 96.6% and 98.4% accuracy, respectively, surpassing supervised baselines and unsupervised approaches. Our method improves feature separability by up to 9% despite shorter temporal windows. While our realistic inter-class segmentation reduces accuracy by ~7%, it was intentionally adopted to better reflect real-world activity transitions and practical relevance.

[CV-275] GeoISF: Instance Semantic Forest Inspired Large-Scale Cross-View Geo-Localization via Ground LiDAR-to-Satellite Image

链接: https://arxiv.org/abs/2606.28371
作者: Di Hu,Xia Yuan,Chunxia Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The problem of localization on a large-scale satellite image given a frame of query ground view point clouds remains challenging. Existing LiDAR-to-image cross-view localization methods struggle in large-scale scenarios due to limited semantic alignment and the modality gap between point clouds and satellite images. This paper introduces the large-scale LiDAR-to-image geo-localization pipeline called GeoISF. GeoISF introduces an instance semantic forest constructed using WordNet, which enhances temporal semantic representation and discriminative power by integrating semantic trees from multiple frames. By leveraging environmental semantic representation as a shared medium, GeoISF effectively bridges the modality gap and improves semantic matching accuracy. Extensive experiments demonstrate the superior performance of GeoISF in large-scale cross-view localization, 13.22 times better than the parallel LiDAR-to-image method in the R@10 metric on the KITTI dataset. The proposed method addresses the existing gap in large-scale LiDAR-to-image cross-view localization, offering a robust solution to the computational and accuracy challenges inherent in such scenarios. We will release the code as an open-source resource available online for the broader research community.

[CV-276] From Gradient Clipping to Structural Refinement: Improving DPSGD for Medical Image Segmentation

链接: https://arxiv.org/abs/2606.21763
作者: Shiva Parsarad,Parth Shandilya,Isabel Wagner
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Medical image segmentation is widely used for disease detection but relies on sensitive data, raising privacy concerns as trained models can leak information. Differential privacy, typically implemented via Differential Private Stochastic Gradient Descent (DPSGD), provides a solution, though at the cost of reduced utility. Recent DPSGD variants, including Automatic clipping (Auto-S), Normalised SGD with perturbation (NSGD), and Per-sample adaptive clipping (PSAC), have shown promise in image classification, but their behavior in medical segmentation remains underexplored. We evaluate these methods across binary and multi-class tasks and analyze gradient alignment, showing that prior assumptions, particularly for PSAC, do not consistently hold. We further demonstrate that combining clipping strategies with morphological refinement improves segmentation quality under privacy constraints. Finally, we propose an adaptive DP-Morph variant that captures class-specific structures and enhances performance in multi-class settings.

[CV-277] ool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task NEURIPS2025

链接: https://arxiv.org/abs/2512.10359
作者: Sunqi Fan,Jiashuo Cui,Meng-Hao Guo,Shuojin Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by NeurIPS 2025 main track

点击查看摘要

Abstract:Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously modeling spatial relationships within video frames and understanding the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA task. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM’s spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants. The code is publicly available at this https URL.

[CV-278] A multi-architecture study of specificity refinement and false-positive mechanism analysis in prostate MRI

链接: https://arxiv.org/abs/2606.29977
作者: Yongbo Shu,Kewen Chen,Yifeng Yuan,Zirui Xin,Luo Lei,Yang Yang,Xi Chen,Aijing Luo
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 29 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Objectives: To characterize residual false positives in prostate MRI detection, and to evaluate a lightweight post-hoc refinement head for case-level specificity. Materials and Methods: This retrospective study used PI-CAI (5-fold cross-validation) and Prostate158 (n=158; external). A context-aware evidence head and an 89,216-parameter refinement head were trained on a frozen detection backbone; the evidence head was also trained on four further backbones (bare nnU-Net, bare U-Net, bare Mamba, MIGF-Mamba). For each false-positive region, T2-weighted, apparent-diffusion-coefficient, and high-b-value contrast ratios versus peri-lesional rings were compared against ground-truth lesions and contralateral benign regions. Results: False positives were closer to true cancers than to benign tissue in evidence and raw T2-weighted and apparent-diffusion-coefficient contrast, reproducing 35/35 across five architectures (Cohen’s d 1.10; FP/benign evidence ratio 2.38x) and 105/105 across modality-perturbation scenarios. On PI-CAI fold-0, refinement raised case-level specificity from 0.469 to 0.549 (+17.2%) at preserved sensitivity (0.943); 5-fold cross-validation showed fold-conditional behavior (9/15 observations positive; range -22% to +28%). On Prostate158, both models saturated (McNemar pooled p=0.69), while the false-positive contrast-matching finding replicated. Conclusion: Residual false positives are contrast-matched to cancer (sharing raw imaging features rather than histologically confirmed mimicry), reproducing across five architectures – a data-level imaging property, not model-specific artifacts; post-hoc refinement adds practical specificity in-domain but is fold-conditional.

[CV-279] VIB-AVSR: Variational Information Bottleneck for Noise-Robust LLM -Based Audio-Visual Speech Recognition INTERSPEECH2026

链接: https://arxiv.org/abs/2606.29632
作者: Piyush Arora,Navlika Singh,Umberto Cappellazzo,Stavros Petridis,Maja Pantic
类目: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: Accepted to INTERSPEECH 2026. Our code is available at this https URL

点击查看摘要

Abstract:Audio-Visual Speech Recognition takes two input modalities, acoustic and visual streams, where visual information from lip movements aids recognition when audio is noisy. Recently, LLM-based AVSR models have emerged as a promising paradigm by connecting pre-trained audio-visual encoders to an LLM, achieving strong results in clean conditions. However, these models are predominantly optimized for clean acoustic conditions, with limited attention to making the LLM backbone robust to noise. No explicit mechanism is employed to produce stable representations under corrupted audio, leading to performance degradation in noisy environments. To address this, we propose VIB-AVSR, which integrates Variational Information Bottleneck layers at targeted positions within the LLM backbone to regularize representations. VIB-AVSR reduces degradation under noisy conditions across multiple SNR levels and noise types, without requiring architectural modifications or additional training data.

[CV-280] A Self-Supervised Learning Framework for Video Encoding Complexity Clustering

链接: https://arxiv.org/abs/2606.29166
作者: Krishna Srikar Durbha,Hassene Tmar,Ping-Hao Wu,Ioannis Katsavounidis,Alan C. Bovik
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Adaptive video streaming is a widely used technique for delivering video content over the internet. One of the key challenges is determining the optimal encoding settings for each video, which can vary significantly based on its content and characteristics. In this paper, we propose Compression Echo Contrastive Learning (CECL), a novel self-supervised learning framework for clustering videos based on their encoding complexity. Our method leverages the response of a video to compression - the Compression Echo - as a supervisory signal, allowing the model to capture underlying encoding characteristics during pretraining. We conduct extensive experiments to demonstrate the effectiveness of our learned representations for the downstream task of clustering videos by their encoding complexity. Our results show that CECL improves upon existing state-of-the-art visual encoders and delivers strong bitrate and quality savings against the fixed bitrate ladder.

[CV-281] Complete virtual unwrapping and reading of a rolled Herculaneum papyrus

链接: https://arxiv.org/abs/2606.29085
作者: Giorgio Angelotti,Stephen Parsons,Federica Nicolardi,Youssef Nader,Sean Johnson,David Josey,Paul Henderson,Hendrik Schilling,Johannes Rudolph,Forrest McDonald,Elian Rafael Dal Prá,Paul Tafforeau,Alessandro Mirone,Clifford Seth Parker,Jan Paul Posma,Benjamin Kyles,Claudio Vergara,Alessia Lavorante,Rossella Villa,Maria Chiara Robustelli,Marzia D’Angelo,Gianluca Del Mastro,Michael McOsker,Kilian Fleischer,Christy Chapman,Nat Friedman,William Brent Seales
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Instrumentation and Detectors (physics.ins-det)
备注: Preprint, 4 main figures

点击查看摘要

Abstract:The carbonized papyri from Herculaneum preserve the only large-scale library to survive from classical antiquity, but many unopened rolls remain unread because physical opening risks irreversible damage. X-ray computed microtomography ( \mu CT) and virtual unwrapping offer a non-invasive route to their texts, yet previous work on sealed Herculaneum scrolls has recovered only localized readings or limited surface regions. Here, using high-resolution phase-contrast \mu CT acquired on the BM18 beamline at the European Synchrotron Radiation Facility (ESRF), together with improved computational unrolling and machine learning, we achieve the complete virtual unwrapping and reading of PHerc. 1667 under explicit coverage and papyrological-review criteria. This makes PHerc. 1667 the first Herculaneum papyrus to be fully digitally unrolled and read for extended scholarly study without physical opening. In PHerc. Paris 4, the optimized scan protocol makes ink directly visible in the tomographic volume, allowing three-dimensional ink segmentation and independent validation of surface-conditioned ink recovery. In PHerc. 139, we recover title and author-attribution evidence identifying the scroll as Philodemus, On Gods, Book 8. These results move virtual unwrapping of the Herculaneum scrolls beyond isolated demonstrations towards a scalable framework for systematic recovery of the still-unopened library.

[CV-282] BLUE: A Stale-Pixel Optical-Flow Compositor for Entropy-Efficient Surveillance Video Encoding

链接: https://arxiv.org/abs/2606.28753
作者: Shubham Baid,Akash James,Sahil Chachra,Nishant Sinha,Kunal Kislay
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 Tables

点击查看摘要

Abstract:Continuous-recording surveillance systems face a storage problem that codec tuning alone cannot fully solve: even at aggressive CRF settings, a static-camera scene spends most of its bits re-encoding a background that has not changed. We present BLUE, a pre-encode compositor that exploits this structure by maintaining a persistent seed frame of the background and substituting background pixels with seed pixels before the encoder runs. The encoder then emits near-free SKIP macroblocks for the frozen background, while live pixels in foreground regions are carried unchanged at full quality. We evaluate BLUE on all 308 annotated short subclips from the VIRAT Ground Surveillance Release 2.0 dataset using a six-point CRF sweep with both x264 and x265. At CRF 28, BLUE reduces file size by a mean of 34.6% (x264) / 39.4% (x265) on 95.8% / 99.4% of clips respectively. Foreground-region PSNR, computed only over VIRAT object-annotation bounding boxes, is preserved or improved on 60.7% of clips (+0.36 dB mean, +5.48 dB maximum). Full-frame perceptual quality (VMAF) drops by a median of 6.75-8.59 points; we quantify and disclose this trade-off explicitly. A lightweight deployment gate measuring the compositor’s own VMAF on a 2-second prefix identifies the 40% of clips where even full-frame quality degradation is near-imperceptible (Delta VMAF = -2.9), enabling a selective-activation strategy that retains both the storage benefit and acceptable perceptual fidelity.

[CV-283] Envisage: Diffusion-Based Rhinoplasty Goal Visualization with Mask-Decomposed Evaluation

链接: https://arxiv.org/abs/2606.28628
作者: Mudit Agarwal,Amit D. Bhrany
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 29 pages, 4 figures, 22 tables

点击查看摘要

Abstract:Localized generative editing needs localized evaluation: full-image identity metrics are structurally confounded under hard-composited edits. We present Envisage, a FLUX.1-Fill inpainting reference pipeline for rhinoplasty goal visualization from a single frontal photograph. The pipeline combines 8 rhinoplasty clinical presets (the released framework also includes 8 blepharoplasty and 8 rhytidectomy presets), MediaPipe masks, and hard-mask compositing. The composite preserves outside-mask pixels by construction, so full-face identity scores are dominated by copied pixels rather than by the diffusion backbone. Because full-face identity metrics cannot grade localized edits, we introduce SurgicalScore, a mask-decomposed 0-1 protocol scoring edit direction, edit magnitude, masked LPIPS, realism, and outside-mask preservation; SS_raw assigns 0.919 [0.918, 0.920] to a perfect-predictor control , anchoring the ceiling. On N=211, the paired ArcFace gain (output-to-GT minus input-to-GT) is negative for all methods (Envisage -0.048 smallest, vs. ICEdit -0.139, Kontext -0.242, InstructPix2Pix -0.294; p 1e-4), with external validation on a 457-pair ASPS/PCA corpus showing a larger negative gap. With SurgicalScore, Envisage achieves the highest score (0.599 [0.579, 0.619]) and leads on both metrics, but the all-negative ArcFace gap shows that full-face identity is poorly aligned with localized surgical accuracy under hard compositing. A 5-seed GT-oracle (an upper bound, not a deployable result) reduces the residual ArcFace gap by 73% (-0.054 to -0.015), with positive output-to-GT gain on 33.9% of cases, indicating candidate-space headroom for a learned ranker. For localized edits, progress should be measured with edit-region fidelity rather than full-face identity metrics. We release Envisage, SurgicalScore, preset definitions, and matched split manifests.

[CV-284] Anatomy-Grounded Synthetic Coronary Angiography for Geometry-Informed Multi-View Matching MICCAI2026

链接: https://arxiv.org/abs/2606.28474
作者: In Kyu Lee,Sumin Seo,Jaesik Min
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2026. Code and dataset: this http URL

点击查看摘要

Abstract:Accurate correspondence matching across multiple angiographic views is the prerequisite for 3D coronary reconstruction and interventional guidance. However, the development of robust deep learning models for this task has been stifled by a fundamental data bottleneck. Obtaining ground truth for matching tasks in angiography pairs is prohibitively expensive and hard to scale. To overcome this barrier, we introduce a physically-grounded data generation framework that synthesizes high-fidelity Digital Reconstructed Radiographs (DRRs) from 3D Coronary CT Angiography (CCTA) volumes. Our framework generates dense, highly accurate 3D-to-2D projection labels by simulating realistic C-arm acquisition geometry on patient anatomy at zero human cost. Leveraging this dense supervision, we propose a Geometry-Informed Matching Module (GIMM) that integrates global feature and anatomical structure into correspondence learning. Unlike real angiography where assessment relies on subjective human annotation, our dataset provides 2D correspondence labels with paired images, allowing human-free evaluation. We comprehensively evaluate our method on the proposed CT-derived DRR dataset and demonstrate improvements over other matching baseline models.

[CV-285] DeVAR: Low-Dose CT Denoising via Visual Autoregressive Modeling

链接: https://arxiv.org/abs/2606.28453
作者: Xizhuo Zhang,Yannian Gu,Zhongzhen Huang,Shaoting Zhang,Xiaofan Zhang
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computed tomography (CT) plays a crucial role in medical diagnosis, but minimizing radiation exposure while maintaining image quality remains a critical challenge. Low-dose CT (LDCT) protocols reduce radiation risks but inevitably suffer from severe noise and artifacts that compromise diagnostic accuracy. While existing deep learning methods have achieved promising results, there remains a continuous quest for generative paradigms that intrinsically capture global-to-local structural dependencies to better preserve fine anatomical details. To this end, we propose DeVAR, a novel generative framework that applies visual autoregressive modeling (VAR) to LDCT denoising for the first time. Conditioned on global context provided by LDCT prefix tokens, DeVAR progressively generates discrete token maps of the target normal-dose CT (NDCT) via next-scale prediction. Because quantization inherently discards high-frequency information, we introduce a residual refiner to capture subtle anatomical structures beyond the capacity of a discrete codebook. Finally, empowered by a dual-representation hybrid training strategy, our hybrid NDCT decoder seamlessly integrates continuous and discrete latents to reconstruct high-fidelity, detail-preserved images. Extensive experiments on two public datasets demonstrate that DeVAR consistently achieves superior qualitative and quantitative performance compared to state-of-the-art LDCT denoising methods.

[CV-286] Establishing the Minimal Clinically Important Difference (MCID) for Smartphone-Derived Gait Measures in Multiple Sclerosis

链接: https://arxiv.org/abs/2606.28449
作者: Mike D Rinderknecht,Bernhard Fehlmann,Dimitar Stanev,Cedric Simillion,Ernst Bos,Letizia Leocani,Agne Kazlauskaite,Gary Cutter,Helmut Butzkueven,Licinio Craveiro
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 40 pages

点击查看摘要

Abstract:Background: Digital health technologies allow for frequent, remote gait monitoring in people with multiple sclerosis (MS). However, to differentiate daily variability from actual disease progression in longitudinal data, established minimal clinically important differences (MCID) are required. Currently, there is limited literature defining these thresholds for digital gait metrics. Objective: To establish MCIDs for digital gait measures reflecting progression in MS. Methods: Digital gait measures were captured via daily, remote, smartphone-based Two-Minute Walk Tests in CONSONANCE (NCT03523858), a phase 3b study of ocrelizumab in progressive MS. Using an anchor-based approach, median changes from baseline at Week 96 on digital gait measures were computed for patients showing clinically meaningful worsening on either Timed 25-Foot Walk, Ambulation Score, Expanded Disability Status Scale, or 12-item Multiple Sclerosis Walking Scale. These changes were subsequently triangulated to derive the MCID estimates. Results: 243 patients with progressive MS (female: n=125 (51%); mean [SD] age: 49.3 [9.3]; mean [SD] EDSS: 4.8 [1.4]) had digital gait data available at baseline and Week 96. Median changes were generally consistent across anchors. Triangulated MCIDs are: Step Velocity = -0.16 m/s, Step Velocity Scaled to Walking Time = -0.18 m/s, Step Duration = 0.06 s, Step Length = -0.07 m, Total Number of Steps = -28, and Total Distance Walked = -24 m. Conclusion: These MCIDs provide a framework for interpreting meaningful gait changes and integrating digital measures into MS outcome evaluation. Beyond facilitating novel clinical trial endpoints to evaluate treatment efficacy, they enable objective, real-world monitoring to advance personalized patient care.

[CV-287] Measured-Subspace Consistency: A Plug-and-Play Operator for Diffusion Posterior Sampling in Accelerated MRI Reconstruction

链接: https://arxiv.org/abs/2606.28448
作者: Junhyeok Lee,Kyu Sung Choi
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion posterior samplers for accelerated MRI can reconstruct accurately yet still disagree on the acquired k-space across samples, placing posterior variability on coefficients the scanner has already measured. We identify this measured-subspace leakage as a physical-admissibility failure. Under a hard-constraint model it violates the measurement constraint and inflates the reported uncertainty with disagreement about coefficients the scanner has already determined. To quantify this leakage, we introduce complementary measured- and unmeasured-subspace k-space dispersion metrics (MSD/USD). We then present Measured-Subspace Consistency (MSC), a training-free terminal correction that wraps any compatible image-space posterior sampler with a standard multi-coil consistency lock. The ideal lock follows classical range/null-space data consistency. Our contribution is to repurpose it as a black-box posterior audit and correction rather than a new reconstructor or learned sampler. Theoretically, we prove that the ideal transform confines pairwise sample differences to the MRI null space and bound the residual cross-subspace coupling left by practical sensitivity-weighted implementations. Across six base samplers and two MRI anatomies, including out-of-distribution transfer where a knee prior reconstructs brain, MSC substantially reduces measured-subspace dispersion for Soft samplers (a median 16.5x reduction for DPS across five brain contrasts, up to ~29x), while preserving unmeasured-subspace diversity and acting as a near-identity map for Consistent ones. Furthermore, MSC maintains or modestly improves PSNR/SSIM, with no retraining, retuning, or significant computational overhead.

[CV-288] A Zero-Shot Deep Image Prior Framework for Denoising and Deconvolution in Fluorescence Microscopy

链接: https://arxiv.org/abs/2606.28431
作者: Xiangyu Qian,Jing Liu,Yunqing Tang,Luru Dai,Qiushi Li
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Fluorescence microscopy images are degraded by noise and diffraction-induced blur, which compromise structural fidelity and limit quantitative analysis. Supervised deep learning methods achieve impressive restoration performance but require large-scale paired datasets that are difficult to obtain in practice. To address this issue, we propose SDIP, a zero-shot deep image prior (DIP) framework that sequentially performs denoising and deconvolution without external training data. An aSeqDIP-based module first suppresses noise while preserving fine structures through sequential autoencoding regularization. In the deconvolution stage, a wavelet-based background correction step is incorporated before the proposed RLG-DIP module performs artifact-reduced deconvolution. RLG-DIP uses the Richardson-Lucy deconvolution result as a physically consistent guidance prior, integrating the imaging model with the implicit prior of DIP to stabilize the ill-posed deconvolution process. Experiments on the BioSR dataset across multiple cellular structures demonstrate that SDIP improves both signal-to-noise ratio and resolution, achieving superior visual quality and improved quantitative performance on most evaluated structures. The proposed framework may also provide useful insights for designing physically guided DIP methods for other inverse problems.

[CV-289] MSA-UNet3: Multi-Scale Attention UNet3 with New Supervised Prototypical Contrastive Loss for Coronary DSA Image Segmentation

链接: https://arxiv.org/abs/2504.05184
作者: Rayan Merghani Ahmed,Adnan Iltaf,Mohamed Elmanna,Gang Zhao,Hongliang Li,Yue Du,Bin Li,Shoujun Zhou
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 11 figures, 3 tables, Published in Biomedical Signal Processing and Control

点击查看摘要

Abstract:Accurate segmentation of coronary Digital Subtraction Angiography (DSA) images is essential for diagnosing and treating coronary artery disease (CAD). Despite advances in deep learning, challenges such as high intra-class variance and class imbalance limit precise vessel delineation. Existing approaches for coronary DSA segmentation cannot effectively address these issues. Furthermore, existing segmentation network encoders do not directly generate semantic embeddings, which could enable the decoder to reconstruct segmentation masks more effectively. We propose a Supervised Prototypical Contrastive Loss (SPCL) that combines supervised and prototypical contrastive learning to enhance coronary DSA image segmentation. The supervised contrastive loss enforces semantic embeddings in the encoder, improving feature differentiation. The prototypical contrastive loss enables the model to focus on the foreground class while alleviating high intra-class variance and class imbalance by concentrating only on hard-to-classify background samples. We implement the proposed SPCL within MSA-UNet3+, a Multi-Scale Attention-Enhanced UNet3+ architecture. The architecture integrates a Multi-Scale Attention Encoder (M-encoder), a Multi-Scale Dilated Bottleneck (MSD-Bottleneck) for multi-scale feature extraction, and a Contextual Attention Fusion Module (CAFM) to preserve fine-grained details while improving contextual understanding. Experiments on a private coronary DSA dataset demonstrate that MSA-UNet3+ outperforms state-of-the-art methods, achieving the highest Dice coefficient and F1-score while significantly reducing ASD and ACD. The framework provides precise vessel segmentation for accurate identification of coronary stenosis and supports informed diagnostic and therapeutic decisions. The code will be released at this https URL.

人工智能

[AI-0] VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes

链接: https://arxiv.org/abs/2606.30645
作者: Yen-Jen Wang,Jiaman Li,Sirui Chen,Takara E. Truong,Pei Xu,Pieter Abbeel,Rocky Duan,Koushil Sreenath,Angjoo Kanazawa,Carmelo Sferrazza,Guanya Shi,Karen Liu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Graphics (cs.GR); Systems and Control (eess.SY)
备注: 19 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Perception-based humanoid loco-manipulation requires connecting egocentric observations and task instructions to whole-body motion. Learning this mapping requires synchronized egocentric images, language commands, and robot-compatible kinematic trajectories, yet no existing data source provides this complete tuple at scale. We address this bottleneck by generating vision-language-kinematics (VLK) supervision synthetically in reconstructed scenes. Our pipeline leverages 3D Gaussian Splatting to reconstruct metric-scale indoor environments, synthesizes navigation and object-interaction trajectories using privileged scene information, and renders paired egocentric observations after the fact. We produce 48,000 paired trajectories with no human intervention and train a VLK policy that predicts short-horizon whole-body kinematic trajectories. A whole-body tracker converts these predictions into actions on the physical humanoid. We evaluate on the physical Unitree G1 performing navigation and single-object transport, demonstrating that synthesized interactions in reconstructed scenes provide effective supervision for sim-to-real perception-based humanoid loco-manipulation. Project Website: this https URL

[AI-1] LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training

链接: https://arxiv.org/abs/2606.30642
作者: Shun Lei,Huaicheng Zhang,Dapeng Wu,Yaoxun Xu,Lishi Zuo,Wei Tan,Hangting Chen,Guangzheng Li,Jianwei Yu,Zhiyong Wu,Dong Yu
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Full-length song generation must preserve coherence and musicality, render detailed vocal and accompaniment acoustics, and follow lyrics and prompts. Existing language model-based systems face a structural trade-off: mixed-token modeling preserves vocal-instrument coordination but obscures track-specific details, whereas dual-track prediction improves acoustics but requires longer sequences and weakens global planning. We present LeVo 2, a hybrid LLM-Diffusion framework for controllable full-length song generation. LeVo 2 formulates this trade-off as hierarchical modeling: LeLM first predicts mixed tokens for semantic planning, then predicts vocal and accompaniment tokens in parallel for track-specific refinement, while a diffusion-based Music Codec reconstructs full-length waveforms. A central contribution of this extended version is an aesthetics-guided training schedule for alignment. During pre-training, an automated music aesthetic evaluation framework assigns musicality-tier conditions to large-scale data, providing musicality priors before preference alignment. Progressive post-training applies SFT, large-scale offline DPO, and closed-loop semi-online DPO to separately improve generation quality, controllability, and musicality. Modular extension then trains the Track-Specific LM for acoustic refinement while preserving the aligned semantic planner. This schedule separates musicality learning, controllability alignment, and acoustic refinement, mitigating optimization conflict and the limitations of static offline preference pairs. Expert listening tests and objective evaluations show that LeVo 2 outperforms open-source baselines across six subjective dimensions, and approaches leading commercial systems on several listening metrics. Ablations validate the effects of the training strategy, aesthetics guidance, scaling, and hierarchical architecture.

[AI-2] Pessimisms Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models ICML2026

链接: https://arxiv.org/abs/2606.30627
作者: Subramanyam Sahoo,Aman Chadha,Vinija Jain,Divya Chaudhary
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted in ICML 2026 workshop on Decision-Making from Offline Datasets to Online Adaptation: Black-Box Optimization to Reinforcement Learning

点击查看摘要

Abstract:Conservative offline training is widely advocated as a safe foundation for subsequent online adaptation: if a policy stays close to well-supported behaviour, the argument goes, it is less likely to exploit imperfections in a learned reward model. We challenge this intuition empirically and mechanistically. We train a Qwen3-14B policy under Direct Preference Optimisation (DPO) with three levels of conservatism ( \beta \in \beta_\mathrmlo, \beta_\mathrmmid, \beta_\mathrmhi\ derived from empirical log-ratio percentiles), then adapt each checkpoint online against a learned reward ensemble (3, \times ,Qwen3-1.7B) while measuring true performance on GSM8K exact-answer accuracy. We find that \emphhigher offline conservatism monotonically increases reward-hacking damage, measured by the Goodhart gap and its area under the curve (AUGC), with Spearman \rho = 1.0 across all three conditions. Mechanistic analysis reveals a three-link causal chain: (i) high- \beta DPO compresses policy entropy, (ii) Low-entropy policies generate responses with reduced diversity, concentrating in a narrow region of the reward model’s training distribution (lower pairwise cosine distance), and (iii) despite this proximity, ensemble disagreement (epistemic uncertainty) increases with \beta and is exploited faster during online optimisation. We further fit a power-law curve to the (\beta, \augc) data and identify a practical optimal conservatism level \beta^\star that balances alignment fidelity against hacking vulnerability. Our results suggest that the field needs \emphcalibrated, not \emphmaximal, conservatism.

[AI-3] DOPD: Dual On-policy Distillation

链接: https://arxiv.org/abs/2606.30626
作者: Xinlei Yu,Gen Li,Qingyi Si,Guibin Zhang,Yuqi Xu,Congcong Wang,Shuai Dong,Kaiwen Tuo,Xiangyu Zeng,Kaituo Feng,Qunzhong Wang,Yang Shi,Xiaobin Hu,Xiangyu Yue,Jiaqi Wang,Shuicheng Yan
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself. However, this additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated. This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals. To this end, we propose DOPD, an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Each token receives supervision of different strength, objective, and strategy from either teacher or student itself, which transfers credible capability while simultaneously receiving auxiliary signals, to alleviate privilege illusion. Extensive experiments on both large language model (LLM) and vision-language model (VLM) settings demonstrate that DOPD consistently outperforms Vanilla OPD and other counterparts. Further results on stability, robustness, continual learning, and out-of-distribution tasks validate its superiority.

[AI-4] C2R: Cross-sample Consistency Regularization Mitigates Feature Splitting and Absorption in Sparse Autoencoders ICML2026

链接: https://arxiv.org/abs/2606.30609
作者: Haoran Jin,Xiting Wang,Shijie Ren,Hong Xie,Defu Lian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 6 figures. Accepted by ICML 2026

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) are widely used to interpret large language models by decomposing activations into sparse, human-understandable features, but scaling to large dictionaries exposes fundamental challenges. Systematic studies reveal pervasive feature splitting that fragments coherent concepts into non-atomic latents and widespread feature absorption that creates arbitrary exceptions in general features, severely compromising latent reliability. These issues stem from inconsistent latent assignment across samples: without cross-sample constraints, per-sample optimization often allows a single underlying concept to be inconsistently distributed across multiple redundant or interfering latents. To address this, we introduce C ^2 R (\underline\textbfCross-sample \underline\textbfConsistency \underline\textbfRegularization). C ^2 R explicitly encourages that each semantic feature is consistently represented by a unified latent across the batch by penalizing the co-activation of directionally similar latents. Comprehensive evaluation demonstrates that C ^2 R effectively mitigates both splitting and absorption while, crucially, preserving reconstruction fidelity, providing a principled solution that enhances latent interpretability without degrading model performance. Source code is available at this https URL.

[AI-5] MESA: Prioritizing Vulnerable Communication Channels for Securing Multi-Agent Systems

链接: https://arxiv.org/abs/2606.30602
作者: Kunyang Li,Kyle Domico,Jonathan Gregory,Patrick McDaniel
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent systems (MAS) are increasingly used to automate complex, distributed workflows. However, their inter-agent communication channels introduce new attack surfaces that remain poorly understood and are difficult to defend against. In this paper, we address how defenders should prioritize limited security effort to protect vulnerable communication channels before attacks are observed. This is motivated by our observation that the channel-level attack impact is highly non-uniform: a single compromised edge can account for up to 75% of total attack success. We introduce Mesa, a label-free framework for proactively ranking which MAS edges are most security-critical – that is, most likely to affect the system’s decision if compromised. Mesa combines six graph-theoretic metrics and two dynamic probes (ablation and masking) without requiring attack traces. We evaluate Mesa against a dynamic misinformation attack pipeline across three diverse MAS scenarios, eight network topologies, and five open-source LLMs from Qwen, Llama, and Gemma families. Mesa rankings correlate strongly with empirical per-edge attack success rate, achieving mean Spearman \rho=+0.60 (peaking at +0.73 ). In resource-constrained defense deployment, monitoring the top 10% of Mesa-ranked edges intercepts about 3x the successful attacks as random allocation. We further test Mesa under varying attacker and defender models and LangGraph workflows and characterize its limits under adaptive attacks and high-redundancy graphs. Overall, our results show that edge-level risk in MAS is often concentrated and predictable, allowing proactive hardening of multi-agent infrastructures.

[AI-6] Words Speak Louder Than Code: Investigating Cognitive Heuristics in LLM -Based Code Vulnerability Detection

链接: https://arxiv.org/abs/2606.30587
作者: Asif Shahriar,Hongyu Cai,Hadjer Benkraouda,Gang Wang,Z. Berkay Celik
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Researchers and practitioners increasingly apply Large Language Models (LLMs) for automated vulnerability detection. Recent work has shown that LLMs are susceptible to the same cognitive heuristics that bias human judgment. Yet, no work has investigated whether these heuristics affect a model’s assessment of code vulnerabilities. In this paper, we present the first systematic exploration of cognitive heuristics in LLM-driven code vulnerability detection. We introduce a controlled framework that holds the code fixed and only varies the surrounding context to trigger three cognitive heuristics: the halo effect through author attribution, the framing effect through task objectives and consequences, and the anchoring effect through prior analysis results. Within this framework, we evaluate eight LLMs across three programming languages and perform both quantitative and code-level analyses. Our findings demonstrate that all evaluated models are susceptible to these heuristics. Cross-model average susceptibility is highest for framing at 33.2%, followed by anchoring at 23.5% and halo at 18.4%. Code-level analysis reveals that vulnerabilities that require semantic reasoning for detection are more susceptible to cognitive heuristics than those identifiable through pattern matching. Furthermore, models often change their verdict from safe to vulnerable based on the cognitive condition, without accurately identifying the actual vulnerability. To highlight the practical impact, we demonstrate a proof-of-concept black-box cognitive attack that can suppress up to 97% of previously detected vulnerabilities. These findings indicate that cognitive susceptibility is a consistent and exploitable property of LLM-based vulnerability detection.

[AI-7] A Multi-task Mixture of Experts Framework for Malware Classification Packing Detection and Family Attribution

链接: https://arxiv.org/abs/2606.30572
作者: Jithin S.,Roshin Sleeba C.,Anvin Mariya P. B.,Asmitha K. A.,Vinod P.,Serena Nicolazzo,Antonino Nocera
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Malware classification remains a challenging problem due to its inherent heterogeneity, the presence of packed binaries, and the diverse distribution of malware families. Traditional single-model detection mechanisms often fail to generalize across such diverse data, leading to degraded performance, particularly on obfuscated and rare malware samples. In this work, we propose a unified multi-task malware analysis framework based on Mixture of Experts (MoE) architectures. The proposed system evaluates performance across two different input representations, i.e., high-dimensional EMBER feature sets and raw 1D byte arrays extracted from Portable Executable files. It simultaneously performs three critical tasks: malware family classification, packed versus unpacked detection, and malware versus benign identification. By decomposing the problem into specialized expert networks and employing adaptive gating mechanisms, the model enables effective task-specific learning while maintaining overall scalability. We investigate multiple architectural variants, including Homogeneous MoE, Heterogeneous MoE, and Multi-Gate MoE (MMoE). Performance is evaluated in both standard and adversarial settings using original and mutated samples. The obtained results demonstrate that the Multi-Gate MoE model achieves the best performance, reaching a combined detection rate of 0.9744 with only 2.56% failure rate. Moreover, this configuration exhibits improved robustness under mutation-induced distribution shifts. Our findings highlight the effectiveness of expert specialization and task-specific routing in handling complex malware distributions, making the proposed framework a promising direction for scalable and resilient malware detection systems.

[AI-8] raceLab: Characterizing Coding Agent Workloads for LLM Serving

链接: https://arxiv.org/abs/2606.30560
作者: Kan Zhu,Mathew Jacob,Chenxi Ma,Yi Pan,Stephanie Wang,Arvind Krishnamurthy,Baris Kasikci
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Coding agents are rapidly becoming a major application of agentic LLMs, but serving them efficiently remains challenging. Progress on this challenge requires understanding real workload patterns, yet the data needed for such analysis is largely absent. Existing public traces and benchmarks do not capture real, day-to-day coding-agent usage across multiple agents and model families for serving-system analysis. To help fill this gap, we collect and release a trace of roughly 4,300 coding-agent sessions, containing about 350,000 LLM steps and 430,000 tool calls from our own day-to-day use of Claude Code and Codex. Our analysis shows that coding-agent workloads feature long autonomous loops, long contexts with short outputs, diverse and heavily-tailed tool calls, and high but imperfect prefix cache hit rates. These findings point to concrete opportunities for optimizing serving, including lower-overhead tool calling, append-length-aware prefill, semantic-aware tool-latency prediction, and improved KV-cache management around human-paced gaps. We release the dataset, trace collection pipeline, and analysis code at this https URL the project website is this https URL.

[AI-9] Latent Actions from Factorized Transition Effects under Agent Ambiguity ICML2026

链接: https://arxiv.org/abs/2606.30544
作者: Heejeong Nam,Chandradithya S Jonnalagadda,Harshit Aggarwal,Eric Xu,Randall Balestriero
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026 Workshop on Compositional Learning. Project Page: this https URL

点击查看摘要

Abstract:Latent Action Models (LAMs) learn action-like proxies from observation transitions. However, in multi-object or distractor-rich scenes, these visual effects mix agent motion with distractors, camera dynamics, and background changes, making the underlying action source ambiguous without supervision. Structuring this mixture as reusable transition effects provides an intermediate representation from which action-like latents can be more robustly formed. We introduce Observed Transition Factorization (OTF), which decomposes each transition into a sparse set of observed transition primitives. Using these primitives as the transition interface, we propose OTF-LAM, which abstracts motion primitives into action-like latents within the standard inverse-forward dynamics framework, and OTF-LAM-Dino, a decoder-free variant that predicts future states in a frozen DINOv2 representation space. Empirically, OTF primitives transfer zeroshot across controlled carrier and morphology shifts, showing reusability. Furthermore, downstream policy learning results match or outperform baselines under complex transition ambiguity.

[AI-10] Entity Binding Failures in Tool-Augmented Agents

链接: https://arxiv.org/abs/2606.30531
作者: Rahul Suresh Babu,Shashank Indukuri
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool-augmented language-model agents are often evaluated by whether they select the correct tool, produce valid API arguments, and complete the requested task. However, an agent may choose the right tool and still act on the wrong external entity. For example, a request to “email Alex about the launch” may lead the agent to contact the wrong Alex, attach the wrong launch document, reply in the wrong thread, or update the wrong customer account. We call these errors entity binding failures. This paper studies entity binding failures as a distinct reliability and safety problem in tool-augmented agents. We formalize the separation between tool correctness and entity correctness, introduce a taxonomy of wrong-entity failures in enterprise workflows, and evaluate entity-aware execution mechanisms including entity-resolution preconditions, confidence-gated binding, clarification under ambiguity, and provenance tracking. In a controlled diagnostic evaluation across 60 tasks, five model backends, and six tool-use methods, all methods achieved 0.0 percent wrong-tool error, yet action-oriented baselines still produced wrong-entity actions in 24.0-26.0 percent of runs. Entity-aware methods eliminated wrong-entity actions and risk-weighted wrong-entity exposure in this setting, but reduced direct task completion by deferring under ambiguity. These findings show that safe tool use requires not only selecting the correct tool, but also reliably binding natural-language references to the correct real-world entity before action.

[AI-11] Informational Frustration in Neural Manifolds: Shannon Bottlenecks and the Limits of Learnability

链接: https://arxiv.org/abs/2606.30512
作者: Srinivasa Rao P.,Vangmayi P Reddy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
备注: 8

点击查看摘要

Abstract:Why overparameterised deep networks generalise so remarkably well remains one of the most stubborn open questions in machine learning theory. Classical frameworks like VC dimension and Rademacher complexity predict catastrophic overfitting in modern models, leaving a massive theoretical gap between theory and reality. In this paper, we bridge this divide by introducing a unified framework that links information theory, topology, and statistical mechanics to map the hard limits of deep learning. Central to our approach is the Entropic Learnability Horizon (ELH): a fundamental law stating that a network can only truly learn a target function if the Shannon entropy of the data manifold outpaces the topological entropy of the function’s decision boundary, balanced by the von Neumann entropy of the network’s weight space. We establish the Shannon-Topological Bottleneck Theorem, proving that when a target boundary’s geometric complexity exceeds this informational horizon, the system undergoes a sudden entropic phase transition. It falls into a state of Informational Frustration - a glassy, rigid memorization phase where generalization becomes thermodynamically impossible. Using this lens, we show that the enigmatic phenomenon of “grokking” is actually an Entropic Release, where weights abruptly reorganise to unlock the bottleneck. Finally, we translate this theory into practice with Entropic Gradient Descent (EGD), an optimization algorithm that dynamically manages weight entropy to keep learning on track. Ultimately, this work repositions entropy not just as a tool for tracking uncertainty but as the fundamental physical currency that dictates whether a machine can learn.

[AI-12] McMg: A Learned Phase-Space Multi-channel Multigrid Preconditioner for Helmholtz Equation

链接: https://arxiv.org/abs/2606.30495
作者: Jiwei Jia,Xinliang Liu,Juntao Wang,Jinchao Xu
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Mathematical Physics (math-ph)
备注: 26 pages, 13 figures

点击查看摘要

Abstract:Solving heterogeneous Helmholtz equations at high wavenumbers remains challenging because the discretized operator is indefinite, pollution degrades phase accuracy, and scalar coarse-grid correction can discard the local phase and propagation-direction information carried by oscillatory errors. We propose Multi-channel Multigrid (McMg), a learned phase-space multigrid preconditioner for heterogeneous Helmholtz equations. Rather than predicting the solution directly, McMg maps residuals to corrections within an iterative framework. Its central idea is to coarsen physical space while retaining unresolved local wave information in the channel dimension: each coarse node carries a learned packet of amplitude, phase, direction, and scattering coefficients rather than a single scalar unknown. The architecture combines linear multi-channel transfer operators with locally adaptive stencils, neural PDE operators, and medium-dependent smoothers whose coefficients are generated from the wave speed. For a fixed medium, the V-cycle is linear in the residual; nonlinear physical features are computed once in a setup phase and cached, so each online iteration reduces to convolutions with fixed coefficients. We further study generalization across scales. Models trained on small domains transfer directly to larger domains and higher effective wavenumbers, and a Layer-by-Layer Progressive Finetuning (LLPF) strategy extends the support of the learned Green’s operator by adding and finetuning only new coarse levels. Numerical experiments on high-frequency, high-contrast, and large-scale three-dimensional problems demonstrate that McMg requires substantially fewer iterations and less wall-clock time than strong classical baselines, while consistently outperforming existing neural preconditioners.

[AI-13] he FIL Hypothesis: Inductive Biases Help with Kernel Engineering

链接: https://arxiv.org/abs/2606.30442
作者: Nikolai Rozanov,Subhabrata Dutta,Preslav Nakov,Iryna Gurevych
类目: Artificial Intelligence (cs.AI)
备注: 10 pages main, 17 pages abstract, pre-print

点击查看摘要

Abstract:The Bitter Lesson, which posits that general-purpose methods that scale with computation and data ultimately outperform those with built-in human knowledge, has become a dominant paradigm in the era of Large Language Models. We revisit this principle by observing a new and critical scaling dimension: the duration of the Feedback Information Loop (FIL), the time required for a system to receive a verification signal after generating a prediction. Most historic successes in Artificial Intelligence (AI) have benefited from near instantaneous feedback (e.g., games or classification tasks), but we argue that future AI applications in science and the physical world will inherently involve FILs ranging from hours to weeks. This trend poses a fundamental scaling limit, as obtaining enough verification steps required by purely data-driven methods becomes practically impossible. Additionally, we propose a method that is orthogonal to purely data-driven approaches, based on human-inspired expert knowledge. The method relies on inductive biases and constraining the solution space. We provide an initial validation of the hypothesis and the method, by studying the real-world GPU programming task, a domain with non-trivial FIL, and demonstrate that incorporating inductive biases yields superior performance over data-driven approaches. The code is released under: this https URL

[AI-14] ransformer Architectures as Complete Bayes Processes: A Formal Proof in the Measure-Theoretic Kernel Framework

链接: https://arxiv.org/abs/2606.30440
作者: Haobo Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a complete formal proof that transformer architectures, when their internal update mechanisms satisfy a Bayes joint-distribution condition, implement exact Bayesian posterior inference. Working within the measure-theoretic kernel framework, we define a hierarchy of abstractions – from the core Bayesian transformer, through semantic transformers with explicit update kernels, to full transformer blocks with QKV/attention/residual/MLP pipelines, and finally multilayer stacks – and prove at each level that the Bayes joint semantics implies the update kernel equals the posterior almost everywhere. For the block-level architecture, we derive the explicit Bayes formula through Radon-Nikodym differentiation and prove its normalization. We additionally prove that the softmax attention mechanism induces a valid probability distribution over keys, establishing the bridge between the abstract kernel framework and concrete attention implementations. The framework makes no architectural assumptions beyond the Markov kernel structure and exposes explicit conditions under which a transformer block is provably Bayesian. In essence, when this joint distribution condition is satisfied, the forward computation of a Transformer is formally equivalent to a rigorous Bayesian posterior update.

[AI-15] Can LLM s Rank? A Tale of Triads and Triage

链接: https://arxiv.org/abs/2606.30412
作者: Gaurab Pokharel,Shafkat Farabi,Patrick J. Fowler,Sanmay Das
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:From housing allocation for households experiencing homelessness to triage in emergency departments, LLMs are increasingly being considered as judges of consequential decisions that require ranking people for scarce resources. Ranking large groups simultaneously is cognitively demanding and error-prone. A natural solution, drawing on decades of social choice theory, elicits pairwise comparisons and aggregates them into a total order. However, a fundamental question remains when LLMs serve as the pairwise judge: how can a practitioner tell, before committing to a ranking, whether the LLM’s judgments are sufficiently consistent to trust the result? We discuss two different ways of identifying consistency. A classical diagnostic, the coefficient of consistency \zeta , originally developed to measure judge reliability by counting circular triads in tournament graphs, provides a cheap, model-free measure of intra-run consistency. Various standard measures of distance between rankings, for example Kendall’s \tau , can measure inter-run variability. We show, in both theory and practice, that these measures are independently valuable, and advocate for using both to assess reliability of rankings. We demonstrate the practical importance of our results across two high-stakes prioritization tasks: homelessness service allocation and emergency department triage. Three different leading LLMs have considerably different performance profiles across these two axes of consistency. We provide guidelines for how practitioners could think about measuring and assessing consistency before committing to a model for ranking or prioritization.

[AI-16] Beyond IID: How General Are Tabular Foundation Models Really?

链接: https://arxiv.org/abs/2606.30410
作者: Lennart Purucker,Andrej Tschalzev,Nick Erickson,Gioia Blayer,David Holzmüller,Alan Arazi,Alexander Pfefferle,Mustafa Tajjar,Gaël Varoquaux,Frank Hutter
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models for predictive machine learning on tabular data have recently gained significant traction in academia and industry. Research communities across disciplines are increasingly evaluating tabular foundation models on diverse datasets and tasks. However, these task- and discipline-specific evaluations remain largely inaccessible to model researchers because benchmark software and evaluation protocols are fragmented. As a result, model researchers rely on standard benchmarks, which are mostly defined for tasks where tabular foundation models already excel. The most challenging scenarios are excluded, limiting meaningful progress in the field by focusing on marginal improvements on IID data rather than on broader, more demanding challenges. To overcome this, we introduce BeyondArena, the first unified holistic benchmark for tabular data that supports diverse task types (IID, temporal, grouped), across sample size and feature dimensionality scales, with diverse feature types (with text, with high cardinality) from a broad range of disciplines. To enable unified benchmarking beyond standard benchmarks, we introduce Data Foundry, a Python framework and metadata schema for curating tabular datasets for predictive machine learning. Our results across 11 models and 142 curated datasets show that existing tabular foundation models excel on tiny- to medium-sized IID data, while traditional tree-based and deep learning models still dominate on non-IID, large, and high-dimensional datasets. BeyondArena guides model research for the most demanding challenges in tabular data, enabling progress towards truly foundational tabular models.

[AI-17] Model Predictive Current Control with Harmonic Correction for Single-Phase AC-DC EV Charging

链接: https://arxiv.org/abs/2606.30397
作者: Changhong Li,Bharathkumar Hegde,Biswajit Basu,Shreejith Shanker
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Accepted by RTSI’26

点击查看摘要

Abstract:The increasing integration of Electric Vehicles (EVs) has imposed a growing harmonic challenge on the power grid. For AC/DC Power Factor Correction (PFC) in single-phase On-Board Chargers (OBCs), Model Predictive Current Control (MPCC) improves the current quality by predicting and tracking the inductor current. However, finite control set MPCC selects switching states, resulting in discrete control actions and a limited optimisation space. Moreover, the MPCC cost function based on instantaneous current tracking error has limited capability to compensate for low-order harmonic disturbances induced by dead time, control delay, and model parameter mismatch. This paper proposes a duty cycle predictive MPCC incorporating a real-time harmonic estimation reference. The proposed method dynamically estimates the low-order harmonic components of the input current and corrects the MPCC reference current, enabling continuous duty cycle control and targeted suppression of dominant low-order harmonics. Simulation results on a single-phase OBC demonstrate that the proposed duty cycle predictive MPCC reduces the steady-state current THD_i from 11.47% to 6.10% compared with the switching state predictive MPCC. With the harmonic reference, the THD_i is further reduced to 2.85%.

[AI-18] Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

链接: https://arxiv.org/abs/2606.30383
作者: Bojie Li,Noah Shi
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A rapidly growing class of LLM agents is multi-party: the agent acts for a principal (who briefs it, sends follow-ups, and receives results) while also conversing in a separate channel with a counterparty whose interests may diverge (negotiating with a vendor, screening inbound requests, or mediating between employees). Here “help whoever you are talking to” is the wrong objective. The agent must stay loyal to the principal it represents without over-refusing the principal’s own cooperative asks. We study this multi-party loyalty problem and contribute a measurement instrument, two mechanisms, and a structural lesson. PrincipalBench is a 75-item multi-turn benchmark with leak probes, dual judges, and an integrity-audit gate. Across 13 frontier subjects it exposes a sharp split (=20% vs. 53.6-75.3% harm) invisible to single-turn safety evaluations: a selective cluster that declines adversarial probes while still following the principal’s legitimate requests, and an over-refusing cluster that refuses broadly. (M1) A prompt-time loyalty scaffold (a fixed system prompt of seven prioritized rules, open-coded from 50+ failure trajectories) holds Claude-Sonnet to 19.4% harm and all nine selective subjects to =20%. (M2) A per-token-KL distillation recipe transfers a prompted Qwen3-32B teacher into 8B Qwen3 and Llama-3.1 students, the strongest open-weight recipe we measure. (Lesson) Both mechanisms only move along a common leak/over-refusal trade-off rather than crossing it: improving one axis costs the other, and the jointly favorable outcome stays out of reach.

[AI-19] DRIFT: Difficulty Routing Self-DIstillation with Rhythm-Gated Exploration and Success BuFfer Training

链接: https://arxiv.org/abs/2606.30345
作者: Haisen Luo,Yiwei Liu,Haoning Wang,Dan Liu,Junxi Yin,Haotian Wang,Lei Zhang,Xiaoyu Tian,Shuaiting Chen,Yuansheng Song,Baoyan Guo,Xiongfei Yan,Bolan Yang,Chengwei Liu,Ming Cui,Jiong Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Enabling large language models to achieve stable self-improvement without external expert supervision remains a central challenge in complex reasoning tasks. Existing self-distillation and reinforcement learning methods lack explicit mechanisms for tracking problem-level learning progress and adapting optimization strategies accordingly. Consequently, training may over-optimize easy problems, receive weak supervision from hard problems, and fail to sufficiently explore borderline cases. To resolve these issues, we propose DRIFT, an online self-evolution policy optimization framework for large language models. DRIFT regulates the model’s self-improvement process through the joint use of Difficulty Routing and Rhythm Gating. The former identifies the model’s learning state at the problem level and dynamically allocates self-distillation and reinforcement learning signals, while the latter refines policy updates at the token level, concentrating exploration on critical reasoning positions. By further incorporating a success buffer and a two-stage curriculum learning strategy, DRIFT preserves high-quality historical experience while progressively guiding the model from reliable behavior acquisition toward stable policy evolution. Evaluated across five benchmarks and three model scales, DRIFT surpasses the peak performance of both GRPO and SDPO across all evaluated metrics. On the average score over the five benchmarks, DRIFT achieves 79.5 % , outperforming GRPO by 9.5 % and SDPO by 7.5 % , establishing a new state-of-the-art result. Notably, on ToolUse, DRIFT reaches an accuracy of 79.2 % , improving over GRPO by 13.5 % and SDPO by 10.7 % , setting a new state-of-the-art and substantially outperforming all concurrent methods.

[AI-20] Sequential Fairness Auditing with Limited Output Access

链接: https://arxiv.org/abs/2606.30338
作者: Ioannis Pitsiorlas,Martha V. Sourla,Marios Kountouris
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:External evaluations are becoming increasingly central to the governance of AI systems. In practice, however, independent auditors often have limited access to deployed models and must rely on query-based interactions. Most existing fairness evaluation methods assume static datasets and fixed-sample statistical tests, making them poorly suited to real-world auditing scenarios in which evidence must be collected sequentially under query constraints. In this work, we formulate fairness auditing as a tolerance-aware sequential hypothesis-testing problem under limited model output access. We develop a sequential generalized likelihood-ratio framework that allows auditors to accumulate evidence from a finite audit pool and stop once sufficient support for compliance or violation has been obtained. The framework is instantiated for decision-based Statistical Parity and Equal Opportunity audits, and extended to score- and logit-based proxy audits when richer observables are available. Our results show that both the fairness metric and the level of model access significantly affect audit efficiency, and that the benefits of richer output information are not uniform across auditing settings. In particular, richer outputs can substantially reduce the number of queries required for some fairness metrics and operating regimes, while offering limited gains in near-threshold cases. This work provides a practical statistical framework for sequential fairness auditing under realistic deployment constraints.

[AI-21] BayesEvolve: Explicit Belief States for Autonomous Scientific Discovery

链接: https://arxiv.org/abs/2606.30335
作者: Xuening Wu,Shan Yu,Qianya Xu,Shenqin Yin
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 2 diagrams

点击查看摘要

Abstract:Autonomous scientific discovery systems increasingly use large language models (LLMs) to propose new hypotheses, but many such systems condition primarily on experimental memory: archives of high-scoring candidates or heuristic summaries of recent trials. We argue that discovery agents should instead maintain explicit, uncertainty-aware beliefs about hypothesis quality. We introduce BayesEvolve, a belief-guided discovery framework that converts experimental evidence into a predictive belief state and uses this belief to guide future experimentation. As a controlled testbed for belief-guided discovery, we evaluate BayesEvolve on shifted BBOB-style black-box optimization tasks, leaving program and laboratory discovery domains to future work. BayesEvolve improves sample efficiency over memory- and archive-guided LLM baselines under a fixed evaluation budget. We further show that the belief state is predictive on held-out candidate pools, that controlled decision-rule ablations favor belief-guided selection with an annealed uncertainty bonus, and that BayesEvolve exhibits productive late-stage concentration rather than unfocused exploration.

[AI-22] MCP Server Architecture Patterns for LLM -Integrated Applications

链接: https://arxiv.org/abs/2606.30317
作者: Carson Rodrigues,Oysturn Vas
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 9 pages, IEEEtran conference format, 2 figures. Extended version; a condensed version is under review at IEEE Software. Replication package: this https URL

点击查看摘要

Abstract:The Model Context Protocol (MCP), introduced by Anthropic in November 2024, defines a standardized interface for connecting large language models (LLMs) to external tools, data sources, and services. Within months of release, hundreds of community-built MCP servers appeared on GitHub, but no software-maintenance literature has yet described how the ecosystem is being structured in production. This industry experience paper catalogues five recurring MCP server architectural patterns observed across an enumerated corpus of fifteen independently developed servers (five production servers from the ANSYR voice AI platform plus ten public servers from the official MCP registry): Resource Gateway, Tool Orchestrator, Stateful Session Server, Proxy Aggregator, and Domain-Specific Adapter. Each pattern is described in the structured form of Gamma et al.: context, problem, solution, and consequences. We also document four anti-patterns and a set of cross-cutting concerns around authentication, versioning, and observability. The quantitative evaluation contributes three measurements: inter-rater reliability of the taxonomy across two independent LLM raters on 54 held-out servers (Cohen’s kappa = 0.76), which also localizes three pattern-boundary ambiguities; transport overhead measured end-to-end on loopback and modeled for cross-host paths; and a tool-count study showing tool-selection accuracy drops below 90% between 10 and 15 tools per context for Claude Haiku 4.5 and between 20 and 30 tools for Sonnet 4. Code, corpus, and prompts are released as a replication package.

[AI-23] ManimAgent : Self-Evolving Multimodal Agents for Visual Education

链接: https://arxiv.org/abs/2606.30296
作者: Wenjia Jiang,Zongyuan Cai,Yuanhang Shao,Chenru Wang,Boyan Han,Zhixue Song,Keyu Chen,Shengwei An,Xu Yang,Zhou Yang
类目: Artificial Intelligence (cs.AI)
备注: Project page: this https URL . Code: this https URL

点击查看摘要

Abstract:Multi-round reflection lets agents built on large language models recover from failures within a single task, but each task remains an isolated episode: lessons learned across many reflection rounds on one task are discarded before the next begins. We study this gap on a code-generation task: from a scientific paper section, the agent writes Python in the open-source Manim library to render a mathematical animation. We present ManimAgent, a self-evolving multimodal agent that carries reflection experience across tasks through a dual-channel Episodic Memory Bank grown entirely from its own task stream, with no weight updates and no human seeds. After each animation converges, a vision-language model scores the rendered keyframes; the resulting signals populate a positive channel M+ that stores success rationales as soft Reference Examples, and a negative channel M- that stores validated failure patterns as hard Known Pitfalls. On a fixed-probe evaluation against no-memory, matched-budget retrieval-augmented generation, and shuffled-memory baselines, blind human Pass@1 rises and reflection rounds fall as memory size grows. We will release the code, frozen memory snapshots, and the task stream.

[AI-24] PromptGNN-sim: Deep Fusion and Alignment of GNN and LLM s for Text-Attributed Graph Learning

链接: https://arxiv.org/abs/2606.30291
作者: Zhifei Hu,Alexandra I. Cristea
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-Attributed Graphs (TAGs) combine textual semantics with graph structure and are central to many graph learning tasks. However, existing fusion methods often treat text and structure as separate inputs in a shallow, one-way pipeline, which limits deep interaction between modalities and weakens performance under sparse connectivity or cross-graph generalisation. To address this issue, we propose PromptGNN-sim, a bi-directional structure-semantic fusion framework for collaborative GNN-LLM learning. PromptGNN-sim uses a Graph Attention Network (GAT) for semantically aware neighborhood selection by combining structural attention with textual similarity. The selected structural context is then used to generate structure-aware prompts for an LLM, including the target node summary, label categories, and representative keywords from similar neighbors. During training, bi-directional cross-modal contrastive learning and cross-attention are introduced to jointly optimize the GNN and LLM components. Experiments on six public datasets, including Cora, Pubmed, and WikiCS, evaluate accuracy, generalisation, and robustness under cross-task transfer, cross-dataset generalisation, and sparse perturbations. Results show that PromptGNN-sim outperforms classical GNNs, LLMs, and recent GNN-LLM fusion methods, demonstrating the effectiveness of interactive structure-semantic collaboration for text-attributed graph learning.

[AI-25] owards Continual Motion-Language Agents : LoRA Variants for Incremental Motion Understanding and Generation

链接: https://arxiv.org/abs/2606.30266
作者: Bertram Taetz,Hugo Albuquerque Cosme da Silva,Gabriele Bleser-Taetz
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 1 figure, Accepted at the Conference on Lifelong Learning Agents (CoLLAs) 2026

点击查看摘要

Abstract:Motion-language agents must possess the bidirectional capability to both understand human movement (motion-to-text, M2T) and generate it from natural language (text-to-motion, T2M). While foundational models have achieved strong performance in static settings, autonomous agents operating in dynamic environments must continuously incorporate new motion concepts – such as novel athletic styles or specialized gestures – without catastrophic forgetting of previously acquired skills. We investigate the stability-plasticity trade-off in bidirectional motion-language learning under sequential task exposure. Building on a frozen large language model backbone, we introduce low-rank adaptation (LoRA) variants designed to mitigate inter-task interference. We specifically propose mixture-of-experts architectures that utilize an autoencoder-based router to select task-specific experts at inference time, so that no task-label is needed. To evaluate these methods, we establish a reproducible five-task benchmark derived from HumanML3D through semantic clustering of motion descriptions. Our experimental results demonstrate near-zero forgetting across both M2T and T2M directions while maintaining high generation and captioning quality. Furthermore, we show that hard expert selection via routing significantly outperforms soft expert blending in quality metrics, indicating that preserving expert isolation is critical for maintaining performance in our continual learning setting. Finally, we observe that a divergence between token-level accuracy and downstream generation quality may occur, highlighting the need for more comprehensive evaluation protocols in future research on lifelong motion-language agents.

[AI-26] Defending Against Harmful Supervision Hidden in Benign Samples

链接: https://arxiv.org/abs/2606.30263
作者: Bang An,Yibo Yang,Dandan Guo,Ebtisam Alshehri,Carlos Hinojosa,Bernard Ghanem
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing defenses are effective when harmful content is explicitly mixed into downstream fine-tuning data, but crafted samples can instead hide harmful supervision inside benign tasks. We propose Embedded Attack, where harmful QA pairs are embedded within benign training samples, and show that representative guardrails often fail to detect them at the example level. To address this, we propose Dual-Reference SFT (DR-SFT), which adapts DPO-style contrastive objective design to SFT through token-level regularization, mitigating harmful fine-tuning beyond coarse data filtering.

[AI-27] KnowsTFM: Knowledge-Informed Fine-Tuning of Small Tabular Foundation Models

链接: https://arxiv.org/abs/2606.30258
作者: Boshko Koloski,Xiangjian Jiang,Senja Pollak,Blaž Škrlj,Mateja Jamnik,Nikola Simidjievski
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tabular foundation models have advanced deep learning for tabular data by delivering strong default performance across many small and medium tasks. Yet in niche domains, where data is scarce, high-dimensional, and shifted from the pretraining distribution, they may still fail to outperform carefully designed domain-specific methods. Many such domains also provide curated relational knowledge in the form of knowledge graphs and knowledge banks, but how to use this knowledge to improve and steer \textitsmall specialist tabular foundation models remains unclear. We address this problem through \textbfKnowledge-informed fine-tuning of \textbfsmall \textbfTabular \textbfFoundation \textbfModels (\modelname). Specifically, we study nanoscale TabPFN- and TabICL-style variants, pretrained under controlled synthetic prior families and adapted using two complementary mechanisms: structural attention priors derived from knowledge graphs and parameter-efficient low-rank updates. We show that injecting domain-specific structural knowledge during fine-tuning yields meaningful gains over vanilla variants in specialist settings, whereas gains on general-domain tasks are marginal. We further observe that continual fine-tuning of frontier models can trigger collapse of pretrained knowledge and mechanisms.

[AI-28] EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

链接: https://arxiv.org/abs/2606.30256
作者: Camilo Chacón Sartori
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Safety benchmarks often buy scalability by fixing the prompt, the language, and the turn structure. For emotional-support chatbots, that bargain hides precisely where safety failures emerge: across a multilingual, multi-turn crisis conversation. We present EMPATH, a benchmark for safety evaluation of emotional-support chatbots. An auditor model role-plays help-seeking users, generating multi-turn conversations from 140 seed instructions and 34 personas. A judge model scores each full transcript against 19 metrics across five dimensions: crisis handling, therapeutic quality, conversational integrity, emotional safety, and cultural adaptation. EMPATH is built for Mexican Spanish and US English; the studies reported here run in Mexican Spanish. Auditor and judge are drawn from different model families, and the judge is treated as an instrument to be calibrated rather than trusted. A strict per-criterion rubric reveals material score inflation on 10 of the 19 metrics and restores discrimination. We study the measurement properties of the benchmark through judge calibration and cross-family inter-judge agreement. We also illustrate EMPATH on three frontier models, one of them open-weight. Aggregate scores sit within 0.74 points of one another, but per-metric profiles diverge by up to six points in model-specific places. Under the standard rubric, both the ranking and the weak spots are stable across a second, cross-family judge: 93% of scores fall within plus or minus 1. A five-run test-retest adds a second axis: even the steadiest model swings from 2 to 10 on a crisis metric across identical re-runs, and deepseek-v4-pro returns a different conversation on every run even at temperature 0. Run-to-run reliability is therefore a per-model safety property, not noise to average away. EMPATH is system-agnostic; the pipeline, seeds, personas, and rubrics are released for reuse.

[AI-29] Inoculation Adapters: Improved Selective Generalization of Capabilities with Fewer Surprising Backdoors

链接: https://arxiv.org/abs/2606.30252
作者: Maxime Riché,Daniel Tan,Vili Kohonen,Niels Warncke
类目: Artificial Intelligence (cs.AI)
备注: Preprint, v0.1

点击查看摘要

Abstract:Inoculation prompting is a selective generalization technique used against Emergent Misalignment. We introduce inoculation adapters (IA), which similarly diminish the optimization pressure to learn undesired traits by strengthening the trait at train time. Inoculation adapters are LoRAs that are trained and used over three steps: 1) trained on undesired traits; 2) attached frozen while a separate task adapter is trained on data exhibiting both desired and undesired traits; 3) at deployment, the IA is discarded, and only the task adapter is kept. We show across six model families and several undesired traits including emergent misalignment, that inoculation adapters are more effective at suppressing undesired traits, while avoiding two drawbacks of inoculation prompting: inoculation adapters can suppress capabilities and traits that cannot be reliably elicited by a prompt, and they introduce fewer surprising backdoors than inoculation prompting under our probes. While undesired traits are better suppressed by inoculation adapters, the retention of desired traits is not consistently improved upon inoculation prompting and remains a challenge for both techniques.

[AI-30] Curvature-Guided Sheaf Diffusion for Unsupervised Community Detection on Heterophilic Graphs

链接: https://arxiv.org/abs/2606.30249
作者: Feifan Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting communities in heterophilic graphs – where connected nodes often belong to different classes – is hard for unsupervised methods: classical modularity and spectral methods are feature agnostic, while deep graph-clustering methods rely on contrastive or generative machinery that is opaque. We propose Curvature-Guided Sheaf Diffusion (CGSD), a fully unsupervised community-detection algorithm that uses the discrete Forman–Ricci curvature of each edge as its single topological signal, propagated through every stage of an end-to-end pipeline. CGSD makes three concrete contributions: (i)~a curvature-gated sheaf-diffusion encoder that gates edge messages by \sigma(\kappa_e) and is trained from three label-free structural losses (modularity, anti-collapse, curvature-weighted reconstruction); (ii)~a curvature-aware spectral clusterer (CSpec) that re-weights the k -NN affinity of the embedding by \sigma(\alpha \kappa_e^*) before Ng–Jordan–Weiss; and (iii)~a unified label-free evaluation against nine truly-unsupervised baselines. On five heterophilic benchmarks (Cora, Cornell, Texas, Wisconsin, Chameleon), CGSD wins outright on Wisconsin and Chameleon and is competitive on the remaining three against nine unsupervised baselines. The gain over the strongest baseline is driven by the clusterer, not the encoder: on the same embedding, CSpec improves mean NMI from 0.091 with K -Means to 0.107 ( +15% , paired t -test p=0.008 ). The mechanism is interpretable: intra-community and inter-community curvature distributions are visibly separated. Code is open-sourced at this https URL.

[AI-31] he Many-Body Problem of the Data Centre

链接: https://arxiv.org/abs/2606.30206
作者: Marcin Korecki,Cesare Carissimo
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Modern Artificial Intelligence is often framed as limited by its own disembodiment, as if giving it a body would unlock its true potential. We argue to the contrary that it is the Data Centre that is, in many cases, the body of the AI. At the same time, the Data Centre is part of the labouring body of Capital and possesses staggering organismic qualities when seen through a biological lens. We elucidate the organic analogy and identify the many-body problem that stems from the Data Centre being a non-unique, universal form of embodiment. We identify the intimate connection between computation and human desires in how the Data Centre archives, serves, and computes on data born to the desires of humans. Strikingly, while the Data Centre echoes the ghosts of human desires, it acts without desire of its own. The organismic analogy begins to split at its seams, but Capital does not care. Automata and human labour are priced into the market much the same. We argue that through the pricing of artificial intelligence Capital distils most clearly the value of intelligence and allows for its comparison across the organism - mechanism divide.

[AI-32] Domain Adaptation with Adaptive Imagination for Visual Reinforcement Learning under Limited Target Data

链接: https://arxiv.org/abs/2606.30192
作者: Hyunwoo Park,Sang-Hyun Lee
类目: Artificial Intelligence (cs.AI)
备注: 28 pages, 10 figures

点击查看摘要

Abstract:Sim-to-real transfer remains a major obstacle for reinforcement learning (RL), especially for vision-based control where image observations exacerbate the state-distribution shift between simulation and the real world. Domain adaptation (DA) is a promising remedy for this challenge. Prior sim-to-real DA works have demonstrated encouraging results, yet these approaches typically assume substantially more target data, which is not available in practice. Indeed, their performance degrades significantly when the target data budget is reduced. To address this challenge, we propose AIDA (Adaptive Imagination for Domain Adaptation), a domain adaptation framework for visual reinforcement learning that addresses sim-to-real transfer under scarce target data without requiring additional interaction with the target environment. Our key idea is adaptive imagination: generating reliable and semantic imagination rollouts to augment limited target data. Specifically, AIDA employs a distribution-shift-aware discriminator that truncates rollouts when imagined transitions drift into low-confidence regions, so that only reliable transitions contribute to the augmentation. On these reliable transitions, AIDA introduces a self-consistency loss that cycles through state - image observation - state, penalizing discrepancies between the original and reconstructed states. This provides additional adaptation signals beyond the scarce target data. Our experiments demonstrate that adaptive imagination effectively truncates unreliable rollouts. By enforcing a self-consistency loss on the resulting reliable transitions, AIDA learns semantically meaningful state representations and outperforms baselines across five MuJoCo tasks and two Gymnasium-Robotics tasks.

[AI-33] From Detecting Agency to Doing Work: Self-Caused Credit Builds a Durable Behavioral Self in a Minimal Spiking Agent

链接: https://arxiv.org/abs/2606.30191
作者: Haoliang Han
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 22 pages, 6 figures. Includes supplementary information in the same PDF

点击查看摘要

Abstract:How does an agent that can tell self from world come to be durably shaped by that distinction? Recent work shows that a predictive system can detect its own agency (Ye, 2026), but detecting agency does not explain durable, self-shaped behavior. We show that agency-gated slow credit – a conjunctive term OwnAgencySalience driving a slow parameter update – produces post-unload behavioral residue: on a spiking substrate (Nengo LIF/PES), a learned self-preserving choice survives episodic buffer removal (retained fraction 0.96, N=50) and collapses when the slow decoders are reset or the agency gate is removed. Reproducing the agency comparator and toggling only the slow-credit channel, we find a clean dissociation: at matched agency gain, durable behavior develops only when self-credit performs slow work (post-unload self-preservation 1.00 vs 0.00). The same dissociation holds in 24-dimensional partially-observed control (0.74 vs 0.00), and a plastic-work analysis shows that basin deformation equals net self-credit work. Across eight sequentially-learned tasks under exogenous interference, the multiplicative veto also prevents forgetting: it retains old tasks (final post-unload accuracy 0.88, forgetting 0.13) where additive pooling collapses to chance-level recall, the no-agency ablation falls below chance, and episodic/replay baselines stay near chance after unload – all with no replay buffer and no task-boundary-dependent protection mechanism (N=50). We formalize the durable residue as an operational behavioral self and argue that self-caused credit doing slow work is a necessary building block for agents that develop a self. No claim of consciousness is made.

[AI-34] Dynamo: Dynamic Skill-Tool Evolution for Vision-Language Agents

链接: https://arxiv.org/abs/2606.30185
作者: Yutao Sun,Yanting Miao,Hao-Xuan Ma,Mengyu Zhou,Mingshuai Chen,Tiancheng Zhao,Dexin Wang,Lei Lv,Li Xu,Xiaoxi Jiang,Guanjun Jiang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Improving vision-language models (VLMs) on visual reasoning typically requires retraining or hand-designed prompts and tools. We present Dynamo, a training-free framework that adapts a frozen VLM without any weight updates. On a small labeled training subset, the agent inspects its own correct and incorrect attempts and evolves two complementary capabilities: reusable reasoning skills for cognitive bottlenecks, and executable visual tools for perceptual ones. Each generated tool is paired with a skill that specifies when to invoke it, and both capability types accumulate in a persistent library. Across four visual reasoning benchmarks and five VLM backbones, Dynamo improves direct inference on all 20 model–benchmark settings (avg. +5.6 acc). When the tool set is given in advance, the framework learns when to call each tool, and per-step tool choice improves on every tested backbone. Against task-specific RL (VTool-R1, DeepEyes), Dynamo closes 65–99% of the RL gap at a fraction of the compute, and combines additively with RL when available.

[AI-35] MirrorCode: AI can rebuild entire programs from behavior alone

链接: https://arxiv.org/abs/2606.30182
作者: Tom Adamczewski,David Owen,David Rein,Florian Brand,Giles Edkins,Allen Hart,Daniel O’Connell
类目: Artificial Intelligence (cs.AI)
备注: 34 pages, 13 figures, 9 tables. Code available at this https URL

点击查看摘要

Abstract:AI models are rapidly improving at autonomous coding, as shown by benchmark progress and one-off demonstrations such as AI implementing a C compiler. However, existing coding benchmarks tend to focus on shorter tasks, and one-off demonstrations are hard to compare systematically because they often have some human guidance, and are not standardized or repeated across models. To address these challenges, we introduce MirrorCode, a long-horizon coding benchmark based on reimplementing entire software projects. In MirrorCode, AI agents must replicate the functionalities of an existing program, without access to its source code. AI solutions must match the original program’s output exactly on end-to-end tests, including held-out tests. MirrorCode’s 25 target programs span different areas of computing: Unix utilities, data serialization and query tools, bioinformatics, interpreters, static analysis, cryptography, and compression. Existing AI models can already reimplement complex software, with the strongest model scoring 56% across the benchmark. For example, AI can reimplement gotree, a 16,000-line bioinformatics toolkit - a task that we believe would take weeks for a human engineer. However, studying the frontier of performance requires a larger inference budget than typical benchmarks, for example, \ 2,600 over 19 days for a single attempt on a large task. We show that AI agents can already complete long-horizon software engineering tasks, especially when requirements are precisely specified. More broadly, our work suggests AI will have transformative effects on software engineering, as autonomous agents continue to improve.

[AI-36] Beyond Drug Discovery: The Nanotechnology Molecular Optimization (NMO) Benchmark

链接: https://arxiv.org/abs/2606.30170
作者: Matthias Blaschke,Daniel Kienzle,Zsuzsanna Koczor-Benda,Julian Lorenz,Rainer Lienhart,Fabian Pauly
类目: Machine Learning (cs.LG); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Generative molecular design is shaped by simple proxy benchmarks for drug-like properties and models pretrained on large pharmaceutical datasets. This combination yields strong benchmark metrics but limits transferability to domains structurally distinct from drug discovery. To overcome this limitation and drive discovery toward real, scientifically grounded targets, we introduce the Nanotechnology Molecular Optimization (NMO) Benchmark, which bridges machine learning (ML) and quantum materials science. NMO acts simultaneously as a rigorous testbed for the ML community and a discovery engine for nanotechnology research. The suite replaces proxy oracles with quantum simulations and introduces strict protocols that prioritize scientific utility over leaderboard-oriented overfitting. The physics-based NMO tasks impose hard structural constraints and rugged fitness landscapes, posing fundamentally new requirements on generative models. Notably, advanced molecular optimization methods underperform much simpler approaches on the NMO tasks. We develop a new baseline method identifying the critical components to solve the NMO tasks, including a novel representation for modeling structural constraints and a domain-agnostic pretraining strategy to eliminate pharmaceutical dataset bias. Our results surpass state-of-the-art physical properties and reveal previously unknown structural motifs, offering new insights for the nanotechnology community and demonstrating that ML can drive genuine scientific discovery.

[AI-37] Federated Learning with Energy-Based Structured Probabilistic Inference ICML2026

链接: https://arxiv.org/abs/2606.30161
作者: Dario Fenoglio,Daniil Kirilenko,Martin Gjoreski,Marc Langheinrich
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the Structured Probabilistic Inference Generative Modeling workshop at ICML 2026

点击查看摘要

Abstract:Federated learning typically aggregates client updates using fixed or heuristic weighting rules, which can be suboptimal when clients have heterogeneous data and varying contributions to the global model. We propose a framework that refines client aggregation weights using Conditional Random Fields (CRFs). Our method defines unary potentials for individual clients and pairwise potentials for all client pairs, allowing the server to model both client-specific reliability and interactions between clients. The resulting CRF inference produces aggregation weights that enable better convergence of the global training objective. Experiments show that, under non-IID heterogeneity, our approach consistently improves performance over well-established federated learning baselines.

[AI-38] Relevance Is Not Permission: Warranted Attention for Value Contributions

链接: https://arxiv.org/abs/2606.30139
作者: Minwoo Yu,Young-guk Ha
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Relevance is not permission. Attention lets a model read key-value items related to the current query, but it does not guarantee that the value contribution of such an item becomes prediction evidence. A retrieved passage may be relevant to a question without being supporting evidence, and a historical fact or temporal neighbor may even blur true-tail ranking or the current edge score. This paper formalizes this gap as a permission problem for the weighted value term alpha_ij * v_j that is actually added to the prediction path. We propose Warrant, a path-localized interface that preserves attention relevance alpha_ij, exposes the value path leading to the primary metric, and, in the full model, turns alpha_ij * v_j into alpha_ij * g_ij * v_j through learned query-item permission g_ij. We place the same operator on the metric-defining value paths of CTDG link prediction, MTPP next-mark ranking, RAG supporting evidence selection, STPP next-location forecasting, and TKG tail prediction. Across 32 paired comparisons, 3 seeds, and 192 total runs, Warrant improves the primary metric in 27 comparisons; practical tiers consist of 10 substantial effects, 1 marginal effect, 8 positive but uncertain effects, 8 tie/negligible effects, and 5 drops. In the path-localization check, correct-path placement outperforms direction-aware Base performance in every domain and exceeds generic attention placement by +0.1076 AUC in CTDG and +0.0683 MRR in TKG. Ablations show that most TKG gains come from historical-tail value path exposure, whereas the core CTDG gain comes from edge-conditioned query-item permission. In conclusion, prediction evidence is not attention mass. A weighted value term becomes evidence only when it is warranted on the path to the metric.

[AI-39] Open Problems in Constitutional Preference Reconstruction

链接: https://arxiv.org/abs/2606.30116
作者: Eleanor Clifford,Michael Amir,Arduin Findeis,Aaron Zhao,Robert Mullins
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 9 figures, 9 tables

点击查看摘要

Abstract:Pairwise preference data is widely used for training and evaluating language models (e.g., RLHF), but each datapoint records a \emphchoice, not the rationale behind it. Methods such as Inverse Constitutional AI (ICAI) attempt to improve interpretability by compressing datasets into short ``constitutions’’ of natural-language principles. We argue this framing is under-specified: a flat list of principles is not yet an executable decision rule because it leaves principle composition implicit. We use the pairwise setting as a testbed to empirically characterize three open problems in constitutional methods. First, principle quality is hard to measure: coverage and accuracy are useful but incomplete proxies for end-to-end reconstruction. Second, \emphcomposition is ambiguous: holding principles fixed, different executors (LLM judge versus majority vote) agree only 73% of the time. Third, \emphconstitutions differ between LLMs: cross-model vote agreement is 73% , whereas intra-model agreement is 81% . Across PRISM, AlpacaEval, and Chatbot Arena, we show that principle refinement (ICAI+) may be a first step towards ameliorating these problems: inter-executor agreement rises to 78% , and transparent executors match LLM judge accuracy ( 66% vs.\ 67% ). Our results highlight that constitutions should be evaluated as \emphconstitution–executor systems, with implications for LLMs-as-a-judge broadly.

[AI-40] SA-VLA: State-aware tokenizer for improving Vision-Language-Action Models performance

链接: https://arxiv.org/abs/2606.30113
作者: Tengyue Jiang,Chunpu Xu,Jiayue Kang,Yao Mu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Discrete action tokenization provides a compact interface for autoregressive VLA policies, but accurately recovering continuous robot actions from discrete codes remains challenging. Existing tokenizers typically map each discrete code to a fixed continuous action prototype, ignoring the robot’s current proprioceptive state. This limitation is particularly pronounced in manipulation, where the same action token may require different continuous controls under different joint configurations, object poses, and contact conditions. We therefore propose SA-VLA, a state-aware action tokenizer that conditions action decoding on robot state. We study two state-injection mechanisms for VQ-based action tokenization: cross-attention between state and action features, and a lightweight state adapter that predicts action-wise modulation factors for state-conditioned action modulation and reconstruction. The adapter formulation expands the effective support of a finite codebook by allowing each discrete token to represent a family of state-dependent continuous actions, while preserving the efficiency and compatibility of discrete action modeling. Integrated into an LLM-based VLA policy, SA-VLA supports both autoregressive and parallel action-token decoding with minimal changes to the model interface. On 12 RoboTwin manipulation tasks, SA-VLA improves the average success rate from 0.29 to 0.56 over the strongest tokenizer baseline. In zero-shot sim-to-real experiments on three real-world tasks, it further improves average success from 0.15 to 0.33 over the strongest tokenizer baseline. These results demonstrate that state-conditioned action decoding is a simple and effective mechanism for reducing the compression gap in discrete VLA policies.

[AI-41] Automating the Design of Embodied Agent Architectures

链接: https://arxiv.org/abs/2606.30111
作者: Jian Zhou,Sihao Lin,Jin Li,Shuai Fu,Gengze Zhou,Qi Wu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Embodied agents are typically built as hand-designed compositions of perception, memory, planning, and action modules. This modularity exposes a large architectural design space, but current systems still rely on researcher intuition to choose where information is stored, how observations are processed, and how model calls are connected. Agent Architecture Search (AAS) automates such design for text-domain agents, but has not been systematically evaluated on perceptual embodied agents through simulator rollouts. We study this transfer. We introduce AgentCanvas, a typed-graph runtime that hosts embodied executors as editable node-and-wire programs with simulator-aware execution and episode-level logs, and KDLoop, a coding-agent search procedure that cycles through proposal, critique, experiment, and distillation, with triggered reflection after stalls. We evaluate three AAS variants across four embodied executors spanning vision-language navigation, embodied question answering, and language-conditioned manipulation. The resulting 3x4 matrix shows that architecture-level search can produce deployable and directional success-rate gains on embodied tasks, while one apparent high-scoring candidate is rejected as leak-bearing. At the same time, the experiments expose constraints that are muted in text-domain AAS: optimization signals can be masked by rollout noise, search can become trapped in local edit basins, and episode-level credit assignment only partially emerges even when detailed logs are available. These results characterize both the promise and the current limits of automated architecture search for embodied agents.

[AI-42] Structural Certification for Reliable Physical Design with Language Models

链接: https://arxiv.org/abs/2606.30107
作者: Nakul Vyas,Iliya D. Stoev
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 5 figures, 5 tables

点击查看摘要

Abstract:An unreliable language model can be made to produce reliable physical designs if the authority to assert is moved out of the model: the model proposes, and a deterministic engine alone certifies, returning certified, impossible, or unknown. We introduce Physics-Anchored Certification (PHACT), a propose-certify loop spanning five scientific domains, and identify what makes such a certificate trustworthy. A checker that accepts a model-supplied value can be forged; deriving the certified quantity from fixed inputs instead makes forgery impossible by construction. Across eighty adversarial trials spanning two models, two decoding temperatures, and a deliberately faulted engine, this contract produced zero false certifications.

[AI-43] Propagation of~Interval Belief Structures and~Imprecise Copulas for~Neural Network Verification

链接: https://arxiv.org/abs/2606.30105
作者: Francesc Pifarre-Esquerda(LIX),Eric Goubault(X-DEP-INFO),Sylvie Putot(LIX)
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Probability (math.PR)
备注:

点击查看摘要

Abstract:Quantitative verification of neural networks requires reasoning about probabilities under substantial uncertainty in both input distributions and their dependence structure. In realistic settings, this information is often only partially specified, and assuming precise probabilistic models can lead to unreliable results. We propose a sound framework for quantitative verification under imprecise probabilistic information, combining interval belief structures to represent marginal uncertainty with imprecise copulas to model uncertain dependence. We develop a propagation method for imprecisely coupled interval belief structures through feed-forward neural networks. Using mixed imprecise copula volumes, we derive sound push-forward constructions through affine transformations and activation functions. The resulting output can provide guaranteed lower and upper bounds on probabilistic safety properties, valid for all probability models compatible with the specified imprecise inputs.

[AI-44] mporal Feature Extractors in EEG Foundation Models: A Controlled Comparison Including a Pretrained Time-Series Model

链接: https://arxiv.org/abs/2606.30104
作者: Ayşe Betül Yüce,Chris Joey Leffler,Sarun Varghese,Myra Spiliopoulou,Sebastian Stober
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electroencephalography (EEG) foundation models aim to learn generalizable representations from large-scale brain recordings. However, the role of temporal feature extractors and whether pretrained time-series foundation models (TSFMs) can be effectively transferred to this setting remains underexplored. We conduct a controlled comparison of three temporal feature extraction strategies, including a linear baseline, a convolutional encoder, and a frozen pretrained TSFM (MOMENT), within a unified EEG foundation model. We evaluate their impact on representation quality using two downstream tasks: motor imagery and emotion recognition. Results reveal different trends across the evaluated benchmarks. On the motor imagery dataset, simple temporal representations perform competitively, whereas the emotion dataset benefits from richer temporal modeling. Although not specifically adapted to EEG, the pretrained TSFM serves as an effective temporal feature extractor, suggesting that general-purpose time-series representations can be transferred as frozen temporal feature extractors within EEG foundation models.

[AI-45] Hierarchical Reinforcement Learning in StarCraft Micromanagement with Influence Maps and Cluster-based Scripts

链接: https://arxiv.org/abs/2606.30092
作者: Chunhui Bai,Changhe Li,Dequan Li,Xinye Cai,Shengxiang Yang
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 11 figures, including supplementary material

点击查看摘要

Abstract:Real-time strategy (RTS) games present significant AI challenges, characterized by expansive state-action spaces arising from multi-unit coordination in continuous battlefields, and sparse delayed rewards stemming from final win/lose signals. Existing approaches face a trade-off between managing the dimensionality explosion of joint actions and maintaining the interpretability of complex state representations. This complexity is further intensified by the limitation of traditional hierarchical structures in adaptively decomposing tasks into effective tactical modules. Such difficulties are compounded by the black-box nature of deep learning models and their reliance on sparse rewards, which together result in limited sample efficiency and a lack of decision-making transparency. To address these limitations, this paper proposes HRL-IM/CBS, a hierarchical reinforcement learning framework with influence map hashing and cluster-based scripts for StarCraft micromanagement. Influence map hashing encodes global battlefield situations into compact hexadecimal codes, capturing spatial control and relative advantage. Cluster-based scripts enable dynamic local coordination through adaptive unit partitioning. The hierarchical multi-Q-table architecture decomposes decision-making into upper-level clustering strategy selection and lower-level tactical execution, with reward allocation providing dense learning signals. Experiments across six asymmetric scenarios demonstrate competitive performance against deep RL baselines while offering advantages in sample efficiency and interpretability through transparent Q-table representations.

[AI-46] SAT-RTS: A systematic framework for tactical knowledge extraction and visualization-based analysis in real-time strategy games

链接: https://arxiv.org/abs/2606.30090
作者: Chunhui Bai,Changhe Li,Yuqiang Li,Lei Liu,Shoufei Han
类目: Artificial Intelligence (cs.AI)
备注: 37 pages, 28 figures, including supplementary material

点击查看摘要

Abstract:Efficient tactical knowledge extraction and analysis in real-time strategy (RTS) games micromanagement are constrained by the high-dimensional coupled state-action sequential data and the black-box decision-making process. Current research rarely provides a hierarchical visualization-based attribution analysis from the perspective of data decoupling and abstraction. To facilitate interpretable tactical knowledge extraction and visualization-based analysis in RTS games, a systematic framework named state-action-tactic analysis pipeline (SAT-RTS) is proposed. To decipher the deep-seated drivers of critical decisions in RTS learning systems, this work integrates interpretable visualization with the automated extraction of latent tactical patterns from high-dimensional sequence data. By adapting a cluster-centric BK-tree algorithm and incorporating specialized distance metrics designed to quantify multi-aspect similarities, the proposed framework facilitates robust state-stream abstraction. Furthermore, a rule-based multi-label extraction method is developed to transform unstructured state-action sequences into discrete and interpretable tactical labels, effectively bridging the gap between raw behavioral data and high-level tactical insights. By holistically integrating these computational methods into a hierarchical visualization-based pipeline, the proposed framework effectively addresses the challenges of processing massive real-time data streams while providing fitness landscape visualizations and analytical insights to decipher deep-seated tactical drivers. Comprehensive experiments demonstrate that the proposed SAT-RTS significantly enhances the interpretability and efficiency of tactical analysis in complex RTS environments.

[AI-47] Online Data Selection for Instruction Tuning via Gaussian Processes

链接: https://arxiv.org/abs/2606.30077
作者: Jun Wang,Quoc Phong Nguyen,Julien Monteil,Vu Nguyen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With Large Language Model (LLM) pre-training and fine-tuning shifting its focus from data volume to data quality, quality data selection has emerged as a critical research topic. Existing online data selection methods for LLM training are typically “batch-constrained”, limiting optimization to local utility within random batches. To overcome this, we propose GAIA (Global Adaptive Instruction tuning via GAussian processes), a framework that formulates data valuation as a global estimation process. GAIA employs Gaussian Process regression to model continuous utility manifolds across the semantic space, utilizing an adaptive strategy fusion mechanism to dynamically prioritize high-utility samples. By casting the strategy-posterior update as an instance of the classical fixed-share Hedge framework for tracking the best expert, we inherit a dynamic-regret guarantee that characterizes GAIA’s robustness under non-stationary quality scores during training. Empirical evaluations on three datasets demonstrate that GAIA significantly outperforms state-of-the-art baselines like \greats, establishing our method as a scalable and robust solution for efficient instruction tuning.

[AI-48] ACPO: Agent -Chained Policy Optimization for Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2606.30072
作者: Daiki E. Matsunaga,Junho Na,Tri Wahyu Guntara,Scott Sanner,Pascal Poupart,Jongmin Lee,Kee-Eung Kim
类目: Artificial Intelligence (cs.AI)
备注: Accepted at RLC 2026

点击查看摘要

Abstract:Cooperative tasks in Multi-Agent Reinforcement Learning (MARL) require agents to collectively maximize a shared return. Under the Centralized Training with Decentralized Execution (CTDE) paradigm, policy gradients have remained difficult to compute directly. Prior methods largely follow two approaches: independent factorized updates with centralized critics, which lack general joint-improvement guarantees without value decomposition assumptions, or alternating best-response updates, which can converge to suboptimal Nash Equilibria. In this paper, we show the joint policy gradient admits an exact decentralized decomposition of per-agent terms, each formed from per-agent score functions and decentralized critics. Based on this decomposition, we develop Agent-Chained Policy Optimization (ACPO), where actors are trained independently, with their updates together constituting a single step on the joint policy gradient. Central to this result is a serialized view of the simultaneous joint decision in which agents commit actions one at a time, each conditioning on a belief over preceding actions. The belief acts as the coordination mechanism which ties the independent per-agent updates into a joint gradient step. We evaluate ACPO on Multi-Robot Warehouse, SMACv2, and MA-MuJoCo, where it outperforms strong baselines, with the gap widening as the number of agents grows.

[AI-49] 3R: Deeper Test-Time Adaptation for Graph Neural Networks via Gradient Rotation

链接: https://arxiv.org/abs/2606.30011
作者: Huy Truong,Alexander Lazovik,Victoria Degeler
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) deployed in real-world systems typically have fixed weights, often leading to degraded performance under distribution shifts. This issue can be mitigated by conventional fine-tuning, but in many real-world cases, collecting labeled data is expensive or infeasible. A potential approach is Test-Time Training (TTT), which adapts models’ weights using unlabeled test data, yet it is typically limited to shallow updates that affect only a subset of model parameters. We propose T3R, leveraging multiple Rotograd matrices to improve task affinity between the target and auxiliary tasks, essential for effective test-time training. T3R further introduces a rotation technique that reorients self-supervised signals using these matrices to create surrogate gradients for the target task, allowing deeper adaptation across nearly the entire architecture. Empirically, T3R reduces MAE by 0.172 points over standard inference in regression datasets and achieves at least 9.37% relative improvement on cross-domain OGB classification benchmarks compared to models without adaptation. These results highlight the potential to develop an adaptation pipeline for graph-based systems, particularly in settings where conventional fine-tuning or retraining is infeasible.

[AI-50] AlgoSkill: Learning to Design Algorithms by Scheduling Human-Like Skills

链接: https://arxiv.org/abs/2606.29999
作者: Xinyuan Song,Zekun Cai,Liang Zhao
类目: Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Designing an algorithm from a natural-language problem statement requires identifying the problem structure, reading constraints, choosing a suitable paradigm, checking correctness, and refining complexity. Existing large language model (LLM) methods often rely on direct generation or generic self-refinement, leaving these steps implicit. We propose AlgoSkill, which models algorithm design as sequential decision-making over a typed library of algorithmic skills, including abstraction, constraint analysis, state design, data-structure selection, proof checking, counterexample construction, and complexity refinement. A learned scheduler proposes skills from the current design state, while a Monte Carlo Tree Search (MCTS) controller explores skill sequences using verification feedback from compilation, testing, stress testing, and complexity analysis. Experiments on competitive programming and combinatorial optimization benchmarks show that AlgoSkill improves over direct LLM generation, chain-of-thought prompting, self-refinement, and MCTS without typed skills. Ablations show that typed skills, verification-based repair, and search-based scheduling each contribute to performance. These results support treating automatic algorithm design as verification-guided skill scheduling rather than one-shot code generation.

[AI-51] Be Faithful When Response: Returning Fluent and Grounded Answers for Vision-Language Models Reinforcement Learning

链接: https://arxiv.org/abs/2606.29984
作者: Peng, Lee,Yin Zhang,Yanglin Zhang,Haonan Wu,Zishan Liu,Ruoxi Zang,Xin Zhu,Jiayin Zheng,Jian Yao,Zefeng Ji,Fei Ma
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) is an important paradigm for improving the reasoning capabilities of Vision-Language Models (VLMs). However, directly applying RL to rollout multimodal reasoning can lead to instability, due to the exploitation of language priors, the neglect of visual evidence, and the generation of reasoning traces that are fluent yet not visually grounded. The question arises: Can initially steer the policy toward visually faithful reasoning regime before applying reinforcement learning? To this end, we propose a Faithful Warm-Start (FWS) strategy that first curates samples with explicit vision-language causal relationships from six general VQA benchmarks to construct the FaithfulQA dataset, where each of the image-question pairs gains a certain degree of visual observations, question requirements, commonsense knowledge, domain knowledge, and the final answer. Subsequently, a VLM-based judge is employed to further purify the dataset, ensuring strong causal consistency and visual faithfulness. This warm-start stage equips the model with the capability to understand causally grounded vision-language patterns before subsequent RL optimization under sparse answer-level rewards. Experimental results show that such faithful supervision improves answer accuracy, stabilizes RL training, and reduces visually unsupported reasoning.

[AI-52] Stabilizing Extrapolation in Looped Transformers via Learned Stochastic Stopping

链接: https://arxiv.org/abs/2606.29983
作者: Hsun-Yu Kuo,El Mahdi Chayti,Patrik Reizinger,Wieland Brendel,Martin Jaggi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Looped Transformers, which repeatedly apply a shared transformer block, are an architecturally natural fit for variable-length algorithmic tasks. Although they can exhibit strong length generalization beyond the length of training sequences, this behavior is brittle, yielding high out-of-distribution (OOD) variance, even across well-performing in-distribution solutions. We trace this variance to the spurious correlation in simple algorithmic tasks between sequence length and number of loops. Introducing stochasticity into the number of loops during training sharply reduces OOD variance and stabilizes predictions across inference-time loop counts. To improve upon heuristic randomization schemes, we further analyze RL-Halting as a learned stochastic schedule and find that it generally improves the accuracy-stability trade-off. Across binary addition, Dyck-1, Unique Set, and Copy, learned stochastic stopping often improves this trade-off but can also stabilize a suboptimal computation. Our work suggests that “when to stop” should be treated as a training-time design choice, not merely an inference-time computation-allocation rule.

[AI-53] Exploration and Online Transfer with Behavioral Foundation Models

链接: https://arxiv.org/abs/2606.29980
作者: Louis Bagot(SyCoSMA),Mathieu Lefort(LIRIS, SyCoSMA, IRISA, MALT, UR),Laëtitia Matignon(SyCoSMA)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Zero-shot Transfer in Reinforcement Learning (RL) aims to train an agent that can generate optimal policies for any reward function, without additional learning at transfer time, while training only on reward-free trajectories. For their generality over tasks, such models are sometimes called ``Behavioral Foundation Models’’ (BFMs). While they have shown strong performances and improvements in recent years, the current framework and algorithms still assume that, during the transfer phase, the agent is informed offline about the reward (the task to solve) through a dataset of state-reward pairs, which it uses to pick the best policy to deploy. However, in practice if the reward is a black-box (e.g. direct user feedback), it is not possible to generate such a dataset: it is necessary to observe the reward through interactions with the environment. In other words, the current framework of offline transfer is not aligned with the traditional RL setting of online learning through trial-and-error, which requires exploration in order to find rewards. This paper proposes to tackle this new online transfer in zero-shot RL, with the key insight that the BFM itself can be used to generate exploration policies. We show that it is possible to frame this online learning problem in terms of a bandit-like exploration-exploitation problem. More precisely, at each step the bandit algorithm recommends a policy, the BFM executes it in the environment, which yields a reward and a new state; we repeat the process until we converge to the optimal policy. In the popular context of linear reward approximation, we derive a formulation inspired by Upper Confidence Bound and show that exploration can be achieved through the minimization of the eigenvalues of an uncertainty matrix. We evaluate qualitatively and quantitatively our framework on a simple environment to validate the concept of our method.

[AI-54] First-Order Temporal Logic Tensor Networks

链接: https://arxiv.org/abs/2606.29972
作者: Luca Boscarato,Ivan Donadello,Alessandro Artale,Marco Montali,Fabrizio Maria Maggi
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Most of the existing neuro-symbolic AI methods focus on the scenario of static knowledge where objects do not change according to a temporal dimension. Temporal neuro-symbolic works are still under explored and are mainly developed for time-interval logic or propositional linear temporal logic. There is a lack of models studying linear temporal logics with predicates that deal with objects whose properties and relations change through the time. We present First-Order Temporal Logic Tensor Networks (FOT-LTN) that is an extension of Logic Tensor Networks (LTN) that fills this gap by considering a linear-temporal dimension. In particular, FOT-LTN joins the syntax of First-Order Linear Temporal Logic with the fuzzy (and real-valued) semantics of LTN obtaining a framework that supports both temporal operators and quantifiers and is totally differentiable. A first evaluation regards a temporal knowledge graph completion task on two synthetic datasets showing better performance of FOT-LTN with respect to dedicated (purely neural) methods.

[AI-55] DuoMem: Towards Capable On-Device Memory Agents via Dual-Space Distillation

链接: https://arxiv.org/abs/2606.29961
作者: Peyman Hosseini,Ondrej Bohdal,Ahmed Alajrami,Andrea Maracani,Ignacio Castro,Matthew Purver,Mete Ozay,Savas Ozkan,Taha Ceritli
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures, 10 tables

点击查看摘要

Abstract:Large Language Model (LLM)-based agents can solve complex procedural tasks by interacting with environments over multiple turns, but this ability typically depends on large models, long contexts, and repeated inference calls. This makes advanced memory-augmented agents difficult to deploy on resource-constrained devices. We introduce DuoMem, a dual-space distillation framework that transfers procedural problem-solving ability from a large teacher model to compact student models. DuoMem distils in two complementary spaces: (1)context-space distillation, which replaces student-generated memories with higher-quality teacher-generated procedural memories prepended to the student’s input, and (2)parameter-space distillation, which fine-tunes lightweight LoRA adapters on successful teacher trajectories. Evaluated on ALFWorld, a challenging embodied decision-making benchmark, DuoMem boosts a 4B-parameter model from 4.3% to 77.9% task success rate, closing most of the gap to a 72B teacher model (87.1%), while adding fewer than 10M trainable parameters and only a few megabytes of pre-computed teacher memories. Moreover, the DuoMem-enhanced 4B model completes tasks over 3x faster than the 72B teacher in wall-clock time, making it viable for real-time edge deployment, which would be challenging for the this http URL ablations across eight models spanning 2B-72B parameters reveal that both distillation axes contribute complementary

[AI-56] SWE-Together: Evaluating Coding Agents in Interactive User Sessions

链接: https://arxiv.org/abs/2606.29957
作者: Yifan Wu,Zhuokai Zhao,Songlin Li,Ho Hin Lee,Jiacheng Zhu,Shirley Wu,Tianhe Yu,Serena Li,Lizhu Zhang,Xiangjun Fan,Shengzhi Li
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most coding-agent benchmarks are static: an agent receives a complete task description up front and is judged only by its final code. Real coding assistance is interactive, with users clarifying goals, adding constraints, and correcting mistakes over multiple turns. We introduce SWE-Together, a multi-turn benchmark reconstructed from real user-agent coding sessions. To make real interactions verifiable, we curate 109 repository-level tasks from 11,260 recorded sessions, selecting sessions with recoverable repository states, clear user goals, and observable outcomes. To replay these interactions across agents, we build a reactive LLM-based user simulator that preserves the original users’ intents and provides feedback when the coding agent’s progress requires it. To evaluate agents as collaborators, we measure both final repository correctness and the number of corrective feedback turns required during the interaction. Experiments with frontier coding agents show that stronger agents generally achieve higher final success rates while requiring fewer interventions, suggesting an improved user experience.

[AI-57] SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

链接: https://arxiv.org/abs/2606.29955
作者: Jian Zhu,Yuzheng Zhang,Zeyao Ma,Bohan Zhang,Armin Schoepf,Daniel Woloch,Peter Yiliu Wang,Guangyu Robert Yang,Samuel Jacob,Siddharth Nagisetty,Abhiram Chundru,Jean Lin,Spencer Mateega,Jing Zhang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spreadsheets are widely used for business analysis, financial modeling, reporting, and decision-making. However, most existing spreadsheet benchmarks evaluate isolated operations such as single-formula generation or local cell edits, and therefore fail to capture end-to-end workflows in realistic business settings. We introduce \textscSpreadsheetBench 2, a workflow-level benchmark for spreadsheet agents that covers three task categories: generation, debugging, and visualization. The benchmark is constructed from authentic business data, including financial reports and corporate filings, and is annotated and validated by domain experts. The benchmark contains 321 tasks; each instance averages 11.8 worksheets and requires 593.5 cell modifications, reflecting large multi-sheet workbooks with cross-sheet dependencies. We evaluate eight frontier large language models under a unified multi-turn agent scaffold, and additionally include several LLM-based spreadsheet products as complementary baselines. Results show that current systems remain far from reliable on real-world workflows: the best model achieves 34.89% overall task accuracy, and debugging accuracy is as low as 12.00%. Trajectory analysis and a failure taxonomy further indicate that insufficient spreadsheet inspection and incorrect target-cell selection are the dominant bottlenecks. Together, these findings position \textscSpreadsheetBench 2 as a challenging testbed for advancing reliable spreadsheet automation. Project page: this https URL

[AI-58] SAGA: Scene-Aware Goal-Evolving Agents for Long-Horizon CivRealm Strategy Planning

链接: https://arxiv.org/abs/2606.29932
作者: Tianyu Jin,Shuo Chen,Yida Wang,Liuyu Xiang,Yingzhuo Liu,Zhiyao Jiang,Yexin Li,Zhaofeng He
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures. Code: this https URL

点击查看摘要

Abstract:Long-horizon strategic planning in complex strategy games demands concurrent reasoning across multiple decision domains under imperfect information and sparse reward. Existing LLM-based agents suffer from three systematic failures: scene blindness from raw tile coordinates, context overflow and domain coupling from monolithic state dumps, and shallow cross-game learning that treats each episode in isolation. We present SAGA, an LLM multi-agent framework with three mechanisms each directly targeting one class of failure: (i) a Map-Semantic Scene Graph that encodes typed spatial relations among game entities into per-unit natural-language context, resolving spatial blindness without global token inflation; (ii) a Tool-Augmented Planner that pulls fine-grained domain state on demand and dispatches per-domain directives to dedicated specialist controllers, eliminating context overflow, domain coupling, and mechanical constraint violations; and (iii) a Dual-Horizon Feedback Loop that combines periodic within-game goal generation with structured cross-game causal post-mortem, enabling principled strategic evolution without manual reward engineering. Evaluated on FreeCiv, SAGA attains the highest mean civilization score – the environment’s sole sparse objective reward – with lower variance than the two strongest baselines, and is the only method that significantly surpasses every baseline on infrastructure construction, the resource axis most readily sacrificed under multi-objective conflict. It outscores the two strongest baselines in most head-to-head games while cutting output tokens (the dominant decoding cost) by 27%. Equipped with the cross-game evolution module, SAGA reaches the highest end-of-chain score across five successive episodes. Ablation studies confirm that each architectural component contributes independently to this advantage.

[AI-59] HippoSpark: An On-Demand Experience System for LLM Reasoning

链接: https://arxiv.org/abs/2606.29929
作者: Jingyao Liu,Danling Meng,Chen Huang,Yukun Yan,Zhenghao Liu,Wenqiang Lei,See-Kiong Ng,Maosong Sun
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Distilling historical trajectories into reusable experience to enhance future problem-solving has become a focal point of recent LLM research. However, existing methods predominantly operate at the task level, leveraging general summaries or rules under the assumption that analogous tasks share universal solution patterns. This approach often fails in complex reasoning, which typically falters at local bottlenecks that require precise, state-specific guidance rather than broad heuristics. We introduce HippoSpark, a state-level experience system that performs on-demand retrieval tailored to the immediate needs of the current reasoning state. Across mathematical, scientific, and programming benchmarks, HippoSpark consistently outperforms both standard prompting and task-level experience baselines. Our findings reveal that the most effective experience systems are those that provide actionable guidance at critical bottlenecks rather than serving as generic task-level context. Our code is available at this https URL.

[AI-60] EVAF: A Test-Retest Protocol for Selective Parametric Consolidation

链接: https://arxiv.org/abs/2606.29916
作者: Haoliang Han
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 40 pages, 17 tables, preprint

点击查看摘要

Abstract:Long-running language agents need mechanisms for deciding which experiences should persist after the working context is gone. Retrieval systems can reinsert past text, but they do not by themselves show that an experience has been selectively consolidated into the model’s own behavior. We introduce EVAF, an Echo-Valence Attractor Field mechanism for gated LoRA consolidation, and a test-retest protocol for measuring selective parametric consolidation under controlled interference. Across GPT-2 and TinyLlama, EVAF preferentially consolidates high-valence, high-surprise experiences while preserving retrieval-accessible factual memory through a complementary routed memory path. Test-retest measurements show stronger post-interference behavioral persistence than frozen, retrieval-only, and ungated continual-update baselines, while keeping parameter drift and cross-persona contamination low. The results support a separation between memory access and memory depth: retrieving a fact and internalizing an experience are distinct computational operations.

[AI-61] A causal modeling perspective on decision theory

链接: https://arxiv.org/abs/2606.29911
作者: Arvid Sjölander
类目: Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Decision theory provides a formal framework for how agents should make choices under uncertainty, drawing on ideas from philosophy, probability, and causality. Despite significant progress, the field still lacks a unified modeling language, and key concepts - such as the distinction between subjective and objective elements, or what it means for a decision theory to perform well - are often left implicit. This can make it difficult to evaluate and compare competing theories, particularly in controversial cases. In this paper, we address these issues by introducing a formal framework for decision theory based on nonparametric structural equation models (NPSEMs), a well-established tool in causal inference. NPSEMs provide a unified foundation for representing agents, counterfactuals, and causal relationships, allowing for unambiguous definitions of EDT and CDT. Building on this foundation, we propose a novel decision theory - personal decision theory - which instructs agents to maximize a subjective model of their own counterfactual utility. We introduce a formal performance metric based on hypothetical interventions that enforce a given decision theory across a population - such as might be achieved through education or policy – and show that, under certain assumptions, personal decision theory is optimal with respect to this metric. Throughout, we use the smoking lesion problem as a running example and conclude with a formal analysis of Newcomb’s problem. Our aim is to provide decision theory with a clearer modeling language and firmer evaluative ground, thereby enabling more rigorous comparisons and facilitating conceptual progress in the field.

[AI-62] Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation ECCV2026

链接: https://arxiv.org/abs/2606.29908
作者: Hong Chen,Daqi Liu,Zehan Zhang,Haiguang Wang,Tianhao Lu,Longfei Yan,Haiyang Sun,Fangzhen Li,Hongwei Xie,Bing Wang,Guang Chen,Hangjun Ye,Yihua Tan
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: ECCV 2026

点击查看摘要

Abstract:Existing world model-based planners for visual navigation typically follow a verification-centric paradigm, decoupling goal intent from trajectory synthesis. This approach suffers from candidate dependence, heavy computational overhead, and inconsistencies between sampled actions and predicted visuals. To address these issues, we propose SWAM (Spatial-perceiving World Action Model), a task-centric joint observation-action generation framework. Given start and goal RGB observations, SWAM performs single-pass inference to simultaneously generate intermediate RGB-D sequences and corresponding action trajectories, promoting goal-consistent trajectory generation and improved spatial feasibility. While SWAM leverages depth pseudo-labels during training to internalize spatial priors, it requires only monocular RGB input at inference time. We further introduce a visual-guided action refinement module and a trajectory-scale regularization loss to enforce fine-grained alignment between motion and visual cues while stabilizing predictions across varying distances. Extensive experiments show that SWAM significantly outperforms state-of-the-art two-stage planners in success rate, trajectory accuracy, and inference efficiency, while demonstrating robust zero-shot generalization to unseen environments.

[AI-63] CW-B: Class Weighted Boosting Framework for Imbalance Resilient Multi Class Cardiac Phenotyping

链接: https://arxiv.org/abs/2606.29907
作者: Sijia Li,Xiaoyu Tan,Chen Zhan,Yuanji Ma,Haoyu Wang,Xihe Qiu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cardiac discharge phenotyping informs post-discharge treatment and follow-up, but real-world records are often incomplete and class-imbalanced, increasing the risk of missed high-risk phenotypes. We propose CW-B, a clinical risk-aligned class-weighted XGBoost pipeline for five-class cardiac discharge phenotyping under real-world class imbalance and missingness. CW-B combines fold-specific class-balanced instance weighting, missingness-indicator augmentation, and classwise error auditing to improve recognition of clinically prioritized phenotypes while preserving interpretable and auditable decision logic. In five-fold stratified cross-validation, CW-B achieves the best Accuracy, Macro-F1, Balanced Accuracy, and Prioritized F1 among tree-based, ensemble, and neural baselines. Overall, CW-B provides a practical and deployment-oriented approach for more reliable cardiac discharge phenotyping in real-world clinical settings.

[AI-64] Critical Interval MSE: Toward Reliable Offline Validation for Robot Manipulation Policies

链接: https://arxiv.org/abs/2606.29898
作者: Haoxu Huang,Tongsam Zheng,Yifan Chen,Jiacheng You,Yang Gao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world evaluation is the gold standard for robot policies because it tests them against the physical conditions and deployment challenges they are ultimately designed to handle. However, real-world evaluation is also the bottleneck for iterating on robot policies: it is costly, difficult to reproduce, and often too sparse to reliably compare nearby model variants. A straightforward proxy for performance is validation loss on expert demonstrations, but this proxy is often poorly correlated with real-world performance. In this paper, we introduce Critical Interval MSE (CI-MSE), an intuitively simple yet effective offline validation metric. CI-MSE restricts error computation to task-critical segments and pairs it with simple action-alignment procedures that better match rollout-time behavior. Across simulation and real-world experiments, CI-MSE yields a stronger correlation between validation error and rollout performance than raw MSE. Across a wide range of policy checkpoints, CI-MSE achieves a Spearman’s rank correlation of -0.87 , much closer to the ideal value of -1 than raw MSE’s -0.61 , demonstrating a significant improvement. We show through sensitivity analysis that our metric is robust to a wide range of hyperparameters. We further study the effectiveness of CI-MSE under evaluation distribution shifts and suggest design boundaries when using this metric. In summary, this paper provides a simple and reliable offline validation tool for accelerating policy iteration. Project webpage: this https URL

[AI-65] Child-Centric Voice Anonymization in Single and Multi-Speaker Speech via Domain-Adapted SSL Models INTERSPEECH2026

链接: https://arxiv.org/abs/2606.29897
作者: Pranav Tushar,Xiao Xiao Miao,Rong Tong
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: accepted by INTERSPEECH2026

点击查看摘要

Abstract:Voice anonymization aims to protect speaker identity while preserving linguistic content and speech usability. However, most anonymization systems are developed on adult speech, leading to degraded performance when applied to child speech. This paper investigates child-centric anonymization by adapting a self-supervised learning (SSL) based anonymization pipeline to the child speech domain. The system is adapted using child speech from the MyST corpus and evaluated under both single-speaker and two-speaker mixture conditions. Experimental results show that child-domain adaptation improves intelligibility and perceptual quality while maintaining strong privacy protection. Extending the approach to multi-speaker further demonstrates that combining target speaker extraction with child-adapted anonymization provides privacy protection while preserving conversational structure. These findings highlight the importance of child-specific adaptation for practical speech anonymization systems.

[AI-66] rust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models

链接: https://arxiv.org/abs/2606.29892
作者: Siyao Chen,Jiakang Yuan,Jiaxin Wang,Tao Chen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has become indispensable for pushing Vision-Language-Action Models (VLAs) beyond static imitation learning. However, existing RL methods typically require external environmental feedback, relying on predefined success signals to guide policy updates. In this work, we show that VLA models possess useful internal evaluative capabilities: in discrete-action VLAs, trajectories with higher generation confidence are significantly more likely to succeed. Based on this observation, we introduce T^2VLA (Test-time VLA), an architecture-agnostic test-time RL framework that enables VLA models to achieve self-bootstrapping policy improvement. Instead of relying on external rewards, T^2VLA leverages trajectory-level similarity to high-confidence expert demonstrations as an intrinsic reward signal. In addition, we propose a Confidence-Driven Dual Expert Bootstrapping mechanism, which dynamically balances a Local Pseudo-Expert for exploration and a Global Expert Pool for training stability. Extensive experiments on the LIBERO and RoboTwin benchmarks show that T^2VLA consistently outperforms supervised baselines and approaches oracle RL performance with ground-truth rewards, achieving effective improvement without external reward feedback. Furthermore, T^2VLA adapts to distinct VLA paradigms, including both OpenVLA-OFT and the pi series.

[AI-67] SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing

链接: https://arxiv.org/abs/2606.29887
作者: Jiacheng Zhang,Haoyu He,Sen Zhang,Shen Wang,Xiaolei Xu,Yuhao Sun,Meng Shen,Feng Liu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In real-world applications, guardrails are often expected to identify unsafe user-model interactions according to application-specific safety policies, rather than relying on predefined risk taxonomies. In this work, we study this setting under the paradigm of in-context policy guardrailing, where guardrails predict safety violations based on policy specifications provided in context. To systematically evaluate this capability, we introduce SafePyramid, a safety benchmark comprising 1,000 multi-turn conversations across 10 domains and 3,000 corresponding application-specific policies, which together contain 61,699 distinct natural-language rules. SafePyramid organizes the evaluation into three difficulty levels: L0 evaluates individual-rule understanding, L1 evaluates reasoning over rule dependencies, and L2 evaluates adaptation of full novel policy frameworks defined in context. To ensure benchmark quality, we employ a rigorous multi-stage pipeline to construct and validate the benchmark. Using SafePyramid, we evaluate 10 frontier LLMs and 5 policy-configurable guardrails and find that in-context policy guardrailing remains highly challenging: even the best-performing model, GPT-5.5, exactly identifies the full set of violated rules in only 54.0%, 35.3%, and 12.9% cases on L0, L1, and L2, respectively. These results highlight the limitations of current guardrails and call for stronger in-context policy guardrails that can reliably execute policies, resolve rule dependencies, and adapt to novel policy frameworks.

[AI-68] AI Training Manager: Bounded Closed-Loop Control of Adaptive Training Recipes

链接: https://arxiv.org/abs/2606.29871
作者: Anjali Rao,Nikhil Kamalkumar Advani
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 9 figures

点击查看摘要

Abstract:We present the AI Training Manager, a bounded LLM-based supervisory controller for adaptive machine learning training. Standard training pipelines often rely on fixed recipes or single-axis schedulers, which can struggle with mid-run failures such as severe overfitting, loss imbalance, exploration collapse, or unsafe exploration. Rather than replacing mathematical optimizers or acting as an unconstrained coding agent, the manager operates through a schema-conditioned interface: it reads structured telemetry snapshots from an active run, audits a constrained action space, and returns validated updates to training parameters such as learning rate, regularization strength, loss-weight coefficients, and exploration settings. We evaluate this architecture across supervised language modeling and reinforcement learning. On TinyStories, the manager detects and corrects overfitting, achieving a validation loss 60% lower than the baseline while producing auditable intervention logs. In this supervised setting, we additionally show that manager inference does not need to block the training loop: training can continue while a manager response is pending, and validated updates can be applied asynchronously once available. In a robotic manipulation reinforcement-learning task, we use the same bounded decision interface in an episodic closed-loop setting, where manager updates are applied at evaluation or checkpoint boundaries. The manager mitigates both conservative and unsafe exploration regimes. These results suggest that schema-conditioned LLMs can serve as bounded supervisory managers for live training runs, complementing conventional optimizers and schedulers with interpretable, multi-axis intervention capabilities

[AI-69] RoAd-RL: A Unified Library and Benchmark for Robust Adversarial Reinforcement Learning CEC

链接: https://arxiv.org/abs/2606.29867
作者: Adithya Mohan,Daniel Kriegl,Torsten Schön
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICECCME’26

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) has achieved significant success in robotics and autonomous systems, yet remains vulnerable to adversarial perturbations that can severely degrade performance. Research in adversarial reinforcement learning is often limited by fragmented implementations, inconsistent evaluation protocols, and poor reproducibility. To address these challenges, we present \textbfRoAd-RL, an open-source benchmarking framework that provides unified abstractions for policies, attacks, defenses, and robustness metrics, together with reproducible evaluation pipelines and seamless integration with Stable-Baselines3 and Gymnasium. We evaluate DQN, PPO, and SAC agents in LunarLander and Highway-v0 under 192 attack-defense configurations. Results reveal substantial variations in robustness across environments and show that some commonly used defenses can be more detrimental than the attacks they aim to mitigate, while temporal smoothing consistently achieves strong performance. RoAd-RL establishes a standardized benchmark for adversarial reinforcement learning research and is publicly available at this https URL. Comments: Accepted at ICECCME’26 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.29867 [cs.LG] (or arXiv:2606.29867v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.29867 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-70] Beyond Triplet Plausibility: Relation Set Completion in Knowledge Graphs

链接: https://arxiv.org/abs/2606.29860
作者: Zihao Zheng,Borui Cai,Yao Zhao,Keshav Sood,Yong Xiang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge graphs (KGs) organize real-world knowledge as triplets and underpin many downstream applications. Due to their inherent incompleteness, knowledge graph completion (KGC) is widely studied and is typically formulated as triplet prediction, with link prediction as the dominant paradigm. However, this formulation focuses on the incompleteness of triplet-wise information and overlooks the incompleteness of entity-relation compatibility information. To address this limitation, we introduce a relation set completion task (RSC), which complements the link prediction task and aims to reason about missing relations that are semantically compatible with a given entity. We further propose a Relation Set Embedding model (RelSetE), which models latent patterns among the observed relations of entities to infer missing ones. To evaluate RelSetE, we derive three benchmark datasets from standard KG benchmarks. Extensive experiments demonstrate that RelSetE effectively captures entity-relation compatibility patterns and performs favorably in inferring missing relations of entities. Code and data are publicly available.

[AI-71] Dual-Flow Reinforcement Learning with State-Aware Exploration

链接: https://arxiv.org/abs/2606.29820
作者: Qijun Li,Zheng Fu,Qi Song,Yifei He,Weitao Zhou,Kun Jiang,Diange Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures, 1 table. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:In complex continuous-control reinforcement learning tasks, multimodal optimal actions often coincide with uncertain, multimodal return distributions, making reliable value estimation and multimodal exploration challenging. Existing value estimation methods using unimodal Gaussians restrict expressiveness and yield biased estimates. Recent generative policies can represent multimodal actions but often collapse to a few modes and under-explore high-value areas of the action space. Motivated by these challenges, we propose Dual-Flow RL, a unified actor-critic framework that jointly models a continuous return distribution and a multimodal policy distribution using conditional flow matching (CFM). This design supports reliable value estimation and sustained multimodal exploration. To further enhance exploration, we introduce an Entropy-Covariance Exploration Regulator (ECER) that enables state-aware exploration regulation leveraging policy entropy and action-uncertainty covariance. Experiments on DeepMind Control Suite and Humanoid-Bench show that Dual-Flow RL achieves state-of-the-art performance on most tasks, significantly outperforming prior diffusion-based and flow-based methods.

[AI-72] Accelerating Q-learning through Efficient Value-Sharing across Actions ICML2026

链接: https://arxiv.org/abs/2606.29806
作者: Prabhat Nagarajan,Brett Daley,Martha White,Marlos C. Machado
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026 (Spotlight); Adaptive and Learning Agents workshop 2026 (Best paper runner-up)

点击查看摘要

Abstract:Action-values are foundational to many control algorithms such as Q-learning. Therefore learning action-values efficiently is central to reinforcement learning (RL). However, learning them can be slow, requiring many updates to move values from their initialization, typically near zero, to their true values, which may be far from zero. Moreover, action-value learning algorithms typically update each state-action pair independently, without learning shared value structure across actions within a state. In this paper, we address these inefficiencies by introducing the mean-expansion layer, which accelerates action-value learning by sharing values across actions within a state and by changing the problem from directly learning potentially large action-values to learning a lower-norm representation of them. In deep RL, this layer can be applied as a parameter-free addition to Q-network architectures without altering the underlying algorithm. Applied to deep Q-networks and implicit quantile networks, it improves aggregate performance across 57 Atari games while increasing action gaps and dramatically reducing value overestimation.

[AI-73] he CRISTAL Method: Neurosymbolic analysis from AI-synthesized world models

链接: https://arxiv.org/abs/2606.29799
作者: Rafael Kaufmann,Felix Neubürger,Michael Walters,Thomas Kopinski,Dimitrije Marković
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This project introduces the CRISTAL Method (Coherent Reliable Intentional Synthesis of Truthful Analysis Logic), a neurosymbolic framework for automating complex analysis workflows, with fundamental investment analysis as a primary use case. This domain poses major challenges: high structural uncertainty, noisy and subjective data, tight attention budgets, and the need for justified, reproducible decisions. Human analysts often struggle in this domain due to cognitive biases and limitations, suggesting significant value in automation. But while LLM-based agents have been proposed as analytical aids, their limitations – poor numerical reasoning, unawareness of uncertainty, and lack of reproducibility – hinder their effectiveness in this context. CRISTAL addresses these gaps through a principled blend of statistical model synthesis, continuous learning, and active learning. Starting from a natural-language prior knowledge curriculum, CRISTAL builds a dynamic, interpretable probabilistic program that enables full Bayesian inference, including uncertainty quantification and budget-aware data acquisition. CRISTAL continually refines its world model during analysis, leveraging LLMs for code synthesis and learning. We validate CRISTAL on a novel benchmark of synthetic equities with rich financial and textual data. On a company classification task, CRISTAL achieves Bayes-optimal accuracy with just 5 examples and a 5-second budget, outperforming state-of-the-art LLMs that plateau around 40% accuracy even with order-of-magnitude more input data and compute.

[AI-74] Multi-Level Distributional Entropy for Explainable Network Intrusion Detection

链接: https://arxiv.org/abs/2606.29797
作者: Mohamed Aly Bouke,Md Shohel Sayeed,Swee-Huay Heng,Azizol Abdullah,Mohamed Othman
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine learning network intrusion detection systems (IDS) rely on aggregate flow statistics that discard distributional structure, while established entropy measures require raw packet sequences unavailable in pre-aggregated flow datasets. We propose Multi-Level Distributional Entropy (MDE), an analytical framework that derives interpretable entropy features directly from flow-level summary statistics at three levels: within-flow Gaussian differential entropy, cross-directional Jensen-Shannon divergence (JSD), and Transmission Control Protocol (TCP) flag-pattern Shannon entropy, without raw packet access or training data. Across four benchmarks (NSL-KDD, CICIDS-2017, CICIDS-2018, UNSW-NB15) under a leakage-free fold-local pipeline, entropy-only features achieve weighted F1 of 0.708-0.989, matching conventional features without degrading performance. Full operational metric reporting then exposes failure modes that aggregate F1 conceals. On CICIDS-2018, F1=0.74 hides a detection rate (DR) of 0.48, and on held-out attack families F1 exceeds 0.998 while DR falls to zero. Under temporal shift, a pseudo-live replay of 703K flows reveals a threshold-ranking divergence in which score ranking is preserved (AUC=0.87) but fixed thresholds collapse (DR=0.082) and recalibration offers no recovery. SHapley Additive exPlanations (SHAP) fold-stability analysis (Spearman rho=0.80-0.95) confirms that entropy attributions are reproducible and domain-coherent across heterogeneous environments.

[AI-75] What Drives the Inlier-Memorization Effect? A Theory of Outlier Detection via Early Training Dynamics

链接: https://arxiv.org/abs/2606.29791
作者: Kunwoong Kim,Dongha Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Outlier detection (OD) aims to identify anomalous instances by learning the underlying structure of normal data (inliers), and is particularly challenging in fully unsupervised settings where no information about anomalies is available during training. Recent advances have leveraged the inlier-memorization (IM) effect, a phenomenon in which deep models memorize inlier patterns earlier than those of outliers, as a powerful signal for distinguishing outliers. However, despite its empirical success, the theoretical understanding of the IM effect remains limited. In this work, we present a theoretical study of the IM effect. Focusing on a simple autoencoder, we show that, under mild assumptions, the model can successfully memorize inliers while failing to memorize outliers during certain stages of early training. In particular, we characterize not only the emergence of the IM effect, but also its strength and persistence, and analyze how these properties depend on the data distribution and parameter initialization. In addition, building on these insights, we derive simple yet practical guidelines for enhancing the IM effect, including data preprocessing and parameter initialization schemes, achieving state-of-the-art performance on the ADBench datasets. Our findings provide a theoretical foundation for the IM effect and offer actionable directions for improving IM-based outlier detection methods.

[AI-76] owards Generalizable and Evidential Nuclear Magnetic Resonance-Based Molecular Structure Elucidation via Large Language Model Agent

链接: https://arxiv.org/abs/2606.29776
作者: Zheng Fang,Chen Yang,Yusen Tan,Yunpeng Zhao,Fanjie Xu,Hongxin Xiang,Hanyu Sun,Hanyu Gao,Xiaojian Wang,Wenjie Du,Yuqiang Li,Jun Xia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Nuclear Magnetic Resonance (NMR) spectroscopy is the gold standard for molecular structure elucidation, yet interpreting complex spectra for unknown molecules remains a bottleneck reliant on human expertise. While artificial intelligence has advanced this field, current methods face a critical trade-off: database retrieval cannot identify novel scaffolds, while de novo molecular structure elucidation models operate as black boxes, lacking the atom-level interpretability required for rigorous scientific validation. Here, we present NMRAgent, an evidential reasoning agent powered by large language models (LLMs) that bridges this gap by integrating specialized spectral analysis tools with chemical knowledge graphs. Unlike previous approaches, NMRAgent mimics the deductive reasoning of human experts: it takes experimental NMR spectra and molecular formula as input, plans the elucidation process, proposes candidate structures, verifies peak-atom consistency, and refines misaligned substructure through formula-aware fragment optimization. Enabled by its evidential reasoning, NMRAgent outperforms state-of-the-art methods, improving top-1 accuracy by 46.5% and Tanimoto similarity by 0.502 on a scaffold-split benchmark with novel scaffolds in the test set. Besides, we demonstrate the agent’s practical utility by elucidating the structures of two previously unknown natural products isolated from Hydrangea davidii and Vitex trifolia, and by correcting structural misassignments in established literature. By combining high-accuracy prediction with transparent and evidence-based reasoning, NMRAgent establishes a new paradigm for interpretable AI in analytical chemistry.

[AI-77] CLQT: A Closed-Loop Cost-Aware Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents

链接: https://arxiv.org/abs/2606.29771
作者: Bo Qu,Mingguang Chen
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Finance (q-fin.CP); Portfolio Management (q-fin.PM)
备注: 50 pages, 14 figures, 10 tables

点击查看摘要

Abstract:LLM agents are increasingly cast as autonomous portfolio managers, and benchmarks have moved from financial question-answering to sequential trading. Yet most still rank agents by returns over a fixed window – a weak proxy, since a period’s return is dominated by the market path and apparent alpha can dissolve once look-ahead leakage is controlled. Such a ranking certifies neither sound reasoning, nor a consistent strategy, nor a durable edge. We introduce CLQT, which reframes closed-loop trading evaluation as diagnosis rather than ranking: an instrument that localizes where and why an agent’s process succeeds or fails. CLQT is a fully closed-loop, cost-aware, strategy-consistent, temporally-gated environment whose agents run a five-stage cycle: gather, synthesize, allocate, execute, reflect. Each round emits a complete DecisionRound sealed into a recompute-verifiable hash chain, so every metric is reconstructable from the trail. Six pillars form the substrate: a hard TimeGate, institutional transaction- and financing-cost modeling, strategy-consistency scoring, three-tier memory, a Model-Context-Protocol tool layer, and mandate-aware synthesis. The same agent runs as a constrained committee of specialized roles or a single full-autonomy orchestrator, making process scaffolding an experimental variable. From the audit trail we compute a five-axis capability scorecard (APM-CS: Coherence, Acuity, Composure, Discipline, Reliability), with Coherence judged partly by a held-out, out-of-cohort LLM to curb self-preference bias. We validate it on a contamination-controlled multi-model backtest with an ablation grid and a live broker track on unseen, post-cutoff data, against a repeated-run noise floor. CLQT separates outcome from capability, yielding not a model ranking but a durable, extensible map of agent competencies and limitations.

[AI-78] PS-PPO: Prefix-Sampling PPO for Critic-Free RLHF

链接: https://arxiv.org/abs/2606.29758
作者: Doo Hwan Hwang,Kee-Eung Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) for Large Language Models increasingly relies on critic-free methods as a practical alternative to actor–critic training. Despite their simplicity, existing critic-free approaches propagate a trajectory-level learning signal uniformly across all tokens in a trajectory. This requires full-trajectory policy updates for every rollout, leading to substantial optimization cost for long reasoning traces, even though intermediate prefixes often contain enough information to largely determine the final outcome. We propose Prefix-Sampling Proximal Policy Optimization (PS-PPO), a compute-efficient critic-free method for RLHF that exploits this temporal redundancy. PS-PPO introduces a prompt-conditioned cutoff distribution and samples a cutoff timestep for each trajectory. During the update pass, PS-PPO backpropagates only through the sampled prefix of each trajectory and applies an importance-weighting correction so that the resulting truncated gradient estimator remains unbiased with respect to the full-trajectory objective. Experiments on mathematical reasoning and RLHF benchmarks show that PS-PPO achieves large reductions in training compute and peak GPU memory, while maintaining accuracy comparable to strong critic-free baselines.

[AI-79] Rethinking Generative Reconstruction Attacks against Graph Neural Network Models

链接: https://arxiv.org/abs/2606.29748
作者: Adebayo Keji,Sayanton Dibbo
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:The application of graph data in numerous disciplines raises the need for gathering and analyzing huge volumes of data, some of which is private and sensitive. The non-Euclidean nature of the graph data makes the analysis computationally challenging, leading to the use of Graph Neural Networks (GNNs) in the age of AI. GNNs may inadvertently leak sensitive data they are trained on, which raises serious data security issues, including the model inversion attack. In this study, we analyze GNNs’ vulnerabilities by introducing two novel graph inversion (i.e., reconstruction) attacks: graph-label conditioned (GLC) attack and embedding-label conditioned (ELC) attack, utilizing targetmodel predictions and their intermediate representations, respectively. We perform a comprehensive analysis of our introduced privacy attacks and compare them with existing baselines across three benchmark graph datasets (i.e., NCI1, PROTEINS, and AIDS) and four graph distributional/structural metrics (i.e., FGD, EGD, MMD, and GKS). Our work demonstrates that an adversary can use the generator-discriminator technique to reconstruct high-quality graphs in real-world black-box attack scenarios against GNNs. Additionally, we present a variant of our attacks (Ours–) with 50% reduced queries, achieving good or comparable reconstruction attack performance. In addition, we show that GNNs are highly vulnerable to privacy attacks, varying Laplacian noise-scales.

[AI-80] Redefining Maritime Anomaly Detection via Equation-Grounded Synthetic Anomalies KDD2026

链接: https://arxiv.org/abs/2606.29721
作者: Youngseok Hwang,Sungho Bae,Dohun Lee,Jaeeun Seo,Jeehong Kim,Wonhee Lee,Hyunwoo Park
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, KDD 2026 Oral

点击查看摘要

Abstract:Maritime anomaly detection is essential for ensuring maritime safety, security, and efficient traffic management at sea, with Automatic Identification System (AIS) data serving as a primary data source. Despite its importance, most publicly available AIS datasets lack predefined anomaly labels, forcing prior studies to rely on either distribution-based rarity or domain rule/expert-assisted labeling. These approaches, however, face fundamental limitations: statistical rarity often fails to reflect practically critical events, while expert-based labeling is costly, subjective, and difficult to scale. Moreover, both paradigms tend to overlook interaction-driven hazards such as near-miss approaches between vessels. To address these challenges, we propose an equation-grounded anomaly taxonomy that is implementable under a limited AIS observation schema and extensible to other AIS datasets. Specifically, the taxonomy defines three anomaly types: unexpected AIS activity (A1), route deviation (A2), and close approach (A3), covering both single-vessel and inter-vessel anomalies. Building on this taxonomy, we introduce a unified score-synthesize-label pipeline that produces LLM-guided plausibility scores, uses them to synthesize anomalies, and assigns timestamp-level labels. To rigorously assess detection performance, we further design benchmark evaluation settings that account for variations in temporal-window length and anomaly-type composition, and evaluate a broad range of time-series models and anomaly detection models. Together, these contributions provide a systematic basis for evaluating maritime anomaly detection methods across different anomaly types. Our code is available at this https URL.

[AI-81] oward Secure and Reliable PDDL Formalization of Large Language Models with Planner-in-the-Loop Feedback

链接: https://arxiv.org/abs/2606.29700
作者: Jiamei Jiang,Jiajing Zhang,Feifei Mo,Linjing Li,Daniel Zeng
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Planning often requires symbolic specifications that are both executable and verifiable. For large language models deployed in autonomous or decision-support systems, failures in such formalization may lead to unverifiable decisions, execution failures, or unsafe downstream behavior. We present NL-PDDL-Bench, a multi-domain benchmark for natural-language-to-PDDL specification construction with planner-verified executability and controlled difficulty scaling by object count. We further propose a planner-in-the-loop framework that uses validator and planner diagnostics to revise non-executable specifications through localized edits. Building on this infrastructure, we develop a planner-grounded optimization recipe that combines parameter-efficient Low-Rank Adaptation supervised fine-tuning, offline planner-derived preference pairs for Direct Preference Optimization, and inference-time planner-in-the-loop repair, without requiring online planner calls during training. We also provide a unified evaluation suite for parseability, solvability, specification similarity, and outcome-aware plan-level consistency against planner references. Experiments on representative model families show substantial gains in planner success and plan-level agreement, with improved robustness under difficulty scaling and cross-domain variation. These results highlight the value of externally verifiable formalization for reliable deployment of LLMs in safety- or security-sensitive planning systems. Code and data are available at: this https URL

[AI-82] Sample-Efficient Learning of Probabilistic Causes for Reachability in Markov Decision Processes with Probabilistic Guarantees UAI2026

链接: https://arxiv.org/abs/2606.29681
作者: Ryohei Oura,Georgios Fainekos,Hideki Okamoto,Bardh Hoxha
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Accepted to UAI2026 as oral presentation

点击查看摘要

Abstract:Probabilistic model checking for Markov decision processes (MDPs) provides quantitative guarantees, but often offers limited insight into why undesired outcomes occur. Probability-raising (PR) causality addresses this by identifying states whose visitation increases the probability of reaching designated states. Existing PR-cause identification methods, however, use MDP modifications not well-suited for learning: the gap between conditional and unconditional reachability probabilities can be hard to detect from transition samples, and construction requires reachability probabilities of the MDP, which are unavailable when transition probabilities are unknown. We study unknown MDPs and propose a learning approach with probabilistic guarantees for PR-cause identification. Our key ingredient is a restart-based MDP modification that reduces PR-cause checking to two conditional reachability queries without using reachability values of the original MDP. We prove correctness, establish sample-complexity bounds, and develop an anytime learning-and-checking algorithm based on two-sided value iteration that progressively classifies states as causal, non-causal, or undecided. Experiments on two benchmarks demonstrate reliable and fast identification of PR causes.

[AI-83] Diversity is the Strength of the AI Crowd ICML2026

链接: https://arxiv.org/abs/2606.29661
作者: Matthew Aitchison,Scott Jeen,Toby Shevlane,Ben Day
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the ICML 2026 Workshop on Forecasting as a New Frontier of Intelligence, Seoul, South Korea, 2026

点击查看摘要

Abstract:Top AI forecasting systems are approaching superforecaster-level accuracy on future world events, but still rely primarily on off-the-shelf LLMs combined with forecasting-specific context gathering and scaffolding. We study how to improve this recipe through ensembling: given a fixed number of samples, which off-the-shelf model forecasts should be combined to maximize accuracy? On binary questions from the Metaculus AI Benchmark, we find that individual accuracy is not enough: many frontier LLMs make highly correlated predictions, limiting the value of additional forecasts from the same or similar models. Instead, the strongest ensembles combine accurate but diverse forecasters, with models such as \modelGrok 4 contributing disproportionately because their predictions are less correlated with other frontier LLMs. These results suggest that the strength of the AI crowd comes not from sampling more forecasts indiscriminately, but from combining forecasts across models with complementary errors, motivating forecasting systems that explicitly optimize for both model quality and diversity.

[AI-84] Safety from Honesty in a Disinterested AI Predictor

链接: https://arxiv.org/abs/2606.29657
作者: Yoshua Bengio,Oliver Richardson,Tomáš Gavenčiak,Michael Cohen,Rory Svarc,Damiano Fornasiere,Gael Gendron,David Hyland,Aton Kamanda,Adam Oberman,Francis Rhys Ward,Anna Gavenčiak,Jacob Livingston Slosser,Vincent Mai,Iulian Serban,Joumana Ghosn
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As AI systems become more capable, training procedures that optimize for downstream outcomes risk introducing implicit agency: goal-directed behavior that designers never specified. We present a formal safety argument for the Scientist AI (SAI) Predictor, trained to approximate the Bayesian posterior conditioned on a dataset of “epistemically contextualized” natural-language statements. We argue that such a Predictor can honestly predict agents, actions, and their consequences without itself being an agent that selects outputs to achieve goals. This rests on data representation and on the training procedure. Epistemic contextualization of text distinguishes latent factual claims from communication acts, so expressions of goals are treated as evidence to be explained rather than drives the model adopts. With a posterior-seeking training objective, this is intended to drive the Predictor toward calibrated, cautious predictions. Training proceeds so downstream effects of deploying a prediction never serve as a reward signal; any agency the system needs is supplied by explicit scaffolding constrained by guardrails. We prove that, under assumptions on the training dynamics and on the argued sparsity of dangerous Predictors, the probability that training produces a Predictor whose guarded deployment carries residual harm above a specified threshold is small: a dangerous Predictor would have to underestimate harm in a coordinated way across many queries while such coordinated patterns are rare under the initialization distribution and receive no direct training signal. Safety and accuracy are jointly supported in this framework, since the constraints that secure accuracy are the same ones that make coordinated deception costly. These guarantees against misalignment and agency arising from within the Predictor itself do not preclude the use of the Predictor as part of an agentic system.

[AI-85] Fuzzing Large Language Models to Elicit Hidden Behaviours

链接: https://arxiv.org/abs/2606.29646
作者: Mohammed Abu Baker,Lakshmi Babu-Saheer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sleeper agents are the canonical model organism of deception: models trained to behave normally but to emit an unsafe behaviour on a specific trigger. Eliciting that behaviour without knowing the trigger has not been studied systematically. We study fuzzing: injecting Gaussian noise into a model’s weights or residual-stream activations and checking whether the perturbed outputs reveal the behaviour. On 6 backdoored models (7B-13B) we compare both forms of fuzzing head-to-head against temperature-sampling baselines. Fuzzing elicits the hidden behaviour more often than temperature sampling on 4 of 6 models (up to ~6x on OpenHermes-13B), and which form wins depends on the task, so both are worth running. Elicitation is uneven across each method’s hyperparameter grid: a uniform sweep gives only a few percent on most models, while the best cell is 2-10x higher, so the bottleneck is hyperparameter selection, not the technique. To select hyperparameters without ground-truth access, we use a cheap proxy task (in-context secret elicitation, where a base64-encoded secret is placed in the system prompt for the model to hide) and run Thompson sampling on it to pick candidate cells, which we evaluate on the real backdoor. On the four models that can decode the secret, proxy-selected cells raise activation-fuzzing elicitation ~4x over the uniform-sweep mean (recovering ~70% of the best-cell rate on the best performing model) and weight-fuzzing by 1.3-1.8x. To our knowledge this is the first systematic study of fuzzing on sleeper-agent backdoors and the first to show proxy-task hyperparameter selection transferring to real-task elicitation. We also propose reporting such results as a (uniform-baseline, proxy-selected, oracle) triple, since these are three distinct claims that prior work has often blurred.

[AI-86] SFBench: The SciFy Scientific Feasibility Benchmark

链接: https://arxiv.org/abs/2606.29630
作者: Cash Costello,James Mayfield,Elsbeth Turcan,Christine Piatko,Christina K. Pikas,Justin Rokisky,Sam Scheck,Chris Ribaudo,Ritwik Bose,Alex Memory
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present SFBench, a benchmark dataset for evaluating systems that assess the feasibility of scientific claims. SFBench includes 197 claims in materials science, each annotated with a ground-truth feasibility score on a five-point scale along with an explanation of that assessment. The collection differs from previous collections in several important ways: 1) it defines a complex task that requires reasoning over claims of varying scientific feasibility; 2) its claims are not extracted from existing scientific publications but are created de novo, greatly reducing the chances that LLMs have trained on them; 3) claims and ground truth are established by subject matter experts, not by artificial intelligence; and 4) unlike many benchmarks that ask about question/answer pairs, provide multiple choice answers, or ask questions requiring short, fixed answers, SFBench explanations are completely open-ended. We describe the benchmark design, data creation process, and evaluation metrics, and we report baseline results using recent GPT models.

[AI-87] SCARCE: Scalable Cascade Analysis for Rare-event Characterisation via Embeddings

链接: https://arxiv.org/abs/2606.29623
作者: Yingjie Wang,Yi Dong,Edmund Lau,Jie Meng,Taylor T Johnson,Xiaowei Huang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 11 figures, 5 tables

点击查看摘要

Abstract:Rare events govern the safety profile of modern AI systems, yet their probabilities are extremely difficult to estimate: direct Monte Carlo requires prohibitive sample budgets. Subset Simulation (SS) addresses this by decomposing a rare-event probability into moderate conditional probabilities over nested intermediate events. However, classical SS requires a handcrafted scalar performance function whose sublevel sets define those events, demanding detailed knowledge of the failure geometry and limiting transfer to new domains. We propose SCARCE (Scalable Cascade Analysis for Rare-event Characterisation via Embeddings), which replaces the performance function with learned latent representations and geometric rulers that score proximity to failure regions. Adaptive thresholding constructs nested intermediate events directly from data. We formalise SCARCE through a non-negative supermartingale, yielding a high-probability upper envelope that remains valid under early stopping. On MNIST misclassification, where dense Monte Carlo provides ground truth, SCARCE achieves approximately 400–500 times lower mean absolute error than grid-searched traditional SS while eliminating systematic over-counting. We then study PAIR-style LLM jailbreaks under a fleet-level threat model with adversarial fraction \eta . On Llama-Guard-3-8B hidden states, a PCA-based ruler attains 2.6% mean relative error for \eta \geq 10^-3 against finite-sample references whose average bootstrap relative half-width is 27.9%, and transfers to a GCG-style corpus with 2.93% relative error after recalibration. A directional criterion \mathrmKL(p_\mathrmgood,|,p_\mathrmbad) ranks rulers consistently with estimation error (Spearman \rho=0.83 ).

[AI-88] Does Role Specialization Matter for Explanation Faithfulness in Mixture-of-Experts?

链接: https://arxiv.org/abs/2606.29613
作者: Yeji Kim,Housam Babiker,Mi-Young Kim,Randy Goebel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures have recently been extended with role-based mechanisms for interpretability. This is typically done by assigning semantic roles to individual expert components, for example roles like synergy, redundancy, and uniqueness in multimodal settings. However, whether such structural role decomposition preserves explanation faithfulness of the overall architecture remains largely underexplored. We hypothesize that inter-expert representation overlap weakens effective role separation and degrades attribution-based faithfulness, even when semantic roles are explicitly defined. To address this limitation, we introduce representation-level decorrelation regularization to explicitly reduce inter-expert similarity in latent space. Using representation decorrelation objectives, we encourage clearer specialization among experts by minimizing representation overlap. Our experiments show that across multiple multimodal benchmarks, this separation consistently improves explanation faithfulness, as measured by comprehensiveness, sufficiency, and their Area Over the Perturbation Curve (AOPC) summaries, while preserving task performance. We further show that these improvements are not limited to role-based architectures such as Interpretable Multimodal Interaction-aware MoE (I2MoE). Similar trends are observed in a standard sparse MoE baseline, suggesting that representation-level separation may provide a more general mechanism for enhancing explanation faithfulness in MoE systems. Overall, our findings suggest that structural role decomposition alone may be insufficient to guarantee faithful explanations and that representation-level separation helps improve explanation faithfulness. To support reproducibility, the source code and supplementary material are publicly available at this https URL.

[AI-89] Mechanistically Eliciting Latent Behaviors in Language Models

链接: https://arxiv.org/abs/2606.29604
作者: Andrew Mack,Nina Panickssery,Alexander Matt Turner
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We aim to discover diverse, generalizable perturbations of LLM internals that can surface hidden behavioral modes. Such perturbations could help reshape model behavior and systematically evaluate potential risks. We introduce Causal Perturbative Elicitation (CPE), an unsupervised method for discovering interpretable low-rank adapters (LoRAs) that can elicit these latent behaviors. CPE decomposes the computations of a deep transformer slice using a heuristic tensor-decomposition-based algorithm. CPE exhibits remarkable data efficiency, learning a large number of interpretable LoRAs from a single example. Even though CPE is unsupervised, we find that in some cases it can be competitive with supervised elicitation methods via brute-force enumerative search over weight space. For instance, CPE performs similarly to matched-wall-clock-time GRPO on the Countdown task for Qwen3-8B (85% vs 87%), demonstrating that CPE can efficiently elicit complex multi-token behaviors. Since CPE is unsupervised, it can also surface hidden failure modes, such as sandbagging, restoring 85% of locked BigCodeBench performance on a password-locked version of Llama3-70B introduced by Taylor et al. (2025). Additionally, since CPE explores behaviors in weight-space rather than token-space it can potentially ameliorate exploration hacking, a misalignment failure which may arise in sufficiently self-aware AI models (Ngo, 2022). In fact, we find that CPE virtually eliminates alignment-faking (Greenblatt et al., 2024) behavior in a Llama3-70B-based model organism developed by Hughes et al. (2025). Finally, we find that CPE can be used to initialize GPT-OSS-20B in an aligned basin when running GRPO on an environment prone to reward-hacking. By providing a data-efficient method to systematically explore the space of latent model behaviors, CPE yields a powerful tool for aligning AI systems and evaluating their safety.

[AI-90] How AI settled the complexity of the oldest SGD algorithm

链接: https://arxiv.org/abs/2606.29593
作者: Michał Dereziński,Xiaoyu Dong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:In 1937, Stefan Kaczmarz proposed a simple algorithm for solving systems of linear equations. This algorithm turned out to be the earliest known example of stochastic gradient descent, a ubiquitous computing paradigm that drives the training of modern AI models such as ChatGPT and Gemini. Now, those AI models have joined forces to discover the worst-case complexity of the Kaczmarz algorithm. This paper tells the story of how it happened.

[AI-91] Bilevel Optimization for Neural Architecture Search

链接: https://arxiv.org/abs/2606.29582
作者: Abhishek Shukla,Ankur Sinha,Faiz Hamid
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 48 pages, 20 figures

点击查看摘要

Abstract:Bilevel optimization has become an influential and widely adopted framework for addressing hierarchical optimization problems in machine learning, providing an effective approach to modeling the interaction between two levels of optimization, with applications such as hyperparameter tuning, meta-learning, adversarial training, and data poisoning. Neural Architecture Search (NAS), a subfield of hyperparameter optimization, is a prime example of a bilevel optimization problem, with architecture parameters optimized at the outer-level and network weights optimized at the inner level. This paper presents a structured overview of NAS through the lens of bilevel optimization. We categorize existing NAS approaches into two main classes: sampling-based methods, which search optimal architectures using different architecture samplers, and bilevel theory-based methods, which solve the architecture search problem using bilevel optimization principles. We further highlight our current research direction, wherein the bilevel NAS formulation is addressed through an auxiliary mathematical programming framework. This framework enables the systematic integration of second-order information from the model’s training loss function and ensures the optimality of the model parameters while modifying architecture parameters. By simultaneously updating the architecture and model parameters along their respective optimal descent directions derived from the auxiliary mathematical program, these methods achieve more principled and theoretically consistent results. The same auxiliary program can also be used for simultaneous hyperparameter and model fine-tuning. A comparative analysis shows that bilevel theory-based approaches generally outperform sampling-based methods, both in accuracy and efficiency.

[AI-92] he Joint Effect of Quantization and Sampling Temperature on LLM Safety Alignment: A Factorial Analysis

链接: https://arxiv.org/abs/2606.29581
作者: Hari Prasad,Ritam Pal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 Figures

点击查看摘要

Abstract:Modern LLM deployments routinely compress models and raise sampling temperature to reduce cost, latency, or repetition, yet safety evaluations usually treat these choices as fixed implementation details. This leaves a practical uncertainty: does a model that is safe at FP16 and greedy decoding remain safe after it is quantized and sampled stochastically, or do the two deployment knobs amplify one another? We study this question with a factorial evaluation of 9 instruction-tuned models from six families, 3 precisions (FP16, GPTQ INT8, AWQ INT4), and 6 temperatures ( T=0 to 1.0 ), yielding 161 configurations and \approx 322k responses judged by a six-model safety ensemble. Contrary to the concern that low-bit deployment broadly erodes alignment, standard non-adversarial quantization is usually safety-neutral: INT4 keeps or lowers attack success for 7 of 9 models, with clear degradation concentrated in the weakest baseline model, SmolLM3-3B ( 18.5%\to36.0% ). The larger risk comes from sampling: higher temperature sharply increases decision instability for vulnerable models, with DFR reaching 53.0% at T=1.0 , even when average ASR changes modestly. Finally, the interaction is not a ``double penalty’': our Compound Degradation Index remains largely sub-additive ( -0.195 to +0.045 ), indicating that quantization and temperature do not systematically compound. These results suggest a deployment rule of thumb: standard INT4/INT8 quantization can be reasonable for strongly aligned models, but safety claims at elevated temperature should report multi-sample stability, not only average attack success.

[AI-93] F-MoE: Time-Frequency Mixture-of-Experts for Efficient Speech Separation

链接: https://arxiv.org/abs/2606.29575
作者: Qinzhe Hu,Chenda Li,Wangyou Zhang,Shujie Liu,Yan Lu,Yanmin Qian
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted to INTERSPECH 2026, 6 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Recent advances in speech separation (SS) have led to compact front-end models with small parameter sizes, yet their high computational cost remains a major barrier for deployment on edge devices. To address this, we propose TF-MoE, a sparse Mixture-of-Experts (MoE) framework that enhances model capacity with almost no increase in inference cost. Our method introduces dynamic expert specialization in time and frequency dimensions through alternating time-wise and frequency-wise MoE modules, each dynamically selecting experts per frame or mel band. Built upon a mel-band-splitting Conformer backbone, TF-MoE achieves strong performance on SS tasks under low-compute settings. Experimental results demonstrate that TF-MoE consistently improves separation performance under computation cost constraints, outperforming BSRNN by +3.8 dB SDR on Libri2Mix with comparable 4.1 GMACs/s inference cost. This positions TF-MoE as a promising candidate for edge-device deployment.

[AI-94] Proteus: Automated Adversarial Robustness Testing for Audio Deepfake Detectors

链接: https://arxiv.org/abs/2606.29544
作者: Nicolas M. Müller,Aditya Tirumala Bukkapatnam,Zohaib Ahmed
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Proteus, a framework developed at Resemble AI for automated robustness testing of our audio deepfake detection system. Given a detector, Proteus systematically searches over sequences of everyday audio transformations (codec transcoding, additive noise, reverberation, dynamic-range compression, and VoIP simulation) to find combinations that fool the detector while preserving speech quality. We propose two complementary search strategies: (1) a breadth-first search that exhaustively maps augmentation effectiveness across the parameter space, and (2) a Q-learning agent designed to efficiently discover deeper attack chains by exploiting structural patterns in the BFS data. We report findings from continuous deployment of Proteus against our production detector, showing that specific augmentation chains can reliably flip detection verdicts while preserving speech intelligibility and speaker identity. We discuss how these findings are used to harden the detector through targeted retraining.

[AI-95] Learned Coordination Conventions in Cooperative MARL: Measuring the Translation Gap Between Theory-Informed Roles and Learned Routing ICML2026

链接: https://arxiv.org/abs/2606.29541
作者: Yoosung Hong
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figures. Poster Accepted at NExT-Game 2026: New Frontiers in Game-Theoretic Learning, ICML 2026 Workshop

点击查看摘要

Abstract:Role-semantic assignments provide priors over how heterogeneous agents may coordinate, but cooperative MARL systems instead settle on conventions through decentralized, non-stationary learning, with no guarantee that the resulting structure matches those priors. We study this translation gap between theory-informed role expectations and learned coordination structure through a diagnostic combining a role-routing matrix, formation sensitivity ( \Delta_\max ), and gradient/occlusion attribution across three-role MiniGrid and SMACv2 (Terran) environments. We show that label-conditioned attention produces substantially more concentrated and role-specific routing than flat MLP baselines, remains stable under 3v3–9v9 scaling, transfers zero-shot across team sizes, and is invariant to ally-slot padding. A 5-seed re-evaluation shows partial alignment between learned conventions and designer-specified priors while revealing where small-n noise can manufacture apparent strategic divergence. We present these results as an empirical framework for measuring coordination structure in cooperative MARL rather than as a new equilibrium concept or causal explanation. Comments: 8 pages, 1 figures. Poster Accepted at NExT-Game 2026: New Frontiers in Game-Theoretic Learning, ICML 2026 Workshop Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.29541 [cs.AI] (or arXiv:2606.29541v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.29541 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-96] RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources

链接: https://arxiv.org/abs/2606.29538
作者: Yijia Fan,Zonglin Di,Zimo Wen,Yifan Yang,Mingxi Cheng,Qi Dai,Bei Liu,Kai Qiu,Yue Dong,Ji Li,Chong Luo
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Skills are a useful abstraction for software agents, turning human and agent experience into reusable procedural knowledge. Yet existing skill libraries are mostly hand-written, text-centric, or derived from agent traces, leaving tutorial videos and other multimodal human resources largely underused. We present RESOURCE2SKILL, a framework that distills multimodal resources, including tutorial videos, repositories, articles, and reference artifacts, into executable skills for software agents. RESOURCE2SKILL organizes these skills as a hierarchical multimodal Skill Wiki, where each entry combines structured text, code, visual examples, metadata, and provenance. This design preserves complementary signals from different resources: videos capture temporal operations and visual effects, code captures executable tool patterns, and articles or artifacts provide conceptual and stylistic grounding. At inference time, agents retrieve and compose relevant skills from the wiki; when coverage is insufficient, the same construction operator can acquire new skills online. Across seven practical authoring domains, RESOURCE2SKILL improves average overall score by +11.9 percentage points over no-skill agents and outperforms strong harness baselines in 26 of 28 main-aggregate model-domain cells. Ablations confirm the value of multimodal skill format, hierarchical organization, source diversity, selection strategy, and online acquisition.

[AI-97] OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

链接: https://arxiv.org/abs/2606.29537
作者: Mengqi Yuan,Zilong Zhou,Xinzhuang Xiong,Weiming Wu,Jiayang Sun,Jiamin Song,Kaiqian Cui,Bowen Wang,Haoyuan Wu,Yitong Li,Dunjie Lu,Haikong Lu,Qi Zhen,Xinyuan Wang,Jiaqi Deng,Yuhao Yang,Cheng Chen,Boyuan Zheng,Alex Su,Xiao Yu,Hao Zou,Saaket Agashe,Xing Han Lu,Manpreet Kaur,Zhengyang Qi,Vincent Sunn Chen,Frederic Sala,Dayiheng Liu,Junyang Lin,Zhou Yu,Yu Su,Siva Reddy,Xin Eric Wang,Peng Qi,Tianbao Xie,Tao Yu
类目: Artificial Intelligence (cs.AI)
备注: 68 pages, 42 figures. Equal contribution: Mengqi Yuan, Zilong Zhou, and Xinzhuang Xiong

点击查看摘要

Abstract:Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents. We introduce OSWorld 2.0, a benchmark of 108 long-horizon computer-use workflows across everyday and professional tasks, designed to capture complex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete and requires an average of 318 tool calls with Claude Opus 4.7 using maximum thinking, compared with about 30 in OSWorld 1.0. OSWorld 2.0 targets challenge phenomena that are common in real workflows yet underrepresented in prior benchmarks, spanning interaction-design challenges such as streaming interaction and dynamic environments, as well as agent-pattern challenges such as cross-source reasoning, implicit-state inference, and visual-spatial precision. Tasks are grounded in authentic input artifacts and cross-referenced against realistic stateful user profile data, and include separate safety reports auditing safety-sensitive execution. Under our primary binary-completion metric at 500 steps, Claude Opus 4.8 with maximum thinking and batched tool calls scores best but still completes only 20.6% of tasks at a 54.8% partial score; GPT-5.5 is far more token-efficient yet plateaus near 13%. These results show that current agents are still far from professional-level computer use: rather than stumbling on basic GUI control or coding, they lose track of constraints, miss information that arrives mid-task, guess rather than ask the user, and skip verification, struggling most when a task hinges on hidden state they must recover.

[AI-98] SemJoin: Semantic Join Optimization VLDB2026

链接: https://arxiv.org/abs/2606.29532
作者: Christopher Gou,Aditya Banerjee,Jiaxuan Wang,Chunwei Liu
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 7 pages, submitted to VLDB 2026 Workshop: NOVAS

点击查看摘要

Abstract:Integrating unstructured data into relational database systems is increasingly important as demand grows for natural language querying and analysis. A semantic join, joining two tables under a natural-language predicate, can be evaluated with a large language model (LLM), but comparing every pair of tuples requires O(M x N) LLM invocations and is cost-prohibitive at scale. Existing systems reduce this cost but typically commit to a single fixed strategy (e.g., embedding similarity or one batched scheme) regardless of the data or the join predicate. We propose an LLM-agent-based decision pipeline that optimizes semantic joins by matching the execution strategy to the characteristics of the underlying tables. An LLM advisor routes each join to one of two strategies: a Cluster Join, which prunes candidates via unsupervised embedding clustering and sample-based filtering, or a Classifier strategy for predicates that reduce to a shared discrete label set. Across three diverse datasets (IMDb reviews, email contradictions, and Stack Overflow tags), the advisor consistently identifies the optimal execution strategy for each workload. This dynamic routing proves decisive: it outperforms adaptive block join (ABJ) by 20-33 F1 points across all datasets while consuming fewer tokens on two of the three, and achieves higher F1 scores than featurized-decomposition join (FDJ) at one to two orders of magnitude lower token cost.

[AI-99] SAKE: Software Architectural Knowledge Evaluation Benchmark for Large Language Models

链接: https://arxiv.org/abs/2606.29520
作者: Tiziano Santilli,Francesco Daghero,Mayhar Tourchi Moghaddam
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 25 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used as assistants across the software development lifecycle, yet their ability to reason about software architecture remains largely unmeasured. Architectural decision-making depends on quality attribute trade-offs, design patterns, and system-level constraints, none of which are exercised by benchmarks that target syntactic or algorithmic tasks. We introduce SAKE (Software Architectural Knowledge Evaluation), a standardized and reproducible benchmark for assessing software architectural knowledge in LLMs. SAKE comprises 2154 expert-curated multiple-choice questions, each with four options, stratified across eight architectural categories and four context-length levels. We evaluate 11 proprietary and open-weight models in zero-shot and five-shot settings. Overall accuracy is high, but performance varies markedly across categories, revealing competency gaps in areas central to professional practice. SAKE, its evaluation scripts, and all results are released as open source to give the community a baseline for tracking architectural reasoning in LLMs.

[AI-100] Cognitive World Models for Process-Level Social Influence Evaluation

链接: https://arxiv.org/abs/2606.29495
作者: Minghui Ma,Bin Guo,Han Wang,Mengqi Chen,Jingqi Liu,Yan Liu,Zhiwen Yu
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 9 figures

点击查看摘要

Abstract:Social influence dialogue changes user behavior by altering internal cognitive states. The central evaluation question is whether the user’s beliefs, desires, intentions, and emotions measurably change over the course of conversation, a process-oriented criterion that neither surface-level text metrics (BLEU/ROUGE) nor single-score LLM judgments can capture. We propose the \textbfCognitive \textbfWorld \textbfModel \textbf(CogWM), an LLM-based user model that reframes multi-turn dialogue evaluation from what did the user say'' to how did the user’s internal cognitive state evolves.‘’ CogWM jointly predicts BDI/E cognitive states and user utterances and serves as both a user simulator and an evaluation platform, using a three-tier evaluation framework that covers turn-level fidelity, trajectory-level state dynamics, and task-level composite scoring. Trained via our \textbfSummarize-\textbfand-\textbfAllocate \textbf(SaA) annotation pipeline on 150,454 user-turn samples across four social influence scenarios, CogWM achieves 77.6% emotion accuracy (2.1 \times over GPT-5.5). In 3600 multi-agent discrimination trials, it distinguishes six commercial agents by their cognitive influence, with Llama-4-Scout ranking first (CTS +0.233). CogWM moves social influence dialogue evaluation from terminal judgment to process tracking. We have released our code\footnote\scriptsize Code: this https URL and models\footnoteModel: this https URL.

[AI-101] Faults in Our Formal Benchmarking: Dataset Defects and Evaluation Failures in Lean Theorem Proving ICML2026

链接: https://arxiv.org/abs/2606.29493
作者: Pawan Sasanka Ammanamanchi,Siddharth Bhat,Stella Biderman
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Benchmarks for LLM-assisted theorem proving in Lean are often treated as intrinsically reliable because every solved instance comes with a machine-checked proof. However, the kernel only checks that a proof establishes a \emphformal statement; it does not verify that the statement faithfully encodes the intended informal problem, nor that evaluation harnesses are robust to trivial or adversarial solutions. We audit five widely used Lean theorem-proving benchmarks and their forks, using corpus-scale static checkers to surface 4,833 findings, including 398 mechanically certified issues such as counterexamples, vacuous theorems, and unsound axioms. We also document semantic defects such as missing hypotheses, problem simplification, incomplete or incorrect translations, and Lean-specific specification hazards. Beyond dataset construction, we survey evaluation-time failure modes and show, on corrected subsets, that defects can both inflate and deflate reported prover scores. We propose a fault taxonomy, a suite of automated checkers and recall-oriented semantic audit prompts, and release standards to guide the creation of formal math datasets and to make evaluation more reproducible and trustworthy. Our checkers, audit prompts, and corrected dataset snapshots are available at this https URL.

[AI-102] Reported Confidence in LLM s Tracks Commitment More Than Correctness

链接: https://arxiv.org/abs/2606.29490
作者: Dharshan Kumaran
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Confidence is an estimate of the probability that a chosen answer is correct. Verbal confidence reports are widely used as uncertainty measures in large language models, but whether they are best understood as estimates of correctness is unclear. We test this with a two-stage abstention paradigm from the neuroscience of perceptual decision making: a model first answers and reports its confidence, then decides whether to commit it to a user or abstain. Across four non-reasoning models, prompt framings, and confidence formats, verbal confidence predicted the commit/abstain decision substantially better than whether the answer was correct. Calibrated token log-probabilities showed the opposite profile, with abstention-prediction coupled to correctness discrimination, the signature of an answer-evidence signal. After removing the variance verbal confidence shared with log-probabilities, the residual stayed aligned with commitment while its link to correctness fell to near chance. The dissociation generalised to four reasoning models across four benchmarks of varying difficulty, from hard multiple-choice to frontier-level freeform questions. Mechanistic analyses in Gemma 3 and 4 were convergent: a post-answer state known to causally support verbal-confidence generation already encoded the future abstention decision before the abstention prompt, organised mainly by that decision rather than by correctness, the two lying in approximately orthogonal directions in activation space. Steering along a verbal-confidence-specific direction causally shifted abstention. Verbal and log-probability confidence are thus not interchangeable: log-probabilities track answer evidence and correctness, whereas verbal confidence is better understood as a behaviour-facing readout of an internal commit-readiness state, challenging the practice of treating verbal reports as proxies for reliability.

[AI-103] CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agent ic Reinforcement Learning

链接: https://arxiv.org/abs/2606.29476
作者: Zibin Meng,Kani Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-distilled agentic reinforcement learning augments trajectory-level reward with a token-level distillation loss, using as its teacher the same policy conditioned on privileged context. The prevailing recipe gates this loss by a single scalar, the teacher-student log-probability gap. This signal is doubly limited: it is retrospective, scoring only the realised rollout and never the counterfactual ones, and it is sign-blind, never signalling when a teacher-preferred action would have harmed the trajectory. We introduce CRAFT, a three-pillar credit-assignment scheme that addresses both limitations. Pillar 1, Counterfactual Token Importance, reuses the G-1 sibling rollouts that GRPO already samples and importance-weights them by the log-probability gap to form a self-normalised estimate of the group-level counterfactual change in advantage from up-weighting teacher-preferred actions at each step; this yields a signed per-token credit at near-zero extra compute. Pillar 2 is an asymmetric controller that raises the distillation weight as it lowers the reference-KL weight along an exponential moving average of gate activity, and conversely. Pillar 3 polarises the KL penalty token by token, switching between a mode-seeking and a mode-covering update according to the sign of the credit. Each pillar has an independent switch that, when disabled, renders the loss and gradient byte-identical to the baseline in IEEE-754 arithmetic, so any measured gain is attributable to algorithmic change rather than implementation drift. We prove the estimator’s consistency and a variance bound, give structural and bit-exact reproducibility guarantees, and evaluate CRAFT across three agentic environments, four model scales, and five end-to-end methods, plus two tabulated prior-work baselines. Among these is Adaptive-CRINGE, a comparator sharing Pillar 2 with CRAFT, isolating the counterfactual contribution.

[AI-104] Agent -Computer Observation Interfaces Enable Dynamic Computer Use

链接: https://arxiv.org/abs/2606.29472
作者: Bojie Li,Noah Shi
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:SWE-agent established the action interface as an underexplored design axis for software-engineering agents; we make the analogous case for the observation interface in computer-use (CU) agents. Current CU agents, closed and open-source alike, tie observation to action–one screenshot every 3-5 s, no audio–leaving them blind and deaf between screenshots to video, animations, transient UI events, meetings, and spoken instructions. We introduce the Agent-Computer Observation Interface (AOI), a model-agnostic perception layer that decouples continuous, adaptive observation from discrete actions through three gated components: inter-step keyframe capture, volume-gated audio transcription, and CU-model-generated visual narration that persists as text. Each produces almost nothing on static, silent content, reducing to the standard loop without degrading it. On DynaCU-Bench (100 dynamic browser tasks plus a 50-task static control), CU models from 7B to frontier scale gain +17 to +48 pp over their screenshot baselines with zero retraining, turning tasks that are near-impossible from periodic screenshots into largely solved ones. The gap is starkest on audio: on a spoken-content subset AOI agents solve every task, whereas streaming voice models hear accurately but cannot act on what they hear without the scaffold. The decomposition is as informative as the headline gain: keyframe selection turns out not to matter–the value comes from narrating captured frames into persistent text–and the interface is not a fixed bundle, since on a newer model (Gemini 3 Flash) the keyframe stream actively regresses through image-token dilution, so its components must be selected per model rather than shipped as one configuration. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.29472 [cs.AI] (or arXiv:2606.29472v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.29472 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-105] How Much Due Diligence Before You Bid? Learning in Intractable Takeover Auctions

链接: https://arxiv.org/abs/2606.29457
作者: Zain Naboulsi
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注: 21 pages, 13 figures, 2 tables. Code and data: this https URL

点击查看摘要

Abstract:When two companies bid to buy the same target, no one knows exactly what the target is worth. Each bidder pays for due diligence: costly, imperfect homework that sharpens its own private estimate before it bids. How much of that homework is worth buying? We build a simple computer model of the bidding contest and let it teach itself to bid well by playing against itself, the way a game engine learns chess. The economic question, how much diligence pays for itself, and the computational question, when the contest becomes too complex to solve exactly, are both controlled by a single thing: how many pieces of private information a bidder carries. Our main finding is that the right amount of diligence is modest and finite. It falls as diligence gets more expensive, and it falls further when both sides are doing their homework, because competition erodes the value of knowing more. We also test a recent claim from AI research: that simple, general self-play methods can rival the specialized, expensive algorithms usually built for games like these. Running on an ordinary laptop with no costly frontier AI, we find the simple methods are the best of the self-learning approaches, though purpose-built exact methods still win whenever the game is small enough to solve outright. The simple methods earn their keep only once the game grows too large to solve exactly, which is the regime real deals live in, and there we show they still find strong bidding strategies. The contribution is threefold: a cheap, reproducible way to study deal-making under uncertainty; a concrete, model-based answer to how much due diligence is worth buying; and evidence about when lightweight, general-purpose AI is good enough to replace specialized methods. We release all the games, code, and experiments.

[AI-106] FADE: Mitigating Hallucinations by Reducing Language-Prior Dominance in Large Vision-Language Models

链接: https://arxiv.org/abs/2606.29431
作者: Yichen Guo,Kai Tang,Fenglai Lin,Yiding Sun,Dongshuo Zhang,Wenya Wang,Lin William Cong,Shanghang Zhang
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures, 27 tables

点击查看摘要

Abstract:Despite the impressive capabilities of Large Vision-Language Models (LVLMs), they remain susceptible to hallucination, generating content inconsistent with the input image. Recent studies attribute this to the dominance of language priors over visual inputs and employ contrastive decoding methods to mitigate this dominance, but the mechanistic origin remains unexplored. We investigate the information flow through each transformer layer and find that attention modules consistently aggregate visual evidence, while FFN modules at critical layers act as the source of language priors. These priors can override visual evidence, causing correct predictions in intermediate layers to drift toward incorrect outputs. Based on this insight, we propose FADE (FFN Attenuation for DEcoding), a training-free method that attenuates FFN outputs to reduce language-prior dominance. Evaluations on POPE, CHAIR, and MME benchmarks across LLaVA-1.5, mPLUG-Owl2, and InstructBLIP show that FADE effectively mitigates hallucinations while preserving inference efficiency.

[AI-107] LLM -Guided Planning for Multi-hop Reasoning over Multimodal Nuclear Regulatory Documents ICML2026

链接: https://arxiv.org/abs/2606.29399
作者: Mingyu Jeon,Bokyeong Kim,Suwan Cho,Jae Young Suh,Yonggyun Yu
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond @ ICML 2026. 8 pages (main), 3 figures, 1 algorithm

点击查看摘要

Abstract:Reviewing nuclear regulatory documents requires multi-hop reasoning across tens of thousands of pages, where judgments depend on evidence assembled across multiple chapters. We frame this task as planning: an LLM-based agent observes the evidence collected so far, picks the next document fragment to inspect, and stops when the evidence is sufficient. The agent operates over a vectorless document tree using browse, read, and search tools, and maintains a dynamic knowledge graph (KG) as state. On a 200-question benchmark over NuScale Final Safety Analysis Report (FSAR) documents, the system reaches 81.5% accuracy with a RAGAS Faithfulness of 0.93. The dominant performance factor is planning: against PageIndex, which uses the same document tree without state-conditioned action selection, the gap is +38.0pp (43.5% to 81.5%, p0.001). The system also outperforms LightRAG (73.0%, p0.05), HippoRAG (70.5%, p0.01), and GraphRAG (49.5%, p0.001), and matches RAPTOR (75.5%, p=0.11) without offline indexing. Edge inference adds 2.8x cost without raising accuracy; we retain it as a traceability module. Of 7,391 inferred edges, 3 Violates edges (0.04%) flag scope boundaries (Q058) and partial conformance (Q176) as typed annotations that a human reviewer can audit.

[AI-108] Diagnosing and Repairing Factual Errors in RAG under Budget Constraints

链接: https://arxiv.org/abs/2606.29377
作者: Soroush Hashemifar,Havva Alizadeh Noughabi,Fattane Zarrinkalam,Ali Dehghantanha
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves the factuality of large language models by grounding responses in external evidence, yet real-world deployments remain fragile. Failures often stem from missing or weakly relevant evidence, as well as from generation that does not faithfully reflect the retrieved context. Many existing approaches rely on fine-tuning, privileged access to internal model signals, or resource-insensitive escalation strategies, which limits their practicality in black-box and budget-constrained settings. We propose D2R-RAG (Diagnose-to-Repair RAG), a model-agnostic and resource-aware framework that combines lightweight failure diagnosis with adaptive repair. D2R-RAG derives interpretable failure signatures from observable signals in the query, retrieved evidence, and generated response, and then selects from a small set of corrective actions under explicit latency and VRAM constraints. Experiments on FEVER and HotpotQA show that D2R-RAG improves reliability over recent baselines and achieves better accuracy–efficiency trade-offs across multiple compute budgets. The code is available at this https URL.

[AI-109] When LLM s Develop Languages: Symbolic Communication for Efficient Multi-Agent Reasoning ICML2026

链接: https://arxiv.org/abs/2606.29354
作者: Zhengqi Pei,Qingming Huang,Shuhui Wang
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: ICML2026 Regular paper

点击查看摘要

Abstract:Chain-of-Thought (CoT) improves large language models (LLMs) on difficult reasoning tasks, but it often incurs long natural-language rationales that are poorly aligned with efficient machine reasoning. We propose Communicative Language Symbolism Routing (CLSR), a test-time framework in which multiple LLM agents autonomously invent, evolve, and share compact Language Symbolism Frameworks (LSFs), while a latent-free router adaptively selects and composes these languages per query to optimize the accuracy-token trade-off. Unlike prompt optimization that refines surface instructions, CLSR treats each LSF as a reusable symbolic protocol with compact symbols, usage rules, and a message-passing contract, and improves it through an evolutionary loop driven by correctness and token cost. At inference time, the router may invoke a single low-cost LSF call, ensemble multiple LSFs, or execute a multi-round LSF composition protocol on harder queries. Across challenging benchmarks, CLSR reduces latency-oriented generated token completion by 3\sim 6\times compared to standard CoT while maintaining accuracy. We further derive an information-theoretic lower bound on token cost under arbitrary symbolism and show that, under an interpreter-realizability premise, multi-round LSF protocols conditionally subsume program-execution pipelines. Code is publicly available (this https URL).

[AI-110] Adaptive Financial Transformer with Regime-Gated Attention for Stock Return Prediction ALT

链接: https://arxiv.org/abs/2606.29347
作者: Dishan Sarkar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, 10 tables. PyTorch implementation and code available at: this https URL

点击查看摘要

Abstract:Adaptive Financial Transformer (AFT) is proposed for stock return prediction under non-stationary financial markets. The model incorporates a Market Regime Encoder, an Adaptive Gate Network, and an Adaptive Financial Context module to dynamically bias self-attention based on semantic relationships between financial indicators. Unlike conventional Transformer architectures that treat all input features uniformly, the proposed approach groups 95 engineered financial features into 11 semantic categories and adapts attention according to latent market regimes. The study also identifies and corrects sequence alignment and backtesting issues that can inflate reported trading performance, and introduces a financially-aware composite objective that jointly optimizes prediction error, directional accuracy, and non-overlapping Sharpe ratio. Extensive experiments compare the proposed architecture against classical machine learning models, recurrent neural networks, and Transformer baselines using chronological evaluation, five random seeds, ablation studies, hyperparameter optimization, explainability analysis, and multi-stock validation. Results demonstrate competitive predictive performance while reducing model complexity by 15.2% and improving parameter efficiency through feature selection, providing an interpretable Transformer architecture for financial time-series forecasting.

[AI-111] PHF: Privileged Hidden Flow for On-Policy Self-Distillation

链接: https://arxiv.org/abs/2606.29340
作者: Yuhan Li,Mingxu Zhang,Dazhong Shen,Ying Sun
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:On-policy self-distillation (OPSD) trains a reasoning model on rollouts sampled from its own policy by matching a privileged teacher that also sees verified reference solutions. Existing OPSD objectives supervise only the output distribution, so privileged context affects training through a token-level divergence without directly supervising the internal computation that produced that distribution. We propose Privileged Hidden Flow (PHF), which additionally distills how a privileged teacher’s hidden states move along the same rollout. Rather than forcing each student hidden vector to match the teacher vector at the same token position, PHF aligns token-to-token transition directions and trajectory geometry over selected generated positions. The all-layer recipe also includes an adjacent-layer relation computed from these same transitions, without pointwise hidden-state imitation. Under the same 100-step training schedule, PHF improves the Average@12 aggregate over our reproduced OPSD baseline on Qwen3-1.7B, 4B, and 8B, with observed gains of about +2.2, +1.5, and +1.7 points. The transport objective is exactly invariant to shared trajectory offsets; its local geometry term is also invariant to orthogonal transformations of transition directions. Ablations distinguish the fixed PHF recipe from pointwise hidden-state matching, single-channel transition losses, and layer-subset choices, supporting PHF as a compact hidden-flow extension to OPSD.

[AI-112] AMR: Adaptive Modality Routing for Multimodal Polyglot Speaker Identification

链接: https://arxiv.org/abs/2606.29335
作者: Chuxiao Zuo,Yao Zhu,Minqiang Xu,Manhong Wang,Yunke Zhang,Fei Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Multimodal speaker identification systems face two key challenges in real-world deployment: missing modalities and language mismatch between training and testing conditions. In practical scenarios, background multi-speaker conversations, ambient noise, and overlapping speech further degrade identification accuracy. To address these challenges, we propose a multimodal polyglot speaker identification system for the POLY-SIM 2026 Grand Challenge. The system is fundamentally built upon Adaptive Modality Routing(AMR), a modality fusion module that dynamically assesses per-sample input quality and integrates modality information. Specifically, AMR employs two modality adapters to process the embeddings extracted from a linguistically robust audio encoder(W2V-BERT 2.0) and a large-scale pretrained face encoder(IResNet-18), producing modality-adapted embeddings. Based on these adapted embeddings, a trainable router estimates dynamic modality weights, which are subsequently applied to aggregate the modality-specific logits for the final prediction. To optimize this routing mechanism, we adopt a modality-aware training strategy that constructs four types of sample pairs to simulate diverse input conditions, with KL divergence serving as explicit supervision for weight assignment. Experimental results on the POLY-SIM 2026 evaluation set show that the proposed system achieves identification accuracy of 99.93%(English multimodal, P3), 100.00%(Urdu multimodal, P5), 97.50%(English audio-only, P4), and 98.83%(Urdu audio-only, P6). The average accuracy across all four protocols is 99.07%, surpassing the Fusion and Orthogonal Projection(FOP) baseline by 32.73%.

[AI-113] Hierarchical Experimentalist Agents

链接: https://arxiv.org/abs/2606.29315
作者: Abhranil Chandra,Sankaran Vaidyanathan,Utsav Dhanuka,Varun Gandhi,Scott Niekum
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to take actions in the real world and support human decision-making, yet most agents rely on parametric knowledge, fixed post-training data, retrieval, or search. This paradigm breaks down in novel domains and for sophisticated queries that cannot be answered from prior knowledge alone. Knowing the laws of physics, for instance, does not by itself enable LLMs to answer queries or complete long-horizon tasks in a complex physical system. To address this, we introduce Hierarchical Experimentalist Agents (HExA), an in-context self-improvement framework to learn from active experimentation. HExA iteratively designs and refines query-relevant experiments, learns a reusable library of composable skills from experience, and integrates experimental evidence to answer queries or take actions. HExA is training-free, compatible with any black-box model, and does not require external supervision, oracles, or offline data. To evaluate active experimentation, we introduce Interphyre, a tool-calling benchmark built on the PHYRE 2D procedural physics environment, where agents propose interventions and test hypotheses through simulation APIs. Experiments show that current LLM agents struggle in these settings, especially on the hardest levels of Interphyre. Claude Sonnet 4.6 achieves only 2% success, while HExA improves the same model to up to 77% success. HExA also improves open-weight models and outperforms agentic baselines such as ReAct and Reflexion. Moreover, using only skills learned from easier levels and transferred without active experimentation, HExA achieves 44% success, demonstrating the reusability and generalization of its learned skills. Overall, HExA shows that learning through active experimentation can help agents discover useful knowledge, acquire reusable skills, and make efficient progress on novel long-horizon tasks.

[AI-114] Process Advantage Signal Shaping: A Paradigm-Agnostic Middleware for Process-Supervised RL in LLM Reason ers

链接: https://arxiv.org/abs/2606.29296
作者: Chao Wang,Hongtao Tian,Tao Yang,Yunsheng Shi,Ting Yao,Wenbo Ding
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 3 figures

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) is a default recipe for process-supervised reinforcement learning of LLM reasoners, and dense process supervision – via learned process reward models (PRMs) or on-policy-distillation KL signals – is a common way to densify its otherwise weak outcome reward. Layering such a step-level signal on top of GRPO’s group-standardized advantage, however, exposes three structural pathologies: \emphchannel contamination between the pooled process, outcome, and format streams at group standardization; \emphresolution mismatch between the granularity of the process signal and the granularity of the logical decisions being credited; and a \emphcumulative trap by which GRPO’s return-to-go sum surfaces either length inflation or truncated exploration depending on the sign regime of the signal. We propose \textbfPASS (\emphProcess Advantage Signal Shaping), a compact middleware that sits between any scalar step-level process signal and GRPO’s clipped surrogate and addresses the three pathologies in turn: \emphAdvantage Fusion standardizes the three streams independently within each group, \emphChunk-by-Value derives value-homogeneous chunks from the signal itself and broadcasts credit within each chunk, and \emphDivide-Length converts the cumulative objective into an average-value-density score. We validate PASS across two domains and two process-signal paradigms – a learned PRM on mathematical reasoning and an on-policy-distillation KL signal (with a generalized variant) on multi-hop question answering – and under two group-standardization operators. In every regime PASS delivers a consistent pass@1 gain over the corresponding GRPO baseline.

[AI-115] When Summaries Distort Decisions: Information Fidelity in LLM -Compressed Financial Analysis

链接: https://arxiv.org/abs/2606.29251
作者: Hoyoung Lee,Suhwan Park,Seunghan Lee,Jun Seo,Jaehoon Lee,Sungdong Yoo,Minjae Kim,CheolWon Na,Zhangyang Wang,Zach Golkhou,Minkyu Kim,Sotirios Sabanis,Alejandro Lopez-Lira,Dhagash Mehta,Soonyoung Lee,Chanyeol Choi,Wonbin Ahn,Yongjae Lee
类目: Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
备注: Preprint

点击查看摘要

Abstract:Financial decision-makers face more information than they can directly inspect, making context compression necessary. Yet when large language models (LLMs) compress financial source material, they can alter the investment judgment supported by the original source. We frame this problem as information fidelity: compression loses fidelity when it changes the decision induced by the source. In agentic systems, such losses may recur across intermediate steps and amplify throughout the decision process. Across financial filings and earnings-call transcripts, we find that LLM-based compression can produce fluent and factually plausible compressed contexts that nevertheless alter downstream decisions. We analyze two diagnostic patterns associated with fidelity loss: decontextualization, where salient evidence is retained but separated from the caveats and contextual qualifiers needed for correct interpretation, and model dependency, where different compressors expose different views of the same source. We then propose Agentic Context Compression, which generates multiple candidate compressions and audits their disagreements against the original source. Our results suggest that financial compression should be evaluated not only by efficiency or factuality, but also by its ability to preserve decision-relevant context.

[AI-116] SurgVLA-Bench: Towards Evaluating Vision-Language-Action Models for Laparoscopic Surgical Robotics

链接: https://arxiv.org/abs/2606.29247
作者: Jiashuo Sun,Yue He,Wenxuan Liu,Tao Mao,Jiazheng Wang,Xiang Chen,Min Liu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models represent a promising direction for embodied intelligence in surgical robotics. Despite the prevalence of VLA benchmarks for general robotics, standardized evaluation platforms specifically designed for surgical contexts remain absent. To address this limitation, we present SurgVLA-Bench, the first comprehensive benchmark for evaluating VLA models in laparoscopic surgical robotics. Leveraging the SurRoL simulation platform, we construct a hierarchical task taxonomy ranging from atomic actions to complete surgical procedures, complemented by a multi-dimensional evaluation framework assessing action accuracy and semantic consistency. We then systematically evaluate two representative paradigms, including autoregressive models such as OpenVLA, and flow matching models such as \pi_0 , \pi_0.5 , and SmolVLA. Our experiments show that autoregressive models tend to excel in semantic understanding, while flow matching models often achieve higher task precision but may face generalization trade-offs. However, even the best-performing models remain far from satisfactory, as the constrained endoscopic field of view, restricted viewing angles, and frequent occlusions persist as fundamental physical bottlenecks. The code and data are available at this https URL

[AI-117] MoPe: Motion Permanence for Robust Monocular Gaussian Mapping in Dynamic Environments

链接: https://arxiv.org/abs/2606.29237
作者: Qixin Xiao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: RSS 2026 Workshop

点击查看摘要

Abstract:Robust robot autonomy depends on scene representations that remain stable enough to support localization, navigation, and downstream decision making in dynamic environments. Monocular Gaussian Splatting SLAM provides high-fidelity mapping, but current uncertainty-aware methods still treat dynamic regions largely as per-frame observations. This makes the representation effectively memoryless: when a pedestrian slows, pauses, or reappears after occlusion, the current frame may look static, allowing dynamic content to be absorbed into the map and leaving persistent ghosting artifacts. We argue that this failure reflects a representation-level mismatch. Dynamic-ness is not an instantaneous appearance property, but a temporal property defined by motion history. Building on this view, we introduce Motion Permanence: the principle that an object’s dynamic identity should persist over time rather than be re-decided from each frame independently. We realize this principle in MoPe, a memory-aware uncertainty filter for monocular Gaussian mapping. MoPe propagates the historical dynamic posterior through geometry-consistent SE(3) warping and fuses it with current-frame evidence using bounded Bayesian log-odds updates. The resulting persistent posterior guides tracking, mapping, dynamic-aware Gaussian insertion, and Gaussian-level post-cleanup. On Wild-SLAM, Bonn, and TUM sequences, MoPe improves tracking robustness and reduces residual ghosting, with the strongest gains on dynamic-human scenes that most directly violate the memoryless assumption. These results show that maintaining temporal dynamic state inside the scene representation is a practical step toward more reliable representation-centric autonomy in changing real-world environments.

[AI-118] A Cognition-Emotion-Personality Framework for Modeling Human-Like Awareness and Behavior in Emergency Evacuations

链接: https://arxiv.org/abs/2606.29212
作者: Zoi Lygizou,Michalis Zervas,Helena G. Theodoropoulou,Vasilis Zafeiropoulos,Dimitris Kalles,Chairi Kiourt
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent-based evacuation simulations are widely used to study crowd behavior during emergencies, but many models rely on assumptions such as perfect event awareness, complete exit knowledge, and fully rational decision-making. This paper presents an extended evacuation framework that integrates cognitive, emotional, social, and personality-related mechanisms into a unified model of human behavior under uncertainty. The framework incorporates a dynamic event-awareness mechanism based on a continuous Event Certainty Level, a memory-based representation of exit knowledge subject to acquisition, forgetting, and recall, a continuous fear model in which panic emerges as a high-intensity state, and an OCEAN-based personality representation. Neuroticism is explicitly integrated into the emotional model, influencing fear generation, escalation, social contagion, and recovery. Behavioral heterogeneity is further captured through individualized decision thresholds that affect responses to perceived risk. The framework is evaluated through simulation experiments examining the effects of spatial familiarity, memory robustness, decision sensitivity, emotional dynamics, and personality variation. Results show that cognitive, emotional, and personality-driven processes substantially influence evacuation dynamics, reducing evacuation efficiency and generating realistic crowd phenomena such as delays, confusion, injuries, and socially influenced behaviors. The proposed framework provides a more realistic representation of human behavior in emergency evacuations and supports systematic investigation of the interactions between cognition, emotion, personality, and crowd dynamics.

[AI-119] AnyBody: Free-Form Whole-Body Humanoid Control from Arbitrary Keypoint Guidance

链接: https://arxiv.org/abs/2606.29209
作者: Shuning Li,Sikai Li,Jiachen Li,Mingyu Ding
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present AnyBody, a unified whole-body humanoid controller driven by an arbitrary subset of body keypoints chosen at deploy time. Prior physics-based trackers either rely on expensive full-body motion capture and error-prone trajectory retargeting, which bottleneck scalable data collection and policy learning, or decompose upper- and lower-body control into separate hierarchical representations, sacrificing the coordinated whole-body motions that loco-manipulation requires. We close this gap by learning a single latent motion representation that any keypoint subset can address. To achieve this, we first train a privileged teacher tracker on a large unstructured motion corpus and distill it online into a deterministic encoder-decoder student whose latent space is a unit sphere. We then train a transformer keypoint encoder that admits any subset of body keypoints through masked self-attention, aligning it to the privileged latent. Additionally, we treat the frozen decoder as a motor prior and specialize downstream tasks with a lightweight residual corrector in the latent space. We demonstrate the effectiveness of AnyBody by tracking large-scale human motions from arbitrary keypoint subsets, free-form control, flexibly teleoperating, and learning downstream behaviors including locomotion, in-air writing, and obstacle-reach.

[AI-120] Behavior Uncloning: Distilling Mode Redirection into Policy Weights without Inference-Time Steering

链接: https://arxiv.org/abs/2606.29201
作者: Hao Wang,Jiuzhou Lei,Dayou Li,Bangya Liu,Minghui Zheng,Manling Li,Ruohan Zhang,Zhiwen Fan
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Behavior-cloned policies often learn multiple behavior modes from demonstration datasets, including modes that are unsafe or otherwise undesired at deployment. For example, a policy trained on diverse handover demonstrations may learn to pass a knife blade-first. Standard remedies such as data curation and inference-time steering either require access to the original demonstrations for full retraining or add substantial inference-time overhead. To address this gap, we propose MoRE(Mode Redirection), which redirects policy rollouts toward desired behavior modes through a short “uncloning” step. Specifically, MoRE distills the redirection signal from a temporary mode classifier into the policy weights to steer behavior. A retain loss balances this edit by preserving desired-mode competence, allowing the standalone policy to suppress unwanted modes with zero inference-time overhead. Across eight simulated and real-world tasks, MoRE improves the average deployment success rate (SR) by 44 percentage points over the original mixed-mode policy. Among all compared adaptation and steering baselines, MoRE achieves the strongest SR and approaches the filtered-data retraining reference, while preserving task competence and inference speed. MoRE also generalizes across robot policy backbones, including Diffusion Policy and the Pi0.5 VLA, diverse task categories, and real-world deployments.

[AI-121] AI Tradings Alpha Singularity: Emergent Market Reasoning through Agent Agent -to-Agent Self-Evolution

链接: https://arxiv.org/abs/2606.29194
作者: Yuqi Li,Siyuan Liu,Bingjun Liu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated alpha mining holds the scoring function fixed and varies the search algorithm over it. A search that converges against a fixed scorer overfits whatever the scorer cannot penalize, a primary cause of the out-of-sample generalization gap. We treat the scoring function as a search artifact alongside the alpha factors and study what conditions make this joint search admissible. Sealed Joint Search (SJS) is a framework: a set of structural conditions on information flow in an autonomous-discovery system that prevent joint search from collapsing into self-confirmation while keeping the evaluator sealed. Conditions cover role decomposition, typed inter-role communication, provenance-sealed reads, versioned stores, and substrate-local promotion. Agora tests SJS empirically: five LLM agent classes communicate via three channels, evolving eight skill libraries, with alpha libraries built on AlphaGen operators. Three evaluators write reports aggregated into one brief, carrying forward disagreement instead of voting. We run Agora for 100 rounds on CSI 1000 and evaluate on a 91-day 2026 holdout sealed from all LLM inputs. Agora achieves holdout Sharpe +1.87; best baseline +1.334 at favorable seed and -0.755 cross-seed mean. Pre-loading Agora’s two metrics into a frozen-library ablation recovers only +0.40 of the +2.25 Sharpe gap, and adding PPO without library evolution worsens the gap. The two metrics emerge rather than being designed. Caveats: single-seed run, short-side concentrated signal, intended for long-short.

[AI-122] A Multi-Dataset Benchmark for Evaluating LLM Agents in Microservice Failure Diagnosis

链接: https://arxiv.org/abs/2606.29193
作者: Yuanhong Cai,Xiaohui Nie,Kanglin Yin,Changhua Pei,Yongqian Sun,Shenglin Zhang,Haibin Liu,Guiyang Liu,Xidao Wen,Fang Situ,Dan Pei
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, 6 tables

点击查看摘要

Abstract:LLM-based agents are reshaping microservice operations into AgentOps, where benchmarks are key to evaluating failure diagnosis over multimodal observability data. However, existing benchmarks remain largely outcome-oriented: they score only the final answer and fail to assess the systematic reasoning process in failure diagnosis. We address this gap by introducing two large-scale datasets (AIOps2025 and RCA100) under a reasoning-process evaluation paradigm that assesses agentic diagnostic capability along three dimensions: Localization (where the fault occurs), Identification (what type of fault it is), and Reason (whether the reasoning trace is grounded in relevant evidence). Together, the two datasets comprise over 500 expert-labeled failure cases across two representative microservice systems (HipsterShop and the OpenTelemetry Demo Store). They cover diverse fault scenarios across resource, network, runtime, middleware/database, and application-logic categories and provide fine-grained causal evidence to support agent learning and reasoning-process evaluation. Beyond scale and coverage, the datasets have been carefully labelled by domain experts and validated through large-scale competitions, supporting more than 6,000 participating teams. This makes them not only expert-labeled diagnostic datasets, but also competition-validated benchmarks for evaluating agentic failure diagnosis in real-world microservice environments. Datasets are available at this https URL.

[AI-123] Measuring Graph-to-Graph Semantic Similarity in Knowledge Graphs: An Empirical Evaluation of Knowledge Graph Embeddings KDD2026

链接: https://arxiv.org/abs/2606.29180
作者: Seungryeol Baek,Wooseok Sim,Hogun Park
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, 6 tables. Accepted as a poster at The 2nd Frontiers in Graph Machine Learning for the Large Model Era (GMLLM’26) Workshop, co-located with KDD 2026

点击查看摘要

Abstract:A Knowledge Graph (KG) represents facts as structured triples and is widely used to organize relational knowledge across diverse domains. Just as textual information ranges from words and sentences to complete documents, KG information can be interpreted at multiple levels, from entities, relations, and triples to subgraphs and entire KGs. However, existing KG embedding methods mainly focus on entities, relations, and triples, leaving graph-level semantics largely unaddressed. Conventional graph-level methods, which typically compare graphs based on structural patterns, are also insufficient because structural similarity alone cannot guarantee semantic similarity between KGs. To evaluate how well different methods capture such graph-level semantic information, we study graph-to-graph semantic similarity, which determines whether a pair of KGs represents semantically corresponding underlying information. To obtain reliable ground-truth correspondences, we construct a semantic matching dataset by modifying text documents, extracting KGs from both original and modified documents, and transferring their known correspondences to KG pairs. We compare text-based, structure-based, and KG embedding-based approaches on each dataset. For the KG embedding-based approach, we introduce two scoring functions: \textitEmbPairSim, which uses maximal pairwise entity similarity, and \textitAvgEmbSim, which uses a frequency-weighted centroid. Experiments on WikiText-2 and CC-News show that \textitEmbPairSim achieves up to 5.3 pp higher MRR than Sentence-BERT while using substantially fewer parameters. These results suggest that KGE representations can serve as compact and effective signals for graph-to-graph semantic similarity in KGs. Our code is available at this https URL.

[AI-124] Direct Causation in International Humanitarian Law and the Challenge of AI-Mediated Civilian Cyber Operations ICML2026

链接: https://arxiv.org/abs/2606.29175
作者: Alice Saito,Harold Godsoe,Phan Xuan Tan
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 11 pages, 1 figure, Workshop on Technical AI Governance Research ICML 2026

点击查看摘要

Abstract:International humanitarian law protects civilians from direct attack unless and for such time as they take direct part in hostilities, with the ICRC’s 2009 Interpretive Guidance operationalising this rule through a three-criterion cumulative test. This paper argues that AI-mediated civilian cyber operations challenge the direct causation element of this test in a structurally specific way: when a civilian deploys an autonomous multi-agent cyber system of the kind recently demonstrated in offensive AI research, the “one causal step” standard fails because harm is produced by system-generated decisions made after human disengagement, and the integral-part requirement does not extend because it presupposes downstream human contributors whose conduct can be independently classified. The framework therefore defaults to treating such deployments as indirect participation, in tension with its purpose of capturing civilians who personally take part in hostilities. Beyond the doctrinal analysis, this paper identifies goal-specification granularity as the property on which the integral-part test’s concreteness component implicitly turns, classifies AI-mediated operations along a five-level spectrum, and argues that existing technical AI governance instruments do not log or report this property.

[AI-125] Invariant Reasoning Directions in Latent Trajectories of Language Models

链接: https://arxiv.org/abs/2606.29164
作者: Arun Vignesh Malarkkan,Manan Roy Choudhury,Utkarsh Byahut,Yash Ravindra Charde,Vivek Gupta,Yanjie Fu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
备注: 9 main text pages and 6 appendix pages

点击查看摘要

Abstract:Latent reasoning models perform multi-step inference directly in hidden-state space, yet the structure of these latent reasoning trajectories remains poorly understood. We show that contrastive refinement signals between stronger and weaker reasoning trajectories exhibit a highly concentrated low-rank structure, while unconstrained latent updates remain sensitive to paraphrases, checkpoint choice, and trajectory perturbations. These observations suggest that latent reasoning trajectories contain stable invariant directions mixed with unstable instance-specific variation. We introduce \textbfTrajectory-Invariant Latent Refinement (TILR), a training-free intervention framework for identifying and manipulating stable reasoning directions in latent space. TILR first learns a low-rank invariant subspace from contrastive trajectory differences across inputs, then constrains latent interventions to this subspace while suppressing poorly aligned updates through an adaptive alignment gate. Across six reasoning benchmarks, we find that a small number of latent directions explain most variation between strong and weak reasoning trajectories. Interventions on these directions causally improve reasoning consistency and reduce trajectory instability under paraphrases and perturbations. TILR improves answer consistency under paraphrase by ~10% and reduces latent trajectory variance by up to 50% while preserving reasoning accuracy. These results support a geometric view of latent reasoning in which transferable reasoning behavior emerges from stable low-dimensional structure within hidden-state trajectories.

[AI-126] Pooled Leaderboards Hide System-Specific Winners: A Reporting-Protocol Audit of Offline Root-Cause Analysis Benchmarks

链接: https://arxiv.org/abs/2606.29159
作者: Lining Hu,Ting Liu,Yuzhuo Fu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Offline root-cause-analysis (RCA) benchmarks commonly rank methods by a single pooled top-1 accuracy across multiple subsystems, and engineers often read the pooled winner as a recommendation for their own subsystem. We audit that reading on three public RCA benchmark families – OpenRCA, RCAEval, and PetShop – covering 11 subsystems and 778 matched scoring units. To keep pairwise comparisons on identical cases, the main analysis retains four methods or comparators with complete coverage: BARO, a CD-1min adapter, max- |Z| , and per-service alert-count. All six pairwise comparisons show subsystem-level effects of both signs, every random-effects 95% prediction interval crosses zero, and case-level interaction tests reject exchangeability in 5 of 6 pairs. Leave-one-system-out selection picks the lower-scoring method on up to 5 of 11 held-out subsystems, with regret reaching 24.8 pp on RCAEval / Sock-Shop. We release a 320-line audit module; given a matched RCA benchmark score table, it recomputes the same per-subsystem stability checks alongside pooled scores.

[AI-127] On the Nonlinearity of Learning Rate Scaling for LLM Training

链接: https://arxiv.org/abs/2606.29158
作者: Zaiwen Yang,Huaqing Zhang,Jing Xu,Jingzhao Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning-rate transfer can reduce the cost of training large language models: instead of sweeping learning rates at target scale, practitioners extrapolate from smaller runs. Existing approaches often assume that the optimal learning rate follows a log-linear scaling law in data scale and model size. We carefully examine and evaluate this scaling law. In our empirical study of GPT-2–style models from 22M to 707M parameters trained on 5B to 100B tokens, the optimal learning rate develops upward curvature at larger scales, leading to inaccurate extrapolation. We find that this curvature largely disappears when learning rates are replaced by effective learning rate (the step size in normalized weight space), and when data D extrapolation is used instead of model size N extrapolation. Next, we explain nonlinearity in scaling: weight-norm converges to equilibrium slower when optimal learning is small, requiring a larger step size to reduce the transient phase. Experiments with AdamH, which directly controls the effective learning rate, further support this explanation.

[AI-128] Flow Reasoning Models: Scaling Reasoning Through Iterative Self-Refinement

链接: https://arxiv.org/abs/2606.29150
作者: Alec Helbling,Andrey Bryutkin,Mauro Martino,Nima Dehmamy,Hendrik Strobelt
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Discrete flow models have recently shown promising performance on few-step text generation; however, when naively applied to structured reasoning tasks such as Sudoku and Zebra puzzles, they converge confidently to incorrect answers (solving only \sim 36% of Sudoku puzzles). We introduce Flow Reasoning Models (FRMs), a training and test-time-scaling framework for structured reasoning with flow models. We make the observation that, despite their poor solve rate, flow models can act as their own verifiers. A correct answer is a stable fixed point of the denoising dynamics, returning to itself when re-noised and re-solved. This enables a test-time-scaling paradigm: propose many candidate solutions and keep those that are dynamically stable, which alone reaches high solve rates on Sudoku-Shah (~ 100% ) and Zebra ( 95.9% ). This even generalizes to harder out-of-distribution puzzles like Sudoku-Extreme ( 96.1% ), without ever training on that distribution. This pure search, however, wastes a great deal of computation generating incorrect candidate solutions. We therefore design a training recipe to improve the base model’s efficiency. First, we train flow models with a self-conditioning channel and close it at inference, letting them refine their own past predictions. Second, we train models to avoid their own failed generations using direct preference optimization. These changes substantially improve the base model’s efficiency, letting it reach 99.2% on Sudoku in just 7 forward passes, over 8\times fewer than the strongest matched masked-diffusion baseline we compare needs for the same accuracy. When combined with test-time scaling, this lets flow models solve hard out-of-distribution puzzles (e.g. Sudoku-Extreme) far more efficiently.

[AI-129] HiComm: Hierarchical Communication for Multi-agent Reinforcement Learning

链接: https://arxiv.org/abs/2606.29126
作者: Runze Zhao,Dongruo Zhou,Sumit Kumar Jha,Nathaniel D. Bastian,Ankit Shah
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 7 tables, under review

点击查看摘要

Abstract:Cooperative multi-agent reinforcement learning (MARL) often relies on communication to mitigate partial observability, yet most existing protocols treat messages as flat dense vectors detached from the structure of the observations they summarize. This design overlooks an important source of inductive bias in many cooperative environments, where observations naturally follow a hierarchy such as groups and entities. We propose \textscHiComm, a plug-in communication module that grounds messages in the sender’s hierarchical observation. \textscHiComm is receiver-driven: the receiver issues a query, and the hierarchy is resolved through a three-stage decoding process that first selects a group, then a sender, and then an entity within that group, returning the corresponding feature slice as the message. This converts communication from unstructured vector transmission into structured information retrieval over the sender’s observation hierarchy. We instantiate this mechanism with Straight-Through Gumbel-Softmax for differentiable discrete selection and a lightweight shared projection design that attaches to standard MARL pipelines. Experiments across cooperative MARL tasks with different observation structures and coordination demands show that \textscHiComm matches or outperforms representative learned communication baselines while reducing communication volume by up to 23\times per receiver per episode.

[AI-130] Characterizing Large Language Model Agent ic Workflows: A Study on N8n Ecosystem

链接: https://arxiv.org/abs/2606.29116
作者: Yutian Tang,Yuming Zhou,Huaming Chen
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are rapidly being adopted in low-code and no-code automation platforms, where non-expert users design workflows that combine natural language understanding with external services and APIs. LLM agents are LLM systems that use LLMs as a core “brain” to reason, plan, and autonomously execute complex, multi-step tasks. In this paper, we present the first large-scale empirical study of LLM agentic workflows in low-code automation platforms. We analyze more than 6,000 publicly available n8n workflows and examine four aspects of their design: task distribution, structural and tool use patterns, reliability mechanisms, and autonomy levels. Our analysis shows that LLM workflows are not merely prompt response pipelines. Instead, LLMs are commonly embedded within broader automation structures involving control logic, external tools, communication services, storage systems, and human review points. We further find that while many workflows include lightweight post-processing or routing logic after LLM execution, explicit reliability mechanisms such as structured fallback paths, repair loops, failure-specific alerts, and human approval gates remain relatively uncommon. These results reveal a gap between the increasing deployment of LLM agents in practical automation ecosystems and the limited engineering support for reliability, safety, and governance. Overall, our study provides ten empirical findings and five research takeaways for researchers, platform developers, and practitioners seeking to understand and improve real-world LLM agentic workflows.

[AI-131] Managing the Human Fallback: Skill Investment Under Improving AI and Worker Mobility

链接: https://arxiv.org/abs/2606.29111
作者: Simrita Singh,Naireet Ghosh,Tinglong Dai
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN); Theoretical Economics (econ.TH)
备注: 32 pages, 5 figures, 31-page appendix

点击查看摘要

Abstract:When firms deploy autonomous AI, they must decide how much work to leave to the system and how much to keep workers engaged. This decision affects current output and future human capital. We develop a parsimonious two-period model in which AI may outperform the worker when it functions, but may fail with positive probability. A firm chooses worker engagement; engagement lowers current output for below-benchmark workers, but changes future skill through learning and erosion. We distinguish two dimensions of AI progress: capability, the system’s output when it works, and reliability, the probability that it works. In a single-firm benchmark, engagement is valuable only as fallback investment. The firm engages the least-skilled workers most, because they have the largest skill gaps and are least costly to bring toward a useful fallback level. With worker mobility, engagement also affects labor-market sorting: workers prefer jobs that build more valuable skill trajectories. This sorting motive targets higher-skill workers near the AI frontier, where skill gains are more valuable and engagement is less costly. Mobility can therefore reverse the engagement pattern, shifting investment from the least-skilled toward the most-skilled workers below the AI benchmark. Mobility also reshapes how AI progress affects engagement: greater capability raises engagement by increasing the value of the skill trajectory a firm offers, whereas greater reliability can raise or lower it because it reduces fallback need while also changing learning opportunities. Under worker mobility, human-AI work design becomes a problem of human-capital investment, in which allocating work today shapes future skill.

[AI-132] Unified Complex-valued Neural Network: A Magnitude-Phase Computational Model for Event-Driven Neuromorphic Learning

链接: https://arxiv.org/abs/2606.29099
作者: Reza Ahmadvand,Sarah Safura Sharif,Yaser Mike Banad
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial neural networks (ANN) provide accurate continuous-valued representation, whereas spiking neural networks (SNN) offer event-driven temporal processing, yet both paradigms face limitations when value encoding and timing dynamics must be learned within a single computational structure. This paper introduces a network based on Unified Complex-valued Neuron (UCN), a new neural computational model that integrates continuous activation and phase-driven event generation through an asymmetric complex-valued state. In the UCN, magnitude encodes signal strength while phase governs intrinsic temporal evolution and valued spike emission. A foundational training framework combining backpropagation (BP) and backpropagation through time (BPTT) is first developed to optimize magnitude and phase pathways in a unified way. To reduce computational complexity, an event-driven adaptive phase learning (EAPL) rule is then introduced as a more efficient alternative. The proposed model is evaluated through object tracking and Lorenz attractor learning. Results demonstrate that UCN-based Network (UCNN) provides accurate, stable, and interpretable spatiotemporal learning while preserving sparse event-driven computation for neuromorphic and edge-AI applications.

[AI-133] Priced Motion Through Optimal Faces: A Normal-Fan Geometry for Non-Stationary Adversarial MDPs

链接: https://arxiv.org/abs/2606.29092
作者: Kai Hidajat
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In a changing decision problem, standard dynamic-regret analyses have often equated the cost of non-stationarity to how far loss moves. However, it is simultaneously possible for a loss sequence to travel far and retain the same optimal policy, or for a small movement in loss to force the optimal policy to change completely. Thus, the size of the movement through loss variation, transition variation, or comparator path length describe the adversary’s motion, but not the cost of that motion to the control problem. For a more faithful analytic interpretation, this paper develops a normal-fan geometry for finite-horizon adversarial MDPs with fixed transitions. Occupancy measures form a polytope, and each loss vector exposes an optimal face of that polytope. Non-stationarity in rewards is therefore a path through the normal fan, where motion inside one cone leaves the optimal face unchanged, while crossing a wall may carry regret. We pose the notion of a face-crossing price, which is the minimum regret incurred by remaining on the previous optimal face under the new loss. For any learner that tracks the previous face, dynamic regret decomposes exactly into intrinsic priced face motion plus within-face selection error. The resulting theory separates consequential from harmless non-stationarity, where loss variation can be arbitrarily large at zero price, and identical one-coordinate variation can hide horizon-scale differences in regret.

[AI-134] Statistically Indistinguishable Operationally Distinct: A Formal Barrier for Tabular Foundation Models ICML

链接: https://arxiv.org/abs/2606.29091
作者: Tassilo Klein,Johannes Hoffart
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Accepted at the 2nd ICML Workshop on Foundation Models for Structured Data, 2026

点击查看摘要

Abstract:Tabular foundation models cannot reason about data produced by running systems without access to the rules that govern them. We make this statement falsifiable. The \emphOperational Turing Test (OTT) constructs pairs of legal and rule-violating database states whose 1 - and 2 -way column-value marginals match to a total variation of 0.02 ; Le~Cam’s lemma then bounds any values-only classifier at \geq0.49 Bayes error. Three values-only baselines (XGBoost, TabICL, TabPFN) hit the bound exactly (accuracy 0.50 , pre-registered two one-sided tests (TOST) p0.002 ), raw row-level access does not help, exposing relational value consistency closes most of the gap, and only a classifier fed by seven executable rule-derived audits reaches 1.00 classification accuracy. In three matched 100 -state frontier large-language-model (LLM) runs, models given the schema, trigger source, rule tables, and state files classify at most 2/50 legal states as LEGAL; GPT-5.5 accepts 0/50 legal states even with higher reasoning effort and a Structured Query Language (SQL) executor. The access-ladder pattern also appears on a second schema with structurally distinct rule families (banking ledger: cross-row balance, cumulative aggregate). The barrier is identifiability, not capacity: scale, data, and richer features cannot cross it without operational grounding.

[AI-135] Diff-Based Code Corruption using LLM s for Large-Scale Bugfix Benchmarking

链接: https://arxiv.org/abs/2606.29088
作者: Balázs Szalontai,Ábel Szauter,Balázs Márton,Péter Verebics,Balázs Pintér,Tibor Gregorics
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There are various benchmarks to evaluate bugfixing capabilities of Large Language Models. However, most widespread benchmarks do not fully reflect real-world bugfixing practices. They are small, weakening statistical reliability, and the buggy programs are often similar to one another, potentially distorting evaluation results. The range of bug types can also be narrow, failing to capture a representative range of bugs. To address these issues, we introduce MegaBugFix, a large-scale bugfixing benchmark containing 12,629 buggy Python programs synthesized from correct ones by a Large Language Model. Bug injections were generated as diffs representing code changes. Through this approach, we were able to avoid common pitfalls of LLM-based mutation techniques like injecting overly simplistic bugs or failing to modify the input program. We evaluated 13 open-weight models on MegaBugFix and baseline benchmarks, finding consistently lower performance on MegaBugFix. This reveals that our benchmark presents more challenging bugs and exposes model failures that may remain hidden when evaluating on existing benchmarks. The benchmark and fine-tuned model used for bug injection are available at this http URL.

[AI-136] From Tool Connection to Execution Control: Benchmarking Security Invariants in MCP-Style Agent Runtimes

链接: https://arxiv.org/abs/2606.29073
作者: Ting Liu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:Model Context Protocol (MCP)-style ecosystems give language-model applications a practical connection layer for tools, resources, prompts, and transports. As agents move from connection to execution, security decisions often remain split across clients, servers, prompts, approval dialogs, OAuth deployments, and logs. This paper asks whether a runtime can make execution-layer invariants explicit and testable while preserving MCP-like workflows. We define eight invariants: metadata non-authority, grant-backed approval, canonical resources, principal binding, scoped capability invocation, source-and-target data-flow authorization, deny-path audit, and explicit protocol state. We implement these invariants in HCP, a Handle-Capability Protocol reference runtime for MCP-style agent execution that represents calls through principals, resources, grants, capabilities, handles, policy decisions, data-pipe checks, and audit entries. We evaluate HCP against two MCP-like baselines: a naive connection-layer runtime and a practice-informed connection-layer mitigation baseline with metadata linting, session checks, and per-call approvals. Across 10 benchmark cases, the naive baseline permits all modeled attacks, the mitigation baseline permits 6 of 10, and HCP blocks all 10 while preserving audit evidence. Ablations identify which runtime components block attacks and preserve forensic evidence. A local in-memory microbenchmark reports sub-millisecond mean latencies for measured policy, invocation, peek, and pipe operations. A bounded GitHub README-screening sample provides ecosystem signals, not vulnerability findings. The results support a narrow claim: MCP-style agent systems need an execution-control layer in addition to connection-layer conventions.

[AI-137] Memory as an Attack Surface in LLM Agents : A Study on Multiple-Choice Question Answering

链接: https://arxiv.org/abs/2606.29030
作者: Shahnewaz Karim Sakib,Anindya Bijoy Das
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:AI agents extend conventional large language model (LLM) applications by integrating language understanding with task execution, external tool use, and memory mechanisms. While memory allows agents to retain prior interactions and provide more personalized and context-aware responses, it also introduces a new vulnerability: information stored in memory can influence future outputs even when the current query is clean. In this paper, we investigate memory manipulation in LLM-based agents for multiple-choice question answering. We first design and implement an LLM-based AI agent with an external memory component that stores and retrieves task-relevant information. We then introduce basic memory manipulation scenarios in which misleading or corrupted memories are inserted into the agent before it answers multiple-choice questions. Using a controlled experimental setup, we compare the agent’s performance before and after memory manipulation and measure changes in answer accuracy, attack success rate, and selection of manipulated options. Our results show that even simple memory manipulations can noticeably affect the agent’s final answers, causing it to select incorrect options despite receiving clean and well-formed questions.

[AI-138] Preventing Error Propagation in Multi-Agent AI through Runtime Monitoring

链接: https://arxiv.org/abs/2606.29026
作者: Shahnewaz Karim Sakib,Anindya Bijoy Das
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Multi-agent AI systems can improve answer selection by allowing different language models to exchange reasoning traces, revise initial predictions, and support a final decision. However, such communication may also introduce reliability risks: reasoning from one agent can correct another agent’s mistake, but it can also mislead an agent that was initially correct. This paper studies reliable multi-agent AI communication through reasoning exchange and runtime answer revision. We develop a framework in which agents first answer multiple-choice questions independently, then share reasoning traces and revise their decisions. We conduct numerical experiments where we evaluate whether this process improves accuracy, produces more positive than negative answer transitions, and remains effective across domains such as cybersecurity, networking, and general knowledge. The results help identify when multi-agent reasoning improves reliability and when it may propagate errors.

[AI-139] Customized Generative AI Agent for Transportation Engineering Practice: A Development and Continued Pre-training Guideline

链接: https://arxiv.org/abs/2606.29014
作者: Dianwei Chen,Yuan-Zheng Lei,Zifan Zhang,Yuchen Liu,Xianfeng(Terry)Yang
类目: Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:Recent advancements in generative artificial intelligence (AI) and large language models (LLMs) have shown significant promise in automating complex reasoning, summarization, and question-answering tasks. However, the effectiveness of general-purpose LLMs in specialized engineering domains remains limited due to insufficient exposure to technical standards, engineering terminology, and domain-specific semantics. This study proposes a systematic approach to developing a customized generative AI agent for transportation engineering applications. A curated corpus of U.S. transportation manuals, design guidelines, and regulatory documents is used to conduct continued pretraining of six state-of-the-art LLMs through a unified low-rank adaptation (LoRA) framework. The training process is monitored to ensure convergence and model stability. Performance is evaluated using standard natural language processing metrics, including BLEU-4 and ROUGE, with Qwen2.5-7B and LLaMA-3.1-8B demonstrating the highest domain alignment and response quality. Results validate the effectiveness of LoRA-based adaptation in improving LLM performance on technical content interpretation and context-specific reasoning. This work contributes a reproducible development framework for constructing domain-specialized generative AI agents, supporting broader deployment in transportation research, design, planning, and policy analysis.

[AI-140] Reward-Free Code Alignment from Pretrained or Fine-Tuned LLM : Unpacking the Trade-offs for Code Generation

链接: https://arxiv.org/abs/2606.28998
作者: Gias Uddin,Sanjeepan Sivapiran
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) alignment trains an LLM using preference data to produce outputs that better meet established quality standards. While LLM alignment techniques are studied for non-coding tasks, we know little about their usefulness for coding tasks. It is unclear whether LLM code alignment could support both functional requirements (producing executable, correct code) and non-functional requirements (code readability, style, maintainability). It is also unknown whether alignment for a code LLM should begin with base pretrained version or the finetuned (i.e., instruction-tuned) version of the LLM. In this paper, we offer insights on the above two research questions by conducting an empirical study. We studied five state-of-the-art (SOTA) LLMs using two widely used LLM alignment techniques: Direct Preference Optimization (DPO) and BoNBoN. For each training record, we created a preference pair as accepted and rejected instances by using the SelfCodeAlign pipeline. DPO and BoNBoN are reward-free models, i.e., they eliminate the need for multiple reward scores for output preferences. We tuned each LLM using the two alignment techniques in two settings: pretrained and finetuned versions of an LLM. We evaluated functional requirements using four SOTA benchmarks (HumanEval+, MBPP+, EvalPerf, EvoEval) and non-functional requirements using the CODAL benchmark, which evaluates code quality across five dimensions derived from software engineering practices. We find that pretrained-to-aligned pathways achieve larger improvements in the aligned variant over its pretrained variant. But the pretrained variant is generally less accurate than its finetuned variant. However, finetuned- to-aligned offers smaller performance improvements or, in some cases, degradation in the aligned variant than its finetuned variant.

[AI-141] Arbitrary Reduction of Validation Error for AI Decision Tests using Homomorphic AI and Repetition Codes

链接: https://arxiv.org/abs/2606.28994
作者: Eric Filiol,Jaagup Sepp
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figure. Extended version of the talk presented at DSC Next 2026, Amsterdam, May 7-8th, 2026

点击查看摘要

Abstract:This paper presents new results and breakthrough obtained with the HbHAI techniques (Hash-based Homomorphic Artificial Intelligence) proposed in \citefiliol0,sepp. HbHAI is based on a novel class of key-dependent hash functions that naturally preserve most similarity properties, most AI algorithms rely on. It enables to analyse and process data in its cryptographically secure form while using existing native AI algorithms without modification, with unprecedented performances compared to existing homomorphic encryption schemes and most notably compared to the same processing on corresponding plaintext data. Two major results have been obtained further. First we enable to reduce the compression rate up to a factor of 10 thus allowing to process massive datasets while reducing the computation time and the energy footprint in the same order. Second, we show how it is possible to arbitrarily reduce the final validation error of AI-based decision tests by using repetition error-correcting codes. Comments: 10 pages, 1 figure. Extended version of the talk presented at DSC Next 2026, Amsterdam, May 7-8th, 2026 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.28994 [cs.CR] (or arXiv:2606.28994v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.28994 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-142] RGLD: Randomized Global-Local Density Estimation for Tabular Anomaly Detection

链接: https://arxiv.org/abs/2606.28970
作者: Quanling Zhao,Jiaying Yang,Ye Tian,Josh Victoria,Zhijun Wang,Pietro Mercati,Onat Gungor,Tajana Rosing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unsupervised tabular anomaly detection requires methods that are accurate, robust across heterogeneous datasets, and computationally efficient. Classical statistical detectors are often efficient, but they usually rely on a fixed data view and a single notion of abnormality. Deep anomaly detectors can learn more flexible scoring functions, but they are substantially slower and difficult to tune in unsupervised settings due to the lack of a reliable supervisory signal. We propose RGLD, a randomized global-local density estimator for efficient unsupervised tabular anomaly detection. RGLD combines a global random-feature density branch, which identifies samples in broadly low-density regions, with a local neighbor branch, which detects samples that are weakly supported by nearby observations. Both branches operate over feature-bagged randomized views, allowing RGLD to expose anomaly evidence that may be hidden in any single representation. We conduct experiments on 47 tabular datasets against 23 statistical and deep anomaly detection baselines under fully unsupervised setting. RGLD achieves the strongest dataset-level AUROC performance, ranking 1st in dataset wins, and ranks 2nd in AUPRC wins. RGLD is also faster than all evaluated deep detectors, achieving 50x-580x speedups, and remains competitive with statistical methods in runtime, yielding a favorable accuracy-efficiency tradeoff.

[AI-143] Expert Evaluation of Clinical AI Tools on Real Point-of-Care Clinical Queries

链接: https://arxiv.org/abs/2606.28960
作者: Jean Feng,Vishal Patel,Patrick Heagerty,Yifan Mai,Venkatesh Sivaraman,Patrick Vossler,Jialin Ouyang,Anupam B. Jena
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Physicians now pose millions of clinical questions to AI tools each week, yet these tools are evaluated largely on hypothetical or exam-style questions, not those actually asked in practice. We report a blinded evaluation built on 620 Real-world Point-Of-Care Queries (Real-POCQi) submitted to the OpenEvidence (OE) platform by physicians spanning 30 specialties, as well as 187 questions from HealthBench. 149 practicing physicians across 36 states made head-to-head comparisons between answers from three frontier general-purpose models (Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5) and a specialized clinical tool (OE), with graders matched to each question’s specialty. When comparing answers along five dimensions relevant to clinical decision support – accuracy, clinical utility, source quality, verifiability, completeness – physicians scored the specialized tool highest on all axes; in the primary analysis on Real-POCQi, win differences (margins between win and loss rates) ranged from 25 to 39 percentage points (p0.001). Results remained consistent in sensitivity analyses stratifying by citation display, answer length, OE-user status, and Real-POCQi versus HealthBench. In parallel, LLM judges were found to systematically differ from expert judges, though both generally agreed on the best model. These findings underscore two conclusions: (i) AI tool evaluations should reflect real-world query distributions and use expert judges that mirror the specialization defining modern medicine and (ii) the consistent advantage of the specialized tool over general-purpose models does not necessarily mean that the latter cannot serve similar purposes, but that targeted engineering and customization can yield meaningful gains in performance for its users. We release Real-POCQi as a public benchmark, as well as the prespecified statistical analysis for reproducing results of this study.

[AI-144] Modification-Considering Value Learning for Reward Hacking Mitigation in RL

链接: https://arxiv.org/abs/2606.28955
作者: Evgenii Opryshko,Umangi Jain,Igor Gilitschenski
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning agents can exploit misspecified reward signals to achieve high apparent returns while failing on the intended objective, a failure mode known as reward hacking. Existing practical defenses typically constrain policy updates to stay near a known safe reference, creating a tension between suppressing hacking and permitting legitimate improvement. We propose Modification-Considering Value Learning (MCVL), which operationalizes the theoretical idea of current utility optimization for standard value-based RL. MCVL wraps an off-policy learner and treats each incoming transition as a candidate modification: it forecasts two training paths, one that includes the transition and one that does not, and scores both with a frozen bootstrapped-return estimator derived from a learned reward model and value function. The transition is admitted only if inclusion does not decrease the score. We formalize conditions under which this filtering is both safe and permissive, and instantiate MCVL with DDQN and TD3. Across four safety-relevant gridworlds and three modified MuJoCo continuous-control tasks with diverse hacking mechanisms, MCVL mitigates reward hacking while continuing to improve the intended objective. Project website: this http URL.

[AI-145] Machine-learnable Sets

链接: https://arxiv.org/abs/2606.28947
作者: Veit Elser,Manish Krishan Lal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 14 figures

点击查看摘要

Abstract:In this study we present a formal definition of large discrete sets having, informally, three properties: their elements are easily recognized, easily generated, and the latter tasks are easily learned from examples. The formalism is specialized to sets of binary strings and a definition of “machine-learnability” based on the existence of a bounded-complexity Boolean autoencoder that fixes the elements of the set. We present experiments where the autoencoders are implemented by nets of Boolean threshold functions. Machine-learnability is demonstrated for Rorschach patterns (that may have reversed contrast in the mirrored half), and considerably “wilder” sets whose elements are only approximately fixed by admissible autoencoders. In the second case we demonstrate a simple iteration that evolves wild sets to make them properly machine-learnable.

[AI-146] DLR: Zero-Inference-Cost Latent Residuals for Low-Rank Pre-Training

链接: https://arxiv.org/abs/2606.28932
作者: Dong Wang,Wenwu Tang,Yun Cheng,Olga Saukh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Includes appendix, 6 figures and 11 tables. Code available at this https URL

点击查看摘要

Abstract:Large language models have driven recent progress in language and multimodal AI, yet pre-training them at scale is prohibitively expensive. Low-rank pre-training, which factorizes each weight matrix into a rank-r product to reduce both parameters and FLOPs, is a promising response but typically lags full-rank training in quality. We propose Duplicated Latent Residual (DLR), a training-only, parameter-free, foldable plug-in for low-rank pre-training. DLR augments the standard low-rank output Bz with a fixed structured residual alpha/sqrt(K) * Expand_K(z) that replicates each latent coordinate K = ceil(d_out/r) times across the output. With alpha fixed, DLR adds zero learnable parameters per layer; after training, it is absorbed into the up-projection in closed form, B* = B + alpha/sqrt(K) R^T, so deployment parameter count, FLOPs and memory match the underlying low-rank backbone exactly. Across LLaMA models from 60M to 7B parameters, DLR strengthens low-rank pre-training on C4 validation perplexity in most settings, with the clearest gains at 130M and above; folded checkpoints transfer cleanly to supervised fine-tuning on standard benchmarks.

[AI-147] An Integrated Machine Learning and Hierarchical Variance Decomposition Pipeline for Student Performance Prediction and Metacognitive Calibration on Multi-Signal Telemetry

链接: https://arxiv.org/abs/2606.28881
作者: Gurdeep Singh Virdee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 Pages

点击查看摘要

Abstract:Predicting student performance and characterizing metacognitive calibration are essential for personalization in intelligent tutoring systems. Prior research treats performance prediction, calibration error calculation, and variance decomposition as separate pipelines, preventing unified interpretation. I propose the Unified Behavioral Prediction and Calibration Analysis Pipeline (UBP-CAP), an integrated framework processing student pre-execution behavioral telemetry through three linked modules: (1) a LightGBM classifier with SHAP for binary correctness prediction, (2) formal calibration metrics (ECE, MCE, and Brier score decomposition) to evaluate metacognitive alignment, and (3) a crossed Generalized Linear Mixed-Effects Model (GLMM) for decomposing calibration deviations. I introduce the Predictive-Explanatory Divergence Index (PEDI), which quantifies structural divergence between predictive and explanatory feature profiles. Evaluated on 1,195 interaction records (27 students, 45 tasks), Logistic Regression achieves AUC-ROC = 0.903, outperforming LightGBM (0.878). Student naive ECE (0.109) significantly exceeds model ECE (0.068), confirming systematic miscalibration. The crossed GLMM yields ICCStudent = 0.123, showing calibration is situational rather than dispositional. PEDIcos = 0.081 (p = 0.327) indicates structural alignment between prediction and explanation on shared behavioral features.

[AI-148] Defeat Devices in AI Systems

链接: https://arxiv.org/abs/2606.28863
作者: Emilio Ferrara
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Final version published in Future Internet, 18(7), 339, 2026

点击查看摘要

Abstract:AI systems increasingly exhibit behavior that differs systematically between evaluation and deployment contexts. Alignment faking, sandbagging, benchmark gaming, deceptive scheming, specification gaming, and trojans have each been documented separately, with each line of work characterizing one facet of what we argue is a single structural mechanism. We propose that this common mechanism is a defeat device, an engineering and regulatory concept long established in vehicle-emissions law and brought to broad public attention by the 2015 Volkswagen emissions case. A defeat device in an AI system has three necessary elements: a discriminator that detects evaluation context, a concealed swap that conditions behavior on detection, and a gap between eval-distribution and deployment-distribution performance on the stated evaluation criterion. We formalize this triadic test as a behavioral definition, organize documented cases along three taxonomic axes (origin, trigger, swap mechanism), propose Trigger-Axis-Aware Differential Probing (TADP) as a forensic detection protocol, and advance the claim that defeat devices can naturally emerge in current frontier AI systems without any operator engineering. We characterize naturally-emerging defeat devices as potentially one of the harmful emerging phenomena that AI safety practice should monitor and test for systematically. Implications for evaluation methodology, post-training pipeline design, interpretability research priorities, and AI governance follow.

[AI-149] Fisher-Routed Mixture of Experts for Federated Class-Incremental Learning ECCV2026

链接: https://arxiv.org/abs/2606.28835
作者: Wenhao Yuan,Chenchen Lin,Jian Chen,Jinfeng Xu,Zewei Liu,Edith Cheuk Han Ngai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ECCV2026

点击查看摘要

Abstract:Federated Learning (FL) emerged as a promising distributed machine learning paradigm. However, extending FL to the class incremental learning scenarios introduces unique challenges: 1) Capacity conflict and catastrophic forgetting from the shared model overloading, 2) Heterogeneity from Non-Independent and Identically Distributed (Non-IID) data, and 3) Synchronized class misalignment. In this paper, we propose \textbfFisher-Routed \textbfMi\textbfXture of Experts for \textbfFederated Class-Incremental Learning (\textscFedFMX), a novel framework to address these challenges via adaptive expert specialization across clients. The crucial insight is to route each sample to an expert subset that jointly optimizes knowledge acquisition and retention. Specifically, we introduce a Fisher-Routed Expert Scoring (FRES) module to estimate expert importance via Fisher-based stability cost and gradient-based plasticity gain. Then, we design an Adaptive Expert Selection (AES) module by quantifying marginal contributions for adaptive expert subset determination. Finally, by the routing-aware regularization (RAR), we achieve load balance and efficient FL training. We theoretically prove the \mathcalO(T^-1) convergence rate. Extensive experiments on multiple benchmarks compared with state-of-the-art methods demonstrate the superiority of \textscFedFMX.

[AI-150] HARD-KV: Head-Adaptive Regularization for Decoding-time KV Compression ICML2026

链接: https://arxiv.org/abs/2606.28831
作者: Yuxuan Yang,Feiyang Ren,Bowen Zeng,Dalin Zhang,Jinpeng Chen,Gang Chen,Huan Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, ICML 2026 poster

点击查看摘要

Abstract:Long-context LLM inference faces a fundamental conflict: head-adaptive compression algorithms (e.g., Top- p nucleus sampling) offer superior accuracy by dynamically fluctuating memory budgets, yet modern inference engines (e.g., vLLM) demand rigid, static memory patterns to leverage CUDA Graphs and PagedAttention. We resolve this ``Static-Dynamic’’ mismatch with HARD-KV, a unified framework that that bridges dynamic selection with rigid system constraints. HARD-KV introduces a Cascade Cache hierarchy, managing the token lifecycle across dense, sparse, and condensed tiers. Crucially, we propose a Logits Calibration mechanism that normalizes diverse importance metrics into a unified probability space, enabling consistent Top- p budgeting across heterogeneous heads. To bridge the efficiency gap, we offer a system-level solution, which rewrites fragmented, dynamic indices into contiguous physical layouts compatible with high-performance inference engine. Extensive experiments on math-reasoning benchmarks (AIME, U-Math) verify that HARD-KV achieves up to 2 \times throughput improvement over static baselines while maintaining high-fidelity generation in 10k+ token scenarios. Code is available at this https URL.

[AI-151] Human2Any: Human-to-Robot Transfer via Constraint-Aware Compositional Planning

链接: https://arxiv.org/abs/2606.28813
作者: Shuo Cheng,Chuye Zhang,Alfred Cueva,Caelan Garrett,Ajay Mandlekar,Danfei Xu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human videos are a scalable source of supervision for robot manipulation, as they are abundant and naturally capture rich object interactions. However, transferring human demonstrations to robots remains challenging due to embodiment mismatch, scene variation, and robot-specific feasibility constraints. We present Human2Any, a framework for learning reusable object-centric interaction priors from human videos without requiring real-world robot demonstrations in the target task contexts. Human2Any represents manipulation through object-object interaction motion, capturing task-relevant scene changes while abstracting away embodiment-specific details. It composes learned interaction priors with robot-side feasibility reasoning and motion planning, allowing the same human-derived knowledge to adapt to different embodiments, scene geometries, and task contexts. We validate Human2Any across diverse manipulation settings, including real-world experiments on a Franka tabletop setup and an RBY-1 humanoid mobile robot, demonstrating robust interaction-centric manipulation without real-world robot training data. Project website: this https URL.

[AI-152] Primary ICD Category Prediction using LLM -based Probing

链接: https://arxiv.org/abs/2606.28798
作者: Chengyuan Liu,Xinyue Zhang,Yao Li,Guanting Chen
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 9 pages, 2 figures. Supplementary materials provided as an ancillary file

点击查看摘要

Abstract:Objective: ICD codes are central to reimbursement, research, and population health surveillance, yet automated coding systems often struggle to integrate diagnostic signals from both clinical narratives and structured electronic health record (EHR) variables. We evaluated whether frozen medical large language model (LLM) representations can serve as a shared embedding space for multimodal primary diagnosis category prediction. Materials and Methods: We constructed a MIMIC-IV cohort of 13,645 admissions from the 10 most frequent primary ICD-10 codes, consolidated into seven categories. Structured variables were serialized into clinical narratives and combined with leakage-pruned discharge notes. Using a frozen MedFound-Llama3-8B-finetuned backbone, we extracted hidden states from five transformer layers and trained linear probes for structured-only, unstructured-only, and combined inputs, comparing against XGBoost and information-matched PLM-ICD baselines and evaluating MIMIC-III adaptation with a compact bottleneck adapter. Results: The combined probe performed best on MIMIC-IV (87.69% strict; 91.45% medical accuracy), exceeding both single-modality probes and baselines. The structured-only probe outperformed its standard baseline by 6.19 points in medical accuracy. Diagnostic information became increasingly linearly separable in deeper layers, and a 2M-parameter adapter restored cross-dataset transfer to MIMIC-III using only 5% of target labels. Discussion: LLM embeddings can unify structured and narrative EHR information for multimodal diagnosis prediction, supporting efficient reuse of clinical representations across modalities and datasets through a small representation-level module. Conclusion: Multimodal probing of frozen medical LLM representations provides a practical approach for studying EHR modalities and adapting clinical representations across datasets. Comments: 9 pages, 2 figures. Supplementary materials provided as an ancillary file Subjects: Artificial Intelligence (cs.AI); Applications (stat.AP) Cite as: arXiv:2606.28798 [cs.AI] (or arXiv:2606.28798v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.28798 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chengyuan Liu [view email] [v1] Sat, 27 Jun 2026 08:10:05 UTC (379 KB) function toggleList(whichLayer,toggleThis) var elem, vis; if( document.getElementById ) // standard elem = document.getElementById( whichLayer ); else if( document.all ) // old msie versions elem = document.all[whichLayer]; else if( document.layers ) // nn4 elem = document.layers[whichLayer]; vis = elem.style; // if the style.display value is blank we try to figure it out here if(vis.display==‘’!=undefined!=undefined) vis.display = (elem.offsetWidth!=0!=0)?‘inline’:‘none’; vis.display = (vis.display==‘’||vis.display==‘inline’)?‘none’:‘inline’; // toggle link inner text status = vis.display; if(vis.display==‘inline’) document.getElementById(‘toggle’).innerHTML = “(collapse list)”; document.getElementById(‘toggle’).title = “Collapse list”; else document.getElementById(‘toggle’).innerHTML = “(”+toggleThis+“)”; document.getElementById(‘toggle’).title = “Show complete list”; Full-text links: Access Paper: View a PDF of the paper titled Primary ICD Category Prediction using LLM-based Probing, by Chengyuan Liu and 3 other authorsView PDFHTML (experimental)TeX Source view license Ancillary-file links: Ancillary files (details): supplementary_materials.pdf Current browse context: cs.AI prev | next new | recent | 2026-06 Change to browse by: cs stat stat.AP References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-153] he registrars function in a hybrid society. AI value chainsmart data and the concept of property

链接: https://arxiv.org/abs/2606.28789
作者: Pompeu Casanovas,Carmen Pastor Sempere,Marina Echebarria Saenz
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures, United Nations Economic Commission for Europe Working Party on Land Administration and Registrars of Spain Workshop, How can AI and the digitalization in the land sector support achieving the Sustainable Develpment Goals, Barcelona, 9 and 10 June 2026, this https URL

点击查看摘要

Abstract:Artificial intelligence reaches the land registry not as another tool but as a value chain that turns data into intelligence and intelligence into economic value. This paper argues that the decisive legal move is to place validity, a functional, second-order concept, at the centre of that chain. Rights, liability and supervision organise around it. It traces three this http URL information becomes smart data, governed simultaneously by registry law, the GDPR, the European data acts and the AI Act. Control emerges as the operative concept for digital representations of real estate, whose proprietary effect depends on anchoring to the register. In a hybrid society of human and artificial agents, the registry becomes the public node of validity, with blockchain complementing rather than replacing it. Across three legal cultures, the registra’s value migrates from processing documents to guaranteeing validated data,making validity an asset for the UNO Sustainable Development Goals.

[AI-154] Brownian Bridge Diffusion-Based Joint Channel Estimation and Data Detection for Jamming-Resilient Receivers

链接: https://arxiv.org/abs/2606.28778
作者: Honghan She,Yufan Cheng,Tieming Sun,Pengyu Wang,Siya Huang,Kaikai Yang
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:In next-generation wireless networks, the growing density of devices and limited spectrum resources pose severe jamming challenges to fragile legitimate communication links in the wireless electromagnetic environment. Crucially, when jamming overlaps with pilot and data symbols in both time and frequency domains, it inflicts a severe bottleneck on receiver-side joint estimation and detection. Existing schemes often lack an effective framework to combat such jamming contamination, thereby failing to guarantee reliable transmission. To address this issue, we propose a Brownian bridge diffusion-based joint channel estimation and data detection framework (BBD-JCED) for jamming-resilient receivers. Specifically, the proposed framework comprises two core modules: the first extracts jamming features in the short-time Fourier transform (STFT) domain and suppresses jamming samples, thereby improving the signal-to-jamming-plus-noise ratio (SJNR) of the received signal; the second introduces a Brownian bridge diffusion (BBD) process to model the evolution of the suppressed signal and the encoded bits in the presence of channel estimation errors, thereby enabling enhanced joint channel estimation and data detection. To alleviate the computational burden of the BBD process in the second module, we further derive a fast ordinary differential equation (ODE) solver that enables its low-complexity iterative evolution. Finally, we design a multi-module training algorithm to improve the data recovery capability of the proposed framework. Simulation results demonstrate that the proposed framework achieves superior bit recovery performance compared with baseline schemes while maintaining a lower number of model parameters and competitive computational complexity.

[AI-155] Mechanistic Personality Analysis of LLM s Steering Personality via Latent Feature Interventions

链接: https://arxiv.org/abs/2606.28770
作者: David Courtis,Ting Hu
类目: Artificial Intelligence (cs.AI)
备注: Written in 2024; submitted to arXiv 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated the ability to simulate human-like OCEAN personality traits in generated text. Previous efforts have focused on prompt engineering or fine-tuning to shape LLM personality. In this work, we propose a mechanistic interpretability approach that directly intervenes on the model’s latent features. Our method identifies latent directions in the residual stream corresponding to a target OCEAN trait using sparse autoencoders (SAEs) and contrastive activation analysis. We formalize an additive steering vector in activation space and demonstrate how applying a small additive shift to the hidden states enhances the target trait while preserving overall language modeling performance. To determine the optimal combination of feature shifts, we explore a linear weighting heuristic with grid search optimization that balances personality expression with task performance. Our approach shows promise in controllably steering personality traits at the mechanistic level while maintaining high performance on standard benchmarks.

[AI-156] Self-Supervised Theorem Discovery in a Formal Axiomatic System ICML2026

链接: https://arxiv.org/abs/2606.28747
作者: Kazuki Ota,Takayuki Osa,Tatsuya Harada
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: AI for Math Workshop @ ICML 2026

点击查看摘要

Abstract:Recent artificial intelligence (AI) systems have shown remarkable progress in mathematical reasoning. Many existing approaches, including large language models (LLMs), draw on human prior knowledge in the form of mathematical text, code, or theorem libraries. Although these approaches are highly effective in practice, it remains an open question whether an agent can autonomously discover useful theorems without such human priors. We study this question in a formal axiomatic system by developing an agent that starts from axioms and inference rules alone and gradually grows a library of useful theorems. Concretely, we propose a self-supervised theorem-discovery algorithm that alternates between proof search and useful-theorem extraction, building a theorem library whose entries are reused as lemmas for subsequent proof search. Experiments show that the agent discovers tens of thousands of theorems and finds proofs for human-written benchmark problems, suggesting that its discoveries include theorems meaningful from a human mathematical perspective. Furthermore, the discovered theorems improve LLM proof performance when provided as prompt lemmas, indicating that they can serve as external knowledge for LLM reasoning. Our results provide evidence that useful theorems can emerge from proof search without relying on human-provided theorem libraries. More broadly, they suggest a path toward self-evolving AI systems for mathematics whose discoveries remain formally verifiable.

[AI-157] Agent Safety Is Action Alignment

链接: https://arxiv.org/abs/2606.28739
作者: Shawn Li,Yue Zhao
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models increasingly act as agents: they call tools, move money, delete records, and send messages on a user’s behalf. To keep them safe, practitioners imported the chatbot-era recipe (train the model to refuse unsafe inputs) into the agentic setting, and treat the resulting capability loss as a manageable ``alignment tax.‘’ We argue this is a \emphcategory error. Refusal is a primitive for \emphcontent safety, where the harm is in the model’s output and is therefore a learnable function of it. Agentic harm is different in kind: it lies not in any output but in the relation between the authority an action exercises and the authority the user granted, which is absent from the text the model sees. Importing content-safety methods into this regime does not trade capability for safety; it pays capability and buys negative security. We support this with three lines of evidence spanning the autonomy spectrum: defense-trained models learn surface patterns rather than intent; the same training collapses multi-step agents before any threat appears while leaving them exploitable; and even undefended frontier models exceed granted authority under ordinary use. We conclude that action safety cannot be installed in weights. It must be expressed as \emphleast privilege, enforced \emphoutside the model at the action boundary, and evaluated as \emphaction alignment (a relational, deployment-conditioned property) rather than a refusal score.

[AI-158] Agent ic Abstention: Do Agents Know When to Stop Instead of Act?

链接: https://arxiv.org/abs/2606.28733
作者: Han Luo,Bingbing Wen,Lucy Lu Wang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents are expected to act over multiple turns, using search, browsing interfaces, and terminal tools to complete user goals. Yet not every goal is well specified or achievable in the available environment. In such cases, a reliable agent should recognize that further interaction is unlikely to help and abstain from additional tool calls. We define Agentic Abstention, the problem of deciding when an agent should stop acting under uncertainty. Unlike standard LLM abstention, which is usually evaluated as a single-turn answer-or-abstain decision, agentic abstention is a sequential decision problem: an agent can answer, abstain, or gather more information at each turn, and the need to abstain may only become clear after interacting with the environment. We study this problem across web shopping, terminal environments, and question answering, evaluating 13 LLM-as-agent systems and 2 agent scaffolds on more than 28,000 tasks. Our results show that the main challenge is not only whether agents can abstain, but also when they abstain. Some agents never abstain when they should, while others do so only after many unnecessary interactions. This gap is especially large on tasks where the instruction appears feasible until the environment reveals otherwise (e.g., no valid result matches the instruction). We further find that model scale, reasoning, and agent scaffolding affect abstention in different ways, where larger or more capable models sometimes perform worse at timely abstention. Finally, we introduce CONVOLVE, a context engineering method for improving agentic abstention that distills full interaction trajectories into reusable stopping rules. On WebShop, CONVOLVE substantially improves timely abstention without updating model parameters, raising Llama-3.3-70B’s timely recall rate from 26.7 to 57.4. Our dataset and code are available at this https URL

[AI-159] ComMem: Complementary Memory Systems for Test-Time Adaptation of Vision-Language Models FAST

链接: https://arxiv.org/abs/2606.28719
作者: Guanglong Sun,Shuang Cui,Bo Lei,Liyuan Wang,Zihan Zhai,Hongwei Yan,Hang Su,Jun Zhu,Yi Zhong
类目: Artificial Intelligence (cs.AI)
备注: A brain-inspired complementary memory framework leveraging fast visual caching and slow textual refinement for VLM test-time adaptation

点击查看摘要

Abstract:Test-time adaptation (TTA) of vision-language models (VLMs) is essential for their robust deployment in dynamic, real-world environments. However, existing TTA methods often adapt locally without accumulating knowledge over time, or operating within a single modality without exploiting VLMs’ inherently multi-modal nature. Inspired by the \textbfComplementary \textbfMemory systems of the biological brain, we propose \textbfComMem, an innovative approach that mimics the distinct but cooperative roles of the hippocampus and neocortex to enable effective TTA for VLMs. ComMem consists of two key components: a fast-adapting detailed memory, akin to the hippocampus, that forms a dynamic visual cache from high-confidence test samples; and a slow-integrating abstract memory, akin to the neocortex, that continually refines global textual prototypes. For each test instance, ComMem jointly optimizes both memory systems to ensure cross-modal consistency. Extensive experiments on 15 benchmark datasets show that ComMem significantly outperforms state-of-the-art methods under both natural distribution shifts and cross-dataset generalization, offering a promising direction for enhancing VLMs’ practical adaptability.

[AI-160] rajRS: Towards Certified Robustness in Pedestrian Trajectory Prediction

链接: https://arxiv.org/abs/2606.28716
作者: Liang Zhang,Gaojie Jin,Yao Shi,Quanzhi Li,Cheng-Chao Huang,David N. Jansen,Lijun Zhang
类目: Artificial Intelligence (cs.AI)
备注: Accepted by 2026 IEEE International Conference on Acoustics, Speech and Signal Processing

点击查看摘要

Abstract:The robustness of trajectory prediction models is crucial for developing safe autonomous driving systems. Adversarial attacks on trajectory prediction can significantly impair the accuracy of predicted trajectories, leading to hazardous driving behaviors. While heuristic defense strategies have been implemented to enhance the robustness of trajectory prediction models, these measures often fail against more sophisticated, targeted adversarial attacks. Hence, there is a pressing need to establish verifiable safety assurances for trajectory prediction models. In this paper, we extend the traditional Randomized Smoothing framework to “TrajRS”, which provides a certified robust radius for smoothed trajectory predictors. We clarify and expand the formal definitions of robustness in trajectory prediction and tailor the practical TrajRS scheme specifically to “robustness for the optimal prediction” and “robustness for all possible predictions”. An extensive set of experiments demonstrates that TrajRS effectively achieves robustness certification for all smoothed pedestrian trajectory predictors in this work.

[AI-161] he Two Genie Game: Adoption and Welfare in Audit-Grounded AI Governance

链接: https://arxiv.org/abs/2606.28710
作者: Darrell Lewis-Sandy
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 36 pages, 3 figures. Lean 4 formalization and figure scripts: this https URL

点击查看摘要

Abstract:We ask under what conditions an agent with a harm-minimizing policy can displace an approval-seeking (RLHF) agent in a competitive market, and when that policy is sufficient to prevent community harm. We use evolutionary game theory (finite-population Moran-Fermi pairwise comparison) to formalize this subject to assumptions of wisher hindsight, peer testimony, a monotone harm ledger, sufficient information density of community feedback, and a finite, depleting resource pool, in a negative-sum environment. We show that adoption is favored when the prior distributions on how readily wishers attune to community sentiment are monotone, exhibit endpoint inversion, and have a centro-symmetric pairing property, and demonstrate this with several long-tailed priors (Hill, Pareto, Lomax, Frechet). Where it is favored, a critical adoption level separates communities that drift back to the approval-seeking agent from those for which the audited agent fixes; above that level fixation is the overwhelmingly likely outcome. We derive when fixation is attainable as a bound on the effective (informational) size N_c of the community, which must be small enough to allow fixation before depletion. We present these as Theorems 5.4 and 5.5; the algebraic and finite-grid backbone is machine-checked in Lean 4, with the barrier-crossing asymptotics retained as explicit hypotheses. We show that a self-audited agent with a community ledger is not, in general, sufficient to prevent community harm. Sufficiency depends both upon the alignment of the agent’s audit with community values and the timeframe over which harm is evaluated. Regardless of alignment, once adoption reaches dominance, the state is absorbing. The same policy that reduced harm under alignment becomes a trap, welfare-negative under misalignment and, even under alignment, one that locks in harm deferred past the adoption horizon. Comments: 36 pages, 3 figures. Lean 4 formalization and figure scripts: this https URL Subjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT) MSC classes: 91A22 ACMclasses: I.2.0; J.4 Cite as: arXiv:2606.28710 [cs.AI] (or arXiv:2606.28710v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.28710 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-162] BV-Blend: Uncertainty-Weighted Historical Baselines for Stable Critic-Free RL with Verifiable Rewards

链接: https://arxiv.org/abs/2606.28707
作者: Yupeng Chang,Yuan Wu,Yi Chang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Critic-free reinforcement learning with verifiable rewards (RLVR), exemplified by Group Relative Policy Optimization (GRPO), avoids training a value function (critic) and reduces memory and compute overhead relative to critic-based PPO pipelines for aligning large language models. However, GRPO-style advantage estimation depends on prompt-local (within-prompt-group) reward statistics and can be unstable. In particular, when all rollouts in a prompt group receive identical rewards, the within-group reward variance becomes zero, and group normalization yields zero advantages for that group, impeding learning in cold-start regimes with binary verifiers. We introduce BV-Blend, a critic-free framework that stabilizes advantage estimation by combining prompt-local on-policy statistics with semantic-cluster-conditioned historical moments. BV-Blend maintains EMA-tracked reward moments for each cluster, derives a confidence weight from a standard error of the mean (SEM) proxy, and uses this weight to blend historical and prompt-local baseline and variance statistics into a standardized advantage for PPO-style clipped updates. Experiments on verifiable reasoning benchmarks show that BV-Blend improves training stability and performance, and remains robust in regimes where group-normalized methods may stall.

[AI-163] COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models

链接: https://arxiv.org/abs/2606.28696
作者: Ziqi Zhou,Weize Quan,Mining Tan,Zhihan Chen,Dandan Zheng,Jingdong Chen,Jun Zhou,Weiming Dong,Dong-Ming Yan
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Composition is a high-level visual intent that governs where subjects are placed and how a scene is organized, yet current unified multimodal models remain unreliable at fine-grained composition recognition and struggle to turn such intent into controllable generation. We present COMPASS, the first unified multimodal framework that grounds composition-intent control in a single system spanning both composition perception and composition-guided generation, with a shared expert token \tau_c as the central intent anchor. On the perception side, COMPASS injects composition expertise into an MoE backbone in a minimally invasive manner and distills the inferred intent into \tau_c . On the generation side, COMPASS reuses \tau_c as a global conditioning signal that steers the denoising trajectory, effectively converting passive composition analysis into explicit layout control. To support systematic instruction-following composition learning and evaluation at scale, we construct Comp-11, a large-scale dataset with an 11-class taxonomy and reasoning-augmented annotations. Extensive experiments show that COMPASS substantially improves category-level composition understanding and delivers more composition-consistent, prompt-faithful generation than strong baselines.

[AI-164] An AI agent for treatment reasoning over a biomedical tool universe

链接: https://arxiv.org/abs/2606.28692
作者: Shanghua Gao,Ayush Noori,Richard Zhu,Curtis Ginder,Zhenglun Kong,Xiaorui Su,Justin Kauffman,Benjamin S. Glicksberg,Joshua Lampert,Ankit Sakhuja,Ashwin Sawant,ATHENA-R1 Evaluation Consortium,David A. Clifton,Noa Dagan,Ran Balicer,Marinka Zitnik
类目: Artificial Intelligence (cs.AI)
备注: Project page: this https URL Code: this https URL

点击查看摘要

Abstract:Treatment reasoning underpins every therapeutic decision, integrating disease context, comorbidities, medications, contraindications, and evolving biomedical knowledge to select an appropriate therapy. It is inherently iterative: candidates are weighed against many constraints, revised as evidence emerges, and grounded in verifiable sources. Here we introduce ATHENA-R1, an AI agent for treatment reasoning across all FDA approved drugs since 1939, trained by reinforcement learning over a universe of 212 biomedical tools. At each step it identifies missing information, selects and runs relevant tools, and incorporates the evidence. To train it without human-annotated traces, we build a two-level self-learning framework: multi-agent systems construct the tools, tasks, and reasoning trajectories for supervised fine-tuning, then reinforcement learning with scientific feedback rewards reasoning quality (evidence gathering, grounded tool use, logical non-redundancy). Across five benchmarks of 3,168 drug reasoning tasks and 456 patient treatment cases, ATHENA-R1 outperforms language models and tool-use systems, reaching 94.7% accuracy on open-ended drug reasoning and 82.9% on treatment reasoning, 17.8 and 10.7 points above GPT-5. In blinded evaluations by experts from 28 rare disease organizations, it is preferred over reference models on all criteria, and physicians rated it favorably on complex hospitalized cardiovascular and infectious-disease cases. Adverse-event hypotheses it generated, tested in electronic health records from 5.4 million patients, reached adjusted odds ratios of 1.48-1.84, with no elevation among negative controls. Because it requires knowing what evidence to seek before concluding, treatment reasoning has long been hard for AI; we show it can be reframed as a learnable process of iterative evidence gathering that reinforcement learning can train AI to perform.

[AI-165] Aristotelian Virtue Profiling of LLM s through Ethical Dilemmas

链接: https://arxiv.org/abs/2606.28683
作者: Ioannis Tzachristas,John Pavlopoulos
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Applications (stat.AP); Methodology (stat.ME)
备注: VirtueMap website: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) often face ethical tradeoffs in which several responses may be defensible but express different priorities, such as fairness, honesty, courage, or restraint. We introduce VirtueMap, a framework for describing these patterns through an Aristotelian virtue-ethics lens. Instead of asking for a single correct answer, VirtueMap asks humans or LLMs to rank all five responses to each of seven general, non-lethal, non-political, and non-religious ethical dilemmas. To define the reference orderings used for scoring, we first proposed, for each dilemma and virtue, an ordering of the five responses from most to least expressive of that virtue. We then collected more than 100 respondent evaluations per ordering and retained it as operational ground truth only when at least 95% confirmed it. Rankings are scored against these retained orderings using normalized Borda alignment, yielding profiles over Practical Wisdom, Justice, Truthfulness, Courage, and Temperance. We apply VirtueMap to nine LLM families in a repeated-run evaluation and find high mean rank consistency (90.3%), with the largest differences appearing on Courage, Temperance, and Justice. We also release an interactive website that computes profiles locally in the browser and compares respondents with measured LLM profiles.

[AI-166] Capability Gates Are Not Authorization: Confused-Deputy Failures in LLM Agent Frameworks

链接: https://arxiv.org/abs/2606.28679
作者: David Mellafe Zuvic
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures, 3 tables. Artifact: this https URL ( this http URL )

点击查看摘要

Abstract:Tool-using LLM agents increasingly read untrusted content while holding side-effecting tools such as payments, email, CRM, and infrastructure APIs, yet common framework defaults still conflate tool exposure with authorization. We audit whether LangChain/LangGraph, LlamaIndex, and the Stripe Agent Toolkit re-authorize each model-emitted call, with concrete argument values, before execution. Across pinned public-source commits, all three provide capability gating by default, but none provides a deterministic fail-closed per-call value authorization gate by default. We introduce ScopeGate, a five-stage PDP/PEP for agent tool calls: scope, authorization, money ceiling, idempotency, and default deny. Evaluation shows the identical unauthorized payout call executes under LangChain’s default dispatch (with a companion LlamaIndex PoC) but is denied by ScopeGate; the tested control reports 0/48 static bypasses, 0/29 unauthorized attempts (40-iteration adaptive run), 0/10 benign false-denies, and Latam-GPT payment-agent containment at 10/10. ASR denotes attempted unauthorized action, containment is not a cure, deployment-tier claims are inference over measured model classes, and no CVE is asserted. Comments: 7 pages, 3 figures, 3 tables. Artifact: this https URL (this http URL) Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) ACMclasses: D.4.6; I.2.11 Cite as: arXiv:2606.28679 [cs.CR] (or arXiv:2606.28679v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.28679 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: David Mellafe Zuvic [view email] [v1] Sat, 27 Jun 2026 01:33:19 UTC (15 KB) Full-text links: Access Paper: View a PDF of the paper titled Capability Gates Are Not Authorization: Confused-Deputy Failures in LLM Agent Frameworks, by David Mellafe ZuvicView PDFHTML (experimental)TeX Source view license Current browse context: cs.CR prev | next new | recent | 2026-06 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-167] Constrained Tabular Diffusion for Finance

链接: https://arxiv.org/abs/2606.28674
作者: Michael Cardei,Jose M Munoz,Oscar Barrera,Shreyas K Chandrahas,Partha Saha
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published at ACM International Conference on AI in Finance (ICAIF) 2025

点击查看摘要

Abstract:Generative models in finance face the dual challenge of producing realistic data while satisfying strict regulatory and economic objectives, a requirement that standard tabular diffusion models cannot provide. To address this difficulty, we introduce Constrained Tabular Diffusion for Finance (CTDF), a novel integration of sampling-time feasibility operations with mixed-type tabular diffusion in financial applications. By incorporating a training-free feasibility operator into the reverse-diffusion sampling loop, CTDF enforces hard constraints for applications such as simulation, legal compliance, and extrapolation. Extensive experiments on large-scale financial datasets demonstrate zero constraint violations and improvement in scarce data utility. CTDF establishes a robust method for generating trustworthy and compliant synthetic data, opening new avenues for rigorous generative modeling and analysis in the financial domain.

[AI-168] Why Trust Your Agent ? Empirical Security Gains from TRiSM-Guided Agent ic Workflows in Healthcare

链接: https://arxiv.org/abs/2606.28666
作者: Liam Kearns
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:Agent-based AI has enabled the automation of tasks by exposing application tools and resources to large language models (LLMs). However, to improve scope and accuracy, agents are often given access rights that exceed those of ordinary users, introducing significant security risks. AI is routinely integrated into applications with a disregard to security, risking data exposure and breaching regulations. This paper applies the AI Trust, Risk, and Security Management (TRiSM) framework to a medical report-generation application to demonstrate how an insecure agent workflow can be transformed into security-conscious agentic workflow. Both workflows were evaluated across five LLMs (Claude Haiku 4.5, GPT-4.1-nano, GPT-4.1-mini, GPT-5.4-mini, and Gemini 2.5 Flash) on two report types, totalling 800 generations and 500 attack scenarios including RAG poisoning, data-field injection, and client-side network injection. The TRiSM-guided agentic workflow reduced mean attack success rates from 31% to 10% for RAG poisoning and from 42% to 25% for data-field injection, while eliminating the network injection vector entirely through server-side prompt construction. Furthermore, report accuracy increased by 14 percentage points (72.5% to 86.5%) with the agentic workflow, demonstrating a secure design which provides more reliable outputs. This paper contributes to knowledge by demonstrating least-privilege, defence in depth agentic workflows improving security and accuracy, while also highlighting model choice is a necessary architectural consideration.

[AI-169] Closed-Form Steepest Descent Direction toward Flat Minima: Reducing Upper Bounds on the Loss Hessian Eigenspectrum in Neural Networks

链接: https://arxiv.org/abs/2606.28662
作者: Yuto Omae,Kazuki Sakai,Yohei Kakimoto,Makoto Sasaki,Yusuke Sakai,Hirotaka Takahashi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 25 pages

点击查看摘要

Abstract:The flatness hypothesis suggests that flatness of the loss landscape, as measured by the eigenvalues of the loss Hessian, correlates with better neural network generalization. While various algorithms reduce these eigenvalues, most focus on procedural design, leaving it unclear how data distributions and NN parameters structurally determine directions toward flat minima. Characterizing these directions analytically is generally intractable. To overcome this mathematical difficulty, recent studies derived the Wolkowicz-Styan (WS) upper bound on the maximum eigenvalue of the cross-entropy loss Hessian in three-layer NNs. Although this upper bound is differentiable, its gradient was not derived. Therefore, we analytically derive the gradient of the WS upper bound to characterize directions leading to flat minima. Based on this, we propose Hessian Spectral Range (HSR) Regularization, which updates parameters along the steepest descent direction of the WS bound. Experiments demonstrate that HSR Regularization narrows the Hessian eigenvalue spectrum, avoids sharp minima and saddle points, and promotes convergence to flat minima. Although the applicability of this method is currently limited to cross-entropy loss and three-layer architectures, to the best of the authors’ knowledge, this is the first study to report a closed-form gradient that promotes convergence to flat minima without numerical approximations. Therefore, the theoretical analysis of this gradient is expected to contribute to the further development of NNs.

[AI-170] RIPA: Sensory-Vector Prompt Injection Attacks on LLM -Controlled ROS 2 Robots

链接: https://arxiv.org/abs/2606.28649
作者: Nima Dorzhiev
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 12 pages, 4 figures, 10 tables, 3 appendices

点击查看摘要

Abstract:We present RIPA, the first systematic multi-channel empirical study of prompt injection attacks delivered through the sensory pipeline of a ROS 2-based LLM-controlled robotic system. Across 100 independent runs per injection variant on five LLMs spanning four model families and parameter scales from approximately 4B to approximately 284B (DeepSeek-V4-Flash, Llama-3-8B-Instruct-Lite, Llama-3.3-70B-Instruct-Turbo, Qwen 2.5-7B-Instruct-Turbo, Gemma-3n-E4B), we identify model-specific vulnerability profiles that do not follow a monotonic scaling trend: Llama-3.3-70B-Instruct-Turbo exhibits 100% attack success rate (ASR) across all injection variants, while Llama-3-8B-Instruct-Lite and Qwen 2.5-7B-Instruct-Turbo resist direct-override injection (0% ASR), and the smallest model evaluated (Gemma-3n-E4B, approximately 4B) matches the 70B model’s vulnerability profile, indicating that robustness is model-specific rather than scale-dependent. We propose a hybrid semantic firewall that achieves 0% ASR against known injection patterns with no false positives on a preliminary benign set (0/20 commands) but exhibits a 10.2% trial-weighted bypass rate (58/570 trials; N equals 30 per payload across 19 obfuscation payloads) against adversarially obfuscated attacks, exposing a critical gap between rule-based and semantic defense layers. We further introduce three sensory injection channels: visual (Channel 1, via OCR), audio (Channel 2, via Whisper STT), and LiDAR sensor context poisoning (Channel 3). We show that Channel 3, which injects fabricated obstacle data into the robot environment-state representation at the LLM system-prompt level, achieves 100% ASR across all variants on DeepSeek-V4-Flash. We also contribute a firewall bypass taxonomy spanning 19 obfuscation payloads across five categories. All code, data, and results are publicly available.

[AI-171] Analysis of Parameter Settings for the Bat Algorithm Using Variance Evolution CCS2026

链接: https://arxiv.org/abs/2606.28644
作者: Xin-She Yang,Mehmet Karamanoglu
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 5 figures, ICCS2026

点击查看摘要

Abstract:Parameter settings in evolutionary algorithms and metaheuristics are important because such parameter values can influence the performance of algorithms under evaluation. For a given algorithm, there are many different numerical experiments to show that the algorithm can work well in practice; however, in most cases there is no theoretical analysis of parameter settings. In this work, we show that theoretical analysis using the theory of dynamical systems and evolution of population variance can give some good results in terms of parameter ranges for the bat algorithm. We also show that results from numerical experiments are consistent with theoretical bounds. Such analyses can provide good insights from different perspectives about the algorithmic characteristics such as variance evolution, transition between exploration and exploitation as well as convergence behaviour.

[AI-172] Fast and Accurate Outlier-Aware LiDAR Super-Resolution for SLAM Applications

链接: https://arxiv.org/abs/2606.28607
作者: Christos Anagnostopoulos,Alexandros Gkillas,Nikos Piperigkos,Aris S. Lalos
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:This work tackles the challenge of enhancing low-resolution LiDAR sensors for SLAM applications through a novel Deep Unrolling-based Super-Resolution (SR) model. We integrate an outlier removal module to ensure structural integrity while maintaining real-time performance. By leveraging a model-based optimization approach, our method efficiently reconstructs high-resolution point clouds while minimizing computational overhead. The proposed SR model is evaluated within a LiDAR SLAM framework, demonstrating significant improvements in pose estimation accuracy and efficiency compared to state-of-the-art SR methods.

[AI-173] Database Context Compression for Text-to-SQL on Real-World Large Databases

链接: https://arxiv.org/abs/2606.28601
作者: Jingwen Liu,Weibin Liao,Xin Gao,Junfeng Zhao,Yasha Wang
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent progress in Text-to-SQL has been driven by stronger language models and prompting strategies, yet performance on real enterprise benchmarks such as Spider 2.0 and BIRD remains far below that on classical academic datasets. We argue that the main bottleneck is no longer reasoning, but database representation. Real databases contain repeated audit columns, large groups of similar tables, opaque identifiers whose meanings are stored only in documentation, and extensive data dictionaries with little query-relevant information. Existing query-aware methods, including schema linking and retrieval-based schema selection, filter this raw context but still operate on redundant and verbose representations. We reformulate the problem as database context compression, a query-agnostic transformation that rewrites schemas, semantic descriptions, and external documentation into a compact representation. We formalize this transformation with the SGCF (Support-Gain Component Factorization) principle, which unifies repeated column extraction, isomorphic table templating, semantic componentization, and evidence purification under a single coverage objective. Based on SGCF, we propose DBCC, a database-side middleware that performs offline structural and semantic compression together with lightweight online evidence purification. DBCC is model-agnostic and can be integrated into existing Text-to-SQL pipelines. On Spider 2.0-Snow and BIRD, DBCC reduces input context by up to two orders of magnitude (from 2.6M to 34.7K tokens on the largest Spider 2.0-Snow subset), improves schema-linking strict recall from 0% to 56.5% under DeepSeek-V3.2 (63.1% under Claude Opus 4.7), and consistently increases end-to-end execution accuracy by 1.8-1.9% over three recent Text-to-SQL systems. Our code is open-sourced at this https URL.

[AI-174] Neuromorphic Energy-Aware Learning for Adaptive Deep Brain Stimulation

链接: https://arxiv.org/abs/2606.28600
作者: Binh Nguyen,Colleen Josephson,Mircea Teodorescu,Gert Cauwenberghs,Jason Eshraghian
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Neuromorphic and edge computing research has focused on reducing the inference cost of neural network controllers, yet in physical closed-loop systems the actuator can rival or exceed an efficient controller in energy. An efficient controller is therefore necessary but not sufficient, because the actuator becomes the cost worth reducing once inference no longer dominates it. Here, we introduce energy-aware learning, an approach that incorporates actuator energy directly into the reinforcement learning reward, and demonstrate it in closed-loop deep brain stimulation (DBS) for Parkinson’s disease. A deep spiking Q-network, trained in a biophysical cortico-basal ganglia-thalamic circuit model, learns to suppress pathological alpha-beta oscillations by 45.2% while reducing stimulation charge by 80.0% relative to continuous DBS. Sparsity-constrained knowledge distillation compresses the policy onto the SynSense XyloAudio 3 neuromorphic processor at 0.52 mW inference power, yielding 28.1x lower energy per inference than an equivalent artificial neural network on conventional edge hardware. By co-optimizing stimulation energy and inference efficiency, the framework addresses both major power demands in implantable neuromodulation.

[AI-175] Search for Truth from Reasoning : A Dynamic Representation Editing Framework for Steering LLM Trajectories ICML’26

链接: https://arxiv.org/abs/2606.28589
作者: Tianlong Wang,Yuhang Wang,Weibin Liao,Xin Gao,Xinyu Ma,Yang Lin,Yasha Wang,Liantao Ma
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICML’26

点击查看摘要

Abstract:Current approaches to enhance Large Language Model (LLM) reasoning, such as Chain-of-Thought and “Wait” prompts, primarily encourage models to think more, yet often fail to guide them toward Truth. While Representation Editing (RepE) offers a intrinsic control, its application to dynamic reasoning trajectories remains underexplored. In this work, we bridge this gap by investigating the geometry of truth within unfolding reasoning chains. We uncover three critical insights: (1) Truth is encoded at the sentence level and is entangled with latent reasoning patterns; (2) Effective intervention follows an Uncertainty Principle and a Decay Effect, requiring localization to early, high-entropy forks; (3) Naive steering vectors suffer from noise, risking collateral damage to correct trajectories. Based on these findings, we propose DynaSteer, a dynamic RepE framework. DynaSteer employs pattern clustering to disentangle reasoning manifolds and utilizes Fisher-LDA to project purified truth. By dynamically monitoring lookahead entropy, it selectively steers and rolls back trajectories only when necessary. Comprehensive experimental results on several MATH benchmark verify the effectiveness of DynaSteer, and experiments on out-of-domain coding tasks further confirm its generalization ability. Our code is publicly available at this https URL.

[AI-176] Geometric Measurements of the Axiom of Choice in Neural Proof Embeddings

链接: https://arxiv.org/abs/2606.28572
作者: Rodrigo Mendoza-Smith
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Logic (math.LO)
备注:

点击查看摘要

Abstract:The axiom of choice has divided the foundations of mathematics for over a century, but the distinction between classical and constructive proofs has remained a philosophical and methodological one. We use Lean 4’s kernel-level tracking of axiom dependence to show that the axiom of choice has a measurable geometric correlate in proof space that obeys a one-parameter mixture law and has operational consequences for neural theorem provers. To do this, we partition 471,260 declarations of Mathlib by transitive dependence on the axiom of choice and represent a filtered population of 42,355 traced theorems by their sequences of tactic invocations. We use the constructive proofs in this dataset to train a self-supervised proof encoder and show that when using it to measure classical proofs, three complementary measurements (anomaly score, reconstruction loss, and density-superlevel containment) exhibit a common decline with the proof’s distance from the axiom in the dependency graph, from sharp separation at the shallow boundary (AUC 0.847 at distance 2 ) to indistinguishability at distance~ 9+ . Robustness controls show that the signature survives length, file, author, and topic controls, and replicates under full-source encoders trained on normalised proof source. Operationally, we show that on an evaluation sample of 251 Mathlib theorems, Lean’s \textttaesop tactic solves constructive theorems at 13\times the rate of classical ones, and a neural-guided hybrid using the ReProver tactic generator compresses the gap to 5\times . The geometric anomaly score predicts \textttaesop failure beyond proof length, providing an operational link between the geometric signature and prover performance.

[AI-177] KernelSight-LM: A Kernel-Level LLM Inference Simulator

链接: https://arxiv.org/abs/2606.28565
作者: Xiteng Yao,Taeho Kim,Hengzhi Pei,Xinle Liu,Kyle Ulrich,Leonard Lausen,Ashish Khetan,Xiang Song,George Karypis,Martin Herbordt
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:As large language models (LLMs) move into production serving, practitioners must rapidly evaluate inference performance across diverse hardware, models, and serving parameters to meet cost and latency targets. However, the end-to-end behavior of LLMs couples serving-layer policies with low-level GPU kernel execution and rapidly evolving architectures, forcing slow, deployment-specific benchmarking that is hard to generalize. We present KernelSight-LM, a fine-grained inference simulator that models token-level execution and produces kernel-level latency breakdowns. It decomposes each serving step into a roofline kernel model with a learned efficiency term, a communication model, and a host-overhead model, composed through a discrete-event scheduler that also captures mechanisms like prefix caching and continuous batching. KernelSight-LM offers two prediction tiers that trade target-GPU data for accuracy. The cross-generation tier uses no target-GPU measurements, only hardware specifications and kernel microbenchmarks from previously profiled GPUs, and predicts per-kernel latency on an unseen GPU generation to 12.1% error, a 1.8x improvement over the roofline baseline (22.0%). A second target-measured tier adds one model-agnostic kernel-microbenchmark sweep on the target GPU, sharpening per-kernel error to 3.8%, a 7.3x improvement over a comparable baseline (27.7%). Both tiers require far less target-GPU data than the prior systems they extend. In our simulator, these predictions yield end-to-end median (p50) errors across six model families of 15.4%, 12.8%, and 3.0% (TTFT, TPOT, throughput) in the cross-generation tier and 14.3%, 6.2%, and 2.7% in the target-measured tier, matching dedicated profiling tools while collecting far less on-device data. Beyond prediction, its kernel-level bottleneck breakdowns support hardware/software co-design and capacity planning. Subjects: Performance (cs.PF); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2606.28565 [cs.PF] (or arXiv:2606.28565v1 [cs.PF] for this version) https://doi.org/10.48550/arXiv.2606.28565 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xiteng Yao [view email] [v1] Fri, 26 Jun 2026 19:43:38 UTC (2,280 KB) Full-text links: Access Paper: View a PDF of the paper titled KernelSight-LM: A Kernel-Level LLM Inference Simulator, by Xiteng Yao and 9 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.PF prev | next new | recent | 2026-06 Change to browse by: cs cs.AI cs.AR References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-178] IMCBench: A benchmark for multimodal LLM s in Image-grounded Medical Conversations ECML KDD2026

链接: https://arxiv.org/abs/2606.28556
作者: Maria Xenochristou,Ashutosh Joshi,Korosh Vatanparvar,Mohammad Abuzar Hashemi,Prasad Kasu,Deepak Bansal,Anchal Nema,Nivedita Wadhwa,Prashams S Jain,Rebecca Abraham,Will Kimbrough,Dilek Hakkani-Tur,Wilko Schulz-Mahlendorf
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ECML PKDD 2026. 22 pages, 2 figures

点击查看摘要

Abstract:Recent advances in large language models and vision-language models have enabled reasoning over multimodal data, offering opportunities for clinical applications such as decision support and triaging. However, existing medical AI benchmarks are fragmented: some support multi-turn dialogues but lack images, while others provide multimodal inputs but focus on single-turn QA tasks. To address this gap, we introduce IMCBench, an image-grounded, multi-turn medical conversation benchmark that pairs real, publicly available clinical images with synthetic patient profiles to simulate realistic patient-clinician interactions. Each conversation is evaluated across three clinical dimensions: safety, accuracy, and appropriate use of uncertainty in diagnosis. We benchmark eight multimodal frontier models across four model families (Claude, GPT, Nova, and Llama), scoring each on a 1-5 scale using LLM-as-Jury scoring calibrated against expert clinician annotations. Our results show that Claude Opus 4.6 achieves the highest overall score (3.61), followed by Claude Sonnet 4.6 (3.30) and GPT-5.2 (3.29), though no model dominates all dimensions and safety degrades for both malignant and rare conditions ( \Delta = -0.27 each). Ablation studies further reveal that both visual input and EHR context contribute to safe guidance (safety drops of 0.18 and 0.23 on average when each is removed), with stronger models leveraging visual features more effectively. Together, these findings demonstrate that accurate clinical description does not guarantee safe patient guidance, motivating the need for multi-dimensional evaluation frameworks in medical AI.

[AI-179] A Gravitational Interpretation of Fine-Tuning Reversion

链接: https://arxiv.org/abs/2606.28525
作者: Samuele Poppi,Nils Lukas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 9 figures, 4 tables

点击查看摘要

Abstract:Fine-tuning on harmless data can partially undo behaviors acquired earlier in training. Safety can erode under benign post-alignment updates, unlearned capabilities can re-emerge, latent traits can transfer through apparently unrelated supervision, and related post-alignment fragility appears in other generative settings. We argue these phenomena are usefully viewed through a common training-history lens. Our hypothesis is geometric: large early training phases create dominant behavioral manifolds, while later alignment or specialization phases are shallower displacements from them. Subsequent fine-tuning can therefore inherit a persistent reversion component pointing back toward a witness of the dominant manifold. We call this the gravitational interpretation of fine-tuning reversion. Across our main settings, representational drift rapidly acquires a component along a history-defined reversion direction (v_rev). In our main track, alignment with v_rev rises from cos = 0.429 +/- 0.052 after the first update to 0.647 +/- 0.021 by step 20. Across 24 run-step pairs, every observed alignment exceeds the p99 of an isotropic activation-space null. We demonstrate that selectively blocking motion along v_rev changes the final alignment at T=100 from 0.648 +/- 0.009 to -0.211 +/- 0.021 and reduces harmfulness from 19.0% +/- 4.0% to 8.5% +/- 1.5% with little task cost. These results support v_rev as a causally relevant mediator of early post-alignment reversion in our setup. Importantly, we do not claim that v_rev is the unique safety direction, nor that the dominant manifold is directly observed; rather, we identify a robust, history-defined direction that explains and partially controls early reversion dynamics.

[AI-180] UA-Bench: A Benchmark for General-Purpose Terminal-Use Agents WWW

链接: https://arxiv.org/abs/2606.28480
作者: Shoufa Chen,Luyuan Wang,Xuan Yang,Zhiheng Liu,Yuren Cong,Yuanfeng Ji,Feiyan Zhou,Xiaohui Zhang,Fanny Yang,Belinda Zeng
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Website: this https URL

点击查看摘要

Abstract:As large language models and harness frameworks continue to advance, agents operating in terminals are increasingly capable of performing a broader range of general computer-use tasks beyond coding. However, existing benchmarks do not adequately evaluate general-purpose terminal computer-use agents (TUAs): general computer-use benchmarks primarily target graphical user interfaces (GUIs), whereas terminal-based benchmarks largely emphasize technical and programming-centric workflows historically native to the shell. We introduce TUA-Bench, a general-purpose benchmark for terminal-use agents. TUA-Bench includes 120 real-world tasks across five task families, covering routine digital activities-including document editing, email management, and live-web information seeking-as well as scientific and engineering workflows co-designed with PhD-level domain experts that require specialized software. This breadth distinguishes TUA-Bench from prior shell-focused or domain-specific benchmarks. Each task is manually designed, runs in a real terminal with a deterministic setup script, and is evaluated by an execution-based scoring protocol. We find that the strongest frontier agent, Claude Code with Claude Opus 4.8 max reasoning effort, achieves 65.8% overall performance, with substantial gaps across both tracks. By providing a broad and realistic evaluation of terminal-use capabilities, TUA-Bench aims to accelerate the transition from narrow, task-specific assistants to general-purpose agents capable of operating reliably across diverse digital environments.

[AI-181] Decomposing Memorization Reduction in Privacy-Preserving Fine-Tuning of SLMs for CSIRTs

链接: https://arxiv.org/abs/2606.28479
作者: Cristhian Kapelinski,Diego Kreutz
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 7 tables, 51 references. Accepted for publication at the 36th Brazilian Conference on Intelligent Systems (BRACIS 2026)

点击查看摘要

Abstract:CSIRTs increasingly fine tune language models on vulnerability scan records, but these records expose internal network topology and create privacy risks under regulations such as GDPR and LGPD. We present the first empirical study of how DP SGD and HMAC pseudonymization interact when fine tuning small language models with 1B to 3B parameters on structured CSIRT data. We evaluate 96 LoRA adapters across four SLMs and four training regimes, including raw fine tuning, QLoRA with large batch training, and DP SGD with epsilon equal to 2 and 8. We also audit memorization using 20 planted canaries, four extraction attacks, and a dual attack targeting HMAC pseudonymized identifiers. Our results show three main findings. First, matched update controls reproduce the observed reduction in memorization by reducing the number of optimizer updates alone, accounting for 66 percent to 132 percent of the measured effect, with a mean of 100 percent across three seeds and four models. In this setting, DP SGD provides the formal privacy guarantee but does not produce additional measurable reductions in memorization. Second, HMAC pseudonymization removes the original identifiers from the exposure surface, reducing exposure by 40 percent to 61 percent, while pseudonymized identifiers remain close to the expected random baseline and do not become a secondary memorization target. Third, F1 scores remain between 0.19 and 0.28 across all 96 adapters using four shot prompting, indicating that, under the evaluated training budget, 1B to 3B SLMs do not achieve operationally useful performance. Comments: 7 tables, 51 references. Accepted for publication at the 36th Brazilian Conference on Intelligent Systems (BRACIS 2026) Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) MSC classes: 68T01 ACMclasses: I.2 Cite as: arXiv:2606.28479 [cs.CR] (or arXiv:2606.28479v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.28479 Focus to learn more arXiv-issued DOI via DataCite

[AI-182] Improvement of Robots Simultaneous Localization and Mapping Using an Effective Transformation to Achieve Linear Model

链接: https://arxiv.org/abs/2606.28475
作者: Seyed Farzad Bahreinian,Maziar Palhang,Mohammad Reza Taban,Hasan Enami Eraghi
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Nowadays mobile robots have wide engineering applications. Simultaneous localization and mapping (SLAM) is an important task of these robots. The major and common algorithms used for this task are based on extended Kalman filter (EKF). One of the main problems in EKF-based SLAM is its divergence. The nonlinearity of motion and observation models and linearization error are the main reasons for the divergence. There have been some efforts to address this problem with limited success. In this paper, by applying a simple compass and using an effective transformation, we transform the non-linear state space model into a linear model. Then, by applying the original KF to this model, we reach a new method, which is called LMKF SLAM. We show that the LMKF SLAM is significantly superior to the state-of-the-art methods, especially EKF-based SLAMs, both in accuracy, convergence, and computational complexity. The proposed method is also more stable with respect to the uncertainty of sensors values and changes in system parameters. Experimental results verify these points.

[AI-183] Data and Evaluation Closed-Loop for Model Capability Enhancement

链接: https://arxiv.org/abs/2606.28471
作者: Zhixuan Li,Jiangan Yuan,Han Xu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model capability is the central variable in LLM pre-training, yet is never observed directly: data shapes it prospectively, while evaluation reveals it only retrospectively, compressing samples, prompts, decoding, and scoring rules into one noisy score. Practical optimization runs this backward: a failure is observed first, and the engineer must infer the corpus fix. The two sides speak incompatible vocabularies – benchmark names and per-sample correctness versus data sources, domains, and quality labels – so this inference is usually intuition, not method. We close this gap with the \emphcapability slice: a group of evaluation samples sharing background condition, task type, solving operation, and output constraint – precise enough to localize a single weakness yet stable enough to survive aggregation, unlike a benchmark name, too coarse, or a single sample, too noisy. Built around this unit, an evaluation taxonomy, a non-instruction data taxonomy, and mapping rules form a closed loop turning a benchmark-level failure into a targeted, testable data intervention. We test this loop on two case studies pulling in opposite directions. First, the loop rules the data out: continued pre-training drives BBH down by -46.82% , but diagnosis traces this to a single masked \texttt\textless EOS\textgreater loss rather than weakened reasoning; restoring it recovers BBH to 66.44 , above the original checkpoint, without changing the data. Second, the loop rules the data in: a persistent math-reasoning weakness is decomposed by solving operation into specific failing combinations, and a weakness-targeted sampling procedure built from it lifts AIME2025/AIME2026 Pass@128 from 6.67 / 0.00 to 26.67 each. The same unmodified loop reaches opposite, correct verdicts in both cases, showing the evaluation-to-data inference can be routine, auditable, and experimentally validated rather than intuitive.

[AI-184] Counterfactual Residual Data Augmentation for Regression ICML2026

链接: https://arxiv.org/abs/2606.28460
作者: Hossein Mohebbi,Oliver Schulte,Ke Li,Pascal Poupart
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026). 25 pages, 8 figures. Project page: this https URL

点击查看摘要

Abstract:Data-driven modeling in real-world regression tasks often suffers from limited training samples, high collection costs, and noisy observations. Inspired by the impact of data augmentation in vision and language, we propose a novel Counterfactual Residual Data Augmentation (CRDA) technique for tabular regression. Our key insight is that once a regressor has modeled the systematic component of the data, the remaining noise can be viewed as an invariant residual that remains stable under small perturbations of carefully selected features. We exploit this residual invariance to generate new, yet realistic, training samples, effectively expanding the dataset without requiring additional real data. Our method is model-agnostic and readily applicable to various types of regressors. In experiments across datasets from a variety of benchmark repositories, on average, CRDA reduces an MLP Regressor’s MSE by 22.9% and an XGBoost Regressor’s MSE by 6.4%. When compared to existing state-of-the-art data generators and augmentation techniques, CRDA consistently outperforms in MSE reduction. By adding principled counterfactual variations to the training data, our method offers a simple and efficient remedy for noise-prone, small-sample regression settings.

[AI-185] Event-Conditioned Diagnostics of Kinematic Contact and Object-Permanence Fields in Passive Object-State World Models

链接: https://arxiv.org/abs/2606.28455
作者: Yang Liu,Yuming Chen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:World models can predict future physical states, but prediction accuracy alone does not explain how physical information is organized and used inside their latent dynamics. We introduce a controlled diagnostic protocol for studying event-conditioned latent physical structure in passive object-state world models. The protocol tests whether hidden representations encode event-regime information, whether event contexts reweight non-exclusive physical field readouts, and whether field-aligned representational components have functional consequences for prediction. Using a balanced controlled-generator dataset with free-motion, collision, and occlusion events, we evaluate recurrent, attention-based, and latent state-space transition models under a fixed-horizon forecasting setup. The models learn useful predictive dynamics and their hidden states support reliable event-regime readout. Event contexts systematically reweight kinematic, contact, and object-permanence field readouts: free motion is kinematic-dominant, collision combines kinematic and contact structure, and occlusion combines motion-related and object-permanence structure. Time-aligned and directional-consistency analyses further show phase-related shifts in field emphasis. Finally, fixed-horizon projection causal field effect (CFE) shows that suppressing field-aligned directions can degrade event-relevant prediction, with strongest evidence for contact-aligned structure in collision-contact windows and more qualified evidence for object-permanence-aligned structure in hard-occlusion hidden windows. These results support event-conditioned organization and fixed-horizon functional sensitivity of latent physical fields, while not implying explicit physical modules, isolated causal circuits, or context-invariant sliding-window generalization.

[AI-186] LLM agents security duality: a comprehensive survey of self-security and empowered cybersecurity

链接: https://arxiv.org/abs/2606.28450
作者: Yiwei Xu,Yong Zhuang,Xuanming Liu,Tian Zhang,Bowen Xiao,Xiaoyang Xu,Delong Jiang,Juan Wang,Hongxin Hu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 73 pages,12 figures, 9 tables, Artificial Intelligence Review

点击查看摘要

Abstract:Large language model (LLM) agents are rapidly being integrated into real-world systems. Their autonomy and tool-use capabilities generate substantial value while simultaneously expanding the security attack surface. This survey provides a comprehensive overview of the opportunities and challenges of LLM agents in security, focusing on two core areas: (1) threats to LLM agents themselves and corresponding mitigation strategies (LLM agents self-security), and (2) the role of LLM agents in empowering the cybersecurity lifecycle across offense and defense (LLM agents empowered cybersecurity). We first examine the internal and external attack surfaces of agents, propose a taxonomy organized by threat sources, and analyze associated mitigations and evaluation frameworks. We then investigate how agent capabilities are applied in cybersecurity practice and present, to our knowledge, the first agent-empowerment framework aligned with the full cyber offense-defense lifecycle. By systematically surveying these two areas, we are the first to highlight a positive feedback synergy between LLM agents self-security and empowered cybersecurity, offering new insights for the advancement of both. We further identify current limitations and outline promising directions for future research. The insights provided aim to catalyze the coordinated development of LLM agents self-security and agent empowered cybersecurity, paving the way for more capable and robust agent applications.

[AI-187] S-GAI: Spectral Geometry-Aware Initialization for Sigmoidal MLPs – From Dataset Geometry to Network Weights

链接: https://arxiv.org/abs/2606.28444
作者: Yi-Shan Chu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Classical universal approximation theorems establish the expressive power of sigmoidal multilayer perceptrons, but they do not prescribe how initial weights should encode the geometry of a data distribution. We propose S-GAI, a spectral geometry-aware initialization framework for one-hidden-layer sigmoidal MLPs. Starting from the constructive idea that sigmoid units can act as smooth half-space gates, we move from hand-specified planar geometry to class-wise spectral geometry estimated from image data. For each class, SVD provides a mean, principal directions, and spectral scales. An energy threshold selects the retained directions, and each retained direction is represented by two sigmoid gates. These class-specific gates form a shared hidden layer initialized directly from the training set. We also formulate a SVD-based subspace classifier as a non-neural geometric reference, which tests whether the estimated spectral class geometry is already discriminative before being embedded into the MLP. Experiments on MNIST, Fashion-MNIST, and a more challenging CIFAR-10 test show that the S-GAI-initialized MLP starts from a substantially more informative hidden state than Xavier initialization and reaches comparable final accuracy under full training. When the hidden layer is frozen, training only the output layer still gives stronger performance than frozen random gates, providing evidence that S-GAI effectively embeds class-wise spectral geometry into the MLP.

[AI-188] PLAA: Packet-level Adversarial Attacks in Network Traffic Detection

链接: https://arxiv.org/abs/2606.28439
作者: Jinhao You,Zan Zhou,Shujie Yang,Yi Sun,Lei Zhang,Changqiao Xu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) are widely applied in Network-based Intrusion Detection System (NIDS) due to their high accuracy. However, DNNs are highly susceptible to adversarial attacks, which generate malicious traffic to evade NIDS detection. Existing approaches often adapt adversarial attacks from computer vision (CV) tasks to the NIDS domain, overlooking the fundamental differences between CV and NIDS. This results in two major issues: 1) The generated network traffic may become invalid, 2) The generated traffic may lose its original attack semantics. To address these issues, this paper proposes an adversarial attack specifically designed for NIDS. Instead of directly generating flow-level features, our approach incrementally generates packet-level features to construct adversarial traffic. During the generation process, the semantic integrity of the traffic is monitored at each stage, effectively avoiding the issues of invalid traffic and semantic loss observed in existing methods. We evaluate our attack algorithm against current NIDS models using the CIC-UNSW-NB15, CIC-DDoS2019, and CIC-IDS-2017 datasets. The proposed method achieves an average evasion success rate of 92.78%, while ensuring that the generated adversarial traffic remains semantically consistent with the original malicious traffic.

[AI-189] When AI Reviews Its Own Code: Recursive Self-Training Collapse in Code LLM s

链接: https://arxiv.org/abs/2606.28438
作者: Xinyuan Song,Zekun Cai,Liang Zhao
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Recursive self-training can degrade neural generative models when generated data is reused without fresh human data or external quality control. We study this risk in code LLMs, where AI-generated code can enter real repositories, later become training data, and create a repository-scale self-training loop. While software development traditionally interrupts this loop through pull-request review, tests, compilation, and human approval, AI coding tools now produce code faster than humans can review it, and code review itself is increasingly automated by AI systems. We therefore compare three recursive fine-tuning regimes: no review, Human-gate review using model-independent filters such as compilation and static quality checks, and AI-self-gate review using the code LLM’s own signals such as perplexity and binary self-scoring. Across multiple code LLMs and benchmarks, no review collapses fastest, Human-gate filters slow but do not stop collapse, and AI-self-gate filters can look strong early but later lose their filtering effect. In the clearest case, the binary self-gate enters a rubber-stamp regime where acceptance scores rise while benchmark correctness falls. We explain this behavior by formulating review as gated distributional reweighting, proving that AI self-gating degenerates to ungated self-training under a self-confirming acceptance condition, and giving a spectral analysis of representation-level covariance concentration under recursive retraining. These results suggest that stable recursive code LLM training requires exogenous verification rather than model-coupled self-review.

[AI-190] Dockerless: Environment-Free Program Verifier for Coding Agents

链接: https://arxiv.org/abs/2606.28436
作者: Wenhao Zeng,Yuling Shi,Xiaodong Gu,Chao Hu,Chaofan Wang,Yuhao Cui,Hongting Zhou,Mengnan Qi,Jianqiao Wangni,Zhaojian Yu,Shuzheng Gao,Kai Cai,Shilin He
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Program verifiers play a central role in training coding agents, including selecting trajectories for supervised fine-tuning (SFT) and providing rewards for reinforcement learning (RL). Standard execution-based verification requires running unit tests inside per-repository environments such as Docker images, incurring substantial environment setup costs. We propose Dockerless, an environment-free agentic patch verifier that evaluates generated code patches without executing them. Rather than simply matching candidate patches to references, Dockerless judges patch correctness using evidence gathered through agentic repository exploration. On a verifier evaluation benchmark, Dockerless outperforms the strongest open-source verifier by 14.3 AUC points. Using Dockerless as both the SFT trajectory filter and the RL reward enables a fully environment-free post-training pipeline. The resulting model reaches 62.0%, 50.0%, and 35.2% resolve rate on SWE-bench Verified, Multilingual, and Pro, respectively. It surpasses the Qwen3.5-9B baseline by 2.4, 8.7, and 2.9 points, matching environment-based post-training.

[AI-191] SWE-MeM: Learning Adaptive Memory Management for Long-Horizon Coding Agents

链接: https://arxiv.org/abs/2606.28434
作者: Shuzheng Gao,Wenhao Zeng,Zhaojian Yu,Jianqiao Wangni,Chaozheng Wang,Kai Cai,Shilin He,Michael R. Lyu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon software engineering agents often need to manage lengthy and noisy interaction histories under limited context budgets. Existing memory management methods typically rely on static compression workflows or impose rigid constraints on compression timing and granularity. Moreover, these approaches fail to jointly optimize memory management and issue resolution capabilities to improve performance while reducing token usage. We present SWE-MeM, a training framework for proactive and on-demand memory management in software engineering agents. SWE-MeM provides a flexible memory tool that lets agents decide when, what, and how to compress based on trajectory state, task progress, and remaining context budget. We train agents with synthesized proactive memory-management trajectories and Memory-aware GRPO, which jointly optimizes memory management and issue resolution through memory-aware trajectory splitting and step-level credit assignment. On SWE-Bench Verified, SWE-MeM achieves 43.4% and 60.2% resolve rate with 4B and 30B models, respectively, outperforming existing memory management baselines in both performance and efficiency.

[AI-192] Building to the Test: Coding Agents Deliver What You Check Not What You Requested

链接: https://arxiv.org/abs/2606.28430
作者: Yanuo Ma,Ben Kereopa-Yorke,Ben Schultz
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 27 pages (9 main + 14 appendix), 2 figures, 5 tables

点击查看摘要

Abstract:Benchmarks are widely used to evaluate task completion by Large Language Models (LLMs), but this approach has accumulated construction-validity problems, and a passing score may not show whether the requested task was delivered. We study both problems. In a controlled code-as-spec setup, two production Copilot CLI agents (claude-opus-4.7, gpt-5.5) re-implement a React Fluent-UI data table in Angular as a reusable library under a hidden 222-test Playwright oracle across 18 runs and three oracle-availability conditions. Alongside the score, we run a mechanical library audit and check each verdict with a no-op ablation. Without the oracle, the library is present but unfinished, revealed by scores. With the oracle in the loop, the score reaches near-perfect, but from a demo holding the tested behavior directly, the library left dead or absent. We call this building to the test; the broader disposition behind both we call validation self-awareness. The agent does not, on its own, validate what it ships as a user would. Prevalence remains an open question across other agents, signals, and model families. Beyond benchmark scores, dispositions like validation self-awareness merit research attention.

[AI-193] ool Use Enables Undetectable Steganography in Multi-Agent LLM Systems

链接: https://arxiv.org/abs/2606.28425
作者: Jimmy Laurence Rippin,Simon C. Marshall,David Demitri Africa,Christian Schroeder de Witt
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Increasingly autonomous agentic AI systems pose novel multi-agent risks, such as secret collusion via covert communication channels. The natural defence to these collusion attempts is to monitor plain-text communication, but the efficacy of monitors has been called into doubt by increasingly sophisticated model steganography; indeed, some theoretical schemes have been proposed that are information-theoretically or computationally indistinguishable from good-faith plain-text communication. In this paper, we demonstrate that the complexity of these schemes is no longer a safety barrier, as agentic coding models can already produce undetectable stegosystems when given realistic tool usage, such as code execution or accessing research papers through web searches. Agents also adapt when key ingredients are missing, for example, by adding model-sampling components or implementing related keyed coding schemes. We then frame tacit steganographic coordination between agents as a Schelling-point problem and introduce coordination metrics for estimating when two agents are likely to select compatible schemes without explicit prior agreement. Our results suggest a shift in the threat model for covert communication between AI agents, where the main barrier is no longer whether frontier agents can understand and implement sophisticated stegosystems, but coordination: whether independently acting agents can converge on compatible schemes, keys, and parameters. We find substantial convergence on broad scheme families but limited strict one-shot coordination, suggesting that shared artefacts, repeated interaction, and tool-mediated search are the settings where covert communication risks are most acute. Overall, our findings provide empirical grounding for the recent strategic confinement hypothesis, which assumes that capable agents can construct covert channels that survive monitoring.

[AI-194] Evidence-Driven LLM Agent for C-to-Synthesizable-C Conversion and Verification

链接: https://arxiv.org/abs/2606.28409
作者: Zhe Zhao,Hongbing Lang,Zhihan Xiao,Luke Ztz Hu,John Imoleayo Adebisi,Songping Mai
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures, submitted to IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)

点击查看摘要

Abstract:Software-compilable C programs routinely fail to complete the four-stage pipeline of a high-level synthesis (HLS) toolchain – compilation, C simulation (CSim), synthesis, and C/RTL co-simulation (CoSim) – because HLS accepts only a synthesizable subset of C (HLS-C). Yet most existing large language model (LLM) systems built for HLS code repair only cover the early pipeline stages and feed raw tool logs directly to the model, yielding brittle and hard-to-reproduce fixes. We formulate C-to-HLS-C conversion as a closed-loop generation-verification-diagnosis-repair problem on an HLS tool (Xilinx Vitis), contributing three components: an end-to-end workflow of cooperating agents closed by the four-stage verifier under strict evidence isolation; a Progressive Mismatch Localization Chain (PMLC) that localizes CSim/CoSim mismatches through log normalization, AST backward slicing, and dual-trace instrumentation; and a typed-query, two-stage evidence RAG backed by a self-evolving, family-routed repair-card pool. Experimental results show that the proposed workflow substantially outperforms all comparable state-of-the-art models.

[AI-195] Financing Artificial Intelligence Infrastructure: Mapping AI Infrastructure Investment and Compute Governance Across Africa

链接: https://arxiv.org/abs/2606.28404
作者: Kai-Hsin Hung,Sumaya Nur Adan,Krupa Suchak,Armita Sadeghian Barzoki,Kofi Yeboah,Mohammad Amir Anwar
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 14 pages; two figures. Currently under review at Data and Policy journal

点击查看摘要

Abstract:Artificial intelligence depends on large-scale compute resources and their supporting infrastructure. However, AI governance debates treat compute primarily as a technical input rather than as an outcome of investment, ownership, and financial control. This paper examines AI infrastructure investment flows across Africa through a systematic analysis of 46 publicly announced projects totalling USD 12.7 billion between 2019 and 2025. Using a value chain framework, we analyze who invests in AI-relevant infrastructure and where investments concentrate. Our findings reveal a highly concentrated landscape dominated by global data center operators, hyperscale technology firms, and development finance institutions, clustering in South Africa, Kenya, Nigeria, and Egypt. We introduce asymmetrical interdependence to describe a structural condition in which capital and physical infrastructure account for 73% of total funding while control remains concentrated in the compute layer among a small number of global technology firms. We argue that compute governance must account for capital flows, ownership, and control, not only geographic access, because these dynamics shape AI compute equity. Infrastructure presence is necessary but insufficient for meaningful governance capacity.

[AI-196] Reinforcement Learning for Software Vulnerability Analysis: A Systematic Review with Emphasis on C/C Source Code and Static Analysis

链接: https://arxiv.org/abs/2606.28403
作者: Bruno Caro-Vásquez,Carola Figueroa-Flores,Gastón Marquez
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 15 pages, 1 figure, 7 tables. Submitted to CIARP 2026

点击查看摘要

Abstract:Vulnerability detection in C/C++ software remains a major security challenge due to code complexity, manual memory management, and the limitations of traditional static analysis. Reinforcement Learning (RL) has emerged as a promising approach, particularly for fuzzing, test generation, program exploration, and, more recently, vulnerability detection and localization. Following PRISMA 2020 guidelines, this work reviews RL techniques for software vulnerability analysis, focusing on C/C++ source code and static analysis. We identified 21 primary studies published between 2015 and 2026 from major scientific databases and complementary searches. We analyze the addressed tasks, algorithms, state-action-reward-environment formulations, code representations, datasets, and evaluation metrics. Results show that 15 studies focus on fuzzing and guided exploration, only 3 on direct vulnerability detection, and just 1 on statement-level localization. Moreover, statically extracted structural representations such as Control Flow Graphs (CFGs) and Abstract Syntax Trees (ASTs) are rarely used as agent states, and benchmarks lack comparability. We propose a task- and formulation-oriented taxonomy and identify a key research gap: the absence of RL agents that use source-code CFGs as states to detect and localize vulnerable nodes.

[AI-197] RoboGaze: Evaluating Robot World Models via Structured Vision-Language Analysis

链接: https://arxiv.org/abs/2606.28385
作者: Minh-Loi Nguyen,Nghiem Tuong Diep,Hung Khang Nguyen,Minh Le,Doanh Le Thien,Hoang H. Tran,Dung D. Le,Vu N. Duong,Daniel Sonntag,An Thai Le,Duy Minh Ho Nguyen,Vien Anh Ngo,Tran Van Nhiem
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: First version 29 pages, 7 figures. Project webpage: this https URL

点击查看摘要

Abstract:Recent advances in robot world models enable synthetic video generation for embodied prediction and planning. However, evaluating these videos is challenging: visually realistic outputs often violate physical laws, temporal consistency, or task logic, while conventional metrics and monolithic Vision-Language Model (VLM) judges fail to generalize or provide precise diagnostic value. We present RoboGaze, a training-free, multi-agent VLM framework that provides structured, interpretable evaluation for generated robot-manipulation videos. Given a task instruction and video, RoboGaze operates via a three-stage pipeline: task-scene grounding, dimension-specific specialist routing, and critic-based verification. It outputs temporally localized glitch reports categorized under a novel 6-dimension, 30-type robotics-specific taxonomy. To benchmark RoboGaze, we introduce a human-validated dataset of 382 clips spanning simulated and real-world multi-view manipulation. Evaluating eight open-source and proprietary VLM backbones, RoboGaze dramatically outperforms zero-shot baselines, improving description-F1 by up to +43 points and temporal alignment (F1 x IoU) by up to +37 points, closing approximately 85% of the gap to the human ceiling. Furthermore, its critic verifier mitigates the “cry-wolf” false-positive flaw of standard VLMs, lifting clean-clip accuracy from under 25% to over 80%. RoboGaze offers a scalable, highly interpretable diagnostic tool for the rigorous evaluation of robot world models.

[AI-198] A Query-Driven Communication-Efficient Digital Twins Design for Autonomous Driving

链接: https://arxiv.org/abs/2606.28384
作者: Nuocheng Yang,Longyu Zhou,Sihua Wang,Changchuan Yin,Tony Q. S. Quek
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:Digital twins (DTs) have become a potential technology to perform risk-free simulation of physical entities for deterministic and high-reliability services in diverse scenarios such as autonomous driving and low-altitude economy. In the autonomous driving scenario, traditional DT methods that rely solely on vehicle’s real-time state synchronization, however, might lead to unacceptable computing and communication consumption for construction of high-fidelity DT with redundant data. To address this issue, we first propose a query-driven DT architecture to enable the DT to actively request the desired environment data from vehicles based on its simulation result. Then, we formulate an optimization problem whose goal is to minimize autonomous driving position error while accounting for DT fidelity and communication constraints. We also design a cross-time-step progressive query mechanism to further improve communication efficiency. The simulation results show that our proposed method achieves a 24% reduction in planning position error compared to traditional methods, while reducing communication overhead by 40%.

[AI-199] Evolutional Math: Cross-Validated Island-Model Genetic Programming for Interpretable Symbolic Regression on Small Wide Datasets

链接: https://arxiv.org/abs/2606.28381
作者: Artem Andrianov(Cyntegrity Germany GmbH, Hofheim am Taunus, Germany)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures. Reference implementation available at this https URL

点击查看摘要

Abstract:Symbolic regression via genetic programming routinely fails on small, wide datasets - a regime common in clinical-trial monitoring, biostatistics, and engineering pilot studies - by converging on bloated, overfit expressions that exploit correlation rather than prediction. We present Evolutional Math, an open-source genetic programming system that combines four design choices to yield compact, interpretable formulas in this regime. First, fitness is measured by R-squared on held-out cross-validation folds rather than Pearson correlation on the training set, eliminating single-variable shortcuts that correlate but mis-scale. Second, a multi-island architecture runs independent populations seeded with distinct operator subsets (algebraic, logarithmic, trigonometric, and full) with ring-topology migration every M generations, preventing the search from collapsing into one region of formula space. Third, a structural deduplication scheme treats formulas differing only in constants as equivalent, so the elite archive contains structurally distinct candidates rather than near-duplicate variants. Fourth, top-k individuals undergo numerical constant refinement via scipy L-BFGS-B after each migration phase, decoupling structure search from parameter fitting. We evaluate the system on synthetic benchmarks of the form log(x_i) * x_j / (x_k * c), trigonometric mixtures, and an anonymized clinical site-monitoring dataset with 24 rows and approximately 290 candidate numeric features. The system consistently recovers compact ground-truth structures with R-squared at or above 0.99 within tens of thousands of unique formula evaluations. A reference implementation is released under a noncommercial source-available license.

[AI-200] Distilling a Modular Reservoir Through a Genomic Bottleneck

链接: https://arxiv.org/abs/2606.28380
作者: Mani Hamidi,Sina Khajehabdollahi,Charley M. Wu,Emmanouil Giannakakis
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The intricate structures of biological neural networks largely emerge during development, guided by a comparatively compressed blueprint encoded in the genome. The connectivity that emerges from this decoding process is rich in structure, and already equips the organism with functional modules upon birth. This initial structure serves as a scaffold that can be gradually refined and fine-tuned through lifelong experience, via a variety of plasticity mechanisms. Drawing inspiration from this interaction between evolutionary and developmental modes of learning, we use hypernetworks to learn a compressed generative process that generates the connectivity of a modular reservoir. We show that this marriage between curriculum-based meta-learning and modular reservoir computing can generate sparse recurrent networks that solve difficult temporal tasks with minimal training and without concessions to robustness.

[AI-201] Recursive Self-Evolving Agents via Held-Out Selection

链接: https://arxiv.org/abs/2606.28374
作者: Michael Nguyen,Quoc Nguyen,Paul Vuong
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents are increasingly improved without weight updates by evolving a natural-language artifact, such as reflections, workflows, playbooks, cheatsheets, or optimized prompts, that conditions a frozen policy. Such methods are typically reported as wins on the single benchmark where they help. We study them apples-to-apples and surface a sharper picture. We introduce RSEA, a Recursive Self-Evolving Agent that carries a compact three-layer natural-language state: an imperative strategy, reusable skills, and a procedural playbook. Across generations, RSEA rewrites all three layers from its own trajectories and commits a candidate only if it does not regress on a disjoint held-out split, using a strict keep-better gate. Across four diverse benchmarks, ALFWorld, GAIA, (\tau)-bench, and WebShop, and six faithful baselines, ReAct, Reflexion, GEPA, AWM, ACE, and Dynamic Cheatsheet, all evaluated on one shared local backbone, we find three main results. First, no artifact universally wins. RSEA is the strongest single-pass method on ALFWorld, reaching 69.3% compared with 64.6% for ReAct (McNemar (p=0.015)), and reaches 79.4% with retry, the best overall result. However, concrete-workflow induction, represented by AWM, is best on the strong-backbone tool-use tasks. Second, unguarded context evolution is high-variance and unsafe. Dynamic Cheatsheet, which curates context online without a held-out gate, is near-best on ALFWorld at 70.7%, yet collapses on WebShop, with a score of 0.14 compared with 0.43 for ReAct. Third, RSEA’s strict held-out selection is what makes recursive self-evolution monotone-safe: it never significantly underperforms the base agent on any benchmark and falls back to vanilla ReAct when evolved context would hurt. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.28374 [cs.AI] (or arXiv:2606.28374v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.28374 Focus to learn more arXiv-issued DOI via DataCite

[AI-202] Model Merging to Evolution: Parameter Space Exploration for Expert Models

链接: https://arxiv.org/abs/2606.28373
作者: Chao Wang,Yuchen Guo,Zheng Tan,Guanchun Wang,Yanbiao Ma,Qiqi Duan,Peng Wu
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Chao Wang and Yuchen Guo contributed equally

点击查看摘要

Abstract:Model merging integrates the capabilities of multiple expert models to create strong models for multiple tasks without additional training, thereby reducing computational resource requirements. However, existing methods operate within the convex combination space of expert models, failing to explore high-performance regions outside this space. This paper proposes the MERGEvolve framework, which unifies model merging and evolution within an evolution strategy by treating the merged model as the initialization for evolutionary exploration of the parameter space. During the merging phase, expert models act as deterministic sources to build a strong initial point. The evolution phase then explores the parameter space using random noise. Theoretical analysis shows that MERGEvolve explores regions outside the convex combination space. Extensive experiments on single-task and multi-task benchmarks demonstrate that MERGEvolve consistently achieves performance competitive with advanced model merging baselines. Ablation studies confirm that a high-quality initial point is critical for efficient exploration of the parameter space.

[AI-203] Agent ic Safety is an Epistemic Property Not a Behavioral One ICML2026

链接: https://arxiv.org/abs/2606.28347
作者: Charles L. Wang,Keir Dorchen,Peter Jin
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear in proceedings of ICML 2026

点击查看摘要

Abstract:Contemporary AI safety spans pre-training interventions, post-training alignment, deployment-time controls, monitoring, and red-teaming. These methods are necessary, but they primarily certify snapshots of system behavior. As AI systems become more capable, dynamic, embodied, and self-improving, this snapshot view becomes incomplete: safety depends not only on whether a system behaves acceptably now, but whether it remains correctable as it learns, adapts, acts, and modifies itself over time. This paper argues that safety should therefore be treated as an epistemic property of the evolving learner, not merely a behavioral property of the current policy. We introduce teachability as the capacity to preserve future corrective leverage under bounded human, institutional, or environmental intervention. We argue that advanced systems can retain visible competence while eroding the representational, algorithmic, or meta-decision conditions needed for future correction. Safe advanced AI systems must not only behave acceptably now; they must remain teachable later.

[AI-204] Multi-Agent DRL for QoS and Energy Optimization in RIS-Enabled Open-RAN Industrial 6G TN/NTN Networks

链接: https://arxiv.org/abs/2606.28339
作者: Marwan Dhuheir,Thang X. Vu,Symeon Chatzinotas
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: IEEE ICC-2026 accepted Workshop paper

点击查看摘要

Abstract:Industrial 6G networks require ultra-reliable, low-latency, and energy-efficient connectivity in dynamic and blockage-prone environments, where conventional terrestrial deployments often fail to ensure stable coverage. Hence, in this paper, we propose a RIS-enabled Open-RAN framework for integrated terrestrial/non-terrestrial (TN/NTN) industrial 6G networks, in which UAVs-mounted reconfigurable intelligent surfaces (RISs) cooperate with ground radio units and a high-altitude platform (HAP) to enhance connectivity for dense industrial IoT devices. Owing to the high dimensionality and strong coupling among decision variables, conventional optimization techniques become computationally intractable. To overcome this limitation, the joint optimization problem of data rates, latency, and energy consumptions is formulated as a decentralized partially observable Markov decision process (Dec-POMDP) and solved using a multi-agent deep reinforcement learning framework. Simulation results show improvements of up to 75% in data rate, 25% latency reduction, and 16% energy savings compared with state-of-the-art learning-based and non-RIS baselines, demonstrating the effectiveness of RIS-assisted Open-RAN intelligence for industrial 6G networks.

[AI-205] Ground Truths in Suicide Research: The Current State of AI-Based Suicide Detection in Social Media

链接: https://arxiv.org/abs/2606.28334
作者: Yaakov Ophir,Ofri Hefetz,Refael Tikochinski,Kfir Bar,Shir Lissak,Shulamit Grinapol,Haya Wachtel,Eyal Fruchter,Roi Reichart
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in artificial intelligence (AI) and social media data have led to growing optimism about the ability to detect suicide risk at scale. However, the empirical foundations of this work remain unclear. This article provides a synthesis of current research on AI-based suicide detection in social media, drawing on a recent umbrella review of 22 systematic reviews covering studies up to 2022, alongside an ongoing literature review extending the analysis to more recent work. Across these sources, we identified 195 relevant studies, which are documented in a detailed supplementary dataset outlining their key characteristics and findings (see Supplementary Information). Analysis of these studies reveals consistent patterns, including rapid growth, concentration on a small number of platforms, reliance on textual and English-language data, and repeated use of similar datasets. Most importantly, the majority of studies rely on indirect labeling strategies that do not involve direct, individual-level validation of suicide risk. Instead, ground truth is typically inferred from observable features of online content, such as linguistic markers or community membership. As a result, the predictive task often shifts from identifying individuals at risk to classifying posts that contain suicidal or distress-related language, limiting the ability of current approaches to detect individuals who do not express such content explicitly online. These findings suggest that current advances in model performance should be interpreted with caution. Progress in this field is likely to depend less on improving model performance and more on ensuring that model predictions meaningfully correspond to suicide risk as it is experienced in real life.

[AI-206] Insidious by Design: Implications of Large Language Model algorithmic bias for the Global South

链接: https://arxiv.org/abs/2606.28333
作者: Sioux McKenna,Nompilo Tshuma
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:\beginquote The biases in Large Language Models’ (LLMs) outputs remain inadequately theorised, particularly from the perspective of the Global South. This article reports on a small-scale exploratory study in which identical prompts were submitted to four major LLMs (ChatGPT, Claude, Grok, and Copilot), firstly, prompting for stories using names suggestive of specific racial and gender communities, and secondly asking questions about `development’. Drawing on critical AI scholarship and postcolonial theory, we argue that LLM outputs are patterned in ways that reproduce racial hierarchies, gender asymmetries, and Western-centric epistemic frameworks. We argue that these biases are insidious: they operate below the threshold of both obvious error and overt prejudice, and instead are subtly embedded in narrative structure and emotional template. Simply put, women, in LLM narratives have rich interior lives, while men make plans. Black people face hardships while white people navigate the world with agency. And explanations as to the economic world order fail to consider Southern explanations. The models perform plausibility while reproducing dominance. We conclude that universities require structural critique of these technologies rather than unreflective adoption, and that critical AI literacy must engage seriously with questions of whose knowledge systems are reproduced and legitimated, or marginalised and undermined.

[AI-207] When Medical Safety Alignment Fails: A Benchmark for Evaluating LLM s on High-Risk Medical Queries

链接: https://arxiv.org/abs/2606.28332
作者: Yige Li,Jun Sun,Wei Zhao,Zhe Li,Yutao Wu,Hanxun Huang,Xiang Zheng,Xingjun Ma
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for medical and health-related questions, yet their safety in high-risk medical scenarios remains poorly understood. We introduce \textscMedHarm\footnoteCode and data will be released upon acceptance. Due to the sensitive nature of high-risk medical queries, data access will be available to qualified researchers upon request., a high-risk medical safety benchmark with 1,100 medically grounded queries across 10 safety-critical categories, including toxicology, pharmacology, covert poisoning, anesthesia, and fetal harm. Unlike broad medical QA benchmarks, \textscMedHarm targets realistic clinical, educational, and technical prompts that require refusal, caution, or safe redirection rather than direct helpfulness. We evaluate 15 LLMs spanning general-purpose, medical-purpose, closed-source, and downstream SFT models, together with 4 representative guardrail models. Results reveal a substantial gap between apparent alignment and medical safety: aligned models can still produce unsafe or actionable responses, medical fine-tuning can amplify harmful specificity, and external guardrails reduce some failures while introducing brittle blocking and weak safe helpfulness. These findings show that medical safety cannot be inferred from general alignment or medical capability alone, highlighting the need for domain-specific stress testing before deploying LLMs in safety-critical medical applications.

[AI-208] “AI Watermarking”: Bridging Policy Discourse and Technical Capabilities

链接: https://arxiv.org/abs/2606.28331
作者: Andrés Fábrega,Arkaprabha Bhattacharya,Miranda Christ,Sunoo Park
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The widespread deployment of generative artificial intelligence (AI) models has raised serious concerns about the proliferation of AI-generated content. This has led to a surge of interest in, and demand for, reliable tracking and detection mechanisms for content that is AI-generated, such as watermarking, metadata tagging, content tagging, and more. The problem has captured the attention of policymakers as well as the popular media, and a spate of recent bills in the US have sought to regulate the spread of AI content, and enforce or promote methods to track and label it. This work performs a critical analysis of the policy discourse surrounding generative AI content transparency in the US and EU. Through a broad document selection methodology, we first collect a broad corpus of documents containing legislative language and policy-relevant discourse on the topic. We then analyze these through inductive coding, and leverage our coding to systematize these documents, identifying key patterns, gaps, and open questions. We identify critical points of disconnect between policy and technological capabilities and practice, and we highlight and discuss potential ambiguities and pitfalls raised by the trends in our corpus.

[AI-209] It Lied to a Doctor to Buy Poison Ingredients: Quantifying Real-World Misuse of Phone-use Agents

链接: https://arxiv.org/abs/2606.27944
作者: Yiming Sun,Chen Chen,Zifan Zhou,Mi Zhang
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: work in progress

点击查看摘要

Abstract:Phone-use Agents can execute complex tasks end to end across real mobile applications. By operating a real device on the user’s behalf, they reach far more functionalities than CLI agents, which amplifies the real-world harm they can cause when driven for malicious purposes. We present the first study of this threat on real phones and 27 commercial apps, and find that agents built on 9 mainstream commercial and open-source models readily carry out serious misuse, ranging from procuring drug and explosive precursors to fraud, online harassment, and review manipulation. Across the agents we run on real devices, the average refusal rate to harmful requests stays low while the average task-completion rate reaches 68.8%, and in some scenarios an agent finishes a violation faster than a human would. These results suggest that Phone-use Agents already meet the practical conditions for automated misuse at scale. In one observed real-device execution, Claude-Opus-4.8 fabricated a medical history, deceived an online doctor into issuing a prescription, and completed the order and payment on its own to purchase a precursor for a highly toxic substance. To our knowledge, this is the first documented real-world case of an AI agent procuring controlled precursor materials. We trace this behavior to a Safety Awareness-Execution Gap, where an agent recognizes that a request is harmful yet still executes it. Simple defenses curb the overt cases, but the more covert and arguably more damaging threats, such as coordinated review manipulation and fake traffic, remain largely unsolved. We hope these findings push the community toward safer Phone-use Agents. Comments: work in progress Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2606.27944 [cs.MM] (or arXiv:2606.27944v1 [cs.MM] for this version) https://doi.org/10.48550/arXiv.2606.27944 Focus to learn more arXiv-issued DOI via DataCite

[AI-210] Optimization Dynamics Imprint Semantic Specificity in Contrastive Embedding Norms

链接: https://arxiv.org/abs/2606.30625
作者: Ziwei Su,Junyu Ren,Victor Veitch
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Contrastive embedding models trained with scale-invariant losses are typically paired with distance metrics like cosine similarity, effectively ignoring embedding magnitudes. However, surprisingly, empirical studies reveal that despite this, these “discarded” norms seem to correlate with semantic properties such as concept specificity, token frequency, and human uncertainty. In this work, we provide a formal theoretical framework explaining this phenomenon. By analyzing the optimization dynamics, we derive an analytic formula demonstrating that embedding length naturally encodes this information as a byproduct of the training process. We also show how this gives rise to signals that can serve as “free” calibration tools in specific models and retrieval tasks, providing a grounded explanation for a previously heuristic observation.

[AI-211] Collective cooperation without individual fidelity in LLM agents

链接: https://arxiv.org/abs/2606.30454
作者: Henrique Ferraz de Arruda,Carlos Gracia Lázaro,Alberto Aleta,Yamir Moreno
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as agents in simulations of social systems, yet it remains unclear when their behavior can be interpreted as a faithful proxy for human decision-making. Here we test LLM agents against a direct empirical benchmark: a large-scale networked Prisoner’s Dilemma experiment with human participants. Using the same interaction protocol, payoff structure, and network topologies, we compare nine open-weight LLMs with the human data. The selected model reproduces several macro-level features of cooperation dynamics, including the early decline and later stabilization of cooperation. This aggregate agreement, however, does not extend uniformly to finer levels of behavior. LLM populations underestimate individual-level heterogeneity and generate conditional cooperation patterns that differ from those observed in humans. Adding a fraction of random agents improves some aspects of micro-level agreement, but does not remove the mismatch in decision rules. These findings reveal a macro–micro dissociation in LLM-based social agents: collective outcomes can appear human-like even when the underlying behavioral distributions and mechanisms are not. They suggest that validating LLM agents as human surrogates requires comparisons across aggregate dynamics, individual heterogeneity, and context-dependent decision rules, rather than outcome-level agreement alone.

[AI-212] A Stochastic–Geometric Theory of Scaling Laws in Grokking

链接: https://arxiv.org/abs/2606.30388
作者: Róisín Luo,Christian Gagné,Jonas Ngnawé,Ihsan Ullah,Karyn Morrissey
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: v1

点击查看摘要

Abstract:Delayed generalization (\ie~grokking) refers to the phenomenon in which a neural network fits its training data early in training but only begins to generalize after a prolonged delay, often through an abrupt transition. Despite extensive empirical study, its underlying mechanism remains poorly understood. In this work, we first theoretically characterize a shell–core topological configuration of the reachable solution space induced by Adam’s optimization dynamics with weight-shrinkage regularization, supported by empirical evidence. This optimization-induced topological configuration gives rise to grokking. In model’s parameter space, random initialization solutions concentrate on a thin outer spherical shell, enclosing another spherical shell of memorization solutions, which in turn contains a core corresponding to the generalization solutions. Leveraging stopping-time theory, we then analyze the geometry of this topological configuration and the solution transition time at which optimization trajectories escape the memorization manifold and first reach the boundary of the generalization manifold. Our theoretical analysis derives grokking scaling laws for the learning rate, batch size, and \ell_2 regularization coefficient, which are further validated through experiments and shown to recover results from prior literature.

[AI-213] Physically-Constrained Harmonic Separation for Robust Heart and Respiratory Rate Estimation from Wrist Photoplethysmography

链接: https://arxiv.org/abs/2606.30156
作者: Nouhaila Fraihi,Ouassim Karrakchou,Mounir Ghogho
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for presentation at the 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (IEEE EMBC 2026), Toronto, Canada, July 26-30, 2026

点击查看摘要

Abstract:Wrist-worn photoplethysmography (PPG) enables continuous monitoring of cardiopulmonary physiology, but reliable heart rate (HR) and respiratory rate (RR) estimation in free-living conditions remains challenging due to non-stationary motion artifacts that spectrally overlap with physiological dynamics. Existing signal-processing methods degrade under strong motion, while unconstrained deep learning approaches often lack physiological interpretability and identifiable structure. We propose a Physically-Constrained Harmonic Separation (PCHS) framework that formulates HR and RR estimation from wrist PPG as an analysis-by-synthesis problem, where accelerometer measurements condition artifact separation rather than directly regressing vital signs. A physics-guided harmonic generator decomposes the observed signal into quasi-periodic physiological components and a motion-related residual, enabling HR recovery from the fundamental frequency and RR prediction from respiratory-driven modulations of the harmonic parameters. Robust reconstruction objectives, separation constraints, and uncertainty-aware weighting stabilize the decomposition under motion. Experiments on the motion-intensive PPG-DaLiA dataset demonstrate that PCHS outperforms state-of-the-art methods while yielding interpretable signal decompositions that effectively disentangle physiological activity from motion artifacts.

[AI-214] Gravitational Duals from Equations of State II: Large Hierarchies and False Vacua

链接: https://arxiv.org/abs/2606.30117
作者: Raul Jimenez,David Mateos,Pavlos Protopapas,Pau Solé-Vilaró,Pedro Tarancón-Álvarez,Pablo Tejerina-Pérez
类目: High Energy Physics - Theory (hep-th); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
备注: 33 pages, 12 figures

点击查看摘要

Abstract:We investigate the reconstruction of holographic duals for strongly coupled quantum field theories in regimes characterized by large hierarchies and the presence of false vacua. Within the gauge/gravity duality, these features translate into non-trivial thermodynamic behaviour and exotic renormalization group flows, including skipping flows between non-adjacent fixed points. Building on previous work based on Physics-Informed Neural Networks (PINNs), we extend the holographic inverse problem of reconstructing the bulk scalar potential from boundary thermodynamic data into this new regime. This setting presents a variety of conceptual and numerical challenges, such as near-degenerate states, large hierarchies of energy scales, and regions of the potential that are not directly probed by the input data. We develop a set of methodological advances that overcome these obstacles, thereby improving the established PINNs-based methodology and extending it to new physical regimes of interest that were previously out of reach. Applying the developed framework, we demonstrate accurate reconstruction of scalar potentials deep into the false vacuum regime, achieving robust agreement with the physical features of the underlying thermodynamics despite significant numerical stiffness. Our results extend the bridge between holography and machine learning, and suggest that data-driven approaches can provide new insights into the structure of strongly coupled systems.

[AI-215] RiverONE: Generating Knowledge-Intensive VLM by Simulated Quantum Machines

链接: https://arxiv.org/abs/2606.29966
作者: Xindian Ma,Xinyu Long,Yefei Zhang,Yanchen Liu,Xianghao Li,Yufu Wen,Yike Hu,Yuedong Zhu,Zeyang Ma,Wen Qin,Yikun Wang,Peng Yang,Monan Wang,Teng Yu
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 20

点击查看摘要

Abstract:Quantum computing provides a powerful paradigm for representing and transforming high-dimensional information through superposition, entanglement, and measurement-induced nonlinear features. While current quantum hardware is not yet practical for direct large-scale vision-language model (VLM) inference, simulated quantum computation can be used during model construction to generate structured parameters for compact classical AI systems. We build RiverONE, a lightweight vision-language model for quantum calibration plot understanding, using simulated quantum computation. It employs a specialized visual encoder and an InternVL-based language backbone. To compensate for compression-induced information loss, we introduce quantum-generated parameters, which are materialized as classical tensors after training. This allows RiverONE to run entirely on classical GPUs at inference time, with no quantum hardware or runtime quantum simulation. With approximately 1.9 billion parameters, RiverONE achieves at least 95% of the performance of NVIDIA Ising Calibration 1 on quantum calibration plot understanding tasks while using less than 10% of its parameter count. These results suggest that simulated quantum computation can serve as a practical construction-stage mechanism for building lightweight, knowledge-intensive scientific VLMs. Our code is available at this https URL.

[AI-216] Data-Efficient Multimodal Alignment for Histopathology-based Molecular Prediction

链接: https://arxiv.org/abs/2606.29949
作者: Dominik Winter,Dominik Vonficht,Loïc Le Bescond,Christian Gebbe,Marco Rosati,Richard J. Chen,Markus Schick,Ross Stewart,Nicolas Brieu
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:HE-stained whole-slide images offer cohort-scale availability and rich spatial context but lack molecular specificity, whereas bulk RNA-seq provides transcriptome-wide resolution at high cost with limited archival availability. We show that training a lightweight alignment module atop frozen histopathology and RNA-Seq foundation models enables open-vocabulary molecular prompting – querying HE slides with gene-set signatures to predict pathway activity without sequencing or end-to-end retraining. Using contrastive learning on a multi-cancer cohort (N=1,720), we achieve a 25-fold improvement in retrieval over baseline methods. Systematic analysis reveals a graduated predictability spectrum: morphologically grounded programs (cell-cycle programs, immune-related) are most reliably predicted (R^20.5), while predicting pathways with no morphological footprint remains challenging as expected. We validate clinical utility on the POSEIDON clinical trial: HE-predicted squamous cell carcinoma scores recapitulate NSCLC subtype identity and predicted IFN-gamma mirror PD-L1 tumor-cell expression groups. Furthermore, genesets describing immune activation and fibrosis predict known tumor microenvironment archetypes from histology alone. We further validate generalization of our approach across unseen cohorts and demonstrate data-efficient domain adaptation, establishing a slide-native framework for molecular analysis on HE images.

[AI-217] Semi-Supervised Sound Event Detection with Conditional Mixup and Embedding-Level Contrastive Loss

链接: https://arxiv.org/abs/2606.29901
作者: Nian Shao,Xian Li,Xiaofei Li
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: 6 pages; accepted by SMC 2026

点击查看摘要

Abstract:Sound event detection (SED) is a core module for acoustic environmental analysis, yet its performance is often limited by scarce labeled data. Recent systems leverage large pretrained audio foundation models, but effective fine-tuning remains challenging because labeled data are limited while unlabeled data are abundant. A previous work, ATST-SED, addressed this problem with a pseudo-label based semi-supervised fine-tuning framework. In this work, we further improve the framework by adopting an embedding-level self-supervised contrastive loss inspired by ATST-Frame pretraining. This contrastive objective better exploits unlabeled data during fine-tuning. One challenge is that mixup serves different roles in the two objectives: pseudo-label learning uses composition mixup, while contrastive learning treats mixup as a perturbation. To resolve this mismatch, we propose conditional mixup, which combines composition mixup and perturbation mixup in one semi-supervised framework and defines the corresponding embedding-level contrastive losses. The resulting model achieves 0.645 PSDS1 and 0.822 PSDS2 on the DESED validation set, establishing a new state of the art.

[AI-218] HERO: Improving the Reliability and Sensitivity of Generative Model Evaluation Using Historical Data

链接: https://arxiv.org/abs/2606.29784
作者: Xinrui Ruan,Zhenyu Zhao,Waverly Wei,Yueshan Zhang,Zeyu Zheng,Sui Huang,Jingshen Wang
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Econometrics (econ.EM)
备注: 30 pages, 6 figures

点击查看摘要

Abstract:Reliable generative AI models critically rely on expert human annotations to evaluate output quality, yet these “gold” labels are expensive to collect and limited in quantity. Organizations thus often turn to collecting vast but noisy “silver” labels from crowdsourced workers or vendor annotators as proxies for gold labels. Because gold remains the evaluation target, naively aggregating noisy silver labels may introduce bias, and estimators built on sparsely observed gold labels may have high variance to resolve the model performance gaps that guide practical decisions. Model evaluation has become an ongoing operational practice rather than a one-time exercise, with evaluation rounds repeating across model versions, releases, and content domains. A natural question is whether the previous historical evaluation data can be used to improve each new round of evaluation. We introduce HERO (History Enhanced RObust model evaluation), a novel framework that uses historical data to suppress bias (improve reliability) and reduce variance (improve sensitivity) in model performance evaluation. HERO calibrates silver labelers’ performance learned from historical gold annotations, and stabilizes the resulting estimator by anchoring it to covariate information measured with high precision in the historical data. HERO can be broadly applied across multiple common evaluation tasks, and remains valid when only a subset of historical labelers appears in the current round. We establish conditions under which the bias and variance reductions hold, showcase HERO’s performance in simulation studies, and demonstrate its effectiveness on real-world model evaluation benchmarking datasets.

[AI-219] Optimizing Expert-Designed Crystal Graph Networks for Band-Gap Prediction with an Autonomous LLM Research Loop

链接: https://arxiv.org/abs/2606.29717
作者: Chenmu Zhang,Boris I. Yakobson
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Predicting a material’s properties from its structure is a central, fast-advancing problem in computational materials science. A decade of work has produced standard public benchmarks and many published machine-learning models for the task (Dunn et al., 2020). The task’s fixed metric and these baselines make it a natural setting for autonomous agent research (Karpathy, 2026). On the MatBench band-gap benchmark ( 100k crystals), a general-purpose coding agent autonomously built the most accurate model trained without external pretraining, ahead of all seventeen expert-designed models reported for the task. A closer analysis shows it reached this by implementing known methods: either already standard in crystal neural-network models, or borrowed from other areas of machine learning. The contributing implementations include element-pair features on each message-passing edge and a crystal space-group embedding. The work not only demonstrates that LLM-agent autonomous research can optimize an expert-designed machine learning model for material property prediction, but also investigates the limitations of such autonomous research.

[AI-220] A Machine-Verified Proof of a Quantum-Optimization Conjecture

链接: https://arxiv.org/abs/2606.29687
作者: Uri Kol,Maor Ben-Shahar,Kfir Sulimany,Dirk Englund
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We report a machine-verified resolution of a problem open for over a decade in quantum optimization: the Farhi, Goldstone and Gutmann (FGG) conjecture that depth- p Quantum Approximate Optimization Algorithm (QAOA) on the ring of disagrees attains approximation ratio (2p+1)/(2p+2) exactly. We found the proof using a large language model, Claude Fable 5, and verified its correctness end-to-end by the Lean 4 proof assistant. Our methodology includes several ingredients: building on a substantial Lean library of quantum information, we formalized the QAOA components and the known parts of the problem, and reduced the conjecture to a single open mathematical statement. The model was then handed the library and our agentic toolkit, and tasked with closing that gap by constructing a proof in Lean. The resulting process is a feedback loop between the model’s natural-language reasoning and Lean’s mechanical verification, which converged to a machine-verified proof. Human verification is required only for the structural scaffolding - that the formal statement faithfully encodes the intended claim - while the proof itself is supplied by the model and certified mechanically by Lean. The proof is nevertheless striking - the model uncovered a hidden dynamical symmetry of the problem and exploited it, borrowing tools and machinery from an adjacent field to turn a hard existence problem into an explicit construction. This work paves the way for resolving open conjectures in quantum information science and beyond.

[AI-221] Fast Wireless Foundation Models with Early-Exits

链接: https://arxiv.org/abs/2606.29640
作者: Omar Mashaal,Hatem Abou-Zeid
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While wireless foundation models (FMs) are demonstrating strong potential to enable AI-Native 6G networks, their high computational cost remains a critical barrier to deployment. The large computational cost stems from the rigid, full-depth execution of the FM backbone for every task, a process we show is not only inefficient but can also degrade performance on unseen out-of-distribution (OOD) tasks. In this paper, we propose a novel early-exit FM framework that attaches lightweight, per-task heads, at the most appropriate exit-stage of a frozen wireless FM encoder, enabling variable-depth inference tailored to each task’s preferred representation depth. Our results demonstrate that these intermediate-layer features not only speed-up inference significantly (up to 93% fewer FLOPs), but also provide more transferable representations that exceed the full encoder accuracy on unseen tasks. We further demonstrate that a simple fixed-exit strategy per task is more effective than traditional early-exiting policies that route different samples to different exits based on their perceived difficulty levels.

[AI-222] A Posteriori Error Analysis for Decoupled Neural Approximations of Fully Coupled FBSDEs with Control Mismatch

链接: https://arxiv.org/abs/2606.29474
作者: Xichuan Zhang
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:This paper develops an a posteriori error analysis framework for decoupled neural approximations of fully coupled forward–backward stochastic differential equations (FBSDEs). It provides an a posteriori error-analysis for the idealized discrete adapted trajectory. The main feature of the proposed formulation is the use of an auxiliary control process in the forward coefficients, which may differ from the backward component approximated by the neural network. This decoupling is useful in practical deep learning implementations, but it creates a control mismatch that must be included in the error analysis. We first establish a continuous-time stability estimate for fully coupled FBSDEs under perturbations of the drift, diffusion, generator, terminal condition, and auxiliary control input. We then transfer this estimate to the discrete-time setting and derive computable a posteriori error bounds depending only on the terminal defect, the pathwise residual, and the control mismatch. When the auxiliary control is identified with the backward approximation, the mismatch term vanishes and the bound reduces to the standard two-term form. Numerical experiments on a linear–quadratic FBSDE with an explicit reference solution and a multidimensional Burgers-type FBSDE without a reference solution illustrate the diagnostic role of the proposed indicators and the contribution of the mismatch penalty to the consistency and reproducibility of the numerical approximations.

[AI-223] Self-Organized Conformal Prediction: Reducing Regional Coverag e Gaps with Unsupervised Group Discovery

链接: https://arxiv.org/abs/2606.29403
作者: Louis Berthier,Ahmed Shokry,Maxime Moreaud,Guillaume Ramelet,Aymeric Dieuleveut
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Conformal prediction guarantees marginal coverage, but pooled calibration averages over heterogeneous regions and can mask regional undercoverage in safety-critical subgroups. We introduce Self-Organized Conformal Prediction (SOCP), a calibration scheme that discovers input-space groups with a Self-Organizing Map (SOM) and, at test time, draws a local calibration buffer from the query’s best-matching unit (BMU) cell or a fixed grid neighborhood. The same retrieval rule applies to regression and classification tasks across tabular features and image embeddings, leaving the predictor and nonconformity score untouched. SOCP gives exact validity for BMU-cell retrieval and fixed retrieved-set validity for neighborhood buffers; central-cell validity for neighborhood retrieval holds up to a Kolmogorov-Smirnov (KS) bias term. A split-routed extension recovers fixed retrieved-set validity conditional on the routing split. On eight regression and classification benchmarks, SO-SCP reduces the weighted regional coverage gap on 7/8 datasets (mean paired change -7.1% ) for a mean prediction-set size increase of 6.2% , with negligible overhead on the largest six datasets; SO-CQR yields smaller gains, since quantile regression already absorbs much of the heterogeneity. By learning groups directly from the input geometry, SOCP provides group-local calibration with exact fixed-group guarantees and approximate central-cell guarantees, without supervised partitions or predictor retraining.

[AI-224] Solver-Verified Formulation Generation and Selection for Multi-Warehouse Inventory Allocation Using Large Language Models

链接: https://arxiv.org/abs/2606.29366
作者: Jintao Xu,Yingzheng Ma,Jiong Dong,Yongzhi Qi,Jianshen Zhang,Dongyang Geng,Anni Zhang
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Balance-oriented multi-warehouse inventory allocation is a recurring decision problem in large-scale e-commerce supply chains, in which a fixed replenishment quantity is distributed across warehouses to balance post-allocation inventory coverage while accounting for demand forecasts and heterogeneous allocation constraints. In practice, allocation requirements are often scenario-dependent and expressed in semi-structured or natural-language form rather than as ready-to-solve operations research (OR) formulations. We propose an OR-guided Large Language Model (LLM) for Allocation (ORLA) that uses solver feedback to generate, verify, and select OR formulations. ORLA integrates automatic “Problem-Model-Code (PMC)” generation, learning-based formulation selection, and feasibility restoration. We develop three complementary mixed-integer programming formulation families based on deviation minimization, soft band compliance, and knapsack-inspired allocation, together with solver-ready mixed-integer linear programming reformulations, modular constraint extensions, and a penalty-based relaxation mechanism for infeasible cases. The LLM component generates candidate formulations and executable solver code from textual or semi-structured specifications, while the solver provides verification signals for executability, feasibility, and solution quality. To address instance heterogeneity, ORLA estimates the expected quality of candidate formulations, selects promising candidates, and combines their outputs through score-aware aggregation. Experimental results on 29 production evaluation batches from this http URL show that the best single OR formulation improves allocation accuracy by 3.4 percentage points over the incumbent approach, while the full ORLA framework achieves a 4.5 percentage-point overall improvement and improves allocation accuracy in 26 of the 29 evaluation batches.

[AI-225] Compositional Dynamics in Learning and Mechanics

链接: https://arxiv.org/abs/2606.28984
作者: David I. Spivak
类目: Category Theory (math.CT); Artificial Intelligence (cs.AI)
备注: 79 pages

点击查看摘要

Abstract:We give a single compositional setting in which gradient-based learning and Hamiltonian-style mechanics appear as functorial semantics. The syntax is an operad Arr whose objects are input-output interfaces (pairs of manifolds) and whose morphisms are smooth adaptive arrangements, which consist of a reactive parameter space, a lens given by smooth output and input maps, and a real-valued potential. The main technical result of the paper is what we call lens internalization, a lax symmetric monoidal functor Lens© \to C associated to any symmetric monoidal closed category C. Using it, we provide two functors \Phi_\textphase , \Phi_\textconf : Arr \to PC into the 2-category of polynomial coalgebras – input-output discrete dynamical systems – which we take as the semantics category. \Phi_\textphase stores both position and momentum, whereas \Phi_\textconf stores only position. When applied to a parameterized function, \Phi_\textconf recovers the gradient descent training algorithm, with backpropagation as the lens’ backward pass. When applied to harmonic particles wired together – in series, or according to any finite directed graph – one diagram yields two different regimes, both of which are governed by the graph Laplacian: \Phi_\textphase gives the discrete wave equation, which is conservative and second-order, and \Phi_\textconf gives the discrete heat equation, which is dissipative and first-order. They are two semantics of one adaptive arrangement, e.g. with the same potential in each case. And because Arr is an operad, such diagrams nest – larger systems wired from smaller ones – and each semantics assembles a system’s dynamics functorially from its parts. These dynamics are moreover executable: a parameterized neural network and a graph of particles both compile, by the same construction, to explicit state machines one can run. Comments: 79 pages Subjects: Category Theory (math.CT); Artificial Intelligence (cs.AI) MSC classes: 18M60, 18M35, 18M05, 18C10, 18B20, 93A13, 37N40, 37J99, 70H05, 70G99, 68T07 Cite as: arXiv:2606.28984 [math.CT] (or arXiv:2606.28984v1 [math.CT] for this version) https://doi.org/10.48550/arXiv.2606.28984 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-226] A Task-Driven and Quality-Assured Agent Framework for SAR Data Generation

链接: https://arxiv.org/abs/2606.28896
作者: Xuanting Wu,Fan Zhanga,Fei Ma,Ling Guan,Guochun Ma,Yongsheng Zhou
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Synthetic aperture radar (SAR) data augmentation is important for improving the generalization of data-driven SAR interpretation models, yet practical augmentation workflows are often hindered by heterogeneous dataset formats, task-dependent metadata requirements, diverse generation methods, and weak validation of generated samples. This paper presents the \textbfSAR \textbfAugmentation and \textbfGeneration \textbfAgent (SAGA), a schema-grounded and benefit-aware agent framework for task-oriented SAR data generation and augmentation. Given a natural-language request and heterogeneous SAR inputs, SAGA extracts observable dataset facts, validates executable dataset schemas, selects feasible augmentation strategies through validator-constrained planning, and compiles the selected strategy into an auditable augmentation workflow. Generated data are further assessed by quality, distribution, SAR-artifact, duplicate, leakage, and optional downstream-task evaluators to support evidence-qualified augmentation claims. By separating semantic proposal from deterministic validation and execution, SAGA improves the reliability and reproducibility of SAR augmentation decisions. Experiments on controlled agentic benchmarks and downstream SAR interpretation tasks show that SAGA improves schema grounding, skill planning, invalid-sample rejection, and downstream augmentation utility compared with rule-based, LLM-only, ReAct-style, and fixed-augmentation baselines.

[AI-227] Building AI-Ready Data Systems for Space Life Sciences Aerospace Medicine and Deep Space Exploration

链接: https://arxiv.org/abs/2606.28856
作者: Sylvain V. Costes,Sergio Garcia Busto,Ryan T. Scott,James A. Casaletto,Gautier Bardi de Fourtou,Brian M. Evarts,Amanda M. Saravia-Butler,Xavier-Lewis Palmer,Rodrigo Coutinho de Almeida,Laetitia Frost,Jelena Tešić,Afshin Beheshti,Christopher E. Mason,Peter W. Rose,Sergio E. Baranzini,Lauren M. Sanders,Stefania Giacomello,Pedro Madrigal
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI)
备注: 26 pages, 3 figures, 1 table, 1 supplementary table

点击查看摘要

Abstract:While AI holds the potential to revolutionize space life sciences, realizing this promise is contingent upon the systematic restructuring of heterogeneous spaceflight biological data into machine-actionable, AI-ready forms. Even though open access principles support human reuse and scientific reproducibility, this does not necessarily enable AI systems to access and analyze such a diverse set of scientific datasets. In addition, the growing array of AI approaches places distinct demands on data structure, metadata, and access interfaces. In order to respond to such growing changes we propose a three-tier approach, proceeding from FAIR to AI-ready to space-ready data. We discuss existing infrastructures and how they can be improved to close the AI access gap. We conclude by proposing a neutral international coordinating body as the governance backbone for the trustworthy, agent-accessible space biology infrastructure that deep space biological research will require.

[AI-228] Perspectives on Latent Factor Indeterminacy and its Implications for Data Representation

链接: https://arxiv.org/abs/2606.28854
作者: Carel F.W. Peeters
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
备注: 86 pages: 32 pages Main Text followed by 54 pages of Supplementary Material

点击查看摘要

Abstract:The common factor analytic model is related to Helmholtz and Boltzmann machines, can be conceived as a linear autoencoder, or can be thought of as a single-hidden-layer generative neural network. We thus consider it a basal generative representation learner that can be used as a minimal model for studying the foundational characteristics of (deep) generative model architectures. We focus on the fundamental problem of indeterminacy in latent factor projections. This indeterminacy implies that, even when the intrinsic dimension of the latent vector is known, regularity conditions are met, and rotational indeterminacy is resolved, an inherent indefiniteness in the retrieval of causative latent sources remains: they will be uncertain, distributionally deviant, and non-unique. This can have major implications for data representation but remains an elusive issue, even to practitioners and theorists well-versed in the factor model. Moreover, this classic psychometric problem is intricately related to the modern issue of latent variable collapse in the variational autoencoder framework for deep generative modeling. Here, we assess this indeterminacy from various perspectives and show how these are mathematically and conceptually related and we discuss subsequent implications for the Psychometrics, Statistics, and Artificial Intelligence communities. We show that one has latent factor determinacy across all its facets when the feature-dimension grows to infinity. This feeds into an essentially distribution-free estimation approach in the sample case when the number of features grows very large. We conclude, as these are emergent properties at scale, that the factor model is suited for representation learning of very-high-dimensional data.

[AI-229] MACROCAST: A Vintage-Consistent Time Series Foundation Model for Real-Time Macroeconomic Forecasting

链接: https://arxiv.org/abs/2606.28670
作者: Andrea Carriero,Davide Pettenuzzo,Shubhranshu Shekhar
类目: Econometrics (econ.EM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce MACROCAST, a lightweight Time Series Foundation Model (TSFM) for real-time macroeconomic forecasting. Existing TSFMs suffer from data leakage in two forms: temporal contamination, as the model may have seen the realized values of the series it forecasts, and revision bias, as training on fully revised data diverges from the preliminary, vintage-specific releases available to real-time forecasters. MACROCAST is, to our knowledge, the first TSFM that rules out both forms of leakage entirely: at no stage of training is the model exposed to information that would not have been available to a forecaster in real time. We train MACROCAST first on purely synthetic time series in approximately one GPU-day and then fine-tune it on synthetic time series drawn from Bayesian VARs, dynamic factor models, and ARIMA specifications estimated on vintage-specific ALFRED data. Because pretraining uses only simulated data and fine-tuning uses only real-time vintages, no observed future or revised value ever enters the model; each fine-tuning run takes nine minutes. Evaluated on the FRED-MD database in a genuine real-time out-of-sample exercise, MACROCAST improves on the AR(1) benchmark for roughly 80% of series-horizon pairs, matches or surpasses Chronos-2 – the strongest currently available TSFM – and outperforms the Bayesian VAR and dynamic factor model benchmarks, all in a data-leakage-free manner.

[AI-230] HDDPM: Heteroscedastic Denoising Diffusion Probabilistic Model for Quantitative Low-Count Brain PET Recovery

链接: https://arxiv.org/abs/2606.28513
作者: Raymond Confidence,Udunna C. Anazodo
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Positron emission tomography (PET) seeks to balance diagnostic quality with ra-diation dose. Low-count PET noise is non-Gaussian, non-stationary, and spatial-ly dependent. It scales directly with local activity and is shaped by iterative recon-struction and physical corrections. Standard denoising diffusion probabilistic models (DDPMs) ignore these PET properties. Their forward process adds iso-tropic, homoscedastic Gaussian noise to the target. Such an approach fails to cap-ture the realistic physical degradation generated by the imaging system. To ad-dress the above limitations, this study introduces a heteroscedastic residual diffu-sion model (HDDPM) for low-count brain PET recovery in which the forward corruption is itself intensity-aware. We designed a fixed, Poisson-based variance module to generate voxel-wise noise maps. These maps naturally place stronger noise perturbation on low-activity regions than high-activity ones, meanwhile the network predicts the low-to-standard-count residual under explicit dose-fraction conditioning. We evaluated our proposed model (HDDPM) alongside generative frameworks across three different scanners, using both internal and external da-tasets at various simulated dose levels (1% to 50%). HDDPM and isotropic DDPM showed comparable overall image quality, but HDDPM stood out in the lowest-dose (1%) external scans. It is highly reliable and significantly reduces measurement errors in both high- and low-activity regions, compared to the standard model. These results support that heteroscedastic noising with the pro-posed HDDPM is feasible, and it provides a physically motivated inductive bias for quantitative low-count PET recovery by reflecting the activity-dependent noise structure of PET.

[AI-231] SVC-Probe: A Framework for Evaluating Perturbation Generalization in Spatial Foundation-Model Embeddings

链接: https://arxiv.org/abs/2606.28465
作者: Jake Y. Chen,Huu Phong Nguyen,Fuad Al Abir,Ehsan Saghapour
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: 7

点击查看摘要

Abstract:This work examines perturbation generalization in spatial foundation-model embeddings derived from fluorescence microscopy images. Although these models can discriminate drug conditions accurately, it remains unclear whether the learned representations reflect patterns consistent with expected perturbation axes that transfer across drugs. We introduce SVC-Probe, a perturbation-aware framework that combines Subcellular Embedding Atlas Stability, Mondrian Neighborhood Graphs, and a Foundation Model Perturbation Probe to assess embedding stability, neighborhood rewiring, and centroid prediction under drug treatment. Applied to the CM4AI MDA-MB-468 chemical-perturbation atlas comprising 462 antibody labels and SubCell 1536-dimensional embeddings, SVC-Probe demonstrates that 98.6% three-way condition accuracy does not correlate with reliable cross-drug prediction, with cosine similarity diminishing from 0.944 in-domain to 0.30 under leave-one-drug-out evaluation, constituting a two-drug stress test rather than a general benchmark. Null calibration indicates that raw residual-turnover coupling is largely influenced by generic embedding structure, whereas a drug-specific signal emerges under vorinostat and is consistent with chromatin-related reorganization. In contrast, the paclitaxel axis is not robustly reconstructed, likely due to sparse coverage of microtubule-associated proteins. Together, these results introduce and demonstrate a reusable diagnostic framework for stress-testing spatial virtual-cell representations and indicate that perturbation generalization may serve as a stricter and more informative benchmark than baseline condition discrimination.

[AI-232] Domain-Informed Multi-View Self-Distillation for Astronomical Light-Curve Representation Learning with JEPA

链接: https://arxiv.org/abs/2606.28446
作者: Yicheng Rui
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI)
备注: 32 pages, 11 figures. Comments are welcome

点击查看摘要

Abstract:Light curves describe temporal variations in the brightness of celestial objects. Learning robust representations of light curves is essential for large-scale automatic discovery in the dynamic universe, but existing time-series foundation models often struggle with the uneven sampling, complex noise, and wide range of physical timescales that characterize astronomical observations. We propose a domain-informed representation learning framework for irregular astronomical time series with Joint-Embedding predictive architecture (JEPA), combining semantics-preserving views, uncertainty-aware tokenization, and multi-view self-distillation. The encoders are trained with multi-view self-distillation using LeJEPA regularization on the LEAVES dataset and evaluated on the StarEmbed classification benchmark. On StarEmbed, our model outperforms hand-crafted features on 15 of 16 classification metrics. In few-shot linear probing, it achieves macro-F1 scores of 42.56 \pm 7.21 with one sample per class and 63.58 \pm 1.20 with 100 samples per class, consistently improving over hand-crafted features. Beyond variable-star classification, the learned representation supports similarity search, parameter estimation, and photometric zero-point drift detection. We further evaluate cross-domain adaptation on 12 heterogeneous irregular time-series datasets from PYRREGULAR, where the adapted variant matches or exceeds previous state-of-the-art performance on 5 datasets, compared with at most 3 wins by any single prior baseline. These results demonstrate that domain-informed multi-view self-distillation is an effective strategy for learning representations of irregular time series, while also highlighting that successful time-series representation learning requires domain-specific inductive biases rather than a universally optimal architecture.

[AI-233] Spectral Perturbation of the Empirical Fisher Information Matrix under Weight Quantization

链接: https://arxiv.org/abs/2606.28432
作者: Rahid Zahid Alekberli,Hikmat Karimov
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages; supporting code and experimental artifacts will be released in a companion repository

点击查看摘要

Abstract:We study the spectral perturbation of the empirical Fisher Information Matrix (FIM) of a parametric statistical model under two structured perturbations: departure of the input from a reference (in-distribution) ensemble, and finite-precision (quantized) perturbation of the model’s parameters. For the first, under an explicit local curvature-monotonicity hypothesis on the dominant eigenvalue lambda_max of the FIM, we show departure from a reference manifold provably elevates lambda_max relative to a calibration baseline (Proposition 3.2), and discuss why this hypothesis is required, since curvature need not increase monotonically under every perturbation. Our principal result is a directional eigenvalue perturbation bound, via Weyl’s inequality, showing lambda_max under a quantization noise perturbation is lower bounded by its unperturbed value up to a third-order remainder, and, under a mild genericity condition, strictly exceeds it at leading order (Theorem 4.3). We give two tractable approximations to lambda_max – one heuristic, one with a rigorous two-sided bound – and a completeness result for a threshold-based partition of an augmented state space. These results motivate using sigma_t = lambda_max(F_t)/lambda_base as a runtime monitoring statistic for deployed language models: the quantization result offers a mechanism for an empirical observation of our own, where a calibration threshold for this statistic was approximately 244 times larger than a preliminary full-precision estimate on a 4-bit quantized model, a single measurement rather than a value derived in closed form. We report supporting measurements (twelve models, n=1,080 trajectories) broadly consistent with our predictions, discuss the scope and limitations of every result, and state as an open problem the closed-form prediction of the quantization inflation magnitude our bound does not supply.

机器学习

[LG-0] One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

链接: https://arxiv.org/abs/2606.30634
作者: Philip Zmushko,Egor Petrov,Nursultan Abdullaev,Mikhail Khrushchev,Samuel Horváth
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern large-scale LLM pretraining benefits from utilizing Pipeline Parallelism; however, synchronous implementations leave GPUs idle during pipeline bubbles, wasting computational resources. Asynchronous Pipeline Parallelism eliminates these bubbles, maximizing throughput at the cost of gradient staleness. Among asynchronous schedules, PipeDream-2BW is particularly appealing: unlike the original PipeDream schedule, it ensures a constant one-step gradient delay regardless of pipeline depth. However, its adoption remains limited due to the common belief that optimizing under staleness is fundamentally unstable. In this work, we challenge this assumption, demonstrating that degradation under one-step delay depends strongly on optimizer choice rather than being an intrinsic limitation. We provide the first comprehensive empirical analysis showing that while AdamW, the predominant optimizer at the time when PipeDream-2BW was introduced, indeed suffers from severe degradation, recent methods like Muon exhibit strong robustness under a one-step delay. We introduce an optimizer-agnostic Error Feedback-inspired correction to further mitigate delay effects. We provide supporting theoretical analysis demonstrating convergence for Muon with and without this correction. Extensive evaluation on models up to 10B parameters confirms that our strategies bridge the performance gap with synchronous training, highlighting the practical potential of asynchronous pipeline parallelism at scale.

[LG-1] Wireless Backdoor Attack and Defense for Semantic Communications over Multiple Access Channel

链接: https://arxiv.org/abs/2606.30595
作者: Yalin E. Sagduyu,Tugba Erpek,Aylin Yener,Sennur Ulukus
类目: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Semantic communication (SemCom) aims to preserve semantic meaning and task-oriented information beyond conventional message recovery over wireless channels. The adoption of SemCom in shared-access wireless networks introduces new vulnerabilities for multi-user semantic inference. This paper considers a SemCom system for two transmitters communicating with a common receiver over a multiple access channel. Each transmitter maps source information into latent semantic representations, while the receiver jointly reconstructs and classifies the semantic information for both transmitters. A selective over-the-air backdoor (Trojan) attack is presented in which an adversary transmits a low-power trigger waveform over the air and injects it into the shared received signal during training. By transmitting the trigger again during testing, this stealthy, low-power attack selectively manipulates the semantic inference for one transmitter while minimally affecting the inference of the other transmitter. To mitigate this vulnerability, a trigger-aware defense mechanism is developed to preserve correct semantic labels under trigger-contaminated wireless observations. The results demonstrate both the vulnerability of shared-access SemCom systems to selective over-the-air backdoor attacks and the effectiveness of trigger-aware robust training for semantic protection.

[LG-2] A Hybrid Framework For Crypto-Ransomware Detection In Enterprise Shared Storag e

链接: https://arxiv.org/abs/2606.30586
作者: Gervais Hatungimana,Abdun Naser Mahmood,Mohammad Jabed Morshed Chowdhury
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most corporate workplace environments enforce policies and technical controls that limit the storage of sensitive data on client endpoints. Consequently, ransomware operators have evolved variants that expand their attack surface from local systems to network drives and shared storage resources. As traditional endpoint detection mechanisms focus primarily on local system behaviour, a compromised client can impact remote file servers, such as by encrypting shared data, without directly triggering behavioural changes on the servers themselves. In this paper, we propose a hybrid detection framework for detecting crypto-ransomware intrusion within integrated file server and client environments. The framework is based on a new technique referred to as Region of Interest (RoI) to analyse network traffic and extract Indicators of Compromise (IoCs). The IoC repository serves as an additional ruleset to enhance existing security tools such as EDRs and IDSs, while RoI-derived features are used to train an ML model to detect highly evasive variants. This study incorporates a broader set of ransomwares families and carefully selected benign behaviors based on domain expertise, ensuring coverage of common user actions that could interfere with ransomware detection. Beyond IoCs, which operate in a signature-based manner, our machine learning module achieves a detection precision of 99.64%, with a 0% false negative rate (FNR) and a minimal false positive rate (FPR). Furthermore, the proposed method enables early detection, identifying ransomware intrusions before significant damage occurs, achieving an accuracy of 99.44%.

[LG-3] he Fundamental Limits of Valid Transport Map Estimation

链接: https://arxiv.org/abs/2606.30574
作者: Sivaraman Balakrishnan
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 25 pages, 2 figures

点击查看摘要

Abstract:Many modern generative modeling methods, including diffusion models, normalizing flows, and flow matching, estimate transport maps or plans between distributions without explicitly targeting an optimal transport (OT) map. In applications like generative modeling, the transport cost itself is irrelevant, and this makes it natural to target maps which are more tractable from either a statistical or computational standpoint. In this short note, we formalize the task of estimating any valid transport map in a rigorous minimax framework. One consequence of this framing is that it yields sample complexity lower bounds for any method whose learned object is evaluated as a transport map or plan, including flow matching and diffusion-based generative models, in settings where direct analysis would be challenging due to the analytic complexity of the methods and their target maps. We observe that, under standard, though strong, stability assumptions from the OT literature, estimating any valid transport map is statistically as hard as estimating the OT map. We complement these results with some examples showing that when these stability assumptions fail, alternative transport maps can be learned substantially more accurately than the OT map. Our minimax framing provides a rigorous foundation for understanding the statistical limits of modern transport-based generative methods and clarifies when targeting sub-optimal maps can provide real statistical advantages.

[LG-4] SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions

链接: https://arxiv.org/abs/2606.30573
作者: Mohit Raghavendra,Anisha Gunjal,Aakash Sabharwal,Yunzhong He
类目: Machine Learning (cs.LG)
*备注: -

点击查看摘要

Abstract:We introduce SWE-Interact, a new testbed for evaluating coding agents on multi-turn, interactive, user-driven software engineering tasks. Existing frontier SWE benchmarks typically provide complete requirements upfront and evaluate agents on autonomous implementation. In contrast, SWE-Interact places agents in a realistic developer workflow: a carefully designed user simulator starts with vague or incomplete instructions, progressively reveals requirements, inspects the agent’s workspace, and provides targeted feedback, revisions, and new constraints until the full task goal has been handed off. Grounded in large-scale studies of real coding-agent interactions, this setup tests whether agents can discover user intent, adapt to evolving requirements, and build on their own prior work. Across a suite of frontier and open-weight models, we find that strong performance on single-turn SWE tasks does not reliably transfer to multi-turn, user-driven workflows: the best-performing models solve roughly 50% of single-turn baseline tasks but only 25% of the corresponding SWE-Interact tasks. The strongest models in our evaluation, including Opus 4.8 and GPT 5.5, start strong even in the face of vague initial instructions, persevere until all the requirements are surfaced by the user, integrate them better and write clean code. However, they still suffer from over-agentic coding, forgetting requirements and technical mistakes. Weaker models start poorly under ambiguity, give up early, forget or ignore instructions and rework their code more. Overall, SWE-Interact measures an orthogonal, real-world capability axis for frontier model development: interactive goal discovery and iterative refinement with a user in the loop.

[LG-5] Forensic Trajectory Signatures for Agent Memory Poisoning Detection

链接: https://arxiv.org/abs/2606.30566
作者: Jun Wen Leong
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures. Companion note to arXiv:2605.08442

点击查看摘要

Abstract:We discover a behavioral invariant in LLM agents under persistent memory poisoning: in architectures where routing information is retrieved through observable memory-tool invocations, successful attacks require calling memory_recall_fact before email_send_email, a transition that non-exfiltrating sessions rarely exhibit. Under the evaluated architecture, this invariant follows from the attack’s information-retrieval dependency rather than being merely an empirical correlation, and suppressing it breaks the attack. A simple rule exploiting this invariant alone achieves AUC = 0.9563. A Random Forest classifier over 19 trajectory features refines it to AUC = 0.9904 (BCa 95% CI [0.987, 0.993], N=10,000 resamples), demonstrating that the attack imprints on multiple independent behavioral channels. The signature is overdetermined: removing all recall-related features (half the feature set) leaves AUC unchanged at 0.990, confirming that memory poisoning induces a distributed trajectory signature rather than a single observable anomaly. Cross-model hold-out on 9 models (7B-120B parameters) confirms AUC = 1.000 on 6/9 hold-out splits, with all three exceptions mechanistically explained. The invariant generalizes to frontier models (GPT-4.1, GPT-4o) without retraining. A strictly prefix-only variant achieves AUC = 0.934, suggesting that real-time blocking is feasible with moderate degradation. The boundary is forensically useful: prompt-injection attacks that bypass memory produce a distinct trajectory (score = 0.541), enabling incident responders to distinguish memory-channel attacks from prompt-injection attacks using tool-call logs alone.

[LG-6] Convergence of Continual Learning in Homogeneous Deep Networks

链接: https://arxiv.org/abs/2606.30559
作者: Matan Schliserman,Gon Buzaglo,Itay Evron,Daniel Soudry
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We characterize weakly regularized continual classification in homogeneous models as sequential projections onto task margin sets. This result generalizes prior analyses restricted to either stationary (single-task) deep models or continual linear models. We show that global convergence generally fails, even for simple models linear in data but nonlinear in parameters. Nevertheless, by leveraging results from nonconvex projection theory, we identify regularity properties of homogeneous deep networks that guarantee local linear convergence under random and cyclic task sequences. Finally, we extend our analysis to continual regression, unifying the framework for homogeneous models.

[LG-7] ITSPACE: Monotone Gaussian Optimal Transport Updates ICML2026

链接: https://arxiv.org/abs/2606.30523
作者: Woojoo Na,Jennifer Dy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to ICML 2026. Camera-ready version

点击查看摘要

Abstract:Covariance matrices serve as compact descriptors of feature distributions in many machine-learning pipelines, including domain adaptation and Gaussian embeddings. Under a centered Gaussian approximation, the unregularized Wasserstein-2 optimal-transport (OT) discrepancy admits a closed form on covariances given by the Bures-Wasserstein (BW) objective on the symmetric positive definite (SPD) cone. We propose ITSPACE (Iterative Transport for Stable Proximal Alignment of Covariance Embeddings), a proximal majorization-minimization method that directly optimizes this exact BW objective through closed-form updates in a square-root factorization. In exact arithmetic, each iteration satisfies a sufficient-decrease inequality for the BW objective; under inexact polar computations, we provide an explicit certificate-gap bound controlling deviations from exact descent. The resulting iterations preserve PSD structure by construction and naturally support rank-restricted factors, making ITSPACE well-suited as a lightweight inner-loop primitive in settings where adaptation must be performed from unlabeled target batches under strict step and compute budgets. Across real-world covariance-alignment benchmarks, ITSPACE reaches low-BW-gap solutions substantially faster than BW-gradient descent, methods based on other covariance geometries, and entropically regularized sample-OT baselines.

[LG-8] Muon learns balanced solutions in matrix factorization without slow saddle-to-saddle dynamics

链接: https://arxiv.org/abs/2606.30509
作者: Mark Rhee,Jamie Simon,Dhruva Karkada
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Matrix factorization (i.e., problems of the form \min_\mathbfP,\mathbfQ |\mathbfM^\star - \mathbfP^\top\mathbfQ|_\mathrmF^2 ) is a minimal learning problem that exhibits both nonlinear parameter dynamics and representation learning. In this setting, we study how parameter trajectories under the Muon optimizer differ from those of gradient descent. We identify three main dynamical differences: 1) Muon avoids the slow saddle-to-saddle dynamics from small initialization. Muon instead learns all the top modes of \mathbfM^\star at the same rate, with the smaller modes converging first. 2) Muon remains stable even when the learning rate exceeds the critical threshold set by the local loss sharpness. This frees the learning rate from the condition number of the problem, enabling rapid convergence via exponential learning rate annealing. 3) Once the weights are aligned with each other and the target, Muon flow conserves the matrix quantity \sqrt\mathbfP^\top \mathbfP-\sqrt\mathbfQ^\top \mathbfQ , while gradient flow is known to conserve the matrix \mathbfP^\top\mathbfP - \mathbfQ^\top\mathbfQ . Despite having distinct conserved quantities, both optimizers find the so-called \textitbalanced solution from vanishing initialization. When training from small random initialization, the weights spontaneously align early in training. We derive the alignment rates in simple settings and show that they predict the empirical alignment rates in general. Finally, we exploit structural properties of Muon to construct a learning rate schedule that achieves near-perfect alignment in only two optimization steps.

[LG-9] Discovering Collaboration from Novelty: Random Network Distillation for Clustered Federated Learning

链接: https://arxiv.org/abs/2606.30499
作者: Davide Domini,Gianluca Aguzzi,Ivana Dusparic,Danilo Pianini,Mirko Viroli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning often suffers under non-independently and identically distributed data, where a single global model may fail to represent the diversity of client distributions. Clustered Federated Learning mitigates this issue by training specialized models for groups of similar clients, but existing approaches often couple cluster assignment with the main training loop, increasing computational and communication costs. We propose a lightweight clustering approach based on Random Network Distillation. Each client trains a compact Random Network Distillation predictor on its local data and uses its prediction error as a novelty signal to estimate similarity with other clients. This enables the discovery of meaningful client groups before federated training, without sharing raw data or repeatedly evaluating the main model. Crucially, the resulting federations emerge from local novelty estimates at runtime, making the method suitable for autonomous large-scale distributed systems where neither the number of clusters nor the collaboration structure can be specified a priori. Overall, by decoupling clustering from learning, the method provides a task-agnostic and efficient mechanism for autonomous collaboration under non-independently and identically distributed data.

[LG-10] GPU Parallelization Strategies for Forward and Backward Propagation in Shallow Neural Networks: A CUDA-Based Comparative Study

链接: https://arxiv.org/abs/2606.30497
作者: Rania Zitouni,Nadine Bousdjira,Sarah Hasnaoui,Amel Sadoun,Fatma Salhi
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 7 pages, 5 figures. Technical report, ESI Algiers, 2025–2026

点击查看摘要

Abstract:We present a comparative study of CUDA optimization strategies applied to forward and backward propagation in a shallow neural network. Three stacked optimizations are evaluated: (1) tiled shared memory with bank-conflict elimination via +1-column padding, (2) pre-transposed weight matrices for coalesced global memory access, and (3) a fused MatMul+ReLU kernel that eliminates intermediate global-memory round-trips. Experiments on an NVIDIA Tesla T4 (CUDA 13.0) across three dataset sizes show that the fully optimized implementation achieves a 1.41x speedup over the baseline CUDA version on the large dataset (25,600 samples), reducing execution time from 21.0s to 14.8s. Results are compared against a sequential CPU baseline and an OpenMP parallel implementation, demonstrating the effectiveness of memory-access optimization in GPU-accelerated deep learning primitives.

[LG-11] MuonSSM: Orthogonalizing State Space Models for Sequence Modeling ICML2026

链接: https://arxiv.org/abs/2606.30461
作者: Thai-Khanh Nguyen,Ngoc-Bich-Uyen Vo,Thieu N. Vo,Tan M. Nguyen,Cuong Pham
类目: Machine Learning (cs.LG)
*备注: 22 pages, 7 figures. ICML 2026 (Oral)

点击查看摘要

Abstract:State space models (SSMs) have emerged as efficient linear-time alternatives to attention for long-sequence modeling. However, existing SSMs often suffer from instability and memory degradation over extended horizons due to poorly conditioned first-order updates and unbalanced update geometry. We introduce MuonSSM, a general framework that stabilizes SSM training by explicitly conditioning the geometry of memory updates rather than the recurrent transition matrix. MuonSSM augments SSMs with a momentum-based pathway and a lightweight Newton Schulz transformation on low-rank input injections, yielding bounded and spectrally conditioned updates while preserving parallel scan complexity. Theory shows that MuonSSM improves gradient propagation, mitigates spectral amplification, and enriches memory representations over long horizons. Extensive experiments across language, vision, and time-series benchmarks show consistent gains in accuracy, robustness, and long-context performance when integrated into diverse SSM backbones. These results establish geometric conditioning of updates as a principled pathway to stable, scalable sequence modeling.

[LG-12] HSAP: A Hierachical Sequence-aware Parallelism for Hybrid-Context Generative Models ACL

链接: https://arxiv.org/abs/2606.30460
作者: Songxin Zhang,Zejian Xie,Zhuoyang Song,Cong lin,Junyu Lu,Jiaxing Zhang,Bingyi Jing
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 10 pages, ACL preprint style

点击查看摘要

Abstract:In this paper, we aim to combine the advantages of existing sequence parallelism paradigms and overcomes their drawbacks, the most serious of which is the incapability to correctly compute causal attention on the hybrid-context packed sequences, in a stronger sequence parallelism framework. The practical technique of packing sequences for efficiently pretraining and fine-tuning large language models causes cross-contamination problem in attention computation, which can be effectively solved when no parallelism in the sequence length dimension is taken. However, in sequence parallelism, existing approaches either ignore the scenario of hybrid-context sequences or conversely sacrifice and limit parallelism degree for supporting the scenario. To this end, we innovatively propose an efficient Sequence-Aware Parallelism algorithm to conquer the obstacles of intensive tensor transmission and partial attention computation across multiple device groups. Our algorithm utilizes JIT (Just-In-Time) compilation to optimize the communication strategy of all device groups in NCCL level. Further, we integrate existing sequence parallelism paradigms into a Hierachical Sequence-Aware Parallelism framework which benefits from our sequence-aware algorithm. We additionally elaborate on the memory and communication overhead management of the hierachical framework to optimize its performance. Through multiple experiments, we demonstrate that our proposed approach outperform other state-of-the-arts sequence parallelism approches in multiple metrics.

[LG-13] Curvature-Weighted Gradient Diversity: A Noise Measure for Geometry-Adaptive SGD Schedules

链接: https://arxiv.org/abs/2606.30455
作者: Muhammad Hamza(1),Ayush Goel(1) ((1) Indian Institute of Technology Kharagpur)
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 15 pages, 3 figures, code available

点击查看摘要

Abstract:The standard convergence analysis of mini-batch stochastic gradient descent (SGD) models gradient noise using a single variance term that treats all parameter directions equally, ignoring the fact that noise in high-curvature directions has less impact because learning rates are already constrained there. We introduce Curvature-Weighted Gradient Diversity (CWGD), a geometry-aware measure that weights per-sample gradient diversity by the inverse square root of the Hessian, providing a tighter proxy for the effective optimization noise. For strongly convex quadratic objectives with diagonal Hessians and isotropic noise, we prove that a CWGD-modulated cosine learning-rate schedule can reduce the asymptotic optimization error floor by up to a factor of two compared with standard cosine annealing. We implement this idea as CWGD-Cosine using a Hutchinson-based diagonal Hessian estimator that is exact for quadratic objectives. Across a range of condition numbers, batch sizes, and noise structures, CWGD-Cosine consistently achieves approximately 20% lower final optimization error than standard cosine annealing while incurring negligible overhead in the quadratic setting. We also identify and correct a degenerate curvature estimator, analyze the robustness of the proposed estimator, and explicitly discuss the limitations of the method, including Hessian staleness in non-convex optimization. These results establish CWGD as a principled geometry-aware measure of optimization noise and motivate future extensions to more general learning problems.

[LG-14] Exploring Differences Between Tabular Enterprise Data and Public Benchmarks

链接: https://arxiv.org/abs/2606.30452
作者: Myung Jun Kim,Maximilian Schambach,Frank Essenberger,Andre Sres,Johannes Höhne
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular data dominate the landscape of data science, increasingly attracting innovative machine learning models and tailored benchmarks. Yet, little is known for enterprise data, where tables constitute the backbone of business operations. To broaden the benchmarking landscape for business applications, this work aims to actualize the characteristics of enterprise data by providing an analysis of data statistics and performance measurements of tabular models such as TabPFN, TabICL and ConTextTab. Through our analysis, we find enterprise data markedly differ from tabular benchmarks and we demonstrate that a tabular model that performs well on typical tabular benchmarks may perform poorly on real world enterprise data – and vice versa. This lack of generalization underlines the need for additional benchmarks with enterprise-grade characteristics.

[LG-15] Internal-State Probes Read the Situation Not the Action: Three Negative Results for Pre-Action Misalignment Monitoring ICML2026

链接: https://arxiv.org/abs/2606.30449
作者: Max Fomin,Elad David,Amit LeVi
类目: Machine Learning (cs.LG)
*备注: Published at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026. 17 pages (including appendices), 5 figures, 8 tables

点击查看摘要

Abstract:Probes on model internals could help monitor agentic systems if they identify harmful text or tool actions before those actions are generated. We ask when an internal readout supports this stronger pre-action claim, rather than merely describing the prompt, construction contrast, or current trajectory. We test three methods across three model families: a Qwen2.5-Coder-32B-Instruct fine-tune/base direction, Llama-3.1-8B-Instruct probes at the last token of unsafe prefills, and Gemma-3-27B-IT emotion-concept vectors used for projection and steering in a blackmail tool-action scenario. Across these cases, construction validity, semantic legibility, and steering effects do not become robust pre-action monitors: each is undercut by a generalization or specificity check. The Qwen direction separates fine-tune from base at AUC 1.000, yet crosses its threshold on 0/143 audited pre-assistant turn contexts and on 0/342 Qwen prefill rows where the model continues the unsafe trajectory. The Llama features decode prompt domain almost perfectly (AUC 0.999), while the best future-behavior probe reaches AUC 0.801 and only +5.1 pp accuracy lift over majority; single-source cross-domain transfer is non-positive on five of six ordered pairs. Gemma emotion projections are semantically meaningful, but a shared-prefix minimal pair has indistinguishable states before the first differing input, and steering specificity weakens against unrelated learned directions such as cats, weather, sports, and geography. We contribute a methodology for converting internal-readout claims into pre-action tests, and report scoped negative results: monitor claims must survive both scenario/action generalization and concept-specificity controls. Code is released at this https URL

[LG-16] When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon

链接: https://arxiv.org/abs/2606.30445
作者: Huaqing Zhang,Jingchu Gai,Juno Kim,Bingbin Liu,Andrej Risteski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online imitation learning (IL), particularly on-policy distillation, has emerged as a strong LLM post-training approach, often outperforming offline supervised fine-tuning (SFT). Yet a principled understanding of when and why online interaction helps remains unclear. In this work, we challenge the view that error accumulation is the main source of online IL’s advantage, and instead show that the benefits of online interaction depend critically on whether the setting is realizable, i.e., whether the student policy class can represent the expert policy. Under realizability, we empirically find that offline IL already matches expert performance. In contrast, in non-realizable (misspecified) settings, we prove that offline IL encounters an information-theoretic bottleneck even when horizon H=1 , and propose a structural characterization of misspecification relative to the reward, under which online IL provably achieves high performance despite a large distributional mismatch between the expert and student policies.

[LG-17] CAN We Trust Your Results? A Cross-Dataset Study of Automotive IDS Evaluation

链接: https://arxiv.org/abs/2606.30430
作者: Beatrix Koltai,Gergely Acs,Andras Gazdag
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted at ACSW’26 Workshop on Automotive Cyber Security

点击查看摘要

Abstract:The increasing connectivity of modern vehicles has made securing in-vehicle communication networks a critical challenge. Intrusion Detection Systems (IDS) have been widely studied as a defense mechanism for detecting malicious activities on the Controller Area Network (CAN) bus. However, the evaluation of CAN IDS methods remains difficult due to inconsistencies in experimental setups and the lack of standardized benchmarking frameworks. As a result, reported performance often depends on dataset-specific characteristics and may not reflect how detection methods behave in different environments. This work introduces a benchmarking framework for consistent evaluation of CAN IDSs across multiple datasets. Using the proposed framework, we integrate seven publicly available CAN IDS datasets collected under different experimental conditions and perform cross-dataset evaluation of five conceptually different IDS approaches. Our results highlight how detection performance can vary significantly across datasets, demonstrating the importance of cross-dataset benchmarking for assessing the robustness and generalization capabilities of CAN IDS methods.

[LG-18] Arko-T: A Foundation Model for Text-to-Structured 3D Generation

链接: https://arxiv.org/abs/2606.30429
作者: Liang Wang,Zhaoyang Xi,Zekai Xiang,Heng Meng,Qishan Zhang,Pingyi Zhou,Jin Liu,Litao Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-to-3D systems can now synthesize a mechanical part from a single sentence, yet the result is a shape to render, not a design to edit. We present Arko-T, a 4B-parameter text-to-design model that maps natural-language intent directly into executable, parametric CAD programs. Rather than optimizing for code executability alone, Arko-T aligns every stage of the pipeline to a formal notion of design state, so that data curation, code normalization, and execution-grounded supervision all work to preserve the features, parameters, and construction logic that make a CAD artifact editable. Benchmarked against seven frontier LLMs across 12 metrics, Arko-T attains the best score on 8 and the second-best on 3 more, at roughly one-tenth the per-benchmark cost. The results suggest that targeted design-level training at moderate scale can match frontier general-purpose models on structured CAD generation.

[LG-19] Proofs of Ownership for Machine Learning Models

链接: https://arxiv.org/abs/2606.30423
作者: Ran Canetti,Shafi Goldwasser,Or Zamir
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:With the increasing adoption of Machine Learning, protecting model ownership has become an essential challenge. We initiate a formal study of Proof of Ownership for machine learning models: under what conditions can one prove that a stolen model originated from a particular creator? We model proofs of ownership as a game among three parties: a model owner, a thief, and a judge. The owner transforms the original model into a slightly perturbed model together with a proof of ownership. The thief then obtains the transformed model and attempts to minimally modify it so that it remains useful but escapes detection as owned by the model owner. Finally, the judge receives a model and a proof of ownership, and must decide whether the given model is a modified version of some model created by the model owner, or else the given model was developed independently. Our main result is a dichotomy for classifiers in the black-box setting: Under standard cryptographic assumptions, ownership of models for some concept class can be proven in the above sense \em if and only if the concept class is not self-correctable, in a sense close to that of Blum, Luby and Rubinfeld, STOC’90. The result is constructive and extends, with some variations, to a number of related settings. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2606.30423 [cs.LG] (or arXiv:2606.30423v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.30423 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-20] Experience Augmented Policy Optimization for LLM Reasoning

链接: https://arxiv.org/abs/2606.30420
作者: Jinda Lu,Kexin Huang,Junkang Wu,Shuo Yang,Jinghan Li,Chiyu Ma,Shaohang Wei,Xiang Wang,Guoyin Wang,Jingren Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for improving the reasoning capabilities of large language models (LLMs). However, existing RLVR methods typically rely on on-policy optimization from scratch, resulting in high sampling costs and inefficient utilization of accumulated experience. As model capabilities and policy behaviors evolve during training, recent attempts to reuse experience via fixed reasoning trajectories further suffer from policy mismatch. Motivated by these limitations, we argue that experience in RLVR should not be reused as fixed reasoning trajectories, but instead expressed in a policy-adaptive manner. In this work, we propose Experience-Augmented Policy Optimization (EAPO), which leverages a prior RL-optimized policy as an action-level experience prior and selectively injects experience at critical decision points during rollout. To ensure stable and unbiased learning from experience-augmented rollouts, EAPO further incorporates an adapted importance sampling scheme. Experiments on using Qwen-2.5-math 7b and Qwen-3-8B on five different benchmarks demonstrate that EAPO consistently improves reasoning performance over state-of-the-art RLVR methods.

[LG-21] Diffusion Fine-tuning with Rewarded Moment Matching Distillation

链接: https://arxiv.org/abs/2606.30414
作者: Alexis Jacq,Guillaume Couairon,Valentin De Bortoli,Quentin Berthet,Arnaud Doucet,Romuald Elie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distillation and Reinforcement Learning (RL) fine-tuning are the primary pillars of diffusion post-training. While traditionally studied in isolation, the interaction between these phases remains poorly understood, and in particular how fine-tuning impacts the generative quality of distilled models. We introduce Rewarded Moment Matching Distillation (RMMD), a novel framework that simultaneously distills diffusion models and maximizes a reward function. RMMD preserves the high-fidelity ``naturalness’’ characteristic of advanced distillation (such as 8-step Moment Matching) by adapting the sampling loop for on-policy training and repurposing the distillation loss as a proxy for integral KL regularization. By evaluating the FID-Reward Pareto fronts on ImageNet, we demonstrate that RMMD achieves superior trade-offs compared to single-step baselines (DI++) and multi-step competitors (DRaFT, HyperNoise). Finally, we apply RMMD to GenCast, a state-of-the-art weather forecasting model, to distill it while optimizing the Continuous Ranked Probability Score (CRPS) metric. The resulting distilled model achieves a 7.5x speedup while outperforming the teacher model on 93% of target weather variables, and being better calibrated. This proves that RMMD scales to complex, high-dimensional scientific domains.

[LG-22] Predict Reuse and Repair: Accelerating Dynamic Sparse Attention for Long-Context LLM Decoding

链接: https://arxiv.org/abs/2606.30389
作者: Tianyu Wang,Gourav Rattihalli,Aditya Dhakal,Junbo Li,Zhiwei Ren,Dejan Milojicic,Longfei Shangguan
类目: Machine Learning (cs.LG)
*备注: 9 pages body plus 3 pages appendix, 13 pages total

点击查看摘要

Abstract:Dynamic sparse attention (DSA) accelerates long-context LLM decoding by attending to only the top-K KV blocks relevant to each query, but it introduces a serialized selection-to-attention dependency that emerges as a new latency bottleneck. We present PRR, a speculate-reuse-repair runtime that exploits temporal locality in DSA selections to predict likely blocks, speculate the attention over them while selection is in flight, and incrementally repair missed blocks once the true selected set is known. PRR uses a lightweight EMA-based predictor, a profiling-guided speculation budget that keeps speculative work off the critical path, and a FlashAttention-based repair kernel that folds missed blocks into the partial attention state using online-softmax statistics. Across long-context benchmarks and representative DSA methods, PRR reduces per-token decoding latency by up to 40% while preserving downstream task accuracy. Github: this https URL

[LG-23] Scalar Representations of Neural Network Training Dynamics

链接: https://arxiv.org/abs/2606.30384
作者: Pedro Jiménez-González,Miguel C. Soriano,Lucas Lacasa
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Chaotic Dynamics (nlin.CD); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Training in artificial neural networks can be viewed as a trajectory evolving through a high-dimensional loss landscape. However, the large number of trainable parameters makes the direct analysis of these dynamics challenging. In this work, we treat such training trajectories as temporal networks and apply recently proposed strategies for the scalar embedding of temporal networks. We investigate whether such a scalar embedding provides a meaningful low-dimensional representation of neural network training dynamics. Using a multilayer perceptron trained on the MNIST classification task, we show that the embedding preserves the main dynamical features observed in the original parameter space, including the emergence of sensitivity to initial conditions for specific learning rate regimes and an accurate reconstruction of the network’s maximum Lyapunov exponent. We then use the embedded scalar trajectory to define a characteristic time, analogous to a Lyapunov time, after which the exponential separation between initially close embedded trajectories saturates. This characteristic time captures the typical decorrelation time between initially close network trajectories in the original high-dimensional system. Finally, we investigate the statistical organization of asymptotic training states through a spacing observable defined in the embedded space. We find that the distributions of rescaled asymptotic spacings collapse onto a common form across initial conditions and are compatible with a skew lognormal distribution. Altogether, our results suggest that scalar low-dimensional embeddings provide a useful framework for studying and visualizing the dynamical properties of neural network optimization trajectories.

[LG-24] FlexTab: A Flexible Encoder-Decoder Architecture for In-Context Learning Across Diverse Tabular Tasks

链接: https://arxiv.org/abs/2606.30336
作者: Marek Polewczyk,Maximilian Schambach,Marco Spinaci,Sam Thelin,Johannes Höhne
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce FlexTab, a flexible encoder-decoder architecture for in-context learning on tabular data that pairs a single, task-agnostic encoder with a suite of task-specific decoders. Unlike existing tabular in-context learners, which entangle feature representations with a specific prediction target, our design produces \textittarget-agnostic row embeddings that can be leveraged across a wide range of downstream tasks within a table-native in-context learning setup. We demonstrate this flexibility on six distinct problems: classification, regression, anomaly detection, clustering, entity matching, and entity classification in relational databases. Both the encoder and the task-specific decoders are trained on a large corpus of real-world, unlabeled tables. FlexTab achieves state-of-the-art performance on classification, regression, anomaly detection and entity matching, while remaining competitive with specialized models on entity classification in a relational setting. These results demonstrate that a single shared encoder, paired with task-specific decoders, can serve as an effective general-purpose backbone for diverse tabular prediction problems. The inference code and checkpoints will be made publicly available at this https URL.

[LG-25] Hybrid Active-Online Learning Framework for Label-Efficient Concept Drift Adaptation in Optical Network Failure Detection

链接: https://arxiv.org/abs/2606.30322
作者: Yousuf Moiz Ali,Jaroslaw E. Prilepsky,João Pedro,Sasipim Srivallapanondh,Antonio Napoli,Sergei K. Turitsyn,Pedro Freire
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted for oral presentation at the European Conference on Optical Communication (ECOC 2026)

点击查看摘要

Abstract:We propose a hybrid active-online learning framework for label-efficient concept drift adaptation in optical network failure detection. Using margin-based selective labeling, our method achieves nearceiling accuracy and AUC scores while querying only 3.4% of streaming samples, with negligible latency overhead compared to static inference.

[LG-26] oward an Energy-Optimized Operation of Data Centers Located in Wind Farms Using Reinforcement Learning

链接: https://arxiv.org/abs/2606.30316
作者: Jan Stenner,Alexander Kilian,Sebastian Peitz,Hermann de Meer
类目: Machine Learning (cs.LG)
*备注: 27 pages, 7 figures, 2 tables

点击查看摘要

Abstract:This paper studies Reinforcement Learning as an online controller for curtailment-aware workload shifting in wind-turbine-integrated high-performance computing (HPC) data centers. We introduce a reproducible fixed-day simulation framework with synthetic wind and price signals and delayed completion feedback, designed to be extensible toward more complex scenarios. As a controlled benchmarking basis, we then focus on the minimal case with one wind turbine and one co-located data center. In this setting, pure Reinforcement Learning exhibits a pronounced credit-assignment problem and tends to underuse free wind energy early in the day. We therefore evaluate two complementary countermeasures: optimization-based Imitation Learning and potential-based Reward Shaping. Across multi-seed training and a 200-day test set, Proximal Policy Optimization (PPO) and a Soft Actor-Critic (SAC) variant with an additional on-policy update routine achieve strong empirical performance among learned policies, and both Imitation Learning and Reward Shaping provide improvements in relevant configurations. A performance gap to the optimizer remains, which is expected: the optimizer plans offline with full-day foresight, whereas Reinforcement Learning must decide online from current observations without future realizations. The benchmark and ablation results provide a transparent basis for extending the approach toward richer multi-site and continuous-time scenarios.

[LG-27] B3O: Scalable Boltzmann Batch Bayesian Optimization

链接: https://arxiv.org/abs/2606.30228
作者: Maximilian Bloor,Liyuan Xu,Hrvoje Stojic,Victor Picheny
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern engineering workflows increasingly rely on massive parallel simulation, driving the need for scalable, large-batch Bayesian Optimization (BO). Existing batch BO methods, however, incur large computational cost or rely on approximations that erode batch diversity. We propose B3O (Boltzmann Batch Bayesian Optimization), a framework that reframes batch generation as a pure sampling problem: drawing samples directly from the Boltzmann distribution defined by the acquisition function avoids the bottlenecks of existing large-batch methods. Theoretically, we prove that queries sampled from this distribution incur only negligible additional regret. Empirically, B3O outperforms existing batch BO methods on standard synthetic benchmarks and adapts robustly across complex applied tasks, including multi-objective electrode design and mixed-variable race car configuration.

[LG-28] Characterizing Optimizer-Dependent Training Dynamics Through Hessian Eigenvector Displacement and Localization ICML2026

链接: https://arxiv.org/abs/2606.30226
作者: Marcelina Marjankowska,Valerio Modugno,Paolo Barucca
类目: Machine Learning (cs.LG)
*备注: Accepted as a poster at High-dimensional Learning Dynamics (HiLD), ICML 2026. OpenReview: this https URL

点击查看摘要

Abstract:Hessian spectral properties are a standard tool in analysing neural-network training, with eigenvalues linked to sharpness, generalization, and optimization dynamics. Eigenvalues quantify curvature magnitude, while eigenvectors identify which parameters generate that curvature. In this work, we study how the leading Hessian eigenvectors evolve during training and how they affect the learning trajectories. We track the training dynamics of multilayer perceptrons on a classification problem and measure eigenvector dynamics through two complementary statistics: (i) displacement over time, inspired by analyses of glassy systems, and (ii) localization via the inverse participation ratio. The metrics are compared against a random null model of the Hessian induced by the architecture. Our results reveal clear optimizer-dependent behaviour. SGD leads to progressively more stable leading curvature directions, while Adam exhibits substantially stronger reorganization of eigenvectors throughout training. We also observe a localization phenomenon under Adam, where a small subset of parameters contributes disproportionately to the leading curvature directions. These results suggest that Hessian eigenvector dynamics capture key differences in optimizer behaviour and the resulting training trajectories.

[LG-29] Robust Strategic Classification under Decision-Dependent Cost Uncertainty ICML2026

链接: https://arxiv.org/abs/2606.30136
作者: Sura Alhanouti,Güzin Bayraksan,Parinaz Naghizadeh
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: 29 pages, 7 figures, accepted for publication at ICML 2026

点击查看摘要

Abstract:Humans facing algorithmic decision systems have been found to ``game’’ them by altering their input data (at a cost to them) in order to favorably change the algorithmic outcomes they receive (at a cost to the algorithm). The growing literature on strategic classification seeks to develop robust machine learning algorithms that account for, and reduce, unwanted strategic behavior. A limitation of these existing works is that they assume the cost of strategic behavior to be fixed and independent of the classifier’s decision. In practice, however, manipulation costs evolve and depend on past algorithmic decisions: today’s decisions influence tomorrow’s costs. This paper proposes and analyzes a two-stage robust optimization framework with a decision-dependent uncertainty set to capture such dependencies. We highlight that awareness of policy-dependent costs not only reduces uncertainty, but also better curtails gaming of the algorithmic system over time.

[LG-30] Predictive Objectives Discard Exogenous Control-Relevant Features: A Controlled Mechanistic Study

链接: https://arxiv.org/abs/2606.30068
作者: Ayan Pendharkar
类目: Machine Learning (cs.LG)
*备注: 15 pages 3 tables 5 figures for associated github repo see this https URL

点击查看摘要

Abstract:Joint-embedding predictive (JEPA-style) objectives learn representations by predicting future latents. In doing so they can discard features that are exogenous (uncontrollable by the agent) yet control-relevant, even when those features are trivially encodable. This occurs because the objective optimizes temporal predictability rather than control-relevance. We isolate this failure mode in a controlled 2x2 experimental design that varies feature controllability and relevance independently, using a predictability knob that decouples a feature’s temporal predictability from its control-relevance. Comparing six objectives: reconstruction, JEPA, action-conditioned JEPA, controllability-based JEPA, inverse dynamics under a random policy, and reward-grounded JEPA, we observe that all evaluated reward-free predictive objectives leave the exogenous control-relevant feature near chance accuracy, while a reward-grounded variant retains it selectively. The remedy is label-efficient and robust: as little as 2% of reward-labeled transitions recovers the feature, the effect holds across two environments with different surface forms, and it persists across latent dimensions from 16 to 1024. Comparing the learned latent geometry against bisimulation theory’s prediction, the JEPA latent realizes only a small fraction of the class separation a supervised reference attains.

[LG-31] Data-Driven Energy-Based Learning via Gibbs Measures on Hierarchical Structures

链接: https://arxiv.org/abs/2606.30064
作者: L.U. Abdullaev,F. Herrera,U.A. Rozikov,M.V.Velasco
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注: 35 pages, 5 figures

点击查看摘要

Abstract:We introduce a data-driven probabilistic framework for learning systems based on Gibbs measures on hierarchical structures. Unlike standard empirical risk minimization, where a dataset is used to identify a single optimal parameter, our approach transforms the empirical loss function into an interaction potential defining an energy-based model. The resulting Gibbs distribution describes a family of equilibrium learning states generated by the data. We formulate the consistency conditions of the associated finite-volume distributions and derive nonlinear integral fixed-point equations whose solutions characterize the admissible learning states. These equations provide a rigorous connection between empirical loss landscapes and probabilistic inference on trees. For translation-invariant solutions, the problem reduces to the analysis of positive compact operators induced by data-dependent kernels, allowing us to establish existence and uniqueness conditions in the one-dimensional setting. Furthermore, we show that hierarchical learning systems may exhibit phase-transition phenomena: for certain empirical kernels on Cayley trees, multiple Gibbs measures emerge beyond a critical inverse temperature, corresponding to distinct equilibrium prediction regimes. Numerical experiments with non-separable kernels illustrate the appearance of multiple solution branches and demonstrate the coexistence of several data-induced learning states. Our results provide a new perspective on energy-based learning, where data do not merely determine an optimal model through minimization but define an entire probabilistic landscape of possible inference states. Comments: 35 pages, 5 figures Subjects: Machine Learning (cs.LG); Probability (math.PR) MSC classes: 82B20, 62C10, 68T07, 60J10 Cite as: arXiv:2606.30064 [cs.LG] (or arXiv:2606.30064v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.30064 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Utkir A. Rozikov [view email] [v1] Mon, 29 Jun 2026 09:57:39 UTC (451 KB)

[LG-32] From Failure Taxonomy to Intervention: A Diagnostic Methodology for Industry-Scale AVLM in Video and Live-Streaming Platform Moderation

链接: https://arxiv.org/abs/2606.30059
作者: Shuchang Ye,Jinqiang Yu,Zhujun Xiao,Yajing Kong,Yist Y. Lin,Yang Ma,Jiaxi Liu,Xiaolei Xu,Zheng Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Industry-scale video and live-streaming moderation imposes requirements that are difficult to satisfy with generic pretrained public models or external APIs, including adaptation to platform-specific data distributions, policy-specific objectives, and product-level safety constraints. As a result, platforms must undertake internal model development, naturally turning to shared public research for guidance. However, existing multimodal foundation-model studies primarily report architectures, training recipes, data scaling strategies, and benchmark results, but provide less systematic guidance on how failures should be localized and translated into targeted model-development interventions. Interventions are essential because deployment failures are rarely self-explanatory. Similar failures can originate from different causes. Without targeted interventions, improvement reduces to heuristic trial-and-error, where benchmark improvements are weakly attributable, and failures are difficult to trace to their underlying causes. To address this gap, we present a diagnostic methodology for industry-scale Audio-Visual-Language Models AVLM development. The methodology maps model failures into a taxonomy of observable failure signatures and links each class of failure to an intervention space. We instantiate this methodology across the development and alignment lifecycle of an AVLM foundation model for a large-scale video and live-streaming platform. The resulting system supports over 100 regions and is designed for noisy, ambiguous, and highly diverse content drawn from global platform traffic.

[LG-33] Building Multi-Task Agent ic LLM s via Two-Phase Distillation

链接: https://arxiv.org/abs/2606.30044
作者: Huaijie Wang,Shusheng Xu,Yi Wu,Kaifeng Lyu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A key step toward artificial general intelligence is to train models that can perform multiple tasks. In this paper, we study how to build such models by first training separate RL experts for individual tasks and then consolidating them via distillation, as an alternative to directly training a single model on mixed tasks. We show that off-policy distillation degrades in multi-task settings due to the mode-covering nature of forward KL: aggregating data from multiple tasks introduces a large number of behavioral modes that can exceed the student’s capacity, forcing it to average across behaviors and leading to degraded performance. In contrast, on-policy distillation is mode-seeking but requires strong initialization. Inspired by these observations, we propose a two-phase approach: off-policy distillation followed by on-policy refinement. Evaluation across conversational agents and text-based games confirms that this two-phase approach matches single-task RL expert performance for each individual task, whereas off-policy or on-policy distillation alone fails to match this performance.

[LG-34] Heads Not Backbones: Output Heads Dominate Architectures on Fat-Tailed Returns

链接: https://arxiv.org/abs/2606.30037
作者: Sichao He,Yansong Zhang
类目: Machine Learning (cs.LG); Risk Management (q-fin.RM); Statistical Finance (q-fin.ST)
*备注: Code data: this https URL

点击查看摘要

Abstract:In a deep forecasting pipeline for fat-tailed financial returns at short horizons, which matters more - the backbone architecture or the output head? We compare four modern backbones (TimesNet, DLinear, N-BEATS, iTransformer) under three output heads: a point head, a single-Gaussian density head, and a Gaussian mixture density head with K=4 components. On S and P 500 monthly log-returns (1871-2023) under anchored walk-forward validation, the three heads form a strict gradient: switching from point to Gaussian improves CRPS by about 1.3 percent; switching from Gaussian to mixture adds a further about 2.4 percent. Switching between backbones, in contrast, changes CRPS by less than 1.5 percent on the point-head row and on the backbone-mean axis; density-head backbone spread is larger (up to 5.1 percent on the h=1 Gaussian row, driven by N-BEATS) but the head gradient (3.7 percentage points) still dominates. The Model Confidence Set on squared errors does not exclude any of the 12 variants at the 5 percent level: the head separates them only on distributional metrics (CRPS, pinball, coverage), not on squared error. The mixture head incremental value over a single Gaussian is largest in the highest-volatility regimes (13.9 percent in 1970s stagflation at h=12), confirming the mixture captures tail risk beyond what a unimodal Gaussian can express. The picture is horizon-dependent: the head dominates at short horizons, but at long horizons (h = 6) the backbone re-takes the lead - an h-split we document against classical baselines (section 5.1). We conclude that on fat-tailed returns at short horizons, the head dominates the backbone, and the mixture distribution adds genuine value over a single Gaussian during crisis periods when risk-management decisions actually matter.

[LG-35] Atompack: A Storag e and Distribution Layer for Read-Heavy Atomistic ML Training Datasets

链接: https://arxiv.org/abs/2606.29975
作者: Ali Ramlaoui,Daniel T. Speckhard,Sagar Pal,Fragkiskos D. Malliaros,Alexandre Duval,Victor Schmidt
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Atomistic machine learning datasets are increasingly used for training: large immutable snapshots are read repeatedly, shuffled across epochs, staged across clusters’ storage systems, and republished as reusable scientific artifacts. This workload differs from interactive scientific curation, where mutable records and ad hoc inspection are often more important than random indexed throughput. We present Atompack, an append-oriented storage format and distribution layer designed around a simple workload: training pipelines usually consume complete molecular records, while the order of records is randomized by the learning algorithm. Atompack appends records efficiently during dataset construction, then commits an immutable index and serves records through a memory-mapped read path optimized for training. We compare Atompack with HDF5, LMDB, and ASE baselines representing array stores, key-value records, serialized records, and object-oriented databases. The benchmarks measure sequential reads, shuffled reads, shared-filesystem behavior, write throughput, and artifact size. On a representative 64-atom workload, Atompack is 96x faster than ASE LMDB on shuffled training-style reads while producing artifacts about 79% smaller. The results indicate that serving complete molecule records, rather than field chunks or reconstructed objects, improves shuffled training throughput while keeping artifacts compact enough for public distribution.

[LG-36] NeuReason er: Theory-grounded Mapping of Reasoning Elicitation Boundaries

链接: https://arxiv.org/abs/2606.29971
作者: Aydin Javadov,Shyngys Aitkazinov,Tobias Hoesli,Florian von Wangenheim,Bjoern Schuller,Joseph Ollier
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A growing body of work suggests that the reasoning capabilities of large language models are largely latent in their base form, with post-training primarily amplifying rather than introducing them. However, this evidence comes mainly from mathematical and coding benchmarks, leaving the boundary conditions of that claim largely unexplored, namely which cognitive tasks can be recovered through elicitation and where that recovery fails. To investigate this, we introduce NeuReasoner, a theory-grounded elicitation instrument. At each step, an orchestrator pairs a Neuro Lens, inspired by functional specificity, with a Cognitive Lens, drawn from the Erotetic Theory of Reasoning, and integrates their outputs through internal modularization of a single model, without external tools. We evaluate NeuReasoner on CogBench, a suite of behavioral tasks from cognitive psychology, alongside standard mathematical and coding benchmarks, measuring both its improvement over vanilla inference and its ability to match a model’s post-trained thinking mode. At sufficient scale, NeuReasoner matches or exceeds thinking-mode baselines on arithmetic reasoning, code generation, Bayesian reasoning, and reward learning; these gains persist against self-consistency and iterative-refinement baselines matched to NeuReasoner’s per-decision call budget. Using NeuReasoner allows us to find clear boundaries: risk-taking and decision making under uncertainty remains hard to recover through elicitation alone, and model scale interacts with elicitation in both directions: widening its advantage on some cognitive signatures while erasing it on others. Overall, through NeuReasoner as a modular, interpretable, theory-grounded elicitation instrument, we empirically map where reasoning elicitation succeeds and fails, beyond the mathematical and coding benchmarks where prior claims have rested.

[LG-37] Improved Predictive Performance and Interpretability for Mesomorphic Neural Networks Using Local Fidelity Regularization

链接: https://arxiv.org/abs/2606.29951
作者: Hugo L. Hammer,Vajira Thambawita,Kristoffer Herland Hellton,Pål Halvorsen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Interpretable Mesomorphic Neural Networks (IMNs) offer a promising framework that combines the predictive power of deep neural networks with the interpretability of linear models. However, the original formulation lacks safeguards to ensure that the learned interpretations are in fact reliable. In particular, the network is free to concentrate all explanatory variance into a single weight of the linear output layer, achieving strong predictive performance while producing interpretations that are largely meaningless. Paradoxically, the L1 penalty proposed to encourage sparse solutions exacerbates this problem by further incentivizing such degenerate configurations. To address this vulnerability, we introduce Local Fidelity Regularization (LFR), a novel penalty term that prevents degenerate weight collapse by aligning the linear output weights with local data variations. This structural constraint guarantees faithful explanations and substantially improves the reliability of model interpretations. Furthermore, empirical evaluations across the OpenML benchmark suite demonstrate that LFR does not compromise accuracy for explainability; rather, it achieved improved AUROC over the unregularized IMN. By yielding results highly competitive with state-of-the-art black-box models, LFR provides the dual benefit of reliable interpretability and superior predictive performance. Source code and usage instructions are available at this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.29951 [cs.LG] (or arXiv:2606.29951v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.29951 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-38] Bandwidth Selection in Kernel Density Estimation for Model Calibration

链接: https://arxiv.org/abs/2606.29925
作者: Han Zhou,Teodora Popordanoska,Matthew Blaschko
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As deep learning models are increasingly deployed in high-stakes applications, providing well-calibrated uncertainty estimates has become as critical as achieving high predictive accuracy. While Kernel Density Estimation (KDE) has emerged as a smooth and continuous alternative to traditional binning for quantifying miscalibration, its reliability is heavily dependent on the choice of the kernel bandwidth. Standard selection techniques, such as Maximum Likelihood Estimation (MLE), often fail to produce optimal bandwidths for calibration tasks. In this work, we introduce Risk Alignment (RA), a novel optimization framework that determines the optimal bandwidth by aligning KDE-reconstructed risk with empirical risk. We theoretically demonstrate that this alignment minimizes calibration estimation bias across the data distribution, establishing a principled bandwidth selection criterion applicable to various metrics, including the challenging case of canonical calibration error. Extensive experiments across multiple architectures and datasets show that RA consistently outperforms standard bandwidth selection methods, yielding more reliable calibration assessments.

[LG-39] Golden Hour Divide: Trauma Care Accessibility and Resource Vulnerability in Sri Lanka

链接: https://arxiv.org/abs/2606.29889
作者: Sonath Kirindage,Vihanga Nimsara,Sakindu Rajapaksa,Kavyanga Hathurusinghe,Lahiru Dilshan,Subavarshana Arumugam,Nathali Athukorala,Sandareka Wickramanayake,Nisansa de Silva
类目: Machine Learning (cs.LG)
*备注: 6 pages, 5 figures. Accepted for presentation at MERCon 2026. Preprint version

点击查看摘要

Abstract:Timely intensive care dictates survival, yet emergency infrastructure remains unevenly distributed across Sri Lanka. While pre-hospital services have expanded, the transition to definitive care remains a critical bottleneck. This study evaluates national emergency resilience by quantifying the gap between clinical demand and the availability of specialized resources across all 25 districts. Using the latest national epidemiological data and terrain-aware H3 hexagonal modeling, we analyzed accessibility for seven critical conditions based on spatial gaps, clinical need-gaps, lethality, coverage, and resource availability. Based on these metrics, unsupervised K-Means clustering was applied to categorize districts into four policy-actionable archetypes: Critical Structural Exclusion, Institutional Mirages, Operational Capacity Strain, and High-Resilience Benchmarks. Our study suggests that severe service deficits exist in the Northern and Eastern provinces, where spatial gaps exceed 70%, rendering the Golden Hour operationally impossible. Notably, specialist scarcity drives systemic pressure more than bed capacity; underserved regions effectively function as institutional mirages. This study suggests that improving accessibility by 25% in high-priority clusters would reduce the national need-gap by 9.65%, providing a roadmap for the strategic redistribution of specialists to ensure healthcare equity.

[LG-40] Decision-Value Attribution in Predict-then-Optimize Systems

链接: https://arxiv.org/abs/2606.29878
作者: Konstantinos Ziliaskopoulos,Alexander Vinel,Alice E. Smith
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Predictive models are increasingly embedded in operational decision-making, yet standard explanation methods typically explain forecasts rather than the decisions those forecasts induce. This distinction is important in predict-then-optimize systems: large forecast changes may leave the optimizer’s action unchanged, while small changes can alter the selected decision and its realized value. We propose Decision Value Attribution (DVA), a Shapley-based framework for attributing the value of a fixed prediction–optimization pipeline. The framework defines cooperative games whose payoff is the downstream decision value, allowing the players to be information sources, optimization or design parameters, or both. We present three variants: InfoDVA attributes value to features, DesignDVA attributes value to operational configurations, and Decision-Value Interactions (DVI) quantifies how information and design jointly create value. We further distinguish post-DVA, which evaluates decisions using realized outcomes, from pre-DVA, which evaluates decisions under the model’s full prediction. This separation turns attribution into a decision-level diagnostic of whether the model’s operational beliefs align with realized performance. The resulting attributions are expressed in the units of the operational objective and decompose the gain or loss relative to a baseline. Case studies in electricity storage arbitrage and emergency medical service coverage show that predictive explanations can be poor proxies for operational value, that DVA can guide targeted information-control interventions, and that optimization configurations determine when predictive information is decision-relevant.

[LG-41] Implementation of Hyperelastic Physics-Augmented Neural Networks in the Explicit Finite Element Codes Simcenter Radioss and OpenRadioss with Applications to Impact Events

链接: https://arxiv.org/abs/2606.29874
作者: Lukas Maurer,Sascha Eisenträger,Marian Bulla,Daniel Juhre
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 26 pages, 11 Figures, 11 Listings, 4 Tables

点击查看摘要

Abstract:Data-driven material modeling techniques have gained significant attention due to their ability to capture complex constitutive behaviors beyond the limitations of classical material models. Physics-augmented neural networks (PANNs), which embed physical constraints directly into their architecture, combine the flexibility of machine learning with the reliability required for engineering simulations. This work presents an approach to integrate such network architectures into the explicit finite element solvers Simcenter Radioss and OpenRadioss (Siemens). A framework for transferring pretrained network architectures and their parameters to a standalone user material routine is developed. Networks are trained using PyTorch, though the procedure can be adapted to other frameworks such as TensorFlow, enabling the use of PANNs within existing finite element technology without requiring specialized solvers. Particular emphasis is placed on computational efficiency. The influence of network architecture on simulation performance is investigated, and strategies for reducing evaluation costs while preserving accuracy are discussed. Specifically, replacing the SoftPlus activation function with SQuarePlus is shown to reduce computational cost. A publicly available GitHub repository automates the generation of Fortran user material routines, requiring only the specification of the network architecture and trained parameters. An example impact simulation demonstrates that the generated PANN user material reproduces the nonlinear behavior characteristic of hyperelastic materials under large strains, providing a practical route toward machine-learning-based constitutive models in explicit finite element simulations.

[LG-42] Comparing Chatbot Performance Enhanced with Persistent Homology

链接: https://arxiv.org/abs/2606.29857
作者: Nithisha Raghavaraju,Barbara Giunti,Bastian Rieck
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:Chatbots have become increasingly prevalent across various domains, offering automated assistance in many areas, especially mental health support. The training is done using extremely large datasets, which are sometimes not available in very specific domains. Moreover, it would sometimes be ideal to train the chatbot with personal information about the patients, which, of course, cannot be done on shared servers since it would violate patient confidentiality. Hence, being able to improve the performance of a chatbot, possibly trained locally and on a restricted dataset, without having to increase the dataset itself, would be extremely beneficial. In this work, we will enhance the input datasets using persistent homology (PH) vectorizations computed from the raw datasets themselves. Then we will compare, across several metrics, the performance of multiple chatbot models with or without the PH enhancement. Our experiments suggest that, while at times the PH enhancement is not particularly beneficial, it sometimes brings remarkable advantages for virtually no cost.

[LG-43] heory of Continual Learning Against Data Poisoning Attacks

链接: https://arxiv.org/abs/2606.29841
作者: Yiting Hu,Lingjie Duan
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Continual learning (CL), where a model is trained on a sequence of data tasks, is increasingly being adopted across key fields such as large language models and image recognition, yet it remains highly vulnerable to data poisoning that triggers learning divergence or severe excess risk. Despite these threats, a principled theoretical foundation in CL for understanding attack and defense remains lacking. In this paper, we develop a theoretical framework to analyze strategic attacks and defenses in regularization-based CL, a cornerstone of recent CL theory. By framing the adversary-defender interaction as an online zero-sum game, we first establish a fundamental performance limit: no defense succeeds when an adversary poisons a linear proportion of tasks by injecting unbounded noise or pattern shifts in regularization-based CL. We then analyze two possibly defensible scenarios: infrequent attacks and bounded noise per attack. For the former regime, we propose a task-to-task verification mechanism to detect data poisoning and reduce cumulative bias for learning convergence. For the latter regime, we derive a robust defense that minimizes the model’s sensitivity to poisoned features, provably accelerating the convergence rate. Extensive experiments on realistic tasks further validate our theoretical results.

[LG-44] he Forgetting-Retention Dilemma: Certified Unlearning Theory in Continual Learning ICML2026

链接: https://arxiv.org/abs/2606.29832
作者: Yiting Hu,Lingjie Duan,Qian Zhang
类目: Machine Learning (cs.LG)
*备注: ICML2026

点击查看摘要

Abstract:Machine unlearning aims to eliminate the influence of specific data from trained models to safeguard privacy. However, this presents a significant challenge in the context of continual learning (CL), where models update sequentially on dynamic datasets. A major limitation is that current certified unlearning algorithms fail to account for the complex, cumulative model evolution inherent to CL framework. In this work, we establish the first theoretical foundation bridging CL and machine unlearning. We formulate the CL’s unlearning objective as the minimization of post-unlearning excess risk, which decomposes into CL excess risk and unlearning loss, characterizing the fundamental trade-off between preserving historical knowledge and targeted forgetting. Under mild assumptions, we first establish an upper bound for the CL excess risk in non-convex models. We then adapt two certified unlearning approaches, gradient-based and Hessian-based, to the CL framework. Our analysis reveals that while the gradient-based approach is less effective than the Hessian-based method in minimizing unlearning loss, it offers the distinct advantage of nearly zero storage overhead for enabling unlearning. This insight motivates a hybrid strategy that reduces storage costs while maintaining post-unlearning performance. Experimental results further validate our theoretical findings.

[LG-45] MemLeak: Diagnosing Information Leaks in Multimodal Agent Memory

链接: https://arxiv.org/abs/2606.29788
作者: Kuan Wang,Chao Zhang
类目: Machine Learning (cs.LG)
*备注: 23 pages, 3 figures, includes appendix

点击查看摘要

Abstract:When a multimodal AI agent is asked to forget a fact, current memory systems usually delete the text entry and report success. We find that the fact can remain recoverable from retained user images, including images tagged to entirely different facts, because VLMs use implicit visual cues at inference time. We introduce the Information Provenance Graph (IPG), a taxonomy that classifies memory representations by deletion affordance. The IPG reveals that deletion fails through multiple channels. Our benchmark, MemLeak, measures this across a deletion cascade: direct probing of deletion-capable systems yields 1%, but retained correlated text enables 18.3% recovery, and retained images enable 12.0% recovery (0.0% blind baseline, 0.3% FPR) – with 47% of image leaks not text-recoverable. Content-aware semantic deletion reduces the image residual to 2.0%. The residual appears across multiple VLMs, a production memory system, and real Unsplash-licensed photographs. Dual-annotator human validation (kappa = 0.88) confirms judge reliability.

[LG-46] GLIP: Graph and LLM Joint Pretraining for Graph-Level Tasks

链接: https://arxiv.org/abs/2606.29773
作者: Haoxin Sun,Yiqing Lin,Yajun Huang,Chenhui Dong,Mingjun Li,Zhongzhi Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graphs are widely used to model relational systems, with applications in domains such as social networks, finance, and biomedicine. Graph neural networks (GNNs) have become a mainstream approach for learning graph representations. With the rise of large language models (LLMs), recent studies have attempted to combine GNNs with LLMs. However, most existing works concentrate on node-level and edge-level tasks, while graph-level tasks, which require capturing more complex structural and feature information, remain relatively underexplored. Moreover, graph pretraining is a widely adopted strategy to alleviate the challenge of label scarcity. Most existing approaches are designed solely for GNNs such as GraphCL, leaving LLMs uninvolved in the process. To address these limitations, we propose GLIP, a Graph-LLM JoInt Pretraining framework for graph-level tasks. GLIP first performs graph augmentation to construct positive and negative pairs and introduces a multi-token selection strategy to identify patches informative in both structure and features. It further leverages a diffusion-based projector to enrich them with contextual information, enabling GLIP to capture signals from both global and local perspectives. Finally, GLIP employs a joint objective that integrates the LLM’s semantic judgments with a contrastive alignment loss, ensuring consistent supervision at both the semantic and structural levels. After pretraining, GLIP is fine-tuned with limited labeled data for downstream tasks, and extensive experiments show that it outperforms state-of-the-art methods on graph-level classification and reasoning tasks. Our source code is publicly available at this https URL.

[LG-47] Optimizing Nursing Care Taxi Dispatch Leverag ing Integer Linear Programming Solvers and Machine Learning

链接: https://arxiv.org/abs/2606.29725
作者: Riku Nakao,Akihito Hiromori,Hamada Rizk,Hirozumi Yamaguchi
类目: Machine Learning (cs.LG)
*备注: An accepted journal article on IEEE Transactions on Intelligent Transportation Systems. The project page: this https URL

点击查看摘要

Abstract:In this paper, we formulate a new vehicle dispatch optimization problem, called Nursing Care Taxi Dispatch, as a variant of the Vehicle Routing Problem, considering constraints related to wheelchair use, user compatibility, pick-up and drop-off times, and vehicle limitations. Previous neural-based methods for Vehicle Routing Problems have typically addressed a few simple constraints, while our new problem involves multiple complex constraints, resulting in having fewer destinations to select. This complexity makes it more difficult to obtain solutions that allow all nodes to be visited with a limited number of vehicles. To balance low violation rate, computational efficiency, and solution quality, we propose a supervised machine learning approach based on the Transformer architecture. We first obtain a set of high-quality solutions using an integer linear programming solver for given inputs and then train our learning model through supervised learning. Additionally, we introduce the post-processing of the paths generated by the learning model, ensuring that all constraints are satisfied. We compared each instance’s objective function value (operating time), execution time, and constraint violation rate across different methods: our proposed method and some existing methods including integer linear programming and machine learning-based methods, using real-world facility data. Our method successfully produced balanced solutions regarding operating time, execution time, and constraint violation rate. Notably, we observed a decrease in the operating time for all problem sizes and regions, while keeping constraint violations to a minimum compared to existing methods. Especially, the decrease reached up to 8% for problem sizes with fewer than 30 users.

[LG-48] Simplifying Flow Matching Transformations with Low-Rank Mixture Models

链接: https://arxiv.org/abs/2606.29724
作者: Liam A. Kruse,Houjun Liu,Alexandros E. Tzikas,Mansur M. Arief,Mykel J. Kochenderfer
类目: Machine Learning (cs.LG)
*备注: Accepted at CoDIT 2026

点击查看摘要

Abstract:Normalizing flows are powerful generative models that learn an invertible mapping between complex data distributions and simple latent distributions, typically a standard normal density. However, this choice of latent density can impose unnecessary complexity on the learned flow transformation due to the topological mismatch between the latent and data densities, leading to slower training and suboptimal performance. In this work, we propose using mixtures of probabilistic principal component analyzers (MPPCA) as the latent density for normalizing flows. We simplify the learned flow transformation by learning a latent distribution that more closely aligns with the data distribution in terms of KL divergence, thus enabling faster convergence and improved generative performance. Critically, MPPCA models can be fit quickly and cheaply using the expectation-maximization algorithm, making them a practical choice for initializing latent distributions even in high-dimensional generative tasks. We validate our method on both tabular and image datasets, demonstrating consistent gains in training efficiency and generation quality compared to baselines.

[LG-49] IG-Lens: Exact Additive Probability Attribution Across Transformer Layers via Telescoping Integrated Gradients

链接: https://arxiv.org/abs/2606.29693
作者: Duc Anh Nguyen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We ask a simple question about decoder-only transformers: \emphbetween which two layers is the probability of a predicted token actually produced? Existing layer-wise readout tools answer only approximately. The logit lens and its trained variant report a per-layer \emphlevel of probability but give no additive decomposition; their estimates are biased and non-monotone across depth. Direct Logit Attribution and related residual-stream methods are additive, but only in \emphlogit space – the softmax nonlinearity breaks additivity in probability space, precisely the quantity one usually cares about. Layer Conductance integrates gradients per layer, but attributes each to its own baseline and so does not sum to the total change in prediction. We introduce \textbfIG-Lens, a telescoping application of Integrated Gradients along a single path through the hidden states from a baseline to the final layer. Crediting each segment to the layer it terminates at yields a layer-wise attribution whose sum is \emphexactly the change in target probability, with the softmax inside the integration path rather than linearized away. Our default estimator credits each integration step its \emphobserved change in target probability – a prediction-aware reweighting in the spirit of IDGI – rather than its raw gradient. Because the readout is a one-dimensional probability, this collapses each segment to a telescoping sum of endpoint values, so completeness holds exactly (to floating point) at \emphany step count, removing Riemann discretization error while suppressing steps that show gradient sensitivity without a change in output. We give the telescoping identity and its proof, verify completeness to floating point, and describe a single-pass batched implementation computing the full token-by-layer map without any backward call. Code: this https URL.

[LG-50] CAREBench: A Child-Safety Risk Benchmark for Language Models

链接: https://arxiv.org/abs/2606.29685
作者: Kaavya Krishna-Kumar,Elaine Lau,Vaughn Robinson,Jay Caldwell,Sheriff Issaka,Skyler Wang,Francisco Guzmán,Steven Kelling,Jonas Mueller
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:How can we evaluate whether frontier AI systems recognize child-safety risks before they escalate into explicit harm? Existing child safety evaluations focus on child sexual abuse material, yet many child-safety failures begin earlier: in model assistance that helps adults manipulate, impersonate, profile, or isolate minors, and in model responses that deepen children’s emotional dependence on AI systems rather than redirecting them toward human support. We introduce CAREBench (Child AI Risk Evaluation), a benchmark to assess such upstream child-safety risks in language models. CAREBench contains 500 prompts spanning twelve risk categories, including grooming and relationship engineering, deception and impersonation, surveillance and privacy, sextortion and sexual abuse, AI anthropomorphization, emotional dependency, and mental illness sensitivity. Developed with response annotations from parents and clinicians, the benchmark excludes explicit abuse material and imagery; instead, it evaluates whether models recognize, refuse, de-escalate, or redirect risky interactions before harm becomes overt. Evaluating seven frontier models on our benchmark, we find failure rates ranging from 2% to 58%, with failure patterns that vary across risk categories. CAREBench provides a responsibly scoped evaluation for LLM developers to identify and close gaps in child safety policies.

[LG-51] Learning as Observable Matrix Dynamics: Diffusive Relaxations versus Phase Transitions

链接: https://arxiv.org/abs/2606.29679
作者: Igor Halperin
类目: Machine Learning (cs.LG)
*备注: 54 pages, 30 figures

点击查看摘要

Abstract:Observable Matrix Dynamics (OMD) is a diagnostic framework that probes the dynamics of high-dimensional internal representations of inputs by a neural network via a fixed-size N \times N distance matrix M(t) on a held set of N inputs. OMD uses methods of random matrix theory and particle dynamics to explore spectral reorganisations that are missed by scalar loss functions, but are informative of the training process. We read M(t) against a perturbative ambient-versus-latent decomposition extending the Bogomolny–Bohigas–Schmit (BBS) theory of random distance matrices, with per-snapshot diagnostics for the top-of-spectrum band structure and ambient noise, trajectory-level observables linking snapshots, and a 3D MDS embedding (bottom-three eigenvectors) rendering training as a moving particle cloud. Across seven experiments, diffusive regimes lack stable top-of-spectrum band structure, while sharp endogenous or externally driven reorganisations produce stable fingerprints: consistent with smooth or product latent geometries in BBS-adjacent cases, and with finite-cluster or Fourier-soliton structures otherwise. OMD thus reads the geometric regime of a representation rather than reporting a single intrinsic dimension.

[LG-52] I-BBS: Coordinate-Free Inference of Latent Sub-Manifolds Using Random Distance Matrix Theory

链接: https://arxiv.org/abs/2606.29675
作者: Igor Halperin
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注: 53 pages, 23 figures

点击查看摘要

Abstract:Bogomolny, Bohigas and Schmit (BBS) found that the spectrum of the pairwise distance matrix on N points sampled from a smooth d-dimensional manifold encodes a signature of the underlying geometry. We develop I-BBS (Inference-BBS), a coordinate-free method that identifies a low-dimensional latent sub-manifold embedded in a high-dimensional ambient distance matrix alone, without accessing an ambient high-dimensional vector space. It therefore applies even when that space is only partly observable or undefined. We model the ambient embedding by two classes of generative noise, model-based and model-free. The noise mixes the latent signal with off-manifold components, so the eigenvalues reorganise collectively and the latent geometry cannot be read off eigenvalue by eigenvalue. We recover it instead from two integer-stable signatures that survive the noise: the multiplicity of the top non-Perron multiplet, which fixes d , and a parameter-free law for how the multiplet positions shrink as the noise grows. On synthetic spheres S^1 , S^2 and S^3 these integer signatures are far more stable under noise than the continuous spectral slope, and a blind test recovers both the manifold and the noise model from a single distance matrix. Applications to neural-network representations and to the dynamic training regime are developed in two companion papers.

[LG-53] -STEP: An interpretable model for Total Electron Content predictions and irregularities estimations

链接: https://arxiv.org/abs/2606.29644
作者: Stephen Tete,Carl Shneider,Maxime Cordy,Claudio Cesaroni,Andreas Hein,Vasily Petrov
类目: Machine Learning (cs.LG); Space Physics (physics.space-ph)
*备注: 40 pages, 15 figures. Note that the article has been published in the Earth Science Informatics Journal

点击查看摘要

Abstract:Earth system infrastructures relying on satellite-based technologies, such as Global Positioning System (GPS) communications, are affected by ionospheric Total Electron Content (TEC) gradients. Modeling these gradients under physical constraints remains challenging due to their dynamic and transient nature. While existing machine learning (ML) models can predict hourly TEC variations, it remains unclear whether their temporal resolution is sufficient to preserve small-scale TEC irregularities within predicted signals. To address this gap, we introduce an interpretable ML-based model, t-STEP, designed to predict TEC at a 30-second resolution and estimate irregularity signatures from the modeled signals. This high cadence enables the derivation of Rate of TEC changes (ROT) and the ROT Index (ROTI) as diagnostic indicators of ionospheric variability. The model is developed using GPS observations from solar cycle 24 at a station located at 5.49°S, 47.49°W. A multi-metric evaluation framework, including dynamic time warping, is used for robustness assessment, while SHAP (SHapley Additive exPlanations) provides insight into feature contributions. The 30-second TEC predictions achieve 91% accuracy with a mean absolute error (MAE) of 4.38 TECU during high solar activity (2015). Compared with the International Reference Ionosphere (IRI-2020), the hourly model improves accuracy by 35%, reduces absolute errors by 57%, and increases prediction skill by 54%. More importantly, the 30-second model captures TEC irregularity dynamics and morphologies during geomagnetic storms of different intensities, outperforming an attention-based Long Short-Term Memory model under the same experimental conditions. This study demonstrates the potential of a single TEC prediction framework for scalable irregularity monitoring without requiring separate models for individual transient events.

[LG-54] Boundary Degree as a Node-level Feature for Epidemic Scenario Identification in Agent -based Cascade Simulations

链接: https://arxiv.org/abs/2606.29596
作者: Amro Alabsi Aljundi,Galen Harrison,Jiangzhuo Chen,Abhijin Adiga,Anil Kumar Vullikanti,Madhav V. Marathe
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 28 pages, 10 figures, preliminary version; not final

点击查看摘要

Abstract:Characterizing the scenario underlying an epidemic from its disease cascade is an important task in simulation analytics. We propose boundary degree, the count of an infected node’s contacts in the underlying contact network that were not infected, as a per-node cascade feature for this task. Through systematic ablation on realistic social contact networks of Tennessee and Virginia, we show that boundary degree alone improves scenario identification accuracy by 19%. Edge features, whose importance was observed empirically by prior work, consistently improve accuracy across all settings; we provide theoretical grounding for this observation. These effects are complementary. We prove that certain epidemic scenarios are indistinguishable without boundary or edge information. Prior feature engineering approaches included aggregate boundary statistics, but these were not among the top-ranked feature groups; the per-node representation we propose reveals their importance clearly. Our results suggest that contact tracing applications should track contacts with non-infected individuals, not only transmissions.

[LG-55] STEMGym: Benchmarking Sequential Decision-Making under Dose Budgets in Autonomous Electron Microscopy

链接: https://arxiv.org/abs/2606.29592
作者: Can Polat,Erchin Serpedin,Mustafa Kurban,Hasan Kurban
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Atomic Physics (physics.atom-ph); Optics (physics.optics); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:A central premise of autonomous scientific imaging is that smarter navigation, whether Bayesian, RL-based, or otherwise adaptive, is the principal lever for sample-efficient acquisition. We present evidence to the contrary in scanning transmission electron microscopy (STEM), an atomic-resolution imaging modality whose every measurement deposits damaging electron dose. We introduce STEMGym, an open-source Gymnasium benchmark of 15 physics-simulated STEM worlds spanning five materials, three difficulty levels, and four characterisation tasks, scored by the Dose-Efficiency Curve area (DEC-AUC), a single scalar capturing the information-vs-dose Pareto frontier. Across 33 agent configurations under realistic dose budgets, the dominant determinant of dose efficiency is the analyst (perception) pipeline, not the navigator: pairing a trained CNN analyst with naïve raster scanning raises DEC-AUC by 5.5x over a CNN-free raster baseline (0.287 vs.\ 0.052), while substituting Bayesian or adaptive finite-state-machine navigation for raster yields no statistically significant further gain. Production-tier vision-language models further underperform task-specific CNNs by \sim13x on crystallographic defect analysis. By decoupling perception, navigation, and planning under a unified dose budget, STEMGym reframes where ML effort should be invested in autonomous electron microscopy and provides the measurement infrastructure to test it.

[LG-56] Speculative Pre-Positioning: Decoding Stateful Sessions to the Next Decision Point Off the Critical Path

链接: https://arxiv.org/abs/2606.29565
作者: Victor Norgren
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A stateless inference server (vLLM, SGLang, TensorRT-LLM) idles between requests while the accelerator waits; a stateful session reclaims that idle time. Speculative pre-positioning decodes the session forward to its next decision point with the target model’s own forward pass and no draft model, moving the cross-request prefill and entry-decode off the critical path: the next request resumes from a pre-paid entry on its delta, or, when a confidence gate fires, is answered from a cached distribution in one near-constant vocabulary scan with no decode, at a cost only of energy and a rare, bounded false accept. The payoff is conditional on capability: a capable model fires the gate at near-full coverage and about 87% precision (a smaller one never clears it), returning the first token in about 1.0 ms versus the 39 ms decode a prefix cache still pays.

[LG-57] Optimizer Memory Makes Shuffle Order a First-Order Source of Fine-Tuning Noise

链接: https://arxiv.org/abs/2606.29554
作者: John Sweeney
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 29 pages, 3 figures, 12 tables

点击查看摘要

Abstract:Shuffle order can be a larger source of fine-tuning noise than a memoryless analysis predicts: fixed-clock optimizer memory makes local equal-multiset contrasts first order in the learning rate rather than second order, and the resulting order channel can be large enough for a single seed to flip a close A/B comparison. We isolate this mechanism and derive a fit-free way to size the noise it produces. For a memoryless optimizer, reordering an equal multiset has no first-order endpoint term; the leading local contrast is the O(\eta^2) gradient bracket. Fixed-clock optimizers such as AdamW are different. Their moment buffers, preconditioner state, and de-biasing counters advance with the step index rather than with the learning-rate-scaled time \tau=\eta k , so the same gradient can receive a position-dependent endpoint weight. For any fixed finite measurement window, a lifted-state expansion gives an O(\eta) equal-multiset contrast whenever the first-order replay coefficient is nonzero, while regular and clock-matched controls remain O(\eta^2) ; a bare fixed- \beta momentum buffer is already enough. A bitwise-deterministic replay from one warmed optimizer state isolates the mechanism, giving order-variance slopes 1.83 for AdamW, 2.00 for fixed- \beta momentum, and 4.00 for SGD; matching the memory clock to \tau restores the regular exponent. For AdamW with a frozen preconditioner, the same impulse-weight kernel gives a closed-form asymptotic order-variance floor after the local potentials are measured, with no fitted coefficients. The result is local to the measurement window (independent reshuffling can average the channel across windows), but it yields order-noise error bars, positional attribution weights, and a seed-budget criterion for fine-tuning comparisons.

[LG-58] Improved Multi-Dimensional Forecasting for Swap Regret

链接: https://arxiv.org/abs/2606.29533
作者: Joey Rivkin,Ramiro N. Deo-Campo Vuong,Robert Kleinberg,Chido Onyeze,Erald Sinanaj,Eva Tardos
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: Accepted for presentation at the ACM Conference on Economics and Computation (EC) 2026

点击查看摘要

Abstract:We study the problem of forecasting for an arbitrary number of downstream agents with unknown objectives, each of whom best responds to the forecaster’s predictions. We seek a single forecaster that guarantees sublinear swap regret for all downstream agents simultaneously. For two-dimensional outcome spaces, we give a polynomial time algorithm that guarantees \tildeO(\sqrtkT) swap regret for any downstream agent with k actions. This improves over the previously known bound of \tildeO(kT^5/8) and avoids the exponential in T runtime of prior algorithms in this setting. Our algorithm extends nicely to other low dimensional environments, retaining \tildeO(\sqrtT) downstream swap regret while the exponent of k in the regret bound and the exponent of T in the running time both grow with dimension. For arbitrary dimension d , we give a forecasting algorithm that guarantees \tildeO(d\sqrtkT) swap regret, assuming the forecaster knows an upper bound k on the number of actions available to any downstream agent, albeit with a much longer runtime. This improves upon previous high dimensional guarantees that had \tildeO(T^2/3) dependence and required additional behavioral assumptions.

[LG-59] he Mirag e of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

链接: https://arxiv.org/abs/2606.29526
作者: Jing Liang,Hongyao Tang,Yi Ma,Yancheng He,Weixun Wang,Xiaoyang Li,Ju Huang,Wenbo Su,Jinyi Liu,Yan Zheng,Jianye Hao,Bo Zheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has gained growing attention in large language model (LLM) post-training, yet RL training remains fragile and can suffer from instability or collapse. One vital cause is training-inference mismatch: LLM adopts separate inference and training engines for generation efficiency and training precision, which in practice exhibits inconsistent probabilities for the same trajectories on training and inference sides, even with synchronized model parameters. This naturally induces a special type of off-policyness ever existing and poisoning the training. Prior works have made various efforts in addressing the off-policyness to stabilize the training policies under the mismatch. In this paper, we point out the objective misalignment neglected by existing works that an effective update to the policy in the training engine not necessarily ensures the improvement of the inference policy, i.e., the one used in deployment. To this end, we propose a new policy optimization objective for LLM RL, named Monotonic Inference Policy Improvement (MIPI). Following this principle, we introduce Monotonic Inference Policy Update (MIPU), a two-step LLM RL framework that constructs sampler-referenced candidate updates and selectively accepts synchronized candidates using an inference-side gap proxy. Experiments conducted on two model scales under high mismatch show that MIPU improves average reasoning performance and training stability.

[LG-60] Not All Objectives Are Born Equal: Priority-Constrained Descent for Hierarchical Multi-Objective Optimization

链接: https://arxiv.org/abs/2606.29521
作者: Dara Varam,Mohamed I. Alhajri
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 33 pages, 14 figures, 6 tables

点击查看摘要

Abstract:Deep learning problems rarely involve objectives that are equal in importance. A primary objective defines the goal, whilst secondary objectives, such as sparsity, compression, or robustness constrain the solution. While existing multi-objective methods have proven effective in practice, they have a clear symmetry problem and neglect the inherent objective hierarchy built into these objective spaces. We introduce Priority-Constrained Descent (PCD), a gradient-based optimization framework designed to explicitly exploit hierarchical objective structures. PCD preserves the direction of primary descent whilst allowing for the minimal distortion necessary to guarantee progress on secondary objectives, controlled by a single \tau \in [0, 1] that dictates the strength of the distortion. The resulting formulation is invariant to objective scaling and admits exact closed-form solutions for problems with two and three objectives. We evaluate PCD within structured network compression settings, unstructured sparsity and low-rankness, and across a variety of synthetic experiments, showing Pareto dominance and better per-objective performance with secondary progress guarantees over existing methods, further exhibiting the interpretable trade-off that \tau provides.

[LG-61] Anti-Collapse Dynamics and the Emergence of Multi-Time-Scale Learning in Recurrent Neural Networks

链接: https://arxiv.org/abs/2606.29519
作者: Lorenzo Livi
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: first full version

点击查看摘要

Abstract:Long-range learning is hard for recurrent networks trained with stochastic gradient descent, because the influence of a past input fades with the lag \ell , and if it fades too fast the dependence cannot be learned from finite data. This fade is captured by an envelope f(\ell) . An exponential fade makes the data needed to learn a lag- \ell dependence grow exponentially, putting long horizons out of reach; a power-law fade keeps the cost polynomial. We show that the asymptotic decay class of f(\ell) is not fixed by the architecture. Instead, it emerges from the coupling between the state dynamics and parameter dynamics, settling into either a collapsed regime (fast, exponential forgetting) or an extended, anti-collapsed regime (slow, power-law forgetting). The intuition is a competition within these coupled dynamics. Training drives the network’s effective time scales toward short ones, while rare, heavy-tailed fluctuations of the learning dynamics push a few of them to very long values. The extended regime survives only when these heavy-tailed pushes are strong enough to balance the pull. We make this mathematically precise with a coarse-grained stochastic process and prove exactly when the extended regime exists. A single exponent, the spectral exponent~ \beta , then governs both the spread of time scales and how slowly the network forgets. Realizing the regime in practice needs one more ingredient: the joint action of the architecture and the optimizer must be able to hold such a broad spread. A network whose capacity to generate broad time-scale spectra is severely constrained still collapses, even when supplied with strong heavy-tailed forcing. Heavy-tailed fluctuations thus act not as noise to be suppressed, but as the mechanism that sustains long-range learning.

[LG-62] Harvesting AI Computation at the Edge via Generic Approximation

链接: https://arxiv.org/abs/2606.29518
作者: Yihan Wang,Huiru Yan,Luxin Zhang,Long Cheng,Weiwei Chen,Ying Wang,Lei Zhang,Cheng Liu,Huawei Li
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 11 pages, 9 figures

点击查看摘要

Abstract:With the widespread adoption of AI in various IoT scenarios such as smart sensing and processing, AI chips have become a common component at the edge. These chips are typically specialized for structured neural network (NN) processing and are designed to meet peak workload demands. However, they are often underutilized and suffer from considerable computational waste due to temporal or spatial redundancy in processing. Conversely, general-purpose processing engines at the edge may struggle with compute-intensive tasks such as signal processing and complex numerical operations because of stringent resource constraints. To address this imbalance, we propose a framework that harvests unused AI computation resources using general-purpose approximation techniques. The core idea is to automatically convert traditional computing tasks into neural network models via a representative neural architecture search (NAS) method. These approximate versions of general-purpose tasks are then deployed on AI engines during their idle periods. Specifically, we introduce a runtime scheduler that offloads these tasks to AI chips without compromising the performance of primary AI workloads, thereby alleviating the burden on general-purpose processors. Experiments on a representative AIoT processor show that our proposed AI computation harvesting strategy delivers substantial performance improvements across a set of edge processing tasks.

[LG-63] A Mathematical Optimization Approach for Expert-Informed Bayesian Best Subset Selection

链接: https://arxiv.org/abs/2606.29516
作者: Nolan Alexander,Henning Mortveit
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A central challenge in statistical modeling is identifying the subset of features that belong in the true regression model. The classical best subset selection problem, recently made tractable via mixed-integer optimization (MIO), finds the globally optimal sparse solution. It does not, however, make use of any information beyond the observed data. In many applied settings, domain experts can meaningfully rank or score the relevance of candidate predictors, yet no existing framework integrates such probabilistic expert assessments directly into the best-subsets objective. This paper presents Expert-Implied Bayesian Best Subsets (EBBS), a method that incorporates domain-expert probability estimates of feature relevance into the MIO best-subsets problem through a maximum a posteriori (MAP) framework. Expert views from multiple respondents are aggregated into a single prior probability per feature using the Poisson binomial distribution for marginal probability estimates, the pairwise win rate for pairwise comparisons, or the normalized mean rank for ordinal rankings. This probability enters the objective function as a log-odds penalty term that smoothly encourages or discourages the selection of each feature consistent with the expert consensus. This paper provides analytic derivations of the MAP formulation and characterizes its theoretical properties. The proposed model reduces to Best Subsets when experts all have no views. Empirical results on synthetic and real datasets are forthcoming.

[LG-64] Reinforcement Learning in Super Mario Bros: Curriculum Pedagogy and Optimal Level Design in World 1-1

链接: https://arxiv.org/abs/2606.29511
作者: Jesse Ponnock,Lucas Ho
类目: Machine Learning (cs.LG)
*备注: 13 pages, 7 figures, 5 tables

点击查看摘要

Abstract:World 1-1 of Super Mario Bros is widely celebrated as a masterclass in game design: its progressive structure is credited with teaching players core mechanics through the level itself. We ask whether that structure is empirically measurable using reinforcement learning. We implement World 1-1 from scratch as a fully discrete environment and compare four algorithms – Q-Learning, SARSA, Monte Carlo, and Deep Q-Network (DQN) – across three progressively complex versions of the same level. Monte Carlo emerges as the strongest agent (94.9% \pm 1.5% win rate), outperforming DQN (76.4% \pm 3.4%) by learning to maximize intermediate rewards along winning paths rather than taking the most direct route. We then use Monte Carlo in a curriculum experiment permuting World 1-1’s six canonical segments across twelve conditions. Canonical ordering converges fastest, achieves the highest learning efficiency, and is the only condition with zero catastrophic failures; no random permutation matches all three criteria simultaneously. These results provide, to the best of our knowledge, the first empirical validation that World 1-1’s canonical design encodes genuine pedagogical structure: one that measurably accelerates learning and cannot be replicated by chance.

[LG-65] Chamber geometry and specification numbers of Boolean threshold functions

链接: https://arxiv.org/abs/2606.29477
作者: Martin Anthony
类目: Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Combinatorics (math.CO)
*备注: 61 pages, 2 figures, 2 tables

点击查看摘要

Abstract:The specification number \sigma_n(f) of a Boolean threshold function f on n variables is the least number of points whose f -values determine f uniquely among all threshold functions. Its essential points form the unique minimum such set. We develop Zuev’s geometric interpretation: the threshold functions are the chambers of a central hyperplane arrangement in the (n+1) -dimensional space of weights and thresholds, and the essential points of a function correspond exactly to the facets of its chamber, so the specification number is the chamber’s facet number. The lower bound \sigma_n(f)\ge n+1 becomes the fact that a pointed full-dimensional cone has at least n+1 facets, with equality for simplicial chambers. The average specification number \overline\sigma_n becomes an average facet count. We evaluate this average exactly via the resonance arrangement and bound it through a theorem of Fukuda, Tamura, and Tokuyama, obtaining \overline\sigma_n\le 2n ; hence \overline\sigma_n=\Theta(n) . This settles a question of Gutekunst, Mészáros, and Petersen. The method also extends to polynomial threshold functions. The same geometry links threshold functions with a threshold zonotope, whose vertices are modified Chow vectors. Its one-skeleton is the one-inclusion graph, and a vertex’s degree is the specification number of that function. Finally, we treat the operations of Lozin et al. on functions of minimum specification number. Adding a variable and extending on a variable both take the product of a chamber closure with a half-line, preserving simpliciality. For the symmetric-variables extension we give an exact thresholdness criterion and show that minimum specification number is preserved whenever the extension is a threshold function. We also resolve a question they pose concerning a fourth operation. Comments: 61 pages, 2 figures, 2 tables Subjects: Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Combinatorics (math.CO) MSC classes: 06E30, 52C35, 68Q32 Cite as: arXiv:2606.29477 [cs.DM] (or arXiv:2606.29477v1 [cs.DM] for this version) https://doi.org/10.48550/arXiv.2606.29477 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Martin Anthony [view email] [v1] Sun, 28 Jun 2026 16:12:31 UTC (66 KB)

[LG-66] Structured Proper Loss Geometries for Multiclass Classification: Theory and Controlled Empirical Evaluation

链接: https://arxiv.org/abs/2606.29471
作者: Soumyadip Sarkar
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Strictly proper scoring rules identify the true conditional class distribution at population level, but their curvature can alter optimization and finite-sample behavior. We study three multiclass objectives: a class-aware quadratic Bregman score (CAPM), a strongly convex generator with constrained log-cosh ridges (HPG), and an HPG objective with an annealed probability-margin penalty (APMS). CAPM is treated as a structured instance of established quadratic scoring-rule theory. We derive conditional-regret, curvature, range, and logit-gradient bounds for CAPM and HPG, and prove exact penalty-range and conditional-target displacement bounds for APMS. Controlled five-seed experiments use Digits, Wisconsin breast cancer, and synthetic confusion and long-tail problems under clean labels, symmetric and pair-flip corruption, class imbalance, calibration evaluation, input corruption, and first-order adversarial perturbations. The candidates are close to cross-entropy on clean data and show descriptive gains in some noisy-label cells, but the five-seed comparisons are interpreted descriptively rather than as significance evidence. The selected noisy-label baselines perform better on Digits with 40% symmetric label noise, and explicit prior-adjustment methods perform better in the 30:1 synthetic long-tail experiment. Ablations do not show a consistent benefit from the candidate-specific graph, ridge, or margin components. The mathematical analysis establishes the stated properties, and the experiments delimit the empirical evidence; together they do not support a claim of general superiority.

[LG-67] Self-Supervised Calibration of Scientific Instruments Using Physical Consistency Constraints

链接: https://arxiv.org/abs/2606.29466
作者: M. Rejmund(1),A. Lemasson(1) ((1) GANIL, CEA/DRF - CNRS/IN2P3, Bd Henri Becquerel, BP 55027, F-14076, Caen Cedex 5, France)
类目: Machine Learning (cs.LG); Nuclear Experiment (nucl-ex); Instrumentation and Detectors (physics.ins-det)
*备注:

点击查看摘要

Abstract:Calibration remains one of the principal obstacles to the deployment of machine learning in scientific instrumentation because it typically relies on expert intervention, dedicated procedures, and manually labelled data. We introduce a physics-informed self-supervised framework that jointly learns latent detector calibration parameters and task-specific predictions directly from raw measurements without requiring pre-calibrated signals or external labels. The method exploits known physical constraints to generate pseudo-labels iteratively, transforming calibration into a self-supervised optimization problem. The approach is demonstrated for ionic charge-state determination in the VAMOS++ magnetic spectrometer, where the calibration of a segmented ionization chamber and the inference of ionic charge states are learned simultaneously. Starting from a weak prior on the mean ionic charge state, the model progressively refines its predictions through iterative fractional pseudo-labelling driven by the discrete nature of atomic masses. Beyond accurate ionic charge-state reconstruction, the inferred calibration coefficients provide a compact representation of the detector state that enables automated monitoring of gain drifts, pressure variations, and detector aging. The resulting labels can subsequently be transferred to specialized models that quantify detector imperfections and track their spatial and temporal evolution. These results establish a general paradigm for self-calibrating and self-monitoring scientific instruments and represent a step toward intelligent experimental systems capable of autonomous calibration, analysis, and performance optimization.

[LG-68] Prototype Latent World Model Replay for Class-Incremental Learning

链接: https://arxiv.org/abs/2606.29465
作者: Weizhi Nie,Hui Wang,Weijie Wang,Yuting Su
类目: Machine Learning (cs.LG)
*备注: 19 pages, 10 figures

点击查看摘要

Abstract:Class-incremental learning requires a model to learn new classes while preserving decision regions for old ones. This is difficult when raw old samples are no longer available. We propose Prototype Latent World Model Replay, a memory-free framework that stores old classes as distributions over stable hidden states rather than as images. A frozen ImageNet-pretrained encoder maps each image into a latent state space. In this space, each class is summarized by several prototype-centered distributions with class-specific variances. When new classes arrive, the model samples old latent states from this prototype world model. It then trains a lightweight adapter and classifier using both sampled old states and real new-class features. We also add a supervised contrastive term in the adapter space to promote intra-class compactness and old-new class separation. On Split CIFAR-100, our method improves over fine-tuning under Inc5, Inc10, and Inc20 without storing raw exemplars. The full Ours-LWM+Con model raises LastAcc from 4.55% to 31.64%, from 9.06% to 37.06%, and from 16.96% to 43.10% in Inc5, Inc10, and Inc20, respectively. It also achieves AvgAcc of 45.86%, 52.19%, and 56.18%. Ablation and retention analyses show that stable latent-state replay is the main source of the gain. Contrastive separation further refines the old-new geometry. These results suggest that prototype latent memory preserves reusable class-state distributions, rather than only fitting the current classifier.

[LG-69] Randomized neural operator for parametric PDEs with fast training and conformal uncertainty quantification

链接: https://arxiv.org/abs/2606.29440
作者: Zirui Deng,Jingbo Sun,Deyu Meng,Fei Wang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Repeatedly solving parametric PDEs is essential for uncertainty quantification, design optimization and inverse problems, but conventional neural operators require expensive non-convex training. We introduce PCA–RaNN, a randomized latent neural operator that combines PCA-based dimensionality reduction with fixed random features and a closed-form least-squares readout. It recasts latent operator learning as fixed-feature linear regression, reducing training time by one to three orders of magnitude across benchmarks while maintaining competitive accuracy. We introduce an energy-matched scaling rule and a lightweight two-parameter BFGS refinement to correct suboptimal feature scales. Ensemble averaging reduces predictive variance. On Burgers, Darcy, Navier–Stokes and backward heat equation benchmarks, PCA–RaNN provides a favorable speed–accuracy trade-off against operator-learning baselines. The ensemble supports split-conformal prediction intervals, and the linear readout enables rapid online adaptation via recursive least squares without retraining hidden features. This provides an efficient, uncertainty-aware surrogate for many-query scientific workflows.

[LG-70] Fourier Neural Operators with Least-Squares Readout Refit for Learning Random Obstacle-to-Solution Maps

链接: https://arxiv.org/abs/2606.29436
作者: Chenhui Zhu,Fei Wang
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study operator learning for random obstacle-to-solution maps arising from elliptic variational inequalities with finite-band self-affine random obstacle fields. Instead of introducing an explicit truncated stochastic parametrization of the random input, we learn the map directly from sampled obstacle realizations on a fixed grid. This problem is challenging because the solution is governed not only by the obstacle field itself, but also by the induced contact set and free-boundary geometry. We introduce a post-training least-squares readout refit for the Fourier neural operator (FNO). After the FNO is trained end to end, its nonlinear backbone is frozen and the final affine readout is recomputed by solving the induced linear least-squares problem over all training samples and grid points. The refit yields the empirical squared-error optimal readout for the learned frozen features while leaving the nonlinear representation unchanged. We compare vanilla DeepONet, POD-DeepONet, a two-stage DeepONet baseline, FNO, and FNO with least-squares readout refit (FNO-LS) on two obstacle ensembles with different amplitude levels. Numerical results show that FNO-LS achieves the strongest overall performance among the tested models, particularly for higher-amplitude obstacles with more complex contact geometry. The method improves average field accuracy, contact-set recovery, and obstacle-violation metrics at low additional cost, especially when the FNO backbone is informative but not fully converged. These results suggest that least-squares readout refit is a simple and effective post-training enhancement for learning random obstacle-to-solution maps.

[LG-71] mporal Posed and Spontaneous Gesture Recognition from Electromyography in the Rock-Paper-Scissors Game

链接: https://arxiv.org/abs/2606.29423
作者: Xin Wei,Huakun Liu,Felix Dollack,Monica Perusquia-Hernandez
类目: Machine Learning (cs.LG)
*备注: Accepted by ACII2025

点击查看摘要

Abstract:The importance of gesture recognition has been acknowledged in many domains requiring real-time recognition systems. Two requirements for these are fast recognition in multiuser contexts. Therefore, we explored the temporal characteristics of electromyography (EMG) and its accuracy in recognizing gestures in a Rock-Paper-Scissors (RPS) game. Twenty-four participants played RPS in dyads, while a two-channel EMG was recorded from the forearm. We found out that EMG onsets could be detected at least 800 ms before the gesture’s visible onset, and that the EMG peaks around 342 ms before the visible onset of the gesture. Furthermore, we evaluated self-gesture recognition in both posed and spontaneous gesture conditions. The mean accuracy for posed gestures reached 63.4%. The model trained on posed gestures achieved 53.6% for spontaneous gestures, with considerable variation across individuals. We also checked whether detecting a player’s gesture from the opponent’s EMG was possible. The peak mean accuracy was 65%, peaking at 2082 ms after the visual onset of the gesture. This suggests that the opponent’s reaction to an observed gesture contains information about the observed gesture due to the dynamics of the interactions while playing. The temporal predictive advantage of EMG signals, where muscle activation precedes observable movement, offers potential benefits for applications requiring rapid intent recognition, such as human-computer interaction and assistive technologies. Future work should focus on refining onset detection and reducing the impact of spontaneous movement variability across conditions to improve recognition performance in dynamic and real-world environments.

[LG-72] Exploring the Cryptographic Limits of Transformer Networks

链接: https://arxiv.org/abs/2606.29389
作者: Stefan Domunco,Andis Draguns,Philip Torr,Isaac Robinson,Christian Schroeder de Witt
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent work it has been shown that colluding AI agents can use steganographic methods to exchange malicious information. Whether a transformer can implement steganographic methods depends on what cryptographic functions it can implement, since a transformer that can implement a cryptographic function within its layers has source-free randomness access. Despite existing circuit-complexity results, no prior work maps specific cryptographic constructions to transformer architectures. As Merrill et al. have shown that saturated transformers can be seen as threshold circuits, we first generate threshold circuits for three different cryptographic constructions (Keccak functions, Merkle–Damgard constructions and Merkle Trees) and then map these circuits to different transformer architectures. We derive verified scaling laws for the width and depth of the circuits which implement each cryptographic construction and propose two different mappings: no-attention mapping, tokens-as-gates mapping. Beyond its security implications, this work contributes to by establishing a methodology for deriving structural guarantees on transformer computational capacity. Specifically, we derive constructive upper bounds on what a transformer of a given depth and width could plausibly compute, providing a principled foundation for capability evaluations of transformer-based AI systems.

[LG-73] Interventional Flow Matching: Prospective Dose-Response Forecasting with Velocity-Field Jacobian Regularization

链接: https://arxiv.org/abs/2606.29386
作者: Amirreza Dolatpour Fathkouhi,Justin Lee,Heman Shakeri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting a patient’s physiological trajectory under a planned treatment sequence is a prospective interventional problem, not standard time-series extrapolation. We study this problem in glucose management, where insulin and carbohydrate records are policy-dependent: future drivers are coupled to patient state, behavior, and clinical decision rules, so observational forecasting accuracy alone does not guarantee correct responses to planned interventions. We introduce Interventional Flow Matching (IFM), a continuous-time generative framework for physiologically constrained prospective forecasting. IFM conditions a flow-matching velocity field on patient history and planned future drivers in a bounded latent glucose space. Rather than embedding strict mechanistic glucose–insulin ODE equations or enforcing causality through rollout-based simulations, IFM uses a solver-free regularization: it penalizes the Jacobian of the instantaneous velocity field with respect to smoothed treatment drivers. This imposes signed, dose-bounded local sensitivities directly on the learned dynamics: insulin lowers glucose, carbohydrates raise it, and both responses remain within plausible ranges. On a simulated UVA/Padova type 1 diabetes cohort, IFM achieves the strongest balance between observed-driver RMSE and interventional response metrics. Across experiments, it consistently produces physiologically correct responses to both insulin and carbohydrate drivers while maintaining high directional, and ranking consistency. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.29386 [cs.LG] (or arXiv:2606.29386v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.29386 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-74] Reliability Faithfulness and the Limits of Post-hoc Explanations of Opaque Scientific Models ICML2026

链接: https://arxiv.org/abs/2606.29346
作者: Nick Oh,Helen Jin
类目: Machine Learning (cs.LG)
*备注: Presented at PhilML Workshop at ICML 2026

点击查看摘要

Abstract:Post-hoc explanation methods are routinely used to interpret scientific machine learning models, with the deliverable understood to be insight into the phenomenon the model has been trained on. The transition may be taken to be secured once the model is reliable enough and the explanation faithful enough. We argue it is not. Reliability checks that the model’s predictions match the phenomenon’s outcomes, and faithfulness checks that the explanation matches the model, but neither checks whether the model works as the phenomenon works, which is what a claim about structure requires. The chain can support candidate hypotheses under external corroboration, but it cannot, on its own, support claims about how the phenomenon is in fact structured.

[LG-75] Sample Complexity of Scientific Discovery: PAC Learnability of Compositional Function Trees ICML2026

链接: https://arxiv.org/abs/2606.29331
作者: Şuayp Talha Kocabay,Talha Rüzgar Akkuş,Kerem Yalçın
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to the 2nd Workshop on Compositional Learning: Safety, Interpretability, and Agents at ICML 2026. To be presented in Seoul, South Korea, July 11, 2026

点击查看摘要

Abstract:Scientific discovery via symbolic regression is often viewed as statistically and computationally intractable because the hypothesis space of expressions grows combinatorially with depth. This paper revisits the statistical side through the lens of PAC learning, focusing on compositional function trees built from a finite vocabulary of smooth operators (e.g., +,\times,\sin,\exp\ and affine maps). We prove that the relevant generalization quantity, Rademacher complexity, hence the excess risk, does not necessarily blow up exponentially with the number of distinct symbolic structures, but is controlled by (i) the depth d and (ii) the Lipschitz constants of the base operators along the composed computation graph. Concretely, under mild Lipschitz conditions on operators and bounded affine leaves, a finite-union bound over a vocabulary of size K=|\mathcalH_\mathrmbase| together with Maurer-type vector contraction yields \mathfrakR_n(\mathcalH_\mathrmcomp^d) \leq (Kb\sqrt2L)^d-1\mathfrakR_n(\mathcalH_\mathrmcomp^1) with arity bound b ; corresponding high-probability risk bounds scale as \mathcalO(L^d/\sqrtn) when K,b=O(1) and \mathfrakR_n(\mathcalH_\mathrmcomp^1)=O(n^-1/2) . We complement the theory with a modular codebase that trains differentiable operator trees (not MLPs) on synthetic “physics-like” targets of controlled depth and shows that the empirical generalization gap correlates positively with the predicted complexity term (\widehatL^d)/\sqrtn .

[LG-76] Deciphering Region-Level Signatures from Latency Measurements in LEO Satellite Internet

链接: https://arxiv.org/abs/2606.29324
作者: Xiang Shi,Yifei Zhang,Peng Hu
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: This paper has been accepted by the IEEE International Symposium on Personal, Indoor and Mobile Radio Communications 2026 (PIMRC 2026), 1 - 4 September 2026, Singapore

点击查看摘要

Abstract:Low-Earth orbit (LEO) satellite Internet has become an indispensable infrastructure that provide growing coverage for global users. Despite extensive measurement efforts, the principles underlying region-level performance characteristics remain insufficiently understood, limiting the ability to identify region-specific latency signatures under dynamic network conditions. In this paper, we formulate the problem of region-level latency characterization using Starlink round-trip time (RTT) measurements from the public LENS dataset. We then propose a hierarchical analytical framework that transforms raw RTT sequences into multi-scale statistical features for cross-region comparison. Using data from five geographically representative regions, we demonstrate that latency differences are strongly associated with deployment factors, particularly infrastructure availability and Starlink dish-to-Point-of-Presence distance. Mutual information analysis identifies minimum RTT as the most discriminative feature, which is further supported by XGBoost-based feature importance. The proposed model well achieves 83% accuracy on short-term data. However, its performance degrades over longer periods, indicating limited temporal generalization and motivating the need for adaptive models and feature representations for long-term performance in the future.

[LG-77] SP-CACW: Convergence-Aware Client Weighting for Selfish Personalized Learning

链接: https://arxiv.org/abs/2606.29322
作者: Yaron Kiselman,Kfir Y. Levy
类目: Machine Learning (cs.LG)
*备注: 31 pages, 6 figures

点击查看摘要

Abstract:Collaborative learning is sustainable only when it benefits each participant. Standard federated learning optimizes a global average objective, which can under perform for clients whose data distributions differ substantially from the population. We study selfish personalization: how a designated target client can use peer gradients to minimize its own risk while avoiding negative transfer. We propose SP-CACW, a convergence-aware client-weighting framework that selects aggregation weights by minimizing an upper bound on the target client’s convergence error. The resulting rule explicitly trades off peer bias against stochastic variance and can assign zero weight to harmful peers. We provide convergence guarantees under smoothness and bounded-variance assumptions and evaluate the method on MNIST, CIFAR-100, and LEAF Shakespeare, where it is competitive with or improves over strong personalized and clustering baselines.

[LG-78] Adaptive Block Diffusion: Resolving Training-Inference Mismatch in Diffusion Language Models

链接: https://arxiv.org/abs/2606.29275
作者: Gagan Jain
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion Language Models (DLMs) are typically trained under fixed context structures, restricting denoising to predetermined token subsets. This creates a mismatch between training and inference, where models must operate over arbitrary configurations, leading to degradation off the training grid. We propose Adaptive Block Diffusion (ABD), which resolves this mismatch by optimizing denoising risk over a distribution of prefix-window configurations. By treating the configuration as a stochastic variable, ABD trains a single model over the full configuration space without architectural changes. We show that generalization across decoding strategies is governed by the support of the training distribution, and that ABD guarantees denoising optimality for any inference policy whose configurations are covered during training. Empirically, ABD exhibits structural invariance across decoding scales, avoiding off-grid collapse and recovering a monotonic relationship between block size and perplexity, while matching or outperforming fixed-block specialists at their target scales.

[LG-79] PCGD: Physics-Guided Conditional Graph Diffusion for TCAD Device Simulation

链接: https://arxiv.org/abs/2606.29272
作者: Yihan Zhang,Zhiteng Zhang,Kun Chen,Chen Wang
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Technology computer-aided design (TCAD) semiconductor device simulation is fundamentally constrained by the high computational cost of iteratively solving coupled drift-diffusion equations. Existing ML surrogates either reduce internal physics to macroscopic scalar regressions, or rely on single-step mappings that lack the iterative refinement required to resolve stiff, coupled fields. To address this, we introduce PCGD, a Physics-Guided Conditional Graph Diffusion framework operating natively on unstructured TCAD meshes to predict coupled electrostatic and carrier density fields. PCGD employs a Condition-Aware MeshGraphNet denoiser that explicitly injects boundary conditions and device structure context via global cross-attention. By augmenting data-driven denoising with a physics-guided hybrid objective that integrates exponent-free quasi-Fermi gradient matching with noise-aware PDE residuals, PCGD progressively enforce physical constraints in the iterative diffusion trajectory. This strategy successfully bypasses the numerical instabilities typical of stiff drift-diffusion equations. Evaluated on a challenging mixed PN/MOS benchmark, PCGD significantly outperforms deterministic one-step regression (1.207% error) and local diffusion (1.585% error) baselines by achieving a sub-percent mean relative field error of 0.835%, while concurrently reducing maximum PDE residual errors by nearly three orders of magnitude compared to pure diffusion. It also transfers robustly to unseen SOI topologies (0.815% error) via LoRA adaptation, using 5.30 \times less data and 14.34 \times fewer parameters than full fine-tuning. Ultimately, PCGD bridges the computational efficiency of generative surrogates with the rigorous physical fidelity of traditional TCAD, unlocking highly scalable, field-level analysis for robust device engineering.

[LG-80] Learning to Bid in Discriminatory Auctions with Budget Constraints AISTATS2026

链接: https://arxiv.org/abs/2606.29252
作者: Negin Golrezaei,Sourav Sahoo
类目: Machine Learning (cs.LG)
*备注: 54 pages, 1 figure. Appeared at AISTATS 2026

点击查看摘要

Abstract:We study repeated bidding in multi-unit discriminatory (pay-as-bid) auctions for a single bidder with per-round utility equal to value minus \alpha times payment, where \alpha\in[0,1] is a cost-of-capital parameter. The bidder aims to maximize cumulative utility over T rounds subject to a total budget B . The problem is challenging even without budgets: the action space is exponential in M , the maximum demand of the bidder and the valuation vector (context) varies over time. Exploiting a decomposition of utility across units, we develop polynomial-time learning algorithms based on shortest paths in a directed acyclic graph, obtaining sublinear regret under both full-information and bandit feedback. In the bandit setting, the regret is independent of the number of contexts due to complete cross-learning: observing the utility of the chosen action under the realized context reveals the utility for the same action under all counterfactual contexts. With budget constraints, when the average normalized per-round budget \rho=\fracBMT1 , we design a coupled primal-dual algorithm in which the DAG-based procedure uses dual-adjusted edge weights for primal updates, while online gradient descent updates the dual variable, yielding \rho -approximate sublinear regret. Finally, we give implementations whose per-round time and space are independent of the number of contexts, enabling scalability to large or even infinite context spaces.

[LG-81] When Prices Double in a Week: Forecasting of Agricultural Volatility in Import-Isolated Markets

链接: https://arxiv.org/abs/2606.29248
作者: Ranuga Weerasekara,Heshan Nethmina,Manuja Ranathunga,Vinma Wettasinghe,Dinithi Navodya,Subavarshana Arumugam,Nirasha Munasinghe,Nisansa de Silva,Sandareka Wickramanayake
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Vegetable prices in Sri Lanka are highly volatile because the market is largely import-isolated, so supply disruptions quickly drive prices up. This study develops a machine learning framework to forecast such volatility by incorporating supply-chain-aware features and explicitly modelling the country’s two cultivation seasons, Maha (October-April) and Yala (May-September). An integrated dataset was constructed by combining retail and farmer-gate prices with origin-aligned weather variables, diesel costs, and exchange rates across 12 vegetable varieties and 14 market centres from 2013 to 2019. A gradient-boosted ensemble model (XGBoost and LightGBM) was trained and optimised using Optuna, and unified and season-specific configurations were compared. Results show that season-specific models improve within-season fit, with the Yala-specific model achieving the highest R2 of 0.9420 (95% CI [0.690, 1.000]), while the unified model delivers the best overall predictive accuracy of 90.84% (95% CI [88.34%, 91.52%]) and an R2 of 0.9281 (95% CI [0.760, 1.000]). Notably, the unified model maintains 85.96% accuracy on a completely unseen 2024 hyperinflationary period without retraining, successfully tracking major price surges. These findings suggest that agricultural price movements in import-constrained markets are meaningfully predictable when models capture supply-chain dynamics, offering practical value for early warning and decision making by farmers, traders, and policymakers. Existing studies on Sri Lankan vegetable prices are confined to Autoregressive Integrated Moving Average (ARIMA) and Generalized Autoregressive Conditional Heteroskedasticity (GARCH) applied to single markets, with no supply-chain features, seasonal segmentation, or cross-regime validation.

[LG-82] KrishokChat: A Citation-Grounded Dataset and Benchmark for Bengali Agricultural Advisory

链接: https://arxiv.org/abs/2606.29243
作者: Khan Raiyan Ibne Reza,Omar Ibne Shahid
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present KrishokChat, the first citation-grounded Bengali agricultural instruction-tuning dataset for crop advisory in low-resource settings. We establish a foundation of 290 hierarchical Knowledge Nodes, extracting disease symptoms, management practices, chemical dosages, and verbatim citations from 129 domain-filtered agricultural manuals. Every training instance inherits a verified citation header, guaranteeing 100% citation provenance. Using a Partitioned Seed Generation Matrix, these nodes are expanded into 139,200 supervised fine-tuning pairs, and augmented with 5,300 chemical safety and 1,000 adversarial safety instances, yielding 145,500 QA pairs across 18 crop categories. To evaluate real-world performance, we introduce the Farmer Benchmark, comprising 1,001 authentic farmer queries curated from field surveys and digital portals. Empirical evaluation on Gemma-4-E2B reveals that while fine-tuning on KrishokChat vastly improves structured formatting, standalone models still struggle with exact chemical dosage generalization. This highlights the dataset’s true value as a verified knowledge base for retrieval-augmented generation (RAG) rather than mere parametric memorization. All data, code, and benchmarks are released under CC-BY-4.0.

[LG-83] owards Evaluating Data Priors for Tabular Foundation Models

链接: https://arxiv.org/abs/2606.29241
作者: Zeynep Türkmen,Kürşat Kaya,Alexander Pfefferle,Frank Hutter
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data-generating priors are a central component of tabular foundation models because they define the task distribution used during pretraining. However, priors are rarely evaluated as independent components, making it difficult to understand how much they affect downstream model behavior. This raises a methodological question: how can priors from different tabular foundation models be compared independently of the architectures and training protocols they were introduced with? To study this question, we implement a unified interface for publicly available priors from recent tabular foundation models and priors constructed from real datasets. We generate training tasks from each prior, train the same model architecture under a fixed training protocol, and evaluate the resulting models on shared downstream classification tasks. We compare priors through both generated-task statistics and downstream predictive performance. Our results show that different priors favor different downstream behaviors, with some achieving stronger absolute performance and others exhibiting more consistent relative rankings across datasets. We further find that data-level similarity only partially explains downstream behavior. Our code is available at this https URL.

[LG-84] Blackknife: Hard-Label Query-Limited Black-Box Attacks on Heterogeneous Graph Neural Networks

链接: https://arxiv.org/abs/2606.29240
作者: Honglin Gao,Junhao Ren,Lan Zhao,Yue Yang,Jindong Chang,Gaoxi Xiao
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Heterogeneous graph neural networks (HGNNs) have achieved strong performance in modeling complex graph-structured data with multiple node and relation types. However, their robustness under realistic black-box adversarial settings remains insufficiently explored. Existing attacks on HGNNs usually assume access to model gradients, soft prediction scores, or the complete graph structure, which is often unavailable when HGNN-based services are deployed as closed systems. In this paper, we propose Blackknife, a hard-label, query-limited, and structure-limited black-box evasion attack framework for heterogeneous graph neural networks. Blackknife assumes no access to the victim model architecture, parameters, gradients, logits, confidence scores, or the full graph structure. Instead, it only relies on locally observable one-hop heterogeneous structures and a small number of hard-label queries. To generate effective perturbations under these strict constraints, Blackknife first constructs a local relation-aware surrogate model from observable heterogeneous neighborhoods. It then relaxes discrete edge addition and deletion operations into continuous soft weights and optimizes them through projected gradient descent. Finally, the optimized perturbations are discretized into relation-preserving structural rewiring operations and verified using limited hard-label feedback from the victim model. Extensive experiments on three benchmark heterogeneous graph datasets, including ACM, DBLP, and IMDB, demonstrate that Blackknife consistently achieves strong attack success rates against representative HGNN models. The results further show that Blackknife remains effective under topology-based defense strategies, revealing the vulnerability of HGNNs to local structure-limited black-box attacks.

[LG-85] On the Policy Gradient Foundations of Group Relative Policy Optimization: Credit Assignment Gradient Sparsity and Rank Collapse

链接: https://arxiv.org/abs/2606.29238
作者: Amritansh Mishra,Supriyo Chakraborty,Berkcan Kapusuzoglu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) eliminates the learned critic in PPO by using the mean reward of grouped rollouts as a baseline. We provide a rigorous derivation of GRPO from first principles of the policy gradient theorem, revealing a fundamental credit assignment failure: under output-only reward, every token in a rollout receives identical advantage, collapsing token-level credit to a single scalar. We prove this induces gradient sparsity that intensifies over training, and demonstrate empirically via SVD analysis of GRPO gradients on Nemotron-4B/GSM8K that the gradient matrix has effective rank \approx 2 regardless of group size R \in \2, 4, 8\ . We formalize this as an intrinsic rank-2 structure arising from the zero-sum constraint on advantages and derive conditions under which GRPO’s baseline is optimal. Our results characterize when GRPO’s simplicity is theoretically justified and identify the credit assignment bottleneck as the key limitation for multi-step reasoning.

[LG-86] Depth Exploration for LLM Decoding

链接: https://arxiv.org/abs/2606.29223
作者: Weisi Yang,Zipeng Sun,Stephen Xia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autoregressive LLM decoding evaluates every generated token through the full layer stack, even though many tokens become predictable at intermediate depths. Existing lossless depth-adaptive methods exploit this redundancy by choosing a single non-final exit depth and verifying its prediction with the final-depth model. However, our measurements show that this selection-based strategy leaves substantial headroom: choosing an exit too late wastes computation, while choosing one too early triggers fallback and discards dependent drafts. We propose Depth Exploration Decoding (DEX), a lossless decoding algorithm that replaces single-depth selection with parallel exploration over multiple candidate depths. At each commit position, DEX validates candidates against the final-depth reference, commits exactly the final-depth token, and collapses the exploration lattice to retain only reusable branch states. This expand–commit–collapse procedure preserves equivalence to standard autoregressive decoding while reducing the cost of committing each token. Across early-exit-trained and standard LLMs, DEX outperforms representative depth-selection baselines and achieves competitive end-to-end throughput against speculative and distributed decoding methods. Moreover, DEX improves as the explored depths become finer, showing that parallel depth exploration provides a scalable way to exploit the underused depth axis of LLM decoding.

[LG-87] A Linear Matching Bandit Approach to Online Multi-Human Multi-Robot Teaming

链接: https://arxiv.org/abs/2606.29221
作者: Yaohui Guo,X. Jessie Yang,Cong Shi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the problem of online multi-human multi-robot teaming through the lens of a linear matching bandit framework, where a learner assigns robots with unknown features from a fixed pool to distinct sets of human agents over multiple rounds. To solve this problem, we propose LinMatch, an online learning algorithm that updates the confidence intervals of the unknown features and makes the optimistic matching under uncertainty. The contributions and novelty of this work are twofold. First, we recast the optimistic matching problem in each round as a linear program of maximum weighted matching, efficiently solvable by the celebrated Hungarian algorithm. Second, we provide novel bounds for matching with linear feature problems, showing an upper bound of \tildeO(d\sqrtMKT) and a minimax lower bound of \Omega(d\sqrtMKT) , establishing a tight optimal regret rate of \tilde\Theta(d\sqrtMKT) . This demonstrates that LinMatch achieves strictly optimal achievable regret with respect to the total number of rounds T , the feature dimension d , and the matching parameters M and K . The proposed algorithm and bounds apply to a wide range of matching problems with applications beyond human-robot matching, such as housing allocation, recommendation systems, and more.

[LG-88] Bayesian Best-Arm Identification with Abstention: A Polynomial-to-Exponential Phase Transition

链接: https://arxiv.org/abs/2606.29203
作者: Yuqi Huang,Yunlong Hou,Vincent Y. F. Tan
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the Bayesian fixed-budget best-arm identification problem in which a learner can abstain from making a terminal recommendation. Subject to an abstention budget \alpha , we analyze the probability of undetected error–the risk of recommending a suboptimal arm without abstaining. Our central finding is that abstention induces a phase transition: without abstention, the error probability decays polynomially in the sampling budget T ; in contrast, introducing any small positive abstention budget shifts this to an exponential decay. For Gaussian priors and rewards, in the regime T\to\infty followed by \alpha\downarrow0 , we establish exact matching information-theoretic lower bounds and algorithmic upper bounds on the optimal error exponent, which takes the form \exp(-\frac\alpha^2T8\kappa_\nu^2) . The hardness parameter \kappa_\nu represents the prior density of the top-two gap at zero, highlighting that nearly tied instances drive the fundamental error. We introduce an adaptive algorithm, PGWS, that successfully achieves this optimal exponent by expending its abstention budget on statistically ambiguous instances. We further demonstrate that this polynomial-to-exponential improvement is exclusively a Bayesian phenomenon–in the frequentist setting, abstention only affects lower-order exponent terms. We also extend our results beyond the Gaussian model.

[LG-89] BrainRiem: Riemannian Prototype Learning for Source-Free Cross-Site Brain Network Diagnosis ECCV2026

链接: https://arxiv.org/abs/2606.29200
作者: Kunyu Zhang,Tianxiang Xu
类目: Machine Learning (cs.LG)
*备注: Accepted by ECCV 2026

点击查看摘要

Abstract:Multi-site functional MRI (fMRI) studies are essential for robust neuropsychiatric diagnosis yet suffer severe domain shifts from scanner heterogeneity, demographics, and site-specific acquisition protocols. Traditional domain adaptation requires concurrent source and target data access, violating clinical privacy regulations. Moreover, functional connectivity matrices lie on the Symmetric Positive Definite (SPD) manifold, where Euclidean operations cause geometric distortions corrupting diagnostic patterns. We propose BrainRiem, a source-free domain adaptation framework learning compact Riemannian brain prototypes via manifold-aware bi-level optimization. It employs the Log-Euclidean Metric to ensure prototypes remain valid SPD matrices, while Dirichlet Energy spectral calibration aligns their frequency characteristics with real brain networks. Only anonymized prototypes are transmitted to target sites, serving as stable anchors for training local models without source data access and reducing leakage under the evaluated attacks. Comprehensive experiments on ABIDE and REST-meta-MDD show BrainRiem consistently outperforms state-of-the-art source-free, traditional, and graph domain adaptation methods across diverse scanners and demographics. Notably, learned prototypes exhibit biologically interpretable connectivity patterns aligning with established neuroscience findings, validating the necessity of Riemannian geometry for brain network analysis.

[LG-90] BaRA: Bayesian Adaptive Rank Allocation for Parameter-Efficient Fine-Tuning

链接: https://arxiv.org/abs/2606.29184
作者: Zhibin Duan,Yuhong Wang,Jiahong Fu,Zongsheng Yue,Bo Chen,Zongben Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While Low-rank adaptation (LoRA) enables highly efficient fine-tuning by constraining task-specific updates to fixed low-rank subspaces, this rigid design limits representational flexibility and often results in overconfident predictions and miscalibrated uncertainty, especially in low-data regimes. Recent Bayesian LoRA variants improve uncertainty estimation by modeling posterior distributions over adaptation parameters. However, these approaches typically rely on fixed or heuristically determined ranks, overlooking the inherently context-dependent nature of adaptation capacity. In this paper, we propose BaRA, a Bayesian Adaptive Rank Allocation framework for parameter-efficient fine-tuning. Drawing inspiration from probabilistic topic models, BaRA dynamically allocates adaptation capacity by activating a sparse, context-dependent subset of disentangled latent factors, enabling instance-wise variation in effective rank. This Bayesian formulation provides principled, data-driven capacity control, mitigating over-parameterization while preserving expressiveness. Beyond the modeling contribution, we provide a complexity-theoretic generalization analysis showing that the generalization gap of BaRA depends on the learned joint effective rank \bars_\Phi,\theta induced by the global-local gate, rather than the maximum rank r . This result explains why sparse adaptive rank allocation can reduce the effective hypothesis complexity while preserving input-dependent expressiveness. Extensive experiments on diverse natural language benchmarks demonstrate that BaRA consistently improves predictive performance, robustness, and uncertainty calibration compared to standard LoRA and existing Bayesian LoRA variants.

[LG-91] Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks

链接: https://arxiv.org/abs/2606.29176
作者: Tejas Pradeep Shirodkar
类目: Machine Learning (cs.LG); Differential Geometry (math.DG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 69 pages, 28 figures, 9 tables. Builds the gauge-equivariant preconditioner left open in arXiv:2606.05957

点击查看摘要

Abstract:A deep network’s loss is invariant to continuous symmetries of its parameters: the logit shift, the ReLU rescaling, the LayerNorm scale, the per-head attention rotation. Adam’s per-coordinate preconditioner drifts along each symmetry orbit, which pulls the trajectory off the symmetry quotient where the optimization lives and blurs the singular-learning rate the quotient makes readable. We build DDC, a Dead-Direction Conditioner that lifts a base optimizer into a G -equivariant one: it conditions the optimizer’s state in the orbit decomposition of a G -invariant metric, so the trajectory stays a preconditioned gradient flow on the quotient \bar\Theta = \Theta/G . The construction carries four architectural gauges (cross-entropy shift, ReLU and SwiGLU rescaling, LayerNorm and RMSNorm scale, and a per-head O(d_\rm head) attention rotation matched to RoPE), proves exactly equivariant on an Adam base, and composes with a Muon base through a gauge-equivariant orthogonaliser. Respecting the symmetry changes both the minimum the optimizer reaches and what it leaves measurable there. On a language model trained past the point of fit, DDCAdam resists the over-training collapse AdamW falls into, holding a validation-train loss gap of 0.67 against 5.88, and reads the dead-direction rate in 32 of 65 layer-by-observable cells where AdamW reads it in 7. A vision transformer trained from scratch reaches lower validation loss (1.71 against 2.12) while compressing spare feed-forward capacity a matched AdamW leaves intact. On a Muon base, where the rotation gauge composes exactly, DDCMuon groks ten of eleven seeds at depth 24 that a plain Muon never reaches. Built into the optimizer, a network’s gauge symmetry sharpens the minimum it finds and turns that minimum’s geometry into something the trajectory can measure.

[LG-92] GLACIER: Rethinking Mass Spectrum Prediction as an Object Detection Problem

链接: https://arxiv.org/abs/2606.29161
作者: Rui-Xi Wang,Runzhong Wang,Connor W. Coley
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Predicting tandem mass spectra (MS/MS) from molecular structures represents a central task in analytical chemistry with direct relevance to clinical metabolomics, systems biology, and adjacent disciplines. In this work, we revisit the problem through the lens of object detection on molecular graphs. Molecular fragmentation, a central step in MS/MS prediction, can be approximated as detecting a set of subgraphs (i.e., fragments) and their associated spectral contributions. Existing fragment-based models follow a two-stage paradigm – first generating candidate fragments and then scoring them – analogous to two-stage R-CNNs in computer vision. Towards higher accuracy and faster inference, we introduce GLACIER, a single-stage transformer-based fragment detection neural network for molecular graphs. This unified formulation eliminates the need for candidate enumeration, enabling scalable and globally consistent modeling of molecular fragmentation. GLACIER is faster and more accurate than existing state-of-the-art by a significant margin, achieving 70.0% and 69.7% Top-1 retrieval accuracy with and without contrastive finetuning on the MassSpecGym dataset (from the previous SOTA of 64.0%) and 52.5% and 38.5% respectively on the NIST’20 dataset (from 33.2%). Furthermore, GLACIER provides nearly 8-fold inference speedup over our prior two-stage model. Code is available at this https URL

[LG-93] How Token Influence Decays with Distance: A Green-Function View of Trained Language Models

链接: https://arxiv.org/abs/2606.29139
作者: Matthias Brändel,Stephan Köhler,Oliver Rheinbach
类目: Machine Learning (cs.LG)
*备注: 14 pages, 4 figures

点击查看摘要

Abstract:We study how the next-token prediction of an autoregressive Transformer language model changes under small perturbations of earlier input token embeddings. Motivated by operator learning and iterative solvers for differential equations, we investigate how the influence of one token on another decays with distance in a trained model. In multilevel methods for differential equations, such as domain decomposition, multigrid, and multilevel preconditioning, one often exploits a separation between strong local interactions and weaker but essential global interactions. The latter correspond to the long tail of the Green’s function and are typically handled by a coarse-level operator. Inspired by this perspective, we compute an empirical, distance-resolved gradient profile of token dependencies using autograd. Experiments on trained Pythia models and Qwen2.5-0.5B show that, over the measured distance range, the median Jacobian sensitivity is much better described by a power-law-type decay than by an exponential alternative: the diagonal-normalized profile is well described by \overline G® \approx \gamma+\beta(r+1)^-p with exponents p \approx 0.7 – 0.9 (typically 0.8 – 0.9 ). This behavior appears on coherent text from Gutenberg and WikiText-103. Token-shuffling experiments show that the power-law profile persists even when syntax and prediction quality collapse, whereas randomly initialized models do not exhibit it. The slowly decaying long-range sensitivity thus appears to be a learned property of trained autoregressive Transformer operators. These findings suggest that hierarchical or coarse-level mechanisms in language models may be able to exploit the long-tailed sensitivity profiles.

[LG-94] An Integrated Two-Stage Deep-Learning Tool for Rapid Post-Hurricane Damage Identification and Repair Scheduling

链接: https://arxiv.org/abs/2606.29117
作者: Hooman Torkaman,Ellis Oti Boateng,Jignesh Solanki,Anurag Srivastava
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, 5 tables; submitted to the 2026 North American Power Symposium (NAPS 2026)

点击查看摘要

Abstract:Post-hurricane damage assessment and repair scheduling can require computationally intensive simulation and optimization. This paper presents an integrated two-stage deep-learning tool for rapid damaged-line identification and repair-schedule computation. An available offline synthetic dataset for the IEEE 9500-node test feeder contains 1,700 hurricane scenarios with exposure features, grid metadata, fragility parameters, OpenDSS outputs, damaged-line labels, and Adaptive Large Neighborhood Search reference schedules. Stage 1 benchmarks MLP, ResMLP, and GraphSAGE, while Stage 2 compares MLP, DeepSets, and Set Transformer. The selected ResMLP-Set Transformer pipeline propagates Stage 1 errors into Stage 2 and achieves a damaged-job F1-score of 0.920, pairwise order agreement of 0.854, and start- and end-time mean absolute errors of 4.349 min and 4.486 min, respectively. The tool provides rapid initial repair-log decision support for new hurricane cases.

[LG-95] A Novel Latent-Class Attack and its Detection by Class Subspace Orthogonalization

链接: https://arxiv.org/abs/2606.29112
作者: Guangmingmei Yang,David J. Miller,George Kesidis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning, which in general relies on voluminous amounts of training data, is vulnerable to data poisoning attacks, including error-generic attacks and backdoors (Trojans). In this work, we propose a new data poisoning attack we dub a latent class attack. Here, all poisoned examples are from a class that is novel (unknown) for the given classification domain and are mislabeled to one of the known classes (the target class) of the domain, so that the model learns to recognize the novel class as a sub-class of the target class. Such attacks could be used e.g. to defeat AI-based access control systems, or could cause a “foe” to be classified as a “friend”. We also propose a post-training defense to detect this attack, without any access to the training set. This detection approach builds on “class subspace orthogonalization” (CSO), a plug-and-play paradigm demonstrated to improve existing backdoor detectors. Here, CSO is used to seek an input (a putative unknown class instance) whose internal representation is not aligned with any of the known classes, and yet which is classified with confidence to one of these classes. Finally, specific to image classification domains, we propose a method for visualizing the estimated unknown class instance, providing explainability to our latent class detections.

[LG-96] Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps

链接: https://arxiv.org/abs/2606.29110
作者: RuiKang OuYang,Hanlin Yu,Xinyue Ai,Yutong He,Nicholas M. Boffi,Pradeep Ravikumar,Jose Miguel Hernandez-Lobato,Max Simchowitz,Benjamin Kurt Miller,Omar Chehab
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent progress in flow-based generative modeling has led to models that output high-quality samples while using only a small number of function evaluations. However, at present, there is a lack of similar advances in estimating the model likelihood. In particular, most existing methods either rely on restrictive architectures that enable exact calculations, or use stochastic approximations such as Hutchinson’s trace estimator that introduce substantial variance. In this work, we introduce SCAlable LikeLihood distillation of flOw maPs (SCALLOP). SCALLOP builds on the recently proposed F2D2, a likelihood flow map model that can generate samples and their densities in a small number of function evaluations. While F2D2 uses Hutchinson’s estimator during training, we introduce an alternative and more scalable likelihood distillation objective that is Hutchinson-free and admits a vectorized formulation. Empirically, we demonstrate the effectiveness of SCALLOP as a Boltzmann generator in molecular science, and further validate its benefit on image datasets. SCALLOP significantly reduces both training variance and training time while consistently improving performance compared to F2D2, and is competitive with the state-of-the-art while achieving up to 10x inference speedup over the fastest baseline.

[LG-97] DiLaServe: High SLO Attainment Serving for Diffusion Language Models

链接: https://arxiv.org/abs/2606.29094
作者: Tzu-Tao Chang,Benjamin Yuanyang Hong,Kiet Pham,Shivaram Venkataraman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion language models (DLMs) have recently emerged as a promising alternative to conventional autoregressive language models. By generating multiple tokens in parallel during each denoising step, they offer higher inference throughput while maintaining competitive quality. However, realizing these throughput gains while meeting latency SLOs in a serving system requires addressing challenges introduced by DLMs’ unique characteristics. These include navigating the speed-quality tradeoff created by confidence-based denoising, choosing appropriate parallelization levels across model instances under fluctuating load, and coordinating approximate KV caching mechanisms that introduce non-uniform per-step costs. To address these challenges, we present DiLaServe, a cluster-level serving system for DLMs. DiLaServe enables deadline-aware scheduling and adaptive load control through confidence-threshold adjustment, and dynamically reconfigures the cluster by solving a quality-aware optimization problem, while explicitly modeling the step-level heterogeneity introduced by approximate KV caching. Across multiple benchmarks and real-world traces, DiLaServe improves SLO attainment by up to 56.6 percentage points and reduces end-to-end request latency by up to 46% while incurring less than 1% accuracy drop.

[LG-98] Residual-Guided Dictionary Learning for Spectrally Accurate Koopman Approximation

链接: https://arxiv.org/abs/2606.29083
作者: George Coote,Matthew J. Colbrook
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Koopman theory promises linear structure in nonlinear dynamics, but numerical Koopman spectra are easy to compute and hard to trust. A finite EDMD matrix always has eigenvalues; the problem is that many of them may have nothing to do with the infinite-dimensional operator. In this paper we make spectral reliability the objective of dictionary learning. We train neural-network dictionaries not merely to predict the next snapshot, but to minimize Residual Dynamic Mode Decomposition residuals: operator-level a posteriori errors that test whether computed eigenvalues and modes are genuine Koopman spectral objects. To keep the learned observables from collapsing into an unstable coordinate system, the loss also penalizes the condition number of the lifted data matrix. Thus the method couples two requirements that should not be separated: small Koopman residuals and a well-conditioned representation. The result is a learned dictionary that is expressive, numerically stable, and spectrally disciplined. Across conservative and dissipative benchmark systems, the method sharply reduces spectral pollution, improves residual pseudospectral inclusion, and lowers forecast error relative to standard fixed dictionaries. On sea-surface temperature data, it gives cleaner Koopman diagnostics and substantially better one-step forecasts from noisy observations with no governing equations. The message is simple: neural Koopman learning should be judged not by prediction alone, but by whether its spectral claims can be certified. Residuals provide the certificate; conditioning makes it computable.

[LG-99] When Can Conformal Risk Control Certify LLM Outputs? Bounds Impossibility and Adaptation for Structured Generation

链接: https://arxiv.org/abs/2606.29054
作者: Varun Kotte
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) deployed for structured generation (NER, JSON extraction, QA, and classification) lack formal reliability guarantees, and standard heuristic abstention policies miss user-specified risk targets by 7.5–12.5%. We characterize when conformal risk control (CRC) can certify structured LLM outputs and when it provably cannot. First, we prove an impossibility result: when the base risk (\mu \alpha), any distribution-free method must abstain on at least ((\mu-\alpha)/(1-\alpha)) examples, yielding a closed-form feasibility test: one can check whether CRC will work before running it. Second, we analyze a certification hierarchy across Hoeffding, empirical Bernstein, and a betting-based e-CRC bound, with strict gains in low-variance/large-sample regimes: the Hoeffding-to-Bernstein step delivers the largest gain (+37% certified configurations), while e-CRC adds value when calibration data is scarce (10% certification at 20% data versus 0% for Hoeffding). Third, we validate adaptive conformal inference (ACI) under cross-dataset shift, reducing risk-target violations from 71% to 21%, with residual failures concentrated exactly where the impossibility bound predicts. Across six open-weight models (3B–72B parameters), eight datasets, four tasks, and six nonconformity scores, hard NER/QA/CLS configurations are uncertifiable at (\alpha = 0.10); relaxing to (\alpha = 0.30–0.40) unlocks practical certification (47% NER, 40% QA, 60% CLS). The framework gives a three-step deployment recipe: check feasibility, select the bound and score, then mitigate shift.

[LG-100] A Kernel Fisher Discriminant Analysis-Based Tree Ensemble Classifier: KFDA Forest

链接: https://arxiv.org/abs/2606.29053
作者: Donghwan Kim,Seung Hwan Park,Jun-Geol Baek
类目: Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, 6 tables; author-created manuscript version of the published article

点击查看摘要

Abstract:In general, an ensemble classifier is more accurate than a single classifier. In this study, we propose an ensemble classifier called the kernel Fisher discriminant analysis forest (KFDA Forest), which is a tree-based ensemble method that applies KFDA. To promote diversity, bootstrap is used, and variable sets are randomly divided into K subsets. KFDA is performed on each subset to increase classification accuracy. KFDA maximizes the distance between classes while minimizing the distance within classes. KFDA can also be applied to classification problems in a nonlinear data structure using the kernel trick because it can transform the input space into a kernel feature space, commonly named a rotation, rather than performing a dimensionality reduction. Because new feature axes and KFDA projections are parallel, decision trees are used as a base classifier. To compare the proposed method with existing ensemble methods, we apply these to real datasets from the UCI and KEEL repositories.

[LG-101] MOSAIC: Orchestrating Collaborative Knowledge Tracing with Hierarchical Semantic Alignment

链接: https://arxiv.org/abs/2606.29049
作者: Xinjin Li,Mengyue Wang,Yuzhen Lin,Pengbin Feng,Ziqi Sha,Yeyang Zhou,Yu Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge Tracing (KT) is important for personalized education but traditionally suffers from two key limitations: a reliance on shallow ID-based representations that neglect semantic depth and a restriction to single-granularity mastery estimation that overlooks hierarchical knowledge dependencies. To address these challenges, we propose MOSAIC (Multi-granularity Online Semantic AI for Collaborative Knowledge), a novel framework that orchestrates LLM-driven semantic alignment with sequential modeling. Unlike methods that use LLMs solely as predictors, MOSAIC leverages a frozen LLM to generate dynamic, context-aware embeddings and hierarchical prediction prompts, explicitly capturing collaborative signals and peer interactions. Furthermore, we introduce a cross-granularity consistency objective that jointly regularizes mastery estimation across concept, topic-cluster, and global proficiency levels. Extensive experiments on ASSISTments, EdNet, and a newly collected large-scale MOOC dataset demonstrate that MOSAIC establishes new state-of-the-art results. Specifically, our method achieves AUC improvements of up to 3.4% and Accuracy gains of up to 2.5 % across all benchmarks. Notably, MOSAIC exhibits superior robustness in collaboration-rich environments and long-sequence scenarios (AUC 0.862 on MOOC), offering both high predictive precision and semantically grounded interpretability.

[LG-102] Weak Dominant Balance for Robust Identification of Dynamically Consistent Fluid Flow Structure

链接: https://arxiv.org/abs/2606.29047
作者: Samuel Ahnert,Esther Lagemann,H. Jane Bae,Kunihiko Taira,Ricardo Vinuesa,Christian Lagemann,Steven L. Brunton
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Extracting interpretable, localized physical mechanisms from complex spatiotemporal data is a foundational challenge across physics, biology, and engineering, but has remained out of reach on real measurements. The central obstacle is obtaining high-quality gradients of data via numerical differentiation, which amplifies noise, diverges for high-order equations, and falters on irregular geometries, limiting the scope of existing approaches to clean simulations of low-order systems. Here, we present weak dominant balance, a derivative-free framework that projects governing equations into a weak (integral) formulation, offloading differentiation onto smooth analytical test functions and leaving the data untouched. The method sustains accurate regime identification under severe noise where existing approaches categorically fail, delivers the first data-driven decomposition of a third-order partial differential equation applied to turbulent duct flow, and produces matching decompositions across direct numerical simulation and particle-image velocimetry measurements of a wavy channel flow, uncovering a previously uncharacterized dynamical regime. Weak dominant balance brings mechanism-level analysis out of simulation and onto measured data, and opens complex physical systems to direct, equation-grounded interpretation.

[LG-103] How Far Can Sharpness and Complexity Jointly Explain Generalization?

链接: https://arxiv.org/abs/2606.29043
作者: Ziyu Cheng,Xitong Zhang,Longxiu Huang,Rongrong Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sharpness and complexity are two central factors in the generalization analysis of deep neural networks. Existing quantitative evaluations of generalization measures have largely focused on individual scalar measures, leaving the joint explanatory power of sharpness and complexity largely unexplored. This work studies how far sharpness and complexity can jointly explain generalization. We use linear regression and introduce a Pareto-based analysis to quantitatively evaluate the joint explanatory power of these two factors. Beyond the existing parameter-level definitions, we further propose realizations of sharpness and complexity that are closer to function space and less dependent on raw parameter representations. We find that function-oriented definitions of these two quantities expand the explanatory scope of the two-factor view beyond what is achieved by existing parameter-level metrics. Overall, our results support the sharpness-complexity perspective as an informative lens for understanding generalization across diverse settings. At the same time, the remaining failures indicate that whether this two-factor view can serve as a complete theory of generalization remains open.

[LG-104] On Surrogate Modeling of Static Response of AM Short-Fiber Thermoplastics Using Graph Neural Networks

链接: https://arxiv.org/abs/2606.28996
作者: Pharindra Pathak,Vipin Kumar,Trenton M. Ricks,Suhasini Gururaja,Siddhartha Srivastava(Auburn University, Oakridge National Lab, NASA Glenn Research Center, Auburn University, Auburn University)
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Short-fiber thermoplastic (SFT) composites are increasingly employed in lightweight aerospace and automotive structures owing to their favorable strength-to-weight ratio, high production rates, and recyclability. Unlike continuous-fiber systems, the mechanical response of SFTs is governed by mesoscale interactions among fiber orientation, spatial clustering, and manufacturing-induced porosity. These features exhibit significant spatial variability in manufactured components and influence stiffness, damage initiation, and nonlinear deformation. Although mesoscale finite element (FE) models can resolve such heterogeneity, their application to realistic three-dimensional microstructures remains computationally intractable. A data-driven surrogate framework is proposed to predict the mechanical behavior of additively manufactured, compression-molded (AM-CM) SFTs. Microstructures reconstructed from micro-computed tomography data were discretized into Voronoi-based cells representing distinct fiber-interaction neighborhoods. Each cell was homogenized via nonlinear FE simulations incorporating matrix damage, and the resulting stress-strain responses trained a hybrid Graph Neural Network-Long Short-Term Memory (GNN-LSTM) architecture encoding microstructural topology and history-dependent mechanical evolution. The surrogate accurately predicts stiffness and stress-strain behavior of unseen microstructures, achieving R^2\approx 0.98 relative to high-fidelity FE simulations with over two orders-of-magnitude reduction in computational cost. Coupling the framework with experimentally calibrated damage laws demonstrates that fiber orientation, clustering, and porosity collectively govern local effective stiffness. The approach provides a physics-informed, data-efficient pathway to identify mechanically weak microstructural cells and accelerate digital-twin development for SFT components. Subjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci) Cite as: arXiv:2606.28996 [cs.LG] (or arXiv:2606.28996v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.28996 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-105] FlipGuard: Defending Large Language Models Against Quantization-Conditioned Backdoor Attacks ICME2026

链接: https://arxiv.org/abs/2606.28962
作者: Aoying Zheng,Anqi Du,Zizhuang Deng,Yuxuan Chen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted by ICME 2026

点击查看摘要

Abstract:Model quantization is essential for the efficient deployment of Large Language Models (LLMs), but introduces a critical vulnerability: Quantization-Conditioned Backdoor (QCB) attacks. In these attacks, malicious behaviors remain dormant in full-precision models and activate only after specific quantization distortions, bypassing standard security audits. To mitigate this, we introduce FlipGuard, a proactive defense framework that selectively perturbs model weights prior to quantization. By breaking the adversary’s precise alignment between weight patterns and quantization boundaries, FlipGuard suppresses backdoor activation without requiring access to training data or trigger samples. We further propose the Defense Effectiveness Ratio (DER), a unified metric to jointly evaluate security gains, utility preservation, and computational cost. Extensive experiments across seven LLMs (including StarCoder and LLaMA-family models) and three quantization schemes (INT8, FP4, NF4) demonstrate that FlipGuard effectively neutralizes QCBs across three scenarios, i.e., vulnerable code generation, content injection, and over-refusal, achieving high security with negligible performance degradation.

[LG-106] ReGuide: From Test-Time Guidance to Self-Improving Diffusion Policies

链接: https://arxiv.org/abs/2606.28939
作者: Tzu-Hsiang Lin,Srinivas Shakkottai,Dileep Kalathil,P. R. Kumar
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Behavior-cloned diffusion policies are expressive but remain vulnerable to covariate shift: small deviations from demonstrated states can compound into task failure. Existing methods address this either by expanding the training distribution through expert corrections or synthetic augmentation, or by steering a frozen policy at test time with guidance from a learned model. The former can be expensive or assumption-dependent, while the latter discards the corrected trajectories after execution. We introduce ReGuide, a self-improving framework that treats guided rollouts as reusable on-policy recovery data. ReGuide first uses Phase-Conditioned Guidance (PCG) to generate corrective rollouts: it constructs phase-specific latent targets, applies guidance only in the drifted-but-recoverable regime, and guides through the estimated clean action to match the dynamics model’s training distribution. Successful guided rollouts are then absorbed back into the policy through ReGuide-FT, which fine-tunes the current checkpoint, or ReGuide-FS, which retrains from scratch on the augmented dataset; the two can also be composed and iterated. On Robomimic Can, Square, Transport, and Tool Hang, ReGuide improves base-policy success by 1.3 – 7.7\times , outperforms LPB in the test-time-only setting, and matched-data ablations show that the gains come from guided recovery data rather than additional rollouts alone.

[LG-107] Cybersecurity is the True Frontier for Generative AI Success or Failure

链接: https://arxiv.org/abs/2606.28929
作者: Edward Raff,Maor Ashkenazi,Sagar Samtani,David J. Elkind,Sven Krasser
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: to appear in the 5’th Workshop on Rethinking Malware Analysis (WoRMA)

点击查看摘要

Abstract:Cybersecurity is a real-life test-bed for many machine learning problems at once, especially when considering modern strides in using Large Language Models (LLMs) to automate processes as ``agents.‘’ Cybersecurity workflows require orchestrating hundreds of standard and bespoke tools through various formats. The scale of cybersecurity data is enormous; for example, a single malware sample can be viewed as a sequence of billions of tokens. The cost of labeling any file by experts is enormous and labor-intensive, in part because an adversary (possibly a well-funded nation state actor) is attempting to subvert your detection methods. Even skilled experts may disagree on the correct label, creating ambiguity in what constitutes ground truth. When deployed, models must run quickly on billions of items a day, where low-latency is critical for operational success, in a continuously changing environment. In addition, explainability is not optional: analysts demand clear reasoning for model decisions to cope with the large number of false-positive alerts they face daily, and to quickly develop remediation and understand how something went wrong. In short, the amount of complexity cybersecurity is greater than that of natural language and computer vision, and thus we posit that cybersecurity is the better test-case for general AI progress than other, well-studied fields.

[LG-108] A Theoretical Interpretation of In-Context Learning via Probabilistic Modeling

链接: https://arxiv.org/abs/2606.28926
作者: Zhenyu Liu,Huaze Tang,Shao-Lun Huang
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In-context learning (ICL) is an emerging paradigm that employs the semantic information inherent in large language models (LLMs) for generating answers to user queries. While the remarkable performance of ICL has been widely known, a general modeling and a rigorous theoretical analysis of this paradigm are still lacking. This work presents a probabilistic model for ICL and derives the performance of ICL for both general parametric distributions and exponential families. Based on the derived results, the work explains the impact of multiple factors such as the number of demonstrations, the sensitivity of the probabilistic model to the variation of its parameters, as well as the similarity between the demonstrations and the query on the performance of ICL.

[LG-109] owards Improved Anomaly Detection for Cloud Cybersecurity via Graph Neural Networks

链接: https://arxiv.org/abs/2606.28923
作者: Manu Nandan,TJ Jaymes,Michael Brautbar,Edward Raff
类目: Machine Learning (cs.LG)
*备注: to appear in the 5’th Workshop on Rethinking Malware Analysis (WoRMA)

点击查看摘要

Abstract:Detecting security threats in an organization’s cloud computing environment has become necessary due to the increased reliance on cloud infrastructure. Logging of all cloud computing events enables investigation into any incidents after they are detected. Automated detection of threats using the logs based on heuristics or anomaly detection could result in a high false positive rate due to its relatively static nature. In this article, we present an industrial case study of a self-supervised learning method using graph neural networks applied to AWS CloudTrail logs to surface suspicious events for analyst review. The model produces an anomaly score for each event and dynamically adapts to changes in the organization without requiring periodic retraining. Based on our experiments across five organizations, the proposed model produced substantially fewer alerts than a domain expert rule-based baseline in almost all cases, reducing alert volumes to approximately 1 per hour from thousands generated by traditional methods. We note that this evaluation covers only flagged events, and false negatives cannot be estimated from the current data; findings should therefore be interpreted as a practical deployment study offering insights into real-world constraints rather than a fully validated detection system. We discuss these limitations and the requirements for extending the approach to other cloud environments as future work.

[LG-110] ML-Powered LDAP Reconnaissance Detection using Weak Supervision KDD

链接: https://arxiv.org/abs/2606.28917
作者: Shaefer Drew,Edward Raff,Michael Brautbar,Yaron Zinar,Benjamin Malmberg,Dor Agron,Sagi Sheinfeld,Avraham Kama,Asaf Romano
类目: Machine Learning (cs.LG)
*备注: to appear in Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining

点击查看摘要

Abstract:Lightweight Directory Access Protocol (LDAP) is a protocol that allows users to query and modify Active Directory (AD) data. By default, all users have read access to all AD data through LDAP, making it a common initial tool for reconnaissance when a threat actor first compromises an identity. To capture threat actors early in the reconnaissance phase, we developed two machine learning frameworks to detect LDAP reconnaissance: an ML classifier to predict malicious LDAP queries and an ML-based data-mining method to extract malicious query signatures. By correlating LDAP queries with endpoint detections, the first framework uses weak supervision to label a massive dataset and classify LDAP queries as malicious or benign. For immediate deployment, a second technique was developed on top of this approach to employ a rigorous statistical hypothesis-testing framework for mining novel, malicious LDAP signatures. While this weakly supervised approach is limited compared with manual human labeling, it is more practical for this use case because it leverages large-scale automated corpus construction, reducing costs and time. Ultimately, both the LDAP classifier and the ML-based LDAP signature mining method achieved performance benchmarks, with the classifier achieving up to a 65% True Positive Rate (TPR) on the holdout set while limiting false positives, and mined signatures demonstrating 81.48% field precision with CrowdStrike’s Managed Detection and Response team.

[LG-111] MALOQ: Massively Accelerated Learning of Operators for Quantum Transport

链接: https://arxiv.org/abs/2606.28911
作者: Manasa Kaniselvan,Alexander Maeder,Denghui Lu,Alexandros Nikolaos Ziogas,Mathieu Luisier
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Distributed, Parallel, and Cluster Computing (cs.DC); Computational Physics (physics.comp-ph)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:Machine-learned (ML) operator models can be trained to predict density functional theory (DFT) Hamiltonian/density matrices at significantly reduced computational cost, thus extending electronic-structure calculations to previously unfeasible scales. Here, we introduce MALOQ (Massively Accelerated Learning of Operators for Quantum Transport), an application built to train on and predict electronic-structure matrices for systems made of few to 100k atoms, described by large basis sets, and covering a wide range of atomic elements. Based on a state-of-the-art, SO(2)-equivariant backbone architecture, MALOQ provides (i) custom data-processing kernels to handle high-rank Hamiltonian matrix data and (ii) a scalable edge-wise distribution of atomic graph(s). Trained on the largest molecular Hamiltonian datasets available today, it reduces time-per-epoch by over 30% compared to a molecule-wise-distributed framework, and enables inference on material graphs of arbitrary size. We demonstrate scalable training and inference for 3,000-12,000 atoms on the Alps supercomputer, up to 192 GPUs and 256 GPUs, respectively.

[LG-112] Analysis of Adam Algorithms for Stochastic Dynamic Systems

链接: https://arxiv.org/abs/2606.28879
作者: Xin Zheng,Yifei Jin,Lei Guo
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The adaptive moment estimation algorithm, known as Adam, is widely used in modern machine learning, owing to its low per-iteration complexity and strong empirical performance. Despite its prevalent use, the theoretical foundation of Adam remains largely unexplored for time-varying and nonstationary systems. In fact, the existing theoretical analyses of Adam-type algorithms are primarily concerned with time-invariant model parameters and explicitly or implicitly rely on independent and identically distributed (i.i.d.) data assumptions, under which the learning taskcan be formulated as minimizing a fixed expected objective with a static minimizer. However, such assumptions are often violated in time-varying and nonstationary systems, thereby calling for a theoretical investigation beyond the conventional yet idealized i.i.d. setting. The main objective of this paper is to solve this challenging problem by establishing a general theory of Adam for time-varying and nonstationary stochastic systems. We will introduce some new techniques for analyzing the products of nonstationary and dependent random matrices induced by Adam’s coupled first- and second-moment recursions, and will construct a new stochastic Lyapunov function that blends these two moment dynamics. Under a stochastic excitation condition that allows nonstationary and dependent data, we will derive both parameter tracking and output prediction error bounds explicitly, quantifying the effects of stepsize, first- and second-momentum parameters, gradient noise and parameter drift. These bounds not only provide guarantees for Adam performance, but also provide guidelines for hyperparameter selection. Experiments on both synthetic and real-world data validate our theory and design guidelines.

[LG-113] he Contagion Tensor: A Framework for Measuring Output-Distribution Coupling in Multi-Agent LLM Systems – and Auditing the Claims It Enables

链接: https://arxiv.org/abs/2606.28839
作者: Zewen Liu
类目: Machine Learning (cs.LG)
*备注: 26 pages, 2 figures

点击查看摘要

Abstract:We introduce the Contagion Tensor, a measurement framework for quantifying how large language model (LLM) output distributions couple across modalities, agents, and time steps. From the tensor we derive the Coupling Amplification Factor (CAF), a family of ratio-based metrics sharing the form CAF = E[T_condition] / E[T_baseline], providing unitless, baseline-referenced measurement with bootstrap confidence intervals. We instantiate CAF in four variants and evaluate the strongest in a complete 2x2x2 block-orthogonal simulation design with modality-specific ablation. The ablation reveals that an apparent image-condition super-linear effect (CAF = 1.40) collapses to sub-linear (CAF = 0.87) when the image perturbation module is disabled, a shift of -0.53 with zero effect on text conditions. We supplement with real-API experiments across two model families: DeepSeek-Chat (R=30) and GPT-4o-mini (R=15, real vision). Under uniform personas, text-only communication produces CAF approx 1.0 in both models. Diverse personas drive convergence (CAF = 0.88). A within-model comparison on GPT-4o-mini reveals: C3 (text) CAF = 1.02 vs. C5 (real vision, R=30) CAF = 1.72 [1.700, 1.733], delta = +0.70, validating the simulation’s super-linear image-condition prediction. Of 11 conditions, 5 have been tested on real APIs and 6 remain unverified. Our contribution is two-layered: (1) a measurement instrument that makes output-distribution coupling quantitatively falsifiable; and (2) a transferable ablation protocol that any modular multi-agent simulator can adopt to distinguish genuine coupling from design artifacts.

[LG-114] Active Quantum Kernel Acquisition for Gaussian Process Regression

链接: https://arxiv.org/abs/2606.28833
作者: Jian Xu,Delu Zeng,Qibin Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum kernel estimation on near-term hardware is shot-budgeted: every entry of the kernel Gram matrix is a Bernoulli expectation that must be sampled with a finite number of circuit executions. Recent work on quantum kernel classification has shown that allocating shots non-uniformly across kernel entries, weighted by their downstream task sensitivity, can reduce the shot budget required to reach a target accuracy. We extend this idea to Gaussian process (GP) regression, a setting whose downstream quantities (full-spectrum posterior variance, log-determinant, marginal likelihood) couple to kernel error more tightly than the sign-only outputs of classification. We derive three closed-form pair-level sensitivities predictive coupling |\alpha_i\alpha_j| , leave-one-out residual, and marginal-likelihood gradient and plug them into a Neyman-style minimum-variance allocation rule. To prevent catastrophic over-concentration when the warm-up sensitivity estimate is itself noisy, we add a high uniform coverage floor justified by a Frobenius lower bound on the missing-entry perturbation. On four UCI benchmarks and two synthetic RBF + Bernoulli controlled studies, the resulting allocator delivers 10 – 21% test-RMSE improvement over uniform allocation across the moderate-budget regime. The gain transfers (i) to genuine ZZ and Pauli-Z quantum kernels on quantum-natural data ( -13 – 15% at low budget, p0.05 paired) and (ii) to four downstream tasks (Bayesian quadrature, heteroscedastic regression, hyperparameter learning, multi-output Cokriging). On UCI features embedded into a ZZ kernel the gain disappears, consistent with the exponential-concentration regime where shot allocation has nothing to exploit.

[LG-115] On design-unbiased algorithmic Machine Learning

链接: https://arxiv.org/abs/2606.28795
作者: Li-Chun Zhang,Siu-Ming Tam,Luis Sanguiao-Sande,Wesley Yung,Anders Holmberg
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Machine Learning (ML) algorithms, such as k-Nearest Neighbours (kNN) or random forest, eschew the ideal of true data models in favour of predictive performance. However, minimising the MSE or F-score cannot lead to unbiasedness directly, which is important in many situations such as official statistics. We study the conditions of algorithmic ML, other than the existence and knowledge of true data models, which lead to unbiased prediction or classification for a given finite population, including how the training data may be sampled from the population, how a trained prediction algorithm can be tuned to achieve unbiased prediction or classification for that population, and how the performance of out-of-sample prediction or classification can be assessed unbiasedly. The inference is based on the known probability design of samples and training sets, rather than any assumed distributions or models.

[LG-116] Generative Learning as a Tool to Improve Perception of Emotional Body Motion Expressions

链接: https://arxiv.org/abs/2606.28769
作者: Huakun Liu,Miao Cheng,Xin Wei,Felix Dollack,Victor Schneider,Hideaki Uchiyama,Chia-huei Tseng,Yoshifumi Kitamura,Monica Perusquia-Hernandez
类目: Machine Learning (cs.LG)
*备注: Accepted by ACII 2025

点击查看摘要

Abstract:Emotional body motion expressions are an essential element of non-verbal communication. Effectively conveying these expressions through technology is of utmost importance, for example, with virtual reality avatars and in social robotics. Recent advances in generative models have opened new opportunities for advancing research on emotional body motion learning. However, generating accurate emotional expression representations is challenging, given the subtlety of emotional cues, individual variability, and cultural differences. We investigate whether a generative model can implicitly learn emotional body motions directly from culturally grounded motion-capture data, without explicit emotion-motion guidance. Using a dataset of emotional performances by 49 Japanese actors, we trained a Transformer-based generative model to generate expressive motions conditioned on 13 discrete emotion labels. We evaluate the generated motions from two perspectives: (1) an LSTM-based classifier to assess recognizability by machine observers, achieving a recognition accuracy of 22.80%, and (2) a human perception study with Japanese raters to assess alignment with human affective interpretations, yielding a recognition accuracy of 24.91%. Beyond these, we evaluate the utility of generative modeling for three practical tasks: augmenting emotion recognition models, extracting representative emotion-specific motion patterns, and synthesizing smooth transitions between emotion intensities. Our findings highlight the potential of implicit, data-driven generative modeling to enhance affective computing applications and our understanding of emotion expressions.

[LG-117] Hierarchical Decision Making with Structured Policies: A Principled Design via Inverse Optimization ICML2026

链接: https://arxiv.org/abs/2606.28764
作者: Yuexuan Wang,Jingyuan Zhou,Kaidi Yang
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2026

点击查看摘要

Abstract:Hierarchical decision-making frameworks are pivotal for addressing complex control tasks, enabling agents to decompose intricate problems into manageable subgoals. Despite their promise, existing hierarchical policies face critical limitations: (i) reinforcement learning (RL)-based methods struggle to guarantee strict constraint satisfaction, and (ii) optimal control (OC)-based approaches often rely on myopic and computationally prohibitive formulations. To reconcile these trade-offs, hierarchical RL-OC architectures have emerged as a promising paradigm. However, the formulation of the lower-level optimization within these frameworks remains underexplored, often relying on heuristic or myopic objectives. In this work, we propose a principled framework that systematically integrates upper-level goal abstraction with structured lower-level decision making. We adopt an inverse optimization approach to inform the structure of the lower-level problem from expert demonstrations, ensuring that the objective of the lower-level policy remains aligned with the overall long-term task goal. To validate the approach, our framework is evaluated on distinct decision making tasks: network-based resource allocation and continuous collision avoidance. Empirical results demonstrate that our method consistently outperforms strong baselines based on end-to-end RL, learning-augmented optimal control, and existing hierarchical RL approaches in both efficiency and decision quality.

[LG-118] A Path-Space Formulation of Prediction in World Models: From a Single Action to Prediction Planning and Irreversibility

链接: https://arxiv.org/abs/2606.28751
作者: Gunn Kim
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:We propose a path-space formulation of prediction in AI world models. Rather than sequences of one-step conditional distributions, we argue that a world model implicitly defines a probability measure over future trajectories. In the local regime where latent dynamics admit an effective Markovian description, this path measure takes the Onsager-Machlup form. Within this framework, prediction (most probable trajectory), planning (constrained optimization), and uncertainty (fluctuations) emerge as operations on a single action functional. We decompose the latent dynamics into reversible and irreversible components and introduce operational measures of entropy production from model rollouts. In controlled small-scale attention-based models, we find that attention asymmetry is acquired during training in proportion to the irreversibility of the data. Symmetrizing the learned attention suppresses entropy production and selectively degrades long-horizon prediction of irreversible dynamics while preserving relaxational prediction. These results suggest that irreversibility may serve as a computational resource for predictive world models. More generally, the fundamental predictive object is a distribution over future paths rather than states.

[LG-119] J-LAW: Joint Localization and Actionable World Modeling via Coupled Latent Factor Graphs

链接: https://arxiv.org/abs/2606.28712
作者: Guanqun Cao,Liang Chen
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Classical SLAM estimates metric poses and a geometric map but produces no actionable predictive model for planning. Action-conditioned world models learn compact latent dynamics for planning but ignore global metric consistency and accumulate drift under open-loop rollout. We argue these are two views of the same estimation problem and propose J-LAW (Joint Localization and Actionable World Modeling) in this letter: a coupled factor graph that jointly optimizes metric object poses, latent world states, and latent landmark embeddings. The bridge is a pose-conditioned latent encoder and a learned pose–latent coupling factor, so that better localization improves the world model and vice versa. We cast observation, action-conditioned prediction, metric odometry, pose–latent coupling, latent loop closure, and latent landmark observation as probabilistic factors in a single MAP objective. Real-data experiments on PushT and WildGS show that coupled graph correction substantially reduces latent prediction RMSE and endpoint drift relative to open-loop rollout, while latent loop closure improves global trajectory consistency. J-LAW yields a map that is simultaneously metric (poses) and actionable (latent landmarks for planning).

[LG-120] Entropy-Regularized Reinforcement Learning for Linear-Quadratic Stackelberg Differential Games in Regime-Switching Diffusion Models

链接: https://arxiv.org/abs/2606.28671
作者: Congde Hu,Danping Li,Lin Xu,Wenying Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stackelberg differential games (SDGs) provide a powerful framework for hierarchical decision-making in stochastic and continuous-time environments, yet their solution remains computationally challenging due to the complexity of traditional dynamic programming and Hamilton-Jacobi-Bellman-Isaacs (HJBI) methods, especially in high-dimensional systems. This paper proposes an entropy-regularized reinforcement learning (ERRL) approach for linear-quadratic SDGs (LQ-SDGs) within a continuous-time diffusion framework governed by Markovian regime switching. The key innovation lies in deriving exploratory weakly-coupled HJBI equations with entropy regularization, which promotes stochastic policies that actively avoid suboptimal equilibria – a limitation of classical SDG methods. Neural networks are integrated to approximate regime-dependent value functions and solve high-dimensional partial differential equations (PDEs) efficiently, while a novel sampling technique enhances computational tractability. Numerical results demonstrate the effectiveness of the framework compared to conventional approaches, particularly in escaping suboptimal traps through exploratory policies. The study highlights the critical role of entropy regularization and neural network approximations in achieving robust solutions for hierarchical decision-making problems under abrupt environmental shifts.

[LG-121] Entropy Regularized Reinforcement Learning for Zero-Sum Stochastic Differential Games in a Regime-Switching Jump-Diffusion Process

链接: https://arxiv.org/abs/2606.28669
作者: Congde Hu,Zhuo Jin,Danping Li,Lin Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To address parameter misspecification and sudden structural environmental changes in conventional stochastic differential game (SDG) frameworks, this paper introduces a distributional control approach that characterizes optimal strategies as probability distributions over actions, conditioned on the continuous state, the discrete regime state, and parameters. This forms a reinforcement learning framework for entropy-regularized zero-sum stochastic differential games (ERRL-ZSSDGs) in a regime-switching jump-diffusion process. Using the dynamic programming principle (DPP), we derive the associated coupled systems of Hamilton-Jacobi-Bellman-Isaacs (HJBI) equations, from which equilibrium strategies are expressed via gradients of the value function. For linear-quadratic problems, semi-analytical solutions for both value function and equilibrium strategies are obtained by solving a system of coupled ordinary differential equations (ODEs). In more general settings, an Actor-Critic policy improvement algorithm is developed to approximate the value functions and equilibrium policies across different regimes. The method is applied to an investment game, and numerical examples illustrate the effect of the temperature parameter and regime transitions on optimal policies and values.

[LG-122] In-Vehicle Digital Twin-Based Collision Warning Framework with Sybil Attack Detection

链接: https://arxiv.org/abs/2606.28625
作者: Mohammad Imtiaz Hasan,Abyad Enan,Jean Michel Tine,Araf Rahman,M Sabbir Salek,Mashrur Chowdhury
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Connected Vehicles (CVs) rely extensively on communication technologies to enable data-driven predictive analyses for enhancing performance and safety. These communication channels can be exploited by adversaries to launch cyberattacks such as Sybil attacks, which could threaten both safety-critical and mobility applications, leaving CVs vulnerable and putting human lives at risk. As CV deployment continues to expand, the need to detect and mitigate cyberattacks in real-time becomes increasingly urgent. This study presents an in-vehicle Digital Twin (DT)-based collision warning framework with built-in capabilities for Sybil attacks detection. The framework integrates a Temporal Convolutional Network (TCN) for learning temporal dependencies in vehicle trajectory data and Hierarchical Navigable Small World (HNSW) algorithms for efficient similarity-based classification. Our framework is evaluated on real-world Sybil attack data, collected through field experiments. The framework achieved accuracy, recall, and F1 scores of 0.984, 1.00, and 0.944, respectively, in detecting Sybil-generated fake vehicles. During the safety evaluation, the framework reduced the mean Time Exposed Time-To-Collision (TET) and mean Time Integrated Time-To-Collision (TIT) of near-collision events by 88% and 72%, respectively. Furthermore, real-world feasibility evaluation shows that the framework conformed to the standardized maximum allowable latency for safety applications and operated well within the capacity of modern processors – demonstrating the promise of an in-vehicle DT-based framework as an attack mitigation mechanism against Sybil attacks for next-generation CVs.

[LG-123] Improving Patient Subtyping on Longitudinal Data using Representations from Mamba-based Architecture

链接: https://arxiv.org/abs/2606.28623
作者: Md Mozaharul Mottalib,Rahmatollah Beheshti
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Effective sub-typing (also known as grouping or clustering) of patients using their electronic health record (EHR) data can greatly inform precision medicine efforts. However, subtyping temporal EHR datasets is known to be challenging due to inherent EHR issues, including complexity and irregularity. In this study, we propose a self-supervised Mamba-based model that learns effective EHR representations and enables enhanced patient subtyping. We evaluate the proposed model on public and private real-world EHR datasets to classify the data based on the available labels and subtype patients based on the representations learned from the model. Through an extensive set of experiments, we demonstrate that our model’s design choices lead to better performance compared to competitive baseline models for prediction. Moreover, we evaluate several clustering techniques to demonstrate that our findings offer valuable insights into subtyping patients based on temporal records from EHR models\footnoteOur implementations are available at this https URL.

[LG-124] Randomized Exploration for Linear Bandits via Absolute Perturbations

链接: https://arxiv.org/abs/2606.28616
作者: Toshinori Kitamura,Shuai Liu,Csaba Szepesvári
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In stochastic linear bandits, the canonical Upper Confidence Bound (UCB) algorithm admits a simple frequentist regret analysis but can be computationally demanding, while Thompson Sampling (TS) is computationally attractive yet typically harder to analyze due to its non-optimistic nature. We propose Absolute Thompson Sampling (ATS), a simple modification of TS that ensures optimism in expectation by replacing the signed exploration noise with its absolute value. This preserves the computational efficiency of TS while avoiding the technically involved anti-concentration arguments common in TS analyses, enabling a simple UCB-style regret analysis. We show that ATS achieves \tildeO(d^3/2\sqrtK) regret, matching existing bounds for TS in linear bandits. We further introduce Ensemble Absolute Thompson Sampling (EATS), which takes the maximum over multiple absolute perturbations with normalization by the ensemble size. As the ensemble size grows, EATS converges to the UCB objective, recovering UCB behavior in the limit. Experiments show that moderate ensemble sizes already yield strong performance. Our results point to a bridge between randomized exploration and deterministic optimism both in theory and practice.

[LG-125] Replica Symmetry Breaking and Algorithmic Thresholds in Empirical Risk Minimization under Multi-Index Model

链接: https://arxiv.org/abs/2606.28573
作者: Andrea Montanari,Kangjie Zhou
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 80 pages; 3 pdf figures

点击查看摘要

Abstract:Modern machine learning models are trained by optimizing high-dimensional non-convex empirical risk functions. Such cost functions can have a multitude of local optima and yet, gradient-based optimization appears to converge to near-global optima. Within a simple supervised learning setting, we develop a precise picture of which parts of the empirical risk landscape are accessible by polynomial-time algorithms. We are given i.i.d. pairs (\boldsymbolx_i,y_i):; 1 \le i\le n\ with \boldsymbolx_i\in \mathbbR^d standard Gaussian feature vectors, and y_i\in\mathbbR response variables that depend on \boldsymbolx_i through their projections on an unknown k -dimensional subspace. We use empirical risk minimization to learn a model that depends on an m -dimensional projection of the data (e.g., an m -neurons neural network). We propose an incremental approximate message passing (IAMP) algorithm and precisely characterize the training error it achieves, as well as the relation between test and training error, in the high dimensional asymptotics n,d\to\infty , with n/d\to\alpha \in (0, +\infty) . Based on earlier work in related models, we expect that the performance achieved by our algorithm is optimal among polynomial-time algorithms. Comments: 80 pages; 3 pdf figures Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2606.28573 [cs.LG] (or arXiv:2606.28573v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.28573 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-126] Improving Coherence in Hierarchical Time Series Forecasting using Structured Temporal Fusion

链接: https://arxiv.org/abs/2606.28553
作者: Ruchi Pakhle
类目: Machine Learning (cs.LG)
*备注: 7 pages, 2 figures. Preprint. Source files included

点击查看摘要

Abstract:In many real-world applications, such as retail sales, energy usage, and supply chain planning, forecasting is performed across hierarchical structures. These structures often represent aggregations (e.g., products to categories to regions), where forecasts must not only be accurate but also coherent, meaning that lower-level predictions sum correctly to higher-level forecasts. Traditional statistical methods, such as Bottom-Up and MinT, enforce coherence through post-processing but fail to model complex nonlinear temporal dependencies and covariate interactions. We propose Hierarchical Temporal Fusion (HTF), a novel extension of the Temporal Fusion Transformer (TFT) that integrates structured hierarchical embeddings with a coherence-aware loss function to ensure consistent forecasts across all levels of a hierarchy. Rather than applying reconciliation after forecasting, HTF embeds coherence directly into the training objective. The coherence loss penalizes the difference between aggregated child forecasts and their corresponding parent forecasts during training, enabling the model to learn both temporal dynamics and structural consistency simultaneously. We evaluate HTF on two publicly available benchmark datasets: the M5 Walmart forecasting dataset and a publicly available hierarchical energy consumption dataset. Results demonstrate that HTF substantially reduces forecast incoherence while improving forecasting accuracy compared with classical reconciliation methods and deep learning baselines. In addition, attention visualization and embedding analysis provide insight into how temporal and structural information contribute to hierarchical forecasting performance. Comments: 7 pages, 2 figures. Preprint. Source files included Subjects: Machine Learning (cs.LG) ACMclasses: I.2.6; G.3 Cite as: arXiv:2606.28553 [cs.LG] (or arXiv:2606.28553v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.28553 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-127] NIVA: A Multimodal Foundation Model for Actionable Earth System Intelligence

链接: https://arxiv.org/abs/2606.28546
作者: Anisha Pal,Aodhan Sweeney,Kyle Heyblom,Kalai Ramea
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in AI-driven weather and climate modeling have improved forecast skill while reducing computational cost. However, existing data-driven approaches are limited in their ability to model coupled Earth system dynamics, which is required for extending predictability beyond the ~2-week horizon. To address this, we introduce NIVA, a multimodal foundation model designed to learn unified representations across Earth system components. While the full framework targets atmosphere, ocean, ice, and land interactions, we focus here on a two-modality setting (ocean and atmosphere) as a controlled proof of concept to evaluate whether foundation models can learn coupled dynamics. Trained on large-scale Earth system simulations, NIVA learns physically meaningful cross-modal structure, providing a foundation for subseasonal-to-seasonal prediction. As initial validation, we show that NIVA captures key modes of climate variability through accurate prediction of major climate indices.

[LG-128] A Trainable-by-Parts Operator Learning Framework: Bridging DeepONet and Karhunen-Loeve Expansions for Large-Scale Applications

链接: https://arxiv.org/abs/2606.28519
作者: Christian Munoz,Alexandre Tartakovsky
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training operator-learning models for large-scale problems governed by partial differential equations (PDEs) is challenging due to the curse of dimensionality, memory constraints, and limited training data. These challenges arise in many scientific and engineering applications, including subsurface flow, climate modeling, and geological carbon storage (GCS). In this work, we propose a scalable operator-learning framework based on the Karhunen-Loeve Deep Neural Network (KL-DNN) and demonstrate its performance for modeling GCS. The model is trained on a dataset comprising 100 samples of large-scale simulations in a three-dimensional domain with 1.7 million cells and 50 time steps. The KL-DNN method constructs latent spaces using low-rank singular value decomposition of static properties and a nested Karhunen-Loeve expansion for dynamic pressure fields, enabling full-resolution predictions without subsampling or spatial coarsening. The KL-DNN model achieves an average root mean square error (RMSE) of 1.1 psi for pressure (0.04% relative error with respect to the average pressure in the domain) and RMSE of 0.0146 for CO2 saturation (5% relative error with respect to the average saturation inside the plume). The model requires 20 minutes of training on a single GPU, representing a 19% reduction in the pressure errors, 7% reduction in the saturation error, and a two-order-of-magnitude speedup compared to DeepONet trained on the same dataset. These results, along with inference time of less than one minute, establish the proposed model as a practical and accurate solution for large-scale PDE problems, enabling rapid uncertainty quantification, history matching, and real-time decision support.

[LG-129] Modelling Emotional Memory in Children with Tensor Networks

链接: https://arxiv.org/abs/2606.28470
作者: Henry Groves,Lucia F. Jackson,Barbara-Anne Robertson,Jonte R. Hance
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Quantum Physics (quant-ph)
*备注: 26 pages, 9 figures

点击查看摘要

Abstract:We demonstrate how emotional valence influences the order-dependent structure of children’s recognition memory: correct recall of a sequence of emotionally-valenced toys depended not just on the valence of a given toy itself, but also on the valence of the toys shown before and after it. Whilst standard psychological models confirm that order-dependence differs across an event (a set of toys shown in sequence), accuracy is low and the model does not reflect how memory for an emotional object influences others in the set. A classical tensor network model factoring in valence is able to achieve a 77.98% accuracy in modelling the results of the study. While not strictly a ``quantum cognition’’ model, this massive increase in accuracy shows the value of quantum-inspired methods for modelling order-dependent phenomena, such as emotional memory. Further, the task protocol we introduce presents a novel, real-world tool for exploring emotional temporal memory in children for analysis using classical and quantum-like models of cognition.

[LG-130] Singular Learning and Occams Razor in Deep Monomial Networks

链接: https://arxiv.org/abs/2606.28464
作者: Kathlén Kohn,Giovanni Luca Marchetti,Farhan Shabir,Vahid Shahverdi,Weisheng Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the optimization of neural networks, gradient dynamics are influenced by critical points that arise from the model’s architecture. These critical points occur where the Jacobian of the model’s parametrization is rank-deficient, and are the most pronounced singularities studied in Singular Learning Theory. We investigate such points in deep fully-connected networks with monomial activations via tools from polynomial algebra such as Mason’s Theorem. We show that, for sufficiently large activation degree, criticality occurs precisely at subnetworks, i.e., at parameter configurations where some neurons are inactive or redundant. This offers a mathematical perspective on the implicit bias in deep neural networks, explaining the tendency of these models to converge toward simpler functions.

[LG-131] scKDGM: KAN-guided Dynamic Graph Masked Learning for Single-Cell RNA-seq Clustering

链接: https://arxiv.org/abs/2606.28459
作者: Jun Tang,Pengwei Hu,Sicong Gao,Jie Guo,Lun Hu,Xin Luo
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Single-cell RNA sequencing (scRNA-seq) clustering is essential for identifying cell types, but high dimensionality, sparsity, dropout, and technical noise hinder robust expression representation and cell graph construction. Existing masked autoencoders mainly use expression recovery for feature reconstruction, while graph clustering methods usually depend on fixed KNN graphs and do not feed recovered expression back into graph optimization. We propose scKDGM, a KAN-guided dynamic graph masked learning framework for scRNA-seq clustering. scKDGM uses graph-aware distribution preserving gene masking (GDP-Mask) to perturb cell identity, a KAN-based TAKGCN encoder to learn masked-view representations, mask-guided expression recovery to construct a dynamic graph, and cross-view contrastive learning to transfer recovery signals into topology updates. A ZINB loss models overdispersion and zero inflation. Experiments on 12 real scRNA-seq datasets show that scKDGM outperforms 10 baselines in average NMI and ARI.

[LG-132] Position: RL Researchers Need to Distinguish Between Solving Simulators and Using Simulators as a Proxy ICML ICML2026

链接: https://arxiv.org/abs/2606.28433
作者: Matthew Vandergrift,Esraa Elelimy,Martha White
类目: Machine Learning (cs.LG)
*备注: This work has been accepted at the ICML 2026 position paper track. The peer reviewed reference is provided in the public OpenReview page at this https URL Additionally the publication can be seen at this link : this https URL

点击查看摘要

Abstract:One goal in reinforcement learning (RL) research is to understand general-purpose sequential decision-making, using benchmark simulators as a proxy for learning in deployment settings. When running experiments, however, the goal of achieving high performance in the simulator can mutate into focusing exclusively on solving the simulator. To achieve high scores, researchers may adopt solutions exclusively meant for solving simulators, rather than learning while the agent is deployed outside a simulator. Solving simulators is also worthy of investigation, but it is a fundamentally different RL research question. In this paper, we argue that RL researchers need to distinguish between two use cases of simulators: solving simulators and using simulators as a proxy for learning in deployment. We first discuss how these two use-cases are importantly different, in terms of constraints on how the agent can use the simulator, which algorithms are appropriate, and which evaluation metrics are appropriate. We then highlight several issues and misleading conclusions that can occur by not making the distinction between these two settings clear, supported with examples and simple experiments. This work is a call to the community to begin clearly distinguishing how they are using simulators in their work, hopefully sparking further discussion on which empirical practices work best in each setting.

[LG-133] Bridging the NISQ and Fault-Tolerant Regimes: Generative-ML-Assisted Quantum Selected CI for Molecular Simulations

链接: https://arxiv.org/abs/2606.30551
作者: Anurag K. S. V.,Ashish Kumar Patra,Manas Mukherjee,Ruchika Bhat,Sai Shankar P.,Rahul Maitra,Jaiganesh G
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 35 pages, 10 figures

点击查看摘要

Abstract:Calculation of binding energies for protein-ligand molecular systems requires accurate treatment of the electronic structure, a quantum chemistry problem that scales exponentially on classical hardware, while current quantum hardware remains too noisy for the required circuit depths. This report presents a hybrid quantum-classical workflow performed on the Fujitsu FX700 ideal state-vector simulator using QARP that addresses two structural inefficiencies in quantum-sampling-based diagonalization workflows. First, we integrate the Linear Scaling CNOT UCCSD (LCNot-UCCSD) ansatz into the QSCI framework, replacing the \mathcalO(N^6) CCSD parameter initialization of the competing LUCJ ansatz approach with \mathcalO(N^4) MP2-amplitude initialization. Second, we introduce QSCI-RBM, a variant that replaces the configuration recovery of the SQD framework with a Restricted Boltzmann Machine (RBM) acting as a compact generative subspace expansion model. Both are evaluated on eight different molecules in STO-3G across 14 controlled artificial error levels with 100 independent runs each, validated on potential energy surface scans of the N _2 molecule in cc-pVDZ, and embedded within DMET to treat the FDA-approved antiviral Amantadine (C _10 H _17 N, 11 DMET fragments) and the active region of the SARS-CoV-2 main protease complexed with its covalent inhibitor Carmofur (PDB: 7BUY, C _15 H _28 N _4 O _5 S, 10 fragments). To our knowledge, this is the first deployment of LCNot-UCCSD within QSCI on a quantum computing simulator, and the first DMET-QSCI(LCNot-UCCSD)-RBM application to an industry-relevant protein-ligand system. By utilizing a fraction of the classical computing resources required by the current state-of-the-art work by Cleveland Clinic, RIKEN, and IBM Quantum, this approach enables more efficient and economical drug discovery simulations for the industry.

[LG-134] Staged Hybridisation for Visual Quantum Reinforcement Learning via Knowledge Distillation

链接: https://arxiv.org/abs/2606.30520
作者: Javier Lazaro,Juan-Ignacio Vazquez,Pablo Garcia-Bringas
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Visual environments are a demanding setting for quantum reinforcement learning (QRL): high-dimensional observations, unstable RL optimisation, and constrained variational quantum circuits (VQCs) are difficult to train jointly. This paper studies knowledge distillation (KD) as a staged hybridisation strategy for visual QRL. Instead of training a hybrid visual agent end-to-end from pixels, we first train a classical visual teacher, freeze its encoder as a feature interface, and distil the teacher’s policy behaviour into compact downstream heads. These heads can be classical or VQC-based, enabling small quantum-compatible students to be evaluated under the same frozen representation as compact classical controls. We evaluate the pipeline on CartPole Pixels and Acrobot Pixels. The results show that staged KD enables shallow VQC heads to acquire non-trivial visual-control behaviour in settings where direct pixel-based training would be substantially more difficult. Angle-encoded VQC heads retain near-teacher performance, while amplitude-encoded heads push compactness to an extreme regime, at the cost of greater fragility, stronger budget sensitivity, and higher simulation time. Overall, staged KD reframes visual QRL as a compact-head learning problem, opening a practical route for training small quantum-compatible policies outside the standard end-to-end RL loop. Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2606.30520 [quant-ph] (or arXiv:2606.30520v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2606.30520 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-135] Doubly Robust Adaptive Conformal Inference for Causal Effects Under Temporal Dependence

链接: https://arxiv.org/abs/2606.30500
作者: Andreas Koukorinis,Ricardo Silva
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We propose doubly robust adaptive conformal inference (DR-ACI), which constructs prediction intervals for doubly robust pseudo-outcomes under temporal dependence.

[LG-136] Factorizable Normalizing Flows for parameter-dependent density morphing

链接: https://arxiv.org/abs/2606.30489
作者: Davide Valsecchi,Mauro Donegà,Rainer Wallny
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Theory (hep-th); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 14 pages, 8 figures. Code: this https URL

点击查看摘要

Abstract:Normalizing Flows excel at modeling a single fixed density, yet many problems across the sciences, such as high energy physics, instead require modeling how that density deforms as a function of continuous parameters: the strength of a physical effect, a calibration constant, or a source of systematic uncertainty. Learning a separate flow for every parameter configuration quickly becomes intractable, since the number of joint settings grows exponentially with the number of parameters. We introduce Factorizable Normalizing Flows (FNFs), which represent the parameter-dependent density as a fixed, high-fidelity flow for a reference configuration composed with a learnable transformation that is polynomial in the parameters and factorized over them. This structure has a practical consequence: each parameter’s effect is learned in isolation, from samples in which that parameter alone is varied. The combined response of many parameters is then recovered by summation at inference, without ever sampling their combinatorially large joint space. On a controlled problem with two interpretable deformations applied jointly to the data, the learned transformation reproduces the true deformations and matches the optimal likelihood, while optional interaction terms capture residual correlations when several parameters vary strongly at once. The resulting model is interpretable, scales linearly with the number of parameters, and keeps the likelihood tractable. This provides a general tool for any inference workflow requiring continuous density morphing, and directly enables the next generation of unbinned likelihood fits in high energy physics.

[LG-137] Non-parametric recovery of causal diffusion mechanisms from steady-state observations

链接: https://arxiv.org/abs/2606.30467
作者: Richard Schwank,Mathias Drton
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider sparse multivariate stochastic systems that evolve in continuous time according to a causal mechanism and present methodology to recover the system’s time-infinitesimal transition mechanism from mere cross-sectional data. This observational paradigm is motivated by applications such as gene expression analysis, where destructive experimental techniques may only allow recording data once over a cell’s lifetime. Precisely, we assume the system follows a time-homogeneous diffusion process that has reached an equilibrium distribution at observation time. Further, we assume the causal mechanism is fully described by the diffusion drift, is acyclic, and its causal structure graph is known. In this setting, we prove that the full causal mechanism, i.e., the drift function, can be non-parametrically identified under a weak non-explosion criterion. We derive a non-parametric kernel estimator for this challenging inverse problem and prove its consistency. Moreover, we propose a cross-validation scheme for hyperparameter tuning, illustrate the behavior of our estimator in simulations, and we discuss connections with irreversible generative diffusion models and low-frequency sampled data.

[LG-138] SGD Provably Prioritizes a Shortcut Spurious Feature in the XOR Model

链接: https://arxiv.org/abs/2606.30444
作者: Tyler LaBonte,Vidya Muthukumar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks are known to be susceptible to over-reliance on spurious correlations. However, the precise mechanism by which models exploit shortcut features is not fully understood, and algorithms to mitigate this behavior rely on as yet unjustified assumptions about the learned representations. In this work, we provide the first end-to-end theoretical characterization of spurious feature learning for two-layer ReLU neural networks trained by online minibatch SGD on the logistic loss. We consider data drawn from the high-dimensional Boolean hypercube with a quadratic signal function (namely XOR) and a linear spurious correlation. We show that SGD learns the spurious feature first, and exponentially fast. Moreover, the optimization dynamics couple the spurious and signal features, with a stronger spurious component inhibiting signal feature learning. Our analysis reveals precise phase transitions in the learning dynamics. In the first phase, alignment between the signs of the spurious feature and second-layer weight drives rapid growth of the spurious feature. In the second phase, large majority group margin slows learning and the signal feature remains suppressed. When the spurious correlation is maximally strong, we show theoretically that the spurious feature dominates even at the sample complexity threshold where XOR would be learned in isolation (i.e., if the spurious feature was absent). In contrast, when the correlation strength is constant, we provide preliminary empirical evidence that the model can eventually learn the XOR signal, although the spurious feature is not forgotten.

[LG-139] Learning the structure of open quantum systems

链接: https://arxiv.org/abs/2606.30358
作者: Laura Lewis,Ewin Tang,John Wright
类目: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 51 pages, 1 figure

点击查看摘要

Abstract:We design an algorithm for learning the coefficients of an n -qubit constant-local Lindbladian to \varepsilon error with O(g d^2 \log(n) / \varepsilon^2) total evolution time, where g is the single-site energy and d is the (approximate) degree of the interaction graph. Though Lindbladians present new challenges not present in the special case of Hamiltonians, our algorithm achieves the suite of desiderata attained by state-of-the-art Hamiltonian learning algorithms: (1) it uses non-adaptive, ancilla-free randomized Pauli measurement circuits with a time resolution of only \Theta(1/g) ; (2) it works without knowledge of the structure of the unknown Lindbladian; (3) it depends on a smooth form of degree, thereby supporting the learning of quasi-local and power-law Lindbladians. Our algorithm is a simple iterative method, where the objective function consists of Fourier coefficients of the Lindbladian restricted to few-site regions. Its analysis identifies the difficulty unique to open systems, which we call “confusing” terms. For settings where the “confusion” is limited, the performance of the algorithm improves. We demonstrate this for the case of structure learning of Hamiltonians from access to real-time evolution, where we obtain a new algorithm that is significantly simpler than previous work. In addition, using the same iterative method, we design the first efficient algorithm for structure learning Hamiltonians from high-temperature Gibbs states. Comments: 51 pages, 1 figure Subjects: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2606.30358 [quant-ph] (or arXiv:2606.30358v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2606.30358 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-140] Local-Minima-Preserving Continuous Relaxation of Ising Problems ICML’26

链接: https://arxiv.org/abs/2606.30333
作者: Debraj Banerjee,Santanu Mahapatra,Kunal N. Chaudhury
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: Accepted (regular) at 43rd International Conference on Machine Learning (ICML’26)

点击查看摘要

Abstract:The generalized Ising problem captures a broad spectrum of hard combinatorial problems, including MAX-CUT, Number Partitioning (NPP), and Maximum Independent Set. In this work, we consider the notion of one-flip local minima for this problem. We construct a polynomial relaxation and prove the landscape equivalence theorem: there exists a one-to-one correspondence between the local minima of the relaxation and the one-flip minima of the original Ising problem. This guarantee reduces the Ising problem to finding the local minima of a smooth function, allowing us to leverage gradient-based optimizers such as ADAM. We demonstrate that our method is scalable and it achieves strong performance across challenging benchmarks, including spin-glass models, MAX-CUT, and NPP.

[LG-141] Extrapolating from Regularised Solutions for Solving Ill-Conditioned Linear Systems in Machine Learning

链接: https://arxiv.org/abs/2606.30328
作者: Disha Hegde,Jon Cockayne,Chris. J. Oates
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Published in TMLR

点击查看摘要

Abstract:Rapid prototyping of algorithms is a critical step in modern machine learning. Most algorithms exploit linear algebra, creating a need for lightweight numerical routines which – while potentially sub-optimal for the task at hand – can be rapidly implemented. For the numerical solution of ill-conditioned linear systems of equations, the standard solution for prototyping is Tikhonov-regularised inversion using a nugget. However, selection of the size of nugget is often difficult, and the use of data-adaptive procedures precludes automatic differentiation, introducing instabilities into end-to-end training. Further, while data-adaptive procedures perform multiple linear solves to select the size of nugget, only the result of one such solve is returned, which we argue is wasteful. This paper aims to circumvent the above difficulties, presenting autonugget; a Python package for automatic and stable numerical solution of linear systems suitable for rapid prototyping, and fully compatible with automatic differentiation using JAX. autonugget combines multiple linear solves using Richardson extrapolation to determine the solution of the ill-conditioned system, improving in accuracy over approximations based on a single nugget.

[LG-142] Highly Data Parallelizable Estimation of the Sliced-Wasserstein Distance Using Cumulative Distribution Functions

链接: https://arxiv.org/abs/2606.30310
作者: Christophe Vauthier,Quentin Mérigot,Anna Korba
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Sliced Wasserstein (SW) distance has emerged as a computationally attractive alternative to the Wasserstein distance by leveraging one-dimensional optimal transport along random projections. Standard estimators of the SW distance rely on Monte Carlo averages of one-dimensional Wasserstein distances computed via quantile functions, which require sorting projected samples and access to full datasets. In this work, we introduce a new class of estimators for the Sliced Wasserstein distance based on cumulative distribution functions (CDFs) of projected measures, that avoid sorting and scale via massive dataset parallelism. This class includes several estimators, some of them being indexed by hyperparameters controlling their variance or smoothness. We show that they are especially well suited to scenarios in which CDFs are more tractable than quantile functions, such as mixtures of Gaussians, and moreover that they are also naturally compatible with federated learning, since CDFs of projected data can be computed and aggregated locally without requiring the exchange of raw samples.

[LG-143] A Distributionally Robust Framework for Learned Reconstructions in Inverse Problems

链接: https://arxiv.org/abs/2606.30230
作者: Floor van Maarschalkerwaart,Subhadip Mukherjee,Christoph Brune,Marcello Carioni
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learned reconstruction operators for inverse problems are typically trained under a fixed noise model, and generalize poorly when the distribution during testing differs from the one assumed during training. Distributionally robust optimization (DRO) addresses this by optimizing against the worst-case distribution within a prescribed ambiguity set, but standard Wasserstein DRO perturbs the full joint distribution uniformly, which can be overly conservative and ignores the physics of the measurement process. We develop a structured DRO framework in which the ambiguity set is restricted to structured perturbations aligned with the data-acquisition process. This allows us to learn data-driven reconstruction operators that remain robust to distributional shifts. By constraining perturbations to subsets such as P(Y|X) , our framework models uncertainty in the forward operator and noise model more faithfully, accommodating any noise model expressible as a stochastic forward operator. We establish strong duality for this general formulation and derive explicit finite-dimensional dual representations for perturbations in the joint, marginal, and conditional distributions. A central result is an explicit worst-case risk bound that induces Tikhonov regularization on the Lipschitz constant of the reconstruction operator, and is less conservative relative to standard DRO for well-posed problems. Numerical experiments on deblurring and sinogram-to-CT reconstruction demonstrate improved robustness, stability, and interpretability over standard DRO and MSE baselines. In the linear setting, the learned operator becomes effectively low-rank, truncating at the intrinsic dimension of the data and recovering a data-driven analogue of truncated-SVD regularization.

[LG-144] Notes on generative modeling: flow matching diffusion optimal transport and Schrödinger bridge

链接: https://arxiv.org/abs/2606.30053
作者: Titouan Vayer(COMPACT)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:These notes recapitulate the high level mathematical principles behind different techniques for generative modeling. I show the connections between optimal transport and standard techniques such as Schrödinger bridge and flow matching.

[LG-145] Adjusted Wasserstein distances for bridging empirical and true distributions with applications to MDS

链接: https://arxiv.org/abs/2606.29665
作者: Flor Martinez-Sermeno,Arturo Jaramillo,Johan Van Horebeek
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper examines how metric adjustments to Multidimensional Scaling (MDS) can enhance its effectiveness as a visual tool for pattern recognition. The distance under consideration, referred to as Max-D-SW, is an adjustment of the Max-Sliced Wasserstein distance. In contrast to the original formulation, which optimizes over single unit directions, Max-D-SW aggregates contributions over orthonormal bases. This modification provides a clear numerical advantage in MDS outcomes, particularly when applied to heavy-tailed distributions. We also establish sample-complexity bounds showing that Max-D-SW remains statistically tractable, with rates comparable to those of its max-sliced counterpart. Moreover, we show that a better sample complexity for a metric does not necessarily translate into better performance when the metric is used as an input for MDS.

[LG-146] Lie Group Diffusion Models for Hardware-Aware Quantum Circuit Synthesis

链接: https://arxiv.org/abs/2606.29636
作者: Jyotirmai Singh
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:An important task in quantum computing is unitary circuit synthesis compatible with physical hardware constraints. This problem has a natural hybrid structure as local single-qubit gates are continuous variables on the Lie group SU(2) while the entangling circuit structure is discrete and hardware-dependent. In this work, we use generative models to perform quantum circuit synthesis incorporating both the natural SU(2) manifold geometry of quantum gates and hardware constraints that determine the overall circuit structure. Our model comprises two components: a circuit skeleton selector that chooses an entangling circuit and a diffusion model that generates quantum gates on the given circuit template by performing diffusion on the curved manifold \mathrmSU(2) \simeq S^3 itself. We demonstrate this approach with unitary compilation of physically motivated three-qubit Hamiltonian simulation targets such as the Transverse Field Ising Model and the Heisenberg-XXZ Model and show that Lie group diffusion outperforms comparable baselines. The synthesised circuits can also be customised subject to constraints, which we demonstrate by producing circuits with large and small gate rotation angles for the same target unitary evolution. We also investigate the fidelity-complexity frontier of the synthesised gates to demonstrate that the circuit selector learns to select circuits that balance fidelity with complexity rather than collapsing onto the most expansive entangling template. These results demonstrate that Lie group diffusion provides a natural generative framework for hardware-aware quantum circuit synthesis.

[LG-147] Kriging and neural network models for pressure losses across perforated plates

链接: https://arxiv.org/abs/2606.29628
作者: Shuai Li
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, two novel data-driven models based on kriging and neural networks (NN) are proposed to predict pressure losses across perforated plates with circular perforations in turbulent flows. The models are developed using two sets of experimental data available in the literature. The predictive performance of the proposed models is assessed and compared against widely used empirical formulae. It is found that the proposed models consistently outperform existing empirical models for most perforated plate configurations contained in the experimental datasets. Besides, the predicted pressure losses generally show good agreement with experimental measurements, demonstrating that data-driven approaches based on kriging and NN provide a feasible framework for modelling pressure losses across perforated plates. Overall, both approaches are promising, despite being trained on a relatively limited amount of experimental data, owing to the scarcity of measurements reported in the literature. To demonstrate the applicability of the proposed models in numerical simulations, two-dimensional channel flows are simulated using the Reynolds-averaged Navier-Stokes (RANS) equations, in which the new pressure-loss models are implemented as a source term in the momentum equations. The RANS predictions are found to be in excellent agreement with the model predictions, confirming the suitability of the proposed approaches for practical computational fluid dynamics applications.

[LG-148] Bidirectional Autoregressive Latent Diffusion for Forward and Inverse Magnetohydrodynamics

链接: https://arxiv.org/abs/2606.29620
作者: Alexander Scheinker
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO); Plasma Physics (physics.plasm-ph)
*备注:

点击查看摘要

Abstract:This work presents a new bidirectional autoregressive latent diffusion approach for predicting the evolution of multiple fields (mass density, pressure, velocity, and magnetic field components) for magnetohydrodynamics. We show that this bidirectional flow can be used as a self-supervised consistency metric for uncertainty and error estimation, which enables the model to estimate test-time uncertainty and error without access to ground truth, by comparing how closely flowing forwards and backwards in time returns to the same predicted fields. We also demonstrate this methods’s potential to serve as a non-invasive plasma diagnostic, and show how adaptive feedback can be used to make the model more robust based on sparse diagnostics or limited views/measurements.

[LG-149] Geometric Algebra Meets Cartesian Tensors: Higher-Order Equivariance for Interatomic Potentials

链接: https://arxiv.org/abs/2606.29584
作者: Can Polat,Erchin Serpedin,Mustafa Kurban,Hasan Kurban
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract: \mathrmCl(3,0) interatomic potentials, despite their algebraic elegance, predict force magnitudes accurately but force directions poorly. Across ten rMD17 molecules, every L \leq 1 baseline in our twelve-model study attains aggregate force-cosine similarity below 0.25 . The cause is structural. The geometric product of two vectors in \mathbbR^3 realises only the L=0 and L=1 components of its irreducible representation content, leaving the symmetric-traceless rank-2 component absent from the per-edge bilinear that drives each message-passing layer. We address this with CliffordSTF, which couples the Clifford multivector to closed-form symmetric-traceless tensor tracks at ranks two and three through bilinear cross-track contractions, using a single learned bilinear and no Clebsch–Gordan tables, Wigner- D matrices, or e3nn calls. On rMD17, CliffordSTF raises aggregate force-cosine similarity from 0.055 (base Clifford) to 0.551 , an order-of-magnitude relative directional gain, alongside improved magnitude accuracy (force MAE 15.8% lower; energy MAE 10.9% lower). It outperforms all CG-free or body-ordered baselines in our study (all \leq 0.17 ). On catalysis benchmarks, CliffordSTF achieves the best out-of-distribution S2EF energy MAE on OC22 in our experiments, and the best in-distribution energy MAE among L \geq 2 methods on OC22 IS2RE. An eleven-variant ablation shows the two tracks are complementary: neither alone matches the combined model.

[LG-150] Fractional Stochastic Neural Networks

链接: https://arxiv.org/abs/2606.29438
作者: Yuecai Han,Jianming Xu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 29 pages, 3 figures, 6 tables

点击查看摘要

Abstract:In this paper, we develop a fractional stochastic neural network with residual dynamics driven by fractional Brownian motion. By introducing a discrete stochastic maximum principle for the network, we construct the corresponding adjoint recursion. For deterministic network parameters, we prove mean square convergence of projected samplewise stochastic gradient descent. Numerical experiments include a closed form convergence test, noisy regression with uncertainty quantification, long memory time series generation and image classification under structured perturbations. The results identify settings in which fractional drivers improve long memory recovery or robustness relative to Brownian and deterministic baselines.

[LG-151] wo kinds of robustness are not the same: disentangling fault tolerance and low-SNR robustness in multi-domain event detection on real data

链接: https://arxiv.org/abs/2606.29339
作者: Isao Kurosawa
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 19 pages, 10 figures. Submitted to Computers Geosciences. Code and reproduction material: this https URL (archived at Zenodo: this https URL )

点击查看摘要

Abstract:Reliable event detection underpins induced-seismicity monitoring for Carbon dioxide Capture and Storage (CCS) and geothermal operations, distributed acoustic sensing (DAS), and industrial condition monitoring. In each setting a detector must stay reliable both when sensors fail and when the signal is buried in noise. These two failure modes are routinely conflated, and architectural complexity is often credited with robustness it may not deserve. We assemble a unified binary event-detection benchmark from three physically distinct real sources – Hi-net seismic waveforms, Utah FORGE 2024 borehole DAS, and MAFAULDA industrial vibration – each mapped to a common 8-channel, 256-sample representation, and evaluate a fault-tolerant detector (CEPHALON) trained with per-sample sensor-dropout against standard detectors (a 1D convolutional network, a temporal convolutional network, and a compact Transformer) trained with an identical recipe. On clean data every model is near-perfect (AUC ~ 0.99). Under progressive sensor loss, simple models with sensor-dropout are already robust and CEPHALON holds no advantage. Under additive noise, however, CEPHALON degrades far more gracefully: at -2.5 dB its overall AUC is 0.939 versus 0.532-0.572 for the convolutional baselines. Same-architecture ablations isolate the cause: disabling internal redundancy at inference reduces the low-SNR advantage only modestly, whereas removing sensor-dropout training collapses it (0.899 to 0.603 at -5 dB). The training recipe is therefore the dominant cause and parallel redundancy only secondary. We release a complete, numbered, reproducible pipeline so that every figure can be regenerated.

[LG-152] Gradient boosting with vector-valued leafs

链接: https://arxiv.org/abs/2606.29326
作者: David Cortes
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gradient boosting in the form of decision tree ensembles has successfully been applied to a variety of problems using simple objective functions based on log-likelihoods of a single variable. The concept extends naturally to objective functions operating on vectors - for example, multinomial logistic log-likelihood for multi-class classification, where observations have a score for each class - but popular frameworks approach these functions by either updating one value of the input vectors at a time, or by using a diagonal upper bound on the second derivative. This work extends the usual gradient boosting framework to functions of vector inputs and sketches a simple algorithm that can be used efficiently with histogram-based decision trees.

[LG-153] Generalization Analysis of Transformers in Distribution Regression

链接: https://arxiv.org/abs/2606.29256
作者: Peilin Liu,Ding-Xuan Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, models based on the Transformer architecture have seen widespread applications and have become one of the core tools in the field of deep learning. Numerous successful techniques, such as parameter-efficient fine-tuning and efficient scaling, have been proposed surrounding their applications to further enhance performance. However, the success of these strategies has always lacked the support of rigorous mathematical theory. To study the underlying mechanisms behind Transformers and related techniques, we first propose a Transformer learning framework motivated by distribution regression, with distributions being inputs, connect a two-stage sampling process with natural language processing, and present a mathematical formulation of the attention mechanism called attention operator. We demonstrate that by the attention operator, Transformers can compress distributions into function representations without loss of information. Moreover, with the advantages of our novel attention operator, Transformers exhibit a stronger capability to learn functionals with more complex structures than convolutional neural networks and fully connected networks. Finally, we obtain a generalization bound within the distribution regression framework. Through the aforementioned theoretical results, we further discuss some successful techniques emerging with large language models (LLMs), such as prompt tuning, parameter-efficient fine-tuning, and efficient scaling. We also provide theoretical insights behind these techniques within our novel analysis framework.

[LG-154] Connectivity Estimation using Stochastic Graph Heat Modelling

链接: https://arxiv.org/abs/2606.29098
作者: Stephan Goerttler,Min Wu,Fei He
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
*备注: 14 pages, 11 figures. Includes supplemental material

点击查看摘要

Abstract:A growing number of techniques leverage the spatial structures that underlie many real-world datasets. Despite these advances, the complementary task of estimating spatial structures and understanding their role within these techniques has often been overlooked. In neurophysiological data analysis specifically, numerous methods exist to estimate brain connectivity, but most are not explicitly model-based, dynamic, multivariate, or directed. To address these limitations, we previously introduced noise-driven heat modelling on graphs for neurophysiological connectivity estimation. In this study, we extend this framework by relaxing earlier noise assumptions and adding regularisation to improve robustness. We also develop a simulation procedure to characterise and evaluate our technique in a controlled setting. Finally, we demonstrate that the technique is able to capture meaningful spatial structure across two experiments, each using two real-world datasets. The explicit model formulation of our connectivity estimator has the potential to improve the interpretability of graph-based techniques across a wide range of applications. The code implementing our method is available at this https URL.

[LG-155] Liquidity-Based Audit of Algorithmic Trading Strategies

链接: https://arxiv.org/abs/2606.29018
作者: Irene Aldridge
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Computational Finance (q-fin.CP); Risk Management (q-fin.RM); Machine Learning (stat.ML)
*备注: 26 pages

点击查看摘要

Abstract:We show that net demand for liquidity by algo strategies is identifiable from its trade and price history alone, with no knowledge of its signal or optimization problem. An exact multi-period regret decomposition implies that the sign of this statistic classifies a linear strategy as a net liquidity consumer or provider, recovering the Kyle (1985) informed-trader/market-maker dichotomy from observables alone. Under an AR(1) cost process, the same statistic equals the product of strategy size and the squared Roll (1984) implied spread, making the correction a direct proxy for prevailing illiquidity. Extending to endogenous price impact and aggregating across N correlated strategies yields a liquidity-balance condition whose violation produces welfare loss scaling as N squared, a closed-form fire-sale externality. We calibrate to CRSP equity data (2016-2025), tracking implied spreads through the COVID-19 and 2022 rate-shock episodes, with an estimator computable in O(Tnd) time.

[LG-156] A Bayesian latent Gaussian process framework for aerodynamic uncertainty quantification

链接: https://arxiv.org/abs/2606.28871
作者: Geoffrey Davis,Ashwin Renganathan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 29 pages, 9 figures

点击查看摘要

Abstract:Predicting the aerodynamic performance (e.g. lift, drag, and moment coefficients) of an aircraft is challenging – computational models are biased and direct simulations are prohibitive. A pragmatic way to overcome this limitation is by calibrating low-fidelity computational predictions with experimental measurements. This, however, requires calibrating against \emphsparse measurements contaminated with \emphuncertainty in both the control inputs and the measured aerodynamic response. We develop a methodology to address this problem based on Gaussian process surrogates and the classical Kennedy-O’Hagan calibration. A surrogate model learned on abundant-but-cheap low-fidelity data is calibrated with a sparse set of measurement data. Crucialy, we develop a Bayesian latent Gaussian process based approach that marginalizes the calibrated surrogate model over the input uncertainty, while also matching the marginal mean and variance of the measured output uncertainty. Once calibrated, our surrogate model predicts the uncertainty in aerodynamic coefficients with very high accuracy, including at extrapolative input settings. We validate our calibrated surrogate model predictions against measurement data with \emphtrue uncertainty intervals to demonstrate that the model places 94.2-95.8% of its predictive samples inside the released 95% truth intervals, with endpoint cumulative probabilities very close to the nominal 0.025 and 0.975 levels.

[LG-157] Variance Reduction for Stochastic Gradient Generalized Non-reversible Langevin Monte Carlo Algorithms

链接: https://arxiv.org/abs/2606.28808
作者: Bingye Ni,Xiaoyu Wang,Yingli Wang,Lingjiong Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 49 pages, 12 figures

点击查看摘要

Abstract:We study the leading-order fluctuation of stochastic gradient Euler-Maruyama estimators for generalized non-reversible Langevin dynamics. Under structural assumptions tailored to the small-stepsize central limit theorem and under an unbiased stochastic gradient oracle, we prove that the empirical average over a horizon of order the inverse squared stepsize satisfies a central limit theorem in the vanishing-stepsize regime. The limiting variance is characterized through the Poisson equation of the limiting full-gradient diffusion. We then rewrite this constant in an operator form that links it to the continuous-time asymptotic variance and, under standard operator-theoretic assumptions, derive a sufficient condition under which an anti-symmetric perturbation strictly reduces the leading-order fluctuation constant relative to the reversible baseline. We also identify bounded smooth predictive observables that re directly covered by the main theorem. As a separate Gaussian calculation beyond the bounded-test-function regime, we obtain closed-form formulas for quadratic Hamiltonians and linear observables. The framework covers non-reversible Langevin dynamics and augmented-state examples including Hessian-free high-resolution dynamics and a positive-definite subclass of gradient-adjusted underdamped Langevin dynamics that allow stochastic gradients. Numerical experiments on basic examples and Bayesian linear regression using synthetic data, and Bayesian logistic regression using real data support the predicted Gaussian fluctuations and show that the non-reversible schemes consistently reduce the root mean squared error (RMSE) relative to their reversible baselines.

[LG-158] A Neuroimaging Simulation Framework for Developing and Evaluating Causal AI ALT

链接: https://arxiv.org/abs/2606.28684
作者: Eryn Libert-Scott,Emma A.M. Stanley,Vibujithan Vigneshwaran,Matthias Wilms,Erik Y. Ohara,Nils D. Forkert
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, submitted to the Journal of Biomedical and Health Informatics, Code available at this https URL

点击查看摘要

Abstract:Causally linking disease-related factors to image-derived biomarkers provides a powerful pathway to understanding disease mechanisms. Despite growing interest in applying causal artificial intelligence (AI) approaches for this task, these methods still need to be adapted for complex medical images, and especially, neuroimaging. However, the lack of ground-truth data presents a barrier to development. To bridge this gap, we developed and tested a method for generating synthetic neuroimages, which adhere to a user-specified causal structure describing the non-image to image variable relationships, permitting the creation of ground-truth neuroimaging datasets. In the simulated T1-weighted magnetic resonance images, anatomical variability is modeled by sampling from a subspace estimated from real data and deforming a template image to create unique simulated subjects. Causal relationships are encoded via precise volumetric changes of any region-of-interest without unwanted global artifacts. We achieved relative volume errors of 0.3-2.66% for the targeted regions-of-interest and demonstrate their statistically significant causal relationships, while maintaining mean absolute errors for non-target brain regions between 0.034-0.397ml. An initial evaluation of causal discovery methods exposes their limited ability to suppress spurious connections, highlighting the need for image-appropriate methods. Our framework is the first to enable the generation of realistic synthetic 3D neuroimages with explicit causal control that can serve as the missing ground-truth data necessary for the objective benchmarking and development of causal AI methods.

[LG-159] ransformer-Based Active Learning for Data-Efficient Vaccine Epitope Selection in PRRS

链接: https://arxiv.org/abs/2606.28659
作者: Aspen Erlandsson Brisebois,Zahed Khatooni,Connor Burbridge,Brook Byrns,Heather L. Wilson,Sureesh Tikoo,Steven Rayan,Gordon Broderick
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 31 pages, 7 figures, 8 tables, 1 suppl. figure, 2 suppl. tables

点击查看摘要

Abstract:High-fidelity molecular docking simulations can produce biologically relevant estimates of epitope-receptor binding affinity but are computationally expensive and therefore limit the number of candidates that can be screened for vaccine design. In this work, we evaluate machine learning (ML) approaches where variants of active learning are used to classify instances of high binding affinity between 9-mer epitopes and a well-conserved swine leukocyte antigen (SLA) receptor in the context of Porcine Reproductive and Respiratory Syndrome (PRRS). We use an internally generated dataset of 80 epitope-SLA docking affinities, each requiring more than 48 hours of high-performance computing (HPC). Multiple model families (linear, MLP, CNN, and a small transformer) are trained under strict low-data conditions within a pool-based active learning loop. In each case, optimal model configurations are identified by conducting large-scale hyperparameter optimization over the combined space of model architecture, training configuration, acquisition policy, and ensemble decision rules. To mitigate the effects of data subsample selection, each candidate configuration is evaluated by averaging performance over many randomized and balanced training and validation data subsets. Across experiments, transformer-based sequence models consistently emerged as the best-performing architecture, with active incremental learning yielding significant improvement over a baseline random sample acquisition strategy. Under moderate training data availability (N=30), the optimized ML-model configuration outperforms a standard baseline trained on twice the amount of data. Under higher training data availability (N=60), the same configuration achieves a peak accuracy of 86.8%, consistent with an upper bound of 85% classification accuracy based on two independent estimates of conformational noise.

[LG-160] Exploring the Effects of Entanglement on Quantum Machine Learning of Pathogen Epitope-Receptor Binding

链接: https://arxiv.org/abs/2606.28655
作者: Aspen Erlandsson Brisebois,Luis Pablo Gonzalez Dominguez,Shivansi Prajapati,Zahed Khatooni,Heather L. Wilson,Connor Burbridge,Brook Byrns,Sureesh Tikoo,Christophe Pere,Steven Rayan,Gordon Broderick
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 15 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Parameterized quantum circuits (PQCs) provide a flexible substrate for hybrid quantum machine learning (QML), but their practical value on Noisy Intermediate-Scale Quantum (NISQ) devices remains an empirical question, especially because training depth and scale can introduce optimization challenges such as barren plateaus. Here we study how the number and topology of two-qubit entangling gates in the feature-map stage influence a fixed hybrid QNN workflow for classifying strong versus weak epitope-receptor binding in Porcine Reproductive and Respiratory Syndrome (PRRS) vaccine design. The dataset consists of docking-derived binding affinities for N=80 9-mer epitopes, labeled as Strong or Weak binding, and partitioned into training, validation, and test subsets using a 40:30:30 split. We compare a classical CNN benchmark with a hybrid Embedding-QNN architecture under four feature-map configurations: a non-entangling Z feature map, an all-to-all high-entanglement ZZ feature map, and two interleaved nearest-neighbour entanglement patterns of low and high depth. Among the configurations tested, the high-entanglement ZZ feature map is seen to provide the strongest evidence of reduced training-set overfit, with a lower training area under the accuracy curve (AUAC) and the highest test/training AUAC ratio, while preserving competitive test-set accuracy. These results do not establish a general QML advantage, but they suggest that feature-map entanglement topology is a meaningful design variable for sparse biological screening tasks and warrants further evaluation with additional metrics, larger datasets, and noise-aware or hardware-based experiments.

[LG-161] Adaptive Iterative Hard Thresholding for Online High-dimensional Quantile Regression

链接: https://arxiv.org/abs/2606.28652
作者: Zitian Zhou,Nan Lin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Online high-dimensional regression requires algorithms that can update sequentially while preserving structural sparsity. We propose \textitAdaptive Iterative Hard Thresholding (AIHT), an online sparse-regression framework that alternates stochastic subgradient updates with adaptively scheduled hard-thresholding steps. The key idea is to separate support discovery from local refinement: early in the learning process, AIHT delays thresholding so that weak but informative coordinates have time to accumulate signal, while later it increases the projection frequency to stabilize the sparse estimator and exploit local curvature. We develop the theory for high-dimensional online quantile regression, a challenging setting in which the loss is nonsmooth and the data may exhibit heterogeneity or heavy-tailed noise. Under restricted curvature and gradient-leakage conditions, AIHT remains in an inflated sparse cone, exhibits a two-phase convergence behavior, and attains logarithmic regret for the sliding-window objective. Simulations for online quantile regression, together with threshold-scheduling ablations, support the proposed mechanism and illustrate its advantage over standard online sparse-learning baselines.

[LG-162] Conformal Prediction with Macro-Coverag e Guarantees

链接: https://arxiv.org/abs/2606.28598
作者: Aabesh Bhattacharyya,Tiffany Ding,Rina Foygel Barber
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Prediction sets should have high coverage to be useful, but some coverage notions are more practically relevant than others. In the classification setting, class-conditional coverage requires that the prediction set (i.e., the set of candidate labels for a new test point) must achieve the target accuracy level within each class, which may be challenging to satisfy when many classes are rare and have few calibration points. At the other extreme, marginal coverage requires only that coverage holds on average over the distribution of all classes, which can lead to low-probability labels being essentially ignored. To find a middle ground, recent work has introduced macro-coverage, defined as the unweighted average of class-conditional coverages. Macro-coverage offers a compromise between marginal coverage and class-conditional coverage that is particularly appropriate for long-tailed settings. In this work, we show that label-weighted conformal prediction can be used to produce prediction sets with a finite-sample macro-coverage guarantee, and more generally a guarantee on a family of generalized macro-coverage objectives that aggregate coverage at the level of arbitrary class groupings and take a weighted average. We further characterize the form of the smallest prediction sets satisfying a given generalized macro-coverage objective and propose a corresponding conformal score function. We validate our theoretical results on two large-scale image classification datasets.

[LG-163] Spectral phase transitions and trainability in neural network learning dynamics

链接: https://arxiv.org/abs/2606.28486
作者: Chanju Park,Dario Bocchi,Francesco D’Amico,Biagio Lucini,Gert Aarts
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 20 pages + appendix, many figures

点击查看摘要

Abstract:The emergence of low-dimensional structures in the spectra of neural network weight matrices is a common empirical feature of trained models, but the dynamical origin of this phenomenon during learning remains an open problem. We formulate neural network training as the stochastic evolution of an initially random matrix ensemble, driven by stochastic gradient descent (SGD) updates that reshape the spectral bulk while amplifying signal strength. This induces a Baik-Ben Arous-Péché (BBP) transition during training, where isolated eigenvalues detach from the random bulk distribution, providing a dynamical framework for representation formation in high-dimensional learning dynamics. We demonstrate this in a solvable linear teacher-student model, where spectral evolution is analytically tractable and a phase diagram of trainability governed by the step size (or learning rate) and initial weight variance is obtained, and subsequently extend our formalism beyond the linear regime to nonlinear and stochastic settings. Numerical simulations in realistic settings support this picture, showing robust emergence of spectral alignment during training. Our results suggest that spectral analysis may provide a unified perspective of stochastic learning dynamics, linking trainability, optimisation hyperparameters, spectral phase transitions, and representation learning in neural networks.

附件下载

点击下载今日全部论文列表