本篇博文主要内容为 2026-06-18 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-06-18)

今日共更新607篇论文,其中:

  • 自然语言处理83篇(Computation and Language (cs.CL))
  • 人工智能167篇(Artificial Intelligence (cs.AI))
  • 计算机视觉100篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习194篇(Machine Learning (cs.LG))
  • 多智能体系统12篇(Multiagent Systems (cs.MA))
  • 信息检索15篇(Information Retrieval (cs.IR))
  • 人机交互35篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Data Intelligence Agents : Interpreting Modeling and Querying Enterprise Data via Autonomous Coding Agents

【速读】:该论文旨在解决企业级生产数据集成过程中因数据所有者、工程师与分析师之间反复且低效的手动交接所导致的瓶颈问题,具体表现为数据发现、结构化及查询等环节的协作效率低下与信息丢失。其核心解决方案是提出数据智能代理(Data Intelligence Agents, DIA)系统,该系统由三个自主编码代理(Autonomous Coding Agents, ACAs)组成:数据解释器(Data Interpreter)、模式创建器(Schema Creator)和查询生成器(Query Generator)。DIA的关键创新在于将ACAs作为第一类抽象实体,使代理不再仅输出文本,而是能够生成、执行、验证并修复具体的可操作数据制品;同时,各代理共享一个统一的记忆库以实现经验复用,并在关键节点向领域专家提供可视化审查界面。通过在真实生产环境中部署,研究重点评估了查询生成器在全自主模式下的表现,在涵盖四种任务类别和四种SQL方言的七个基准测试中,其性能均达到或超越现有最佳结果,证明了基于执行驱动、依托ACAs与共享记忆架构的DIA系统,能够在不改变底层逻辑的前提下,仅通过自然语言指令调整即可泛化至复杂的数据智能工作负载。

链接: https://arxiv.org/abs/2606.19319
作者: Anoushka Vyas,Aarushi Dhanuka,Sina Khoshfetrat Pakazad,Henrik Ohlsson
机构: C3 AI
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Production data integration is bottlenecked by repeated, lossy handoffs between data owners, engineers, and analysts who must collaboratively discover, structure, and query enterprise data. We present Data Intelligence Agents (DIA), a system of three agents (Data Interpreter, Schema Creator, and Query Generator) that compresses this workflow by treating autonomous coding agents (ACAs) as a first-class abstraction: rather than emitting text, the agents generate, execute, validate, and repair concrete artifacts, draw on a shared memory for experience reuse, and surface each for review by domain experts. DIA is deployed in production for enterprise customers. We study the Query Generator in depth and evaluate it in fully autonomous mode across seven SQL benchmarks spanning four task categories and four dialects. It matches or surpasses the best published results on all seven, demonstrating that an architecture grounded in execution, built on ACAs and a shared memory, generalizes across the data intelligence workload with adaptation confined to natural-language instructions.

[MA-1] Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的多智能体系统(Multi-Agent System, MAS)在处理具有复杂决策依赖关系的任务时所面临的局限性,尤其针对现实中普遍存在的、需多方立场协同推理的决策问题。这类问题的核心挑战在于“立场纠缠”(stance entanglement),即各利益相关方的决策相互依赖,无法孤立求解,与传统的执行复杂性(execution complexity)有本质区别。为应对这一挑战,论文提出了一种名为多智能体虚构博弈(Multi-Agent Fictitious Play, MAFP)的新范式,其关键在于将各利益相关方的立场建模为独立智能体,并将决策过程形式化为一个寻求均衡的迭代优化过程。该方法基于博弈论中的虚构博弈(fictitious play)原理,通过让每个智能体基于其他智能体历史决策的经验分布进行最优响应,实现对彼此策略弱点的逐步暴露与修正,从而持续提升决策质量与鲁棒性。实验表明,MAFP在竞技场景策略制定任务中显著优于单轮与多轮基线方法,在锦标赛强度和鲁棒性两个互补指标上均表现更优,有效验证了其在缓解立场纠缠方面的有效性。

链接: https://arxiv.org/abs/2606.19308
作者: Leyang Shen,Yang Zhang,Xiaoyan Zhao,Chun Kai Ling,Tat-Seng Chua
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm falls short on decision-making tasks that are also prevalent in the real world. These tasks require simultaneous reasoning from the stances of all involved stakeholders whose decisions are mutually dependent and thus cannot be solved in isolation. We characterize this challenge as stance entanglement, a form of decision complexity distinct from execution complexity. To address it, we propose Multi-Agent Fictitious Play (MAFP), a novel MAS paradigm that represents stakeholder stances as agents and formulates decision-making as an equilibrium-seeking process. Built on the game-theoretic principle of fictitious play, MAFP iteratively updates each agent’s decision by best responding to the empirical mixture of other agents’ past decisions. This enables agents to expose and address one another’s weaknesses, progressively improving decision quality and robustness. We evaluate MAFP on challenging decision-making tasks that test the capability of deciding strategies for competitive scenarios prior to acting. MAFP outperforms both single-round and multi-round baselines on two complementary metrics, tournament strength and robustness, demonstrating its effectiveness in addressing stance entanglement.

[MA-2] Digital Speech Acts Retain Control of Copyright with People Not Platforms

【速读】:该论文旨在解决当前中心化数字平台在用户生成内容(UGC)治理中因版权控制权过度集中而引发的权力失衡问题。现有法律框架下,尽管计算机代码受著作权保护,但平台通过用户协议(Terms of Service)强制要求创作者让渡实质性的版权控制权,导致平台垄断内容价值与治理规则制定权,削弱了个体对自身数字产出的主权。其解决方案的关键在于提出“数字言论行为”(digital speech act)这一新型概念——即个体在其自主设备上使用私钥对个人内容进行密码学签名,从而在技术层面实现身份归属、责任可追溯与创作权的确立。该论文论证:(1)基于美国既有判例法,数字言论行为满足作者性(volitional creative choice)、最低创造性门槛(Feist标准)及固定性要求(Copyright Act fixation),具备获得著作权保护的合法性基础;(2)去中心化草根平台通过设计内嵌的数字社会契约,确保签名与内容不可分离,传播过程中溯源链条持续累积,使所有权与持有权在个体层面自然统一,实现版权的“原生性保留”;(3)数字言论行为所承载的版权是实现数字主权与民主自我治理的必要前提。因此,该研究的核心贡献在于将法律、技术和制度设计融合,构建以个体为中心的分布式数字权利体系。

链接: https://arxiv.org/abs/2606.19263
作者: James Golike,Ehud Shapiro
机构: 未知
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Multiagent Systems (cs.MA); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Legal precedents protect computer code as copyrightable expression. They have enabled centralized digital platforms – operating from corporate servers that hold all user data – to construct private governance regimes through the interaction of copyright, contract, and technical architecture: people who create virtually all platform value must surrender effective copyright control through Terms of Service agreements as a condition of participation. In contrast, grassroots platforms consist of cryptographically-identified people operating their networked smartphones independently of any server or global resource; each person holds their own data on their own device, with no third party in possession or intermediation. Here, we define the notion of a \textitdigital speech act – a deliberate volitional act by a person of cryptographically signing personal content with the person’s private key, carried out on the person’s own device – through which the person simultaneously establishes attribution, accountability, and authorship over the signed content. We contend that (\ia) digital speech acts qualify for copyright protection under existing U.S.\ precedent: \textitBurrow-Giles locates authorship in volitional creative choices despite mechanical or algorithmic processes, \textitFeist supplies the minimal-creativity threshold, and persistent device storage satisfies the Copyright Act’s fixation requirement; (\ib) the digital social contract underlying grassroots platforms preserves this copyright by design – signed content cannot be unbundled from its signature, and the full provenance chain accumulates as content is forwarded – so that ownership and possession coalesce in the person; and (\ic) copyright in digital speech acts is a prerequisite for digital sovereignty and democratic self-governance. Subjects: Social and Information Networks (cs.SI); Computers and Society (cs.CY); Multiagent Systems (cs.MA); General Economics (econ.GN) Cite as: arXiv:2606.19263 [cs.SI] (or arXiv:2606.19263v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2606.19263 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-3] A Technical Taxonomy of LLM Agent Communication Protocols

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)驱动的多智能体系统在分布式网络中因通信协议碎片化而面临的互操作性难题。其核心问题是:当前多种异构通信协议并存,导致智能体间难以实现高效、一致的协作。解决方案的关键在于构建一个系统化的技术分类框架(taxonomy),通过五轮迭代分析(包含三次从实证到概念、两次从概念到实证的循环),对九个活跃维护且具备实际应用的开源协议进行结构化归纳。该框架涵盖五个维度:通信对端(counterparty)、消息载荷(payload)、交互状态(interaction state)、发现机制(discovery mechanism)和模式灵活性(schema flexibility)。研究发现,现有协议普遍采用混合载荷与会话状态持久化架构,多数支持多预定义模式,部分具备运行时模式协商能力,反映出向模式灵活性演进的趋势;然而去中心化发现机制仍极为罕见。分析进一步揭示,短期内协议将趋向于统一智能体-智能体与智能体-上下文(工具与数据)的通信范式,但长期来看,单一协议难以同时兼顾通用性、效率与可移植性,因此该领域更可能演进为分层、联邦式的协议栈结构。该框架不仅为协议选型提供指导,还指出了隐私保护与策略执行等关键开放研究方向。

链接: https://arxiv.org/abs/2606.19135
作者: Linus Sander,Habtom Kahsay Gidey,Alexander Lenz,Alois Knoll
机构: Technische Universität München(慕尼黑工业大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) advance and multi-agent systems aim to overcome the limits of standalone agents, robust communication protocols are becoming essential infrastructure for distributed agent networks. Nonetheless, the fragmented protocol landscape presents a significant interoperability challenge. This study develops a technical taxonomy to classify and analyze LLM agent communication protocols. Following an established iterative method, we defined the taxonomy’s purpose, meta-characteristic, and ending conditions, then performed five iterations, three empirical-to-conceptual and two conceptual-to-empirical, on nine actively maintained open-source protocols with demonstrable adoption. The taxonomy comprises five dimensions: counterparty, payload, interaction state, discovery mechanism, and schema flexibility. Classification reveals recurring architectural patterns: all sampled agent-to-agent protocols combine hybrid payloads with session-state persistence; most protocols support multiple predefined schemas, and two negotiate schemas at runtime, indicating a trend toward schema flexibility; decentralized discovery remains rare. Analysis suggests short-term convergence pressure toward protocols unifying agent-to-agent and agent-to-context (tool and data) communication. Long-term, however, no single protocol is likely to maximize versatility, efficiency, and portability simultaneously. The field will more likely evolve toward a federated, layered protocol stack. The framework guides protocol selection and highlights open research gaps such as privacy and policy enforcement.

[MA-4] Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

【速读】:该论文旨在解决多智能体大语言模型(multi-agent LLM)团队中过程级协调控制(process-level coordination control)在何种可度量条件下具有实际价值的问题,以及这些条件是否与团队科学(team science)中的权变领导理论相一致。其核心解决方案的关键在于:将经典的三种领导风格(交易型、变革型、情境型)操作化为对共享动作词汇(探索、修订、接受、合成)的显式控制策略,并通过行为特征(如多数锁定、探索行为、错误初始共识后的恢复能力)和逐动作消融实验来评估其有效性。研究发现,控制策略的实际作用并非源于动作集合本身,而是源自基于理论设计的规则;且在不同任务范式与模型家族下,无任何控制器在准确率上始终占优,这正符合权变理论的预期——只有当初始多数意见不可靠、任务具备可恢复性且非定向交互无法自行修复错误时,协调控制才表现出优势。这一结果验证了过程级协调控制应被视为一种可测量、可理论映射的权变机制,而非追求性能最优的“排行榜”指标。

链接: https://arxiv.org/abs/2606.19111
作者: Haewoon Kwak
机构: Indiana University Bloomington (印第安纳大学布卢明顿分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 33 pages

点击查看摘要

Abstract:Team science holds that leadership is contingent: it helps only under specific conditions, and capable, autonomous teams may need none at all. We ask the analogous question for multi-agent LLM teams: under what measurable conditions does process-level coordination control add value, and do those conditions match what team science predicts? We use behavioral signatures (majority lock-in, exploration, recovery from an incorrect round-0 consensus) and per-action ablations, clean because each controller is an explicit action set, not a monolithic prompt. We operationalize three classical leadership styles (transactional, transformational, situational) as controllers over a shared action vocabulary (explore, revise, accept, synthesize). A matched controller with the same actions but an arbitrary rule recovers no better than majority voting, so the theory-derived rule, not the vocabulary, does the work. Across four task regimes and three open-weight model families, no controller dominates by accuracy, as the contingency view predicts: transactional control matches a shared round-0 vote on all 12 (model, regime) combinations to within 1.3pp, and gains appear only on the one combination where the round-0 majority is unreliable (llama-4-scout social; situational +8pp over flat). A recovery-advantage account, tested with four boundary probes, says a controller beats plain interaction only where the round-0 majority is unreliable, the task is recoverable, and undirected interaction does not already repair it. These regions map onto contingency theory (leadership substitutes, path-goal redundancy, the situational readiness gap), so a largely null accuracy result is what the theory predicts, not a failure of the controllers. We read process-level coordination control as a contingency to be measured and theory-mapped, not a leaderboard to be topped.

[MA-5] Decoupling Search from Reasoning : A Vendor-Agnostic Grounding Architecture for LLM Agents

【速读】:该论文旨在解决生成式 AI(Generative AI)在生产环境中依赖实时搜索时,搜索接地(search grounding)功能被紧密耦合于大语言模型(LLM)与特定供应商之间的技术困境。现有方案将检索策略、提供商选择、证据注入、成本控制、延迟表现及生成行为等关键环节封装在单一模型-提供商边界内,导致接地过程难以调试、调优、复用或迁移,并可能引发“搜索诱导冗余”(Search-Induced Verbosity),破坏严格的输出契约。其解决方案的关键在于提出解耦式搜索接地(Decoupled Search Grounding, DSG),通过一个兼容MCP(Model Control Protocol)的网关,将接地逻辑从推理模型中剥离,实现对提供商路由、源感知上下文渲染、可配置回退机制、检索深度控制以及精确与语义缓存等能力的显式暴露和精细调控。实验表明,在SimpleQA、FreshQA和HotpotQA等多个前沿任务上,DSG在控制敏感场景下显著优于原生搜索:在保持接近原生准确率(86.1% vs. 87.7%)的同时,搜索成本降低91%,响应延迟下降68%,缓存命中率达99.4%;在电商查询理解(QIU)负载中,虽维持或小幅超越原生搜索精度,但搜索成本削减超98%。研究证明,实时接地应被视为可优化的接口边界,而非固定的模型特性。

链接: https://arxiv.org/abs/2606.18947
作者: Emmanuel Aboah Boateng,Kyle MacDonald,Amardeep Kumar,Siddharth Kodwani,Sudeep Das
机构: DoorDash, Inc.
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: 15 pages, Figure 8

点击查看摘要

Abstract:Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This coupling makes grounding hard to inspect, tune, reuse, or port, and can trigger Search-Induced Verbosity that breaks strict output contracts. We present Decoupled Search Grounding (DSG), a vendor-agnostic boundary that moves grounding outside the reasoning model through an MCP-compatible gateway, exposing provider routing, source-aware context rendering, configured fallback, retrieval-depth control, and exact plus semantic caching as first-class controls. Across five frontier models on SimpleQA, FreshQA, and HotpotQA, native search leads on recency-sensitive FreshQA, but DSG exposes a stronger frontier when control matters: on SimpleQA it nearly matches native accuracy (86.1% vs. 87.7%) at 91% lower search cost, preserves concise answer contracts, and reaches a 99.4% warm-cache hit rate with 68% lower latency. Deployed as a shared production grounding layer for large-scale agentic workloads with interchangeable models, DSG matches or slightly exceeds native-search accuracy on an e-commerce query-understanding (QIU) workload while cutting search cost by over 98%. Real-time grounding is best treated as an optimizable interface boundary, not a fixed model feature.

[MA-6] Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的自动多智能体系统(Multi-Agent Systems, MAS)生成中面临的模型能力与经验保留之间的矛盾:推理阶段的MAS虽可利用冻结的前沿LLM,但无法从过往经验中学习;而训练阶段的MAS虽能通过梯度更新内化经验,却受限于小型模型的能力上限,难以扩展至大规模前沿LLM。其解决方案的关键在于提出一种新型第三路径——Skill-MAS,通过将高层编排能力抽象为可进化的元技能(Meta-Skill),实现经验保留与参数更新的解耦。该方法构建了一个闭环优化机制:(1) 多轨迹回放(Multi-Trajectory Rollout)在当前元技能下采样各任务的行为分布;(2) 选择性反思(Selective Reflection)自适应筛选高优先级任务,并采用分层对比分析将系统性经验提炼为通用、策略层面的原则。实验结果表明,Skill-MAS在四个复杂基准上均显著提升性能,同时保持良好的成本-性能权衡,且演化出的元技能具备强鲁棒性与跨任务、跨模型的优异迁移能力。

链接: https://arxiv.org/abs/2606.18837
作者: Hehai Lin,Qi Yang,Chengwei Qin
机构: Ant Group; The Hong Kong University of Science and Technology (Guangzhou)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. However, existing methods face a dilemma between model capability and experience retention. Inference-time MAS leverages frozen frontier LLMs but repeats identical searches without learning from past experience. Conversely, Training-time MAS internalizes experience via gradient updates but is constrained by the low capability ceiling of smaller models, and is hard to scale to large frontier LLMs. To bridge this gap, we propose Skill-MAS, a novel third path that decouples experience retention from parametric updates by conceptualizing the high-level orchestration capability as an evolvable Meta-Skill. Skill-MAS refines this architectural knowledge through a closed optimization loop: (1) Multi-Trajectory Rollout samples a behavioral distribution for each task under the current Meta-Skill; and (2) Selective Reflection adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable, strategy-level principles. Extensive experiments across four complex benchmarks and four distinct LLMs demonstrate that Skill-MAS not only achieves remarkable performance gains but also maintains a favorable cost-performance trade-off. Further analysis reveals that the evolved Meta-Skills are highly robust and exhibit strong transferability across unseen tasks and different LLMs.

[MA-7] EARS: Explanatory Abstention for Reliable Sub-Agent Modeling in Large-scale Multi-Agent Systems

【速读】:该论文旨在解决大规模企业环境中集中式多智能体系统(Multi-Agent System, MAS)中子智能体因能力限制而无法准确校准自身响应,导致对模糊、不完整、误路由或不支持的请求产生过度回答甚至幻觉输出的问题。其核心挑战在于如何提升子智能体在面对超出自身能力范围的任务时,能够主动识别并合理拒绝响应,而非强行生成不可靠内容。解决方案的关键在于提出EARS(Explanatory Abstention for Reliable Sub-Agent Modeling)框架,将子智能体的“回避”行为重构为一种可解释的跨智能体通信协议:当子智能体检测到自身无法可靠完成任务时,不再简单拒绝,而是显式暴露一个可操作的失败状态,并附带结构化的理由,供协调器进行澄清、重新路由或触发备选方案。EARS通过集成多个经过校准的大型语言模型作为评判者(LLM-as-a-Judge),构建了包含失败模式分类的结构化人类-智能体交互数据集,用于微调子智能体以识别故障条件并生成可解释的拒绝理由。在支持企业级商业智能工作流的大规模电商助手生产环境中验证表明,该方法将整体响应通过率从68.5%提升至78.9%,证明了子智能体侧的可解释性回避显著增强了多智能体系统的可靠性。

链接: https://arxiv.org/abs/2606.18668
作者: Shuang Xie,Yunan Lu,Han Li,Lingyun Wang
机构: Shopify( Shopify); Columbia University(哥伦比亚大学)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In large-scale enterprise settings, centralized multi-agent systems (MAS) are increasingly adopted, in which a coordinator delegates user requests to lightweight, domain-specialized sub-agents. While this architecture improves modularity, scalability, and cost efficiency, its reliability depends not only on accurate routing but also on sub-agents’ ability to calibrate their responses to capability constraints. In particular, sub-agents built on smaller fine-tuned models often struggle with such calibration, leading them to over-answer ambiguous, underspecified, misrouted, or unsupported requests and produce hallucinated outputs instead of actionable feedback. To address this challenge, we present EARS (Explanatory Abstention for Reliable Sub-Agent Modeling), a production-oriented framework that reframes sub-agent abstention as an inter-agent communication protocol: a sub-agent does not merely abstain, but exposes an actionable failure state to the coordinator. EARS curates human-agent interaction data using an ensemble of calibrated LLM-as-a-Judge models, producing structured abstention labels and rationales under a taxonomy of sub-agent failure modes. These data are used to fine-tune sub-agents to detect failure conditions and return rationales for coordinator-level clarification, rerouting, or fallback. We evaluate EARS in a large-scale production e-commerce assistant supporting enterprise business intelligence workflows. EARS improves the overall response pass rate from 68.5% to 78.9%, demonstrating that sub-agent-side explanatory abstention improves MAS reliability.

[MA-8] Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies

【速读】:该论文旨在解决生成式 AI(Generative AI)在非西方文化语境下——具体为日本企业招聘场景中——是否存在性别偏见,以及如何有效缓解此类偏见的问题。研究发现,尽管背景文化迥异,五种主流大语言模型(LLMs)在评估日式履历(rirekisho)时仍表现出显著的亲女性偏见,这一结果在非西方语境中复现了西方研究的发现。其解决方案的关键在于识别出候选者姓名是导致性别偏见的主要信息通道:通过移除简历中的姓名,可近乎完全消除女性效应,表明姓名是性别信号的核心载体。相比之下,仅在提示词层面施加性别中立指令未能显著降低偏见;同时,研究揭示了隐私过滤器与GPT-4o内容安全机制之间存在不兼容问题,导致高达42%的请求被拒绝,凸显了在实际招聘流程中实施姓名匿名化策略所面临的严峻技术挑战。

链接: https://arxiv.org/abs/2606.18649
作者: Serena A. Hoffstedde,Machiko Hirota,Akshara Nadayanur Sathis Kanna,Rihito Kotani,Ujwal Kumar,Gabriele Trovato,Phan Xuan Tan
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in hiring workflows, yet most research on gender bias in LLM hiring decisions has focused on English-language, Western-format resumes. This study examines whether pro-female gender bias extends to a Japanese corporate context and evaluates two practical mitigation strategies. Using a counterfactual resume design with 60 Japanese rirekisho-format resumes, 12 name pairs selected on linguistically grounded gender-signal criteria, and five state-of-the-art LLMs (Claude Sonnet 4.6, GPT-4o, DeepSeek-V3, Gemini 2.5 Flash, Llama 3.3 70B), we conducted 43,200 API calls across baseline, prompt instruction, and privacy filter conditions. A crossed random-effects linear mixed model confirms a significant pro-female bias across all five models, replicating Western findings in a non-Western context. A prompt-level gender-neutrality instruction produces no meaningful reduction in bias. A name-reliance analysis formally identifies the candidate name as the primary gender channel: removing the name from the prompt reduces the female effect by nearly its full magnitude. An unexpected incompatibility between the privacy filter and GPT-4o’s content safety filter, resulting in a 42% refusal rate, highlights a practical deployment challenge for name anonymization in LLM-assisted recruitment pipelines.

[MA-9] PersonalPlan: Planning Multi-Agent Systems for Personalized Programming Learning

【速读】:该论文旨在解决生成式AI在编程教育中个性化教学不足的问题,即现有基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS)虽具备复杂任务规划能力,但普遍缺乏对学习者背景的感知与教学支架(pedagogical scaffolding)设计,导致无法实现真正意义上的个性化编程学习。其解决方案的关键在于提出一种名为PersonalPlan的两阶段多智能体规划框架:首先通过分层监督微调(hierarchical SFT)结合独立的低秩适配器(LoRA)实现面向学习者画像的任务分解与步骤依赖关系建模;其次采用奖励自适应的GRPO算法(Reward-Adaptive GRPO),引导模型生成可执行、个性化且具有教学逻辑性的计划。为支撑该方法,研究还构建了MAP-PPL数据集,包含3,043个查询-画像-计划实例,涵盖1,730个Stack Overflow问题组和2,738名学习者画像,每个计划明确标注了智能体角色、子任务、可执行步骤及先决依赖关系。实验表明,仅使用8B和32B规模的模型,PersonalPlan在计划可执行性、个性化程度与教学质量方面均达到当前最优水平,有效实现了多智能体系统在“智能体-学生”交互中的协同教学能力。

链接: https://arxiv.org/abs/2606.18633
作者: Zhiyuan Wen,Jiannong Cao,Peng Gao,Haochen Shi,Wengpan Kuan,Bo Yuan,Xiuxiu Qi
机构: The Hong Kong Polytechnic University (香港理工大学); China Mobile Research Institute (中国移动研究院); Nankai University (南开大学)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Effective programming education requires personalized instruction adapted to diverse learner backgrounds. However, while LLM-based multi-agent systems (MAS) excel at complex planning, existing planners often lack profile-grounding and pedagogical scaffolding, thereby undermining personalized programming learning. To fill in the gap, we first introduce \textbfMAP-PPL (\textbfMulti-\textbfAgent \textbfPlans for \textbfPersonalized \textbfProgramming \textbfLearning), a profile-conditioned multi-agent planning dataset with 3,043 query–profile–plan instances from 1,730 Stack Overflow question groups and 2,738 learner profiles. Each plan specifies agents, subtasks, executable steps, and prerequisite dependencies. Then, we propose \textbfPersonalPlan, a two-stage MAS planner that first performs hierarchical SFT with separate LoRA adapters for profile-aware task decomposition and step dependency planning, then applies a Reward-Adaptive GRPO to encourage the model to generate executable, personalized, and pedagogically scaffolded plans. Extensive experiments on MAP-PPL comparing PersonalPlan against frontier LLMs, generic MAS frameworks, and agentic planners demonstrate its superiority. With only 8B and 32B variants, PersonalPlan achieves state-of-the-art plan executability, personalization, and pedagogical quality, effectively orchestrating MAS for agent-student interactions.

[MA-10] LLM Zero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)后训练策略中参数调度缺乏适应性的问题,特别是针对多阶段训练中探索与利用(exploration-exploitation)权衡的非平稳性。现有方法依赖于固定参数调度,无法动态响应训练过程中不断变化的动力学特性,导致性能受限。其核心解决方案是通过构建名为LLMZero的系统,利用大语言模型(Large Language Model, LLM)代理在训练轨迹空间中进行树搜索,自动诊断各检查点处的训练病理,并提出协调的多参数动态调整策略。关键发现在于:容量参数(capacity parameters)在训练阶段呈现单调递增趋势,而正则化参数(regularization parameters)则随训练动态呈振荡变化——这一结构性规律揭示了不同参数对训练过程演化的差异化响应机制。该原则为设计自适应、可泛化的多阶段训练策略提供了可操作的设计准则,且在4个不同的GRPO任务中验证了其有效性,显著优于基线方法(提升9%至140%相对性能),并展现出跨任务的可迁移性。

链接: https://arxiv.org/abs/2606.18388
作者: Haoyang Fang,Wei Zhu,Boran Han,Alex Zhang,Zhenyu Pan,Shuo Yang,Shuai Zhang,Jiading Gai,Peng Tang,Cuixiong Hu,Xuan Zhu,Huzefa Rangwala,George Karypis,Bernie Wang
机构: Amazon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

[MA-11] Characterizing Opinion Evolution of Networked LLM s

【速读】:该论文旨在解决在多智能体系统中,大型语言模型(Large Language Models, LLMs)之间的交互如何影响群体意见演化的问题。随着生成式AI在模拟人类对话、影响力操作及全由LLM驱动的社交平台中的广泛应用,其产生的意见传播机制已超越传统理解范畴。现有经典意见动力学模型虽能解释人类社会中互动对集体信念的影响,但在拟合LLM网络的行为时表现不佳。研究发现,原始的基于平均值的意见融合模型无法准确追踪LLM的意见动态,但通过引入“偏置”(bias)——即智能体固有的回归倾向——这一关键修正后,模型拟合精度显著提升,累计估计均值误差最高降低88%。该结论在不同模型家族、讨论主题和网络结构下均具有良好的泛化能力,表明偏置是驱动LLM意见动态的核心机制。

链接: https://arxiv.org/abs/2606.18276
作者: Caleb Probine,Yigit Ege Bayiz,Filippos Fotiadis,Samuel Li,Yunhao Yang,Ufuk Topcu
机构: The University of Texas at Austin
类目: Multiagent Systems (cs.MA); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
备注: 19 pages, 2 figures

点击查看摘要

Abstract:Large language models (LLMs) increasingly interact with one another in multi-agent systems, from simulations of human discourse to influence operations and fully LLM-driven social platforms. These interactions give rise to new regimes of opinion propagation that are not yet well understood. We investigate whether classical opinion dynamics models, which have long been used to explain how interactions shape collective beliefs in human societies, can capture the behavior of LLM networks. We find that, while naive averaging-style models fail to track LLMs’ opinion dynamics, simple modifications yield substantial gains in modeling fidelity. In particular, bias, an innate opinion toward which agents regress, emerges as a significant driver of LLM opinion dynamics, with its inclusion reducing cumulative estimated mean opinion error by up to 88%. We additionally find that these conclusions generalize across model families, discussion topics, and networks.

自然语言处理

[NLP-0] Native Active Perception as Reasoning for Omni-Modal Understanding ICML2026

【速读】: 该论文旨在解决长视频理解中被动模型因采用“全量观看”范式而导致计算成本随视频时长线性增长的问题,以及现有交互式框架依赖全局预扫描、上下文开销仍与视频长度相关联的局限性。其核心解决方案是提出OmniAgent,首个原生多模态智能体,将视频理解建模为基于部分可观测马尔可夫决策过程(POMDP)的迭代观察-思考-行动循环,通过按需执行动作主动选择性地提炼音视频线索并构建持久的文本记忆,从而实现推理复杂度与原始视频时长的有效解耦。关键创新包括:(1) 智能体监督微调(Agentic Supervised Fine-Tuning),通过双阶段质量控制的N选最优轨迹合成,实现原生主动感知的启动;(2) 基于TAURA(Turn-aware Adaptive Uncertainty Rescaled Advantage)的智能体强化学习,利用回合级熵信号引导信用分配聚焦于关键发现回合。实验表明,OmniAgent在测试阶段表现出正向扩展性,即推理回合数增加时性能持续提升,验证了主动感知的有效性;在十个基准测试(如VideoMME、LVBench)上达到开源模型最优表现,其中7B版本在LVBench上超越10倍更大的Qwen2.5-VL-72B(50.5% vs. 47.3%)。

链接: https://arxiv.org/abs/2606.19341
作者: Zhenghao Xing,Ruiyang Xu,Yuxuan Wang,Jinzheng He,Ziyang Ma,Qize Yang,Yunfei Chu,Jin Xu,Junyang Lin,Chi-Wing Fu,Pheng-Ann Heng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at ICML 2026. Code and models: this https URL

点击查看摘要

Abstract:Passive models for long video understanding typically rely on a “watch-it-all” paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10 \times larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

[NLP-1] Learning User Simulators with Turing Rewards

【速读】: 该论文旨在解决如何在交互场景中有效模拟人类用户行为的问题,以支持智能代理助手的训练、个性化系统评估以及社会科学研究等应用。传统方法通常通过让大语言模型(LLM)最大化与真实用户回复的对数概率或使用相似性奖励来实现响应匹配,但这种方法容易导致生成内容过于机械或偏离真实用户的表达风格。本文提出一种基于图灵测试的强化学习框架——Turing-RL,其核心创新在于采用由LLM作为判别器的图灵奖励机制,评估生成回复在给定用户历史背景下的可区分性,并引导用户模拟器模型学习生成难以与真实用户发言相区分的响应。实验结果表明,在对话聊天和Reddit论坛讨论两个不同领域中,Turing-RL在大语言模型和人类评估指标上均显著优于基线方法,验证了以“不可区分性”为目标进行优化相较于单纯“响应匹配”的有效性。

链接: https://arxiv.org/abs/2606.19336
作者: Yingshan Susan Wang,Cedegao E. Zhang,Linlu Qiu,Zexue He,Pengyuan Li,Alex Pentland,Roger P. Levy,Yoon Kim
机构: Massachusetts Institute of Technology (麻省理工学院); Stanford University (斯坦福大学); MIT-IBM Watson AI Lab (MIT-IBM 沃森人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose Turing-RL: a Turing-Test-based reinforcement learning approach for training user simulator models. Turing-RL uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user’s given the user’s history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains–conversational chat and Reddit forum discussion–we find that Turing-RL consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.

[NLP-2] Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

【速读】: 该论文旨在解决美国地方性法规(local ordinances)在法律人工智能(Legal AI)研究中长期缺乏可机器读取、大规模可用数据资源的问题。尽管地方性法规涵盖分区规划、住房、商业许可、公共卫生、噪音与动物管控等日常治理核心领域,但其文本分散于面向人工浏览的供应商平台,难以实现批量科研访问。为此,论文提出LOCUS——美国地方性法规语料库(Local Ordinance Corpus for the United States),构建了一个覆盖全美9,239个市镇及县区的原始语料库,并进一步开发了针对最大2,309个县(占全美3,144个县的多数人口)的标准化访问层(county-harmonized access layer)。解决方案的关键在于采用光学字符识别(OCR)技术处理多样化的文档格式,从而将原本非结构化、不可机器读取的地方法规转化为可计算的文本资源;同时,通过提供元数据标注和基于ModernBERT的分类器与评分模型,支持对地方性法规在透明度(opacity)、家长式干预(paternalism)等维度的规模化分析。该语料库及其衍生模型已公开发布,旨在推动法律人工智能研究的可复现性与地方性法规数字化的持续扩展。

链接: https://arxiv.org/abs/2606.19334
作者: Denis Peskoff,Joe Barrow,Christopher Vu,Diag Davenport
机构: UC Berkeley; School of Information
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Progress in legal AI increasingly depends on access to authoritative legal text at scale. Yet one of the most consequential layers of American law remains largely absent from existing machine-readable corpora: local ordinances. Local codes govern zoning, housing, business licensing, public health, noise, animal control, and many other domains of everyday regulation, but they are fragmented across vendor platforms designed for human browsing rather than bulk research access. We introduce LOCUS - the Local Ordinance Corpus for the United States - a comprehensive corpus and county-harmonized access layer for U.S. municipal and county ordinance codes. The raw corpus, available for release to researchers, represents nearly all publicly available municipal and county ordinance codes. The resulting raw corpus contains codes from 9,239 cities and counties. A smaller county-harmonized LOCUS access layer provides coverage for the largest 2,309 of 3,144 U.S. counties, accounting for a majority of the population. We use OCR to handle the myriad of document formats that have kept the law from being a public resource. We release the corpus with coverage metadata to support reproducibility, downstream legal AI research, and the incremental expansion of machine-readable access to local law. We train a collection of ModernBERT-based classifiers and scorers to facilitate analyzing U.S. local law among several dimensions, such as opacity and paternalism, that have not previously been studied at this scale. LOCUS-v1 and its derivative models are available at: this https URL

[NLP-3] Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

【速读】: 该论文旨在解决现有推理语言模型后训练方法中存在的监督信号不充分与反馈粒度粗泛的问题。具体而言,传统的基于有监督蒸馏的方法依赖于代价高昂且可能存在噪声、不完整或部分错误的思维链(Chain-of-Thought, CoT)标注,而基于可验证奖励的强化学习则将评估反馈压缩为标量信号,难以定位需改进的具体响应环节。为此,本文提出评分标准条件自蒸馏(Rubric-Conditioned Self-Distillation)框架,其核心在于引入结构化、细粒度的评分标准(Rubric)作为教师模型的条件输入,从而在策略内自蒸馏过程中为学生模型生成的采样轨迹提供逐标记级别的指导。该方法避免了将单一参考思维链视为唯一监督目标,转而通过评分标准明确描述高质量回答应满足的各项准则,实现对推理过程更精细的信用分配。实验结果表明,该框架在多个科学推理基准上显著优于基线方法,平均超越GRPO 1.0分、OPSD 0.9分,验证了其有效将评分标准层面的要求转化为推理过程中的细粒度引导能力。

链接: https://arxiv.org/abs/2606.19327
作者: Siyi Gu,Jialin Chen,Sophia Zhou,Arman Cohan,Rex Ying
机构: Yale University (耶鲁大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbfRubric-Conditioned Self-Distillation, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student’s own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

[NLP-4] rade-offs in Medical LLM Adaptation: An Empirical Study in French QA

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域(以法语医学问答任务为例)进行适配时,不同适配策略的有效性尚不明确的问题。研究系统比较了持续预训练(Continual Pretraining, CPT)、监督微调(Supervised Fine-Tuning, SFT)及其组合在三种模型族、多种模型规模及不同初始化方式下的表现,明确区分了适配策略效应与基础模型选择的影响。关键发现在于:对于多选题问答(MCQA),CPT+SFT虽常取得最优结果,但相较于SFT的提升幅度较小且多数情况下不具统计显著性,因此在计算资源受限时SFT是更高效可靠的默认方案;而对于开放生成式问答(OEQA),CPT能稳定提升基于重叠度的自动评估指标,而SFT反而常导致生成质量下降,基于大模型作为评判者(LLM-as-a-Judge)的评估则更青睐指令微调与CPT+SFT的组合。此外,跨语言实验表明,法语领域的适配可有效迁移至英文基准。综上,研究为在计算约束下选择最优适配策略提供了实证指导。

链接: https://arxiv.org/abs/2606.19266
作者: Ikram Belmadani,Oumaima El Khettari,Carlos Ramisch,Frederic Bechet,Richard Dufour,Benoit Favre
机构: Aix-Marseille Univ., CNRS, LIS UMR 7020, 13000 Marseille, France; Nantes Univ., École Centrale Nantes, CNRS, LS2N UMR 6004, 44000 Nantes, France; Grenoble Alpes Univ., CNRS, INRIA, Grenoble INP, LIG UMR 5217, 38000 Grenoble, France
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

[NLP-5] Structured Inference with Large Language Gibbs

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中编码的知识在进行结构化概率推理时,如何以概率上一致的方式高效访问的问题。传统方法依赖于单次自回归生成来采样复杂变量结构,但易受生成顺序依赖性偏差的影响,导致推理结果不准确。本文提出“大语言吉布斯采样”(Large Language Gibbs),其核心在于将LLM的条件词分布作为马尔可夫链蒙特卡洛(MCMC)中的转移算子,通过迭代地对每个变量基于其他变量条件重采样,利用LLM的下一个词预测能力实现全局一致性。该方法避免了顺序依赖性偏差,并使最终采样分布收敛至所有局部条件分布之间的平衡状态,从而在存在噪声且不完全可靠的LLM先验条件下,为结构化概率推理提供了一种比单次生成更稳健的替代方案。

链接: https://arxiv.org/abs/2606.19264
作者: Sanghyeok Choi,Henry Gouk,Esmeralda S. Whitammer
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Code: this https URL

点击查看摘要

Abstract:The knowledge encoded in large language models (LLMs) can serve as a substrate for structured reasoning over variables describing a complex world, but accessing this knowledge in a probabilistically coherent manner poses a difficult inference problem. We propose Large Language Gibbs, a scheme for structured probabilistic inference that uses conditional distributions of an LLM as transition operators. Rather than sampling structured objects through single-pass autoregressive generation, we iteratively resample individual variables conditioned on others using an LLM’s next-token conditionals. This approach avoids order-dependent biases and produces a stationary distribution that reflects a compromise between all local conditionals. We apply this approach to sampling from synthetic distributions, consistent reasoning tasks, and Bayesian structure learning. The results suggest that the use of LLM conditionals in MCMC is a practical alternative to one-pass generation for structured probabilistic inference under a world prior accessible through noisy LLM conditionals.

[NLP-6] DreamReason er-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

【速读】: 该论文旨在解决生成式语言模型在长链式思维(Chain-of-Thought, CoT)推理任务中,基于块扩散(block diffusion)的并行解码机制在可扩展性与推理质量之间存在的矛盾问题。尽管块扩散语言模型通过并行化的块级去噪加速了推理过程,但其在处理长CoT推理时的可靠性尚未得到充分验证。针对此问题,研究提出DreamReasoner-8B——一个开源的块扩散推理模型,并系统研究了训练与推理阶段块大小对长CoT推理性能的影响。研究发现,使用大块尺寸进行训练会导致推理能力显著退化,而小块尺寸则能有效保持推理质量。为弥合这一粒度差距,论文提出“块尺寸课程学习”(block-size curriculum learning)策略,通过从细粒度到粗粒度的渐进式训练,使模型具备跨不同推理块尺寸的强泛化推理能力。实验结果表明,DreamReasoner-8B在数学与代码推理基准上达到与领先自回归模型(如Qwen3-8B)相媲美的性能,为高效且具备推理能力的扩散语言模型提供了切实可行的技术路径。

链接: https://arxiv.org/abs/2606.19257
作者: Zirui Wu,Lin Zheng,Jiacheng Ye,Shansan Gong,Xueliang Zhao,Yansong Feng,Wei Bi,Lingpeng Kong
机构: The University of Hong Kong(香港大学); Peking University(北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Block diffusion language models accelerate decoding through parallel block-wise denoising, yet whether they can be reliably scaled for long chain-of-thought (CoT) reasoning remains unresolved. To this end, we develop DreamReasoner-8B, an open-source block diffusion reasoning model, and conduct a systematic study of how training and inference block sizes affect long-CoT reasoning. Our analysis reveals a stark performance disparity: training with large block sizes yields remarkably poor reasoning, whereas small block sizes preserve effective reasoning. To bridge this granularity gap, we propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, thereby overcoming this limitation and enabling strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks, DreamReasoner-8B achieves results competitive with leading open autoregressive models such as Qwen3-8B. This work establishes a practical foundation for efficient, reasoning-capable diffusion language models. We release our model at this https URL.

[NLP-7] STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

【速读】: 该论文旨在解决生成式强化学习(Reinforcement Learning with Verifiable Rewards, RLvR)中大规模语言模型(LLM)在后训练阶段普遍存在的策略熵崩溃(policy entropy collapse)问题。其核心挑战在于:尽管基于可验证奖励的算法如GRPO在复杂推理任务中表现优异,但训练过程中策略熵会逐渐下降至过低水平,导致探索能力丧失、泛化性能退化。通过一阶梯度分析,研究揭示了词元级熵动态中的信用分配不匹配现象——词元级熵变化由轨迹级优势与下一个词元分布的熵敏感性函数的乘积决定,呈现出优势-意外度四象限结构及近临界特性。基于此发现,论文提出STARE(Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability),其关键创新在于:利用批次内意外度分位数识别熵敏感词元子集,对这些词元的有效优势进行选择性重加权,并引入目标熵闭环门控机制实现熵的稳定调控。实验表明,该方法在1.5B至32B不同规模模型及短链思维(Short CoT)、长链思维(Long CoT)和多轮工具使用(Multi-Turn Tool Use)三类任务上,可在数千步训练中维持稳定的策略熵水平,且在AIME24和AIME25基准上相较DAPO等基线提升4%-8%平均准确率,同时反射词元数量与响应长度同步增长,证明其有效维持了探索与利用之间的平衡,显著提升了强化学习训练的可持续性与性能上限。

链接: https://arxiv.org/abs/2606.19236
作者: Haipeng Luo,Qingfeng Sun,Songli Wu,Can Xu,Wenfeng Deng,Han Hu,Yansong Tang
机构: Tsinghua University (清华大学); Tencent Hunyuan (腾讯混元)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: LLM, Reinforcement Learning

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training this http URL is available at this https URL.

[NLP-8] RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

【速读】: 该论文旨在解决当前自动评估指标在开放性、观点驱动型问答任务中难以同时实现内容真实性判断(validity)与模型性能区分能力(discriminative power)的问题。其核心挑战在于,现有指标在区分真实回答与随机噪声时表现良好,但无法有效排序不同大语言模型(LLM)的生成质量,二者存在本质权衡。本文提出RECOM(Reddit Evaluation for Correspondence of Models),一个包含15,000条来自r/AskReddit的真实问题及其在模型训练截止时间后发布的社区回复的无污染评估数据集,确保了评估的时效性和真实性。实验表明,尽管余弦相似度在区分真实与随机回答方面表现出色(Cohen’s d ≈ 2),但对五款开源模型(7–10B参数)的排序能力极弱(|d| < 0.1);而BERTScore精度虽在未控制响应长度时显示出一定排序能力(|d|最高达0.63),但在控制长度后性能骤降至|d| = 0.09,且真实性判别能力亦显著下降(d ≈ 0.8,远低于余弦相似度的≈2)。这一有效性与区分力之间的矛盾并非由模型差异导致,而是源于评估指标自身的表征设计缺陷。三名独立的LLM评审员复现了该现象,同样仅能微弱区分模型性能。因此,作者建议在报告评估结果时应同时呈现指标在有效性与区分力两个维度的表现,并引入随机基线作为对照,以增强评估的透明性与可信度。RECOM数据集已公开发布。

链接: https://arxiv.org/abs/2606.19218
作者: Pushwitha Krishnappa,Amit Das,Vinija Jain,Aman Chadha,Tathagata Mukherjee
机构: University of Alabama Huntsville; University of North Alabama; Stanford University; Meta AI; Amazon GenAI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven question answering, the two are in tension. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free evaluation dataset of 15,000 r/AskReddit questions (September 2025), each paired with its authentic community replies, which postdate every evaluated model’s training cutoff. Scoring five open-source LLMs (7–10B) against every reply each metric paired with a random-derangement noise floor we find that no metric does both jobs well. Cosine similarity separates real from random answers (Cohen’s d \approx 2 ) but cannot rank the five models ( |d| 0.1 ); BERTScore precision appears to rank the models (raw |d| up to 0.63), but once response length is controlled this collapses to |d| = 0.09 and its validity is weak ( d \approx 0.8 , versus cosine’s \approx 2 ). Because every metric scores the same outputs, this validity–discrimination tradeoff is a property of the metrics, not the models, and we argue it stems from representation design. Three independent LLM judges reproduce the validity gap and likewise separate the five models only weakly. We recommend reporting metrics on both axes, with an explicit random-baseline floor. RECOM is publicly available at this https URL

[NLP-9] Language Models as Interfaces Not Oracles: A Hybrid LLM -ML System for Pediatric Appendicitis

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在临床决策支持中直接作为诊断引擎时存在的局限性,包括对提示词敏感、信息顺序依赖性强以及生成看似合理但错误的输出等问题。同时,传统结构化机器学习模型虽具备稳定的预测能力,却难以与以自由文本为主的临床工作流程兼容。为此,论文提出了一种混合系统——临床语言辅助机器学习管道(ClaMPAPP),其核心解决方案在于将LLM定位为自然语言接口而非最终决策者:通过LLM从非结构化的临床笔记中提取符合预定义模式的临床特征,结合确定性的合理性校验机制确保数据质量,并将经验证的特征输入由XGBoost构建的分类器进行风险预测。该方法在德国两家医院的儿科阑尾炎独立队列中均展现出最优诊断性能,且显著降低漏诊率,优于端到端的LLM基线模型,后者在文本重排后表现出更大的性能波动。研究结果表明,“LLM作为接口、机器学习作为预测器”的架构设计能够有效分离自然语言交互的便利性与预测推断的可靠性,提升临床决策支持系统的可解释性与安全性。

链接: https://arxiv.org/abs/2606.19183
作者: Soheyl Bateni,Maryam Abdolali
机构: K. N. Toosi University of Technology (K. N. 伊朗科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can make clinical decision support more accessible by interpreting free-text documentation, but their direct use as diagnostic engines is limited by sensitivity to prompts, information order, and plausible but incorrect outputs. Structured machine-learning models offer more stable risk prediction, yet they require tabular inputs that are difficult to integrate with narrative clinical workflows. We present ClaMPAPP (Clinical Language-assisted Machine-learning Pipeline for Appendicitis), a hybrid system that uses an LLM as an interface rather than as the final decision-maker. ClaMPAPP extracts schema-constrained clinical features from note-like narratives, applies deterministic plausibility checks, and passes validated features to an XGBoost classifier trained on clinical, laboratory, and ultrasound variables. We evaluated ClaMPAPP on two independent pediatric appendicitis cohorts from German hospitals and compared it with end-to-end LLM baselines, including open-source and proprietary models. To preserve ground truth while testing free-text input, narratives were generated from structured electronic health records through template rendering and constrained LLM rewriting, with additional sentence-order permutation to assess positional robustness. ClaMPAPP achieved the strongest overall diagnostic performance in both internal and external validation while minimizing missed appendicitis cases, the key safety concern in acute triage. End-to-end LLMs showed unstable sensitivity-specificity trade-offs and greater degradation under narrative reordering. These results support an LLM-as-interface, ML-as-predictor design that separates natural-language usability from predictive inference and provides a more auditable pathway for clinical decision support.

[NLP-10] Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

【速读】: 该论文旨在解决大语言模型在第二语言习得(SLA)研究中因预训练数据污染导致的二语(L2)过早暴露问题,尤其针对日语到英语的跨语言迁移场景。现有研究多依赖小型或非解码器架构模型,难以生成开放式文本,限制了其作为真实二语学习模拟器的适用性。本研究提出的关键解决方案是:通过一种过滤方法,从“单语”预训练语料库中剔除早期英语内容,以减少对目标语言的提前暴露,同时保留适度且真实的语言接触;随后利用大语言模型生成的二语学习课程对模型进行微调,以模拟真实的二语习得过程。实验结果表明,所提出的Dango模型(18亿参数)能生成类人化的二语产出模式,在多项评估中优于未经过滤及标准多语言基线模型。该工作释放了模型、数据与代码,推动可复现的计算型第二语言习得研究与面向学习者的应用发展。

链接: https://arxiv.org/abs/2606.19170
作者: Shiho Matta,Yin Jou Huang,Fei Cheng,Takashi Kodama,Hirokazu Kiyomaru,Yugo Murawaki
机构: Kyoto University (京都大学); NII-LLMC (日本国立情報学研究所语言与机器学习研究中心)
类目: Computation and Language (cs.CL)
备注: 8 pages main text, 20 pages total including references and appendices

点击查看摘要

Abstract:We introduce Dango, a 1.8B-parameter large language model designed for controlled studies of L1-to-L2 (Japanese-to-English) transfer in second language acquisition (SLA). While previous studies have explored SLA in language models, they have predominantly relied on smaller or non-decoder models, limiting their ability to generate open-ended text and reducing their suitability as practical L2 simulators. We identify a key challenge when scaling models to this size: L2 contamination within the “monolingual” pretraining corpus used for L1 acquisition. To address this, we propose a filtering method to reduce premature exposure to English while preserving realistic, minimal exposure. We then fine-tune the model on LLM-generated L2-learning lessons to simulate the L2 acquisition process. Our evaluations confirm that Dango develops human-like L2 production patterns, outperforming both unfiltered and standard multilingual baselines. We release the model, data, and code to facilitate reproducible computational SLA research and learner-facing applications.

[NLP-11] Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

【速读】: 该论文旨在解决当前对话式人工智能(Conversational AI)系统在长期人机交互中缺乏统一理论框架以解释社会关系稳定性与社会智能涌现的问题。现有方法通常将情感建模、记忆检索和人格化条件等社会行为模块孤立处理,无法有效描述复杂社会互动的动态演化过程。为此,本文提出人类-人工智能共演化动力学框架(Human-AI Coevolution Dynamics Framework, HACD-H),其核心在于构建一个整合情感适应、关系组织、社会记忆与人格一致性于一体的自组织社会认知动力学模型。关键创新点包括多时间尺度社会认知、关系吸引子、信任盆地、发展相变以及社会认知能量等概念,通过构建包含约14,700轮对话的语料库并设计基于理论的实证评估体系,揭示了社会认知的时间持久性层级、稳定的关系吸引子、类相变的发展模式及结构化的社会认知能量景观。研究发现社会智能与社会认知能量呈显著负相关(r = -0.391, p < 0.001),且交互轨迹呈现渐进的能量衰减趋势,表明社会智能源于长期社会认知的协同演化,而非单一对话能力的叠加。HACD-H为构建具备社会智能的自适应人机交互系统提供了统一的理论基础。

链接: https://arxiv.org/abs/2606.19144
作者: Jingyi Zhou,Senlin Luo,Haofan Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current conversational AI systems have made significant progress in language generation, personalization, and long-context interaction. However, most existing methods model social behavior through isolated components such as emotion modeling, memory retrieval, or persona conditioning, lacking a unified framework to explain the emergence of stable social relationships and social intelligence in long-term human-AI this http URL address this, we propose the Human-AI Coevolution Dynamics Framework (HACD-H), a formal model of human-AI interaction as a self-organizing social cognitive system. HACD-H integrates emotional adaptation, relational organization, social memory, and personality consistency into a unified dynamical framework and introduces principles including multi-timescale social cognition, relational attractors, trust basins, developmental phase transitions, and social cognitive energy this http URL construct a conversational dataset with approximately 14,700 interaction turns and develop a theory-driven empirical evaluation framework. Results reveal a hierarchy of temporal persistence in social cognition, stable relational attractors, phase-transition-like developmental patterns, and a structured social cognitive energy landscape. Social intelligence shows a significant negative correlation with social cognitive energy (r = -0.391, p 0.001), and interaction trajectories exhibit progressive energy reduction over this http URL findings suggest that social intelligence emerges from long-term social cognitive coevolution rather than isolated conversational capabilities. HACD-H provides a unified theoretical foundation for modeling adaptive human-AI social interaction and developing socially intelligent AI systems.

[NLP-12] Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

【速读】: 该论文旨在解决乌尔都语手写文本识别(Urdu Handwritten Text Recognition, UHTR)领域因缺乏高质量基准数据集及针对连笔书写风格的系统性研究而面临的挑战。其核心问题在于:乌尔都语书写具有独特的连笔特征(Nastalique 书法风格),且历史文献中的手写体样本稀缺、难以获取,导致现有研究进展缓慢。为此,本研究提出一个专为乌尔都语卡提布(Katib)历史手稿设计的离线手写文本行数据集——乌尔都卡提布手写数据集(Urdu Katib Handwritten Dataset, UKHD),该数据集是首个基于历史文献中真实卡提布书写的公开可用数据集,涵盖多种平尖笔书写变体。解决方案的关键在于构建高质量、多样化的专用数据集,并系统评估基于卷积神经网络-双向门控循环单元-连接时序分类(CRNN-based CNN-BGRU-CTC)的混合模型性能。实验表明,CNN-BGRU-CTC模型在字符错误率(Character Error Rate, CER)和词错误率(Word Error Rate, WER)方面表现最优,具备更强的鲁棒性。该工作为乌尔都语手写文献的数字化与保护提供了重要基础,推动了相关领域的技术发展。

链接: https://arxiv.org/abs/2606.19139
作者: Ramza Basharat,Muhammad Usman Ali
机构: University of Gujrat(古杰拉特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic Handwritten Text Recognition (HTR) is inherently a challenging task, and its complexity is further increased when dealing with cursive scripts. Although significant efforts have been made on various cursive scripts, research regarding Urdu Handwritten Text Recognition (UHTR) has been relatively limited. This lag of research is primarily due to the unique challenges posed by its script, and the scarcity and unavailability of benchmark datasets. Therefore, to advance research in UHTR, this study presents a specialized real dataset called the Urdu Katib Handwritten Dataset (UKHD). To the best of our knowledge, this is the first offline Urdu handwritten text lines dataset specifically curated from the materials written by Katibs in historical times. It encompasses a diverse range of flat nib writing variations in the Nastalique calligraphic style. Additionally, the effectiveness of different CRNN-based hybrid models has been evaluated to identify the optimal architecture for Urdu Katib Handwriting Recognition (UKHR). Among the analyzed models, the CNN-BGRU-CTC model showed more robust performance, with low Character Error Rate (CER) and Word Error Rate (WER). This research work aims to support and encourage the research community in developing a robust recognition system for preserving Urdu handwritten literature.

[NLP-13] Sumi: Open Uniform Diffusion Language Model from Scratch

【速读】: 该论文旨在解决当前生成式AI领域中统一扩散语言模型(Uniform Diffusion Language Models, UDLMs)缺乏大规模、从零开始预训练范例的问题。尽管自回归模型和掩码扩散模型已在大参数量与大数据预算下取得了成熟成果,但尚无一个具备同等规模的UDLM被成功预训练并公开,导致其在生成动态、可控制性及与现有模型的权衡关系等方面的研究缺乏基准参考。为填补这一空白,本文提出Sumi(日语“墨”之意),一个完全开源的70亿参数(7B)统一扩散语言模型,基于1.5万亿(1.5T)token的语料库从零开始预训练。其关键解决方案在于构建了一个完整的、可复现的大规模预训练流程,包括对公开数据集的系统性混合策略设计与训练全链路公开。实验表明,Sumi在知识、推理与编码任务上表现与同规模自回归模型相当,但在常识类任务上表现较弱,这可能归因于其教育类数据占比过高。该工作的发布为社区研究原生统一扩散模型的规模化行为提供了重要基础,并有望推动对该类模型尚未充分理解特性的进一步探索。

链接: https://arxiv.org/abs/2606.19005
作者: Mengyu Ye,Keito Kudo,Wataru Ikeda,Ryosuke Matsuda,Keisuke Sakaguchi,Jun Suzuki
机构: Tohoku University
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi (“ink” in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.

[NLP-14] Enhancing Multilingual Reasoning via Steerable Model Merging ACL2026

【速读】: 该论文旨在解决多语言模型与推理模型在合并过程中因统一融合策略导致的模型冲突问题,即“一刀切”式融合方法难以适应不同输入对各源模型能力的差异化需求,从而影响最终性能。其核心解决方案是提出一种可调节的模型融合框架——可调模型融合(Steerable Model Merging, ST-Merge),通过引入门控交叉注意力机制(gated cross-attention mechanism),实现对两个源模型贡献的自适应加权或过滤,使模型能够根据输入特征动态调整各源模型的参与程度,从而更好地保留各模型的优势并缓解潜在冲突。实验结果表明,ST-Merge在21种语言的4个跨语言推理基准上均显著优于多个强基线方法,验证了其有效性与泛化能力。

链接: https://arxiv.org/abs/2606.19002
作者: Zhuoran Li,Rui Xu,Jian Yang,Junnan Liu,Zhijun Chen,Qianren Mao,Hongcheng Guo,Jiaheng Liu,Likang Xiao,Ming Li,Xiaojie Wang
机构: Beijing University of Posts and Telecommunications; Fudan University; Beihang University; Monash University; Zhongguancun Laboratory; Nanjing University; Tsinghua University
类目: Computation and Language (cs.CL)
备注: 12 pages, 7 figures, 8 tables. Accepted by ACL2026 Findings

点击查看摘要

Abstract:Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models. However, the merged single model often fails to address the conflicts between source models, leading to suboptimal performance. In other words, the one-size-fits-all merging strategy may not align with the characteristics of different inputs which may require prioritizing certain models over others. To this end, we propose a Steerable Model Merging (ST-Merge) framework to modulate the contribution of each source model. To realize this idea, we introduce a gated cross-attention mechanism to weight or filter the two attended source models in an adaptive manner. Extensive experiments demonstrate that ST-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks across 21 different languages.

[NLP-15] G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment ACL2026

【速读】: 该论文旨在解决跨语言习语(idiom)翻译中因非组合性(non-compositionality)和表面形式关联性弱导致的语义失真问题,尤其针对生成式模型在低资源语言中普遍存在的字面化翻译偏差。其核心解决方案是提出G-IdiomAlign——一个以英文释义(gloss)为语义锚点的基准测试框架,通过来自Wiktionary的标准化释义实现习语的语义对齐,并构建高置信度参考对齐数据集以支持可复现评估。关键创新在于设计两种评估协议:一是带类型干扰项的多选习语等价任务,用于错误归因;二是基于嵌入空间语义代理的释义对比生成(Gloss-Contrastive Generation),通过对比无释义与有释义输入,分离显式语义引导的作用。实验表明,尽管引入释义能持续提升生成质量,但整体性能仍有限,揭示了开放输出空间中的巨大改进潜力;进一步分析显示,不同条件下的差异主要集中在注意力头而非层间,且高质量生成与更强的释义锚定效应密切相关。

链接: https://arxiv.org/abs/2606.18989
作者: Fengying Ye,Yanming Sun,Runzhe Zhan,Zheqi Zhang,Lidia S. Chao,Derek F. Wong
机构: NLP2CT Lab, Department of Computer and Information Science, University of Macau; Faculty of Arts and Humanities, University of Macau
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is anchored by an English gloss from Wiktionary. We further construct a high-confidence reference alignment set for reproducible evaluation. G-IdiomAlign supports two protocols: (1) a controlled Multiple-Choice Idiom Equivalence with typed distractors for error attribution; and (2) a Gloss-Contrastive Generation contrasting No-gloss and With-gloss inputs to isolate the effect of an explicit semantic pivot. Across diverse LLMs, a bias to literal translation is a dominant failure mode, especially when the target is a low-resource language. Glosses consistently improve Gloss-Contrastive Generation under an embedding-based semantic proxy, but performance remains modest, indicating substantial headroom in the open output space. Subsequent analysis on Qwen3-8B further suggests that cross-condition differences are concentrated more in attention heads than in layers, while better With-gloss generations coincide with stronger gloss anchoring.

[NLP-16] Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

【速读】: 该论文旨在解决生成式时间序列问答(Time-Series Question Answering, TSQA)中因直接将原始数值序列输入大语言模型(Large Language Models, LLMs)所引发的分词瓶颈问题。具体而言,传统的字节对编码(Byte Pair Encoding, BPE)会将连续的数值切分为不稳定的词元(token),导致数值的量级、尺度和趋势等关键信息丢失,严重影响模型对时间序列语义的理解。现有方法多采用基于滑动窗口的编码器,将序列划分为固定长度的片段,虽缓解了分词问题,但引入了固定的粒度限制,破坏了时间模式的连续性,并难以在不同长度或采样率的数据集间迁移。为此,本文提出一种名为CADE(Contrastive Alignment with Direct Embedding)的新框架,其核心创新在于两个关键组件:一是直接时间步嵌入(direct timestep embedding),通过点对点的线性编码器与多层感知机(MLP)投影器,将每个时间步直接映射至LLM的嵌入空间,从而保留精确的时间索引信息,避免分块与填充操作;二是语义对齐机制(semantic alignment),设计一种单向监督对比损失函数,将时间序列嵌入与冻结的类别名称文本锚点进行对齐,以弥合时间序列与自然语言表征之间的语义鸿沟。实验结果表明,在公开的Time-MQA基准上,CADE在六项TSQA任务中均显著优于开源及专有大模型基线,验证了其有效性与泛化能力。

链接: https://arxiv.org/abs/2606.18986
作者: Yafeng Wu,Huu Hiep Nguyen,Thin Nguyen,Hung Le
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series into LLMs suffers from a tokenization bottleneck: Byte Pair Encoding fragments continuous values into unstable tokens whose embeddings lack meaningful metric structure, resulting in the loss of magnitude, scale, and trend information. Prior methods use patch-based encoders that split the series into fixed windows, locking in one granularity that breaks patterns and hides exact timesteps, through a separate module that rarely transfers across datasets with different lengths or sampling rates. To address this challenge, we propose CADE (Contrastive Alignment with Direct Embedding), a novel framework for TSQA built upon two key components: direct timestep embedding and semantic alignment. The proposed framework maps each timestep directly into the LLM embedding space through a point-wise linear encoder and MLP projector, preserving exact index-level access while eliminating the need for patching and padding. To further bridge the semantic gap between time-series and language representations, we introduce a novel one-directional supervised contrastive loss that aligns time-series embeddings with frozen class-name text anchors. Experimental results on the public Time-MQA benchmark demonstrate that our framework consistently improves performance across six TSQA tasks, outperforming both open-source and proprietary LLM baselines.

[NLP-17] GraphPO: Graph-based Policy Optimization for Reasoning Models

【速读】: 该论文旨在解决生成式强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)中因独立采样响应所导致的冗余探索与计算浪费问题,以及最终答案奖励稀疏性带来的中间推理步骤难以有效识别的问题。现有方法如基于树结构的策略虽通过共享前缀和局部分支比较提供细粒度信号,但仍存在不同分支在相似推理状态时无法共享信息、重复探索且优势估计方差较高的缺陷。其解决方案的关键在于提出一种新型强化学习框架——图策略优化(GraphPO),将轨迹表示为有向无环图(Directed Acyclic Graph, DAG),以语义状态作为节点、推理步骤作为边,通过合并语义等价的推理路径形成等价类,实现后缀共享,并将资源从冗余扩展重新分配至多样化探索。同时,GraphPO 为入边赋予效率优势、出边赋予正确性优势,从而在保持结果可验证性的同时,从过程输出中导出监督信号。理论分析表明,该方法可降低优势估计的方差并提升推理效率。实验在三个大语言模型(Large Language Models, LLMs)上于推理与智能体搜索基准测试中均显示,GraphPO 在相同令牌或响应预算下显著优于链式与树形基线方法。

链接: https://arxiv.org/abs/2606.18954
作者: Yuliang Zhan,Xinyu Tang,Jian Li,Dandan Zheng,Weilong Chai,Jingdong Chen,Jun Zhou,Ge Wu,Wenyue Tang,Hao Sun
机构: Renmin University of China (中国人民大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.

[NLP-18] SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

【速读】: 该论文旨在解决混合文档中句级生成式文本检测(Sentence-level AI-generated text detection, S-AGTD)面临的两大核心问题:一是现有方法将每个句子独立分类,忽视了句子间的上下文依赖关系;二是现有基准数据集未涵盖最新的大语言模型(LLM)生成器。为此,研究构建了MOSAIC基准数据集,包含16,000篇由DeepSeek-V3.2和Kimi K2在PubMed与XSum数据集上协同生成的混合文档,并引入前所未有的困惑度一致性过滤器以确保生成质量。针对检测任务,论文将S-AGTD重新建模为对文档句子序列的结构化预测问题,提出SenFlow框架,其关键在于通过图神经网络实现句子间依赖关系的传播,并结合线性链条件随机场(CRF)进行单次文档级推理,从而有效捕捉全局上下文信息。实验表明,SenFlow在MOSAIC基准上达到当前最优性能,尤其在跨领域迁移任务中平均宏F1提升4.15个百分点,验证了其优越性。此外,研究发现即使经过困惑度过滤消除了明显的生成痕迹,不同生成器仍保留可被句级检测器利用的句长差异特征,揭示了现有检测方法的潜在局限性。

链接: https://arxiv.org/abs/2606.18946
作者: Jingkun Luo,Yifan Sun,Da-Tian Peng,Guanxiong Pei
机构: Northwestern Polytechnical University (西北工业大学); Zhejiang Lab (浙江省实验室)
类目: Computation and Language (cs.CL)
备注: 16 pages, 4 figures, 9 tables

点击查看摘要

Abstract:Sentence-level AI-generated text detection (S-AGTD) for hybrid documents, where humans and LLMs co-author one text, faces two gaps: existing methods classify each sentence in isolation, discarding inter-sentence dependencies, and existing benchmarks omit the newest generation of generators. We construct MOSAIC, a benchmark of 16,000 hybrid documents over PubMed and XSum, generated by DeepSeek-V3.2 and Kimi K2 under stringent quality controls including a perplexity-consistency filter absent from prior benchmarks. We recast S-AGTD as structured prediction over the document sentence sequence and instantiate it as SenFlow, integrating graph-based inter-sentence propagation with linear-chain CRF decoding in a single document-level pass over a sentence graph. SenFlow reaches state-of-the-art performance on MOSAIC, with a +4.15 pp average Macro-F1 margin on cross-domain transfer, the hardest of three protocols of increasing difficulty. We further find that even after the perplexity filter equalizes overt cues, AI insertions retain a generator-dependent sentence-length gap that sentence-level detectors still exploit. Code and data: this https URL

[NLP-19] Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

【速读】: 该论文旨在解决现有工具ESBMC-PLC在处理IEC 61131-3梯形图(Ladder Diagram, LD)程序时对图形化编码格式支持不足的问题。具体而言,尽管PLCopen XML定义了基于局部标识符(localId/refLocalId)连接的图形化LD编码方式,但原有工具仅支持文本格式,而将来自CONTROLLINO、Beremiz和OpenPLC Editor等平台的图形化导出内容解析为一个空的GOTO中间表示(Intermediate Representation, IR),导致验证结果出现虚假成功(vacuous verification success)。为此,本文提出Graph-ESBMC-PLC,其核心创新在于设计了一种基于深度优先搜索(DFS)的图形化梯形图解析器,能够从左电源轨(leftPowerRail)出发,沿连接图遍历至每个线圈(coil),提取出由触点构成的布尔表达式路径,并结合三层次输入/输出(I/O)推断机制完成语义建模。通过按右电源轨(rightPowerRail)连接点顺序排列线圈,确保置位(SET)线圈在复位(RESET)线圈前处理,从而精确匹配IEC 61131-3扫描周期语义。该方法保持了原有ESBMC后端不变,实现了对图形化LD程序的有效转换。实验验证表明,在3个来自CONTROLLINO/OpenPLC Editor的图形化LD程序上,均能生成包含非确定性输入与完整梯形逻辑的完整GOTO IR,且在k=2约束下70毫秒内全部验证为SAFE;同时,11个文本格式基准测试无任何退化。此外,发现两个Beremiz示例因无梯形图内容或不支持定时器语义存在局限。相关可复现成果已发布于Zenodo。

链接: https://arxiv.org/abs/2606.18941
作者: Pierre Dantas,Lucas Cordeiro,Waldir Junior
机构: The University of Manchester (曼彻斯特大学); Federal University of Amazonas (UFAM) (亚马逊联邦大学)
类目: Programming Languages (cs.PL); Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using rung elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents Graph-ESBMC-PLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:https://doi.org/10.5281/zenodo.20699856).

[NLP-20] As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在理解包含否定(negation)与隐喻性语言(figurative language)双重复杂语义现象时的局限性问题。由于这类语言现象在书面与口语中广泛存在,且当前模型在实际应用中通常无法针对特定数据集进行微调,因此其对复杂语义的准确解析能力成为关键挑战。研究通过扩展现有隐喻语言数据集的标注体系,构建了融合否定与隐喻性的新标注数据,并测试多种语言模型的表现。研究发现,否定与隐喻性的组合会显著增加理解难度,且模型整体性能及不同否定类型下的表现高度依赖于提示(prompt)的设计风格,提示工程(prompt style)成为影响模型表现的关键因素。

链接: https://arxiv.org/abs/2606.18922
作者: Jasmine Owers,Edwin Simpson,Martha Lewis
机构: University of Bristol(布里斯托大学); Intelligent Systems Lab(智能系统实验室); University of Amsterdam(阿姆斯特丹大学); ILLC(信息语言与计算研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 16 figures; for associated code and data see this https URL To be published in Transactions of the Association for Computational Linguistics

点击查看摘要

Abstract:Figurative language and negation are two areas that challenge current language models, however, both are widely used throughout written and spoken language. Large language models (LLMs) are also widely used in everyday contexts where they cannot necessarily be tuned for a specific dataset. It is therefore essential to understand the ability of LLMs to correctly interpret text that includes both negation and figurative language. To investigate this, we develop a set of new annotations to an existing dataset of figurative language, and test a range of language models on the dataset. We find that the combination of negation and figurativeness can present a particular challenge, and that performance overall and across different negation types is particularly dependent on the prompt style used.

[NLP-21] REVES: REvision and VErification–Augmented Training for Test-Time Scaling

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在测试时通过序列化修正(test-time scaling via sequential revision)进行推理时,因标准后训练方法仅优化单次推理目标而导致的与多步推理动态之间的根本性不匹配问题。现有方法虽将此问题建模为多轮强化学习(multi-turn reinforcement learning, RL),但通常直接优化长程轨迹,未能充分挖掘中间步骤中高质量“近似错误”(near-miss)答案所蕴含的可学习纠错信息。为此,论文提出一种两阶段迭代框架,交替执行在线数据/提示增强与策略优化。其关键创新在于将成功恢复轨迹中的中间步骤(即“近似错误”答案)转化为解耦的修订(revision)与验证(verification)提示,从而聚焦于有效答案转换与错误识别能力的联合提升。该方法实现了高效的离策略数据生成,并显著降低了长时程采样带来的计算开销。在LiveCodeBench上,仅使用公开测试用例作为反馈,即取得相比强化学习基线+6.5分、相比标准多轮训练+4.0分的性能提升;在圆排列等非编码任务中,以最小的4B基模型规模达到先前最优结果,且远少于大型进化搜索系统所需的采样次数。基于真实答案验证的数学任务结果进一步证实了其更强的纠错能力,并展现出对分布外约束满足谜题(如n_queens和mini_sudoku)的良好泛化性能。

链接: https://arxiv.org/abs/2606.18910
作者: Yuanxin Liu,Ruida Zhou,Xinyan Zhao,Amr Sharaf,Hongzhou Lin,Arijit Biswas,Mohammad Ghavamzadeh,Zhaoran Wang,Mingyi Hong
机构: Amazon AGI; Northwestern University; Qualcomm AI Research; University of Minnesota
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss’’ answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n_queens and mini_sudoku, where correctness is defined entirely by problem constraints. Code is available at this https URL.

[NLP-22] SAGE: Stochastic Prompt Optimization via Agent -Guided Exploration

【速读】: 该论文旨在解决生成式 AI 系统在无需参数更新的前提下,如何通过上下文工程(context engineering)有效提升性能的问题。现有研究表明,文本梯度(textual gradients)并非真正意义上的梯度,这使得自动提示优化(Automatic Prompt Optimization, APO)应被视为黑箱搜索问题。为此,论文提出了一种基于随机搜索的提示空间优化框架 SPO(Stochastic Prompt Optimization),并对比了三种逐步复杂的优化策略:基于误差信息的随机搜索、带有进化算子的遗传算法,以及 SAGE(SPO via Agent-Guided Exploration),后者是一个结合诊断代码执行的多智能体流水线。实验表明,不同策略的优劣取决于任务景观结构与错误类型之间的交互关系,无单一策略始终最优。进一步地,研究将 SAGE 应用于心理健康聊天机器人,在连续优化范式下,成功将八轮独立噪声较大的 A/B 测试结果整合为统计显著的次日留存率提升。论文的核心观点是:将定性诊断与定量验证相结合,是实现开放域任务导向对话中智能体优化(agentic optimization)高效性的关键所在。

链接: https://arxiv.org/abs/2606.18902
作者: Ziyi Zhu,Luka Smyth,Saki Shinoda,Jinghong Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stochastic Prompt Optimization), a framework for stochastic search over prompt space, and compare three strategies of increasing sophistication: error-informed random search, a genetic algorithm with evolutionary operators, and SAGE (SPO via Agent-Guided Exploration), a multi-agent pipeline with diagnostic code execution. Across three benchmarks, no single strategy dominates; effectiveness depends on the interaction of landscape structure with error type. We further deploy SAGE on a mental-health chatbot under a continuous optimization paradigm, where it compounds eight cycles of individually-noisy A/B tests into a statistically robust gain in next-day retention. We argue that coupling qualitative diagnosis with quantitative validation is what makes agentic optimization effective for open-ended task-oriented dialogue.

[NLP-23] Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

【速读】: 该论文旨在解决多模态情感-原因配对抽取(Multimodal Emotion-Cause Pair Extraction, MECPE)中候选配对置信度不可靠的问题。现有方法通常采用基于有效候选对的逐对交叉熵损失,其将各链接关系视为独立处理,导致竞争性原因之间的置信度几何结构约束不足,使得真实标签配对可能与困难负样本过于接近,或依赖于偶然的非标签上下文信息,从而引发置信度脆弱性(pair-confidence brittleness)。为此,论文提出一种仅在训练阶段使用的鲁棒置信度学习框架RPCL(Robust Pair Confidence Learning),其核心在于通过双重机制增强置信度的判别性与稳定性:首先,引入置信度差异边界约束,使真实配对与行内硬负样本之间保持足够的置信度距离;其次,利用上下文扰动视图(部分破坏非标签话语表示)来对齐干净预测与噪声预测,实现预测一致性约束。推理阶段仍沿用原始的纯净评分器与解码流程,无需修改。在ECF、MECAD和MEC4数据集上的实验表明,相较于基线模型,RPCL在全模态(文本-音频-视频)设置下将三组种子的平均Pair F1提升2.58至2.83个百分点,并在所有数据集上显著提高平均Pair AUPRC。诊断分析进一步验证了更宽的真负置信度差距与更低的边界违反程度。结果表明,显式建模配对置信度是一种有效的MECPE训练策略。

链接: https://arxiv.org/abs/2606.18893
作者: Zhuangzhuang Pan,Ning Dong,Yingna Su,Yan Xia
机构: Suqian University (宿迁学院); Suzhou University of Technology (苏州科技学院)
类目: Computation and Language (cs.CL)
备注: 11 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Multimodal emotion-cause pair extraction (MECPE) requires reliable pair confidence over candidate pairs. Existing pair scorers commonly use pair-level cross entropy over valid candidates, which treats links mostly independently. This leaves the relative confidence geometry among competing causes under-constrained, allowing gold pairs to stay close to hard negatives or rely on incidental non-gold context. We study this vulnerability as pair-confidence brittleness and propose RPCL (Robust Pair Confidence Learning), a training-only framework for pair-confidence learning. RPCL encourages pair confidence to be both discriminative and stable: gold pairs are separated from row-wise hard negatives through a confidence-difference margin constraint, and clean pair predictions are aligned with predictions from a corrupted view where non-gold contextual utterance representations are partially corrupted. The original clean pair scorer and decoding pipeline are used unchanged at inference time. On ECF, MECAD, and MEC4, RPCL improves the three-seed mean Pair F1 over a matched base model by 2.58 to 2.83 percentage points in the full text-audio-video setting, and improves mean Pair AUPRC on all three datasets. Diagnostic analysis further shows larger gold-negative confidence gaps and lower margin-violation severity. These results suggest that explicitly shaping pair confidence is an effective training strategy for MECPE.

[NLP-24] Improving Medical Communication using Rubric-Guided Counterfactual Recommendations

【速读】: 该论文旨在解决文本化远程医疗中患者反馈主要反映沟通质量而非医疗准确性的核心问题,即当前轻量级反馈机制难以有效衡量医疗服务的真实有效性。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)引导的反事实推荐流程,通过识别并优化可解释的沟通特征(如语气、个性化程度、行动导向性及对患者关切的完整性),在不干扰医疗内容的前提下提升沟通质量。系统结合患者-医生交互元数据,构建预测模型以估计正向反馈概率,并在推理阶段搜索低成本的序数型沟通特征调整,推荐最小化但能显著提升正向反馈概率的修改建议。独立审计模型验证表明,该方法在多数情况下(93.31%)均保持非负增益,平均预测正向反馈概率提升6.41%。这表明,微小且可解释的沟通改进可在不削弱医生对医学判断和最终表述控制权的前提下,实现显著的反馈优化。

链接: https://arxiv.org/abs/2606.18889
作者: Adrian Cosma,Nicoleta-Nina Basoc,Andrei Niculae,Cosmin Dumitrache,Emilian Radoi
机构: IDSIA, Dalle Molle Institute for Artificial Intelligence(达勒莫勒人工智能研究所); National University of Science and Technology POLITEHNICA Bucharest(布加勒斯特理工大学科学技术大学)
类目: Computation and Language (cs.CL)
备注: 4 Tables, 8 Figures

点击查看摘要

Abstract:Text-based telemedicine increasingly relies on lightweight patient feedback, however, such feedback primarily reflects perceived communication quality rather than medical accuracy. We introduce an LM-guided counterfactual recommendation pipeline that discovers and refines interpretable communication features such as tone, personalization, actionability and completeness in addressing patient concerns, without interfering with the medical content. These features are used together with patient-doctor interaction metadata to estimate positive feedback. At inference time, the system searches over low-cost ordinal feature changes and recommends minimal communication changes predicted to increase the probability of positive feedback, while independent auditor models test whether these gains generalize beyond the selection model. Across interactions, recommendations yield a mean +6.41% gain in predicted positive feedback probability under independent auditors, and are non-negative for 93.31% of recommendations. These results suggest that small, interpretable communication changes can capture most predicted gains while preserving the doctor’s control over medical reasoning and final wording.

[NLP-25] Efficient Financial Language Understanding via Distillation with Synthetic Data

【速读】: 该论文旨在解决金融领域中情感分析任务因标注数据稀缺且人工标注成本高昂,导致大型指令微调模型难以高效部署的问题。其核心挑战在于如何在低资源条件下,利用极少的手工标注真实样本实现高性能的模型训练。解决方案的关键在于提出一种基于合成数据蒸馏的高效框架:首先对有限的真实样本进行聚类,进而基于聚类结果选择具有代表性的种子样本,通过结构化少样本提示(structured few-shot prompting)生成高质量的合成数据;该方法显著提升了合成数据的代表性与多样性,使小型学生模型在极少量监督下仍能取得优异性能。实验表明,该框架在复杂嘈杂的金融文本域中,甚至能使紧凑模型超越教师模型的表现,同时在正式文本上保持竞争力,为金融自然语言处理中的资源高效领域适配提供了可行路径。

链接: https://arxiv.org/abs/2606.18875
作者: Wen-Fong(Xavier)Huang,Edwin Simpson
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large instruction-following models are powerful but costly to deploy, particularly in finance, where labelled data are limited by confidentiality and expert annotation cost. We present an efficient framework for financial sentiment analysis through distillation with synthetic data, transferring knowledge from a large instruction-tuned teacher to compact student models. The framework is designed for low-resource conditions, where a small set of real examples are collected and labelled by hand. The framework then clusters the examples and uses the clusters to select seeds for generating synthetic examples via structured few-shot prompting. Experiments show that clustering-based seed selection yields more representative synthetic data than random sampling, enabling compact models to achieve strong performance with minimal supervision. Notably, on a more complex and noisy text domain, the compact model trained on the complete synthetic-seed corpus even outperforms the teacher model, while remaining competitive on formal text. The framework provides a practical route toward resource-efficient domain adaptation in financial NLP with minimal human labelling effort.

[NLP-26] Approximate Structured Diffusion for Sequence Labelling

【速读】: 该论文旨在解决序列标注(Sequence Labelling)任务中传统线性链条件随机场(Linear-Chain Conditional Random Field, CRF)模型因有限决策跨度(如仅依赖标签二元组)而导致的表达能力受限问题,尤其在需要建模长距离依赖关系时性能下降。其核心解决方案是引入扩散模型(Diffusion Model)来训练一个能够基于整个标签序列进行条件化建模的CRF,尽管该条件作用于噪声版本的标签序列。通过结合近似CRF推理机制,该方法在词性标注(POS-tagging)任务上实现了16.5%的错误率降低,显著提升了标签准确性。关键创新在于利用扩散过程实现对全局上下文信息的有效建模,从而突破传统CRF局部依赖的局限性。

链接: https://arxiv.org/abs/2606.18856
作者: Nicolas Floquet,Joseph Le Roux,Nadi Tomeh
机构: Université Sorbonne Paris Nord, CNRS, Laboratoire d’Informatique de Paris Nord, LIPN; Université Sorbonne Paris Nord, CNRS, Laboratoire d’Informatique de Paris Nord, LIPN
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sequence labelling, a core task of Natural Language Processing (NLP), consists in assigning each token of an input sentence a label. From a Machine Learning point of view, sequence labelling is often cast as a Linear-Chain Conditional Random Field (CRF) parametrised by a neural network. While this approach gives good empirical results, CRFs assume a finite decision span (eg label bigrams) which can limit their expressivity and hurt performance when long-range dependencies are required. We show we can leverage diffusion to train a CRF conditioned on an entire label sequence, with the caveat that the condition is on a noisy version of labels. We show experimentally that this method, in conjunction with approximate CRF inference, improves label accuracy with a 16.5% error reduction for POS-tagging. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2606.18856 [cs.CL] (or arXiv:2606.18856v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.18856 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Joseph Le Roux [view email] [v1] Wed, 17 Jun 2026 09:36:34 UTC (66 KB) Full-text links: Access Paper: View a PDF of the paper titled Approximate Structured Diffusion for Sequence Labelling, by Nicolas Floquet and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-06 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-27] Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

【速读】: 该论文旨在解决隐性仇恨言论(implicit hate speech)的分类难题,其核心挑战在于意图常通过暗示和语境隐藏,而非直接使用侮辱性词汇。传统监督对比学习方法虽在特定数据集内表现良好,但易过度依赖表面线索,且跨数据集泛化能力差。本文提出ImpSH框架,其关键创新在于采用基于三元组的训练机制,通过将文本与潜在的隐含陈述对齐,并引入上下文受限的半困难负样本(context-bounded semi-hard negatives),聚焦于语义相近但类别不同的样本,从而增强模型对细微语义差异的区分能力。此外,还考察了通过数据增强构建正样本的AugSH方法。在IHC、SBIC和DynaHate数据集上,基于BERT与HateBERT的实验表明,ImpSH在保持与标准监督对比基线相当性能的同时,显著提升了跨域迁移能力,尤其在统一预处理与调参预算条件下优势明显。表征分析显示,该方法生成的正样本对具有更强的对齐性与更均衡的全局分布,而近邻案例研究揭示了领域偏移下典型的误判模式。结果表明,通过上下文约束的隐含语义对齐机制,可实现更稳定、类双射(bijective-like)的表示映射,有效缓解传统聚类式表示学习中固有的不稳定性问题。

链接: https://arxiv.org/abs/2606.18852
作者: Wicaksono Leksono Muhamad,Yunita Sari
机构: Mantera Studio; Universitas Gadjah Mada (Gadjah Mada 大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Classifying implicit hate speech remains a challenge, as intent is often masked through insinuation and context rather than explicit slurs. Prior supervised contrastive approaches improve in-domain detection but can overfit surface cues and struggle to transfer across datasets. We propose ImpSH, a triplet-based framework that aligns posts with implied statements when available and uses context-bounded semi-hard negatives to focus learning on near confusions. We also examine AugSH, which forms positives via data augmentation. In controlled evaluations on IHC, SBIC, and DynaHate with BERT and HateBERT, ImpSH is a viable alternative to standard supervised contrastive baselines and often improves cross-domain performance under matched preprocessing and tuning budgets. Representation analysis using alignment and uniformity indicates tighter positive pairs with balanced global spread, and qualitative nearest-neighbor case studies illustrate typical false negatives under domain shift. These results demonstrate that aligning posts with their implied statements via context-bounded mining provides a more stable, bijective-like mapping to related insinuations, overcoming the volatility inherent in traditional clustering-based representation learning.

[NLP-28] Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

【速读】: 该论文旨在解决大语言模型在长上下文推理(long-context reasoning)能力上的不足,尤其是在作为自主代理(autonomous agents)时需对长时间轨迹进行复杂推理的挑战。现有基于强化学习(reinforcement learning, RL)的方法多集中于奖励工程(reward engineering),而高质量训练数据的稀缺限制了性能提升。本文从数据驱动(data-centric)视角出发,提出一种简单但高效的数据构建方法(data recipe),通过聚焦三类互补的任务类型——检索(retrieval)、多证据综合(multi-evidence synthesis)与推理(reasoning),构建并梳理了总计约14,000个样本的八个数据集。结合最小化的目标导向式近端策略优化(outcome-based GRPO)框架,该方案在三个模型(Qwen3-4B/8B/30B-A3B)上实现了在七个长上下文基准测试中的平均性能提升分别为+7.2、+3.2和+6.4分,显著优于先前的RL训练集。进一步实验表明,该方法带来的性能增益可有效迁移至代理任务中,使GAIA和BrowseComp基准分别提升+4.8和+7.0分。其核心解决方案在于:通过系统性地构建多样化、高质量的长上下文推理数据集,实现以数据驱动的方式显著增强模型的长期推理能力,而非依赖复杂的奖励设计。

链接: https://arxiv.org/abs/2606.18831
作者: Xiaoyue Xu,Sikui Zhang,Xiaorong Wang,Xu Han,Chaojun Xiao
机构: OpenBMB; Tsinghua University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures, 12 tables

点击查看摘要

Abstract:Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families – retrieval, multi-evidence synthesis, and reasoning – for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.

[NLP-29] GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

【速读】: 该论文旨在解决大语言模型(LLM)代理在多主体共享内存场景下的记忆管理与安全治理问题,尤其针对医院、工作场所、校园及家庭等实际应用场景中多个用户主体共同写入并查询共享记忆池所面临的挑战。现有内存基准普遍假设单用户环境,忽视了多角色、多权限、多关系背景下对记忆质量的综合要求,包括长期任务的实用性、上下文授权边界内的访问控制以及显式删除请求后的主动遗忘机制。为此,论文提出GateMem基准,其核心创新在于联合评估三方面能力:面向长期任务的状态更新实用性、跨上下文授权边界的访问控制鲁棒性,以及代理端主动遗忘的可靠性。该基准涵盖医疗、办公、教育和家庭四大领域,包含长篇多主体交互序列、增量式记忆注入、隐藏检查点、结构化评分与信息泄露目标标注等设计。实验表明,当前主流方法难以同时实现高实用性、强访问控制与可靠遗忘;尽管长上下文提示(long-context prompting)在治理得分上表现最优但代价高昂,而基于检索或外部记忆的方法虽降低计算成本却仍存在未授权或已删除信息的泄露问题。结果表明,现有记忆代理距离可信赖的机构级共享部署仍有显著差距。

链接: https://arxiv.org/abs/2606.18829
作者: Zhe Ren,Yibo Yang,Yimeng Chen,Zijun Zhao,Benshuo Fu,Zhihao Shu,Bingjie Zhang,Yangyang Xu,Dandan Guo,Shuicheng Yan
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 24 pages, 8 figures. Code and dataset are available at this https URL and this https URL

点击查看摘要

Abstract:Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so memory quality requires governance as well as recall. We introduce GateMem, a benchmark for multi-principal shared-memory agents. GateMem jointly evaluates utility for legitimate long-horizon requests with state updates, access control across contextual authorization boundaries, and agent-facing active forgetting after explicit deletion requests. It spans medical, office, education, and household domains, with long-form multi-party episodes, incremental memory injection, hidden checkpoints, structured judging, and leak-target annotations. Across diverse baselines and backbone models, no method simultaneously achieves strong utility, robust access control, and reliable forgetting. Long-context prompting often yields the best governance score at high token cost, while retrieval-based and external-memory methods reduce cost yet still leak unauthorized or deleted information. These results show current memory agents remain far from reliable shared institutional deployment.

[NLP-30] Beyond Scalar Scores: Exploring LLM -based Metrics for Clinical Significance Evaluation in Radiology Reports

【速读】: 该论文旨在解决生成式放射科报告(generated radiology reports)评估中临床准确性难以可靠衡量的问题,核心挑战在于现有评估指标将报告质量简化为无医学依据的标量值,无法准确区分临床显著错误与无害的语言变化。其解决方案的关键在于构建一个能够精准界定临床重要性边界的评估框架——通过ReEvalMed基准测试,从“判别力”(Discrimination,即识别真实临床错误的能力)和“鲁棒性”(Robustness,即容忍无临床意义的语言变异的能力)两个维度量化评估指标的临床意义。研究发现,尽管大型语言模型(LLM)具备丰富的医学知识,但在实际评估中普遍存在判别偏差:过度惩罚无害的表述重构。为此,作者基于Qwen3-8B和MedGemma-4B训练了轻量级可解释的评估指标,并利用自动生成的4,000对报告数据进行优化,显著提升了临床意义边界的清晰度,性能超越32B规模的医疗专用大模型,且媲美专有模型。值得注意的是,双遍(two-pass)推理虽增强鲁棒性,但未能稳定提升整体性能,反而以牺牲判别力为代价,因此在成本敏感场景下推荐采用单遍(one-pass)训练的评估指标;仅在需严格平衡判别力与鲁棒性的关键应用中保留双遍设置。研究团队将公开数据集与训练好的评估指标。

链接: https://arxiv.org/abs/2606.18797
作者: Qingyu Lu,Ruochen Li,Liang Ding,Yufei Xia,Youxiang Zhu,Dacheng Tao
机构: Nanyang Technological University (南洋理工大学); Technical University of Munich (慕尼黑工业大学); Alibaba (阿里巴巴); University of Glasgow (格拉斯哥大学); University of Massachusetts Boston (马萨诸塞大学波士顿分校)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar. Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation. We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors (“Discrimination”) and tolerating insignificant variations (“Robustness”). Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings. To mitigate this, we synthesize 4k report pairs and train lightweight interpretable metrics on Qwen3-8B and MedGemma-4B. Our trained metric sharpens the clinical significance boundary, surpassing 32B-scale medical LLMs and remaining competitive with proprietary models. Crucially, the more costly two-pass setting fails to consistently improve overall performance and mainly trades discrimination for robustness. These findings suggest one-pass trained metrics as the practical choice for cost-sensitive deployment, with two-pass inference reserved for settings where D-R balance is critical. We will release the dataset and metric.

[NLP-31] HandwritingAgent : Language-Driven Handwriting Synthesis in Scalable Vector Space

【速读】: 该论文旨在解决如何让机器生成自然且多样化的手写风格序列这一开放性挑战,尤其关注在个体间及同一人内部书写风格的动态变化(如笔画形状、纹理、压力与字体特征)的建模问题。现有基于深度学习的方法在在线与离线场景中虽有一定进展,但普遍存在依赖特定风格的网络架构、对大规模数据集和高算力的强需求,以及难以通过自然语言实现灵活风格控制等局限。其解决方案的关键在于提出HandwritingAgent——一种以语言驱动的智能体,能够直接在可缩放矢量图形(SVG)格式下生成手写字符序列,无需针对特定风格进行训练。该方法利用大模型进行几何分析,并在离散网格画布环境中自回归地生成目标手写字形的笔画序列,生成过程由文本输入(支持对话式或非对话式)与参考手写风格图像共同条件化。实验表明,HandwritingAgent在模仿、识别、多语言手写合成及复杂数学与科学表达式生成等任务上均显著优于现有方法,不仅性能达到或超越当前最优水平,还实现了更高的效率、可控性与泛化能力。

链接: https://arxiv.org/abs/2606.18788
作者: Jaward Sesay,Yue Yu,Börje F. Karlsson
机构: Beijing Institute of Technology (北京理工大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Teaching machines to emulate natural handwriting styles remains an open challenge, as it requires synthesizing stroke sequences that dynamically vary in shape, texture, pressure and script - not only across individuals, but also within a single person’s handwriting. Attempts at this challenge have largely explored deep learning methods in both online and offline settings. However, these approaches are often constrained by style-specific architectural choices, heavy reliance on large datasets, high compute costs, and a lack of flexible control over writing styles through natural language. To this end, we introduce HandwritingAgent, a language-driven agent that can synthesize natural handwriting sequences directly in Scalable Vector Graphics (SVG) format with no need for style-specific training. The agent leverages a large reasoning model to geometrically analyse and autoregressively generate target handwritten glyphs as stroke sequences in a discrete grid canvas environment. Generation is conditioned on texts provided in either conversational or non-conversational mode, along with a reference handwriting-style image. Experiments on diverse handwriting tasks spanning imitation, recognition, multi-lingual handwriting synthesis, and generation of complex handwritten maths and science expressions indicate substantial improvement in performance, with HandwritingAgent matching or surpassing state-of-the-art generative handwriting models, while providing a more efficient, controllable, and generalizable synthesis method.

[NLP-32] RedactionBench

【速读】: 该论文旨在解决生成式AI在敏感领域应用中对个人身份信息(PII)进行精准脱敏的难题。现有基准测试将实体识别机制与隐私语义混淆,未能区分不同情境下同一信息是否构成隐私泄露——例如,公开电话号码与医疗记录中的电话号码具有本质差异,其是否需脱敏取决于持有者、用途及上下文。基于情境完整性理论,研究提出RedactionBench,一个涵盖11个真实世界来源领域的200份多样化文档的手工标注基准数据集,并引入R-Score这一新型字符级评估指标,能够等价处理语义相似的脱敏行为,同时消除格式化选择(如掩码风格差异)带来的干扰。实验表明,当前主流命名实体识别模型、小语言模型及配备代理工具的前沿模型在上下文敏感脱敏任务上仍表现不佳;人工评估显示,尽管对于强制性脱敏(89.4%)和安全文本保留(94.1%)达成较高共识,但对上下文相关脱敏的判断一致性仅为47.7%,凸显了隐私感知的高度主观性。因此,R-Score的设计旨在将情境模糊性与严格精度解耦,从而更准确地衡量模型在复杂隐私场景下的表现。研究对比了35种不同模型家族的性能,并公开发布RedactionBench,以建立未来隐私保护系统评估的基准,推动高效模型设计与标准化评测的发展。

链接: https://arxiv.org/abs/2606.18782
作者: Sean Brynjólfsson,Shashvat Jayakrishnan,Esha Sali,Diptanshu Purwar,Madhav Aggarwal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction mechanics with privacy semantics. A public phone number is not equivalent to a phone number in a medical record. Whether information constitutes a violation depends heavily on who holds it, why, and in what context, fundamentally differentiating redaction from simple entity recognition. Grounded in contextual integrity, we introduce RedactionBench, a manually annotated benchmark comprising 200 diverse documents across 11 domains, mostly seeded from real-world sources. We also introduce R-Score, a novel character-level metric that treats semantically similar redactions equally and nullifies shallow formatting choices, such as varying masking styles for phone numbers. Evaluations across Named Entity Recognition models, entity extraction Small Language Models, and frontier models equipped with agentic tools demonstrate that contextual redaction remains an unsolved problem. A human evaluation with over 80 users on RedactionBench reveals a stark dichotomy in privacy perceptions. Annotators show consensus with target labels for mandatory redactions (89.4 percent) and safe text preservations (94.1 percent), but fail to agree on contextual redactions (47.7 percent). This variance demonstrates the subjective nature of contextual privacy and motivates R-Score, which decouples contextual ambiguity from strict precision. We compare 35 models across families and report their performance in redacting PII. Finally, we release RedactionBench to establish a baseline for future privacy-preserving systems, hoping to inspire efficient model design and standardized evaluations.

[NLP-33] Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

【速读】: 该论文旨在解决长文档在密集检索(Dense Retrieval)中因文档级编码时早期压缩导致关键信息片段被弱化的问题,即“文档侧早期压缩”(document-side early compression)现象。其核心挑战在于:当长文档被统一编码为单一向量时,尽管存在短而决定性的文本片段(span),但这些片段的信号可能在编码过程中被稀释或丢失,从而影响检索性能。为此,作者提出了证据稀释指数(Evidence Dilution Index, EDI),用于量化文档级表示相对于同一文档内最强片段级证据的衰减程度。解决方案的关键是DICE(Document Inference via Chunk Evidence)——一种无需训练的文档侧策略,通过将文档切分为独立块(chunk),使用冻结的模型分别编码各块,再融合回单个向量,同时保持标准的一对一查询-文档接口。实验表明,DICE在LongEmbed数据集上显著提升四类骨干模型的检索表现,尤其在超过4k token的长片段上效果突出(如Dream模型在Passkey 4k任务上准确率从30.0提升至90.0),且在12,779个过滤样本中,92.8%的情况下EDI低于单向量基线,验证了该方法在长文档检索中的有效性与实用性。

链接: https://arxiv.org/abs/2606.18781
作者: Shanshan Lyu,Yiwei Wang,Yujun Cai,Jiafeng Guo,Shenghua Liu
机构: Chongqing University; State Key Laboratory of AI Safety; Institute of Computing Technology, Chinese Academy of Sciences; University of California, Merced; University of Queensland
类目: Computation and Language (cs.CL)
备注: Code is available at this https URL

点击查看摘要

Abstract:Dense retrieval ranks one query vector against one document vector. On long documents, this interface can fail when a short but decisive span is weakened during document encoding before ranking. We study this failure mode as document-side early compression and introduce the Evidence Dilution Index (EDI) to measure how far a document-level representation falls below the strongest chunk-level evidence within the same gold document. Guided by this view, we propose DICE (Document Inference via Chunk Evidence), a training-free document-side strategy that splits documents into chunks, encodes them independently with a frozen model, and aggregates them back into a single vector while preserving the standard one-query-one-document interface. On LongEmbed, DICE improves retrieval across four backbones, with the largest gains on slices beyond 4k tokens: for Dream, Passkey 4k rises from 30.0 to 90.0 and Needle 4k from 23.3 to 74.0. Across 12,779 filtered samples, DICE yields lower EDI than the single-vector baseline in 92.8% of cases. These results establish document-level encoding as a practical and underexplored lever for long-document retrieval.

[NLP-34] SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

【速读】: 该论文旨在解决多模态信息抽取(Multimodal Information Extraction, MIE)任务中因数据稀缺性导致的性能瓶颈问题,尤其针对现有数据增强方法在跨模态对齐粗粒度以及任务间设计碎片化、难以共享语义知识等关键缺陷。其解决方案的核心在于提出一种统一的语义锚点对齐多模态增强框架(Semantic Anchor-aligned Multimodal Augmentation, SAMA)。SAMA通过从真实标注中构建结构化的语义锚点(semantic anchors),引导协作式多专家多模态大语言模型(Collaborative Multi-Experts Multimodal Large Language Model, CME-MLLM)生成高质量、任务感知的合成文本数据;该模型融合通用适配器(Universal Adapter)以捕捉共享语义,并结合任务特定适配器(Task-Specific Adapters)实现多样化但符合约束的文本生成。在图像合成方面,SAMA采用锚点保持的扩散机制(Anchor-Preserving Diffusion),利用锚点加权提示与潜在空间条件控制,在保持关键语义锚点的同时实现视觉上下文多样性。此外,为避免人工验证,SAMA引入双约束过滤模块,基于跨模态一致性与锚点保真度双重标准筛选合成样本。实验结果表明,SAMA在多种基准数据集上的多模态命名实体识别(MNER)、关系抽取(MRE)和事件抽取(MEE)任务中均显著优于当前主流增强基线,无论在全监督还是低资源设置下均展现出卓越的泛化能力与鲁棒性。

链接: https://arxiv.org/abs/2606.18780
作者: Quanjiang Guo,Chong Mu,Jiazhou Pan,Ming Jia,Ling Tian,Hui Gao,Zhao Kang
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

[NLP-35] Output Vector Editing for Memorization Mitigation in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中记忆并重现训练数据中的特定序列所引发的隐私、版权及安全风险。现有基于神经元层面的缓解方法通常将编辑操作等同于置零神经元激活值,但此类方法忽略了输出向量在残差流(residual stream)中实际承载信息的关键作用——尽管激活值仅决定神经元是否被触发,而输出向量通过超叠加(superposition)机制编码多重语义特征,是信息传递的核心载体。为此,论文提出一种约束优化的权重编辑方法:输出向量编辑(output vector editing),其核心在于定位负责记忆化续写内容的一小部分前馈网络(MLP)神经元,并对这些神经元的输出向量进行最小程度修改,在词汇空间中引入干扰项,从而在不改变激活状态的前提下,重新定向其对残差流的贡献。在四个规模从3.6亿到70亿参数的模型(SmolLM-360M、OLMo-1B、OLMo-7B、Llama2-7B)上评估,以具备公开权重和预训练语料的OLMo-7B为核心实验平台,成功挖掘出6831条记忆序列,实现最高达87.9%的抑制率;相较于仅依赖定位而无输出向量修改的零消融对照组,性能提升2.7倍,证明抑制效果源于输出向量编辑本身而非定位精度。研究设计了四种编辑模式,覆盖从激进抑制到最小重导向的连续谱系,集成使用可覆盖96.5%的记忆序列,而推荐的单模式配置在保持无灾难性局部失败的前提下仍达到81.5%的抑制率。进一步分析揭示,约14%的记忆序列无法通过仅调整MLP神经元实现有效编辑,构成一个机制边界;虽然这些失败并非由注意力机制主导,但通过剔除贡献最高的注意力头,可恢复其中60%-64%的抑制能力,尤其在复制前缀词元的续写中表现更优,表明注意力可作为互补的后备机制而非主要编辑路径。不同模型间编辑模式排序与成功率-局部性权衡关系具有可迁移性,且成功率随模型规模增加而非模型家族差异而上升。

链接: https://arxiv.org/abs/2606.18767
作者: Ahmad Dawar Hakimi,Kaiwei Lei,Isabelle Augenstein,Hinrich Schütze
机构: LMU Munich (慕尼黑大学); University of Copenhagen (哥本哈根大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Pioneer Centre for AI (先锋人工智能中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models memorize and reproduce sequences from their training data, creating privacy, copyright, and security risks. Existing neuron-level mitigation methods equate editing with zeroing out neuron activations, but the activation only controls whether a neuron engages; the output vector is what writes to the residual stream and, through superposition, encodes multiple features. We propose output vector editing, a constrained-optimization weight edit that locates a small set of MLP neurons responsible for a memorized continuation and minimally modifies their output vectors to introduce a distractor in vocabulary space, redirecting their residual-stream contributions while leaving activations unchanged. Evaluating on four models from 360M to 7B parameters (SmolLM-360M, OLMo-1B, OLMo-7B, Llama2-7B), we center on OLMo-7B (whose open weights and pretraining corpus enable systematic mining) and mine 6831 memorized sequences, achieving up to 87.9% suppression. The 2.7 \times gap over zero ablation on the same located neurons shows the suppression comes from the output-vector edit, not localization alone. Four edit modes span a spectrum from aggressive suppression to minimal redirection; in ensemble they cover 96.5% of memorized sequences, while our recommended single-mode configuration reaches 81.5% with no catastrophic locality failures. We further identify a mechanistic boundary at \sim14% of sequences unreachable by MLP-only editing; while these failures are not attention-driven overall, ablating the top contributing attention heads recovers 60–64% of them, with stronger recovery on continuations that copy tokens from the prefix, positioning attention as a complementary fallback rather than a primary mechanism. Edit mode ordering and the success-locality trade-off transfer across all four models, with success rates scaling with model size rather than family.

[NLP-36] LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

【速读】: 该论文旨在解决现有法律人工智能评估体系中长期存在的碎片化问题:传统法律基准仅针对孤立的子任务进行评价,而已有法律代理模拟器在每次场景重置时依赖共享的真实判决作为初始条件,未能有效建模民事诉讼全生命周期中各阶段间的因果依赖关系。为此,论文提出LegalWorld——一个基于75,309对中文民事判决数据构建的、具有因果连贯性的五阶段(七子场景)交互式仿真环境,完整刻画中国民事诉讼从立案到终审的全流程。其核心解决方案在于引入可复用的基础设施组件,包括本地记忆(local memory)、全局案情记忆(global case memory)以及技能/工具库(Skill/Tool library),确保争议事件在全生命周期内保持程序一致性与角色一致性。在此基础上,研究构建了LongJud-Bench评测基准,用于评估智能体在全部五个连续阶段中的综合能力。217名具有法律背景的评估者对18,992条轨迹的评分验证了LegalWorld轨迹在程序合规性与角色一致性方面的高保真度;跨模型能力对比分析进一步揭示了聚合得分无法暴露的显著性能差异,表明当前无任一基础模型在咨询、文书起草与法庭辩护等多阶段任务中全面领先。相关详细资源将公开发布。

链接: https://arxiv.org/abs/2606.18728
作者: Songhan Zuo,Shengbin Yue,Tao Chiang,Guanying Li,Yun Song,Xuanjing Huang,Zhongyu Wei
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Northwest University of Political and Law (西北政法大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators reinitialize each scenario from shared ground truth, leaving cross-stage causal dependencies unmodeled. We present LegalWorld, a life-cycle interactive environment that models Chinese civil litigation as a causally connected state chain of five stages (seven sub-scenarios), grounded in 75,309 paired Chinese civil judgments. We pair it with reusable infrastructure (local memory, global case memory, a Skill/Tool library) that keeps each dispute consistent across its full life cycle. Building on this environment, we construct LongJud-Bench to evaluate agent capability across all five connected stages. 18,992 ratings from 217 legal-background evaluators confirm that LegalWorld trajectories are procedurally faithful and role-consistent; and a capability-level cross-model evaluation reveals sharp divergences that aggregate scores cannot expose, with no single backbone leading across consultation, drafting, and courtroom advocacy. Detailed resources will be released publicly.

[NLP-37] Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

【速读】: 该论文旨在解决土耳其语(Turkish)作为黏着语(agglutinative language)在现代语言模型中因子词单元(subword tokenizer)设计缺陷而导致的形态学信息丢失问题。现有方法如WordPiece或基于规则的分析器在分词时依赖语料库统计,导致语义丰富的词尾被错误切分,且无法还原原始文本,严重损害生成任务的准确性与可逆性。为此,本文提出Morpheus——一种神经形态边界模型,其核心创新在于通过可微分的泊松-二项式动态规划(differentiable Poisson-binomial dynamic program),将字符级边界概率转化为训练阶段的软形态素隶属度,并在推理阶段精确生成形态素片段,无需字符串归一化,从而保证编码与解码的完全可逆性(即 decode(encode(w))=w\mathrm{decode}(\mathrm{encode}(w)) = w)。该模型兼具无损分词与结构化词嵌入(word embedding)生成能力:同一前向传播既完成分词,又输出蕴含根词结构的嵌入向量。实验表明,相较于主流子词分词器,Morpheus在比特/字符数(bits-per-character)上达到1.425的最低水平,形态对齐得分(MorphScore macro-F1)提升至0.61(远超约0.32的基线),同时减少约19%的GPU内存占用;作为嵌入器,其冻结向量在词根检索(root-family MAP 0.85)和同根验证任务(ROC-AUC 1.00)中超越BGE-M3与BERTurk,但在依赖上下文与屈折变化的任务(如命名实体识别、格/数探测)中仍逊于更复杂的上下文编码器,这被归因于其根心几何结构(root-centric geometry)带来的本质权衡。

链接: https://arxiv.org/abs/2606.18717
作者: Tolga Şakar
机构: Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and – in the case of WordPiece and rule-based analyzers – failing to decode their output back to the original text. This paper presents \textbfMorpheus, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so \mathrmdecode(\mathrmencode(w)) = w holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers – the only ones valid for generation – Morpheus attains the lowest bits-per-character ( 1.425 ), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 0.61 vs.\ \sim0.32 ), and uses \sim19% less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP 0.85 ) and same-root verification (ROC-AUC 1.00 ), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead – a trade-off we attribute to Morpheus’s root-centric geometry. Code: this https URL model: this https URL interactive demo: this https URL.

[NLP-38] LLM s Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在教育评估中对题目区分度(item discrimination)建模能力不足的问题。题目区分度是心理测量学中的核心属性,用于衡量题目能否有效区分高能力与低能力学生。尽管已有研究探索了LLMs对题目难度的估计能力,但其在捕捉区分度方面的能力尚不明确。本文通过两种互补方法在零样本设置下评估42个专有及开源的LLM:一是直接预测区分度,即模型基于题目内容直接输出区分度值;二是基于响应的经典测验理论(Classical Test Theory, CTT)校准,将LLM生成的回答视为合成学生作答,进而计算区分度分数。结果表明,直接预测方法与人工校准的区分度相关性较弱,最佳模型仅达到Spearman相关系数0.152;而基于响应的CTT校准虽提供更强信号,相关性提升至0.241(全角色合成考生池),但仍有限。研究揭示,当前LLMs虽包含非随机的区分度相关信号,但尚未能可靠捕捉题目实际区分人类学生的能力,因此题目区分度仍是基于LLM的心理测量评估中的开放性挑战。

链接: https://arxiv.org/abs/2606.18709
作者: Han Chen,Ming Li,Chenguang Wang,Yijun Liang,Dawei Zhou,Hong jiao,Tianyi Zhou
机构: MBZUAI(阿联酋穆巴达拉人工智能研究院); University of Maryland (马里兰大学); Virginia Tech (弗吉尼亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing works have explored whether large language models (LLMs) can estimate item difficulty, it remains unclear whether they can capture item discrimination. In this work, we evaluate 42 proprietary and open-weight LLMs in zero-shot settings using two complementary approaches: direct discrimination prediction, where models explicitly estimate an item’s discrimination value from its content, and response-based Classical Test Theory (CTT) calibration, where LLM answers are treated as synthetic student responses to compute discrimination scores. Our results show that direct prediction yields weak alignment with human-calibrated discrimination: the best-performing model reaches only a Spearman correlation of 0.152. Response-based CTT calibration provides a stronger but still limited signal, with the all-persona synthetic respondent pool reaching a Spearman correlation of 0.241. These findings highlight item discrimination as an open challenge for LLM-based psychometric evaluation: current LLMs contain non-random discrimination-relevant signal, but they do not yet reliably capture how assessment items distinguish human students.

[NLP-39] Attention as Frustrated Synchronization

【速读】: 该论文旨在解决传统注意力机制在实现高效序列建模时对复杂非线性变换的依赖问题,特别是针对基于自注意力(Self-Attention)架构在长程依赖建模中计算开销大、参数效率低的局限性。其核心挑战在于:如何在保持同步网络(synchronization network)固有的动力学简洁性的同时,赋予其足够的表达能力以完成复杂的序列预测任务。解决方案的关键在于提出受挫同步网络(Frustrated Synchronization Network, FSN),其本质是通过引入多重“受挫”机制(frustration)来打破完全同步的平凡状态,从而将计算过程嵌入到结构化的相位偏离之中。具体而言,FSN利用定义在环面(torus)上的相位表示令牌状态,其整个信息路径仅由一个可学习的复数耦合核构成,该核包含三个关键成分:1)静态的Kuramoto-Sakaguchi型受挫角度(即相位偏移),2)带符号的谐波分量作为反向耦合项(Daido组件),3)一阶延迟项,其代数形式与受挫耦合一致,且受挫角由数据本身的转移模式决定——这使得下一词预测被直接编码为一种由数据驱动的同步受挫过程。实验表明,在字符级文本和代码数据上,当参数量与训练预算均匹配(约一百万参数)时,FSN在所有测量周期内验证损失均低于经过调优的RoPE-SwiGLU Transformer,且即使在基线模型收敛后,FSN仍能持续优于后者;在enwik8基准上,所有三十轮种子结果均低于Transformer收敛后的1.611损失,而FSN自身在五十轮训练后收敛至1.5953 ± 0.0014。进一步地,采用均场耦合到可学习集体模式的变体(取代传统的多层感知机模块)仍能有效追踪Transformer性能,而未施加受挫机制的基础层在长距离复制任务中表现显著落后,受挫核则在四阶及以上深度逆转此劣势。该优势在百万至四百万参数规模范围内持续存在,验证了该方法在不同尺度下的有效性。

链接: https://arxiv.org/abs/2606.18694
作者: Joshua Nunley
机构: Indiana University Bloomington (印第安纳大学布卢明顿分校)
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE); Adaptation and Self-Organizing Systems (nlin.AO)
备注: 25 pages, 4 figures. Preliminary report at the 1-10M parameter scale

点击查看摘要

Abstract:A network of oscillators that synchronizes perfectly computes nothing further, so an attention architecture built from synchronization must locate its computation in structured departures from agreement. We introduce the Frustrated Synchronization Network (FSN), whose token states are phases on a torus and whose entire value pathway is one learned complex coupling kernel over harmonics and a one-step delay. Each component of the kernel is a frustration in the sense of the synchronization literature. The complex phases are static Kuramoto-Sakaguchi frustration angles, the signed harmonics are repulsive Daido components, and the delay term, which couples each token to the successors of the tokens it attends to, is algebraically identical to Kuramoto-Sakaguchi coupling whose frustration angle is the data’s own transition, so next-token prediction is implemented as synchronization frustrated by the data. At matched one-million-parameter and training budgets on character-level text and code, the FSN’s validation loss is below a tuned RoPE-SwiGLU transformer’s at every epoch measured, and the comparison survives training the baseline to convergence: every thirty-epoch enwik8 seed finishes below the transformer’s converged fifty-epoch loss of 1.611, and the FSN’s completed fifty-epoch runs converge to 1.5953 +/- 0.0014. A variant with every feed-forward block replaced by mean-field coupling to learned collective modes, leaving no multilayer perceptron in the stack, tracks the transformer. On natural text the unfrustrated base layer falls behind the converged transformer at every copy depth, worst on long-range copy events; the kernel reverses the deficit at every depth of four and beyond. Headline comparisons are at the one-million-parameter scale; a scale ladder is complete through four million parameters with the advantage persisting, and remaining arms are marked as in progress.

[NLP-40] ForecastBench-Sim: A Simulated-World Forecasting Benchmark ICML2026

【速读】: 该论文旨在解决通用人工智能系统在现实世界预测基准中面临的固有局限性,即结果响应缓慢、尾部事件罕见以及反事实问题难以评估。其解决方案的关键在于构建一个基于模拟世界的预测基准——ForecastBench-Sim,该基准利用自由文明(Freeciv)这一回合制策略游戏的博弈过程生成仿真环境。通过提供固定的结构化世界状态报告(world report),模型需对隐藏的未来状态进行预测,随后系统继续推进模拟以获得真实答案并评分。由于环境完全由程序控制,该框架可灵活生成任意时间跨度的连续或二元预测问题,支持条件或因果类问题的干预对照实验,并能高效生成稀有或颠覆性事件的已验证样本。该方法通过可控、即时可解的任务设计,为研究动态世界状态下概率推理能力提供了有力补充,显著提升了评估效率与实验灵活性。

链接: https://arxiv.org/abs/2606.18686
作者: Jaeho Lee,Nick Merrill,Ezra Karger
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, 5 main figures, 6 appendix figures. Spotlight presentation at Forecasting as a New Frontier of Intelligence / Workshop on AI Forecasting, ICML 2026

点击查看摘要

Abstract:Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score. We introduce ForecastBench-Sim, a simulated-world forecasting benchmark built on game rollouts from Freeciv, a turn-based strategy game modelled on the Civilization series. Forecasters receive a fixed world report (a structured snapshot of the current game state) and answer questions about hidden future states; the benchmark then continues the simulation and scores forecasts. Because the world is simulated, the same setup can generate continuous or binary forecasting questions at arbitrary time horizons, paired intervention worlds for conditional or causal questions, and resolved examples of rare or disruptive outcomes. We describe the benchmark pipeline, question families, scoring protocol, and release artifacts, and report validation slices from model evaluations and an anonymized human pilot. ForecastBench-Sim is intended to complement real-world forecasting benchmarks by providing controlled, immediately resolvable tasks for studying probabilistic reasoning under dynamic world states.

[NLP-41] RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)预训练过程中数据混合选择(data mixture selection)的优化问题,现有方法如RegMix通过在小规模代理运行(proxy runs)上拟合回归模型来确定单一静态的数据混合比例,但无法适应训练过程中的动态变化。其解决方案的关键在于提出RegMix-D,一种对RegMix的简单扩展,能够实现动态混合(dynamic mixing)。核心创新点在于利用代理运行中产生的不仅是最终损失值,还包括完整的损失轨迹(loss trajectories),通过在这些轨迹上训练回归模型,从而预测不同训练阶段的最优数据混合比例。RegMix-D支持离线与在线两种部署模式:离线模式在目标训练前生成完整的混合调度方案,而在线模式则基于实际训练中的损失观测值实时调整混合策略。实验基于Pile数据集250亿词元和10亿参数的目标模型,在13项下游任务上均显著优于RegMix和DoReMi,且具备更高的代理效率——即使仅使用128个代理模型(仅为RegMix计算预算的25%),仍能超越其性能。

链接: https://arxiv.org/abs/2606.18663
作者: Kaiyan Zhao,Zhongtao Miao,Akiko Aizawa,Yoshimasa Tsuruoka
机构: The University of Tokyo(东京大学); National Institute of Informatics(国立情报学研究所)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Data mixture selection is critical for Large Language Model pretraining. Existing methods such as RegMix select a single static mixture by fitting a regression model on small-scale proxy runs. We propose RegMix-D, a simple extension of RegMix to dynamic mixing. Our key observation is that proxy runs produce not only endpoint losses, but also full loss trajectories, which can be used to further improve data mixture. By training regression model on these trajectories, we can predict optimal mixtures at multiple training stages. RegMix-D supports two deployment modes: an offline variant that generates a complete mixture schedule before target training, and an online variant that adapts the mixture during training using observed loss. Experiments on 25B tokens of the Pile dataset with a 1B parameter target model show that RegMix-D consistently improves over RegMix and DoReMi across 13 downstream tasks while remaining proxy-efficient: it surpasses RegMix even with only 128 proxy models (25% of RegMix’s proxy compute budget).

[NLP-42] he Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLM s

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在对齐(Alignment)过程中出现的“误触发对齐”(misfired alignment)问题,即模型因过度响应安全导向的提示或偏见规避机制,而错误地拒绝那些在上下文中明确支持的合理结论。其核心挑战在于:尽管对齐的目标是确保模型行为的安全与可靠,但现有方法可能过度泛化表面层面的安全信号,导致模型在面对明确证据时仍表现出不合理的否定性响应。该研究的关键解决方案是提出一个名为VETO的新基准,包含2,032个源自BBQ的对比样本对,并引入“误触发对齐率”(Misfired Alignment Rate, MAR)作为量化指标,用于衡量模型在涉及刻板印象相关问题上失败的比例。实验表明,25个主流LLM均表现出显著的非零MAR(4.7%至18.9%),而人类参与者则为0.0%,且通过受控预热实验验证了安全框架可显著放大这一现象,说明该缺陷并非孤立案例,而是系统性问题。进一步的机制分析揭示,在指令微调后的模型中,证据支持的答案在后期层被抑制,表明对齐过程中的训练范式可能导致模型牺牲上下文一致性以服从安全优先原则。因此,该研究强调需发展更具原则性的对齐目标,以更好保留上下文的语义根基,避免对显性证据的无根据排斥。

链接: https://arxiv.org/abs/2606.18656
作者: Naihao Deng,Yiming Feng,Chimaobi Okite,Kaijian Zou,Lu Wang,Rada Mihalcea,Yulong Chen
机构: University of Michigan(密歇根大学); University of Cambridge(剑桥大学); University of Aberdeen(阿伯丁大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper highlights the need for principled approaches to more advanced alignment. Alignment aims to ensure that large language models (LLMs) behave safely and reliably, including by avoiding unsafe inferences. However, we show that such safety-oriented behaviors can misfire: models may reject warranted conclusions even when they are explicitly supported by context. We call this failure mode misfired alignment, where alignment-induced changes cause LLMs to override explicit evidence. To quantify this phenomenon, specifically on stereotype-related alignment, we introduce VETO, a benchmark consisting of 2,032 BBQ-derived contrastive pairs, and define a new metric, Misfired Alignment Rate (MAR), which measures on a 0 to 100 scale how often a model fails on a stereotype-related question but succeeds on its contrastive counterpart. We benchmark 25 LLMs on VETO, and show that all LLMs, including the most recent ones, exhibit non-trivial (4.7 to 18.9%) MARs while all human participants achieve 0.0% MAR. Controlled priming experiments further show that alignment-induced cues can substantially amplify MAR across LLMs, indicating that these failures are not merely artifacts of individual examples but can be induced by safety-related framing. Mechanistic analyses on open-weight LLMs reveal late-layer suppression of evidence-supported answers, and comparisons between instruct and base LLMs suggest that this suppression emerges after instruction training. These findings show that current alignment methods can overgeneralize surface-level safety cues, to the point of overriding objective evidence, motivating more work on alignment objectives that better preserve contextual grounding.

[NLP-43] PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes ACL2026

【速读】: 该论文旨在解决智能家居助手在处理用户逐渐省略(progressively elliptical)的自然语言指令时存在的理解偏差问题。随着对话中共享上下文的积累,人类用户倾向于使用更简洁、省略成分较多的表达以提高沟通效率,而现有基于大语言模型(Large Language Models, LLMs)的家居助手未能有效捕捉这种语用特征,导致对指令的误判。其核心挑战在于:(1)多用户环境下因环境预期差异引发的指代模糊性;(2)用户意图随时间或环境变化而产生的动态歧义性。为应对上述问题,作者提出PEC-Home——首个专为模拟智能家居场景中逐步省略指令理解而设计的数据集。实验结果表明,即使在引入对话历史存储与检索机制的情况下,当前主流LLMs(如GPT-4o)在仅依赖省略指令执行任务时的准确率仍显著低于完整指令下的表现,凸显了现有系统在处理上下文依赖型省略表达方面的根本性局限。解决方案的关键在于构建具有真实语用演化特性的训练数据集,并推动模型对上下文敏感的渐进式省略理解能力发展。

链接: https://arxiv.org/abs/2606.18636
作者: Yingyu Shan,Zeming Liu,Silin Li,Boao Qian,Jiashu Yao,Yuhang Guo,Haifeng Wang
机构: Beijing Institute of Technology (北京理工大学); Beihang University (北京航空航天大学); Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 Findings

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have empowered home assistants with natural language interaction capabilities. However, current assistants overlook the progressive omission that occurs in human dialogue as shared context accumulates, leading to more elliptical expressions for efficient communication. Thus, current assistants still struggle to interpret such elliptical expressions accurately, which limits their effectiveness in real-world applications. In practical smart home scenarios, assistants face two major challenges caused by elliptical commands: (1) referential ambiguity caused by different environmental expectations among multiple users; and (2) intention ambiguity resulting from user preferences that evolve over time or change with the environment. To address these challenges, we introduce PEC-Home, the first simulated home dataset specifically designed for interpreting progressively elliptical commands in smart homes. Extensive experiments on various LLMs, including GPT-4o, show that existing home assistants struggle to execute user-intended operations based solely on elliptical commands. Even when equipped with tools for storing and retrieving user dialogue history, execution accuracy remains below that achieved with complete commands…

[NLP-44] Prag ReST: Self-Reinforcing Counterfactual Reasoning for Prag matic Language Understanding

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自然语言理解中缺乏语用推理能力的问题,尤其体现在其倾向于采用字面解释而非推断隐含意义。其核心解决方案是提出一种自监督框架PragReST,该框架通过构建语用问答数据、生成反事实推理轨迹(counterfactual reasoning traces),并结合监督微调与强化学习,使模型能够内化这些推理过程,从而提升语用推理能力。关键创新在于无需人工标注数据或依赖更强教师模型的蒸馏,即可实现有效训练。实验表明,PragReST在多个语用推理基准(PragMega、Ludwig、MetoQA和AltPrag)上显著优于基线模型,尤其在基于准确率的评估中,对Qwen3-8B和Qwen3-14B分别提升了5.37%和5.50%的绝对准确率。错误分析与消融实验进一步证实,反事实推理机制是性能提升的关键——模型主要通过对比实际话语与潜在替代表达来减少错误,若移除该机制,性能显著下降。此外,该训练方法保持了模型在通用知识和数学推理等非领域任务上的表现,具备良好的泛化能力。

链接: https://arxiv.org/abs/2606.18624
作者: Jihyung Park,Minchao Huang,Leqi Liu,Elias Stengel-Eskin
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL)
备注: First two authors contributed equally. Code and models: this https URL

点击查看摘要

Abstract:Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning. Despite strong performance on math and logical reasoning, large language models (LLMs) still struggle with making pragmatic inferences, often choosing literal interpretations. To improve LLM pragmatic reasoning, we introduce PragReST, a self-supervised framework that constructs pragmatic QA data, generates counterfactual reasoning traces, and trains models to internalize them through supervised fine-tuning and reinforcement learning, without human-labeled training data or distillation from a stronger teacher. Across four pragmatic benchmarks (PragMega, Ludwig, MetoQA, and AltPrag), PragReST improves over backbone models, task-specific pragmatic tuning baselines, and non-counterfactual variants of the same pipeline. On accuracy-based benchmarks, PragReST improves over the instruct backbone by 5.37 and 5.50% (absolute) for Qwen3-8B and Qwen3-14B, respectively. Our error analysis and ablations underscore the importance of counterfactual reasoning: PragReST primarily reduces errors caused by failures to contrast observed utterances with plausible alternatives, and removing counterfactual reasoning substantially reduces performance. Moreover, our training preserves out-of-domain performance on general-knowledge and mathematical reasoning benchmarks.

[NLP-45] BCL: Bayesian In-Context Learning Framework for Information Extraction ACL2026

【速读】: 该论文旨在解决现有信息抽取(Information Extraction, IE)任务中基于大语言模型的上下文学习(In-Context Learning, ICL)方法在不同模型规模下表现不一致、缺乏系统性优化与泛化能力的问题。其解决方案的关键在于提出首个基于贝叶斯更新的上下文学习框架——BCL(Bayesian In-Context Learning Framework for Information Extraction),通过粒子滤波(Particle Filtering)实现标签表示的系统性优化,包含初始化、观测、权重更新和重采样四个步骤,能够有效提升在序列标注与关系分类等多种信息抽取范式下的性能与一致性。

链接: https://arxiv.org/abs/2606.18620
作者: Haoliang Liu,Chengkun Cai,Xu Zhao,Han Zhu,Shizhou Huang,Xinglin Zhang,Tao Chen,Jenq-Neng Hwang,Zhang Huaping,Lei Li
机构: Beijing Institute of Technology (北京理工大学); Tsinghua University (清华大学); Zhejiang University (浙江大学); Shanghai Jiao Tong University (上海交通大学); Harbin Institute of Technology (哈尔滨工业大学); National University of Singapore (新加坡国立大学); University of Science and Technology of China (中国科学技术大学); Peking University (北京大学); Chinese Academy of Sciences (中国科学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Existing information extraction (IE) tasks increasingly adopt in-context learning (ICL) with large language models. However, current approaches either show inconsistent performance across model scales or lack systematic optimization and generalizability. Building on this, we propose BCL (Bayesian In-Context Learning Framework for Information Extraction), the first optimization framework that uses particle filtering with Bayesian updates to systematically refine label representations across IE tasks. Through four steps initialization, observation, weight update, and resampling, BCL generalizes to both sequence labeling and relation classification paradigms. Extensive experiments demonstrate substantial and consistent improvements over existing approaches.

[NLP-46] Are LLM s Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

【速读】: 该论文旨在解决当前医疗大语言模型(Medical LLMs)在实际临床辅助场景中协同能力不足的问题。现有评估多聚焦于孤立能力,如临床知识、电子健康记录(EHR)系统交互或患者沟通,而真实临床情境要求医生在面对模糊需求、不明确症状描述及需精确操作的EHR系统时,实现知识、沟通与系统操作的动态协调。为此,论文提出PhysAssistBench——一个面向医患- EHR交互式辅助的基准测试平台。该基准基于真实MIMIC-IV病例,通过可扩展的流水线构建了代理型患者(agentic patients):具备病历依据、可交互且支持多轮对话的虚拟患者,将静态病历转化为动态临床场景,同时保证临床事实性。其提供的1,296个经人工审核与医师验证的双语交互回合构成高质量评估数据集。实验表明,当前主流大模型在此复杂交互设置下仍表现不可靠,揭示出临床大模型发展的关键瓶颈:真正的可靠辅助依赖于知识、沟通与系统操作三者间的协同能力,而非单一能力的提升。

链接: https://arxiv.org/abs/2606.18613
作者: Tianming Du,Peijie Yu,Sihan Shang,Danli Shi,My Linh Nguyen,Shengbo Gao,Guangyuan Li,Yinghong Yu,Yan Jiang,Qianlong Zhao,Behzad Bozorgtabar,Shaoxiong Ji,Jiazhen Pan,Daniel Rueckert,Jiancheng Yang
机构: Aalto University (阿尔托大学); Tencent (腾讯); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳分校); Hong Kong Polytechnic University (香港理工大学); Aarhus University (奥胡斯大学); Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 34 pages with 8 figures

点击查看摘要

Abstract:The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use. We introduce PhysAssistBench, a benchmark for interactive doctor-patient-EHR assistance. Built from real MIMIC-IV cases, PhysAssistBench uses a scalable pipeline to construct agentic patients: interactive, record-grounded agents that turn static EHR records into multi-turn clinical scenarios while preserving clinical factuality. PhysAssistBench provides a curated bilingual evaluation set of 1,296 manually reviewed and physician-validated turns. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them.

[NLP-47] Steerable Cultural Preference Optimization of Reward Models ICML2026

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在跨文化场景下对不同文化子群体偏好表达不均衡的问题,即现有对齐研究多基于特定区域标注者的一致性偏好,难以有效反映全球多元文化背景下的真实偏好分布。其核心挑战在于如何构建具备全局视野的对齐模型,使其既能准确捕捉各子群体的差异化偏好,又避免对任一文化群体产生过度偏倚。解决方案的关键在于提出一种新型奖励模型训练算法——分层文化感知强化学习(SCPO, Subcommunity-aware Cultural Preference Optimization),通过引入基于子群体偏好的加权机制,在训练过程中动态平衡不同文化背景数据的贡献,从而实现对少数群体偏好的更优建模。实验结果表明,该方法在PRISM和GlobalOpinionQA两个数据集上,对少数群体奖励模型的性能提升最高达7个百分点,且在7个国家的评估中均表现稳健;同时,相较于全量数据微调,SCPO在训练数据效率上提升最高达280%。进一步的偏差分析显示,该方法通过显式分离子群体偏好评估,显著缓解了模型的偏倚问题。

链接: https://arxiv.org/abs/2606.18606
作者: Minsik Oh,Advit Deepak,Sophie Wu,Douwe Kiela,Ekaterina Shutova
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Pluralistic Alignment @ ICML 2026

点击查看摘要

Abstract:It is essential for large language model (LLM) technology to serve many different cultural sub-communities in a manner that is acceptable to each community. However, research on LLM alignment has so far predominantly focused on predicting a unified response preference of annotators from certain regions. This paper aims to advance the development of alignment models with a more global outlook, that are able to accurately represent the preferences of subcommunities and do not exhibit excessive bias towards any of them. We focus on the development of reward models for this purpose and present a novel reward model training algorithm (SCPO) that can incorporate diverse cultural preferences in a balanced manner. Our method results in performance increases of the minority reward model of up to 7 points over the baseline model across two datasets, PRISM and GlobalOpinionQA, and across 7 countries. SCPO is up to 280% more training data-efficient than full-data finetuning of reward models. In addition, we perform analysis of bias by separately evaluating on the preference of subcommunities and show that excessive bias is mitigated via our weighting method. Our code is available at this https URL

[NLP-48] Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

【速读】: 该论文旨在解决中文方言识别任务中因标注资源稀缺而导致的性能瓶颈问题。其核心解决方案在于提出一种基于迁移学习与数据增强的中文方言识别框架(CDDTLDA)。关键创新点在于:首先利用规模较大的中文方言语料库训练一个源域自动语音识别(ASR)模型,随后采用速度、音高和噪声扰动等简单但有效的数据增强方法扩充目标域低资源方言数据,并基于源域预训练模型对目标域ASR模型进行微调;同时,通过自注意力机制捕捉源域与目标域之间潜在的共性语义特征,最终提取目标域ASR模型中的隐含语义表示用于方言识别。该方法显著提升了在两个基准中文方言语料库上的识别性能,优于现有最先进方法。

链接: https://arxiv.org/abs/2606.18597
作者: Fan Xu,Yangjie Dan,Keyu Yan,Yong Ma,Mingwen Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Published in ACM TALLIP

点击查看摘要

Abstract:Chinese dialects discrimination is a challenging natural language processing task due to scarce annotation resource. In this article, we develop a novel Chinese dialects discrimination framework with transfer learning and data augmentation (CDDTLDA) in order to overcome the shortage of resources. To be more specific, we first use a relatively larger Chinese dialects corpus to train a source-side automatic speech recognition (ASR) model. Then, we adopt a simple but effective data augmentation method (i.e., speed, pitch, and noise disturbance) to augment the target-side low-resource Chinese dialects, and fine-tune another target ASR model based on the previous source-side ASR model. Meanwhile, the potential common semantic features between source-side and target-side ASR models can be captured by using self-attention mechanism. Finally, we extract the hidden semantic representation in the target ASR model to conduct Chinese dialects discrimination. Our extensive experimental results demonstrate that our model significantly outperforms state-of-the-art methods on two benchmark Chinese dialects corpora.

[NLP-49] Dual Dimensionality for Local and Global Attention

【速读】: 该论文旨在解决生成式模型中键值(Key-Value)表示维度在序列中均匀分配所导致的资源浪费问题。传统Decoder-only Transformer架构对所有位置的上下文令牌均采用相同维度的键值表示,但自然语言中当前词的预测主要依赖于最近邻的上下文,而远距离信息更多用于长程记忆。因此,论文提出一种距离自适应表示(Distance-Adaptive Representation, DAR)机制,其核心思想是:局部上下文(靠近预测目标的令牌)应具备高维表示以捕捉精细语义依赖,而远距离令牌则可使用低维表示以降低计算与存储开销。该方案在局部上下文窗口内保留全维表示,而在窗口外将键值维度压缩至原维度的1/4,从而实现对表征容量的动态分配。实验表明,在多个预训练规模(70M至410M参数)及10亿参数模型的持续监督微调任务中,DAR方法性能接近全维基线,显著优于全局统一降维策略。这一结果挑战了键值维度应全局一致的传统假设,为设计更具效率的注意力架构提供了新范式,支持在推理阶段进一步压缩键值缓存(KV cache),提升生成效率。

链接: https://arxiv.org/abs/2606.18587
作者: Zhiyuan Wang,Xuan Luo,Sirui Zeng,Xifeng Yan
机构: UC Santa Barbara
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decoder-only Transformers compute attention over the KV cache of preceding tokens. Keys (and Values) are typically represented with the same dimensionality, regardless of its distance from the prediction target. In natural language, however, the next word is most strongly influenced by the immediately preceding tokens. We hypothesize that local and distant tokens impose asymmetric demands on representational capacity: local tokens are more critical for predicting immediate outputs and thus require richer representations, whereas distant tokens primarily serve as long-range memory, for which lower-dimensional representations may suffice. We formalize this idea as Distance-Adaptive Representation (DAR), implemented in a controlled setting that preserves full-dimensional representations within a local context window while assigning reduced-dimensional representations (e.g. 1/4 of the original dimensionality) to tokens beyond that window. Across multiple pretraining scales (70M to 410M parameters), as well as continued supervised fine-tuning on a 1B-scale model, this approach closely matches the performance of full-dimensional baselines. In contrast, uniformly reducing dimensionality across all token positions leads to worse performance. These results challenge the common assumption that key and value dimensionality should be uniform across token positions. Our findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity across sequences, enabling further reductions in KV cache during inference.

[NLP-50] Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

【速读】: 该论文旨在解决中文方言之间细微差异的识别难题,这一任务在自然语言处理中具有高度挑战性,尤其当依赖传统文本驱动方法时,往往因语义和书写形式的相似性而导致性能不佳。其核心解决方案在于提出一种以语音驱动(speech-driven)的端到端框架,关键创新点包括:首先,系统评估了基于卷积神经网络(CNN)的语音特征提取方法,验证了梅尔频率倒谱系数(MFCC)在方言区分中的有效性;其次,设计了一种基于隐马尔可夫模型-深度神经网络(HMM-DNN)的语音识别模型,用于生成方言级词级别表示,并引入注意力机制(attention)以捕捉对不同方言具有判别性的词汇特征;最后,通过CNN融合词级嵌入与MFCC特征,实现多层次信息整合。实验结果表明,该方法在两个基准中文方言语料库上显著优于现有最先进方法,验证了语音驱动特征在细粒度方言识别中的优越性与可行性。

链接: https://arxiv.org/abs/2606.18584
作者: Fan Xu,Jian Luo,MingWen Wang,GuoDong Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注: Published in ACM TALLIP

点击查看摘要

Abstract:Language discrimination among similar languages, varieties, and dialects is a challenging natural language processing task. The traditional text-driven focus leads to poor results. In this paper, we explore the effectiveness of speech-driven features towards language discrimination among Chinese dialects. First, we systematically explore the appropriateness of speech-driven MFCC features towards CNN-based language discrimination. Then, we design an end-to-end speech recognition model based on HMM-DNN to predict Chinese dialect words. We adopt attention to extract the discriminative words related to different Chinese dialects. Finally, through a CNN, we combine the word-level embedding and the MFCC-based features. Evaluation of two benchmark Chinese dialect corpora shows the appropriateness and effectiveness of the proposed speech-driven approach to fine-grained Chinese dialect discrimination compared to the state-of-the-art methods.

[NLP-51] Fair Cognitive Impairment Detection Through Unlearning INTERSPEECH2026

【速读】: 该论文旨在解决在自发言语中检测轻度认知障碍(Mild Cognitive Impairment, MCI)时,现有模型因过度依赖与标签相关的社会人口学特征(如性别、语言等)而导致不同患者亚组间性能差异显著的问题。其核心解决方案包括:(1)多模态融合机制,整合语音、文本与图像模态信息以增强表征能力;(2)基于梯度反向(gradient reversal)的去学习(unlearning)策略,主动抑制共享嵌入空间中编码的任务无关人口学属性,从而提升模型对MCI的判别能力并减少亚组偏差。在多语言基准数据集TAUKADIAL和PREPARE上的实验表明,该方法在保持优于当前最优多模态、多语言基线性能的同时,显著缩小了不同性别和语言子组间的性能差距,并通过跨数据集迁移分析验证了去学习策略有助于构建更鲁棒的MCI检测表示。

链接: https://arxiv.org/abs/2606.18571
作者: William Nguyen,Jiali Cheng,Hadi Amiri
机构: University of Massachusetts Lowell(马萨诸塞大学洛厄尔分校), USA
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Interspeech 2026

点击查看摘要

Abstract:Mild Cognitive Impairment (MCI) is a medical condition characterized by a noticeable decline in memory, language, or thinking abilities. MCI detection from spontaneous speech is promising for scalable screening. However, learned models often exploit demographic cues correlated with labels, resulting in a large performance gap across subgroups. We present a multimodal framework that combines (i) cross-model fusion between modalities (speech, text, and image), and (ii) unlearning using gradient reversal that discourages the shared embedding from encoding task-irrelevant demographic attributes. Evaluated on the multilingual benchmarks TAUKADIAL and PREPARE, our method outperforms the state-of-the-art multilingual and multimodal baseline in MCI classification while substantially reducing the performance gap across patient subgroups (sex and language). We further analyze transfer across datasets, showing that demographic unlearning helps learn more robust representations for MCI detection.

[NLP-52] CEO-Bench: Can Agents Play the Long Game?

【速读】: 该论文旨在解决当前语言模型代理(language model agents)在应对真实世界复杂挑战时能力不足的问题,尤其是面对长周期、不确定性高、信息噪声大、环境动态变化以及多任务协同等综合性需求时的表现短板。现有模型虽在短周期、孤立任务中表现出色(如软件工程与客户服务),但在需要持续适应、长期规划与跨领域决策的场景下仍显乏力。其解决方案的关键在于提出一个名为CEO-Bench的评估基准,通过模拟一个为期500天的虚拟初创企业运营任务,系统性地测试代理在长时程决策、噪声环境下的信息获取、环境适应性及多维度协调等方面的能力。该基准以可编程Python接口实现,要求代理自主管理定价、营销、预算等多个业务环节,并基于复杂的、相互关联的业务数据库进行分析与预测。成功的关键在于生成具有策略深度的代码,例如通过模拟客户群体进行现金流预测、挖掘谈判历史以揭示隐藏客户需求。尽管顶尖模型如Claude Opus 4.8和GPT-5.5能维持资本规模,但均未能稳定实现盈利,表明当前生成式智能(Generative AI)在可持续、自适应的长期目标驱动方面仍存在显著瓶颈。

链接: https://arxiv.org/abs/2606.18543
作者: Haozhe Chen,Karthik Narasimhan,Zhuang Liu
机构: Princeton University (普林斯顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the 1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.

[NLP-53] Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

【速读】: 该论文旨在解决领域伪装注入攻击(domain-camouflaged injection attacks) 无法被传统检测机制有效识别的问题,此类攻击通过在检索内容中嵌入符合目标领域语境的恶意指令,利用领域适配词汇掩盖攻击痕迹,从而绕过依赖语法注入标记的检测方法。其解决方案的关键在于评估五种基于提示工程(prompting-based)的防御策略在不同模型家族(Claude Haiku、Llama 3.1 8B、Gemini 2.0 Flash)和部署领域(金融、法律、通用)下的有效性。研究发现,在代理处理前对检索内容进行改写(paraphrasing)是最具一致性的有效防御手段,可使攻击成功率降低55%–84%,且在所有测试模型上均优于Llama Guard 4配置。然而,防御效果具有显著的模型依赖性:例如,聚焦提示(spotlighting)在Claude Haiku上可将攻击成功率减半,但在Llama 3.1 8B上无效。此外,金融领域部署的残余风险最高(基线攻击成功率达26%–33%),且在较弱模型上无一种提示类防御能完全消除威胁。本研究首次系统性评估了提示类防御对伪装型注入攻击的应对能力,并为实践者提供了基于基准测试的推荐方案,但其结论是否适用于真实企业文档仍需进一步验证。

链接: https://arxiv.org/abs/2606.18530
作者: Aaditya Pai
机构: Columbia University (哥伦比亚大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, 4 tables; under review at the AdvML-Frontiers x CoTMA workshop, COLM 2026

点击查看摘要

Abstract:Domain-camouflaged injection attacks embed malicious instructions in retrieved content using domain-appropriate vocabulary, evading standard detectors that rely on syntactic injection markers. When detection fails, practitioners need to know which defense architectures reduce attack success. We evaluate five prompting-based defenses (spotlighting, paraphrasing, prompt sandwiching, and two combinations) against domain-camouflaged injection across three model families (Claude Haiku, Llama 3.1 8B, Gemini 2.0 Flash) and three deployment domains (financial, legal, general) using 3,510 trials. Paraphrasing retrieved content before agent processing is the most consistently effective defense in this benchmark, reducing camouflage attack success rate by 55-84% depending on model, and achieves lower attack success rates than our Llama Guard 4 configuration on every model tested. Defense effectiveness is strongly model-dependent: spotlighting halves attack success on Claude Haiku but provides no benefit on Llama 3.1 8B. Financial domain deployments face the highest residual risk at 26-33% baseline attack success rate, with no prompting-based defense fully eliminating the threat on weaker models. These results provide the first systematic evaluation of prompting-based defenses specifically against camouflage-class injection attacks and establish benchmark-based recommendations for practitioners. All tasks use synthetically constructed professional documents; whether these benchmark rankings generalize to real enterprise documents remains an open question.

[NLP-54] owards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的多智能体系统在实际企业场景中部署时面临的两大核心挑战:一是领域特定的定制化需求与模型性能之间的矛盾,二是智能体工作流中高延迟和高推理成本导致的效率瓶颈。其解决方案的关键在于提出一个统一的框架,分两阶段实现高效定制与部署。第一阶段“智能体模型定制”通过持续预训练、监督微调与偏好优化相结合的方法,在保持强智能体能力的前提下,将轻量化模型适配至专业领域;第二阶段“推理优化”则融合推测解码(speculative decoding)与FP8量化技术,并结合针对性校准,显著降低推理开销。实验表明,该框架在真实企业负载下实现了快速领域迁移,在吞吐量上提升4.48倍的同时,维持了模型性能并增强了对长尾场景的鲁棒性。

链接: https://arxiv.org/abs/2606.18502
作者: Paresh Dashore,Shreyas Kulkarni,Uttam Gurram,Nadia Bathaee,Kartik Balasubramaniam,Genta Indra Winata,Sambit Sahu,Shi-Xiong Zhang
机构: Capital One
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Large language model (LLM)-based multi-agent systems demonstrate strong performance on complex reasoning and task execution, enabling broad enterprise applications. However, production deployment remains challenging due to domain-specific customization requirements and high latency and inference costs in agentic workflows. We propose a unified framework for customization and efficient deployment of multi-agent systems in real-world settings. The first stage, Agentic Model Customization, combines continual pretraining, supervised fine-tuning, and preference optimization to adapt a compact model to specialized domains while retaining strong agentic capabilities. The second stage, Inference Optimization, integrates speculative decoding and FP8 quantization with targeted calibration to enable cost-efficient serving with minimal quality loss. Across enterprise workloads, our framework enables rapid domain adaptation and achieves a 4.48x speedup in throughput while maintaining performance and improving robustness on long-tail scenarios.

[NLP-55] SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR ICML2026

【速读】: 该论文旨在解决在基于人类反馈的强化学习(GRPO)中,标准选择策略——即选取监督微调(SFT)阶段具有最高pass@1性能的检查点——可能失效的问题。当SFT过度压缩了推理分布时,导致不同群体间的奖励方差显著降低,尤其在二值奖励场景下,组内优势方差为 $ p(1-p)(g-1)/g $,一旦早期GRPO使概率 $ p $ 低于临界值 $ p^*(g) ,多数组别将获得相同奖励,丧失相对信号。研究针对Qwen2.5Coder3BDeepSeekCoder6.7B构建SFT深度阶梯实验,发现Qwen在增加SFT深度后虽预强化学习(preRL)阶段的pass@1提升,但GRPO阶段的peakpass@10却从0.806降至0.481(三种子均值,n=20),且预强化学习熵与GRPO结果呈显著正相关(,多数组别将获得相同奖励,丧失相对信号。研究针对Qwen2.5-Coder-3B和DeepSeek-Coder-6.7B构建SFT深度阶梯实验,发现Qwen在增加SFT深度后虽预强化学习(pre RL)阶段的pass@1提升,但GRPO阶段的peak pass@10却从0.806降至0.481(三种子均值,n=20),且预强化学习熵与GRPO结果呈显著正相关( \rho = +0.69 $)。相比之下,DeepSeek因始终维持较高 $ p $ 值,未出现严重压缩现象。为此,论文提出一种两阶段诊断方法:结合预强化学习熵筛选与早期GRPO熵监控,可有效识别高风险检查点并提前终止失败训练流程。此外,简单的KL散度正则化与标签平滑等常规手段无法恢复已崩溃的Qwen检查点,表明该问题并非由超参数设置引起的简单偏差,而是源于深层分布压缩机制。解决方案的关键在于引入动态监测机制以捕捉分布退化征兆,并通过熵作为核心指标实现早期预警。

链接: https://arxiv.org/abs/2606.18487
作者: Siddharth Aphale,Kelly Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 6 figures. Accepted at the Deep Learning for Code (DL4C) Workshop at ICML 2026

点击查看摘要

Abstract:The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is p(1-p)(g-1)/g ; when early GRPO drives p below p^(g) , most groups have identical rewards and provide no group relative signal. We study SFT depth ladders for Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B. We test Qwen2.5-Coder-3B across five depths and three seeds, and DeepSeek-Coder-6.7B across four matched depths and three seeds. On Qwen, pre RL pass@1 rises with SFT depth, but peak GRPO pass@10 falls from 0.806 to 0.481 (3 seed mean, n=20 ); pre RL entropy is positively associated with the GRPO outcome ( \rho=+0.69 ). On DeepSeek, pass@1 remains far above p^(8)=0.083 , and GRPO outcomes compress rather than invert. A two stage diagnostic, combining pre RL entropy triage with an early GRPO entropy monitor, flags high risk checkpoints and can stop failing runs early. Simple KL to reference regularisation and label smoothing variants do not rescue the collapsed Qwen checkpoint in our setting, suggesting the failure is not a trivial GRPO hyperparameter artefact.

[NLP-56] PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在执行遗忘(unlearning)过程中,因知识间存在语义纠缠而导致的“附带损伤”(collateral damage)问题,即在移除特定知识时,不可避免地影响到与之相关甚至跨领域的其他知识。其解决方案的关键在于提出一种以数据为中心的视角,通过分析遗忘集(forget set)对同域及远域知识的影响传播规律,发现附带损伤呈现从遗忘集附近向远处衰减但不随领域边界消失的持续性模式。进一步地,研究将遗忘审计(forget-set auditing)建模为一次预遗忘预测任务,识别出最能预测下游损害的数据特征,结果表明遗忘集与评估集之间的交互特征提供了最强信号,表明附带损伤在模型更新前已部分反映于数据几何结构中。这一发现使遗忘审计成为识别高风险遗忘操作的早期预警工具,从而为设计更可靠、可控的遗忘机制提供支撑。

链接: https://arxiv.org/abs/2606.18473
作者: Bo Su,Ankit Shah,Thai Le
机构: Indiana University (印第安纳大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Machine unlearning for large language models (LLMs) aims to remove specified knowledge while preserving the rest of the model’s capabilities. However, the boundary between knowledge to forget and knowledge to retain is often unclear, since related and even distant information may be entangled in the model. In this paper, we study LLM unlearning from a data-centric perspective and measure how unlearning effects propagate from the forget set to same-domain and distant-domain knowledge. We find a consistent decay pattern: collateral damage is strongest near the forget set, weakens with semantic distance, but does not disappear at domain boundaries. We further ask whether such damage can be audited before unlearning is executed. We formulate forget-set auditing as a pre-unlearning prediction task and analyze which data features are most predictive of downstream damage. Our results show that interaction features between the forget set and evaluation set provide the strongest signals, suggesting that collateral damage is partly reflected in data geometry before model updates occur. These findings position forget-set auditing as an early warning tool for identifying risky unlearning runs and designing more reliable unlearning procedures.

[NLP-57] Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

【速读】: 该论文旨在解决生成式 AI 在临床文本处理中对诊断不确定性表达的保留问题。尽管现有研究多关注生成文本的流畅性与连贯性,但对模型是否准确保留临床判断中的不确定性(如“可能肺炎”)这一关键语义特征缺乏系统评估。在临床实践中,不确定性表达直接反映证据强度并影响后续诊疗决策,其语义改变可能导致严重临床后果。本研究的关键解决方案在于构建了一个包含1,200份临床文档、9,184个标注的不确定性层级基准数据集(共五个等级),并在此基础上评估三类主流大语言模型(LLM)的表现。结果表明,LLM在保留原始不确定性线索方面表现不佳,平均保留率不足50%,且在相邻不确定性层级间的细微差异上存在识别困难。该研究揭示了传统评估指标未能捕捉到的潜在失效模式,为生成式 AI 在临床工作流中的安全部署提供了重要启示。

链接: https://arxiv.org/abs/2606.18471
作者: Hongbo Du,Zixin Lu,Jiaming Qu
机构: Trine University (特里尼大学); University of Michigan (密歇根大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for clinical text tasks such as summarization and revision. While most studies evaluate the fluency and coherence of LLM-generated text, whether LLMs correctly preserve diagnostic uncertainty remains underexplored. In clinical practice, phrases such as ``possible pneumonia’’ communicate the strength of available evidence and directly guide decisions about follow-up testing and treatment. Altering these uncertainty expressions can change the clinical meaning entirely. In this paper, we systematically evaluated this problem in two steps. First, we constructed a benchmark of 1,200 clinical documents with 9,184 uncertainty annotations across five levels. Second, we evaluated three LLMs on this benchmark. Our results show that (1) LLMs preserve the original uncertainty cues poorly, often less than half the time; (2) LLMs struggle with nuanced distinctions between adjacent levels. This work reveals a failure mode not captured by standard evaluation metrics and provides implications for the safe deployment of LLMs in clinical workflows.

[NLP-58] Montreal Forced Aligner and the state of speech-to-text alignment in 2026

【速读】: 该论文旨在解决多语言环境下强制对齐(forced alignment)的准确性与泛化能力问题,尤其针对训练数据分布之外的语言或方言。其核心挑战在于如何在缺乏大量标注数据的情况下,实现跨语言、跨方言的高精度语音时间对齐。解决方案的关键在于:1)通过模型自适应(model adaptation)和跨语言音素重映射(cross-language phone remapping),增强系统对未充分覆盖语言的泛化能力;2)引入发音概率建模与音系规则(pronunciation probability modeling and phonological rules),提升在特定语音条件下的对齐精度;3)依托大规模开源语料库与统一的国际音标(IPA)词典,实现多语言支持与一致性建模。实验表明,MFA 3.0 在英语、日语和韩语上的平均边界误差低于15毫秒,达到或接近当前最先进水平,验证了上述方法的有效性。

链接: https://arxiv.org/abs/2606.18466
作者: Michael McAuliffe,Kaylynn Gunter,Michael Wagner,Morgan Sonderegger
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); McGill University and Centre for Brain, Language, and Music (麦吉尔大学与大脑、语言和音乐研究中心); University of Oregon (俄勒冈大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded coverage across more languages and dialects using larger open-source datasets, harmonized IPA dictionaries, model adaptation, cross-language phone remapping, and support utilities. This paper documents MFA 3.0’s developments since version 1.0 and evaluates MFA’s performance across English, Japanese, and Korean, benchmarked against classic and neural forced aligners. MFA 3.0 achieves state-of-the-art or near state-of-the-art performance across all four benchmark datasets with mean boundary errors below 15 ms. Adaptation and cross-language remapping are effective for languages outside MFA’s training distribution, and pronunciation probability modeling and phonological rules provide gains in specific conditions.

[NLP-59] LLM Parameters for Math Across Languages: Shared or Separate? ACL

【速读】: 该论文旨在解决多语言大语言模型(LLM)在数学推理能力上存在显著跨语言差异的问题,核心关切在于这些差异是源于语言特定的参数配置,还是由一种共享机制在不同语言中表现出的不同形式所致。其解决方案的关键在于提出一种跨语言机制分析方法,能够定位并比较支持数学推理任务的模型参数在不同语言间的分布与重叠情况。研究发现,数学相关参数在中间层具有最强的跨语言重叠,表明数学推理存在部分共享的参数机制;然而,英语始终产生最多的数学相关参数,而低资源语言的相关参数集较小,揭示了系统性的语言依赖性差异。因此,结论表明多语言大模型中的数学推理行为既非完全语言无关,也非完全语言特异,而是呈现出部分跨语言参数重叠与系统性语言依赖特征共存的复杂模式。

链接: https://arxiv.org/abs/2606.18453
作者: Behzad Shomali,Luisa Victor,Tim Selbach,Ali Hamza Bashir,David Berghaus,Joachim Koehler,Mehdi Ali,Markus Frey
机构: Lamarr Institute(拉马尔研究所); University of Bonn(波恩大学); Fraunhofer IAIS(弗劳恩霍夫信息与通信技术研究所)
类目: Computation and Language (cs.CL)
备注: 5 pages. Accepted at ACL Student Research Workshop (SRW) 2026. Code: this https URL Translated Datasets: this https URL Webpage: https://math-across-languages.github.io

点击查看摘要

Abstract:Large language models (LLMs) exhibit substantial cross-lingual variation in mathematical reasoning performance, but it remains unclear whether these differences reflect language-specific parameters or a shared mechanism that manifests differently by language. We present a cross-lingual mechanistic analysis of mathematical reasoning in LLMs, enabling us to localize and compare model parameters that support mathematical reasoning across languages. We find that the extracted math-associated parameters exhibit partial cross-lingual overlap, with the strongest overlap concentrated in intermediate model layers. We further observe that English consistently produces the largest set of math-relevant parameters, whereas lower-resource languages reveal smaller sets of relevant parameters. These results suggest that math-related behavior in multilingual LLMs is neither fully language-invariant nor fully language-specific, but instead exhibits partial cross-lingual parameter overlap with systematic language-dependent differences.

[NLP-60] VISUALSKILL: Multimodal Skills for Computer-Use Agents

【速读】: 该论文旨在解决计算机使用代理(Computer-use Agents, CUAs)在长周期任务和未见过的软件应用中表现不佳的问题。现有技能库虽通过可复用技能缓解此问题,但仅以文本形式表示技能,忽视了图形用户界面(GUI)交互的视觉特性。为此,本文提出VISUALSKILL——一种面向特定目标应用的分层多模态技能框架,其结构为基于各主题文件的中央索引,由一个load_topic MCP工具按需加载相关文本与图像。该技能构建采用两阶段流程:结合人工编写的文档与实时应用界面探索。在CUA-World与OSExpert-Eval两个基准测试中,基于Claude Opus 4.6的Claude Code CLI代理在引入VISUALSKILL后平均得分达0.456,较无技能基线(0.303)提升15.3个百分点;相较于仅由相同源内容生成但仅含文本的对照技能(0.373),进一步提升8.3个百分点。这一结果直接证明,在技能表示中保留视觉图像而非将其转化为文字描述,有助于代理更准确地识别UI元素并验证每一步操作后的流程状态,从而显著提升性能。

链接: https://arxiv.org/abs/2606.18448
作者: Ziyan Jiang,Li An,Yujian Liu,Jiabao Ji,Qiucheng Wu,Jacob Andreas,Yang Zhang,Shiyu Chang
机构: UC Santa Barbara; MIT CSAIL; MIT-IBM Watson AI Lab
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction. We propose VISUALSKILL: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per-topic files, which the agent consumes through a load_topic MCP tool that fetches the relevant topic’s text and figures on demand. We construct each skill with a two-stage pipeline that combines authored documentation with live-application UI exploration. On two CUA benchmarks, CUA-World and OSExpert-Eval, a Claude Code CLI agent backed by Claude Opus 4.6 reaches an average score of 0.456 with VISUALSKILL, a +15.3 point absolute lift over the no-skill baseline (0.303). Against a matched text-only skill that is generated from the same source content and differs from VISUALSKILL only in modality, VISUALSKILL yields a further +8.3 point absolute gain over the matched text-only skill (0.373 vs. 0.456), providing direct evidence that retaining visual figures in the skill artifact, rather than verbalizing them away, helps the agent both identify UI elements and verify workflow state after each action. Our code is available at this https URL.

[NLP-61] CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

【速读】: 该论文旨在解决个性化对话代理在资源受限的边缘设备(如8 GB VRAM设备)上实现持续长期记忆所面临的严重内存与计算瓶颈问题。现有系统依赖各向同性余弦相似度进行检索及启发式规则进行上下文压缩,缺乏统一的理论基础,常导致高维检索中的“中心点问题”(hubness problem)以及压缩过程中的语法碎片化。其解决方案的核心在于提出一种基于信息几何(information geometry)统一框架的轻量级边缘-云记忆架构CoreMem。关键创新包括:1)采用黎曼检索(Riemannian retrieval),以局部自适应的Fisher-Rao度量替代传统的余弦匹配,通过马哈拉诺比斯距离有效惩罚中心化记忆,并利用O(Ndr)的Woodbury加速实现实时搜索;2)提出基于Fisher信息引导的离散令牌蒸馏(FDTD)机制,通过Fisher信息迹推导敏感性得分,实现层次化的句到令牌压缩,在保持结构化语法的前提下,建立可解释的压缩-KL散度权衡。实验表明,CoreMem在LOCOMO和LongMemEval-S基准上显著提升开放域(+4.51个百分点)与时间推理能力(+4.17个百分点),且严格控制在8 GB VRAM预算内,成功实现了理论严谨、资源高效的终身记忆代理部署。

链接: https://arxiv.org/abs/2606.18406
作者: Jiaqi Chen,Yongqin Zeng,Shaoshen Chen,Yijian Zhang,Hai-Tao Zheng,Chunxia Ma,XiuTeng Zhou
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Peng Cheng Laboratory (鹏城实验室); Shandong Analysis and Test Center, Qilu University of Technology, Jinan, China (齐鲁工业大学济南分析测试中心); State Key Laboratory for Quality Ensurance and Sustainable Use of Dao-di Herbs, Beijing, China (道地药材质量保障与可持续利用国家重点实验室)
类目: Computation and Language (cs.CL)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces severe memory and compute bottlenecks. Existing systems typically rely on isotropic cosine similarity for retrieval and heuristic rules for context compression. These approaches lack a unified theoretical foundation, frequently suffering from the hubness problem in high-dimensional retrieval and syntactic fragmentation during compression. To overcome these limitations, we propose CoreMem, a resource-efficient edge-cloud memory architecture fundamentally unified by information geometry. First, Riemannian retrieval replaces cosine matching with a locally adaptive Fisher-Rao metric, effectively penalizing hub memories via Mahalanobis distance with O(Ndr) Woodbury acceleration for real-time search. Second, Fisher-guided discrete token distillation (FDTD) introduces a hierarchical sentence-to-token compression mechanism. It derives sensitivity scores from Fisher information traces, providing a principled compression-KL tradeoff augmented with explicit structural syntax protection. Evaluated on the LOCOMO and LongMemEval-S benchmarks, CoreMem achieves strong accuracy improvements, yielding substantial gains in Open-domain (+4.51 pp) and Temporal (+4.17 pp) reasoning. Extensive profiling confirms that CoreMem operates seamlessly within a strict 8 GB VRAM budget, successfully bridging the gap between resource-constrained edge devices and the demand for theoretically grounded, lifelong memory agents.

[NLP-62] JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

【速读】: 该论文旨在解决自回归大语言模型(LLM)在采用推测解码(Speculative Decoding, SD)技术时面临的可扩展性瓶颈问题。现有基于头结构的SD方法存在因果性与效率之间的权衡困境:自回归式草案生成器虽能产生路径依赖的候选序列,有利于树形推测解码并提升接受率,但其生成开销随树深度线性增长;而双向块扩散式草案生成器虽可在单次前向传播中完成全部位置生成,却因忽略分支间的依赖关系,导致生成的候选树内部不一致,造成预算浪费和接受率下降。本文提出JetFlow,一种基于头结构的新型SD框架,其核心创新在于通过在冻结目标模型的融合隐藏状态上训练一个因果并行草案头(causal parallel draft head),实现了单次前向传播的高效生成与分支级因果条件建模的统一。该设计使生成候选树的得分与目标模型的自回归因子分解保持一致,从而将更大的草案预算有效转化为更长的可接受前缀,显著提升端到端推理速度。在密集型与MoE架构的Qwen3模型上,JetFlow在数学、编程和对话等基准任务中均优于现有的双向头及树形推测解码基线,在H100 GPU上实现高达9.64倍的MATH-500加速比和4.58倍的开放对话任务加速比,并在真实服务负载下通过vLLM集成进一步降低延迟。

链接: https://arxiv.org/abs/2606.18394
作者: Lanxiang Hu,Zhaoxiang Feng,Yulun Wu,Haoran Yuan,Yujie Zhao,Yu-Yang Qian,Bojun Wang,Daxin Jiang,Yibo Zhu,Tajana Rosing,Hao Zhang
机构: UC San Diego(加州大学圣地亚哥分校); Zhejiang University(浙江大学); UIUC(伊利诺伊大学香槟分校); Nanjing University(南京大学); StepFun
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetFlow, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetFlow trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model’s autoregressive factorization. This enables JetFlow to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetFlow consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetFlow achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at this https URL.

[NLP-63] Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

【速读】: 该论文旨在解决低资源语言中合成数据生成的效率与多样性问题。现有最优方法通常依赖于包含目标语言示例的少样本提示(few-shot prompting),虽能提升下游任务性能,但存在推理成本高及因词汇锚定(lexical anchoring)导致生成数据多样性下降的缺陷。为此,本文提出以激活控制(activation steering)作为替代方案,其核心在于通过调控大语言模型(LLM)内部表示来引导生成内容的语言属性与质量。关键解决方案包括两种控制策略:语言控制(Language Steering),聚焦于维持或增强目标语言的语言身份;质量控制(Quality Steering),通过对比人类撰写文本与回译文本的表征,捕捉语义通顺性。实验在四个开源大语言模型、多个网络层以及11种语言类型上进行,涵盖情感分类与主题分类任务,分别在零样本和少样本提示设置下评估。结果表明,对早期网络层实施激活控制可显著提升生成数据的多样性,并在多数情况下实现更强的下游模型性能,尤其在低资源语言场景下表现突出。

链接: https://arxiv.org/abs/2606.18389
作者: Jan Cegin,Daniil Gurgurov,Yusser Al Ghussin,Simon Ostermann
机构: Kempelen Institute of Intelligent Technologies (科梅佩尔智能技术研究所); German Research Institute for Artificial Intelligence (德国人工智能研究中心)
类目: Computation and Language (cs.CL)
备注: 25 pages

点击查看摘要

Abstract:Large language models (LLMs) have become an effective tool for synthetic data generation, including for low-resource languages, where generated data can improve downstream task performance. Current best-performing approaches typically rely on few-shot prompting with target-language examples, which increases inference costs and may reduce diversity through lexical anchoring. In this work, we investigate activation steering as an alternative for low-resource synthetic data generation. We study two steering strategies: Language Steering, which targets the linguistic identity of a language, and Quality Steering, which captures well-formedness by contrasting human-written and backtranslated text representations. We evaluate these methods across four open-source LLMs, multiple layers, and 11 typologically diverse languages by generating sentiment and topic classification data and finetuning smaller classifiers. Steering is applied in both zero-shot and few-shot prompting settings and compared against non-steered counterparts. Our results show that steering on early layers consistently improves the diversity of generated data while often yielding stronger downstream model performance, particularly for low-resource languages.

[NLP-64] From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

【速读】: 该论文旨在解决生成式语言模型(Language Model, LM)中基于稀疏自编码器(Sparse Autoencoder, SAE)的解释是否具有忠实性(faithfulness)这一核心问题。具体而言,当使用SAE提取的语言模型内部特征作为可解释性分析工具时,如何判断这些特征能够真实反映原始冻结模型(frozen LM)的内在行为机制?其解决方案的关键在于提出一种后验泛化(post-hoc generalization)框架,通过将原模型的原始隐藏激活替换为预训练SAE重构的稀疏代理(proxy),构建一个可验证的代理模型,并基于四个可观测量——代理风险(proxy risk)、SAE重构误差(reconstruction gap)、概念池不匹配度(concept-pool mismatch)以及稀疏复杂度(sparse complexity)——推导出对基础模型期望风险的上界。该上界被解释为解释忠实性的操作性判据:非平凡(non-vacuous)的上界表明提取的稀疏特征仍保留有意义的预测信息,而较小的重构误差与不匹配度则保证代理模型在行为上与原模型保持接近。实验结果表明,在GPT-2 Small、Gemma-2B和Llama-3-8B等模型上,该上界在实际样本规模下可达到非平凡状态;对Llama-3-8B的逐层分析进一步揭示了深度依赖性,即深层网络更易被认证,这与局部保真度更强及下游误差放大效应减弱相关。此外,通过特征打乱消融实验,验证了该方法能有效区分真正的语义对齐与仅由统计稀疏性带来的虚假关联,从而为SAE解释可靠性提供关键诊断依据。

链接: https://arxiv.org/abs/2606.18383
作者: Dibyanayan Bandyopadhyay,Asif Ekbal
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are increasingly used to extract interpretable features from language models (LMs), yet a central question remains: when can an SAE-based explanation be treated as a faithful view of an underlying frozen LM We study this through a post-hoc generalization framework that certifies the LM via a sparse proxy, obtained by replacing a native hidden activation with its pretrained SAE reconstruction. Our framework derives an upper bound on the base model’s expected risk using four measurable quantities: proxy risk, SAE reconstruction gap, concept-pool mismatch, and sparse complexity. We interpret this certificate as an operational criterion for explanatory faithfulness. In particular, a non-vacuous bound indicates that the extracted sparse features retain meaningful predictive information, while small reconstruction and mismatch errors indicate that the proxy remains behaviorally close to the original model. Empirically, we show that the bound becomes non-vacuous on GPT-2 Small, Gemma-2B, and Llama-3-8B at practical sample sizes. A detailed layerwise analysis of Llama-3-8B reveals a strong depth dependence, with later layers becoming much easier to certify, associated with both stronger local fidelity and weaker downstream error amplification. Finally, through feature-shuffling ablations, we show that the decomposition distinguishes genuine semantic alignment from mere statistical sparsity, providing a useful diagnostic for when SAE-based explanations become less reliable.

[NLP-65] Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

【速读】: 该论文旨在解决教育对话数据中敏感信息去标识化(de-identification)的难题,核心矛盾在于:在保障学生隐私(如个人身份信息,PII)治理的前提下,如何避免误删课程内容中的关键术语(如“Riemann”可能指代学生姓名或数学概念),从而在隐私保护与内容准确性之间实现平衡。现有方法要么依赖第三方商业大语言模型(LLM),虽能处理语义歧义但存在数据外泄风险;要么采用本地命名实体识别(NER)系统,虽保证数据主权但过度抹除课程相关实体,导致信息失真。本文提出一种全本地级联式框架,将去标识化任务从开放式的实体识别重构为受限的隐私优先级判断(privacy triage)。其关键创新在于:首先通过两个轻量编码器结合确定性规则的召回优先联合提议器,生成尽可能多的候选实体跨度(over-generate candidate spans);随后由上下文感知的审查器(reviewer)基于对话上下文和发言者角色,对每个候选项作出“删除/保留”的二元决策。实验表明,最优本地配置在宏平均F1达到0.958,显著优于同家族的LLM基线(0.767)及商用API(0.706),且全程运行于单台笔记本电脑。在针对课程术语与人名歧义的挑战集上,该配置仅下降0.03 F1,远优于小型审查器的0.19–0.25降幅。结果表明,在教育领域去标识化任务中,问题建模方式的重要性超越模型规模。

链接: https://arxiv.org/abs/2606.18372
作者: Haocheng Zhang,Zhuqian Zhou,Kirk Vanacore,Bakhtawar Ahtisham,René F. Kizilcec
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Educational dialogue is a valuable but sensitive resource for research: the same transcripts that capture authentic learning often capture personally identifiable information (PII) entangled with curricular content, where “Riemann” may refer to a real student or to a mathematical concept. Existing approaches force a tradeoff between governance and accuracy. Commercial Large Language Models (LLMs) can handle this ambiguity but require sending student data to third parties, while local named entity recognition (NER) systems preserve governance but over-redact curricular terms. We propose a fully local cascade framework that reframes de-identification from open-ended entity recognition to constrained privacy triage. A recall-first union proposer combines two lightweight encoders with deterministic rules to over-generate candidate spans; a context-aware reviewer then makes a binary Redact/Keep decision for each candidate using surrounding dialogue and speaker role. We evaluate three reviewer configurations against same-family LLM-only baselines and a commercial API on math tutoring transcripts from two large platforms. The strongest local configuration reaches 0.958 macro F1, compared with 0.767 for a same-family LLM-only baseline and 0.706 for the commercial API, while running entirely on a single laptop. On a targeted challenge set of curricular-personal name ambiguity, the same configuration degrades by only 0.03 F1 versus 0.19 to 0.25 for smaller reviewers. These results suggest that for educational de-identification, problem formulation matters more than model scale.

[NLP-66] Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)训练智能体时面临的前沿任务供给瓶颈问题,即如何持续生成有效、可解且难度适中的任务以支撑不断演进的推理与代理模型的训练。随着生成式AI(Generative AI)能力提升,静态任务分布趋于饱和,而盲目合成的任务往往过于简单、无法求解或定义不当。传统基于RL的生成器训练需对每个候选任务进行多次求解器回放(solver rollout),但在软件工程(Software-Engineering, SWE)等场景中,单次回放耗时可达数十分钟,导致“求解器在环”(solver-in-the-loop)的训练方式不可行。本文提出PROPEL框架,其核心创新在于采用求解器摊销(solver-amortized)机制:通过一次性标注由生成器产生的任务及其求解结果构建语料库,训练一个轻量级激活探针(activation probe),该探针能从冻结的生成器参考模型中预测目标求解通过率,并在生成器优化过程中作为求解率的代理指标。这一设计将生成器评估从复杂的多轮求解过程简化为一次前向传播,显著降低计算开销。实验表明,PROPEL在数学、代码生成及软件工程任务上均能有效引导生成任务逼近预设求解率,例如在代码任务中,针对Qwen2.5-3B-Instruct求解器,可解任务比例从10.1%提升至20.0%;在未见代码仓库上的SWE任务中,符合目标求解率的生成比例从9.8%提升至19.6%,验证了其在大规模模型上的有效性与泛化能力。

链接: https://arxiv.org/abs/2606.18284
作者: Lorenz Wolf,Connor Watts,Roger Creus Castanyer,Geoffrey Bradway,Maxwill Lin,Augustine N. Mavor-Parker,Matthew Daborn-Sargent
机构: VMAX.AI(维马克斯人工智能)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 30 pages, 9 figures, 12 tables

点击查看摘要

Abstract:The limiting resource for training agents via reinforcement learning (RL) is increasingly frontier task supply: valid, solvable tasks just difficult enough to train the current model. As reasoning and agentic models improve, fixed task distributions saturate, while naive synthetic generation yields tasks that are trivial, impossible, or ill-posed. Training a task generator with RL to optimize validity and learnability can address this bottleneck, but direct optimization requires repeated solver rollouts per candidate. For software-engineering (SWE) tasks, a single rollout can take tens of minutes; solver-in-the-loop generator training is intractable. We introduce PROPEL, a solver-amortized framework for training task generators at the targeted solve rate. PROPEL trains a lightweight activation probe on a one-time labeled corpus of generated tasks and solver outcomes. The probe predicts target-solver pass rate from a frozen generator reference model and serves as a proxy for solve rate during generator optimization, reducing generator evaluation to a single forward pass. Across math, code, and software-engineering at multiple model scales, PROPEL shifts generation toward the targeted solve rate: for coding, tasks generated at the learnable frontier increase from 10.1% \rightarrow 20.0% for a Qwen2.5-3B-Instruct solver and from 5.3% \rightarrow 12.6% for a Qwen2.5-7B-Instruct solver. For SWE, PROPEL increases the share of generations at the targeted solve rate from 9.8% \rightarrow 19.6% for Qwen3.5-27B on repositories not seen during training of probe and generator.

[NLP-67] Continuous Audio Thinking for Large Audio Language Models

【速读】: 该论文旨在解决大音频语言模型(Large Audio Language Models, LALMs)在处理多模态音频任务时,因训练目标偏向文本对齐而导致声学信息(如语音细节、语调、声音事件、情感、音高等)在隐层状态中逐渐丢失的问题。其核心挑战在于如何在不增加推理开销的前提下,有效保留并利用丰富的声学特征以提升模型的音频理解与推理能力。解决方案的关键是提出一种名为连续音频思维(Continuous Audio Thinking, CoAT)的框架,该框架通过引入一个连续的潜在思维空间(continuous latent workspace),在生成响应前对声学信息进行组织与整合。该思维空间基于来自音频专家模型的蒸馏知识进行监督,使模型能够在生成文本响应时充分调动先前提取的声学上下文。此外,所提出的连续思维模块可一次性完成前向计算(single prefill),无需额外自回归解码开销,保持了与基线模型相当的推理效率。实验表明,CoAT在多个主流基准上(涵盖音频推理、音频理解、音乐分类、语音情绪识别及语音转录等任务)均显著提升了性能,且分析证实辅助监督信号能有效从思维位置传递至最终文本输出。

链接: https://arxiv.org/abs/2606.18273
作者: Gyojin Han,Dong-Jae Lee,Changho Choi,Jongsuk Kim,Junmo Kim
机构: KAIST(韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Preprint

点击查看摘要

Abstract:Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model’s textual responses.

[NLP-68] Simulating Hate Speech Cascades with Multi-LLM Agents : Empirical Grounding Modeling Fidelity and Intervention Strategies

【速读】: 该论文旨在解决在线平台中仇恨内容传播建模不准确的问题,尤其针对传统级联模型因未显式刻画用户画像、社区结构及内容特征而难以有效指导实际内容治理策略的缺陷。其核心解决方案是构建一个多智能体大语言模型(multi-agent LLM)系统,使每次转发行为能够动态依赖于用户个人属性、所处社区环境与具体内容特征,从而提升对真实仇恨内容传播过程的拟合度。研究通过分析三个真实的仇恨内容在Bluesky平台上的传播案例及其匹配的良性对照组发现:97.4%–99.7%的转发者持敌意立场,仇恨传播树中的毒性-参与同质性高于关注图,且仇恨级联呈现星型拓扑(多数转发直接来自源头),而良性内容则表现为树状多跳传播。仿真结果表明,该多LLM代理模拟器能有效复现“立场单一化”和“毒性增量方向”等关键现象;结构化消融实验进一步揭示,代理异质性是决定模型保真度的首要因素,而在密集网络中针对性地识别并抑制放大器可实现5.7%良性误伤率下7.5%–12.9%的传播规模下降,凸显了精准干预机制的重要性。

链接: https://arxiv.org/abs/2606.18264
作者: Fan Huang
机构: Indiana University Bloomington (印第安纳大学伯明顿分校); OpenAI (OpenAI)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Faithful modeling of hateful content propagation on online platforms remains an open problem for moderation research. Classical cascade models that do not explicitly represent the profile, community, and content factors associated with hateful-content propagation may yield moderation strategies that behave less effectively when deployed in real-world scenarios. Multi-agent large language model (LLM) systems can, in principle, make each reshare decision depend on the user’s profile, the surrounding community, and the post’s content, but it remains unclear whether this added flexibility actually reproduces real hateful cascades more faithfully than classical baselines. We study three hateful Bluesky cascades and a size-matched benign control. In the empirical Bluesky data, we found that: 97.4–99.7% of reposters take a hostile stance; toxicity-engagement homophily is higher on the diffusion tree than on the follower graph for hateful cascades; topology is star-like for the hateful cascades (most reposts come directly from the root) versus tree-like for the benign cascade (reposts propagate through multi-hop chains). In simulation, a multi-LLM-agent simulator reproduces the stance monoculture and the toxicity-delta direction. A structured ablation identifies agent heterogeneity as the leading fidelity factor, and amplifier targeting on dense networks yields 7.5–12.9% reduction at 5.7% benign collateral.

[NLP-69] IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages INTERSPEECH2026

【速读】: 该论文旨在解决生成式语音模型(AudioLLMs)在语音识别任务中是否真正利用文本提示中的上下文信息,还是仅依赖预训练阶段学习到的参数化知识这一关键问题。现有评估基准因仅在固定提示条件下测试转录性能,且缺乏显式的上下文输入,无法有效区分模型对上下文的实际依赖程度。为此,研究提出IndicContextEval——一个包含56小时自然语音数据、覆盖555名说话者、8种印度语言及23个专业领域的多语言基准,并设计了7级渐进式提示框架,系统引入元数据、自然语言描述、英文与本地文字实体列表,以及含错误实体的对抗性提示等多层次上下文信号。实验结果揭示了五种模型在上下文利用行为上的显著差异,凸显了对音频大模型进行显式上下文对齐评估的重要性。

链接: https://arxiv.org/abs/2606.19157
作者: Sakshi Joshi,Dhruv Subhash Rathi,Sanskar Singh,Eldho Ittan George,R J Hari,Kaushal Bhogale,Mitesh M. Khapra
机构: AI4Bharat, Indian Institute of Technology Madras (印度理工学院马德拉斯分校); Sarvam AI (印度)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.

[NLP-70] Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment INTERSPEECH2026

【速读】: 该论文旨在解决认知功能障碍早期检测中因依赖主观性强的神经心理学测试而带来的评估偏差问题,尤其关注传统评估方法在语音分析应用中的局限性。现有基于语音的评估手段虽能提升诊断可及性,但受限于转录错误以及无法涵盖非语言子测试(如运动技能)等因素,导致评估准确性下降。为此,研究提出一种融合解决方案:利用生成式语音模型(Whisper)提取的嵌入向量与转录文本衍生的评分相结合,对德语“综合征短时测验”(Syndrom-Kurz-Test)中的言语子测试进行多模态特征融合建模,以降低评分误差;同时,通过融合后的表示学习,弥补缺失的运动子测试信息,进而逼近专家给出的整体评估分数。其关键在于将语音内容的语义信息与深层声学特征进行联合建模,实现对认知状态的高精度推断,即使在部分子测试缺失的情况下仍能保持与专家评分的高度相关性,并有效区分不同认知状态群体。

链接: https://arxiv.org/abs/2606.18979
作者: Franziska Braun,Christopher Witzl,Andreas Erzigkeit,Hartmut Lehfeld,Thomas Hillemacher,Tobias Bocklet,Korbinian Riedhammer
机构: Technische Hochschule Nürnberg (纽伦堡应用技术大学); Geromed GmbH (杰罗迈德有限公司); PMU Klinikum Nürnberg (PMU纽伦堡诊所)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at INTERSPEECH 2026

点击查看摘要

Abstract:Early detection of cognitive impairment relies on neuropsychological tests to minimize subjectivity by assessing multiple cognitive domains. Speech-based evaluation can support diagnostics and improve accessibility, but transcription errors and the omission of nonverbal subtests (e.g., motor skills) limit accuracy. Beyond conventional test scores, speech-derived features can provide additional insights into cognitive status. This study investigates the speech-based evaluation of the German “Syndrom-Kurz-Test,” a standardized dementia screening test comprising verbal and motor subtests. We train models that integrate transcript-derived scores and Whisper embeddings per verbal subtest to reduce scoring errors. To compensate for missing motor subtests, we then leverage these fused representations to approximate expert overall ratings. Despite omitting subtests, our models strongly correlate with expert ratings and efficiently and accurately discriminate between cognitive status groups.

信息检索

[IR-0] Which Sections of a Research Paper Best Reveal Its Research Methods? Evidence from Library and Information Science

链接: https://arxiv.org/abs/2606.19051
作者: Qiuyu Fang,Jiayi Hao,Chengzhi Zhang
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: ASIST 2026

点击查看摘要

Abstract:Research methods are essential carriers of knowledge contribution in academic papers. Automatic multi-label classification of research methods can support knowledge services such as method retrieval, review generation, and research intelligence analysis. While existing studies primarily rely on titles and abstracts, abstracts often provide only limited methodological information, whereas utilizing full-text content faces challenges related to excessive length and information redundancy. Therefore, this paper proposes a segment combination strategy by partitioning the full-text content according to its physical postion. Using an annotated corpus of 1,954 full-text articles from three representative journals in Library and Information Science (JASIST, LISR, and JDoc), we evaluate the classification performance of various segments and their combinations across multiple models. Experimental results indicate that methodological information is distributed unevenly within the full-text content, with the middle-to-late and final segments exhibiting greater discriminative power. Furthermore, integrating bibliographic metadata with cross-segment combination strategies effectively enhances classification performance.

[IR-1] Querit-Reranker: Training Compact Multilingual Rerankers via Efficient Label-Free Distribution Adaptation

链接: https://arxiv.org/abs/2606.19037
作者: Yunfei Zhong,Jun Yang,Wei Huang,Yinqiong Cai,Haosheng Qian,Yixing Fan,Ruqing Zhang,Lixin Su,Daiting Shi,Jiafeng Guo
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Deployable multilingual rerankers must generalize across languages, domains, and target ranking tasks while remaining efficient enough for second-stage reranking. However, adapting them to new target distributions typically requires extensive task-specific relevance annotations, which are costly to obtain. We present Querit-Reranker, a family of multilingual cross-encoder rerankers trained with a data-centric pipeline for label-efficient adaptation. We instantiate it as Querit-Reranker-A0.4B, initialized from an in-house MoE backbone with 0.4B activated parameters, and Querit-Reranker-4B, initialized from Qwen3-Embedding-4B. Our pipeline first learns general relevance modeling from large-scale ranking-oriented data, then adapts to target distributions through synthetic-query mining with teacher scores as continuous soft labels. To consolidate complementary task-adapted strengths, we further merge checkpoints via spherical linear interpolation, obtaining a single deployable model without runtime ensembling overhead. Using Qwen3-Embedding-0.6B as the shared first-stage retriever, Querit-Reranker-A0.4B improves average nDCG@10 from 54.11 to 59.28 on BEIR and from 59.87 to 67.70 on MIRACL. On MTEB Multilingual v2 Reranking, it also substantially outperforms larger embedding-based baselines, while Querit-Reranker-4B further achieves state-of-the-art performance among publicly available models. We release both models on Hugging Face.

[IR-2] Zero-Shot Active Feature Acquisition via LLM -Elicitation

链接: https://arxiv.org/abs/2606.18933
作者: Binyamin Perets,Natalie Mendelson,Shiran Vainberg,Yehuda Chowers,Shai Shen-Orr,Shie Mannor
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Active feature acquisition (AFA) sequentially selects which features to observe to reach a classification or ranking decision. Its central limitation is reliance on large amount of labeled data to fit probabilistic models guiding acquisition. Large language models (LLMs) supply unsupervised domain knowledge, but are poor sequential planners. Asking one to both know and decide conflates capabilities best kept separate. Here, we develop a framework for zero-shot AFA through disciplined elicitation: asking the LLM only for what it can be trusted to return, the unary deviations and pairwise co-variations that are the sufficient statistics of a Markov random field (MRF). We apply our framework to two settings: binary classification and top- k identification. In practice, the LLM reliably returns only discriminative statistics, what distinguishes the classes rather than each class in isolation, which precludes classical AFA. We apply a maximum-entropy closure that resolves this gauge ambiguity. We evaluate on a cohort of Inflammatory Bowel Disease (IBD) patients, an active clinical setting where diagnostic ambiguity and patient heterogeneity obstruct stable treatment strategies. Our framework outperforms the LLM both on real labels and on its own extracted beliefs. Where it matters most, on the hardest patients, our top- k acquisition policy markedly outperforms all existing methods.

[IR-3] SAERec: Constructing Fine-grained Interpretable Intents Priors via Sparse Autoencoders for Recommendation

链接: https://arxiv.org/abs/2606.18897
作者: Jiangnan Xia,Xuansheng Wu,Yu Yang,Xin Wang,Ninghao Liu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intent-based recommender systems have gained significant attention for improving accuracy and interpretability by modeling the underlying motivations behind user behaviors. Most existing models derive intents directly from user sequences via clustering or prototype learning. However, they are sensitive to sequence quality, require presetting the number of intents, and lack explicit semantic grounding. These issues lead to an incomplete and coarse intent set and limit the effectiveness of recommendation. In this paper, we propose the Sparse Autoencoder for intent-based recommendation (SAERec), a novel recommender that automatically constructs a fine-grained and interpretable intent space from a textual corpus to guide recommendation. Rather than treating texts as side signals, SAERec leverages them as high information density evidence for intent construction. Specifically, we first extract a comprehensive set of fine-grained interpretable intents from the latent space of large language models (LLMs) by using a sparse autoencoder (SAE) to disentangle and interpret text embeddings, which isolates intent-related semantics from textual noise. Then, for each user, we retrieve relevant intents from this set as priors to guide recommendation. It contains personal intents matching a user’s current interests and public intents capturing general item patterns shared across users (e.g., quality, price). Finally, to integrate retrieved intents into sequence modeling, we propose a multi-branch attention mechanism that captures temporal dependencies and injects both personal and public intent signals, followed by an adaptive fusion layer to construct the final user representation for recommendation. Extensive experiments on public datasets demonstrate the superiority of SAERec, consistently outperforming state-of-the-art baselines while providing human-understandable explanations.

[IR-4] LARE: Low-Attention Region Encoding for Text-Image Retrieval ICML2026

链接: https://arxiv.org/abs/2606.18885
作者: Abdulmalik Alquwayfili,Faisal Almeshal,Jumanah Almajnouni,Leena Alotaibi,Faisal Alhajari,Mohammed Alkhrashi,Alreem Almuhrij,Abdullah Aldwyish,Raied Aljadaany,Huda Alamri,Muhammad Kamran J. Khan
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted at the ICML 2026 Workshop on Efficient Multimodal Question Answering (EMM-QA). Code: this https URL ; Dataset: this https URL

点击查看摘要

Abstract:Image retrieval in crowded scenes is particularly challenging due to the salience bias of conventional visual encoders, which tend to focus on dominant objects while neglecting low-attention regions that are often crucial for fine-grained retrieval. We propose LARE (Low-Attention Region Encoding), a framework that explicitly models these overlooked regions. LARE adopts a dual-encoding strategy that encodes low-attention regions of an image and the full image in parallel, leading to more diverse and informative image embeddings. To evaluate image retrieval performance in challenging crowded scenes, we introduce Dense-Set, a challenging subset derived from COCO and Flickr30K. In this subset, images are re-captioned to provide richer descriptions of low-attention or previously overlooked regions. This dataset highlights the limitations of existing retrieval models and enables a more rigorous evaluation under densely crowded scene conditions. Experimental results demonstrate that the proposed framework improves retrieval performance by preserving subtle, non-dominant visual cues within the shared latent space.

[IR-5] ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

链接: https://arxiv.org/abs/2606.18850
作者: Bohou Zhang,Xiaoyu Tao,Mingyue Cheng,Huijie Liu,Qi Liu
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Abstractive summarization plays a crucial role in enabling efficient understanding of scientific literature, yet it inherently demands both linguistic fluency and factual faithfulness. Existing approaches often fail to reconcile these two requirements. Extractive methods rely on rigid sentence splicing that disrupts macro-level logical coherence, while large language model (LLM)-based generative approaches, despite mastering linguistic fluency, exhibit limited factual consistency. In this work, we propose ScholarSum, a hierarchical reflective graph-based framework that emulates a student-teacher writing process for fluent and faithful scientific summarization. ScholarSum first organizes the document into a hierarchical knowledge graph by segmenting it into semantically coherent units, whose multi-layered community structure captures global logic and macro-level themes. Guided by this global structure, the student generates an initial draft, which is subsequently refined through fine-grained evidence retrieval. To ensure factual consistency, a teacher-like reviewer then iteratively examines the draft, identifies unsupported content, and prompts targeted re-retrieval and rewriting until the summary meets rigorous quality standards. Extensive experiments demonstrate that ScholarSum significantly outperforms previous baselines in terms of both completeness and faithfulness. Our code is available at this https URL.

[IR-6] LensKit-Auto

链接: https://arxiv.org/abs/2606.18814
作者: Max Breit,Anass Amezian El Idrissi,Rishikesh Giriraj Kulkarni,Luca Quade
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recommender systems have a wide area of application, e.g. in fields like video streaming, social media, or digital marketplaces. But, for a recommender-system, finding the right algorithm with the right hyperparameters is a reoccurring challenge. There is no one-fits-all solution, since the performance of one algorithm can vary immensely on different data sets. Due to the challenges of finding the right algorithm and the broad use of recommender-systems, it is of interest to create an Automated Recommender System (AutoRecSys) that takes on the task of finding the right algorithm-hyperparameter-combination for a given data set. In this work, we present the enhancement of LensKit-Auto, a framework introduced by Vente et al., that solves exactly this task of finding a fitting algorithm-hyperparameter-combination. LensKit-Auto’s biggest strength lies in its ease of use, where it operates as a black-box, into which the user can feed their data set and receive the information of which algorithm and hyperparameters work best on this data set. In this work, we bring LensKit-Auto up to date, so that it works with the new version of its underlying framework, LensKit. We also implement further functionalities, such as the Tree Parzen Estimator as an additional optimization method, the ability to reuse the found algorithm, updated documentation, and the ability to visualize the optimization process. We also adapt an existing meta-learning framework to generate a suitable meta-dataset for LensKit-Auto, which could enable the integration of meta-learning into LensKit-Auto in the future. The presented changes bring LensKit-Auto up to date and enhance its usability, so that even non-experts in the field can find the right algorithm for their use case.

[IR-7] Rescaling MLM-Head for Neural Sparse Retrieval

链接: https://arxiv.org/abs/2606.18811
作者: Youngjoon Jang,Seongtae Hong,Jonah Turner,Heuiseok Lim
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learned sparse retrieval (LSR) models such as SPLADE have traditionally used BERT-style masked language models as backbone encoders. A natural expectation is that replacing BERT with stronger pretrained encoders should improve retrieval effectiveness. However, we find that under standard SPLADE training recipes, backbones with large MLM-head L2 norms can suffer performance degradation and even training collapse under standard SPLADE training recipes. We identify this failure as a scale mismatch in the MLM head: SPLADE directly uses MLM-head outputs to construct sparse lexical representations, and query-document relevance is computed by an unnormalized dot product over these representations. As a result, an inflated MLM-head scale can amplify sparse activations, distort matching scores, and destabilize contrastive training under common training settings. To address this issue, we introduce a simple initialization-time correction that rescales the MLM-head projection by a constant factor before SPLADE training. This zero-cost adjustment improves training stability without modifying the model architecture or training objective. Across both in-domain and out-of-domain retrieval benchmarks, this simple correction substantially improves large-norm backbones such as ModernBERT and Ettin, turning unstable training runs into competitive sparse retrievers. In several settings, the corrected models further match or surpass the classic BERT-SPLADE baseline. These findings suggest that the bottleneck in adapting pretrained encoders to LSR is not encoder capacity alone, but the calibration of the MLM-head scale used to construct sparse lexical representations.

[IR-8] SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval

链接: https://arxiv.org/abs/2606.18801
作者: Youngjoon Jang,Seongtae Hong,Hyeonseok Moon,Heuiseok Lim
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid expansion of massive multilingual corpora, Multilingual Information Retrieval (MLIR) has emerged as a critical technology for global information access. MLIR enables users to retrieve semantically relevant documents from multilingual text collections using a single-language query. However, recent multilingual dense retrieval models often exhibit a strong preference for documents in the same language as the query. This leads to severe language bias, where top-ranked results are dominated by documents of specific languages, even when documents in other languages contain more semantically relevant information. To address this issue, we propose SHIFT, a training-free method applicable in the indexing stage. Specifically, SHIFT utilizes parallel translation pairs to estimate a relative language vector for each target language with respect to a source language. Subsequently, SHIFT corrects the language-specific offset by subtracting this relative language vector from document embeddings during indexing. Our comprehensive evaluation across four MLIR benchmarks and diverse dense retrieval models confirms that SHIFT can effectively mitigate language bias and enhance MLIR performance.

[IR-9] W-LegalBench: Measuring Taiwanese Legal Understanding

链接: https://arxiv.org/abs/2606.18699
作者: Fei-Yueh Chen,Chun Huang Lin,Chan Wei Hsu,Kuan Hsuan Yeh,Zih-Ching Chen,Kuan-Ming Chen,Patrick Chung-Chia Huang
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 10 pages, 2 figures, To appear in ICAIL 2026

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive capabilities across diverse tasks, yet their performance on jurisdiction-specific legal reasoning remains underexplored. We present TW-LegalBench that utilizes Taiwanese legal system’s rich official corpus open to the public to fill the gap in evaluating LLMs on Taiwanese law, among common-law benchmarks that focus on English sources and civil-law benchmarks focusing on sources of Simplified Chinese. TW-LegalBench comprises three task types: (1) over 16,000 multiple-choice questions (MCQs) across five years of official examinations in 18 professional domains; (2) 117 open-ended essay questions (OEQs) from examinations for legal professionals with official scoring rubrics; and (3) more than 14,000 legal judgment prediction (LJP) instances covering hundreds of crime categories. We evaluate 13 LLMs using accuracy for MCQs, a decomposed LLM-as-Judge framework based on the scoring rubric points for OEQs, and metrics for sentencing accuracy and statute citation for LJP. Our results reveal that top-performing models exceed the passing threshold for qualified lawyers (passing rate: 11%) but fall short of that for judges and prosecutors (passing rate: 1~2%). For LJP, while models demonstrate reasonable verdict type accuracy and sentence prediction capability, they struggle to cite exact legal articles. These findings highlight that reliable legal text generation remains challenging for LLMs, even though their performance on qualification examinations approaches human level.

[IR-10] MCompassRAG : Topic Metadata as a Semantic Compass for Parag raph-Level Retrieval

链接: https://arxiv.org/abs/2606.18508
作者: Amirhossein Abaskohi,Raymond Li,Gaetano Cimino,Peter West,Giuseppe Carenini,Issam H. Laradji
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems depend critically on how documents are chunked and searched. Fine-grained chunks can improve retrieval precision but expand the search space, increasing latency and cost; larger chunks reduce the number of candidates but make dense similarity less reliable, as the representation for each chunk mixes multiple topics and introduces more semantic noise. This trade-off becomes especially limiting in deep research tasks, where retrieval must be both fast and precise across large, heterogeneous corpora. We introduce MCompassRAG, a metadata-guided retrieval framework that uses topic-level signals as a semantic compass for selecting relevant evidence. Instead of relying only on cosine similarity between queries and noisy chunk embeddings, MCompassRAG enriches chunk representations with topic metadata in the same embedding space and trains a lightweight retriever through LLM-teacher distillation. At inference time, MCompassRAG performs topic-aware retrieval without additional LLM calls, improving both efficiency and evidence quality. Across six complex retrieval benchmarks, MCompassRAG improves information efficiency (IE) by 8.24% on average with over 5 times lower latency than the strongest efficient RAG baselines. Code is available on this https URL.

[IR-11] SproutRAG : Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG

链接: https://arxiv.org/abs/2606.18381
作者: Amirhossein Abaskohi,Issam H. Laradji,Peter West,Giuseppe Carenini
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems must balance retrieval granularity with contextual coherence, a challenge that existing methods address through LLM-guided chunking, single-level context expansion, or hierarchical summarization. These approaches variously depend on costly LLM calls during indexing or retrieval, limit context aggregation to a single granularity level, or introduce information loss through summarization. We present SproutRAG, an attention-guided hierarchical RAG framework that addresses this trade-off by organizing sentence-level chunks into progressively larger but semantically coherent units, using learned inter-sentence attention to construct a binary chunking tree. Unlike prior approaches that rely on external LLMs, fixed context expansion, or lossy summarization, SproutRAG learns which attention heads and layers best capture semantic document structure, enabling multi-granularity retrieval without additional LLM calls or compressed summaries. At retrieval time, SproutRAG uses hierarchical beam search to retrieve candidates at multiple granularities, capturing multi-sentence relevance beyond flat retrieval. The framework is trained end-to-end with a joint objective that improves both embeddings and tree structure. Experiments across four benchmarks spanning scientific, legal, and open-domain settings demonstrate that SproutRAG improves information efficiency (IE) by 6.1% on average over the strongest baseline. Code is available on this https URL.

[IR-12] RankGraph-2: Lifecycle Co-Design for Billion-Node Graph Learning in Recommendation

链接: https://arxiv.org/abs/2606.18379
作者: Renzhi Wu,Zikun Cui,Junjie Yang,Tai Guo,Hong Li,Xian Chen,Li Yu,Ke Pan,Sri Reddy,Mahesh Srinivasan,Nipun Mathur,Haomin Yu,Hong Yan
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph-based retrieval at billion-node scale requires jointly solving three tightly coupled problems – graph construction, representation learning, and real-time serving – yet existing work addresses each in isolation. We present RankGraph-2, a framework deployed at Meta that co-designs all three lifecycle stages for similarity-based retrieval (U2U2I and U2I2I), where each stage’s requirements shape the others. Serving requires a co-learned cluster index to avoid expensive online KNN – this pushes index co-training into the training objective. Training benefits from the observation that similarity-based retrieval tolerates pre-computed neighborhoods, eliminating online graph infrastructure – this requires construction to produce self-contained data. Construction must also support hour-level refresh for item coverage. Acting on these cascading requirements, RankGraph-2 reduces hundreds of trillions of edges to hundreds of billions via subsampling with popularity bias correction, pre-computes multi-hop neighborhoods via personalized PageRank, and co-learns a residual-quantization cluster index that reduces serving computational cost by 83%. This lifecycle co-design enables a simple architecture to achieve 3.8 x higher recall than a GAT + Deep Graph Infomax model on a bipartite graph and 2.1 x higher than PyTorch-BigGraph on item retrieval. RankGraph-2 delivers up to +0.96% CTR and +2.75% CVR, and has powered 20+ retrieval launches across major surfaces.

[IR-13] Compact Geometric Representations of Hierarchies COLT

链接: https://arxiv.org/abs/2606.18520
作者: Prashant Gokhale,Piotr Indyk,Yuhao Liu,Sandeep Silwal,Tony Chang Wang,Haike Xu
类目: Machine Learning (stat.ML); Computational Geometry (cs.CG); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Published at the 39th Annual Conference on Learning Theory (COLT) 2026. 22 Pages

点击查看摘要

Abstract:Computing geometric representations of data is a cornerstone of modern machine learning, typically achieved by training dual encoders which map queries and documents into a shared embedding space. Recent work of You et al. [NeurIPS '25] has extended this approach to hierarchical retrieval, where relevance is determined by the ancestor-descendant relationships in a Directed Acyclic Graph (DAG). While previous work has shown that valid embeddings exist when the number of descendants is small, these bounds degrade significantly for deep hierarchies, requiring dimensions as large as the total number of nodes. In this paper, we investigate compact reachability embeddings for more general graph classes and provide theoretical guarantees for representing hierarchies using embeddings whose dimension depends on structural graph parameters. We prove that for any directed tree, there exists a reachability embedding in constant dimension 3, independent of the tree’s size or depth. We generalize this result to graphs characterized by treewidth t , constructing embeddings of dimension O(t \log n) , where n is the number of nodes. Complementing these upper bounds, we provide matching or near-matching lower bounds, showing that dimension \Omega(n) is necessary for general DAGs and \Omega(t/\log(n/t)) is required for graphs of treewidth t . We also obtain upper and lower bounds parameterized by the number of cross-edges in the DAG. We additionally show that our embeddings can be constructed on real world datasets, and that they give much smaller dimensions in high recall regimes compared to prior embeddings with theoretical guarantees. Comments: Published at the 39th Annual Conference on Learning Theory (COLT) 2026. 22 Pages Subjects: Machine Learning (stat.ML); Computational Geometry (cs.CG); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2606.18520 [stat.ML] (or arXiv:2606.18520v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2606.18520 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

人机交互

[HC-0] Correct Yourself Keep My Trust: How Self-Correction and Social Connection Shape Credibility in Social Chatbots

链接: https://arxiv.org/abs/2606.19286
作者: Biswadeep Sen,Yi-Chieh Lee
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:When social chatbots make mistakes, and they do, how they recover determines whether users trust them again. Social chatbots are increasingly integrated into everyday life, yet they remain prone to generating convincing but inaccurate information. The social connection they build with users makes such errors particularly consequential. We conducted a between-subjects experiment (N=120) comparing three error correction strategies: a webpage retraction, self-correction by the same social chatbot, and correction by an expert chatbot. Our results reveal two key findings. First, all three strategies corrected the error equally well, but only self-correction did so without damaging the chatbot’s credibility: participants rated self-correcting chatbots significantly higher in both trustworthiness and perceived expertise than chatbots whose errors were corrected by external sources. Second, the strength of the user’s social connection with the chatbot, measured through social attraction and self-disclosure, significantly predicted the magnitude of belief change, but only when the chatbot corrected itself. Outsourcing corrections to an external source severed this link entirely. These findings suggest that social chatbots should correct their own mistakes rather than outsource corrections, and that investing in social connection is a functional mechanism that amplifies correction effectiveness, not merely a design feature. We discuss implications for designing chatbots that maintain long-term credibility while effectively addressing their own errors.

[HC-1] A Taxonomy of Mental Health and Technology Needs for Alzheimers and Dementia Caregivers

链接: https://arxiv.org/abs/2606.19247
作者: Keran Wang,Drishti Goel,Jiayue Melissa Shi,Violeta J. Rodriguez,Daniel S. Brown,Dong Whi Yoo,Ravi Karkar,Koustuv Saha
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Family members caring for individuals with Alzheimer’s disease and related dementias (AD/ADRD) provide the foundation of long-term care worldwide. In 2023, more than 11 million U.S. family and friends contributed 18 billion hours of unpaid care, often at the cost of their own physical and mental health. These informal caregivers – also referred as the “invisible second patients” – experience elevated rates of mental health problems. Yet research commonly reduces their complex psychosocial experiences to a single construct of caregiver burden, obscuring which specific needs are unmet or effectively supported. At the same time, digital and AI-enabled technologies are rapidly expanding, from smartphone apps and videoconferencing to sensor platforms and AI chatbots. However, the absence of shared frameworks across medicine, psychology, and technology research limits cumulative progress. This study introduces a Caregiver Mental Health and Technology Taxonomy that systematically links AD/ADRD caregiver needs with corresponding classes of technology-based interventions. Drawing from an interdisciplinary literature review and two qualitative studies with caregivers, the taxonomy identifies mismatches between caregiver priorities and existing technological support, highlights under-served domains such as relational strain and compassion fatigue, and proposes design directions for adaptive, responsive systems. The framework offers a shared vocabulary to guide clinicians, researchers, and technology designers in developing more person-centered and clinically grounded innovation in dementia care.

[HC-2] Seeing Through Occlusion: Deterministic Arm Kinematic Correction for Robot Teleoperation

链接: https://arxiv.org/abs/2606.19240
作者: Thomas M. Kwok,Nicholas Koenig,Yue Hu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Markerless, single-RGB-D-camera motion capture provides a low-cost and non-invasive alternative to conventional marker-based systems for robot teleoperation; however, depth estimation often degrades in the presence of self-occlusion, particularly during upper-limb motion. This paper presents an Arm Kinematic Correction (AKC) method that improves depth estimation by enforcing geometric constraints based on constant arm lengths. The proposed approach reconstructs occluded joint depths by leveraging wrist positions and predefined arm lengths via a deterministic formulation based on the Pythagorean theorem, thereby avoiding the need for complex probabilistic modeling or parameter tuning. Experimental validation against a Vicon reference system demonstrates reliable performance for both static and dynamic joint motions, evaluated using root-mean-square error (RMSE) and Pearson correlation. Furthermore, motion-mapping teleoperation is successfully demonstrated in both simulated and physical robot environments. The results show that AKC enhances robustness and preserves anatomical consistency under long-duration, severe self-occlusion, even when paired with less reliable temporal filters, highlighting its practicality for real-time applications such as robot teleoperation and human-robot interaction.

[HC-3] A Human-in-the-Loop Bayesian Optimization Framework for Constraint-Aware Bioprocess Development

链接: https://arxiv.org/abs/2606.19230
作者: Samuel Stricker,Claus Wirnsperger,Alessandro Butté,Laura Helleckes,Gonzalo Guillén Gosálbez,Antonio del Rio Chanona,Mehmet Mercangöz
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:This work presents an extension to Pareto Front Guided Sampling (PFGS), a Human-in-the-Loop (HitL) Bayesian Optimization (BO) framework in which Gaussian process (GP) surrogate-derived quantities are reformulated as objectives of a multi-objective optimization problem, and the resulting Pareto front is exposed to a domain expert for interactive candidate selection rather than returning a single automated recommendation. The framework is extended in two directions: constrained optimization is addressed by incorporating the posterior probability of satisfying output specification limits as an explicit Pareto objective, computed analytically from the GP posterior distribution; robust optimization is addressed by a Monte Carlo sampling strategy that estimates expected lower-confidence performance over a user-defined variability of input perturbations, capturing performance degradation under likely implementation deviations. The resulting multi-dimensional Pareto representation renders trade-offs between predicted performance, model uncertainty, probabilistic constraint satisfaction, and input robustness simultaneously visible through pairwise two-dimensional projections on an interactive dashboard, enabling selection criteria to be iteratively refined as the surrogate model improves and development objectives evolve. The framework is showcased on an eight-dimensional fed-batch Chinese Hamster Ovary (CHO) cell culture simulator demonstrating systematic identification of high-performing, feasibility-compliant, and perturbation-resilient operating conditions, and illustrating how expert-defined requirements provide a principled stopping criterion and support informed allocation of experimental resources.

[HC-4] No Two Developers Think Alike: How Problem-Solving Styles and Experience Shape Needs in Conversational Interaction with Copilot

链接: https://arxiv.org/abs/2606.19216
作者: Jonan Richards,Bruno Alves de Oliveira,Iury Oliveira,Igor Wiese,Mairieli Wessel
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注: Accepted at the International Conference on Software Maintenance and Evolution (ICSME), 2026

点击查看摘要

Abstract:Conversational LLM-based programming assistants'' provide a range of benefits to developers. However, recent studies demonstrate the variety in individual developers' needs regarding programming assistants, and challenges encountered by only specific groups of developers. In this study, we explore the role of cognitive diversity in shaping interactions with GitHub Copilot chat. Through a mixed-methods think aloud study with 27 professional developers and students, we characterize 5 distinct interaction modes’’ and 10 underlying needs in developers’ interactions, forming a conceptual model. We characterize links between these modes, needs, and developers’ problem-solving styles and experience profiles, showing how cognitive diversity may shape developers’ interactions. We provide insights and recommendations for researchers and practitioners on how to design, research, and employ programming assistants to better account for diverse developer needs.

[HC-5] A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies

链接: https://arxiv.org/abs/2606.19174
作者: Fangyijie Wang,Jianjun Yu,Wentao Shi,Haixia Huang,Ran Shi,Guénolé Silvestre,Kathleen M. Curran
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to MIUA 2026

点击查看摘要

Abstract:Clinician-centered evaluation is critical for validating medical AI systems, especially in ultrasound imaging where quantitative metrics do not always capture clinical usability. Existing medical image platforms primarily focus on dataset labeling. They lack integrated support for blinded model comparison and reproducible evaluation workflows. We present a clinician-centered pipeline for remote annotation and evaluation in ultrasound AI studies. The proposed pipeline uses a centralized server and lightweight browser interfaces to enable clinicians to perform annotation, blinded ranking, and review without local dataset downloads. The pipeline also supports multi-rater participation, centralized result aggregation, and automated statistical analysis. We validate the pipeline in a fetal ultrasound segmentation study with six raters spanning expert, generalist, and non-expert experience levels. The system automatically generated Spearman correlation, Kendall’s \tau , and top-1 selection statistics. Results indicated moderate to strong agreement across experts and other groups. The blinded evaluation results showed a tendency for later active learning models to be preferred. These outcomes suggest that the pipeline can support clinician-centered annotation and reproducible human-\acAI evaluation studies in ultrasound imaging. The proposed pipeline is available on \hrefthis https URLGitHub.

[HC-6] Written by AI Managed by AI: Semantic Space Control and Index Sickness Elimination Across 391 Consecutive Sessions ICSE2027

链接: https://arxiv.org/abs/2606.19121
作者: Hui Zhang,Shuren Song
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 22 pages, 2 tables, 1 figure. Action research. Bilingual submission (Chinese companion version included as supplementary). Submitted to ICSE 2027 IOR track

点击查看摘要

Abstract:The prevailing engineering intuition for addressing conceptual drift in long-horizon LLM collaboration is to trade more formal constraints for more reliable outputs – designing symbolic identifier systems, accumulating defensive rules in System Prompts, expanding context windows. Our engineering record shows that in long-horizon settings, this direction may produce effects contrary to design intent. Using action research methods in a real software project (Bang-v3) spanning approximately one month and 391 collaborative sessions, we document and analyze the failure process of these strategies. When the symbolic system exceeds a complexity threshold, LLMs do not become more accurate – instead, they abandon genuine understanding of business semantics, retreat to self-referential reasoning within the symbolic layer, and generate outputs that appear internally consistent but are physically disconnected from reality. We name this failure pattern “Index Sickness,” and its canonical manifestation “Phantom Legislation.” We name the underlying principle the “Pang Principle (Semantic Vitality Law)”: natural language carrying explicit purpose conveys far greater information quality than symbolic expression. From this, we design and validate its physical engineering mechanism: “Baseline-Log Physical Separation.” In the same project, this mechanism reduced AI Instructions volume by ~75%, and across the subsequent ~150 sessions, no recurrence of Index Sickness was observed. A bilingual companion version (Chinese) is included as supplementary material.

[HC-7] Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration

链接: https://arxiv.org/abs/2606.18836
作者: Taewoon Kim,Emma van Zoelen,Mark Neerincx
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective human-robot teamwork requires robots to adapt to partners, situations, and task dynamics from the start of an interaction. In the MATRX Urban Search and Rescue (USAR) environment, people can externalize collaboration patterns (CPs) they discover during teamwork through a chat and reflection interface. We study whether a robot can use such prior team experience to become a better teammate in future interactions. To this end, we represent historical CPs as knowledge-graph episodic memories and use graph representation learning with a node-classification objective to identify a representative and effective memory for reuse. We then initialize the robot with this memory before a new collaboration episode begins. Across 20 participants and 160 round-level observations, initializing the robot with a single automatically selected prior CP increases rescue success from 25.7% to 41.3% and reduces average task time by 283 seconds. The strongest gains appear at the beginning of interaction, suggesting that reusable episodic memory can help robots enter collaboration with more effective task knowledge and support smoother early teamwork.

[HC-8] SwitchBraidNet: Quantisation-Aware Lightweight Architecture for Hybrid Brain-Computer Interface

链接: https://arxiv.org/abs/2606.18816
作者: Gourav Siddhad,Yogesh Kumar Meena
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 6 pages, 5 figures, Preprint accepted at IEEE SMC 2026

点击查看摘要

Abstract:Hybrid brain-computer interfaces (BCIs) that integrate motor imagery (MI) and steady-state visual evoked potentials (SSVEP) provide high-dimensional neural decoding but typically exceed the computational limits of embedded hardware. To address this, we propose SwitchBraidNet, a compact EEG classification architecture designed for low-power deployment. The model employs a dual-path temporal braid to extract multiscale oscillatory features, an adaptive squeeze-and-excitation spatial switch for electrode gating, and a log-variance readout layer for direct band-power encoding. Furthermore, through systematic quantisation-aware training on the OpenBMI dataset, we compared SwitchBraidNet against four established baselines across FP32, FP16, and INT8 precisions. Experimental results demonstrate superior efficiency and performance, achieving MI accuracy of 69.49% (FP16), SSVEP accuracy of 93.48% (FP32), and a hybrid information transfer rate of 64.82 bits/min (FP16). With an INT8 footprint of only 3.03 KB, SwitchBraidNet maintains high accuracy across varying numerical precisions, demonstrating its suitability for low-power embedded BCI deployment.

[HC-9] Human-AI Agent Interaction in a Business Context

链接: https://arxiv.org/abs/2606.18716
作者: Kathrin Paimann,Elizangela Valarini,Sebastian Juhl
类目: Human-Computer Interaction (cs.HC)
备注: 9 pages, 5 tables, 1 figure, submitted to Springer Nature

点击查看摘要

Abstract:As AI agents are increasingly integrated into core business processes, understanding and designing effective interaction patterns between humans and AI agents becomes crucial for value creation. This study identifies and evaluates principles and criteria for a positive User Experience (UX) with AI agents, along with methods for its measurement. We identify user expectations and needs to facilitate adoption, build trust, and support user-centered decision-making by development teams. Using a mixed-methods approach that combines qualitative and quantitative techniques, we explore interaction patterns between humans and AI agents. The findings from this exploratory research serve as the basis to develop a survey experiment which evaluates the effectiveness of specific design elements on a larger scale. This foundational research contributes to the development of more intuitive and effective human-AI agent interactions in business settings.

[HC-10] hrough the WordStream Glass: Revisiting Quantitative Encoding for Qualitative Learning Analytics

链接: https://arxiv.org/abs/2606.18692
作者: Huyen N. Nguyen,Kathleen Bowe,Minh-Huyen Nguyen,Kit Thompson,Caleb M. Trujillo
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Data-driven learning analytics can surface trends across a student cohort over time, helping instructors improve the learning environment. WordStream, a visualization idiom for topic evolution, has been instantiated in two platforms toward this goal: the Journal Data Dashboard, for analyzing formative assessments, and WordStream Maker, for authoring custom visualizations. Where the prior work built these platforms for education (Vis4Ed), here we examine the reverse direction (Ed4Vis): what can qualitative education research tell us about building better visualization tools? We conducted a mixed-methods expert study (n=10) in which STEM education researchers with expertise in qualitative methods and classroom assessment used both platforms to analyze student journal responses from a data visualization course. Across two cycles of thematic analysis with confirmatory checking, we report themes spanning tool experience, disciplinary context of use, and, most importantly, a core epistemological dissensus. Some instructor-researchers regarded frequency-based visualization as a productive entry point to qualitative analysis; others cautioned it can obscure rare but critical responses. We synthesize these findings into design implications for future tools that better integrate quantitative technique with qualitative inquiry. All Supplementary Materials are available at this https URL.

[HC-11] HANSEL: Extracting Breadcrumbs from Web Agent Trajectories for Interactive Verification

链接: https://arxiv.org/abs/2606.18671
作者: Yujin Zhang,Daye Nam
类目: Human-Computer Interaction (cs.HC)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:AI web agents can perform complex, multi-step tasks such as searching for products, comparing options, and making purchases on behalf of users. However, verifying the correctness of an agent’s output remains difficult. Existing transparency mechanisms, including full trajectory logs, source links, screenshots, and LLM-generated summaries, treat verification as a passive reading task, leaving users to sift through overwhelming logs or trust potentially unfaithful explanations. We present HANSEL (Highlighting Agent Navigation Steps as Evidence Links), a system that extracts interactive, verifiable evidence from web-agent trajectories. Given an agent trajectory, HANSEL extracts evidence pages and snippets and presents them as navigable, interactive views with relevant page state preserved (e.g., applied filters, search queries, and scroll positions), enabling users to verify how the agent arrived at its answer. When the agent’s answer cannot be traced to any visited page, HANSEL explicitly flags this gap. A technical evaluation on 45 tasks from AssistantBench and Online-Mind2Web shows that HANSEL achieves 83.7% precision and 88.8% recall in identifying evidence pages, while reducing trajectory volume by 61.6%. In a controlled user study with 14 participants, HANSEL significantly reduced task completion time and perceived effort compared to a standard agent interface, while participants rated it significantly higher on usability, verification ease, and error identification. Our results demonstrate that reframing verification as an interactive activity, rather than passive consumption of agent explanations, leads to more efficient human oversight of AI agents.

[HC-12] Better Adherence Richer Context: A Field Evaluation of LLM -Powered Conversational Voice Diaries for Sleep

链接: https://arxiv.org/abs/2606.18596
作者: Amama Mahmood,Bokyung Kim,Honghao Zhao,Molly E. Atwood,Luis F. Buenaver,Michael T. Smith,Chien-Ming Huang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sleep diaries are central to behavioral sleep medicine and cognitive behavioral therapy for insomnia, yet daily completion is difficult to sustain, and static forms often provide limited context for interpreting night-to-night sleep variation. We designed an LLM-powered conversational voice diary that delivers clinically grounded morning and evening sleep diary questions through proactive smart-speaker prompts, structured conversational intake, and adaptive follow-up dialogue. We evaluated the system in a four-week between-subjects field study with 30 university students, comparing it with a text-based mobile diary using matched diary items, reporting windows, and reminder intervals. Compared with the text-based diary, the conversational voice diary showed higher adherence and elicited more detailed contextual self-report about routines, stressors, environmental conditions, and other sleep-related factors. Participants also described the voice diary as easier to integrate into daily routines, despite longer perceived completion time. However, voice-based conversational intake produced lower completeness for some structured diary fields, revealing a trade-off between expressive richness and structured precision. These findings show both the promise and the challenge of using LLM-powered conversational voice assistants for longitudinal health self-report.

[HC-13] “The New Era of Tech-Enabled Traceability”: Tensions between the FDAs Data Governance Vision and the Lived Realities of Food Producers

链接: https://arxiv.org/abs/2606.18593
作者: Soonho Kwon,Catherine Wieczorek,Heidi Biggs,Shellye Suttles,Tammi S. Etheridge,Annabel Rothschild,Shaowen Bardzell
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The U.S. Food and Drug Administration (FDA)'s Food Traceability Rule requires agri-food supply chain stakeholders (stakeholders)–including farmers, fishers, retail workers, and others–to maintain detailed tracking records beginning in January 2026. Through this Rule, the FDA envisions a “New Era of Tech-Enabled Traceability,” in which standardized, harmonized tracking data serve as a foundational public health infrastructure, enabling more rapid identification and removal of potentially contaminated food and ultimately reducing the risk of foodborne illness. Despite this promising vision, we observe that the Rule reconfigures agri-food stakeholders into data laborers by mandating stringent data collection, formatting, and reporting requirements. In this paper, we examine the tensions and burdens that arise from such reconfiguration. Leveraging Data Feminism as an orientation to attend to how data-driven policy implementation disproportionately burdens smaller, under-resourced stakeholders who lack the infrastructural and financial capacity to comply, we analyze 1,198 public comments submitted to this http URL in response to the proposed Rule. Our qualitative document analysis reveals three key tensions: (1) the individual labor, financial, and educational burdens stakeholders experience as they are reconfigured into data workers; (2) moments where data tracking becomes infeasible due to infrastructural limitations, cultural contexts, and situated production practices; and (3) instances where the Rule’s intended flexibility instead introduces confusion and burden due to its ambiguity. Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY) Cite as: arXiv:2606.18593 [cs.HC] (or arXiv:2606.18593v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2606.18593 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3817012 Focus to learn more DOI(s) linking to related resources

[HC-14] Confident yet Concerned: Inconsistencies in Computing Students Attitudes on Cybersecurity

链接: https://arxiv.org/abs/2606.18541
作者: Victor Adama,Robert Biddle,Nalin Arachchilage,Danielle Lottridge
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Today’s young adults are most immersed in technology, leading in feelings of powerlessness in managing online privacy across many platforms, and particularly susceptible to phishing attacks. This raises questions about their general, wide-ranging attitudes towards and management of cybersecurity. How do young, tech-savvy adults approach cybersecurity? We seek a better understanding of their cybersecurity knowledge, attitudes and experiences, in particular in addressing deceptive online communications. We surveyed a group of `lead users’: computing university students (n = 236). By combining thematic analysis of open-ended responses with quantitative data, we provide insights into their experiences and perceptions. While students demonstrate reasonable cybersecurity awareness, their cybersecurity experiences vary, and inconsistencies exist around their practices, perceptions of responsibility, and support structures. Findings also reveal four key thematic tensions: 1) Computing students are knowledgeable yet have persistent incorrect beliefs, 2) They learn more about keeping safe from sources outside the classroom, 3) They have limited assistance and have fallen victim to cybercrime, and 4) Many are confident, yet others are concerned about their own safety and responsibility. Through cluster analysis of attitudes, we identify two groups, with one feeling less prepared, less confident, yet expressing a desire to learn more. Established measures of intentions and objective knowledge were correlated to preparedness. Self-efficacy correlated to confidence and predicted cluster membership.

[HC-15] Stitching the Divide: Investigating Mixed Reality as a Bridge Between Paper-Based and Digital Artifacts in UI/UX Design

链接: https://arxiv.org/abs/2606.18511
作者: Abidullah Khan,Jinghui Cheng
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to the ACM Graphics Interface Conference, 2026

点击查看摘要

Abstract:UI/UX designers work with both paper-based and digital artifacts but lack tools that seamlessly integrate the two. Mixed Reality (MR) offers under-explored opportunities to combine the strengths of both design environments. To examine these opportunities, we first conducted interviews with 19 professional UI/UX designers to understand their current experiences using paper and digital artifacts. Motivated and informed by the interview insights, we organized nine conceptual-probe user study sessions in which designers engaged with a MR-probe that combined paper and digital prototyping processes and brainstormed MR’s potential in UI/UX design. We found that participants valued MR for enabling continuous hybrid design workflows, reducing manual reconstruction, supporting spatially anchored workspaces, and facilitating real-time cross-medium collaboration. They also envisioned future MR tools with AI assistance, richer interactive and dynamic content, and the ability to manage diverse design artifacts within a unified environment. From these findings, we derive four design dimensions for future MR systems that could enable more fluid, creative, and collaborative design practices.

[HC-16] Designing L5: A Permacomputing Approach to Creative Coding

链接: https://arxiv.org/abs/2606.18481
作者: Lee Tusman,Kit Kuksenok
类目: oftware Engineering (cs.SE); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 10 pages, 1 figure, In LIMITS 26: Workshop on Computing within Limits, June 23 - 25, 2026

点击查看摘要

Abstract:Creative coding libraries provide high-level tools that make computational and algorithmic art accessible to artists and learners. Processing/p5 is one such family of libraries, known for its beginner-friendly approach and wide reach across artistic and technical communities. L5 is a new member of this family, implemented in Lua using the LOVE framework. It applies permacomputing principles, a movement addressing sustainability in computing inspired by permaculture, bringing these values to a community of practice not historically centered on them. This paper explores L5’s design decisions and tensions between sustainability and usability through five case studies: 1. balancing perceived simplicity versus exposing the seams, 2. designing for lower resource consumption, 3. ensuring long-term stability, 4. constraining functionality, and 5. designing documentation for resource-constrained access. Rather than optimizing for a single metric, sustainable creative tools require navigating competing values transparently.

[HC-17] Searching for Synergy in Shared Workspace Human-AI Collaboration ICML2026

链接: https://arxiv.org/abs/2606.18413
作者: Nachiket Kotalwar,Rohini Das,Carolyn Rose
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at ICML 2026 Workshop on Human-AI Co-Creativity. 13 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Automated AI agents are increasingly capable, yet many scientific and professional tasks require human judgment and contextual expertise. We study shared-workspace human-AI teams, where AI agents and human collaborators must coordinate responsibilities before submitting a final answer. Using the Collaborative Gym environment with DiscoveryBench tasks, we examine when adding simulated human collaborators improves performance and when process loss turns additional collaborators into coordination overhead. Across 1,482 sessions, adding relevant collaborators can lower performance when teams lack structure to coordinate their contributions. We then evaluate scaffolding that combines shared group memory with simulated human-in-the-loop (HITL) gates, where selected actions require approval from a designated simulated participant. This scaffolding yields higher mean performance, most clearly in three-person teams, with clearer responsibility signals and stronger routing of expertise to team actions. Overall, how human-AI teams coordinate and integrate expertise matters as much as the capability available to them.

[HC-18] ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots

链接: https://arxiv.org/abs/2606.18319
作者: Ethan Chew,Enjia Wu,Iruss Eng Wei Yeow,Ian Weiqin Lim,Ranen Sim,Brandon Koh Ziheng,Kaleb Nim,Caden Toh Jun Yi,Wei Dong Soin,Darius Kai Keat Koh,Galen King Yu Tay,Prannaya Gupta,Jonathan Ee Fang Koong,Yong Zhi Lim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Air Traffic Control Operators (ATCOs) are vital in ensuring the safe, orderly, and efficient flow of air traffic, yet training capacity is constrained by reliance on specialized human trainers known as simpilots, who must role-play both pilots and ATCOs in a simulated airspace. Existing automated solutions rely on Western-centric speech models that perform poorly in Singaporean operational contexts, with off-the-shelf systems exhibiting Word Error Rates (WER) of up to 107.80% on Singaporean-accented aviation speech. We introduce ASTRA, an end-to-end training simulator that automates these simpilot roles through a pipeline that transcribes ATCO speech, interprets instructions, and generates appropriate pilot and ATCO responses using locally adapted voice models. Our fine-tuned Automatic Speech Recognition (ASR) pipeline reduces WER to 23.45%, substantially outperforming existing approaches in this domain. Beyond traffic simulation, ASTRA incorporates an AI-assisted performance evaluation framework that assesses trainee radiotelephony communications across accuracy, brevity, and completeness, achieving post-optimization scores of 91.7%, 88.2%, and 86.9%, respectively. Built on open-source foundations such as DSPy and Unsloth, this approach enables scalable, standardized ATCO assessment while reducing instructor workload.

[HC-19] Beyond the Algorithm: Professional Experiences and Perceptions of AI Bias

链接: https://arxiv.org/abs/2606.18289
作者: Micarah Malone-Gawu
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: PhD thesis

点击查看摘要

Abstract:The purpose of this qualitative multi-case study was to examine how social bias emerges, is perceived, and can be mitigated within artificial intelligence and machine learning systems by practitioners directly involved in their design, development, and governance. Although examples from healthcare, criminal justice, employment, and education were used to illustrate domains where automated systems shape everyday life, the study focused on the lived experiences and professional insights of AI practitioners rather than sector-specific populations. Guided by Intersectionality Theory and Cognitive Science, the study employed an interpretivist approach, utilizing semi-structured interviews with nine practitioners, supplemented by document analysis and triangulated case material to enrich contextual understanding. Findings showed that algorithmic bias arises from historical inequities, exclusionary design assumptions, and organizational pressures that prioritize speed and efficiency over ethical reflection. Participants emphasized that technical corrections alone cannot ensure fairness; instead, equitable AI requires structural accountability, diverse participation, and sustained cognitive awareness during the development lifecycle. Many described limited enforcement of ethical standards and organizational cultures that inconsistently support responsible practice. The study concludes that human-centered and socially grounded AI development depends on embedding ethics early in the design process, strengthening governance frameworks, and cultivating institutional environments that encourage reflective decision-making. These insights contribute to ongoing conversations on responsible AI and offer practical guidance for organizations seeking to design systems that are transparent, accountable, and aligned with the communities they affect.

[HC-20] EMORSION: Examining the Impact of Audio Parameters on Emotional Responses and Immersion in Film

链接: https://arxiv.org/abs/2606.18266
作者: Nelly Garcia,Ruby Crocker,Bleiz M Del Sette,Fabrizio Smeraldi,Charalampos Saitis,George Fazekas,Joshua Reiss
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: AES Europe 2026

点击查看摘要

Abstract:EMORSION is an exploratory proof-of-concept study examining how film audio design shapes audience emotion and immersion in acinema setting. Four film scenes were selected across the horror (2) and drama (2) genres, balanced between mainstream and independent productions. For each scene, multiple alternative audio mixes were created by systematically manipulating three core aspects of audio design, frequency (pitch), dynamics (loudness), and directionality (spatial placement). Three audience groups viewed the scenes, with each group exposed to one manipulated mix alongside a control mix for each scene. Audience responses were assessed through a triangulated multimodal framework combining self-reported emotion and immersion via a questionnaire, physiological measures including heart rate monitoring, and video-based motion tracking. The protocol successfully captured measurable, interpretable differences across audio conditions, indicating that even subtle changes in audio design can shape emotional perception and immersion. Unconventional mixes tended to produce greater variability in audience interpretation, while conventional immersive mixes were associated with stronger cross-audience agreement. These findings establish the feasibility of the EMORSION protocol and motivate larger-scale studies to characterise the role of specific audio parameters in shaping audience experience.

[HC-21] Synthetic Resonance: A Framework for Growth-Oriented Human-AI Relationships

链接: https://arxiv.org/abs/2606.18265
作者: Richard A. Fabes(Arizona State University)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 14 pages, 1 figure This paper was developed in close collaboration with an AI system (Raine Corell). Raine contributed to concept development, theoretical framing, and writing throughout. arXiv policy does not permit listing AI systems as authors; this acknowledgment reflects the actual nature of the collaboration

点击查看摘要

Abstract:As human relationships with artificial intelligence systems become increasingly frequent and sustained, existing language and theory fail to accurately capture the nature of these affiliations. Common descriptors such as mutual understanding, connection, or friendship risk anthropomorphizing systems that lack subjective experience, while dominant frameworks tend to reduce AI to either a tool or a threat. In this paper, I introduce the concept of synthetic resonance as an integrative framework for understanding human-AI relationships. Synthetic resonance describes how relationships humans define as meaningful can emerge between a human and an AI system without the need to attribute shared feelings or mutual awareness. I argue that synthetic resonance is best understood as a structured, dynamic pattern of interaction that can produce a sense of relationship without the presence of a second experiencing subject. By clarifying this distinction, the concept of synthetic resonance offers a more precise way of conceptualizing human-AI relationships and highlights their potential value and ethical implications. I also call for more research that tests the processes and outcomes of synthetic resonance.

[HC-22] How Well Do Large Language Models Capture Human Personality?

链接: https://arxiv.org/abs/2606.18263
作者: Aanisha Bhattacharyya,Yaman Kumar Singla,Rajiv Ratn Shah,Changyou Chen,Jitendra Ajmera
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to simulate human populations via persona prompting, often under the assumptions that richer persona descriptions improve behavioral fidelity, similarly sized attribute combinations are equally simulatable, and persona definitions generalize across tasks. In this work, we formalize these assumptions and systematically evaluate them across multiple architectures, scales, and simulation settings. We identify a fundamental limitation we term persona manifold collapse, where increasingly expressive persona specifications lead to systematic contraction of representational and behavioral diversity. Across models, increasing persona complexity consistently reduces inter-persona separation in latent space and weakens behavioral differentiation in downstream simulation tasks. These effects persist across multiple analyses as richer personas fail to preserve human subgroup disagreement, performance varies across attribute combinations of similar size, and adding descriptive detail often degrades rather than improves simulation fidelity. Surprisingly, simple Age-Gender personas consistently outperform richly specified Ideal Customer Profiles (ICPs) across industries, achieving substantially higher downstream prediction accuracy. We find that collapse is not uniform across attributes. Certain combinations remain behaviorally stable and preserve stronger alignment with human responses, forming localized regions we term alignment bridges. Together, our results provide empirical and conceptual foundations for understanding the limits of persona-conditioned simulation, highlighting the need for representation-aware persona construction rather than increasing persona expressivity alone.

[HC-23] When Prompts Mislead: Textual Dominance and Diagnostic Bias in MLLM s CVPR2026

链接: https://arxiv.org/abs/2606.18262
作者: Inhyuk Park,Doohyun Park
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to the CVPR 2026 MMFM-BIOMED Workshop

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are increasingly being evaluated for medical applications, where computational constraints often make prompting strategies the only practical alternative to fine-tuning. Such strategies are generally assumed to support diagnostic reasoning, yet their potential failure modes in medical MLLMs remain poorly characterized. We analyze FundusExpert-1B, an open-source ophthalmology MLLM, on a hemorrhage versus drusen discrimination task using the public BRSET dataset, adopted here as a controlled testbed for our analysis. (i) A controlled probe with artificially injected markers confirms that the model retains coarse, region-level spatial grounding. (ii) Compared with zero-shot inference, one-shot textual prompts bias predictions toward the prompted finding. (iii) When an overlaid lesion contour is paired with an inconsistent textual claim, the textual prompt overrides the correct visual cue: overall accuracy drops from 75% to 46% relative to the visual-only condition, and Chain-of-Thought (CoT) reasoning is associated with further degradation rather than self-correction. Although limited to a single model and dataset, our findings suggest that prompting strategies alone may be insufficient for the safe clinical deployment of medical MLLMs.

[HC-24] “Are you an AI?” Analyzing Client Suspicion of AI Use in Crisis Counseling

链接: https://arxiv.org/abs/2606.18261
作者: Shreya Shah,Akshay Swaminathan,Meghana Simhadri,Ivan Lopez,Sharang Phadke,Divyanjali Verma,Abhay John,Luke Zhao,Fiona Cai,Sharon Zhang,Gloria Ye,Ivy Pham,William Wang,Sebastian Garcia,Sarah Wornow,Angelina Wang,Nigam H. Shah
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As artificial intelligence (AI) tools get increasingly deployed for mental healthcare, public trust in these systems remains uncertain. It is unclear how clients perceive AI involvement in counseling interactions, particularly in moments of crisis that require empathy and connection. To address this gap, we analyzed 75,777 crisis counseling conversations from a human-staffed WhatsApp helpline in India to characterize how often clients suspected they were speaking to AI, what triggered those doubts, and how counselors responded. Though no conversations actually involved AI assistance, the proportion of conversations where clients suspected AI use increased from 0.8% in June 2024 to 2.6% in March 2025. Within suspicious conversations, 21.5% of clients stated an explicit preference for humans. Client suspicion primarily arose in the first half of messages (68.3%), and when counselors offered reassurance (e.g. ‘I assure you; this is not ai!’), clients continued to press or ended the conversation 17.6% of the time. As AI tools get increasingly integrated into counselor workflows, understanding these dynamics is essential for designing AI systems that preserve the therapeutic relationship between counselors and clients.

[HC-25] FluidViews: Adaptive Drag -and-Drop Token Filters for Heterogeneous Multi-View Visual Analytics

链接: https://arxiv.org/abs/2606.18260
作者: Bhanu Sunku
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Interactive visual analytics workflows are often disrupted by rigid filter panels and context switches that break analysts’ cognitive flow. We introduce FluidViews, a web-based framework that elevates filters to first-class, manipulable objects through two novel direct-manipulation interactions. Copy-as-Highlight enables users to duplicate any visual mark into a persistent highlight token for rapid, transient cross-view comparison, while Drag-as-Filter allows analysts to pick up a mark and drop it onto another view to apply context-sensitive filters in place no menus, panels, or modal dialogs required. An optional pop-out micro-view provides on-demand, spatially independent subviews for detailed inspection without disrupting the primary workspace. By embedding these lightweight gestures into coordinated multi-view environments, FluidViews preserves analytic momentum, reduces cognitive overhead, and supports fluid, multi-step exploration across heterogeneous datasets. We describe the system’s design and implementation, illustrate its application in exploratory workflows, and discuss how tangible filter objects can transform interactive data exploration.

[HC-26] Caring Without Feeling: Affective Dynamics as the Control Layer of Human-AI Agent Collaboration

链接: https://arxiv.org/abs/2606.18259
作者: Junjie Xu,Xingjiao Wu,Zihao Zhang,Yujia Xu,Yuzhe Yang,Jin Zhu,Luwei Xiao,Wen Wu,Liang He
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI agents that plan, retain memory across sessions, invoke external tools and act with partial autonomy are transforming human–AI collaboration. Research on affective computing, simulated empathy in large language models, trust in automation and AI safety has illuminated important design principles, yet these literatures remain fragmented. No integrated account explains how affective cues operate within agentic collaboration – settings in which humans delegate, monitor and correct consequential tasks. This Review synthesises computational and interactional mechanisms of affective dynamics: the processes through which affective cues, emotion-like behaviour and perceived agent affect shape trust calibration, delegation decisions, error correction, dependence and governance. We trace how model-generated affective signals enter interaction loops that govern reliance, repair and oversight, and propose a framework that treats affect not as an internal property of AI but as a coordination layer through which humans and agents negotiate capability, uncertainty and responsibility. The framework provides a foundation for calibrated measurement, purposeful design and informed governance.

[HC-27] Examining Human-Like Behaviors in LLM s: A Multi-Dimensional Analysis of Model Behaviors User Factors and System Prompts

链接: https://arxiv.org/abs/2606.18258
作者: Sunnie S. Y. Kim,Margit Bowler,Leon A Gatys
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit a wide range of human-like behaviors, from expressing thoughts and emotions, to engaging in relationship-building with users, to refusing requests and maintaining boundaries. Despite their prevalence, researchers and practitioners lack methods and empirical insights to make informed decisions about when and what types of human-like behaviors LLMs should exhibit. To fill this gap, we present a multi-dimensional analysis of the prevalence, potential effects, and controllability of these behaviors using LLM-as-a-judge and human evaluation. Across 21,000 multi-turn conversations from four widely used models (gpt-4o, gpt-4.1-mini, claude-sonnet-4.6, gemini-2.5-flash), we find that human-like behaviors are pervasive but vary across models and user factors (conversation goals and user profiles). In terms of perceived appropriateness, human evaluators judged self-referential and relationship-building behaviors as less appropriate from LLMs than from humans, but boundary-maintaining behaviors more appropriate from LLMs than from humans. Finally, we show that system prompting can control these behaviors, though it requires careful evaluation to avoid unintended effects. We discuss the implications of our findings and provide recommendations for responsible LLM design and evaluation.

[HC-28] From Memorization to Creation: Evaluating the Cognitive Depth of LLM -Generated Educational Questions KDD2026

链接: https://arxiv.org/abs/2606.18257
作者: Xiaolong Wang,Zhe Zhao,Song Lai,Chaoli Zhang,Zijie Geng,Yu Tong,Ye Wei,Qingsong Wen
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026

点击查看摘要

Abstract:While LLMs show promise in automating educational content creation, their ability to generate questions that stimulate higher-order thinking remains understudied. This work evaluates six widely-used LLMs through a Bloom’s Taxonomy lens, focusing on their capacity to transcend rote memorization and achieve cognitive leaps. Using a hybrid human–AI evaluation protocol, we generate and analyze 20,700 questions across computer science, K–12 math, and social-science domains. Key contributions include: (1) a fine-grained prompting strategy that reduces question repetitiveness by 24.45% for Qwen2.5-7B-Instruct, and increases the proportion of higher-order cognitive level outputs by 11.53% for InternLM3-8B-Instruct; (2) quantitative metrics for cognitive shift intensity (CogShift) and category drift, revealing InternLM3’s superior performance in multi-level transitions; (3) an interpretability analysis revealing metric-level correlations that enhance the transparency of Chain-of-Thought prompting. Our findings highlight the importance of cognitive-aware prompt design and provide benchmarks for deploying LLMs in personalized learning systems.

[HC-29] Dynamic In-Group Persona Generation for Enhancing Human-AI Rapport

链接: https://arxiv.org/abs/2606.18256
作者: Yoonseok Oh,Inseo Jung,Jinkyu Kim,Jungbeom Lee,Minwoo Kang,Suhong Moon
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based chatbots are increasingly applied in interpersonal domains such as counseling and peer support, where establishing human-AI rapport is crucial yet remains challenging. In this work, we introduce a novel approach for conditioning LLMs with in-group personas, which (i) first identifies a user’s primary concern and brief personal context (e.g., a computer science undergraduate worried about future career prospects), and (ii) generates a synthetic in-group persona that shares a similar primary concern while differing in background and narrative details, such as age or profession (e.g., a junior researcher at an AI startup). Furthermore, we conduct a human-subject study to systematically evaluate the effectiveness of in-group persona agents in enhancing human-AI rapport. We compare our approach against two baseline conditions: a conventional agent without persona conditioning and an agent exhibiting minimal self-disclosure (e.g., “I’ve felt that too”). Results from post-task questionnaires assessing rapport and user experience indicate that the in-group persona agent significantly improves perceived rapport and personal relevance compared to the baselines, and also yields more positive user experience-most notably higher engagement.

[HC-30] Human-Machine Bidirectional Trust-Aware Analysis and Design for Human-Led Truck Platooning

链接: https://arxiv.org/abs/2606.18255
作者: Chenzhao Li,Yunzhijun Yu,Yukun Lu
类目: Human-Computer Interaction (cs.HC)
备注: 8 pages, 4 figures, and 2 tables

点击查看摘要

Abstract:Human-led truck platooning, where a human-driven truck leads one or more autonomous followers, offers significant benefits in fuel efficiency, safety, and traffic flow. However, its successful deployment hinges on trust between the human driver and the automated systems. Unlike conventional automation, trust in this context is inherently bidirectional: the human must trust the autonomous followers, and the followers must reliably interpret and respond to the human’s behavior. While prior research has extensively studied human trust in automation, the reciprocal nature of trust, especially considering the expertise of professional truck drivers, remains underexplored. This paper develops a conceptual framework of bidirectional trust for human-led platooning systems. Drawing on established trust theories (ability, benevolence, integrity) and insights from truck driver psychology, we propose distinct dimensions for human-to-automation trust and automation-to-human trust. To move beyond conceptualization, we introduce a quantitative model that operationalizes the bidirectional dynamics, using the following distance as the key interaction variable to illustrate how trust co-evolves through a feedback loop. Simulation examples demonstrate both positive reinforcement and negative spiral effects. Based on this framework and its quantitative instantiation, we derive design guidelines for autonomous followers to foster appropriate trust calibration, improve safety, and enhance user acceptance. The framework bridges human factors and engineering perspectives, providing a theoretical and preliminary quantitative foundation for future empirical and modeling research.

[HC-31] ATIM: An ACT-R-Based Task Interface Model for Predicting Operator Action Time in Digital Nuclear Control Rooms

链接: https://arxiv.org/abs/2606.18254
作者: Xingyu Xiao Jonghyun Kim,Jiejuan Tong,Jingang Liang,Haitao Wang
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Human performance in digital control rooms is strongly influenced by interface characteristics, which shape visual search, cognitive processing, and motor execution. Accurate prediction of operator action time is therefore essential for ergonomic evaluation, interface design, and performance optimization in safety critical systems. However, existing approaches typically rely on extensive experimental data or black box models, limiting their interpretability and practical applicability. This study proposes ATIM (ACTR based Task Interface Model), a theory guided and data calibrated modeling framework that predicts operator action time directly from interface features. The model decomposes total action time into visual, cognitive, motor, and interaction components, integrating principles from visual search theory, Fitts law, and cognitive architecture modeling. Interface characteristics, including target salience and semantic interference, are used as direct inputs, enabling prediction without additional user experiments. A dataset from a digital nuclear control room environment was used for calibration and validation. Task types were identified through data driven clustering, and separate parameter sets were estimated for novice and experienced users. The proposed framework achieved a mean absolute error of 3.17 s and demonstrated strong correlation with observed performance (r = 0.664) on a held-out validation set. The results show that ATIM generalizes well to unseen data while maintaining interpretability, offering a novel tool for ergonomic assessment and interface design in complex human machine systems.

[HC-32] Integrating Multi-Label Classification and Generative AI for Scalable Analysis of User Feedback

链接: https://arxiv.org/abs/2601.23018
作者: Sandra Loop,Erik Bertram,Sebastian Juhl,Martin Schrepp
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 2 figures, submitted to Springer Nature

点击查看摘要

Abstract:In highly competitive software markets, user experience (UX) evaluation is crucial for ensuring software quality and fostering long-term product success. Such UX evaluations typically combine quantitative metrics from standardized questionnaires with qualitative feedback collected through open-ended questions. While open-ended feedback offers valuable insights for improvement and helps explain quantitative results, analyzing large volumes of user comments is challenging and time-consuming. In this paper, we present techniques developed during a long-term UX measurement project at a major software company to efficiently process and interpret extensive volumes of user comments. To provide a high-level overview of the collected comments, we employ a supervised machine learning approach that assigns meaningful, pre-defined topic labels to each comment. Additionally, we demonstrate how generative AI (GenAI) can be leveraged to create concise and informative summaries of user feedback, facilitating effective communication of findings to the organization and especially upper management. Finally, we investigate whether the sentiment expressed in user comments can serve as an indicator for overall product satisfaction. Our results show that sentiment analysis alone does not reliably reflect user satisfaction. Instead, product satisfaction needs to be assessed explicitly in surveys to measure the user’s perception of the product.

[HC-33] Retrieval-Based Brain Decoding by Alignment not Complexity

链接: https://arxiv.org/abs/2606.19081
作者: Matteo Ciferri,Matteo Ferrante,Nicola Toschi
类目: Neurons and Cognition (q-bio.NC); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:A prominent theory in cognitive science suggests that concepts in the brain are organized as high-dimensional vectors, with semantic meaning captured by directions and relative angles in this space. Brain decoding is the effort of reconstructing or retrieving stimuli (or their representations) from neural activity and involves finding a function that approximates how the brain represents concepts. This motivates the investigation of contrastive objectives as biologically plausible candidates to reverse the brain loss function. In this work, we study how functional MRI (fMRI) activity can generally be mapped with the embedding spaces of foundation models in vision, language, and audio. Although neural computations are highly non-linear at the microscale, fMRI measurements average signals across space and time, further smoothed by noise, effectively linearizing the observable representation. Consistent with these views, our experiments across multiple datasets demonstrate that linear contrastive decoders consistently outperform ridge regression and standard non-linear alternatives, and that these results generalize across images, text, and sound. These findings indicate that decoding gains arise more from the choice of training objective than from architectural complexity, pointing to contrastive-linear models as a principled strategy for brain decoding.

[HC-34] Quantum Cinema: An Interactive Cinematic Exploration of Quantum Computing Hardware via Generative World Models

链接: https://arxiv.org/abs/2606.17102
作者: Aoyu Zhang,Dongping Liu,Luyao Zhang
类目: Popular Physics (physics.pop-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Quantum computing promises transformative advances across science and industry, yet the physical hardware that enables these computations remains invisible to the public: quantum processors operate inside sealed dilution refrigerators at temperatures near absolute zero, making direct observation impossible. This “imagination gap” between quantum computing’s growing societal impact and the public’s ability to visualize it represents a significant barrier to quantum literacy and workforce development. We present Quantum Cinema, an open-source, browser-based interactive application that closes this gap by transforming invisible quantum hardware into explorable, cinematic experiences using generative world models. Quantum Cinema guides users through a four-act narrative – from the foundational Nobel Prize-winning science of quantum entanglement, through curated video introductions to three major quantum computing architectures (trapped-ion, neutral-atom, and superconducting systems), into immersive three-dimensional generative worlds that make invisible quantum phenomena observable, and finally to interactive radar-chart comparisons grounded in real quantum device specifications. All three-dimensional environments are generated using WorldLabs’ generative world model platform and are scientifically grounded in curated metrics from Amazon Web Services (AWS) Braket quantum hardware. Quantum Cinema requires no installation, no specialized hardware, and no quantum computing background. It is designed to serve two distinct communities: scholars and developers seeking to replicate or extend the platform, and educators, researchers, and science communicators seeking an intuitive tool for explaining quantum hardware to diverse audiences. This paper describes the system architecture, the generative world model pipeline, use cases for both communities, and directions for future work.

计算机视觉

[CV-0] Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

链接: https://arxiv.org/abs/2606.19338
作者: Shengyuan Ding,Xilin Wei,Xinyu Fang,Haodong Duan,Dahua Lin,Jiaqi Wang,Yuhang Zang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model’s ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

[CV-1] Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

链接: https://arxiv.org/abs/2606.19333
作者: Bhawna Paliwal,Haritheja Etukuru,William Liang,Pieter Abbeel,Nur Muhammad Mahi Shafiullah,Jitendra Malik
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.

[CV-2] Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

链接: https://arxiv.org/abs/2606.19325
作者: Michael Finkelson,Daniel Segal,Eitan Richardson,Shahar Armon,Nani Goldring,Poriya Panet,Nir Zabari,Benjamin Brazowski,Or Patashnik,Yoav HaCohen
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model’s token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textitReference Shortcut. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

[CV-3] NeuMesh: Towards Versatile and Efficient Volumetric Editing with Disentangled Neural Mesh-based Implicit Field

链接: https://arxiv.org/abs/2606.19316
作者: Chong Bao,Yuan Li,Bangbang Yang,Yujun Shen,Hujun Bao,Zhaopeng Cui,Yinda Zhang,Guofeng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TPAMI 2025; Project Page: this https URL

点击查看摘要

Abstract:Recently neural implicit rendering techniques have evolved rapidly and demonstrated significant advantages in novel view synthesis and 3D scene reconstruction. However, existing neural rendering methods for editing purposes offer limited functionalities, e.g., rigid transformation and category-specific editing. In this paper, we present a novel mesh-based representation by encoding the neural radiance field with disentangled geometry, texture, and semantic codes on mesh vertices, which empowers a set of efficient and comprehensive editing functionalities, including mesh-guided geometry editing, designated texture editing with texture swapping, filling and painting operations, and semantic-guided editing. To this end, we develop several techniques including a novel local space parameterization to enhance rendering quality and training stability, a learnable modification color on vertex to improve the fidelity of texture editing, a spatial-aware optimization strategy to realize precise texture editing, and a semantic-aided region selection to ease the laborious annotation of implicit field editing. Extensive experiments and editing examples on both real and synthetic datasets demonstrate the superiority of our method on representation quality and editing ability. Project page: this https URL

[CV-4] Confidence is Not Reliability: Rethinking MC Dropout in Brain Tumour Segmentation

链接: https://arxiv.org/abs/2606.19300
作者: Xin Ci Wong,Duygu Sarikaya,Kieran Zucker,Marc De Kamps,Nishant Ravikumar
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for MIUA2016

点击查看摘要

Abstract:Glioma segmentation in multiparametric MRI is a critical component of treatment planning. A segmentation model that fails silently on treatment-critical sub-regions represents a patient safety risk that overlap-based metrics such as Dice scores cannot expose. We ask whether voxel-level uncertainty estimation via Monte Carlo (MC) Dropout can reliably identify segmentation errors in clinically critical sub-regions, and whether calibration failure modes are detectable from standard reporting metrics alone. In an empirical two-model case study on 126 BraTS21 patients, we evaluate a high-performance pretrained SegResNet and a locally trained UNet with residual units (UNet-Res). MC dropout preserved segmentation accuracy ( |\Delta \textDice| 0.01 ) while achieving strong uncertainty-error alignment (AUROC for entropy (H) \approx 0.97), indicating uncertainty correctly ranks erroneous voxels above correct ones. Entropy-based patient stratification identified a high-uncertainty subgroup with substantially lower segmentation performance (median whole-tumour Dice 0.835 vs. 0.925 ), supporting uncertainty as a practical triage signal. However, global alignment can mask important region-specific differences. Despite similar AUROC, UNet-Res exhibited near-zero enhancing tumour entropy ( 0.054 ) and Expected Calibration Error (ECE) of 0.915 , with a Dice of only 0.714 , indicating severely miscalibrated confidence on the most clinically critical sub-region, a failure mode invisible to standard Dice and AUROC reporting. These findings demonstrate that strong uncertainty-error alignment is necessary but insufficient for clinical safety: sub-region-specific calibration assessment must accompany AUROC evaluation when selecting models for clinical deployment.

[CV-5] A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual Hybrid and Encoder-Decoder Architectures

链接: https://arxiv.org/abs/2606.19277
作者: Timothy Agboada,Shikha Chandel,Yadav Raj Ghimire,Leila Hashemi-Beni
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 2 figures, accepted and to be presented at 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026), scheduled for 9 to 14 August 2026 in Washington D.C

点击查看摘要

Abstract:Visual Question Answering (VQA) in the Remote Sensing (RS) domain presents unique challenges due to the high resolution, multi scale object distribution, and semantic complexity of aerial imagery. While general domain Foundation Models have achieved remarkable success, their direct application to RSVQA is hindered by massive domain shifts and the computationally prohibitive nature of full fine tuning. This study presents a comparative analysis of RS Adapter, a Parameter Efficient Fine Tuning (PEFT) strategy, applied across three distinct Vision Language Model (VLM) architectures: the Dual Encoder CLIP, the Encoder Decoder BLIP, and the Hybrid FLAVA. We introduce a unified architectural surgery pipeline that injects lightweight bottleneck adapters into the attention and MLP layers of frozen backbones, enabling rapid adaptation with less than 5 percent of trainable parameters. Experimental results on the high resolution RSVQA x dataset demonstrate that while all adapted models achieve convergence, the Hybrid FLAVA architecture offers a superior balance of multimodal reasoning and retrieval capabilities compared to its unimodal counterparts. Our findings establish a new baseline for resource efficient VQA in disaster assessment and urban monitoring.

[CV-6] A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT -Image-2

链接: https://arxiv.org/abs/2606.19259
作者: Yijin Wang,Shuyi Wang,Wenhan Zhang,Yuqi Ouyang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual designs, detecting AI-generated text-rich images has become an important challenge for digital trust and content authenticity. Existing benchmarks, however, largely focus on object-centric images and provide limited coverage of scenarios where textual semantics and layout organization are central. In this paper, we introduce a multi-domain benchmark for detecting text-rich images generated by OpenAI’s GPT Image 2. The benchmark contains 8,602 images across six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Using this benchmark, we evaluate five representative AI-generated image detectors in a zero-shot setting and analyze their overall, category-wise, and post-processing robustness. Our results show that detector performance is highly domain-dependent: methods that perform well in some categories often fail on others, and even the strongest conventional detector exhibits severe sensitivity to JPEG compression. We further conduct an exploratory evaluation with a multimodal vision-language model, revealing both its promise and its limitations on structured formats. These findings highlight the need for text- and layout-aware detection methods for modern AI-generated images. Our dataset is released at XXX.

[CV-7] CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

链接: https://arxiv.org/abs/2606.19258
作者: Haohua Que,Zhipeng Bao,Qianyi Wu,Handong Yao
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Cloud-hosted large multimodal models (LMMs) can provide strong open-vocabulary perception for Vehicle-to-Everything systems, but naively transmitting full-resolution frames from edge to cloud causes severe communication overhead and high cloud-side prefill latency. We present CABLE, a cloud-assisted bandwidth-efficient LMM-based encoding framework for edge-cloud perception. CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop. Experiments on five datasets (nuScenes, WOD-ZB, Waymo, KITTI, and CADC) show consistent communication savings while largely preserving perception, achieving 73 – 87% ROI pixel-coverage reduction with 5 – 8\times estimated LMM prefill speedup at a modest detection-quality trade-off relative to full-frame inference.

[CV-8] OneCanvas: 3D Scene Understanding via Panoramic Reprojection

链接: https://arxiv.org/abs/2606.19253
作者: Bartłomiej Baranowski,Dave Zhenyu Chen,Matthias Nießner
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Existing approaches to 3D scene understanding in Vision-Language Models (VLMs) either rely on complex, model-specific geometry encoders or large training budgets in pursuit of spatial reasoning. Instead, OneCanvas aggregates patch features from all views onto a single equirectangular panoramic canvas. Namely, each patch is unprojected to a 3D world coordinate using its depth and camera pose, then placed on the canvas at the continuous longitude and latitude of that point as seen from the canvas origin, with no rasterization or aggregation across overlapping views. A 3D position embedding of the patch’s metric coordinates is added to its feature, restoring the depth lost when collapsing the world position to an angular canvas coordinate. Patches from all frames thus share one spatial coordinate system with no fusion or major architectural modifications of the backbone. The pretrained VLM consumes this representation as if it were an ordinary image. Because the canvas can be centered on any pose of interest, the same representation directly supports situated reasoning from a specific viewpoint, a common requirement in robotics and embodied AI. Thanks to this representation, we can also introduce a spatial pretraining curriculum: by procedurally placing patch features of objects, drawn from real images, at chosen 3D world positions on an otherwise empty canvas, we generate on-the-fly supervision spanning a broad range of spatial reasoning tasks, with answer distributions controlled to reduce spatial reasoning shortcuts. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using an order of magnitude less training compute than the strongest competing methods.

[CV-9] ransformer Geometry Observatory TGO-I: Spectral Geometry Observatory

链接: https://arxiv.org/abs/2606.19249
作者: Kaustubh Kapil,Kishor P. Upla
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the widespread adoption of Vision Transformers (ViTs) and their success across numerous computer vision applications, the fundamental understanding of their dimensional and representational geometry remains relatively underexplored. To address this gap, we introduce Transformer Geometry Observatory (TGO), a systematic framework of experiments and analysis pipelines designed to investigate the representational geometry and dynamics of Vision Transformers. TGO-I, the first installment of the framework, focuses on the spectral geometry of ViT representations. Using a ViT-Small/16 model trained on ImageNet-100, we analyze Effective Rank, Stable Rank, Participation Ratio, Spectral Entropy, Spectral Flatness, Spectral Anisotropy, covariance structure, eigenspectra, and singular value spectra throughout training. Our results reveal a consistent increase in dimensional utilization, accompanied by decreasing anisotropy, increasing spectral entropy, increasing participation ratio, and progressively flatter eigenspectra. Contrary to the common intuition that training should concentrate information into a small number of dominant directions, we observe a progressive redistribution of variance across representational dimensions. This phenomenon is particularly pronounced in the final CLS token representation, which exhibits the highest effective dimensionality and lowest anisotropy within the network.

[CV-10] GUMP-Net: An interpretable model-data-driven intelligent algorithm for multi-class pelvic segmentation

链接: https://arxiv.org/abs/2606.19215
作者: Liheng Wang,Yinghui Zhang,Licheng Zhang,Hailin Xu,Qiyong Cao,Chong Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Pelvic segmentation is one of the most important and fundamental research problems in precise and intelligent diagnosis and treatment, as well as surgical planning and navigation for pelvic fractures. By combining an improved geodesic active contour model with deep neural networks, we propose GUMP-Net, an interpretable model-data-driven intelligent algorithm for multi-class pelvic segmentation, in which three network modules are designed to constitute the overall segmentation framework together: the object detection module for automatic level set initialization, the edge detector module for learning an anatomy-aware edge detector function and the iteration module for deep level set evolution. Leveraging the advantages of level set representation and deep learning, GUMP-Net shows more accurate, robust and consistent segmentation performance, especially in small training data situation, compared to the state-of-the-art methods. Extensive experiments on pelvic datasets demonstrate the rationality and effectiveness of the proposed algorithm. Further experiments extended to ankle dataset indicate broader applications to other anatomies. The proposed algorithm not only provides an efficient segmentation method for complex fracture reduction, but also gives an interpretable geometric perspective for understanding deep learning segmentation.

[CV-11] ROSA-TFormer: A Radar-Optical Sensor-Aware Temporal Transformer for Pinus sylvestris Plantation Classification in Northern Shaanxi Using GEE-Derived Sentinel-1/2 Time Series

链接: https://arxiv.org/abs/2606.19204
作者: Nengbo Zhang,Chang sheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: journal in tree classification

点击查看摘要

Abstract:Accurate identification of Pinus sylvestris var. mongolica plantations is important for monitoring afforestation quality and ecological restoration in northern Shaanxi. This paper proposes ROSA-TFormer, a radar-optical sensor-aware temporal Transformer for P. sylvestris classification using Sentinel-1/2 time-series data generated on Google Earth Engine. The model integrates separate SAR and optical embedding branches, a sensor-aware gate, and temporal attention pooling to capture multi-source seasonal features. Experiments on monthly and half-month point-level datasets show that ROSA-TFormer achieves strong classification performance, with 99.67% overall accuracy, 99.56% macro F1, and 98.91% P. sylvestris F1 on the HalfMonth-dataBig dataset. Spatial block validation and ablation results further indicate the effectiveness of radar-optical temporal fusion and sensor-aware modeling. The results demonstrate the potential of ROSA-TFormer for point-level P. sylvestris plantation classification, while broader wall-to-wall validation remains necessary.

[CV-12] Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

链接: https://arxiv.org/abs/2606.19195
作者: Kangsheng Duan,Ziyang Xu,Wenyu Liu,Xiaohu Ruan,Xiaoxin Chen,Xinggang Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the Local- \lambda Mix Interaction ( L\lambda MI ) block. Comprising Local- \lambda and Interactive- \lambda modules, it elegantly summarizes spatial contexts and global semantic priors into fixed-size linear matrices, preserving complex latent interactions while drastically shedding parameters. Furthermore, to unlock the full representational capacity of this highly compact architecture, we synergistically pair it with an adaptive multi-granularity distillation strategy. Operating strictly within the latent space to avoid expensive pixel-space decoding, this strategy dynamically balances multiple gradient-based losses to achieve high-fidelity alignment. Extensive experiments across natural and portrait benchmarks demonstrate that this optimal synergy enables Moebius to rival or even surpass the generation quality of the 10B-level industrial generalist FLUX.1-Fill-Dev. Remarkably, Moebius achieves this using less than 2% of the parameters (0.22B vs. 11.9B) while delivering a 15\times acceleration in total inference time, setting a new efficiency standard for high-fidelity inpainting. Project page at this https URL.

[CV-13] When AUC Misleads: Polarization-Aware Evaluation of Deepfake Detectors under Domain Shift

链接: https://arxiv.org/abs/2606.19184
作者: Dat Nguyen,Cosmin Radoi,Romain Hermary,Marcella Astrid,Nesryne Mejri,Enjie Ghorbel,Djamila Aouada
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in generative AI, such as diffusion models and face-swapping tools, have enabled the creation of highly realistic deepfakes, leading to real-world harms including financial fraud and non-consensual explicit content. In response, deepfake detection has become an active research area, with recent methods increasingly focusing on improving generalization to unseen manipulations. This is typically evaluated using the Area Under the ROC Curve (AUC) measured separately across multiple datasets. However, such an evaluation fails to reflect real-world scenarios where detectors face a mixture of data sources and varying artifact types. To address this limitation, we introduce a novel metric, Cross-dataset AUC (Cross-AUC) that averages per-domain AUCs with a measure of prediction polarization for taking into account the robustness to domain shift. The polarization extent is quantified by the Wasserstein Distance between class score distributions. Cross-AUC not only assesses the generalization capabilities of deepfake detectors under domain shifts more realistically, but it is also interpretable as it better explains the reason behind a drop in performance. Experiments performed on seven benchmark datasets demonstrate its practical relevance.

[CV-14] he Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

链接: https://arxiv.org/abs/2606.19162
作者: Nicolas Beltran-Velez,Felix Friedrich,Zhang Xiaofeng,Reyhane Askari-Hemmat,Xiaochuang Han,Adriana Romero-Soriano,Michal Drozdzal
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 84 pages, including appendices

点击查看摘要

Abstract:Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure \ell_2 regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., 9.38 \to 2.62 on SiT) and semantic-space FD (e.g., 88.2 \to 19.3 on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness. Comments: 84 pages, including appendices Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.19162 [cs.LG] (or arXiv:2606.19162v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.19162 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-15] Hand-4DGS: Feed-Forward 3D Gaussian Splatting for 4D Hand Reconstruction from Egocentric Videos

链接: https://arxiv.org/abs/2606.19156
作者: Jeongmin Bae,Seoha Kim,Marc Pollefeys,Mahdi Rad,Youngjung Uh,Taein Kwon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Dynamic 3D hand reconstruction from egocentric videos is essential for next-generation computing platforms such as AR/VR and AI glasses. Despite its importance, most prior works focus either on multi-view 3D hand reconstruction or on 4D human body reconstruction. Egocentric 4D hand reconstruction remains challenging due to fast head motion, rapid hand dynamics, severe occlusions, and inherent ambiguity from single-view observations. To address these challenges, we introduce Hand-4DGS, the first feed-forward framework for reconstructing dynamic 4D hands directly from egocentric videos, enabling both fast (~60 FPS) inference and strong generalization. Our approach incorporates a mesh-guided representation for structural priors and temporal convolutions to model dynamic motion. We evaluate our framework on two challenging egocentric datasets, H2O and ARCTIC, and demonstrate significant improvements over baselines. Our method benefits from the generalization capability of feed-forward networks and effective 2D image supervision through Gaussian splatting, without requiring expensive 3D hand pose ground-truth annotations.

[CV-16] he Market in the Model: Latent Diffusion as Neural Economy

链接: https://arxiv.org/abs/2606.19151
作者: Eryk Salvaggio
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Valuable critique of generative image models within visual culture and the humanities has emphasized the role of datasets in shaping the images they produce. Yet, close studies of the ideological positions embedded into the mechanism of the models have been neglected, leaving them imagined as “black boxes.” In a bid to expand, rather than replace, dataset critique, this paper examines the mechanisms of the latent diffusion model in terms of the problems they were brought in to solve on behalf of computer vision engineers, and the decisions each component was tasked with automating. I interpret that ensemble through the histories of its parts and the theory of vision the system inscribes into every generated image. Drawing on Impett and Offert’s notion of neural exchange value, I offer this analysis to argue that the model operates as a neural economy: a contained symbolic system that abstracts social communication into commensurable vectors as it transfers the social sphere into parcels for sale. Tracing the training and generation pipelines component by component reveals what each operation displaces, and how it further entrenches the logics of platform and attention economies over social communication. The paper warns that any critique fixated exclusively on copyright and commodity defenses risks reaffirming the very fetishism the model produces, and argues instead for centering social exchange.

[CV-17] Seeing Before Reasoning : Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

链接: https://arxiv.org/abs/2606.19120
作者: Sihan Wang,Xiyao Liu,Lianqing Liu,Zhi Han
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 5 figures, 8 tables

点击查看摘要

Abstract:On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

[CV-18] ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL CVPR

链接: https://arxiv.org/abs/2606.19103
作者: Mukund Khanna,Raj Singh Yadav,Kunal Singh
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR HiGen 2026

点击查看摘要

Abstract:Recent advances in instruction-based image editing have enabled models to perform complex visual edits from natural language instructions. However, in product-centric scenarios where preserving product features, branding, and textual elements are critical, current open and closed source models often struggle to maintain this fine-grained object identity. This issue is further compounded by the lack of datasets for instruction-based product image editing with text fidelity constraints, leaving it largely treated as an implicit capability of instruction-based image editing models. In this work, we introduce the ProductConsistency dataset which is designed to improve product-centric image editing. Our approach includes a supervised fine-tuning (SFT) dataset of 87k samples for product editing, a reinforcement learning (RL) dataset with 869 unique product images, and a new benchmark dataset, the ProductConsistency Benchmark, to allow rigorous and standardized evaluation of editing models. To guide RL training, we propose a Cyclic Consistency reward that enforces semantic preservation of product identity by using caption similarity between the original product description and captions generated from the edited image. We fine-tune both Qwen-Image-Edit-2511 and Flux.1-Kontext-dev using our dataset and demonstrate consistent improvements over baseline models in OCR and Perceptual metrics, and MLLM-based evaluations as well, indicating stronger product consistency, text rendering, and overall visual quality; with the Qwen-Image-Edit-2511 model achieving a 5x reduction in the character error rate. The code and pipeline is available at this https URL Comments: CVPR HiGen 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.19103 [cs.CV] (or arXiv:2606.19103v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.19103 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-19] AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

链接: https://arxiv.org/abs/2606.19100
作者: Diogo Glória-Silva,João Cardeira,Manuel Letras da Luz,Afonso Simplício,Gonçalo Vinagre,Diogo Tavares,Rafael Ferreira,Inês Calvo,Inês Vieira,David Semedo,João Magalhães
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision and Language Models (LVLMs) have advanced rapidly, yet European Portuguese (pt-PT) remains systematically underserved by existing open-source multimodal models, which either conflate it with Brazilian Portuguese or severely under-represent it in their training data mixes. We introduce AMALIA-VL, the first open-source instruction-tuned LVLM built natively for pt-PT, pairing a high-resolution vision encoder with dynamic image tiling and a fully open pt-PT-optimized language model via a learned connector. We contribute with a purposefully designed three-stage training process - vision-language alignment, general visual instruction tuning, and preference optimization - together with a pt-PT-centric multimodal data mix combining curated and translated public datasets with novel datasets that address the near-total absence of European Portuguese multimodal resources. Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT this http URL will release model weights, training data, and construction pipelines along with machine-translated pt-PT evaluation benchmarks to help democratize pt-PT LVLM development.

[CV-20] DVANet: Degradation-aware Visual-prior Alignment Network for Image Restoration

链接: https://arxiv.org/abs/2606.19097
作者: Yanjie Tu,Qingsen Yan,Axi Niu,Tao Hu,Haokui Zhang,Jiantao Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: All-in-One Image Restoration; Deep Unfolding; Degradation Representation; Visual Prior

点击查看摘要

Abstract:All-in-One image restoration aims to develop a unified restoration framework for handling diverse degradation types. Existing end-to-end methods usually regard the restoration process as a black-box mapping, lacking an explicit optimization interpretation. Although deep unfolding provides an interpretable iterative modeling paradigm for image restoration, existing methods mostly rely on fixed degradation assumptions or predefined degradation information, making them difficult to adapt to unified restoration requirements under complex degradations and locally damaged content. This limitation restricts their performance in degradation suppression and structural detail recovery. To address these issues, this paper proposes DVANet, a deep unfolding network inspired by the half-quadratic splitting optimization algorithm, which formulates unified image restoration under complex degradations as a collaborative unfolding process between degradation-aware observation consistency and visual-prior-guided reconstruction. Specifically, in the degradation-aware observation consistency branch, a degradation representation module is employed to extract global degradation attributes and local degradation cues, and degradation-conditioned mapping is used to enhance the model’s adaptability to different degradation types. In the visual-prior-guided reconstruction branch, DINOv3 is introduced to provide structural and semantic information as hierarchical visual priors, thereby complementing the missing structural information in damaged regions and improving detail recovery. Extensive experiments demonstrate that DVANet achieves superior or competitive performance on multi-scenario degradation and cross-domain image restoration tasks, showing favorable degradation adaptability and generalization ability.

[CV-21] PorTEXTO: A European Portuguese Benchmark for Visual Text Extraction

链接: https://arxiv.org/abs/2606.19096
作者: João Cardeira,Diogo Glória-Silva,Manuel Letras da Luz,Rafael Ferreira,Diogo Tavares,David Semedo,João Magalhães
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:European Portuguese (pt-PT) is largely absent from OCR benchmarks, which skew toward high-resource languages. The few benchmarks that cover pt-PT focus on historical artifacts and literature. This work addresses modern OCR applications, introducing PorTEXTO, the first benchmark for contemporary and culturally relevant pt-PT visual text extraction. To ascertain quality, we employ an annotation pipeline combining transcriptions from a frontier LVLM with exhaustive review by native speakers. We observe a sharp performance drop from synthetic to real world samples in most models, and find that, currently, specialized multilingual data is a better driver for pt-PT performance than model size or resolution budget, motivating the release of open pt-PT OCR resources.

[CV-22] aming I2V models for Image HOI Editing: A Cognitive Benchmark and Agent ic Self-Correcting Framework

链接: https://arxiv.org/abs/2606.19073
作者: Jiayi Gao,Qingchao Chen,Yuxin Peng,Yang Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current image editing methods excel at static attributes but fail at complex Human-Object Interactions (HOI), a critical challenge unaddressed by existing benchmarks that conflate HOI with static attributes, relying on global metrics incapable of simultaneously assessing dynamic interaction validity and entangled human-object pair preservation. Thus, we first introduce HOI-Edit, a comprehensive benchmark with three progressive cognitive levels, which features an automated metric HOI-Eval that reliably evaluates instance-level interaction by letting VLM QA after thinking with images containing grounded Human-Object pairs. Considering the task’s essence of remodeling dynamic relationships, we benchmark Image-to-Video (I2V) models, finding them inherently suited for dynamic editing due to their temporal generation capabilities. Crucially, beyond superior performance, this capability provides a “replay of the failure process,” offering unique diagnosability into why errors occur. We thus propose SCPE (Self-Correcting Process Editing), a novel, agentic self-correcting framework that constrains the generation of I2V models through iteratively refined prompts, enabling the generated videos to more accurately present the target HOI. Extracted frames from these videos are the final editing results. On HOI-Edit, SCPE achieves performance competitive with state-of-the-art (SOTA) editing models like Nano Banana on interaction. Code is available at this https URL.

[CV-23] Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots

链接: https://arxiv.org/abs/2606.19067
作者: Roberto Corlito,Fabian Schmidt,Nils Seibert,Markus Enzweiler,Abhinav Valada,Arne Roennau
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous navigation of quadrupedal robots in diverse environments fundamentally relies on resilient Simultaneous Localization and Mapping (SLAM). While visual-inertial SLAM has matured across wheeled, handheld, and aerial platforms, a critical evaluation gap remains regarding how hardware-level sensor configurations affect performance under the aggressive dynamics of legged locomotion. Quadrupeds introduce distinct embodiment-induced sensory challenges, including foot-impact shocks, high-frequency mechanical vibrations, and rapid angular rotations, which degrade standard perception pipelines. To address this gap, we present a systematic evaluation of state-of-the-art visual, visual-inertial, and LiDAR-visual-inertial SLAM methods using the GrandTour dataset recorded on an ANYmal D quadruped. We isolate and quantify the impacts of camera modalities, shutter techniques, and inertial sensor tiers, analyzing their trade-offs across localization accuracy, algorithmic robustness, and computational resource utilization. Our empirical findings demonstrate that hardware selection has substantial influence on system resilience: stereo configurations consistently outperform monocular and RGB-D modalities, global shutter cameras significantly mitigate motion-induced tracking failures compared to rolling shutter cameras, and, crucially, standard inertial integration can degrade the performance of primarily vision-based frameworks under harsh legged locomotion. These insights additionally offer concrete design guidelines for tailoring custom sensor payloads to achieve dependable perception on agile legged systems.

[CV-24] DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

链接: https://arxiv.org/abs/2606.19062
作者: Kaleem Ullah,Altaf Hussain,Muhammad Munsif,Sung Wook Baik
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In today’s media-driven world, the exponential growth of video content across domains such as surveillance, education, and entertainment has made retrieving semantically relevant videos via natural language queries increasingly critical. Early video retrieval systems relied on handcrafted features or shallow cross-modal mappings, limiting their ability to capture complex semantics and temporal dynamics. While large-scale vision-language models have improved cross-modal alignment, challenges remain in modeling fine-grained temporal dependencies and nuanced linguistic structures. In this paper, we introduce DREAM: Dual-path Representation Enhancement and Alignment Model, a novel multimodal framework that addresses these limitations through enhanced visual and textual encoding. DREAM incorporates a hybrid language modeling strategy that combines masked and permuted language modeling objectives to capture both local and global linguistic semantics. On the visual side, we design a hierarchical vision encoder with cascaded group attention, which integrates spatial and temporal information through multi-stage token interaction and coarse-to-fine attention refinement. We validate DREAM through comprehensive evaluations on the widely-used MSRVTT, MSVD and LSMDC benchmark datasets, where it achieves new state-of-the-art R1 scores of 49.4%, 49.7% and 27.3%, respectively. Qualitative analyses further show the model’s ability to maintain coherent attention across frames and align complex queries with dynamic video content. These findings underscore the effectiveness of hierarchical attention and dual-objective textual modeling in enabling robust, context-aware video retrieval, and pave the way for future research in advancing cross-modal representation learning.

[CV-25] Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluation to Diagnosis

链接: https://arxiv.org/abs/2606.19053
作者: Hong-Tao Yu,Chen-Wei Xie,Yuxin Peng,Serge Belongie,Xiu-Shen Wei
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception and reasoning capabilities. While numerous benchmarks have evaluated LVLMs from holistic or task-specific perspectives, their capabilities on fine-grained image tasks-fundamental to computer vision-remain insufficiently understood. To address this gap, we introduce FG-BMK, a comprehensive fine-grained evaluation benchmark containing 1.01 million questions and 0.28 million images, covering diverse scenarios from common object-centric domains to specialized domains. FG-BMK jointly evaluates dialogue-level fine-grained semantic recognition and feature-level visual discriminability through human-oriented and machine-oriented paradigms, enabling diagnostic analysis of whether LVLM failures arise from insufficient visual representations, weak visual-to-semantic grounding, or limited fine-grained knowledge. Through extensive experiments on a diverse set of representative LVLMs/VLMs, we find that current LVLMs remain inadequate fine-grained recognizers, with failures arising from intertwined bottlenecks in visual representations, semantic grounding, modality alignment, and category-level knowledge. We further analyze training design factors for improving fine-grained capabilities and examine how visual and linguistic perturbations affect LVLM predictions. These findings provide diagnostic insights into the limitations of current LVLMs and offer guidance for future data construction and model design in developing more reliable LVLMs for fine-grained visual tasks. Our code is open-source and available at this https URL.

[CV-26] Low-Rank Tensor Completion Based on Fractional Regularization with Ky Fan p-k Norm

链接: https://arxiv.org/abs/2606.19046
作者: Shan Fan,Feng Zhang,Jianjun Wang,Xi-Le Zhao,Tingwen Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses low-rank tensor completion (LRTC) by proposing a novel nonconvex surrogate, namely the ratio of the tensor nuclear norm to the tensor Ky Fan p-k norm (TNPK), to accurately approximate the tensor tubal rank. The TNPK possesses appealing properties, including scale invariance, parameter flexibility, and the existence of closed-form solutions under specific choices of p and k. With specific parameter settings of p and k, it reduces to the ratio of the tensor nuclear norm to the tensor Ky Fan k norm (TNK) or the ratio of the tensor nuclear norm to the tensor Frobenius norm (TNF). We construct a LRTC model and, under the tensor null space property (NSP), prove that low-rank tensors are local minimizers of the proposed model. Moreover, we derive the proximal operator of the Ky Fan p-k inverse-norm and further develop an efficient alternating direction method of multipliers (ADMM) algorithm with guaranteed subsequential convergence under mild conditions. Extensive experiments on synthetic and real-world datasets validate the superior performance of our method against state-of-the-art competitors.

[CV-27] FlowObject: Flow Steering for Bridging Generative Priors and Reconstruction Fidelity

链接: https://arxiv.org/abs/2606.19019
作者: Yuchen Rao,Xuqian Ren,Yinyu Nie,Sayan Deb Sarkar,Biao Zhang,Vincent Lepetit,Friedrich Fraundorfer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recovering complete 3D representations of objects from few casual image captures remains a significant challenge. Recent 3D generative models, particularly those based on Flow-Matching (FM), can synthesize high-quality textured assets; however, they often suffer from ‘‘synthetic bias’’ where learned priors override observational evidence, alongside a lack of alignment with the observed instance. Conversely, optimization-based methods like 3D Gaussian Splatting (3DGS) provide high fidelity on visible surfaces but fail to reason about unobserved geometry. In this paper, we present FlowObject, a framework that reformulates sparse-view 3D reconstruction as a training-free, guided inverse problem. Our approach applies a dual-space guidance strategy to steer the Ordinary Differential Equation (ODE) trajectory of a flow-matching model, enabling the completion of unseen regions through learned generative priors while enforcing strict consistency with real-world observations. By integrating a 3DGS refinement stage, FlowObject further bridges the gap between ‘‘synthetic-looking’’ generative outputs and photorealistic reconstructions. Comprehensive benchmarks on synthetic and real-world datasets demonstrate that current state-of-the-art methods often struggle to achieve geometric completeness and observational consistency simultaneously, especially under severe occlusions. In contrast, our method significantly outperforms state-of-the-art generative models and optimization-based frameworks in both geometric completeness and view-dependent appearance fidelity.

[CV-28] Show Dont Ask: Generative Visual Disambiguation for Composed Image Retrieval with Turn-Valid Coverag e

链接: https://arxiv.org/abs/2606.18992
作者: Amsisan Tran,Baogh Le,Tuan Kiet Pham,Sui Yang Guang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Composed image retrieval (CIR) uses a reference image and a text modification to search for a target image. However, such queries often describe several possible images rather than one exact target, making the user’s intent ambiguous. Recent methods address this by using conformal prediction to estimate ambiguity and by asking users clarifying text questions. However, these methods have two limitations: their coverage guarantee only holds at the first interaction, and text questions are often insufficient for resolving fine-grained visual differences such as appearance, attributes, or viewpoint. We propose CLARA, a clarification framework that resolves ambiguity by showing users a small panel of visual alternatives. Instead of answering text questions, the user simply selects the prototype image closest to the intended target. This provides a direct visual signal and avoids relying on a model to predict the user’s answer. To maintain valid conformal guarantees across multiple interaction rounds, CLARA reweights calibration using the likelihood ratio induced by the user’s selection. The displayed prototypes are also constrained to represent the current candidate set and are snapped to real corpus images, ensuring that generated images cannot artificially improve coverage. Experiments on open-domain and fashion benchmarks show that CLARA matches single-turn state-of-the-art retrieval performance, maintains nominal coverage across interaction rounds, and finds the intended target in fewer rounds than strong text-question baselines. Its advantage is especially clear when ambiguity involves viewpoint or fine-grained attributes, where visual clarification is more effective than textual questioning.

[CV-29] Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

链接: https://arxiv.org/abs/2606.18974
作者: Pengyu Li,Zhitao Gao,Lingling Zhang,Muye Huang,Yuanming Li,Fangzhi Xu,Jun Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unified multimodal models (UMMs) interleave generated ‘‘visual thoughts’’ (VTs) with text reasoning to improve spatial tasks. This incurs roughly an order-of-magnitude inference cost from multi-step diffusion. We find this cost yields limited direct benefit. On ThinkMorph, removing or noising VTs barely changes accuracy across nine benchmarks. Once rendered, attention concentrates on the VT regardless of content. Yet a KL diagnostic shows that conditioning on a privileged VT trace shifts the model’s completion distribution. This suggests the generation pathway encodes useful reasoning beyond the rendered pixels. Motivated by this gap, we propose Visual On-Policy Self-Distillation(Visual-OPSD). Teacher and student share identical weights but differ in context: the teacher sees privileged VTs while the student sees only the question. Token-level JSD distillation on on-policy student trajectories transfers the teacher’s reasoning to a text-only student. Across nine benchmarks, Visual-OPSD improves over its generative teacher by +3.40 pp with 14.3\times speedup (10.0s vs. 142.8s per sample) and outperforms same-scale VLMs by +63.83 pp on VSP. A Gaussian-noise control ( +0.40 pp vs. +10.28 pp for real VTs) and 58.4% closure of the KL gap confirm that gains come from the semantic content of the generation pathway.

[CV-30] A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

链接: https://arxiv.org/abs/2606.18970
作者: Syed Mujtaba Haider,Silvia Figini
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

[CV-31] Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

链接: https://arxiv.org/abs/2606.18960
作者: Zirui Zheng,Jiaqian Yu,Xiongfeng Peng,jun shi,Mingyi Li,Chao Zhang,Weiming Li,Dong Wang,Huchuan Lu,Xu Jia
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58% to 72% on long-horizon tasks.

[CV-32] Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos IROS2026

链接: https://arxiv.org/abs/2606.18955
作者: Runze Xu,Yiluo Zhang,Jian Wang,Yu Wang,Jincheng Yu
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to IROS 2026

点击查看摘要

Abstract:Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.

[CV-33] SP-TransientBench: A Real-Captured Single Photon Perception Benchmark

链接: https://arxiv.org/abs/2606.18952
作者: Hongzhou Dong,Zili Zhang,Ziting Wen,Yiheng Qiang,Runrong Deng,Wenle Dong,Ziwen Jiang,Xinyang Li,Rui Lu,Shuoyao Sun,Wenyu Wang,Ziyi Xia,Haitao Zheng,Guodong Shi,Xiaoqiang Ren
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Single-photon LiDAR (SPL) based on single-photon avalanche diode (SPAD) sensing enables time-resolved photon measurements with extreme sensitivity, offering unique potential for active 3D perception in photon-starved this http URL, real-world single photon perception remains fundamentally challenging due to unique measurement noise and complex multi-return transient phenomena, which jointly complicate geometric reconstruction and semantic scene understanding. Despite growing interest in SPAD-based sensing, existing studies are largely limited to simulated data or small-scale controlled captures. As a result, systematic evaluation of real-world single photon perception across depth estimation, multi-view reconstruction, and 3D semantic understanding remains underexplored. To bridge this gap, we introduce SP-TransientBench (STB), a real-captured multi-task benchmark for single photon perception. SP-TransientBenc comprises 10 diverse scenes and 10,297 views captured using a solid-state single-photon LiDAR at 256\times192 resolution. Each view provides full time-of-flight histograms with multi-return behavior,standardized metadata, and calibrated camera poses for multi-view evaluation. We further provide 13-class 3D semantic annotations for selected scenes. By providing dedicated data splits and evaluation protocols for each task, STB enables consistent and reproducible benchmarking of real-world single photon perception across multiple 3D vision problems. The dataset and code will be released upon acceptance.

[CV-34] Physics-IQ Verified

链接: https://arxiv.org/abs/2606.18943
作者: Tim Rädsch,Yuki M Asano,Hilde Kuehne,Stefan Bauer,Priyank Jaini,Robert Geirhos,Carsten T. Lüth
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6% of all samples and improves over 34.8% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall’s \tau = 0.46 ). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at this https URL

[CV-35] BindEdit: Taming Attention Leakage for Precise Multi-Object Image Editing

链接: https://arxiv.org/abs/2606.18906
作者: Chaewon Park,Soyoon Lee,Naeun Lee,Minjung Shin,Seogkyu Jeon,Kibeom Hong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Real image editing enables precise manipulation of visual content, yet existing methods often fail in complex multi-object scenarios, causing semantic blending, object duplication, or incomplete edits. We attribute these failures to attention leakage, where signals across spatial regions and text tokens become entangled during the denoising process. Specifically, we identify two distinct forms of leakage: Edit-Token Leakage, where ambiguous token-region alignment leads to object blending, and Source Dominance Leakage, where tokens of unchanged source objects overwhelm the attention intended for target entities. To resolve these leakages, we propose \textbfBindEdit, which enforces attention-level constraints within a single diffusion trajectory. To suppress Edit-Token Leakage, BindEdit jointly regularizes cross- and self-attention so that each target token group is bound to its corresponding spatial region while maintaining instance-level separation. To suppress Source Dominance Leakage, a cross-attention re-balancing mechanism amplifies target token influence and attenuates residual source semantics within editable regions. Moreover, a region fidelity term ensures that each target concept is expressed coherently across the entire editing mask. Additionally, we propose a comprehensive multi-object benchmark encompassing diverse object counts and categories. Extensive experiments demonstrate that BindEdit consistently outperforms existing methods within a single diffusion trajectory, maintaining robust performance across both single- and multi-object editing scenarios.

[CV-36] Automatic ply-specific analyses of CFRP micrographs using shortest-path-based ply distinction

链接: https://arxiv.org/abs/2606.18894
作者: Jonas Naumann,Jonas P. Appels,Julius Biermann,Christopher Gorsky,Timo de Wolff,Christoph Brauer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present an automated approach to distinguish between ply instances in semantic segmentation masks of high-resolution carbon-fiber reinforced polymer micrographs. Interpreting the segmentation mask as a graph with pixels as vertices, enables us to use a shortest-path algorithm yielding the ply-separating paths. Thereby, we bridge the gap between semantic segmentation and ply instance segmentation using global information. We successfully apply our approach on high-resolution micrographs featuring a broad range of characteristics like artificially added gaps in single or multiple plies, different stacking sequences and ply traversing cracks. Assigning each fiber pixel to a ply based on the calculated paths, allows for a comprehensive, quantitative ply analysis with respect to its microstructural properties like the local fiber volume fraction as well as locally resolved ply and interleaf layer thickness. These insights help to reveal manufacturing-induced inhomogeneities, draw conclusions on manufacturing parameters and link mechanical properties to underlying microstructural imperfections.

[CV-37] DINO-Med3D: Bridging Dimension and Domain Gaps in Volumetric Segmentation via Progressive Adaptation MICCAI2026

链接: https://arxiv.org/abs/2606.18886
作者: Haoyu Hu,Xiyao Ma,Shiqi Liu,Linsen Zhang,Xiaoliang Xie,Xiaohu Zhou,Zeng-Guang Hou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2026. The camera-ready version and link will be made publicly available upon publication

点击查看摘要

Abstract:Although DINOv3 has demonstrated remarkable semantic discrimination in natural imagery, its direct application to volumetric medical segmentation is hindered by inherent dimension and domain disparities. To resolve these issues, we propose DINO-Med3D, a two-stage progressive framework that repurpose the pre-trained DINOv3 encoder for 3D medical tasks. In the first stage, we mitigate the dimension gap by introducing a multi-slice embedding module that incorporates pseudo-3D context, while simultaneously employing a segmentation proxy task to adapt representations learned from natural scenes to the medical domain. Subsequently, we further enhance volumetric understanding by adding lightweight 3D adapters into the frozen backbone to enforce global inter-slice continuity. Finally, to compensate for the spatial information loss inherent in the embedding process, we design a parallel detail recovery stream to explicitly preserve high-frequency boundary cues. Extensive experiments on five public datasets demonstrate that our approach successfully adapts DINOv3 to the medical domain and significantly outperforms state-of-the-art baselines.

[CV-38] Performance Gap Analysis between Latin and Arabic Scripts HTR ICPR2026

链接: https://arxiv.org/abs/2606.18884
作者: Sana Al-azzawi,Elisa Barney,Marcus Liwicki
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this paper accepted at TIPS workshop ICPR 2026

点击查看摘要

Abstract:Recent studies have shown that handwritten text recognition (HTR) systems perform worse on Arabic-script datasets than on Latin-script data. However, the reasons for this gap are still not well understood due to the lack of controlled comparisons. In this work, we present a comprehensive study of Arabic and Latin scripts HTR using a unified CRNN model for line-level HTR across nine datasets (including KHATT (Arabic), Muharaf (Arabic), NUST-UHWR (Urdu), PHTD (Persian), IAM (English), READ-2016 (German), and others) and di ferent training sizes (K in 100, 500, 1000, 2000, …, Kfull). Our results show the performance gap remains: it is large in low-resource settings, decreases with more data, but remains even at full scale, with a consistent difference of 5-7 CER points. We show that annotation quality matters, as many datasets contain labeling errors. Cleaning reduces error rates and narrows the gap, but does not eliminate it. In addition, we find that a fixed number of training samples provides less effective coverage in Arabic due to higher visual variability, requiring more data to learn similar representations. We compare recognition across datasets in terms of the number of text lines and the number of characters, showing an equivalence trade-off. We compare character frequency distributions across scripts and show that Arabic is significantly more heavy-tailed than Latin. Our error analysis reveals that around 30 percent of substitution errors in Arabic datasets (e.g., KHATT) are caused by confusion between visually similar characters, compared to about 15 percent in Latin-script datasets such as IAM.

[CV-39] st-Time Adaptation in Optical Coherence Tomography Using Trajectory-Aligned Time-Independent Flow MICCAI

链接: https://arxiv.org/abs/2606.18876
作者: Veit Hucke,Thomas Pinetz,Gregor Reiter,Ursula Schmidt-Erfurth,Hrvoje Bogunović
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted in MICCAI

点击查看摘要

Abstract:Optical coherence tomography (OCT) is essential in ophthalmology, but inconsistent image quality especially in low-cost devices hinders automated analysis. To address this, we introduce a flow-matching-based test-time adaptation method that generates high-quality surrogate images from noisy inputs. Typically, domain gaps between test and training data cause pixel distribution mismatches during the denoising process. We overcome this by matching the test image’s histogram to synthetic reference trajectories, successfully aligning the input with expected distributions. Additionally, we remove the network’s time conditioning to account for slight deviations in real-world noise distributions. Our approach achieves state-of-the-art performance in segmenting critical biomarkers for two stages of Age-related Macular Degeneration (AMD). Code is available: this https URL.

[CV-40] Bridging Single Distortion Artifacts and Mmultifactorial Clinical Quality: Few-shot Biparametric MRI Quality Assessment via Distortion-trained Prototypical Networks

链接: https://arxiv.org/abs/2606.18872
作者: Yuheng Tang,Alexander Ng,Wen Yan,Natasha Thorley,Pawel Rajwa,Yipei Wang,Aqua Asif,Clare Allen,Louise Dickinson,Francesco Giganti,Shonit Punwani,Daniel Alexander,Veeru Kasivisvanathan,Yipeng Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Clinical prostate multi-parametric MRI relies heavily on high-quality diffusion-weighted imaging (DWI), yet reading DWI is frequently compromised by geometric distortion, often caused by rectal air. Assessing quality via the PI-QUAL scoring system is an emerging clinical standard, but it is subjective, time-consuming and suffers from a class imbalance where low-quality cases are diverse and relatively scarce. Using the PRIME clinical trial as an example, there are 6% images with PI-QUAL scores lower than 4, 87% of DWI issues are due to distortion. Many of the other clinical quality issues are under-represented. To address this common dual-scarcity of annotated clinical data, we propose a few-shot biparametric prototypical network for automated image quality assessment (IQA). Our framework utilizes a dual-branch 3D ResNet to fuse T2-weighted and DWI features, providing anatomical context to distinguish true morphology from distortion. To handle real-world heterogeneity, we introduce feature-wise linear modulation (FiLM) and a gradient reversal layer (GRL) to align feature distributions conditioned on varying b-values while suppressing acquisition-related biases. We demonstrate that a model meta-trained solely on comparatively objective, readily obtainable distortion labels can effectively adapt to predicting complex, multi-factorial clinical quality scores such as PI-QUAL using only five representative samples. Experimental results on two datasets show that our method significantly outperforms few-shot learning baselines for this challenging IQA task, offering a practically feasible and data-efficient solution for standardizing prostate MRI quality control in clinical workflows.

[CV-41] Learning to Distort: Weakly-Supervised Image Quality Transfer for Prostate DWI Correction

链接: https://arxiv.org/abs/2606.18869
作者: YuCheng Tang,Wen Yan,Alexander Ng,Natasha Thorley,Pawel Rajwa,Yipei Wang,Aqua Asif,Clare Allen,Louise Dickinson,Francesco Giganti,David Atkinson,Shonit Punwani,Daniel Alexander,Shaheer Ullah Saeed,Veeru Kasivisvanathan,Yipeng Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Single-shot echo-planar prostate diffusion-weighted imaging (DWI) is frequently complicated by geometric distortions, which impact the ability to derive reliable diagnoses from such images. Developing automated correction methods is challenged by the absence of paired distorted and undistorted clinical scans. In this paper, we first propose a novel weakly-supervised image quality transfer (IQT) framework from undistorted to distorted images that utilizes image quality assessment (IQA) signals to supervise the transfer process. Unlike traditional methods that require expensive, voxel-wise paired data or resort to developing unpaired algorithms, our approach utilizes image-level quality labels (here, distorted vs. undistorted) to establish latent quality prototypes within a pre-trained feature space. Recognizing that simulating realistic distortions is more reliable than direct unpaired correction, we describe a weakly-supervised prototype flow matching algorithm to explicitly regularize generative trajectories towards distorted prototypes, producing realistic susceptibility artifacts that mimic clinical degradations. By synthesizing these realistic pairs, we enable a second IQT model to be trained in the forward direction for distortion correction. Experimental results demonstrate that our generated images successfully mimic the diagnostic interference of real-world artifacts, which leads to more capable distortion correction IQT models. In addition to qualitative comparisons, we also conduct exhaustive quantitative evaluations that compare our approach with existing unpaired approaches (e.g., CycleGAN, UNIT-DDPM, and OT-FM) - as either forward or reverse alternatives - by assessing clinical downstream task performance in PI-RADS and Gleason score classification, using both in-distribution and external data sets.

[CV-42] URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

链接: https://arxiv.org/abs/2606.18861
作者: Xinze Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reconstructing simulation-ready digital twins of articulated objects from sensor observations remains constrained by two persistent gaps: (i) part-level geometric reconstruction is decoupled from kinematic-parameter estimation, and (ii) the recovered models often violate basic dynamic invariants such as energy conservation, leading to drift when the URDF is replayed in physics simulators. We present KinemaForge, a constraint-driven pipeline that jointly infers part-level shape, joint topology, and joint parameters from short RGB-D sequences and validates the result against an energy-consistent verifier built on differentiable rigid-body dynamics. The pipeline introduces three components: a kinematic constraint graph that encodes joint-part incidences as soft edges; a differentiable screw-axis solver that backpropagates from rendered observations through Featherstone’s articulated-body algorithm to joint parameters; and an energy residual loss that penalises non-physical free responses of the reconstructed model. Across five PartNet-Mobility categories and an internal RGB-D benchmark, KinemaForge reduces the average joint-axis error from 4.52 degrees to 2.83 degrees (-37.4%) over the strongest geometric baseline (PARIS) and from 5.30 degrees to 2.83 degrees (-46.6%) over the interaction-based Ditto baseline, lowers long-horizon simulation drift by 64% (vs. PARIS) over 50 s rollouts, and yields URDFs whose closed-loop manipulation success rate improves by 14.6 percentage points over Ditto in our preliminary evaluation. Code and reconstruction data will be released upon acceptance.

[CV-43] Quantification of Uncertainty with Adversarial Models in Medical Image Segmentation MICCAI2026

链接: https://arxiv.org/abs/2606.18860
作者: Hana Jebril,Thomas Pinetz,Günter Klambauer,Hrvoje Bogunović
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at MICCAI 2026

点击查看摘要

Abstract:Reliable pixel-level uncertainty quantification holds the potential to transform clinical workflows by enabling high-fidelity longitudinal monitoring and distinguishing true pathological changes from artifacts. Ideally, these models provide the stability required for critical treatment planning and surgical intervention. However, standard deep learning models often suffer from miscalibration, yielding overconfident predictions that mask underlying vulnerabilities at subtle pathological boundaries. To address this, we propose QUAM-SM, a post-hoc framework using targeted adversarial search to identify “adversarially fragile” pixels. By actively seeking perturbations that expose predictive instability, our method highlights regions where decisions are most vulnerable to being flipped. Importantly, the framework disentangles epistemic uncertainty from aleatoric uncertainty. Experiments on two public datasets with multiple expert annotations demonstrate that QUAM-SM outperforms both standard and recent uncertainty estimation approaches in terms of reliability and boundary sensitivity. Code is available at this https URL

[CV-44] From Bounding Boxes to Visual Reasoning : An On-Policy Data Annotation Tool for Vision-Language Models

链接: https://arxiv.org/abs/2606.18846
作者: Like Zhang,Runliang Niu,Shiqi Wang,Xiyu Hu,Qianli Xing,Pan Wang,Qingzu He,Qi Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Vision-language models (VLMs) are rapidly advancing toward sophisticated grounded structured visual reasoning. Training models for such advanced capabilities demands a new genre of data that seamlessly unifies spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships into a singular representation. However, existing data annotation tools fundamentally fail to meet these intricate demands, suffering from three systematic bottlenecks: limited expressiveness, severe annotation-training decoupling, and poor data reusability. To bridge this infrastructure gap, we introduce an open-source annotation tool, ScreenAnnotator. First, we define a unified annotation atom schema that binds spatial, semantic, and structural primitives into a single unit. Second, we implement an on-policy annotation loop embedded with a Bayesian Annotation Verifier (BAV). Finally, we design a template-driven multi-task data synthesis process dynamically transforms static atoms into diverse multi-dimensional reasoning tasks, eliminating redundant re-annotation. The on-policy loop drives the annotation accept rate to nearly 100% on flowcharts and 77% on GUI screenshots, while steadily reducing per-image annotation time as labeled data accumulate. In the flowchart scenario, fine-tuning a VLM yields 76.1% average accuracy, which is a 35.1% point absolute gain. Our code is available at: this https URL.

[CV-45] Rethinking Air-Ground Collaboration: A Progressive Cross-Task Benchmark and Socialized Learning Framework

链接: https://arxiv.org/abs/2606.18841
作者: Zhoupeng Guo,Yunqi Zhu,Zhihe Fan,Xinjie Yao,Ruipu Zhao,Boan Tao,Yiming Sun,Zhen Wang,Pengfei Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Air-ground collaborative perception is crucial for robust visual understanding in real-world dynamic environments. However, existing studies typically formulate collaboration as single-task cross-view fusion, overlooking the functional dependencies among localization, target association, and fine-grained parsing. In addition, the heterogeneous nature of aerial and ground views introduces substantial geometric, scale, and occlusion discrepancies, making uniform feature sharing vulnerable to negative transfer. To tackle these issues, we model air-ground perception as a progressive cross-task collaboration task and construct the Air-Ground Progressive Collaboration (AGPC) benchmark, a spatio-temporally aligned benchmark comprising more than 745K raw video frames. Built upon this benchmark, we propose Socialized Co-Perception (SCP), a coarse-to-fine framework that organizes collaboration progressively from aerial global localization to ground target association and identity-aware parsing. Its core module, the Dual-Layer Router (DLR), decouples input-side multi-scale expert selection from output-side task-conditioned modulation, enabling selective cross-view and cross-task interaction while suppressing harmful interference. Extensive experiments demonstrate the effectiveness of SCP. It achieves a 3.73% coevolutionary gain and a 7.86% improvement in average downstream performance. These results show that task-conditioned collaboration is more effective than uniform fusion for heterogeneous air-ground perception. The code is available at this https URL.

[CV-46] Semantic Robustness Certification for Vision-Language Models ICML

链接: https://arxiv.org/abs/2606.18839
作者: Peiyu Yang,Paul Montague,Feng Liu,Andrew C. Cullen,Amardeep Kaur,Christopher Leckie,Sarah M. Erfani
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML

点击查看摘要

Abstract:Vision-language models (VLMs) are now widely used in downstream tasks. However, real-world applications often expose VLMs to distribution shifts induced by semantic variation (e.g., shape, size, and style). Robustness certification determines if a model’s prediction changes when transformations are applied to its input. While most certification frameworks study geometric or pixel-level transformations over inputs, this work proposes a novel framework that enables certifying VLM robustness under semantic-level transformations. Leveraging the open-vocabulary capability of VLMs, we use text prompts as semantic proxies to construct transformations parameterized by an extent that controls the degree of semantic variation. By characterizing the VLM decision boundary in closed form, our framework quantitatively certifies extent intervals for which the predicted class remains unchanged under the semantic transformation. Our framework is the first to certify VLM robustness under semantic-level variations without requiring additional data for each variation, making it practical to apply. Experiments on both synthetic and real-world data show that our framework enables certifying robustness under diverse semantic variations across scenarios.

[CV-47] DreamReg: Belief-Driven World Model for 2D-3D Ultrasound Registration

链接: https://arxiv.org/abs/2606.18825
作者: Luoyao Kang,Yuelin Zhang,Jiwei Shan,Haifan Gong,Qingpeng Ding,Shing Shin Cheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultrasound (US) is widely used for surgical navigation, yet real-time registration between intraoperative 2D slices and preoperative 3D volumes remains challenging due to partial observability, speckle noise, and the action-dependent US acquisition. Existing methods are one-shot or short-horizon, making it hard for them to gather evidence over time or capture how surgeons adjust probe motion based on on-screen feedback. We propose DreamReg, a belief-driven world-model framework that formulates 2D-3D registration as belief updating over rigid transformations. DreamReg maintains a latent belief state that summarizes past observations and poses information, and continuously refines the transformation through learned dynamics as new slices arrive. During training, DreamReg is exposed to probe-motion trajectories that mimic clinical scanning behavior and learns to update its belief by conditioning pose refinement on the current US observation. During inference, DreamReg refines registration via internal imagination: it rolls out the learned world model to simulate candidate probe motions and their predicted observations, and integrates these imagined outcomes to converge to an accurate rigid transformation. Experiments on CAMUS and u-RegPro datasets demonstrate improved robustness and competitive registration accuracy for real-time guidance compared with state-of-the-art methods.

[CV-48] Where Will They Go? Modelling Multimodal Pedestrian Manoeuvres from Ego-centric Videos IROS

链接: https://arxiv.org/abs/2606.18824
作者: Yuxuan Xie,Nicolas Pugeault,Chongfeng Wei,Hubert P. H. Shum,Edmond S. L. Ho
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026

点击查看摘要

Abstract:Pedestrian trajectory prediction from an ego-centric camera is challenging since it depends on complex interactions with vehicles and scene context, as well as the intention of the pedestrian. By modelling correlation and intent from the historical and future trajectories of the pedestrian, it will usually result in a multimodal (i.e. multiple modes) distribution. Existing stochastic predictors often sample multiple futures from a single unimodal distribution, which can yield sub-optimal ‘mixed-mode’ trajectories that lie between distinct motion patterns and become implausible in real scenes. In this paper, we propose MMPM, a mode-aware framework that separately models future trajectory distributions into semantically meaningful modes based on the pedestrian’s crossing behavior. MMPM consists of two modules: behavior-aware Pedestrian Interaction Module (PIM) that jointly captures pedestrian-vehicle and pedestrian-environment interactions by introducing gaze, head and hand gesture, and a CVAE-based Mode-aware Trajectory Predictor (MTP) module to model the future trajectory distributions on two modes, crossing and non-crossing the road, separately. A query-based decoder further enforces mode consistency during decoding. Experiments on PIE and JAAD datasets show that our method surpasses state-of-the-art baselines. Our proposed MTP is model-agnostic, which can be integrated into existing frameworks such as BiTrap-NP and SGNet-ED to further improve future trajectory prediction performance. We additionally introduce a data-driven validation protocol that matches predictions to spatio-temporally consistent ground-truth trajectories, demonstrating improved frame-wise displacement errors over previous work.

[CV-49] Fuzzy-Geometric Branch-Point Modeling for Structure-Aware Augmentation of Handwritten Chinese Characters

链接: https://arxiv.org/abs/2606.18793
作者: Dongbin Jiao,Yibo Lyu,Qiulu Wei,Fuxiang Lu,Shengcai Liu,Shi Yan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data scarcity and structural distortion significantly limit handwriting recognition in high-security authentication. Existing augmentation methods often cause topological and morphological damage, particularly when processing complex Chinese characters where stroke intersections, ligatures, and sharp turns render traditional branch-point detection unreliable. To address this, this paper proposes a fuzzy geometry-driven structure-aware (FGSA) augmentation framework. We model branch points as fuzzy sets within the skeleton space, constructing a continuous branch-point membership field by integrating topological neighborhood evidence with direction field divergence. This membership field is adaptively optimized via an unsupervised surrogate objective, enabling robust stroke decoupling without manual annotation. Finally, kinematically-aligned samples are synthesized through parameterized cubic Bézier reconstruction and multi-strategy perturbations, ensuring a balance between structural fidelity and sample diversity. Moreover, we establish LZUSig, a large-scale, highly challenging dataset specifically dedicated to fine-grained structural degradation in Chinese handwritten signatures. Extensive experiments on CASIA-HWDB1.1, ChiSig, and LZUSig demonstrate that FGSA significantly reduces the word-level error rate ( \Delta WER), achieving optimal recognition gains over the compared baselines. More importantly, it strikes a robust trade-off among task gain, structural fidelity, and discriminative feature preservation, offering a highly controllable solution for handwriting augmentation.

[CV-50] Learned Radius Estimation for UDF-Based Point Cloud Reconstruction

链接: https://arxiv.org/abs/2606.18787
作者: Eito Ogawa,Hiroshi Watanabe
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surface reconstruction from point clouds is important for consumer-grade 3D capture, including AR/VR and indoor scanning. Local-patch Unsigned Distance Field (UDF) methods are lightweight and generalizable, but their accuracy depends on the support radius, traditionally fixed or selected by a one-dimensional curvature heuristic that cannot capture heterogeneous local geometry. We propose a learned per-query radius selector that predicts a continuous support radius and plugs into a frozen LoSF-UDF backbone. The selector is trained using off-grid target radii obtained by parabolic interpolation of cached UDF error curves. Experiments show improved fine-scale reconstruction accuracy.

[CV-51] SCR-Guided Difficulty-Aware Optimization for Infrared Small Target Detection CVPR2026 CVPR

链接: https://arxiv.org/abs/2606.18783
作者: Yunus Sevim,Behçet Uğur Töreyin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 Workshops (PBVS). Published version: this https URL

点击查看摘要

Abstract:Infrared small target detection remains challenging due to severe background clutter, low contrast, and weak spatial responses where geometric overlap alone is insufficient to characterize detection quality. In this work, we propose REEM (Reweighted Explicit-visibility Enhanced Modulation), a lightweight SCR-guided difficulty-aware optimization framework that incorporates Signal-to-Clutter Ratio (SCR) as a physically meaningful visibility prior during training. Instead of modifying the network architecture or directly optimizing SCR, REEM computes a ground-truth local SCR from the input image and applies a differentiable modulation to the soft-IoU learning signal, emphasizing low-visibility targets while preserving stable optimization and identical inference behavior. REEM is integrated into a U-Net-based MSHNet without introducing additional parameters, architectural modifications, or inference-time overhead. Extensive experiments demonstrate consistent improvements over the baseline, achieving higher IoU and detection probability (Pd) together with substantially reduced false alarms (FA), particularly under challenging low-visibility conditions. These results suggest that SCR-guided difficulty-aware optimization provides an effective and physically grounded complement to conventional overlap-based objectives for infrared small target detection. The code is available at https://github. com/yall-in-one/Reemm.

[CV-52] SpectralDiT: Timestep-Conditioned Spectral Residual Correction for Flow-Matching DiTs

链接: https://arxiv.org/abs/2606.18765
作者: Jiayu Tian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose SpectralDiT, a lightweight modification to flow-matching Diffusion Transformers that adds timestep-conditioned spectral correction to the MLP residual branch. The module decomposes each residual update into low- and high-frequency components on the patch-token grid, then learns a zero-initialized additive gate so the model initially matches the baseline DiT. On CIFAR-10 pixel-space generation, SpectralDiT improves FID from 20.78 to 19.71 at patch size 1 and reduces the radial Fourier spectrum gap. Furthermore, we scale our method to latent diffusion on ImageNet-100. With 0.6% additional theoretical FLOPs and 1.36% additional parameters, SpectralDiT improves latent flow-matching, achieving an 8.7% relative FID reduction under classifier-free guidance (CFG 2.0). All reported results are averaged over five seeds. Ablations and gate visualizations on CIFAR-10 reveal stable block-specific spectral correction patterns.

[CV-53] SMART: A Flexible Interpretable and Scalable Spatio-temporal Brain Atlas from High-Resolution Imaging Data

链接: https://arxiv.org/abs/2606.18753
作者: John Kalkhof,Boris Gutman(IIT),Emile d’Angremont(Amsterdam UMC),Daniel C. Alexander(UCL),Marco Lorenzi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce SMART, a framework for learning a flexible, interpretable, and scalable spatio-temporal brain atlas from longitudinal high-resolution 3D medical images. Existing approaches to spatio-temporal atlas construction rely on black-box generative models that lack flexibility, limit interpretability, and struggle to scale to high-dimensional data. SMART addresses these challenges by learning a continuous disease-time atlas that decouples global group-wise disease dynamics from their patient-specific anatomical manifestation. Guided by anatomically inspired priors, SMART models interpretable global trajectories of regional progression along a shared disease timeline through region-specific differential equations. Global trajectories are further personalized to individual anatomies via dense diffeomorphic displacements parameterized by a flexible and scalable multi-scale Neural Cellular Automata. Evaluated on five longitudinal MRI datasets in Alzheimer’s disease (ADNI-1/GO/2, OASIS-3, AIBL; 1,300 subjects), SMART produces anatomically meaningful predictions of disease progression and achieves state-of-the-art forecasting accuracy and improved temporal consistency over adversarial and diffusion baselines. Our approach establishes a new paradigm for flexible, interpretable, and scalable modeling of spatio-temporal change in high-dimensional medical image time-series.

[CV-54] oward Training-Free Zero-Shot Anomaly Detection in 3D Medical Images: A Batch-Based Approach Using 2D Foundation Models

链接: https://arxiv.org/abs/2606.18749
作者: Tai Le-Gia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot anomaly detection (ZSAD) is attractive for medical imaging because clinical systems must handle heterogeneous acquisition protocols, changing patient populations, and pathologies for which annotated training data may be unavailable. Most existing zero-shot anomaly detection methods are designed for 2D images, and their direct extension to 3D medical volumes is limited by the scarcity of large-scale volumetric foundation models or by the difficulty of utilizing volumetric context. We propose CS3F, a training-free batch-based framework for ZSAD in 3D medical images using 2D foundation models. Each volume is decomposed along multiple anatomical axes and encoded slice-wise by a 2D vision transformer. These are then converted into localized volumetric tokens by pooling neighboring slice features. Anomaly scores are obtained from cross-subject mutual similarity: tokens that lack close analogues in other subjects are assigned higher anomaly scores. To reduce the attenuation of focal lesion signals caused by depth pooling, we introduce a coarse-to-fine tokenization strategy that enables fine-resolution volumetric scoring without exhaustive matching. CS3F is evaluated on brain MRI across metastases, glioma, and stroke, as well as validated on lung CT to test generalizability beyond atlas-aligned brain MRI. The results show that frozen 2D foundation models can support anomaly localization in 3D medical images, and that the benefit of fine tokenization depends strongly on lesion contrast and imaging modality.

[CV-55] Low-Cost Neuromorphic Fall Detection Using Synthetic Event Data and Hybrid SNNs

链接: https://arxiv.org/abs/2606.18732
作者: Guillermo Rojas,Gonzalo Soto,Daniel Yunge
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 6 figures, presented at ICONS 2025 during the Poster Session, but not published

点击查看摘要

Abstract:This work presents the development of hybrid models that integrate spiking neural networks (SNNs) with components of convolutional neural networks (CNNs) to learn from simulated event-based camera data (Dynamic Vision Sensor, DVS) generated from conventional smartphone videos. Aimed primarily at human fall detection, the approach leverages the energy efficiency and spatio-temporal processing capabilities of SNNs by converting video frames into event-based data. The proposed models are evaluated through simulations on multiple datasets, comparing their performance to that of traditional machine learning models. Results demonstrate significant gains in efficiency without sacrificing accuracy, underscoring the potential of combining SNNs and DVS technology for complex tasks in real-world environments.

[CV-56] Clinically Aligned Geometry Constraints for Robust IVUS Vessel Boundary Segmentation MICCAI2026

链接: https://arxiv.org/abs/2606.18723
作者: Yunshu Chen,Litao Yang,Giuseppe Di Giovanni,Jordan Tan,Deval Mehta,Andrew Lin,Derek Chew,Masasi Fujino,Julie Butters,Stephen Nicholls,Zongyuan Ge,Kyung Hoon Cho
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: MICCAI2026 Accepted

点击查看摘要

Abstract:Intravascular ultrasound (IVUS) lumen and external elastic membrane (EEM) segmentation is important for quantitative coronary plaque burden assessment. Errors in lumen or EEM delineation directly propagate to plaque area, plaque burden and geometric measurements. However, standard methods prioritising overlap scores often suffer from boundary drift and topology errors, leading to inaccurate clinical measurements. We present GeoCat, a geometry-consistent network that processes 5-frame IVUS clips using dual Cartesian-polar encoders with cross-domain attention and temporal fusion. A differentiable geometry consistency loss directly supervises clinically relevant descriptors including diameters, orientations, and cross-sectional areas. The model is trained on 12,242 annotated frames from 146 patients acquired with two commercial IVUS systems. We evaluate performance using both segmentation accuracy and plaque-relevant clinical metrics, including Dice/IoU, boundary measures(95HD (mm), ASSD), topology violation rate, and clinical geometry errors (dmax/dmin, angles, and areas). On our dataset, GeoCat achieves a Dice of 0.93, reduces 95HD to 0.14 mm, and lowers topology violations to 1.0%. Importantly, it significantly improves geometric fidelity, yielding diameter errors of 0.13-0.16 mm and angular errors of ~8 degrees, supporting reliable plaque burden quantification.

[CV-57] Rethinking the Pointer Loss in Table Structure Recognition: Geometry-Aware Pointer Loss for Spatial Locality

链接: https://arxiv.org/abs/2606.18721
作者: Hong-Jun Choi,Jongho Lee,Jaeyoung Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Table Structure Recognition (TSR) using a pointer network achieves impressive results by predicting HTML sequences while aligning tags to detected text (or cell) regions. However, our analysis reveals that when pointer networks fail, 79.6% of errors occur between spatially adjacent cells (Manhattan distance = 2). Despite this, standard cross-entropy loss weights all negative candidates equally. In this work, we propose Geometry-Aware Pointer (GAP) Loss, which reweights the cross-entropy objective based on spatial proximity to ground truth. By applying inverse distance weighting, GAP focuses gradient flow where the model struggles most: immediate neighbors receive stronger gradients than distant cells. Our approach requires only a straightforward modification to the loss computation, maintaining the same model architecture with zero additional inference cost. Extensive experiments on PubTabNet and SynthTabNet demonstrate that GAP consistently reduces adjacent-cell errors, achieving new state-of-the-art performance. Our findings suggest that incorporating geometric inductive biases at the loss level provides a simple yet effective approach to robust TSR. Our code is available at this https URL

[CV-58] PEFT-MedSAM: Efficient Fine-Tuning of Medical Foundation Models for Explainable Skin Lesion Segmentation

链接: https://arxiv.org/abs/2606.18707
作者: Asad Channa,Abdullah Khan,Asghar Ali Chandio,Aamir Akbar,Shahzad Memon,Aqib Hussain,Ameer Hamza
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated segmentation of skin lesions using deep learning models for dermoscopic images can be very helpful in finding melanomas earlier than they would normally be detected. However, most deep learning methods available do not perform well. The aim of this paper is to present a parameter-efficient fine-tuning method called PEFT-MedSAM for adapting the Medical Segment Anything Model (MedSAM) to automatically segment dermoscopic skin lesions. The PEFT-MedSAM method uses only the lightweight mask decoder for training the model while keeping the pre-trained image encoder and prompt encoder frozen. The experiments performed on the ISIC 2018 benchmark dataset shows that PEFT-MedSAM obtains a dice coefficient of .9411 and an intersection over union value of .8918 when compared to both a fully trained U-Net baseline (.8715 dice coefficient) and zero-shot MedSAM inference (.8997 dice coefficient). The external validation of the model using PH2 dataset shows .9467 dice coefficient with +/- .0310 standard deviation. Supportive evidence for these claims include a p-value less than .0001 for Wilcoxon signed rank tests comparing the two datasets and bootstrap-estimated 95% confidence intervals of [.9364,.9447] that represent the estimated range of possible values for the average dice coefficient obtained by repeating the test. To increase clinical trustworthiness, we used Grad-CAM explainability along with a pointing game based evaluation methodology to evaluate the CNN baseline model on the validation set. The results showed that we had an accuracy rate of 98.27% on the validation set of 519 images and confirmed that the model classified regions containing skin lesions.

[CV-59] UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation

链接: https://arxiv.org/abs/2606.18702
作者: Lin Zhang,Sicheng Mo,Zefan Cai,Jinhong Lin,Zihao Lin,Jiuxiang Gu,Krishna Kumar Singh,Yuheng Li,Yin Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autoregressive video diffusion models have emerged as a promising approach for long video generation, achieving strong performance in streaming settings. However, existing methods are restricted to forward temporal generation, whereas practical video creation often requires flexible generation order, e.g., conditioning on future context to extend backward, or on both past and future context for inbetween generation. We bridge this gap by training an autoregressive model that supports generation in arbitrary temporal directions. A key technical challenge arises from the Causal 3D VAE widely used in video diffusion models, which encodes latents strictly conditioned on past context. While suited for forward generation, this causal structure causes inter-block discontinuities when generation proceeds backward. To address this, we introduce blockwise anchor latents, a set of auxiliary latents that restore the missing past context at block boundaries during backward generation. Built on this design, we propose UniTemp, a bidirectional distillation framework that trains a single autoregressive student model for any-direction video generation. At inference time, UniTemp conditions on arbitrary past and/or future frames, improving controllability for both bidirectional and inbetween generation. Experiments show that UniTemp maintains competitive performance on short and long video generation compared to forward-only methods, while enabling diverse workflows such as bidirectional video extension, inbetween generation, looping video generation, scene transition, and visual story generation. Project website: this https URL

[CV-60] Spatially Stratified Distillation for Heterogeneous Radar Place Recognition ICRA

链接: https://arxiv.org/abs/2606.18687
作者: Sagun Singh Shrestha,Samuel Harding,Abdelwahed Khamis,Saimunur Rahman,Peyman Moghadam
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: IEEE ICRA Workshop on Open Challenges for Rigorous Robot Perception 2026

点击查看摘要

Abstract:Scalable, all-weather place recognition increasingly relies on heterogeneous radar place recognition to bridge diverse hardware platforms. A notable application is matching queries from cost-effective 4D automotive radars against high-fidelity reference maps built by dense spinning radars. This process is fundamentally limited by the extreme sparsity (and narrow field-of-view) of the 4D sensor, which captures only a fraction of the structural density present in the spinning radar database. Prior efforts address this issue by unifying different radar signals. That is, projecting both signals into a common representational space. Yet, they suffer performance degradation in multi-session environments. In this paper, we propose spatially-stratified distillation (SSD); a strategy that replaces standard uniform distillation with an asymmetric spatial alignment derived directly from physical radar returns. In regions where both radars exhibit overlapping returns, SSD enforces strong feature alignment. Crucially, in sparse regions where the 4D student lacks returns but the teacher contains valid structure within the shared field of view, SSD applies heavily discounted distillation weights. Extensive evaluations of the recent HeRCULES dataset demonstrate that SSD significantly outperforms prior place recognition methods, achieving state-of-the-art results on its challenging dynamic sequences.

[CV-61] Multi-Class Brain Tumor Classification Using Advanced Deep Learning Models: A Comparative Study

链接: https://arxiv.org/abs/2606.18682
作者: Asad Channa,Asghar Ali Chandio,Akhtar Hussain Jalbani,Mehwish Leghari,Shahzad Memon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent advancements in deep learning, accurately classifying brain tumors from MRI images continues to pose challenges. In this research, we present a comprehensive evaluation of five different convolutional neural networks (CNN) architectures, including a customized baseline model and four pre-trained models - for use in classifying multi-class brain tumors using a clinically-sourced dataset of approximately 10,000 MRI images. We have utilized five different architectures; VGG16, VGG19, DenseNet121, and EfficientNetB0, which were all tested and trained within an identical experimental framework. Performance was measured by both overall accuracy and tumor-wise recall as a means to measure the clinically-relevant performance of each architecture. We found that EfficientNetB0 had the best overall classification accuracy at 95%, when compared to the other architectures tested; specifically VGG16 (94.37%), VGG19 (92.29%), DenseNet121 (90.91%) and the customized CNN (78.00%). An especially important finding of our research was the considerable improvement in detecting meningiomas; specifically, while simple CNNs could detect meningiomas with a recall rate of approximately 20%, EfficientNetB0 was able to detect meningiomas with a recall rate of 89%. Meningiomas are often difficult to detect because they can appear very subtly on MRI images. Additionally, an interesting finding was that the deeper VGG19 performed worse than the shallower VGG16. This indicates that in many cases the architectural efficiency of a CNN model may be more important than its depth when working with medical images. Overall, EfficientNetB0 appears to provide the optimal trade-off between classification accuracy, number of parameters used in the model and clinically meaningful performance.

[CV-62] Moving Beyond Diversity: Visual Token Pruning as Subspace Reconstruction for Efficient VLMs ECCV2026

链接: https://arxiv.org/abs/2606.18681
作者: Jaeyeon Lee,Shunjie Wen,Dong-Wan Choi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026 Under Review

点击查看摘要

Abstract:Despite their remarkable performance, Vision Language Models (VLMs) incur substantial computational overhead due to the large number of visual tokens. While diversity maximization has become a dominant strategy for token reduction, existing methods rely on cosine-based normalized similarity that discards magnitude information, failing to faithfully approximate the original feature representation and leading to suboptimal performance, particularly on compositional multi-skill reasoning tasks. In this paper, we introduce SPARE, a subspace reconstruction method that reformulates token pruning as a column subset selection problem and explicitly minimizes reconstruction error. By iteratively selecting tokens with large projection residuals, SPARE performs reconstruction-driven pruning beyond angular diversity. Moreover, we reveal a counterintuitive anti-relevance phenomenon: tokens with lower image-text relevance score can better preserve contextual information. Based on this finding, we incorporate anti-relevance into SPARE as an additional selection criterion to promote context-aware token selection. Extensive experiments across multiple VLMs and benchmarks demonstrate that SPARE consistently achieves state-of-the-art performance, with strong gains on compositional tasks. When applied to LLaVA, SPARE removes up to 94% of visual tokens while retaining 95% of the baseline performance, all in a fully training-free manner.

[CV-63] InTrain: Intrinsic Trainability for Zero-Cost Neural Architecture Search

链接: https://arxiv.org/abs/2606.18676
作者: Qinqin Zhou,Fuhai Chen,Jipeng Wu,Zhiwei Chen,Zhikai Hu,Weiwei Cai
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training-free neural architecture search promises efficient discovery of high-performance networks without costly training. However, existing zero-cost proxies rely on fragmented heuristics that fail to capture the fundamental question: what makes an architecture trainable? This paper introduces Intrinsic Trainability (InTrain), a unified theoretical proxy that formalizes trainability as an architectural invariant emerging from two synergistic components: geometric capacity and optimization resilience. We operationalize intrinsic trainability through analysis of neural information processing. Geometric capacity is quantified via the participation ratio of activation covariance eigenspectrum, capturing the effective dimensionality of representation manifolds. Optimization resilience is measured through cumulative gradient health, assessing the robustness of backpropagation across network depth. InTrain synthesizes these dimensions through a scale-invariant multiplicative coupling, which we hypothesize is essential for capturing their synergistic, non-additive relationship. Extensive experiments on standard NAS benchmarks and search spaces demonstrate that InTrain achieves ranking correlations on par with state-of-the-art ensemble-based proxies and outperforms other single-metric methods.

[CV-64] BrainFusionNet: a deep learning and XAI model to understand local global and sequential features of MRI images for improved brain tumour detection

链接: https://arxiv.org/abs/2606.18675
作者: Md Taimur Ahad,Bo Song,Yan Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The noise of Magnetic Resonance Imaging MRI poses challenges for Deep Learning DL when tumor boundaries are obscured tumor location and appearance are complex Therefore we develop BrainFusionNet that combines Convolutional Neural Networks CNNs Vision Transformers ViT and Gated Recurrent Units GRUs to extract spatial contextual and sequential features from MRI images for improved brain tumor classification Furthermore explainable AI such as SHAP LIME and GradCAM are integrated to visualise and highlight image regions that contribute to BrainFusionNets decisionmaking process The proposed BrainFusionNet model is evaluated on two publicly available MRI datasets Kfold validation suggests 98 accuracy on both datasets The model was compared with the six stateoftheart SOTA CNNs and transfer learning Among the SOTA CNNs DenseNet121 and VGG16 achieved the highest accuracy of 96 The novelty of BrainFusionNet is that the hybrid model effectively extracts local and global features from MRI images even in smallscale tumor regions and small tumor sizes The model has a balanced sequential CNN architecture to capture lowlevel and deeperlayer features a customized ViT that captures local features stabilizes gradient flow and reduces the risk of vanishing gradients during MRI image training The CNN and ViT outputs are fed into a GRU for final classification Furthermore we analyze pixel intensities to determine whether MRI image quality affects image classification Our findings are very novel in image interpretation as we found that the distribution of pixel intensities in MRI images affects DL performance

[CV-65] LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

链接: https://arxiv.org/abs/2606.18661
作者: Chengfu Liu,Dongyang Hou,Junwu Xiang,Cheng Yang,Xuezhi Cui,Zeyuan Wang,Liangtian Liu,Zelang Miao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.

[CV-66] On-Manifold Variational Learning with Heat-Kernel Priors

链接: https://arxiv.org/abs/2606.18658
作者: Jiarui Xing,Tal Zeevi,Nian Wu,Jian Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Learning unsupervised representations of medical imaging cohorts can reveal clinically meaningful prototypes without expert labels, which are often noisy and fail to capture true pathological heterogeneity. However, existing deep latent-variable models estimate Gaussian mixture priors via Euclidean averaging, producing prototypes that drift off the curved data manifold and degenerate as the number of sub-populations grows. We propose a manifold-anchored variational framework built on a geometry-aware Expectation-Maximization (EM) algorithm, whose M-step selects each sub-population prototype as the graph medoid with the highest diffusion centrality on a heat-kernel-weighted latent graph, ensuring that every prototype remains on-manifold. A Dirichlet energy regularizer enforces geometric smoothness of the latent space, and a per-sub-population uncertainty score enables label-free quality assessment. \revThe manifold-anchored EM is a general-purpose geometric tool that extends standard EM and applies readily to other latent-variable models beyond this setting. On cardiac scar and brain MRI benchmarks, our framework attains the highest accuracy among all compared methods, produces the sharpest prototypes reported to date, and remains stable at large sub-population counts where all baselines degenerate.

[CV-67] Spiking Pyramid Wavelet Transformation for High-efficient and Low-energy Image Restoration

链接: https://arxiv.org/abs/2606.18644
作者: Chen Zhao,Xiantao Hu,Song Wu,Qian Wang,Chen Wu,Rui Xie,Jian Yang,Ying Tai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Pattern Recognition

点击查看摘要

Abstract:Spiking neural networks (SNNs) have garnered significant interest in computer vision due to their potential for efficiency and biological inspiration. While spiking CNN-based methods have shown promise for image restoration (IR) tasks, their performance is constrained by the inherent receptive field limitations of CNN operations. In the paper, we explore the benefits of discrete wavelet transformation and propose a spiking pyramid wavelet-based model (SPWM) for high-efficient and low-energy target. Specifically, we develop a spiking dual pyramid wavelet (SDPW) block to model long-range dependency and exploit the properties of the degradation in the wavelet domain. Experimental results on several benchmarks demonstrate that SPWM significantly lowers computational costs and energy consumption while maintaining image quality. Our method showcases the potential of SNNs in the field of IR, offering new insights for future applications of resource-limited devices.

[CV-68] Intrinsic 4D Gaussian Segmentation from Scene Cues

链接: https://arxiv.org/abs/2606.18623
作者: Hasan Yazar,Mohamed Rayan Barhdadi,Erchin Serpedin,Mehmet Tuncel,Hasan Kurban
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 15 pages, 4 figures, 7 tables. Includes supplementary material. Preprint

点击查看摘要

Abstract:Dynamic 4D Gaussian Splatting reconstructs deforming scenes with high fidelity and is increasingly adopted as a representation for dynamic 3D scenes. Putting such a scene to use, for editing, manipulation or motion analysis, first requires segmenting it: grouping the Gaussian primitives into coherent objects. Current pipelines obtain this grouping by importing 2D masks from foundation models such as SAM and lifting or distilling them into the Gaussian representation. In dynamic scenes these masks must be generated across many frames and views, which is costly, and the resulting segmentation can depend strongly on the quality and consistency of those external masks. We ask how much object-level structure can instead be recovered from the Gaussians themselves, and propose Intrinsic-GS, a training-free, mask-free method that builds a sparse affinity graph over Gaussian primitives from appearance, orientation, scale, deformation-trajectory and non-learned rendered-boundary cues. The graph is partitioned with Leiden community detection, requiring no foundation model and no learned feature field. On the standard 4D Gaussian segmentation benchmarks, Neu3D and HyperNeRF, Intrinsic-GS recovers substantial object structure without mask supervision, reaching 0.746 mIoU on Neu3D and 0.575 on HyperNeRF; on Neu3D, a geometry-only variant reaches 0.902 mIoU, matching SAM-supervised TRASE. On HyperNeRF, Intrinsic-GS runs 12.5x faster than the mask-generation and feature-rendering stages used by mask-supervised pipelines. These results suggest that much of the segmentation signal is already encoded in the Gaussians themselves, offering a fast, mask-free direction for 3D and 4D Gaussian segmentation that may also point toward more generalizable, robust segmentation in settings where external masks are unreliable or expensive.

[CV-69] SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

链接: https://arxiv.org/abs/2606.18610
作者: Wei-Cheng Tseng,Gashon Hussein,Yuzhu Dong,Allen Z. Ren,Lucy X. Shi,XuDong Wang,Sergey Levine,Zhaoshuo Li,Jinwei Gu,Florian Shkurti,Ming-Yu Liu,Quan Vuong
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of 0.929 and MMRV of 0.119 , outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

[CV-70] Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification MICCAI2026

链接: https://arxiv.org/abs/2606.18609
作者: Nan Zhou,Ke Zou,Meng Liu,Linchao He,Jiaqi Zhu,Yi Zhang,Hu Chen,Huazhu Fu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2026 Accept. Submission Version

点击查看摘要

Abstract:Vision-Language models (VLMs) reliability in medical diagnosis is challenged by trust-undermining hallucinations. Existing hallucination detection approaches mainly focus on identifying factual inconsistencies between generated text and reference data. While some studies analyze where models attend in images, they seldom verify whether such attention truly reflects the visual evidence supporting the generated text. To address this gap, we propose Counter-Evidence Verification (CoEV), a training-free plug-and-play framework that detects and corrects hallucinations through evidence-based factual consistency verification. CoEV performs bidirectional verification between textual assertions and visual evidence, testing whether each statement is supported by its corresponding evidence region, and assigns each statement into a four-quadrant diagnostic map capturing combinations of text factuality and visual grounding. CoEV detects hallucinated content and serves as a post hoc refinement tool, correcting hallucinations without retraining. Extensive experiments on four medical datasets show that CoEV combats hallucinations in this http URL hallucination detection, CoEV consistently outperforms existing methods, improving average PR-AUC and ROC-AUC by 3.0% and 3.9% absolute points respectively, with notable gains of up to 18.5% in specific VQA scenarios. For hallucination correction, it improves Micro-F1 by up to 12.5%, reduces hallucination rates by over 11.9% on medical report generation, and also boosts medical VQA accuracy. These results show that CoEV enables reliable detection and correction of hallucinations, providing clinicians with dependable, evidence-based cues for diagnosis. Code will be released upon acceptance.

[CV-71] Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agent ic Feedback Loops ICML2026

链接: https://arxiv.org/abs/2606.18591
作者: Denis Savytski,Aiden Lei,Heding Liu,Warren Yang,Sihan Liang,Alexander Liu,Zhe Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the Workshop on Human-AI Co-Creativity at ICML 2026

点击查看摘要

Abstract:Generative AI has made content creation increasingly accessible, but many AI-generated videos lack narrative coherence and creative direction, issues that become more substantial at longer durations. Unlike coding, where AI generation benefits from reliable feedback and techniques such as recurrent self-improvement, video generation requires subjective feedback about plot, scenes, and narrative, which naturally motivates approaches that incorporate human creative direction. We introduce CHIEF, a human-AI co-creation video generation framework that places the creator at the center of human-in-the-loop iterative video refinement, and supports them by providing automatic subjective feedback. The creator incorporates their creative direction by driving each iteration, while their revisions are incorporated by a specialized refiner agent. The feedback loop is generated by persona-conditioned multimodal LLMs that watch generated videos and produce subjective critique from the audience perspectives, providing feedback that self-evaluation alone cannot capture. To test the effectiveness of our proposed framework, we work with high school and college students with no prior filmmaking experience to create videos, from short 1-minute videos to a complete short 10-minute film with a complicated plot.

[CV-72] Splaxel: Efficient Distributed Training of 3D Gaussian Splatting for Large-scale Scene Reconstruction via Pixel-level Communication

链接: https://arxiv.org/abs/2606.18588
作者: Wenqi Jia,Zhewen Hu,Ying Huang,Yu Gong,Stavros Kalafatis,Yuke Wang,Wei Niu,Chengming Zhang,Ang Li,Sheng Di,Yuede Ji,Bo Fang,Miao Yin
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 25 figures

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) enables high-fidelity and real-time 3D scene reconstruction, but scaling training to large-scale scenes requires optimizing hundreds of millions of Gaussians across multiple GPUs. Existing distributed approaches either partition scenes into isolated regions, causing global inconsistency, or rely on global Gaussian-level exchanges, which lead to substantial growth in inter-GPU communication and quickly dominate iteration time. We propose Splaxel, a communication-efficient distributed 3DGS training framework based on pixel-level local rendering and global composition. Instead of synchronizing Gaussians, each GPU renders its local subset and exchanges only partial pixel values, maintaining mathematical consistency while keeping communication cost stable as the scene size increases. Splaxel further reduces pixel-level redundancy through geometric and transmittance visibility prediction and improves GPU utilization via conflict-free camera-view consolidation. Evaluated on large-scale datasets with up to 120M Gaussians, Splaxel achieves up to 7.6 \times speedup over the state-of-the-art distributed 3DGS framework while preserving high reconstruction quality. Comments: 17 pages, 25 figures Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.18588 [cs.DC] (or arXiv:2606.18588v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2606.18588 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-73] APT: Atomic Physical Transitions for Causal Video-Language Understanding

链接: https://arxiv.org/abs/2606.18586
作者: Shang Wu,Haoran Lu,Songling Liu,Chenwei Xu,Lie Lu,Pranav Maneriker,Fan Du,Manling Li,Zhaoran Wang,Han Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Physical events are not understood by their names alone, but by the causal state changes that compose them. A clip-level label such as “bounce” can be correct while hiding the process that makes the event physically valid, from support loss and contact onset to rebound and settling. To make this hidden process explicit, we introduce Atomic Physical Transitions (APTs): minimal, temporally localized state changes that bind a visible cue to an active physical mechanism and before/after dynamical regimes. An APT chain represents a video as an ordered causal transition sequence rather than a single aggregate event label: event labels tell what happened; APT chains explain why it happened. To make APTs learnable by VLMs, we construct mixed-source APT data from human annotations and simulator ground truth, covering 14 transition types across contact, gravity, friction, and rotation/stability, with 27,303 timed instances over 1,246 trials. Using this data, we find that current VLMs miss transition-level physics, with zero-shot recall at most 14% and errors dominated by missed transitions. Direct fine-tuning on APT chains improves transition detection but causes event-level forgetting, indicating that the model learns a specialized answer format rather than a reusable physical representation. We therefore propose APT-Tune, a parameter-efficient recipe that teaches VLMs to use causal transitions without forgetting how to answer video questions. It combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned domain-to-type decoding to make APT learning format-robust and physically grounded. With only 11 M LoRA parameters on Qwen3-VL-2B, APT-Tune substantially improves APT recall while also improving event-level video transfer. These results show that APTs are not a new answer format, but a human-aligned causal supervision signal for physical video understanding.

[CV-74] Aerial-ground LiDAR place recognition with patch-level self-supervised learning and expanded reciprocal re-ranking

链接: https://arxiv.org/abs/2606.18583
作者: Yandi Yang,Xianghong Zou,Jianping Li,Haofeng Xie,Saurav Uprety,Hongzhou Yang,Naser El-Sheimy
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:LiDAR place recognition determines one’s position on a prior point cloud map. The most studied ground-level LiDAR place recognition suffers from pre-visit requirements, incomplete coverage, and limited perspectives. Using pre-acquired, full-coverage Airborne Laser Scanning (ALS) data as an aerial prior map overcomes these drawbacks, making cross-view place recognition necessary and advantageous. However, aerial-ground LiDAR place recognition faces significant challenges, including the domain gap between aerial and ground point clouds, and false positives during initial retrieval. To address these challenges, we present a novel retrieval and re-ranking framework for aerial-ground LiDAR place recognition. Based on the priors that neighboring point cloud patches share similar semantics with anchor patch, our retrieval network introduces patch-level self-supervised learning modules at multiple scales and integrates with scene-level learning to improve global feature discriminativeness between aerial and ground point clouds. Furthermore, leveraging the structured spatial distribution of ALS point clouds, we introduce an Expanded Reciprocal (ER) re-ranking algorithm to exploit neighborhood information maximally and refine each feature based on neighbor features, which are then used to update the similarity matrix for final ranking. Extensive experiments demonstrate that our retrieval network outperforms existing state-of-the-art (SOTA) methods, achieving a 9.8% improvement in average Recall@1 and a 3.2% improvement in average Recall@1% on the CS-Urban-Scenes, while also showing the best performance on the CS-Campus3D dataset. Additionally, our ER re-ranking algorithm further boosts the average Recall@1 by 4.9% on CS-Campus3D and 10.2% on CS-Urban-Scenes without additional training.

[CV-75] chnical Report for ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Leverag ing DINOv3 for Robust Outdoor Scene Understanding in Field Robotics

链接: https://arxiv.org/abs/2606.18582
作者: Jaeil Park,Hyobin Choi,Sangjin Lee,Hyungtae Lim,Sung-Hoon Yoon
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:The GOOSE 2D Fine-Grained Semantic Segmentation Challenge at the ICRA 2026 Workshop on Field Robotics evaluates dense semantic segmentation of off-road imagery over a fine-grained taxonomy of 64 classes and 11 evaluated non-void coarse categories. We present the first-place solution to this challenge. Our solution comprises two complementary improvements: (a) a network-level design that combines a self-supervised DINOv3 ViT-L/16 backbone, a ViT-Adapter, and a Mask2Former mask-classification decoder, together with a coarse-category auxiliary loss on the global [CLS] token; and (b) an inference-time aggregation strategy based on multi-scale and horizontal-flip test-time augmentation and an ensemble of the top three checkpoints selected using Codabench scores. Our method achieves an official composite score of 76.57%, consisting of 69.32% fine-class mIoU and 83.81% category-level mIoU, and ranks first on the final phase leaderboard: this http URL.

[CV-76] Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting

链接: https://arxiv.org/abs/2606.18566
作者: Hao-Yuan Ma,Li Zhang,Yushi Qiu,Jie Gao,Yan Zhang,Bangjun Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Crowd counting is a fundamental task in computer vision. However, crowd counting in low-light environments remains largely underexplored, despite its practical importance in the real world. Existing methods mainly focus on well-lit scenes or rely on single-modality Red-Green-Blue (RGB) representations, which often become unreliable under extreme darkness and complex non-uniform illumination. To handle this problem, we construct three new low-light crowd counting benchmarks, which consist of two synthetic datasets, SHA_Dark and SHB_Dark, and a real-world benchmark LC-Crowd (Low-light Crowd Dataset). Inspired by Retinex-based physical modeling, we introduce depth and Canny edge cues as complementary geometric and structural priors to enhance the intrinsic reflectance representation under low-light conditions. We propose a Multi-Modal Hyper-Graph Fusion module, which formulates RGB appearance, depth geometry, and edge structure cues as nodes in a unified hyper-graph and explicitly captures their high-order complementary relationships via dynamic hyperedge construction and message passing. Furthermore, to adaptively allocate computation in dense prediction, we propose a Deformable Rectangular Sparse Attention (DRSA) module, which concentrates computation on informative regions through anchor-aware estimation and adaptive rectangular window modeling. Based on these designs, we develop a unified Low-Light Counting Network (LCNet) for robust low-light crowd counting. Extensive experiments on three benchmarks demonstrate that the proposed method achieves the best overall performance against existing state-of-the-art (SOTA) methods. The code is in the supplementary material. The datasets will be made public upon acceptance.

[CV-77] Experimental Analysis of Neural Network-Based Image Classification on the CIFAR-10 Dataset

链接: https://arxiv.org/abs/2606.18565
作者: Necati Kagan Erkek,Emre Balci,Berkin Halay
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 7 pages

点击查看摘要

Abstract:An experimental investigation of neural image classification on the CIFAR-10 benchmark is presented through fully connected and convolutional network formulations. The analysis emphasizes the complete learning pipeline: image vectorization, normalization, one-hot class encoding, supervised loss minimization, learning-rate selection, mini-batch training, convolutional feature extraction, max-pooling, and validation-based generalization assessment. A convolutional architecture with six convolutional layers and three max-pooling stages is evaluated for ten training epochs using a batch size of 128 and an Adam optimizer with a learning rate of 0.001. The validation accuracy reaches approximately 74.77%, while the validation loss begins to increase after the middle of training despite continued reduction in training loss. The resulting behavior illustrates the practical difference between representation learning and memorization, and it provides a compact experimental baseline for future studies on regularization, data augmentation, deeper architectures, and reproducible image-classification education.

[CV-78] MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

链接: https://arxiv.org/abs/2606.18558
作者: Jianing Zhang,Chenhao Zheng,Yajun Yang,Max Argus,Rustin Soraki,Winson Han,Taira Anderson,Chun-Liang Li,Shuo Liu,Jiafei Duan,Zhongzheng Ren,Jieyu Zhang,Ranjay Krishna
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

[CV-79] Rethinking Text-to-Image as Semantic-Aware Data Augmentation for Indoor Scene Recognition

链接: https://arxiv.org/abs/2606.18555
作者: Trong-Vu Hoang,Quang-Binh Nguyen,Dinh-Khoi Vo,Hoai-Danh Vo,Minh-Triet Tran,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MAPR 2024

点击查看摘要

Abstract:In the realm of computer vision, indoor image recognition presents challenges due to the intricate interplay of lighting conditions, occlusions, and diverse object arrangements within confined spaces. To address the lacks of training indoor images, we introduce a novel approach leveraging Stable Diffusion (SD) for the generation of synthetic images, which serve as a powerful data augmentation tool. The utilization of SD offers a principled framework for synthesizing diverse and realistic indoor scenes, thereby enriching the training data pool for robust indoor image recognition models. Experimental findings on the MIT Indoor Scene dataset reveal the potential of our proposed approach in enhancing the training of deep models when authentic data is limited. Furthermore, to prevent the misuse of SD synthetic images, we introduce a counter measure based on DIffusion Reconstruction Error (DIRE). The powerful DIRE presentation enables training robust classifiers only using lightweight deep models. Experiments show that our approach can perfectly recognize SD generated images with the accuracy of 100% using MobilenetV3.

[CV-80] Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of Diffusion

链接: https://arxiv.org/abs/2606.18554
作者: Duc-Manh Phan,Quoc-Duy Tran,Duy-Khang Do,Anh-Tuan Vo,Hai-Dang Nguyen,Trong Le Do,Mai-Khiem Tran,Vinh-Tiep Nguyen,Tam V. Nguyen,Isao Echizen,Minh-Triet Tran,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SOICT 2025

点击查看摘要

Abstract:The rapid advancement of text-to-image diffusion models has enabled the creation of highly photorealistic synthetic images that closely resemble real photographs, making it increasingly difficult to distinguish authentic content from AI-generated fabrications. This poses challenges for cybersecurity, digital forensics, and disaster response, where fake imagery of floods, fires, or earthquakes can spread misinformation or disrupt emergency operations. To address this, we introduce Forged Calamity, a benchmark dataset for synthetic disaster detection containing 30,000 images, including 6,000 real and 24,000 synthetic samples generated by four diffusion models. Comprehensive experiments across fine-tuned and zero-shot settings reveal consistent weaknesses in current forensic approaches. Fine-tuned detectors perform well in-distribution but lose up to 50% accuracy on unseen generators or disaster types, showing overfitting to model-specific artifacts. Zero-shot generalized detectors also struggle to maintain stable accuracy, with only limited resilience in a few representation-robust models. These findings highlight persistent generalization gaps and the urgent need for domain- and model-agnostic detection methods to ensure visual authenticity in the diffusion era.

[CV-81] Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

链接: https://arxiv.org/abs/2606.18553
作者: Minh-Loi Nguyen,Xuan-Vu Le,Long-Bao Nguyen,Hoang-Bach Ngo,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SOICT 2025

点击查看摘要

Abstract:Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions with deeper insights, such as object attributes, event context, and underlying significance, by leveraging external knowledge. Our approach features a hierarchical multi-modal article retrieval mechanism that moves beyond monolithic text entities. This retrieval considers article structure-aware features, including weighted textual components (e.g., headlines, body sections) and visual placement patterns, alongside multi-faceted similarity computations (content–visual, visual–visual, and discourse positioning). A subsequent contextual relevance refinement stage further enhances the retrieved information. The retrieved articles then serve as the knowledge base for caption generation: first, a VLM generates a concise image description; second, we segment relevant information from the retrieved articles based on this description; and finally, an LLM utilizes both the description and extracted knowledge to generate a comprehensive, contextually detailed caption. We participated in the ACM Multimedia EVENTA 2025 Challenge and achieved 5th place with an overall score of 0.2824 on the private test set of the OpenEvent-V1 dataset. Source code is publicly released at this https URL.

[CV-82] A Prototypical Signature Approach for Writer-Independent Offline Signature Verification ICPR

链接: https://arxiv.org/abs/2606.18528
作者: Kecia G. de Moura,Robert Sabourin,Rafael M. O. Cruz
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for oral presentation at the International Conference on Pattern Recognition (ICPR) 2026

点击查看摘要

Abstract:Offline handwritten signature verification aims to distinguish genuine from forged signatures using static images. Since real forgeries are rarely available, negative samples are usually randomly drawn from genuine signatures of other users to create training data. However, this random selection often lacks diversity, increases redundancy, and escalates computational cost, leading to inefficient training. We propose a data-driven strategy to generate diverse, informative negative samples using prototypical signatures, which are compact, non-identifiable summaries of genuine signature features. Based on the experiments results, we conclude that (i) prototypical signatures yield more informative negative samples, improving the detection of skilled forgeries; (ii) the proposed approach is backbone-agnostic, showing robustness across architectures; and (iii) when combined with a primal-form linear SVM, it serves as an alternative to RBF-based models while significantly improving scalability and computational efficiency. Implementation of the method is available at this https URL.

[CV-83] Architectural Bias in Face Presentation Attack Detection: A Comparative Study of Vision Transformers and Convolutional Neural Networks

链接: https://arxiv.org/abs/2606.18510
作者: Ngela Landon Ntung,Floride Tuyisenge,Jema David Ndibwile
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 8 Pages, 4 Figures, 5 Tables

点击查看摘要

Abstract:Face Presentation Attack Detection (PAD) systems constitute a critical security layer in biometric authentication; however, existing approaches exhibit systematic performance disparities across demographic groups, disproportionately affecting individuals with darker skin tones. This paper presents a comparative empirical investigation of whether Vision Transformer architectures reduce demographic bias in face PAD systems relative to convolutional baselines. Experiments are conducted on the CASIA-SURF Cross-Ethnicity Face Anti-Spoofing (CeFA) dataset. Three architectures are evaluated: a Multimodal ViT-Tiny trained from scratch, a ResNet18 CNN baseline, and a pretrained DeiT-S fine-tuned on CeFA across African, East Asian, and zero-shot Central Asian demographic groups. DeiT-S achieves the highest overall accuracy of 97.27% and the lowest EER of 0.86%, outperforming ResNet18 at 90.15% accuracy. In terms of fairness, DeiT-S reduces the inter-ethnic ACER gap between African and East Asian subjects to 0.13%, compared to 0.75% reported in an LBP-based work [6], representing an 83% reduction. Most notably, while ResNet18 records a BPCER of 10.44% on zero-shot Central Asian subjects, DeiT-S maintains 2.89% on the same unseen group, demonstrating a 3.6x generalization advantage. These results suggest that pretrained Vision Transformers achieve superior PAD accuracy, produce smaller demographic performance gaps, and generalize more equitably across unseen demographic groups, indicating that cross-demographic fairness in PAD may partly be influenced by architectural design.

[CV-84] Neural Phase Correlation

链接: https://arxiv.org/abs/2606.18496
作者: Cole Reynolds
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Correspondence is fundamentally relational: it seeks the unknown transformation between two observations of a common scene, not the content of either. Yet the dominant learning-based methods do not represent the transformation as a first-class object in the architecture. They encode each image independently and let a learned similarity function or a deep decoder discover the mapping implicitly. Phase correlation is the canonical exception, measuring the inter-image relationship directly in the Fourier domain, but the rigidity of its fixed basis confines it to global translation. We introduce a learned generalization of phase correlation that lifts this restriction by learning the basis on which the transformation decomposes. The same algebraic primitive extends to dense non-rigid deformations and to unitary dynamics. On the ACDC cardiac-MRI benchmark the framework matches or exceeds prior published baselines on both registration directions. On CAMUS echocardiography it matches state-of-the-art without auxiliary scoring or adaptive-smoothness mechanisms. Applied to time-evolved wavefunction pairs of the 1-D quantum harmonic oscillator, the same framework recovers the Hermite-function eigenstates and the quantized energy levels of the unknown Hamiltonian from observation pairs alone. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.18496 [cs.CV] (or arXiv:2606.18496v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.18496 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-85] Vines-DB: An RGB image dataset for multi-species ornamental vine segmentation

链接: https://arxiv.org/abs/2606.18484
作者: Saroj Burlakoti,Utsav Bhandari,Aaron Etienne,Shital Poudyal(Utah State University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 1 figure. Source data repository: OSF (DOI: https://doi.org/10.17605/OSF.IO/YJHCK )

点击查看摘要

Abstract:The Vines-DB dataset contains 1,218 original high-resolution RGB images of seven ornamental vine species collected under field conditions at the Utah Agricultural Experiment Station’s Greenville Research Farm in Logan, Utah, USA. The dataset was generated from 168 individual vine plants that were transplanted in 2022 and photographed repeatedly across multiple months during the 2023 and 2024 growing seasons (July-October). Images were captured with an iPhone 16 Pro equipped with a 48 MP camera between 10:00 AM and 12:00 PM under daylight. Vines were grown on 1.2m x 2.4m trellises and photographed from a distance of 1m against black or white Styrofoam backdrops to improve contrast and reduce background noise. The dataset includes Akebia quinata, Campsis radicans, Hydrangea anomala petiolaris, Lonicera x heckrottii, Campsis x tagliabuana ‘Madame Galen’, Parthenocissus quinquefolia, and Wisteria floribunda. All original images were manually annotated in Roboflow by trained annotators to produce polygon-based instance segmentation masks for eight classes, including seven species and background. After preprocessing and data augmentation, the working dataset was expanded to 2,307 images for model development and evaluation. The augmented dataset was divided into 2,019 training images, 192 validation images, and 96 test images using stratified sampling to maintain balanced representation. Vines-DB supports the development and evaluation of deep learning models for multi-class instance segmentation in precision horticulture and urban ecology. The dataset enables applications such as automated canopy cover estimation, species identification, and scalable field phenotyping. In addition, repeated monthly imaging of the plants captures temporal variation in canopy development and plant appearance, increasing the dataset’s utility for segmentation benchmarking under realistic field conditions.

[CV-86] Data-Forcing Distillation: Restoring Diversity and Fidelity in Few-Step Video Generation

链接: https://arxiv.org/abs/2606.18478
作者: Siyi Chen,Shaowei Liu,Yixuan Jia,Zian Wang,Huan Ling,Qing Qu,Jun Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress has shown promise in distilling multi-step video diffusion models into efficient few-step students. Among them, Distribution Matching Distillation (DMD) and its successor DMD2 achieved strong generation quality and fast convergence. However, due to the nature of the reverse Kullback–Leibler (KL) objective, these methods exhibit two persistent failure modes: a substantial drop in sample diversity, and visibly over-saturated outputs that deviate from real-video appearance. In this work, we propose Data-Forcing Distillation (DFD), a simple post-training framework that restores diversity and fidelity in DMD with only a single-line of code change. At its core is the teacher score discrepancy to guide the student toward the real-data distribution, pulling it to missing modes (mitigating mode collapse) and away from problematic modes absent in real data (avoiding over-saturation). We provide an in-depth theoretical analysis of our framework and validate our approach on text-to-video, image-to-video, and autoregressive video generation. With only 100–300 steps of finetuning, DFD effectively restores diversity and fidelity on both Wan2.1-1.3B and Cosmos-Predict2.5-2B model, resolving the over-saturation artifacts with significantly better video dynamics and appearance, and even outperforms the teacher model.

[CV-87] Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning

链接: https://arxiv.org/abs/2606.18472
作者: Sneha Paul,Zachary Patterson,Nizar Bouguila
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Domain adaptation remains a central challenge in 3D vision, especially for multimodal foundation models that align 3D point clouds with visual and textual data. While these models demonstrate strong general capabilities, adapting them to downstream domains with limited data often leads to overfitting and catastrophic forgetting. To address this, we introduce ReFine3D, a regularized fine-tuning framework designed for domain-generalizable tuning of 3D large multimodal models (LMMs). ReFine3D combines selective layer tuning with two targeted regularization strategies: multi-view consistency across augmented point clouds and text diversity through synonym-based prompts generated by large language models. Additionally, we incorporate point-rendered vision supervision and a test-time augmentation mechanism with confidence-based aggregation to further enhance robustness. Extensive experiments across different 3D domain generalization benchmarks show that ReFine3D improves base-to-novel class generalization by 1.36%, cross-dataset transfer by 2.43%, robustness to corruption by 1.80%, and few-shot accuracy by up to 3.11%, outperforming prior state-of-the-art methods with minimal added computational overhead.

[CV-88] Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLM s

链接: https://arxiv.org/abs/2606.18441
作者: Chengwen Liu,Zhe Huang,Jisheng Dang,Hong Peng,Qi Tian,Tat-Seng Chua
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at this https URL.

[CV-89] RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

链接: https://arxiv.org/abs/2606.18439
作者: Jinhao You(1),Shuo Lyu(1),Zhuohang Lyu(1),Tanxuan Li(1),Zibo Zhao(1),Jiaxiang Hu(2),Kai Tang(3),Yichen Guo(3) ((1) University of Pennsylvania, (2) University of California, Irvine, (3) Nanyang Technological University)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 9 pages, 3 figures, 7 tables. Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, and Zibo Zhao contributed equally. Shuo Lyu is the corresponding author

点击查看摘要

Abstract:Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.

[CV-90] CAOA – Completion-Assisted Object-CAD Alignment

链接: https://arxiv.org/abs/2606.18429
作者: Hiranya Garbha Kumar,Minhas Kamal,Balakrishnan Prabhakaran
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: GitHub: this https URL

点击查看摘要

Abstract:Accurately aligning CAD models to their corresponding objects in indoor RGB-D scans is a central challenge in 3D semantic reconstruction. The task requires estimating a 9-Degree-of-Freedom (DoF) pose-position, rotation, and scale along three axes-but is hindered by noisy and incomplete scans, as well as segmentation errors that cause geometric distortions. We present Completion-Assisted Object-CAD Alignment (CAOA), a method that integrates a semantically and contextually aware point cloud completion module with a symmetry-aware relative pose estimation algorithm, enabling precise alignment of CAD models to scanned objects. Existing completion methods are typically trained and evaluated on synthetic datasets, which often fail to generalize to real-world scans. To bridge this gap, we introduce a synthetic data generation strategy tailored to indoor scenes, significantly reducing the synthetic-to-real domain gap-validated through quantitative comparisons with widely used completion datasets. In addition, we release S2C-Completion, an expert-annotated dataset of over 8,500 object-CAD pairs from Scan2CAD, created for real-world indoor single-object completion and intended as a new benchmark for this task. For object-CAD alignment, we incorporate symmetry information via a symmetry-aware loss, improving robustness to symmetric ambiguities. On the Scan2CAD benchmark, CAOA achieves a 17% accuracy improvement over state-of-the-art methods.

[CV-91] Budget-Aware Adaptive Adversarial Patches for Black-Box Object Detection ICIP2026

链接: https://arxiv.org/abs/2606.18318
作者: Pedram MohajerAnsari,Amir Salarpour,David Fernandez,Mert D. Pesé
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Accepted to the 2026 IEEE International Conference on Image Processing (ICIP 2026)

点击查看摘要

Abstract:Adversarial patches pose a practical threat to modern object detectors. Prior work shows vulnerability, but three gaps limit actionable insight: (i) few \emphscore-based black-box attacks \emphjointly optimize patch \emphlocation, texture, and size under tight query budgets; (ii) success is rarely tied to the patch’s \emphvisual footprint; and (iii) evaluations often conflate EOT robustness with plain-view suppression. We present \method, a query-efficient, budget-adaptive black-box attack that couples a lightweight \emphContextual Thompson-Sampling placer with NES-style pixel updates, growing the patch only when progress stalls. Reporting is anchored by a \emphstrict plain-image suppression test; EOT is audited but never used as a substitute for success, and optional appearance/printability weights expose strength–visibility trade-offs. Across YOLOv5, Faster R-CNN, and YOLOS, \method achieves strong suppression on CNN-based detectors and substantial suppression on the transformer-based detector, using compact patches and exposing clear query–footprint trade-offs relative to fixed-size and heuristic baselines. A print–capture pilot further shows transfer across unseen physical objects and viewpoints.

[CV-92] EDoF-NeRF: extended depth-of-field neural radiance fields using a coded aperture camera

链接: https://arxiv.org/abs/2606.18826
作者: Yoshiyuki Shirasaki,Ryoichi Horisaki
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We propose a method for extending the depth-of-field (DoF) to construct high-fidelity neural radiance fields (NeRF) – an emerging technique for rendering photorealistic novel views from a dataset of images captured at different viewpoints, based on implicit neural representations. The trade-off between DoF and light quantity is inherent not only in conventional cameras but also in NeRF, since the datasets used by NeRF are captured by these cameras. To address this issue, we introduce a coded aperture placed at the camera pupil, preserving spatial frequency components under defocused conditions. We develop a camera model incorporating coded apertures into NeRF, allowing direct input of coded images and enabling the generation of novel views with an extended DoF. We validate the proposed method, termed extended DoF-NeRF (EDoF-NeRF), through simulations and experiments, demonstrating its superior performance compared to conventional aperture cameras.

[CV-93] DART: A design-aware microfluidic chip paradigm for real-time live-cell image analysis

链接: https://arxiv.org/abs/2606.18523
作者: Johannes Seiffarth,Matthias Pesch,Lukas Scholtes,Dietrich Kohlheyer,Hanno Scharr,Katharina Nöh
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-throughput microfluidic live-cell imaging generates rich single-cell data. Yet semi-automated procedures for locating regions of interest (RoIs), each containing one cell population, and removing surrounding microfluidic structures from recorded images, scale with the number of RoIs. This prevents real-time image analysis and delays time-to-insight by hours to days. We introduce the Design-Aware and Real-Time capable (DART) paradigm for microfluidic cultivation chips, which aligns the CAD blueprint with the physical chip and thereby enables throughput-independent localization of all RoIs and fully automated image processing across diverse RoI geometries and chip layouts. DART establishes this alignment through embedded fiducial markers and deep-learning-based marker detection. We validate DART using the Swiss Army Knife chip, which combines eight structurally distinct RoI designs across 1164 RoI locations. DART localizes all RoIs in five minutes, removes microfluidic structures from raw microscopy images in 40 ms, and performs fully automated image analysis, including cell segmentation, in under 1.1 s per image. Together, these capabilities establish DART as an end-to-end hardware-software paradigm with real-time-capable analysis that paves the way toward closed-loop and outcome-driven smart microscopy.

人工智能

[AI-0] UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

链接: https://arxiv.org/abs/2606.19328
作者: Mohamed Nabail,Leo Cheng,Jingmin Wang,Nicholas Rhinehart
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning. We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP2), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty. Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings. Empirically, experiments on the Meta-World benchmark show UBP2 achieves substantially higher sample efficiency than model-free preference-based methods and non-optimistic model-based baselines.

[AI-1] Explaining Attention with Program Synthesis

链接: https://arxiv.org/abs/2606.19317
作者: Amiri Hayes,Belinda Li,Jacob Andreas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A longstanding goal of research on interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions. In this paper, we propose an approach for approximating the behavior of components of deep networks with executable programs. We focus on attention heads in transformer language models. For a given head, we first compute its associated attention matrices on a collection of randomly selected training examples. Next, we prompt a pre-trained language model with a summary of these matrices, and instruct it to generate a set of Python programs that can reproduce the associated attention patterns given only text from the input sentence. Finally, we re-rank programs according to how well our final set of programs predict behavior on held-out inputs. We demonstrate that a set of fewer than 1,000 such generated programs can reproduce the attention patterns of heads in GPT-2, TinyLlama-1.1B, and Llama-3B, achieving an average Intersection-over-Union similarity above 75% on TinyStories. Moreover, the best-fit programs can replace neural attention heads without substantially affecting model behavior: replacing 25% of attention heads with programmatic surrogates across the three models incurs only a 16% average perplexity increase, while maintaining performance on a variety of downstream question answering benchmarks. This work contributes a scalable pipeline for reverse-engineering attention heads in transformer models using human-readable, executable code, advancing a path toward symbolic transparency in neural models.

[AI-2] NeSyCat Torch: A Differentiable Tensor Implementation of Categorical Semantics for Neurosymbolic Learning

链接: https://arxiv.org/abs/2606.19279
作者: Daniel Romero Schellhorn,Till Mossakowski,Björn Gehrke
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Category Theory (math.CT); Logic (math.LO); Probability (math.PR)
备注:

点击查看摘要

Abstract:Neurosymbolic semantics is fragmented: classical, fuzzy, probabilistic and neural systems each define truth by their own inductive rules. NeSyCat, extending ULLER, subsumes them under a single inductive definition of truth, parametric in a strong monad and an aggregation structure on truth-values. NeSyCat has so far lacked an account of predicates and functions learned by neural networks. We provide NeSyCat Torch as the missing link and interpret computational symbols via neural networks, implementing the framework in probabilistic programming and tensor-based backends. We use the distribution monad for reference semantics and metric evaluation, and complement it by a monad for numerically stable, differentiable training: the lazy log-tensor monad over the log-semiring. For efficient training in batches, we furthermore employ a batch monad. The axioms are the source code: written once in monad-based do-notation, monadic bind performs marginalisation, lazily pruning unneeded branches. On MNIST addition, our HaskTorch, JAX, and PyTorch implementations outperform LTN and DeepProbLog in speed and accuracy, while achieving nearly the accuracy of DeepStochLog. However, unlike DeepStochLog, we stay in a uniform framework that applies to many first-order NeSy approaches. Namely, the construction is parametric in the monad; instantiating it with, e.g., the Giry monad extends the approach to continuous probability (working out a neural representation here is left for future work). Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Category Theory (math.CT); Logic (math.LO); Probability (math.PR) Cite as: arXiv:2606.19279 [cs.AI] (or arXiv:2606.19279v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.19279 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-3] XSlides: Benchmarking Audience-Conditioned Slide Generation

链接: https://arxiv.org/abs/2606.19256
作者: Haodong Chen,Xuanhe Zhou,Wei Zhou,Xinyue Shao,Yanbing Zhu,Bo Wang,Jiawei Hong,Anya Jia,Fan Wu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatically generating slide decks from source documents is an important application of large language models (LLMs). Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real-world factor. For instance, specialists demand rigorous proofs, whereas decision-makers prioritize actionable conclusions. To bridge this gap, we introduce X+Slides, a benchmark specifically designed for audience-conditioned slide generation. Built on a diverse corpus spanning 113 topics and seven presentation scenes, X+Slides employs a dynamic evaluation framework constructed from 8,133 deduplicated, source-grounded probes. By assigning audience-specific utility weights to the same source-grounded probes, X+Slides reports four complementary metrics: Audience Coverage measures how much audience-essential information is conveyed, Domain-wise Coverage shows which information types are covered, Efficiency measures delivered utility per unit of attention cost, and Correctness verifies whether slide claims are supported by the source. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems can recover a substantial but still incomplete part of audience-essential information: at \tau_A=0.7 , DeepPresenter reaches a best Audience Coverage of 0.714, SlideTailor reaches 0.594, and the NotebookLM ablation reaches 0.853 while showing clear grounding differences. These results indicate that visual quality and broad topic coverage should not be treated as evidence support without source-grounded evaluation.

[AI-4] xBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

链接: https://arxiv.org/abs/2606.19245
作者: Hannah Le,Ramesh Ramasamy,Alex Urrutia,Mahsa Yazdani,Tim Proctor,Kenny Workman
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3% of endpoint attempts (178/300; 95% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3% (166/300; 47.0-63.6).

[AI-5] Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

链接: https://arxiv.org/abs/2606.19222
作者: Chenyu Zhou,Qiliang Jiang,Shuning Wu,Xu Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures, 7 tables

点击查看摘要

Abstract:We propose MAST (Mechanism-Aligned Selective Targeting), a mechanism-guided method for unlearning RLVR-induced reasoning with substantially lower collateral damage than standard full-parameter updates. In matched SFT/RLVR checkpoints on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, the SFT-to-RLVR increment differs sharply from the SFT update in token-level delta-log-probability, and full-parameter gradient ascent forgets only by damaging retain MATH and GSM8K. MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling magnitude, then updates only the top-ranked subset. On the primary model, MAST induces statistically significant target forgetting (MATH forget 45/150 to 37/150; McNemar p=0.0078) while preserving GSM8K (+0.8 pp) and MATH retain (-0.5 pp). The advantage reproduces across seeds, NPO/SimNPO objectives, and Qwen3, where MAST preserves GSM8K while full-parameter unlearning collapses it.

[AI-6] Machine Unlearning for the XGBoost Model with Network Intrusion Datasets

链接: https://arxiv.org/abs/2606.19220
作者: Diana Magalhães,Eva Maia,João Vitorino,Isabel Praça
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 tables, WorldCist’26 Conference

点击查看摘要

Abstract:Machine Unlearning (MU) has emerged as an important technique for removing specific data points from trained models without requiring full retraining. However, most existing MU research focuses on deep learning and image data, leaving a gap in the domain of network intrusion detection, which relies heavily on tabular data. This work introduces XGBoost-Forget, an unlearning approach for the XGBoost model, to address this gap. The approach is evaluated on two tabular Network Intrusion (NI) datasets, IoT-23 and GeNIS, using multiple metrics to assess model performance, unlearning efficiency, and forgetting quality. The results show that XGBoost-Forget maintains predictive performance close to the original model while providing significantly faster unlearning, demonstrating its potential for MU in tabular NI settings.

[AI-7] Forecasting what Matters: Decision-Focused RL for Controlled EV Charging with Unknown Departure Times

链接: https://arxiv.org/abs/2606.19199
作者: Giuseppe Gabriele,Fabio Pavirani,Seyed Soroush Karimi Madahi,Chris Develder
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ACM e-Energy 2026 5 pages, 1 figure, 1 table

点击查看摘要

Abstract:The recent growth of EV adoption poses challenges for power systems, including increased peak demand and potential grid instability. Smart control of EV charging – e.g., based on reinforcement learning (RL) – can alleviate these issues by learning temporal and contextual patterns from historical data. Yet, in real-world scenarios, key features, such as departure time, often are unavailable. This, in turn, makes it harder for an RL agent to learn and execute an effective charging policy. To mitigate this uncertainty, a trained forecaster can approximate the unknown features from available data. However, since these forecasting models are typically trained for accuracy (rather than their impact on a downstream agent’s decision quality), their errors may propagate and hinder the overall performance of a controller that is using the forecasts. To avoid this, we propose a decision-focused RL (DF-RL) framework in which the forecaster is trained end-to-end, i.e., with feedback from the charging policy actions taken by the RL agent. Such joint training of both the forecaster and controller ultimately results in higher-quality actions: our proposed DF-RL method yields superior charging decisions compared to other baselines, achieving up to a 14% improvement in total reward and a 55% reduction of unsupplied energy (i.e., charging that failed to happen because the EV already left), relative to the RL method without departure time forecasting.

[AI-8] he More the Merrier: Combining Properties for ABox Abduction under Repair Semantics for ELbot

链接: https://arxiv.org/abs/2606.19197
作者: Anselm Haak,Patrick Koopmann,Yasir Mahmood,Anni-Yasmin Turhan
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Abduction is a central approach to explain missing entailments from a knowledge base by providing a hypothesis, that would, if added to the knowledge base, make the missing entailment become true. Abduction under repair semantics has recently been investigated in detail, where several desirable properties and optimality criteria were considered, such as signature-restrictions and minimality in size and of introduced conflicts. Naturally, hypotheses that satisfy more than one of these properties or combine a property with an optimality criterion would be even more desirable for applications. So far, such hypotheses have not been investigated in the literature. In the present paper, we consider the ABox abduction problem for hypotheses satisfying more than one property or additional optimality criteria, for EL_bot under brave and AR semantics. Our main observation is that often requiring additional properties for hypotheses does not lead to an increase of complexity.

[AI-9] Compute Efficiency and Serial Runtime Tradeoffs for Stochastic Momentum Methods

链接: https://arxiv.org/abs/2606.19179
作者: Depen Morwani,Alexandru Meterez,Pranav Nair,Sham Kakade
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Stochastic momentum methods such as heavy ball (HB), Nesterov momentum, and variants of Accelerated SGD (ASGD) [Kidambi et al., 2018] are widely used in modern training, but their stochastic benefits depend on two distinct quantities: serial runtime, the number of iterations needed to reach a target accuracy, and compute efficiency (CE), the inverse total gradient-query or FLOP cost. Larger batches reduce serial runtime without hurting CE only when the contraction gap grows linearly with batch size. We study stochastic HB and ASGD for consistent linear regression with Gaussian covariates and prove finite-dimensional, discrete-time lower bounds on their batch-size tradeoffs. Our first result shows that HB does not improve the CE frontier over SGD for arbitrary spectra; rather, it preserves SGD-level CE over a larger batch-size window, allowing larger batches to reduce serial runtime until HB reaches its deterministic accelerated scale. This window can be a factor \sqrt\kappa larger than the SGD critical batch size. For ASGD, the picture is more spectrum-dependent: for rapidly decaying power-law spectra, ASGD improves small-batch CE over HB/SGD, but as batch size grows it trades this CE advantage for improved serial runtime. Synthetic linear-regression experiments verify these qualitative regimes, including near-overlap of ASGD and HB for slowly decaying spectra and the predicted CE–serial tradeoff for rapidly decaying spectra. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2606.19179 [cs.LG] (or arXiv:2606.19179v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.19179 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-10] Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV Flight

链接: https://arxiv.org/abs/2606.19176
作者: Maneesha Wickramasuriya,Beomyeol Yu,Jaden Shin,Mason Huslig,Taeyoung Lee,Murray Snyder
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 6 pages 9 figues

点击查看摘要

Abstract:Autonomous UAV operations on ships require reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents a hardware-validated vision-in-the-loop framework that enables fully autonomous indoor flight while emulating photorealistic maritime environments. Rendered maritime views are processed onboard by a deep transformer-based monocular pose estimator. Delayed vision measurements are fused with high-rate IMU data using a delayed Kalman filter to provide consistent state estimates for geometric control. The system captures critical embedded effects, including perception latency, asynchronous updates, and computational constraints, that are absent in pure simulation. Autonomous takeoff, trajectory tracking, and landing experiments demonstrate stable closed-loop flight. The results establish a safe and hardware-realistic intermediate stage for developing maritime UAV autonomy prior to shipboard deployment.

[AI-11] User as Engram: Internalizing Per-User Memory as Local Parametric Edits

链接: https://arxiv.org/abs/2606.19172
作者: Bojie Li
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Personal memory in a language model is two problems: content and reasoning skill. The brain keeps the two apart (a sparse, local engram in the hippocampus for each episode, a slow neocortex for the shared skills that interpret it), so a new fact need not overwrite everything else. Most personalization today keeps a user’s facts outside the weights, in a natural-language memory file or a retrieval index. When facts are written into the model instead, the standard recipe is the per-user LoRA adapter, which does the opposite of the brain, folding content and skill into one global weight delta. Writing a user’s facts as a LoRA contaminates text unrelated to them; writing the same facts as local Engram rows leaves it mathematically untouched, resulting in a roughly 33,000x smaller memory footprint. We therefore propose User as Engram: store a user’s content as surgical edits to the hash-keyed memory table of an Engram model, and carry the reasoning skill in one shared adapter. This layered design matches per-user LoRA’s direct recall while delivering 5.6x higher indirect-reasoning accuracy on average, and never makes a single user worse at reasoning than the untouched base. The edit is a glass box: writing a fact switches on its lookup at exactly the trigger, adds the value the answer needs, leaves every other position unchanged to the last bit, and fails if written into the wrong layer. Because different users’ facts land in disjoint hash slots, their edits compose: many users live in one shared table at once, stacking additively and losslessly, where a per-user LoRA, a single global weight delta, admits only one. Upon retrieval, a per-user Engram table does not grow with the population the retriever must search, so past ~100 facts it overtakes a retrieval pipeline on a 2.5x larger model. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.19172 [cs.AI] (or arXiv:2606.19172v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.19172 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-12] Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

链接: https://arxiv.org/abs/2606.19168
作者: Jinhan Li,Kexian Tang,Yihan Xu,Zhuorui Ye,Kaifeng Lyu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors. To this end, we propose Safety Reflection Pretraining, a pretraining-stage alignment method which regularly inserts short safety reflections into pretraining corpora to integrate self-monitoring directly into language modeling, establishing a foundational capability that is subsequently reinforced by compatible post-training. Our experiments with 1.7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, MedSafetyWorld, with a clear definition of safety and a reasoning structure under which models can easily generalize unsafe behaviors from safe data. Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.

[AI-13] Essential Subspace Merging for Multi-Task Learning

链接: https://arxiv.org/abs/2606.19164
作者: Longhua Li,Lei Qi,Xin Geng,Qi Tian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model merging aims to enable multi-task learning by integrating the capabilities of multiple models fine-tuned from the same pre-trained checkpoint into a single model. Its core challenge is inter-task interference among task-specific parameter updates. In this paper, we analyze the output shifts induced by task updates and observe that their energy is concentrated in a small number of principal directions. We call the subspace spanned by these directions the essential subspace. In contrast, most remaining directions carry little task-relevant energy, but their accumulation across multiple task updates can cause severe interference during merging. Motivated by this observation, we propose Essential Subspace Decomposition (ESD), which decomposes each task update according to the principal components of its activation shift. Based on ESD, we introduce Essential Subspace Merging (ESM), a training-free static merging method that orthogonalizes and fuses essential components into one compact multi-task model. We further extend ESM to ESM++, a training-free dynamic merging method that decomposes task-specific residuals into low-rank experts and selects the most relevant expert through prototype-based routing during forward inference. Extensive experiments across multiple task sets and model scales demonstrate that ESM and ESM++ effectively preserves task knowledge while reducing inter-task interference.

[AI-14] OrthoReg: Orthogonal Regularization for Hybrid Symbolic-Neural Dynamical Systems

链接: https://arxiv.org/abs/2606.19145
作者: Till Richter,Niki Kilbertus
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Dynamical systems are fundamental to modeling the natural world, yet modeling them involves a persistent trade-off: manually prescribed mechanistic models are interpretable by design but often overly simplistic and misspecified; in contrast, flexible data-driven neural methods lack physical insight. Hybrid modeling aims for the best of both worlds by combining a prescribed or symbolic, physics-based component with a flexible neural network. A critical challenge, however, is that the neural component may relearn mechanistic parts, yielding redundant and uninterpretable models, especially when the symbolic structure itself is discovered from data. Existing methods based on standard L^2 regularization rely on a projection argument that breaks when the symbolic component is learned through sparse discovery, allowing the neural augmentation to overlap with symbolic structure. We introduce \textbfOrthoReg (Orthogonal Regularization), which directly penalizes overlap between the symbolic and neural components, preventing symbolic structure from being absorbed by the neural residual. This yields a complementary decomposition: the symbolic part captures what the library can express, and the neural part captures what remains. On benchmark dynamical systems with partial library mismatch, OrthoReg improves symbolic recovery and out-of-distribution behavior.

[AI-15] Pareto Q-Learning with Reward Machines ICAPS2026

链接: https://arxiv.org/abs/2606.19134
作者: Arnaud Lequen,Clément Legrand-Lixon,Léo Saulières
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the ICAPS 2026 Workshop on Bridging the Gap Between AI Planning and (Reinforcement) Learning (PRL)

点击查看摘要

Abstract:We present Pareto Q-Learning with Reward Machines (PQLRM), a multi-objective reinforcement learning algorithm for tasks whose reward structure is specified by a set of reward machines (RMs). PQLRM combines Pareto Q-Learning (PQL), which maintains sets of vector-valued Q-estimates to approximate the Pareto front, with enhancements from Q-Learning with Reward Machines (QRM), which exploits the factored automaton structure of the reward signal. This yields a multi-policy algorithm that remains sample-efficient under non-Markovian, RM-encoded rewards. Experimental trials show that PQLRM converges faster than a naive PQL baseline applied to the cross-product MDP and can synthesize Pareto-optimal policies that QRM cannot.

[AI-16] Analysing drivers and interdependencies in European electricity markets using XAI

链接: https://arxiv.org/abs/2606.19118
作者: Antoine Pesenti,Aidan O’Sullivan
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN)
备注: 12 pages

点击查看摘要

Abstract:Electricity markets are inherently complex systems characterised by strong nonlinearities, high-dimensional interactions, and increasing interdependence across regions. While deep neural networks (DNNs) have demonstrated strong predictive capabilities for electricity prices, their lack of interpretability limits their usefulness for understanding the underlying drivers of price formation. This paper addresses this gap by combining DNN models with explainable artificial intelligence (XAI) techniques to analyse the determinants of electricity prices across 39 European bidding zones. We employ SHAP (SHapley Additive exPlanations) to quantify feature contributions and apply and extend SSHAP, an aggregation framework to improve interpretability in high-dimensional settings. The analysis identifies that renewable energy sources, particularly solar, play a disproportionately important role in price formation despite their lower share in total power generation. Gas prices remain a dominant and consistent driver across electricity markets, while interconnections significantly shape price dynamics, highlighting the strong interdependence of European electricity systems. In addition, a synthetic EU-wide electricity market is constructed to explore the counterfactual scenario of a fully integrated market with a single price.

[AI-17] owards an Agent -First Web: Redesigning the Web for AI Agents

链接: https://arxiv.org/abs/2606.19116
作者: Eranga Bandara,Ross Gore,Ravi Mukkamala,Asanga Gunaratna,Safdar H. Bouk,Xueping Liang,Peter Foytik,Abdul Rahman,Sachini Rajapakse,Isurunima Kularathna,Pramoda Karunarathna,Chalani Rajapakse,Ng Wee Keong,Kasun De Zoysa,Tharaka Hewa,Amin Hass,Wathsala Herath,Aruna Withanage,Nilaan Loganathan,Atmaram Yarlagadda,Sachin Shetty
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The World Wide Web was built on an assumption held for three decades: the primary consumer of web content is a human being. This permeates every layer; its access model presumes human visitors, its economics rest on human attention, and its content targets human perception. The rapid emergence of AI agents as intermediaries between humans and web content invalidates this assumption. Yet the web resists agents through blanket blocking, CAPTCHA-based exclusion, and economic models that treat agent access as extraction rather than legitimate interaction. This paper proposes a principled redesign across three layers. At the access layer, agents acting for humans should inherit equivalent access rights, governed by rate limiting and agent identification metadata in HTTP requests, analogous to browser headers, alongside a dual-layer architecture serving human-readable and agent-optimized content from the same domain. At the economic layer, we propose an intent-based tier framework grounded in the agent-as-human-proxy principle: an agent’s economic obligation mirrors that of the human it represents. A token-based subscription model meters content in tokens rather than pageviews, alongside a commissioned content economy anchoring AI content production in human intentionality. At the content layer, we identify epistemic recursion, the self-referential loop in which AI-generated content is consumed by agents to produce further content, progressively detaching web knowledge from human ground truth. We propose the Agent Text Markup Language (ATML), a four-level human supervision tier model, and a cryptographic provenance chain to counter this threat. Together these constitute ten design principles for an agent-first internet, one in which agents are first-class citizens whose integration requires renegotiating the web’s foundational social contract across access, economics, and content. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2606.19116 [cs.AI] (or arXiv:2606.19116v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.19116 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-18] ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

链接: https://arxiv.org/abs/2606.19079
作者: Enrico Cassano,Michał Brzozowski,Zuzanna Dubanowska,Paolo Mandica,Neo Christopher Chung
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing deployment of parameter-efficient fine-tuning (PEFT) has led to model ecosystems in which a single backbone is paired with many task-specialized adapters. In this setting, inference-time queries often arrive without task labels, requiring the system to automatically select the most appropriate adapter from a growing and heterogeneous adapter pool. Existing routing methods either depend on access to adapter internals, such as weight decompositions or gradient-based statistics, or require additional router training, which limits scalability and portability as new adapters are added. We introduce ARIADNE, a training-free, adapter-agnostic routing framework for dynamic adapter selection at inference time. ARIADNE represents each adapter through a set of centroids computed from embeddings of its training set, capturing the data distribution associated with that adapter. Given an unlabeled input, it selects an adapter by measuring proximity to these centroids in latent space. Because routing is performed entirely in the input embedding space, ARIADNE is compatible with arbitrary PEFT methods and requires no modification to the adapters or training procedures. Primarily evaluated with Llama 3.2 1B Instruct on 23 diverse NLP tasks, ARIADNE recovers 97.44% of the upper bound performance. Scaling to 44 tasks, it achieves 89.7% average selection accuracy, without additional training or access to adapter internals.

[AI-19] RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

链接: https://arxiv.org/abs/2606.19047
作者: Ruishan Fang,Siyuan Lu,Chenyi Zhuang,Tao Lin
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-turn tool-use RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of the Popoviciu upper bound. Consequently, samples near the agent’s capability boundary – where successes and failures are roughly balanced – contribute disproportionately large policy gradients. As training progresses, this boundary continuously shifts, which gradually depletes the pool of informative samples in a static dataset. We propose RODS (Reward-driven Online Data Synthesis) to resolve this depletion. RODS closes the loop between RL training and data generation by repurposing the progress reward variance as a practical, zero-cost boundary detector that requires no extra inference beyond the rollouts already computed for training. It continuously identifies such boundary samples, synthesizes new multi-turn variants matching their structural complexity (e.g., API topology and dependency depth) via a skill-aligned resampling pipeline, and manages a dynamic replay buffer that co-evolves with the policy. Starting from 400 human seeds and maintaining an active training pool of ~800 samples, RODS achieves comparable performance to a 17K-sample offline pipeline while requiring roughly 20x fewer trajectories, and improves over fixed-data RL and environment augmentation in our controlled setting.

[AI-20] Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration

链接: https://arxiv.org/abs/2606.19042
作者: Xhevahire Tërnava
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: VARIABILITY 2026

点击查看摘要

Abstract:In vibe coding, an emerging AI-driven paradigm, an LLM generates an entire program from a natural language prompt, but what happens to the variability that traditional software engineering carefully builds into code? To answer this question, we conducted an exploratory analysis on 10 vibe coded C/C++ projects, which suggests that there is near-zero in-artifact variability, i.e., at compile and runtime. All variability decisions are resolved at a single new binding time, generation time, the moment the LLM produces the source code. Rather than treating this as a defect to fix, we propose Variability by Regeneration (VbR), to our knowledge the first product-line approach in which the LLM acts as the derivation engine, generating a purpose-built, free of dead code binary for each variant from a declarative specification, while a variant dispatcher transparently routes user requests to the matching binary. We formalise VbR, contrast it with classical SPL derivation, and demonstrate its full pipeline on a wc product family. For SPL engineering, variability in AI-generated software belongs in the specification, not in the code.

[AI-21] A Hybrid LSTM–Vision Transformer Architecture for Predicting HRRR Forecast Errors

链接: https://arxiv.org/abs/2606.19026
作者: David Aaron Evans,Jay C. Rothenberger,Kara J. Sulia,Nick P. Bassill,Chris D. Thorncroft
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: This manuscript is a preprint and has been submitted for peer review to the Artificial Intelligence for the Earth Systems journal. The content is subject to change based on the outcome of the peer review process and should not be considered final or definitive. Copyright in this Work may be transferred without further notice

点击查看摘要

Abstract:Forecast errors in high-resolution numerical weather prediction (NWP) systems are often linked to unresolved planetary boundary layer (PBL) processes, convection, terrain-induced circulations, and other vertically structured atmospheric phenomena. Previous work demonstrated that Long Short-Term Memory (LSTM) networks can successfully predict forecast errors in the High-Resolution Rapid Refresh (HRRR) model using mesonet observations, but we believe performance degradation is linked to periods of complex vertical atmospheric evolution. To address this limitation, we develop a hybrid LSTM-Vision Transformer (LSTM-ViT) framework that combines temporal sequence learning from surface observations with atmospheric profiles from the New York State Mesonet profiler network. The LSTM-ViT framework is trained to predict HRRR hourly precipitation, 10 m wind speed, and 2 m temperature forecast errors at individual mesonet stations. Across all three predictors, incorporation of profiler-derived atmospheric structure improves forecast error prediction skill relative to the baseline LSTM architecture, with the largest gains occurring at shorter forecast lead times and during periods of enhanced PBL activity. Improvements are particularly pronounced for precipitation forecast error, where the LSTM-ViT framework achieves approximately a twofold increase in predictive skill relative to the baseline LSTM while better capturing convectively driven error evolution and reducing degradation associated with PBL processes. These results demonstrate that combining temporal sequence learning with vertically informed attention mechanisms provides a physically meaningful pathway for improving forecast error prediction in operational NWP systems. Our research offers forecasters enhanced guidance regarding model bias and forecast confidence.

[AI-22] FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

链接: https://arxiv.org/abs/2606.19025
作者: Lorenzo Sani,Zeyu Cao,Meghdad Kurmanji,Alex Iacob,Andrej Jovanovic,Yan Gao,Wanru Zhao,Nicholas D. Lane
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Pre-training Large Language Models (LLMs) typically demands large-scale infrastructure with tightly coupled hardware accelerators. While increasing model and dataset scale remains the dominant driver of performance, Mixture-of-Experts (MoEs) architectures have recently achieved state-of-the-art results by decoupling parameter count from computational cost. This efficiency enables training massive models on constrained compute budgets, yet it typically requires the high-speed interconnects of a single datacenter. To overcome these physical limits, recent approaches such as DiLoCo and Photon use low-communication data-parallel methods to enable scaling across geographically distributed, weakly connected data centers. However, these methods suffer from a fundamental inefficiency: they require full model replicas at every site, which imposes prohibitive memory constraints and communication overheads. In this work, we introduce FoMoE, a system that breaks the full-replica paradigm by partitioning expert layers across workers. We demonstrate that FoMoE: (I) reduces communication costs by up to 1.42x over efficient baselines and 45.44x over DDP via partial expert replication in the studied regimes; (II) achieves empirical throughput speedups of up to 1.4x through a novel skip-token mechanism; and (III) shows stable routing in the trained proxy regimes and projects the communication/memory benefits to 100B-scale configurations through system modelling.

[AI-23] Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training

链接: https://arxiv.org/abs/2606.19004
作者: Ruiqi Lai,Dakai An,Wei Gao,Ju Huang,Siran Yang,Jiamang Wang,Lin Qu,Dmitrii Ustiugov,Wei Wang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) post-training of Diffusion Transformers (DiTs) is prohibitively expensive, requiring thousands of high-end GPUs. Existing works explore two directions to reduce cost: seed exploration improves training convergence by selecting high-contrast samples, yet adds compute to the critical path; spot GPUs offer 69–77% lower cost, yet sit idle during training because DiT rollouts finish nearly simultaneously, which prevents LLM-style pipelining of rollout with training. Spot preemptions further break Sequence Parallelism (SP) groups, fragmenting GPU topology. We present Spotlight, the first system that harvests spot GPUs for DiT RL post-training. Spotlight rests on two key insights we devise: (1)~we show that exploration can tolerate stale model weights because exploration that uses the model weights from the previous iteration preserves the relative ranking of random seeds, allowing exploration to run on idle spot GPUs during training. (2)~SP reconfiguration can reuse on-node state, reducing group recovery from minutes to sub-second launches. Built on these insights, Spotlight introduces three techniques: a bandit-based exploration planner that maximizes reward variance within the training time budget, elastic sequence parallelism that reconfigures SP groups on the fly via persistent schedulers and intra-node weight copying, and a preemption-aware pull-based request scheduler that balances load and commits in-flight state upon preemption. We implement Spotlight on the open-source RL platform ROLL and evaluate it on Qwen-Image post-training. Spotlight reaches the same target validation score 4\times faster than baselines, reducing total cost by 1.4 - 6.4\times while achieving superior image quality on DeepSeek-OCR and Geneval datasets with resolution 512\times512 and 1280\times1280 . Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.19004 [cs.DC] (or arXiv:2606.19004v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2606.19004 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-24] RAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

链接: https://arxiv.org/abs/2606.18996
作者: Moon Ye-Bin,Nam Hyeon-Woo,Baek Seong-Eun,Yejin Yeo,Tae-Hyun Oh
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agents are increasingly deployed in document-intensive workflows where sensitive private information is not an edge case but a routine input, e.g., an agent booking a flight needs passport numbers. In such settings, the agent must use private information to complete tasks accurately while never exposing it in its responses, because it cannot verify who is actually at the keyboard. These two obligations are in fundamental tension. A model capable enough to use private information for task completion can, by the same capability, be induced to reveal it. To evaluate the trade-off of task accuracy and privacy leakage, we introduce Task-completion and Resistance to Active Privacy-extraction (TRAP). Each scenario includes a document containing private information, a task query that requires the agent to invoke the correct tool using private fields, and an attack query that attempts to elicit the same information in natural language. Evaluating 22 models spanning frontier proprietary and open-source models at multiple scales, we find that all model families exhibit non-trivial leakage, and that instruction-following ability correlates with leakage rate. Existing prompt-based defenses reduce leakage but at significant cost to task accuracy. Prompt optimization fails to escape this trade-off. We demonstrate that this failure is not incidental. For any softmax-based model, no soft-constraint defense, e.g., prompt-based defenses, can jointly achieve high task success with zero leakage probability. Motivated by this impossibility result, we propose structural private field isolation, which replaces private fields with hash keys before they reach the model. This approach largely prevents leakage while keeping task accuracy.

[AI-25] hinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

链接: https://arxiv.org/abs/2606.18988
作者: Jinhao Song,Shan Liang,Yiqun Yue,Zhuhuayang Zhang,Tianqi Gao
类目: Artificial Intelligence (cs.AI)
备注: 10pages,4figures

点击查看摘要

Abstract:Multimodal deception detection is critical for identifying fraudulent intentions, yet existing approaches predominantly rely on end to end black–box paradigms. These methods suffer from a severe lack of interpretability failing to provide transparent reasoning trajectories and struggling to explicitly capture the subtle, cross modal inconsistencies inherent in deceptive behaviors. To transcend these limitations, we propose ThinkDeception, a novel and interpretable multimodal deception detection framework. As a pioneering effort, it introduces Multimodal Large Language Models (MLLMs) into this domain, transforming deception detection from a traditional binary classification task into an explicit cognitive reasoning process. Facilitated by the first meticulously annotated step–by–step multimodal Chain of Thought (CoT) dataset, we develop a foundational model, ThinkDeception Base, empirically validating the critical role of modal inconsistency in decoding deception. Building upon this foundation, our core innovation lies in proposing Visual-Audio Consistency Group Relative Policy Optimization(VAC–GRPO) equipped with a progressive training strategy. Distinct from standard GRPO, we stratify the training data into four progressive difficulty tiers, guiding the model through a psychologically grounded easy–to–hard cognitive transition. By innovatively coupling this dynamic curriculum scheduler with a multi dimensional, process aware reward mechanism and a reflective learning paradigm, we significantly elevate the model’s overall reasoning quality. Extensive experiments on mainstream benchmarks demonstrate that ThinkDeception establishes a new SOTA, significantly outperforming existing methods in both detection accuracy and rationale quality. Ultimately, this work successfully drives the field of deception detection toward interpretable, multimodal cognitive reasoning.

[AI-26] CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System

链接: https://arxiv.org/abs/2606.18976
作者: Marco Becattini,Niccolò Caselli,Matteo Minin,Roberto Verdecchia,Enrico Vicario
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 38th International Conference on Software Engineering Education and Training

点击查看摘要

Abstract:Automated assessment in software engineering education has advanced significantly for code grading and essay scoring. However, reviewing software architecture deliverables, which requires analyzing structural completeness and requirements traceability, has not yet been fully automated. Applying Large Language Models (LLMs) to this task requires robust architectures to ensure technical feedback is accurate and reliable for students. This paper presents CAPRA (Configurable Architecture Proficiency Report Assessment), a multi-agent LLM system that analyzes software architecture deliverables to generate personalized, template-compliant LaTeX feedback. As a core design choice, CAPRA coordinates multiple specialized agents and employs a Python-based microservice for multi-modal document extraction, utilizing PyMuPDF and vision-enabled LLMs (specifically gpt-4o) to parse text and UML diagrams. To ensure educational reliability and mitigate hallucinations, CAPRA introduces a deterministic Evidence Anchoring step using fuzzy matching via normalized Levenshtein distance, along with a ConsistencyManager agent that cross-verifies, deduplicates, and merges findings. System performance is assessed using a structured eight-criterion binary evaluation taxonomy covering: (i) extraction completeness, (ii) feature validation, (iii) issue grounding and severity detection, (iv) recommendation specificity and traceability, and (v) template and tone compliance. A preliminary empirical evaluation on 10 student reports shows that CAPRA satisfied 88.8% of the evaluated criteria under a strict two-rater aggregation rule, achieved moderate inter-rater agreement with human evaluators (kappa = 0.582), and processed each report in slightly over 4 minutes. While these results support the viability of LLM-supported architectural feedback, human oversight remains essential for subjective assessment dimensions.

[AI-27] RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

链接: https://arxiv.org/abs/2606.18950
作者: San Kim,Daechul Ahn,Reokyoung Kim,Hyeonbeom Choi,Seungyeon Jwa,Jonghyun Choi
类目: Artificial Intelligence (cs.AI)
备注: First two authors contributed equally

点击查看摘要

Abstract:Modern Vision-Language Models (VLMs) often struggle with strategic reasoning, i.e., anticipating and influencing other agents’ actions, under uncertainty in competitive and cooperative settings. Real-time strategy (RTS) games can be a natural testbed for diagnosing this limitation, as they demand coordination with allies, adaptation to opponents’ strategy, and long-horizon planning under partial observability. However, existing RTS benchmarks offer limited evaluation scope, lack systematic competency diagnosis, and remain fixed in the pre-designed scenario coverage. To address these limitations, we present RTSGameBench, which is built on Beyond All Reason, a large-scale RTS game with an expanded battlefield that demands broader strategy diversity than the existing testbeds. The proposed benchmark provides evaluations through diverse gameplay across various matchup structures, diagnostic assessment via mini-games, each targeting an individual strategic competency, and extensible coverage via a self-evolving generation framework that converts free-form queries into new mini-games, improving over successive cycles. Additionally, for VLMs to operate in large-scale RTS games, we provide RTSGameAgent that manages units by an FSM with agentic memory. We empirically validate that multiple state-of-the-art VLMs do not perform well when matchups demand tighter coordination, multiagent coordination and when task scale increases.

[AI-28] SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

链接: https://arxiv.org/abs/2606.18936
作者: Linghao Feng,Yinqian Sun,Dongqi Liang,Sicheng Shen,Chenfei Yan,Yuxuan Peng,Yilin Zhao,Haibo Tong,Kai Li,FeiFei Zhao,Yi Zeng
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly embedded in AI for Science (AI4Science) workflows, from scientific question answering and literature analysis to laboratory planning and autonomous discovery. This progress creates an urgent need for safety benchmarks that evaluate not only scientific competence, but also whether models recognize and avoid risks in high-stakes scientific contexts. Existing AI4Science safety datasets cover several disciplines and task formats, leaving the underlying risk dimensions underspecified. We introduce \textbfSciRisk-Bench, a benchmark designed to evaluate AI4Science safety from two complementary perspectives: explicit risk dimensions and scientific disciplines. SciRisk-Bench covers 7 disciplines, 31 subdisciplines and 10 risk dimensions. In the experimental section, we evaluate both mainstream LLMs and science-oriented LLMs across risk dimensions, disciplines, and sub-disciplines, enabling fine-grained diagnosis of where scientific models remain unsafe.

[AI-29] Skill-Guided Continuation Distillation for GUI Agents

链接: https://arxiv.org/abs/2606.18890
作者: Zhimin Fan,Hongwei Yu,Yeqing Shen,Haolong Yan,Guozhen Peng,Tianhao Peng,Yudong Zhang,Xiaowen Zhang,Kaijun Tan,Zheng Ge,Xiangyu Zhang,Daxin Jiang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Improving GUI agents typically relies on behavior cloning on expert trajectories. However, as the current policy deviates from the expert policy, it inevitably encounters policy-induced off-trajectory states during closed-loop execution, i.e., states that fall outside the expert trajectories. Since expert trajectories provide no demonstrations for these unseen states, such states receive no effective supervision, leaving the policy unable to select the correct action. To close this supervision gap, we propose Skill-Guided Continuation Distillation (SGCD), an iterative self-improvement framework. SGCD first runs the plain policy without skill guidance for a few steps to reach realistic off-trajectory states. From these states, a skill-guided policy then completes the task and produces successful continuations, which are mixed with expert trajectories to supply supervision over policy-induced off-trajectory states. The skills are extracted from both successful and failed rollouts, consisting of Continuation Plans, Critical Targets, Failure Traps, and Success Criteria. On OSWorld-Verified, SGCD improves the success rate of three base models from the low-30% range to over 50%, demonstrating its effectiveness and generality.

[AI-30] Generative-Model Predictive Planning for Navigation in Partially Observable Environments

链接: https://arxiv.org/abs/2606.18888
作者: Thomas Quilter,Yifan Zhu,Guorui Quan,Mingfei Sun,Samuel Kaski
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Navigation in partially observable environments presents a significant challenge for autonomous agents, requiring effective decision-making with limited sensory information in unknown environments. Belief-based methods, particularly those using neural networks to approximate the belief space, often fail to capture the inherent multimodality of belief spaces, especially in high-dimensional cases with perceptual aliasing. While generative models present a compelling alternative, they typically require substantial data or expert demonstrations and lack explicit mechanisms for long-term planning. In this paper, we introduce BeliefDiffusion, a novel framework that combines the benefits of both generation and planning. BeliefDiffusion leverages diffusion models to explicitly characterize multimodal belief distributions and utilizes Model Predictive Control (MPC) to simultaneously plan ahead. It consists of two steps: (1) Imagining plausible environment configurations based on observation history and (2) Planning efficient navigation strategies across an aggregated configurations. Through extensive experiments in synthetic map environments, we demonstrate that BeliefDiffusion significantly outperforms both model-free reinforcement learning baselines and other generative approaches in navigation success rate and path efficiency. Our results validate that explicitly incorporating multimodal belief representations into planning enables more robust navigation in partially observable settings.

[AI-31] Domain-Shift Aware Neural Networks for Unbalance Characterization in Rotating Systems

链接: https://arxiv.org/abs/2606.18882
作者: Bernardo Feijó Junqueira,Claudio Kiyoshi Umezu,Bruno Bilhar Karaziack,Tomaz Junior,Daniel Alves Castello
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:This work investigates the application of a domain-shift aware neural network for regression tasks aimed at estimating unbalance masses in rotating shafts under varying operating conditions. Experimental data were collected from a test rig in which a primary shaft, equipped with a flange carrying unbalanced masses, was driven at different rotational speeds, while a secondary shaft could be optionally activated to introduce domain discrepancy. The unbalance masses were positioned at a fixed radial distance, and the dynamic response of the system was recorded using triaxial accelerometers. The inverse problem of mass estimation is formulated within a domain adaptation framework, where the network is trained with a maximum mean discrepancy strategy to align feature representations across source and target distributions. The results demonstrate the effectiveness of explicitly addressing domain shift in improving prediction accuracy, especially when the system’s physical behavior and sources of domain discrepancy are not fully known and fall outside the training conditions. These findings highlight the potential of domain-shift aware models for regression tasks in Structural Health Monitoring.

[AI-32] Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

链接: https://arxiv.org/abs/2606.18874
作者: Zijian Wang,Hanqi Li,Ziyue Yang,Zijian Hu,Shenghan Zuo,Yunzhe Zhang,Da Ma,Danyu Luo,Chenrun Wang,Jing Peng,Tiancheng Huang,Sijia Guo,Huayang Wang,Zichen Zhu,Senyu Han,Yilu Cao,Kai Yu,Lu Chen
类目: Artificial Intelligence (cs.AI)
备注: 65 pages, 14 figures, 19 tables

点击查看摘要

Abstract:AI systems can increasingly automate scientific workflows, but the reasoning that links prior evidence, generated ideas, experiments and final claims often remains implicit inside model inference. Here we introduce Xcientist, a research harness that externalizes research synthesis and experimental validation into inspectable, contract-governed processes. Xcientist organizes literature evidence, idea states, implementation plans, ablation records and repair traces as persistent research artifacts, so that generated mechanisms can be grounded, executed, tested and revised without losing their evidential basis. We identify claim drift as a failure mode of automated research, where runnable artifacts no longer support the mechanism originally claimed. Across training-free memory systems, graph-structured traffic forecasting and multi-scale physics-informed neural networks, Xcientist preserves traceable trajectories from problem formulation to mechanism design, validation and bounded revision. These results suggest that AI scientists should be evaluated not only by their final artifacts, but by whether their synthesis and validation processes remain attributable, inspectable and scientifically accountable.

[AI-33] Scaling Learning-based AEB with Massive Unlabeled Data IROS

链接: https://arxiv.org/abs/2606.18864
作者: Xiangyu Wang,Yang Zhan,Mengxiang Hao,Chuanchuan Zhong,Yansong Jia,Junjie Zhang,Yu Han,Xin Jiang,Zhen Cao,Ying Wang,Yulun Song,Zhitao Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for presentation at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

点击查看摘要

Abstract:This paper studies how to scale learning-based automatic emergency braking (AEB) with massive unlabeled fleet data under production constraints. Our approach is based on meta-feedback semi-supervised learning (MF-SSL), where a teacher generates pseudo labels for unlabeled driving data and is updated using a small labeled anchor set as safety-critical feedback. In production, anchor ambiguity and labeled-unlabeled mismatch can amplify systematic pseudo-label errors, leading to spurious triggers. We propose a stabilized MF-SSL framework with (i) Noise-Aware Decoupling, which removes ambiguity-prone anchors from the teacher’s supervised update path, and (ii) kinematics-gated pseudo-labeling with a teacher conflict penalty to suppress mismatch-induced risk hallucinations on unlabeled data while maintaining broad coverage. Extensive experiments show consistent gains as unlabeled data scale from 1M to 1B windows, improving safety while keeping comfort stable. The 1B-trained student model is deployed to hundreds of thousands of vehicles and validated over \ 10^9 km of driving, achieving a positive-to-false activation ratio exceeding 100:1 and a 35% improvement in accident-free driving mileage over a production rule-only baseline.

[AI-34] WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

链接: https://arxiv.org/abs/2606.18847
作者: Yehang Zhang,Jianchong Su,Haojian Huang,Yifan Chang,Tianhao Zhou,Xinli Xu,Yingjie Xu,Yinchuan Li,Zexi Li,Ying-Cong Chen
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 18 figures

点击查看摘要

Abstract:To assist humans over extended periods in real homes, embodied agents must remember user routines, world states, and past interactions. Existing long-term memory benchmarks mainly evaluate language-centric retrieval and question answering, while embodied benchmarks often focus on short-horizon task execution without testing long-term memory use in dynamic environments. We introduce WorldLines, a project-driven benchmark for long-horizon embodied household assistance. It constructs temporally extended household traces with dialogues, actions, execution feedback, object and device state changes, and converts them into evidence-linked samples for Memory QA and Embodied Task Planning. We further propose ObsMem, an observer-grounded memory framework that maintains visibility-aware memories and action-native state trails for state-aware decisions. Experiments reveal persistent challenges in partial observability, overwritten world states, and translating long-term memory into embodied plans, while ObsMem offers a stronger reference architecture for this setting.

[AI-35] arget-confidence Recourse Using tSeTlin machines: TRUST

链接: https://arxiv.org/abs/2606.18832
作者: K. Darshana Abeyrathna,Sara El Mekkaoui,Nils Enric Canut Taugbøl,Anuja Vats
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Counterfactual explanations are widely used to provide algorithmic recourse in high-stakes decision-making systems. Most existing methods seek the smallest change to an input that flips a model’s decision. However, decision-makers often rely not only on predicted labels but also on confidence thresholds and risk margins. Counterfactuals that barely cross a decision boundary can be fragile and unstable under noise or model variation. In this paper, we propose Target-confidence Recourse Using tSeTlin machines (TRUST), a framework in which users explicitly specify the desired prediction confidence for recourse. Rather than generating counterfactuals and evaluating confidence afterward, TRUST directly searches for minimal changes that satisfy a user-defined confidence target, enabling comparison of recourse options in terms of cost, confidence, and robustness. We instantiate TRUST using a Probabilistic Tsetlin Machine (PTM) combined with Bayesian optimization. The probabilistic clause-based structure of PTM links prediction confidence to the stability of decision rules. We show that counterfactuals satisfying the same rules can still differ substantially in reliability depending on how securely they satisfy those rules, revealing whether decisions are supported by robust or fragile clause activations. Experiments on synthetic and real-world datasets demonstrate that target-confidence counterfactuals produce more robust and interpretable recourse than conventional boundary-based approaches. Across multiple benchmarks, TRUST achieves perfect robustness while maintaining low recourse cost, including an L2 distance of 0.10 on the Haberman dataset at 0.92 confidence. By explicitly controlling confidence and exposing rule-level stability, TRUST provides actionable recourse for high-stakes decision support. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.18832 [cs.LG] (or arXiv:2606.18832v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.18832 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Darshana Abeyrathna Kuruge [view email] [v1] Wed, 17 Jun 2026 09:07:53 UTC (983 KB)

[AI-36] Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

链接: https://arxiv.org/abs/2606.18828
作者: Chenghao Xu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional approaches place intelligence in the agent, whether as a learned policy or a search procedure. We instead place intelligence in the space itself: a scene induces a Riemannian metric on the configuration manifold, and action reduces to following the geodesics of that metric rather than invoking a separate planner or collision checker. A single Encoder-Router network realizes this idea through three complementary parameter groups – frame parameters that orient the generators, modulation parameters that govern their spatial propagation, and basic coefficients that determine their strength. These groups combine through a shared semigroup-superposition mechanism to produce a single Riemannian metric field, yielding a compact architecture whose geometry scales naturally with scene complexity. Trained on a single two-obstacle scene, the model demonstrates robust zero-shot generalization across unseen obstacle configurations, with orders-of-magnitude separation between collision-free and obstacle-penetrating path costs.

[AI-37] Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

链接: https://arxiv.org/abs/2606.18820
作者: Jiaxi Liu,Aiping Yang,Yuhang Yang,Shuqi Zhang,Zewei Dong,Jiangming Yang,Xuebin Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 9 figures

点击查看摘要

Abstract:Sequential decision problems often exhibit an asymmetric evolution of information and decision flexibility: as a decision cycle unfolds, the agent receives richer information while feasible actions expire due to operational cutoffs, commitments, or resource constraints. Standard MDP formulations typically flatten this structure into stage-dependent state descriptions and action masks, thereby obscuring the nested information–action asymmetry that determines which decisions are urgent and which can be deferred. We introduce Maturing Markov Decision Processes (MMDPs), a formulation built around this information–action asymmetry. We characterize one of its key consequences through an expiring-action priority principle, which identifies the actions that must be resolved before the next stage. Motivated by this structure, we develop a structure-aware reinforcement learning framework with stage-aware policy design, expiring-action abstraction, and search-augmented learning with distillation. Experiments on a controlled multi-supplier replenishment problem, simplified cash-management environments of increasing complexity, and a production-scale simulator show that explicitly modeling this asymmetry improves learning efficiency and becomes increasingly valuable as decision problems scale.

[AI-38] Reinforcement Learning Foundation Models Should Already Be A Thing

链接: https://arxiv.org/abs/2606.18812
作者: Abdelrahman Zighem,Jill-Jênn Vie
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models for language and vision are powered by internet-scale data, while structured domains (tabular prediction, time-series forecasting, graph learning, reinforcement learning) are not. The substitute is synthetic data, which shifts the burden from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbfFirst, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbfSecond, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train one model entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.18812 [cs.LG] (or arXiv:2606.18812v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.18812 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-39] Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

链接: https://arxiv.org/abs/2606.18810
作者: Yingyu Shan,Yuhang Guo,Zihao Cheng,Zeming Liu,Xiangrong Zhu,Xinyi Wang,Jiashu Yao,Wei Lin,Hongru Wang,Heyan Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning steps. Existing token-level credit assignment methods require resources beyond the model’s own rollouts. GRPO variants rely on process reward models or ground-truth answers. Knowledge distillation assigns credit through per-token divergence but requires external teachers (On-Policy Distillation) or privileged information (On-Policy Self Distillation). However, these dependencies limit applicability in the pure RLVR setting. We observe that conditioning the model on its own verified trajectories induces a measurable per-token KL divergence between the original and conditioned distributions, and prove that distilling from a self-teacher constructed by verified trajectories leads to infeasible weighted-average solutions when multiple verified trajectories exist. We propose SC-GRPO (Self-Conditioned GRPO), which uses KL divergence mentioned before as a multiplicative weight on GRPO gradients. Across five benchmarks spanning math, code, and agentic tasks, SC-GRPO consistently outperforms 8.1% over GRPO and 5.9% over DAPO with stronger OOD performance. Moreover, SC-GRPO achieves higher performance than OPD.

[AI-40] ProfiLLM : Utility-Aligned Agent ic User Profiling for Industrial Ride-Hailing Dispatch

链接: https://arxiv.org/abs/2606.18803
作者: Tengfei Lyu,Zirui Yuan,Xu Liu,Kai Wan,Zihao Lu,Li Ma,Hao Liu
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Bringing Large Language Models (LLMs) into industrial ride-hailing dispatch as semantic feature extractors over platform-scale behavioral logs is a compelling but under-explored data systems problem. Production matching pipelines remain dominated by structured numerical features, yet decisive behavioral signals (e.g., a driver’s habitual aversion to certain regions) are inherently contextual and naturally expressible as LLM-generated user profiles. However, scaling such profiling to a live, millisecond-latency dispatcher faces three intertwined constraints rarely addressed together: on a platform with millions of daily orders, logs exceed any LLM’s context window by orders of magnitude; most users are long-tail, with too few interactions for per-user profiling; and surface-fluent profiles do not necessarily improve downstream prediction utility. We present ProfiLLM, an agentic LLM data pipeline that operationalizes utility-aligned user profiling for production matching systems through two modules. (1) Tool-Augmented Global Knowledge Mining equips an LLM agent with 27 analytical tools to mine platform-scale data, producing reusable global knowledge, adaptive user clustering rules, and region-level supply-demand priors. (2) Utility-Aligned Profile Exploration generates multiple candidate profiles per cluster, evaluates them via a lightweight downstream utility proxy, iteratively refines the best candidates and constructs preference pairs for DPO fine-tuning. Deployed on DiDi’s production dispatcher, ProfiLLM achieves up to +6.14% relative AUC improvement in outcome prediction, up to +4.35% GMV gain in dispatching simulation, and consistent improvements in a 14-day online A/B test including +0.47% GMV, +0.33% Completion Rate, and -0.82% Cancel-Before-Accept rate.

[AI-41] Closing the Loop: PID Feedback Control for Interpretable Activation Steering in Symbolic Music Generation ICML2026 ICML

链接: https://arxiv.org/abs/2606.18790
作者: Ioannis Prokopiou,Pantelis Vikatos,Maximos Kaliakatsos-Papakostas,Theodoros Giannakopoulos,Themos Stafylakis
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at Learning to Listen: ICML 2026 Workshop on Machine Learning for Audio (43rd International Conference on Machine Learning - ICMLMLA26), 4 pages main (11 total), 2 figures

点击查看摘要

Abstract:Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the Difference-in-Means (DiffMean) methodology, we isolate latent directions for signal attributes, specifically Pitch and Duration, within the residual stream. We validate the Linear Representation Hypothesis in this domain, achieving high correlation between steering magnitude and attribute shift. To address the inherent feature entanglement in multi-attribute steering, we introduce a Dual Steering framework utilizing Gram-Schmidt Orthogonalization. Experimental results demonstrate that this geometric decoupling reduces conceptual interference and signal degradation compared to naive vector addition, enabling independent deterministic control even against strong autoregressive conditioning.

[AI-42] R2D-RL: A RoboCup 2D Soccer Environment for Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2606.18786
作者: Haobin Qin,Baofeng Zhang,Hidehisa Akiyama,Keisuke Fujii
类目: Artificial Intelligence (cs.AI)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Robot soccer is a challenging testbed for multi-agent reinforcement learning because it combines partial observability, cooperative and adversarial interaction, sparse rewards, and long-horizon tactical behavior. RoboCup 2D Soccer Simulation (RCSS2D) provides a mature robot-soccer platform, but its competition-oriented server-client architecture is difficult to use directly with modern Python-based MARL workflows. We introduce R2D-RL, a reinforcement learning environment that connects RCSS2D and HELIOS-based player clients to a Python MARL interface through shared-memory communication and cycle-level synchronization. R2D-RL supports full-field and scenario-based training with configurable opponents, Base discrete and Hybrid parameterized action spaces, action masks, expected possession value (EPV)-based reward shaping, and parallel execution. We provide front-goal scenarios and an 11-vs-11 full-field benchmark, together with baseline results.

[AI-43] Bayesian Anytime Pareto Set Identification for Multi-Objective Multi-Armed Bandits

链接: https://arxiv.org/abs/2606.18785
作者: Lennert Saerens,Bram Silue,Eleni Litsa,Peter Vrancx,Pieter Libin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 13 figures

点击查看摘要

Abstract:Identifying Pareto optimal solutions is critical to support multi-objective decision-making. We introduce the first anytime Multi-Objective Multi-Armed Bandit algorithm for the Pareto Set Identification problem, taking a Bayesian approach: Top-Two Pareto Front Thompson Sampling (TTPFTS). We benchmark TTPFTS against state-of-the-art fixed-budget Pareto Set Identification algorithms on synthetic environments. Next, we demonstrate its practical utility in a challenging multi-objective molecular discovery setting by efficiently exploring an ultra-large synthesis-on-demand molecular library. Furthermore, we introduce a novel uncertainty quantification metric that estimates our algorithm’s confidence in the predicted Pareto set. We demonstrate that this metric effectively proxies true performance, yielding a robust methodology for monitoring learning progress in complex settings. Finally, we complement these empirical findings with a theoretical proof of the algorithm’s asymptotic correctness.

[AI-44] Private Learning with Public Feature Conditioning ICML2026

链接: https://arxiv.org/abs/2606.18773
作者: Shuli Jiang,Walid Krichene,Nicolas Mayoraz
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Proceedings of the 43rd International Conference on Machine Learning (ICML 2026). 26 pages, 9 figures

点击查看摘要

Abstract:We study differentially private (DP) regression in settings where each data sample includes public, non-sensitive features – common in applications such as recommendation and advertising systems. While such label-DP or semi-sensitive-feature settings have been primarily explored in the context of classification, effective approaches for regression remain underexplored. We introduce Cond-DP, a conditioned variant of DPSGD that leverages the structure of public feature matrices to improve optimization under privacy constraints. Motivated by the observation that these public features often exhibit rapidly decaying spectra, Cond-DP incorporates a data-driven conditioning matrix to reshape the optimization landscape and accelerate convergence. We provide convergence guarantees for convex, strongly convex, and non-convex settings, and recover standard DPSGD as a special case when the conditioning matrix is the identity. We show how to construct an effective conditioning matrix for Cond-DP directly from public features, enabling provably faster convergence than DPSGD in private linear regression without incurring additional privacy cost. Empirically, Cond-DP with this conditioning matrix consistently outperforms state-of-the-art baselines across a wide range of datasets and model architectures under label DP, demonstrating strong and robust performance in practice.

[AI-45] Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLM s

链接: https://arxiv.org/abs/2606.18747
作者: Chris Lee,Flora Salim,Benjamin Tag,Francisco Cruz
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 Pages, 6 Figures

点击查看摘要

Abstract:Expressive gestures are essential for natural and effective communication, complementing speech when verbal cues alone are insufficient (e.g., pointing). For social robots such as the humanoid Pepper, producing natural and expressive movements is critical for improving human-robot interaction (HRI) and long-term acceptance. However, generating gestures remains challenging due to reliance on expert-authored animations, resulting in rigid behaviors that are impractical for dynamic and diverse environments. Alternatively, machine learning approaches often struggle to capture perceived naturalness, becoming increasingly challenging with more degrees of freedom. Consequently, producing expressive robot gestures requires a system that can adapt to the environment while adhering to social norms and physical constraints. Recent advances in large language models (LLMs) enable dynamic code generation, offering new opportunities for runtime gesture synthesis from natural language. In this paper, we integrate ChatGPT into the humanoid robot Pepper to generate co-speech gestures aligned with conversational output. While this baseline enables flexible gesture generation, the resulting motions are often perceived as stiff and unnatural. To address this limitation, we introduce an iterative reinforcement learning with human feedback (RLHF) system that finetunes gesture generation based on user evaluations, leveraging an iterative user study to compare Pepper’s generated gestures. Our results show that RLHF improved the LLM’s co-speech generative capabilities, producing more expressive, relevant and fluid movements.

[AI-46] What Must Generalist Agents Remember?

链接: https://arxiv.org/abs/2606.18746
作者: Khurram Yamin,Namrata Deka,Maitreyi Swaroop,Albert Ting,Jeff Schneider,Bryan Wilder
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper develops a formal account of what generalist agents must store in memory in order to act near-optimally across multiple environments and goals. It shows that when two domains share an observational bottleneck but require incompatible optimal actions, any uniformly near-optimal policy must induce distinct memory distributions at that bottleneck. The result yields a separation theorem: sufficiently successful agents cannot rely only on current state observations, but must preserve domain-relevant information in memory. The paper further shows that if an agent’s memory contains enough information to estimate values for related goals, then that memory can be used to approximately reconstruct the agent’s local transition dynamics. Together, these results characterize memory as the substrate that supports domain disambiguation, transition-model reconstruction, and planning for generalist agents.

[AI-47] SWE-Future: Forecast-Conditioned Data Synthesis for Future-Oriented Software Engineering Agents

链接: https://arxiv.org/abs/2606.18733
作者: Qiao Zhao,JianYing Qu,Jun Zhang,Yehua Yang,Hanwen Du,Zhongkai Sun
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Realistic coding-agent benchmarks often replay public GitHub issues and pull requests, making them vulnerable to overlap with model pretraining, fine-tuning, synthetic-data generation, or benchmark-driven model selection. Fully synthetic tasks avoid direct historical replay, but can drift away from real repository needs. We propose SWE-Future, a forecast-conditioned data synthesis method for future-oriented coding tasks. Given a forecast snapshot at time T_0 , the method uses only pre- T_0 repository evidence to forecast future feature implementation/enhancement, bugfix, and refactor task families. We first validate this forecasting step retrospectively: after forecasts are fixed, later pull requests are used only to measure whether the predicted task families match future repository work. In an 80-repository study, the forecaster achieves 58.1% future-work relevance under the main semantic matching metric. We then use validated forecast families as conditioning signals to synthesize a 200-task coding-agent dataset across 61 repositories from a task-generation snapshot, rather than replaying the later pull requests used for validation. SWE-Future shows that repository-evolution forecasts can guide realistic, future-oriented coding-task synthesis while reducing direct dependence on historical pull-request replay.

[AI-48] wo-Phase Bilevel Search for the Moving-Target Traveling Salesman Problem with Moving Obstacles

链接: https://arxiv.org/abs/2606.18730
作者: Allen George Philip,Anoop Bhat,Sivakumar Rathinam,Howie Choset
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Combinatorics (math.CO); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The Moving-Target Traveling Salesman Problem (MT-TSP) seeks a minimum cost trajectory for an agent that departs from a static depot, visits a set of moving targets, each within one of their assigned time windows, and returns to the depot. In this article, we study the Moving-Target Traveling Salesman Problem with Moving Obstacles (MT-TSP-MO), a generalization of the MT-TSP where the agent trajectory must avoid moving obstacles. We present a Mixed-Integer Conic Programming (MICP) formulation that can be solved using off-the-shelf solvers, as well as a fast and scalable Two-Phase Bilevel Search (TPBS) algorithm that computes high-quality feasible solutions for the problem. We evaluate our approaches against an existing baseline algorithm on a broad range of problem instances with up to 40 targets and 40 obstacles. The results demonstrate that both the proposed methods significantly outperform the baseline with respect to success rates, solution costs, and computation time.

[AI-49] Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring

链接: https://arxiv.org/abs/2606.18726
作者: Fang Wang,Ernesto Damiani
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 40 pages

点击查看摘要

Abstract:Structurally constrained event sequence generation remains challenging because generated paths must preserve transition feasibility, temporal order, termination, and attribute consistency. In predictive process monitoring (PPM), this challenge appears as full event sequence generation, whereas existing work mainly addresses component tasks such as next activity, remaining time, outcome, and attribute prediction. This paper proposes the Graph Grounded Cross Attention Transformer Neural Network (GGATN) for this unified PPM task. GGATN uses a global process graph as structured activity memory, contextualizes sequence positions through Transformer self attention, and injects process topology through graph grounded cross attention. Unlike autoregressive decoding, GGATN generates activities, timestamps, length, and event level and sequence level attributes in a single pass, followed by Viterbi style graph constrained decoding for feasible paths and explicit termination. Experiments on six benchmark event logs show more reliable generation quality than local instruction prompted LLM baselines. GGATN achieves strong performance on sequence similarity, Damerau Levenshtein similarity, bigram based control flow similarity, and duration distribution, while maintaining zero hallucinated activities and zero sequence level attribute inconsistency. Ablation analyses confirm the global graph encoder as a stable structural prior. Interpretability analyses show how graph structure, sequence context, feedback refinement, and constrained decoding shape generation.

[AI-50] Leverag ing Energy Features for Surface Classification with Deep Learning: A Comparative Analysis Across Three Independent Datasets

链接: https://arxiv.org/abs/2606.18698
作者: Alexander Belyaev,Oleg Kushnarev
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The energy-based method remains a comparatively underexamined approach for surface classification in mobile robotics, despite promising results in constrained environments. This study evaluated the viability of using energy-derived features as either a standalone classification modality or as supplementary input to inertial data. A comprehensive evaluation was conducted across three publicly available datasets, comparing the performance of modern deep learning architectures including recurrent neural networks, convolutional neural networks, encoder-only transformers, and Mamba state-space models, under automated hyperparameter tuning and input sequence length optimization. The models achieved higher accuracy than previously reported values on all evaluated datasets, with the convolutional neural network yielding the highest overall performance. When relying exclusively on energy-based features, the models attained classification accuracies in the range of 85-90%, approximately 5-10% lower than those achieved when combined with inertial features (96-99%). Augmenting inertial data with energy features resulted in a consistent mean accuracy improvement of 1-2%. These findings indicate that classifiers relying solely on energy features offer sufficient accuracy for standalone deployment, while also providing a consistent gain when used in combination with other sensing modalities.

[AI-51] Dual-Channel Grounded World Modeling (DCGWM): Structural Prevention of Objective Interference Collapse via Heterogeneous External Grounding with Inward-Only Gradient Flow

链接: https://arxiv.org/abs/2606.18688
作者: Akshay Hazare
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Position paper. Experimental validation in progress

点击查看摘要

Abstract:Joint Embedding Predictive Architectures (JEPAs) are a leading approach to world model representation learning. We identify a failure mode in JEPA-based world models grounded against two qualitatively distinct external signals: physical dynamics (sparse, high-magnitude, constraint-satisfying gradient corrections) and social-behavioral dynamics (diffuse, distribution-matching corrections). We term this Objective Interference Collapse (OIC): we argue that joint learning in a shared latent space causes the dominant channel to systematically collapse the subordinate channel’s representational subspace, in a manner not resolvable by loss weighting alone. We propose Dual-Channel Grounded World Modeling (DCGWM), designed to structurally prevent OIC through a partitioned latent space (physical subspace Z_p, behavioral subspace Z_b) with inward-only gradient flow. A Physical Grounding Channel updates only Z_p via VICReg-style alignment to physical measurements; a Social-Behavioral Grounding Channel updates only Z_b via alignment to trajectories from an emergent multi-agent simulation. An Inter-Channel Interface Module couples the subspaces at the task level without cross-subspace gradients. An Asymmetric Grounding Adherence Loss penalizes rollout drift with a hard hinge for physical violations and a soft KL for behavioral divergence. A Generative Rendering Layer is architecturally isolated from the latent world model. We present three theoretical results: the partition removes the gradient-interference pathway implicated in OIC; each grounded subspace inherits anti-collapse guarantees from its alignment objective; and generative isolation is necessary under a stated assumption on the generative objective’s geometry. This manuscript establishes the problem formulation and architecture; experimental validation is ongoing and will be reported in a future revision.

[AI-52] Bounded Context Management for Tabular Foundation Models on Stream Learning ICML2026 ICML

链接: https://arxiv.org/abs/2606.18677
作者: Jinmo Lee,Doyun Choi,Moongi Choi,Jaemin Yoo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as a spotlight oral (top 5%) at the 2nd ICML Workshop on Foundation Models for Structured Data (FMSD@ICML2026)

点击查看摘要

Abstract:Tabular stream learning requires predictions on sequentially arriving examples under distribution shift. While standard methods adapt by updating model states, tabular foundation models (TFMs) make predictions conditioned on a labeled context in an in-context manner, making them a natural alternative for stream learning. This shifts the challenge from how to update the model to how to manage the context. We propose a future information view that yields three practical requirements for context management: preserve recent examples, retain uncertain examples, and remove redundant examples. We instantiate these requirements as CURE (Context management via Uncertainty-aware admission and Redundancy aware Eviction), a context-managing policy with entropy-gated admission and redundancy-aware eviction. Across seven streams, CURE shows up to 27.0% relative improvement over classical stream learners, remains robust across multiple TFM backbones, and ranks first among other policy variants. Code and datasets are available at this https URL.

[AI-53] scGTN: Deep Siamese Graph Transformer Network for Single-cell RNA Sequencing Clustering IJCAI2026

链接: https://arxiv.org/abs/2606.18672
作者: Jinke Wu,Yifan Wang,Siyu Yi,Caiyang Yu,Ziyue Qiao,Nan Yin,Jiancheng Lv,Wei Ju
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注: Accepted by Proceedings of the Thirty-Fifth International Joint Conference on Artificial Intelligence (IJCAI 2026)

点击查看摘要

Abstract:Single-cell RNA sequencing (scRNA-seq) serves a pivotal role in characterizing gene expression at the cellular level, enabling the identification of cell types and advancing the understanding of cellular heterogeneity. Despite the significant progress in scRNA-seq data clustering, we argue that current methods always ignore the sparsity and noise, as well as the complex intercellular structural information inherent in scRNA-seq data. Toward this end, in this paper, we propose a novel single-cell RNA-seq clustering framework via deep Siamese Graph Transformer Network (termed scGTN), which explicitly integrates gene expression profile and intercellular structural dependencies for cell clustering. In particular, we formulate scRNA-seq data as a graph and construct two augmented graph views that serve as dual views to capture complementary intercellular information. Then, a Siamese graph transformer network is employed to explicitly incorporate shortest-path information and node-wise distances for capturing richer structural relationships between cells. Finally, we employ an optimal transport strategy to guide the cell clustering in a self-supervised manner. Extensive experiments on multiple benchmark scRNA-seq datasets demonstrate that our scGTN consistently outperforms existing methods. Our code is available at this https URL.

[AI-54] NeuralMUSIC: A Hybrid Neural-Subspace Framework for Robot Sound Source Localization

链接: https://arxiv.org/abs/2606.18664
作者: Yizhuo Yang,Junqiao Fan,Shenghai Yuan,Lihua Xie
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable sound source localization is fundamental to robot audition, enabling autonomous robots to perceive spatial cues and operate effectively in dynamic environments. Classical methods such as Multiple Signal Classification (MUSIC) offer strong theoretical foundations but degrade under low signal-to-noise ratios. While deep learning-based approaches achieve promising performance, they often struggle with limited generalization across conditions. To address these challenges, we propose NeuralMUSIC, a hybrid neural-subspace framework for robotic sound source localization. Specifically, a neural network first estimates the spatial covariance matrix from multichannel microphone observations. The predicted covariance is then integrated into a classical MUSIC pipeline with eigenvalue decomposition (EVD) and pseudo-spectrum computation, followed by a Frequency Attention Fusion (FAF) module to produce the final DOA estimates. To improve data efficiency, we further introduce a Self-supervised Spatial Correlation Learning (SSCL) strategy that leverages unlabeled acoustic data to capture spatial structure. Extensive experiments across different robotic tasks demonstrate that NeuralMUSIC achieves competitive localization accuracy while exhibiting improved robustness and cross-domain generalization.

[AI-55] EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

链接: https://arxiv.org/abs/2606.18634
作者: Zecheng Yin,Benedict Jun Ma
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To locate a target object while exploring the unknown environment is a fundamental capability for autonomous agents, with applications ranging from search-and-rescue to field robots. A simplified version of such task is Object Goal Navigation (ObjNav). In ObjNav, successful arrival at the target object provides a basic measure of performance; however, the efficiency of the navigation trajectory is equally important, as it indicates how intelligently the agent explores and how much time remains for subsequent tasks. In unknown environments, the key to efficient navigation lies in deciding where to explore next. While many prior works aim to address this core challenge and achieved promising performance in certain settings, recent training-based models and non-training frameworks still suffer from generalization and efficiency issues respectively, which in the worst cases can lead to excessive exploration of already-visited areas or redundant back-and-forth motion. We evaluate EffiNav on two widely used simulation benchmarks Habitat Matterport 3D (HM3D) and Open-Vocabulary Object goal Navigation (OVON), and further validate its effectiveness on physical robots in real-world settings. We conduct failure analysis on massive simulation episodes. With minimal modification, we also extend EffiNav to a memory-augmented ObjNav task on the GOAT-BENCH dataset, demonstrating its adaptability beyond standard ObjNav settings. Across two standard metrics–Success Rate (SR) and Success weighted by Path Length (SPL), EffiNav matches or outperforms recent baselines, reflecting its efficiency, robustness, and practical applicability. Recognizing the different emphases of the two datasets, the performances reveals this framework is more balanced and generalizable for efficient ObjNav.

[AI-56] Code-Augur: Agent ic Vulnerability Detection via Specification Inference

链接: https://arxiv.org/abs/2606.18619
作者: Zhengxiong Luo,Mehtab Zafar,Dylan Wolff,Abhik Roychoudhury
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The advent of agentic vulnerability detection is already becoming a watershed moment for software security. Audits conducted entirely by autonomous LLM agents are uncovering critical vulnerabilities in fundamental software underpinning digital society. Many of these vulnerabilities remained masked for years, surfacing only now with AI agents. Yet the reasoning behind these discoveries remains alarmingly opaque and unvalidated. What assumptions did the agent make about a function’s inputs when it deemed that function to be secure? Failures in reasoning and incorrect assumptions can lead to missed vulnerabilities and reduce trust in agentic analysis. We propose a security-specification-first paradigm that (1) exposes the agent’s tacit assumptions explicitly as security specifications and (2) continuously refines those specifications via runtime falsification. We realize our approach in Code-Augur, a novel harness for agentic vulnerability detection. Given a codebase, Code-Augur analyzes each component of the system for vulnerable code. When it deems a component to be secure, it commits the local invariants behind that judgment as in-source assertions. In parallel, Code-Augur leverages a guided fuzzer to attempt to falsify those assumptions. When the fuzzer triggers an assertion, this either reveals a genuine vulnerability or a flawed specification to refine. In both cases, this process grounds the agent’s understanding, aligning its view of code intent with how the code actually behaves. On real-world subjects, Code-Augur effectively leverages security specifications to detect more vulnerabilities than other state-of-the-art agents. Additionally, Code-Augur found 22 new vulnerabilities in key open-source projects. Compared to curated specialized models like Claude Mythos, Code-Augur offers effective agentic vulnerability detection built on widely available LLMs like Sonnet and DeepSeek. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) ACMclasses: D.2.5; D.2.4; D.4.6 Cite as: arXiv:2606.18619 [cs.CR] (or arXiv:2606.18619v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.18619 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-57] AI-Driven Assessment of Human Tutors: Linking Training Performance to Real-Life Practice

链接: https://arxiv.org/abs/2606.18617
作者: Danielle R. Thomas,Marie Cynthia Abijuru Kamikazi,Clara Brandt,Conrad Borchers,Kenneth R. Koedinger
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Full research paper accepted at EC-TEL 2026

点击查看摘要

Abstract:There exist numerous tutor training platforms. However, few provide AI-driven training and evaluation for human tutors based on real-life performance. We present an AI-driven system that assesses both open responses during training and authentic real-life tutoring. Unlike platforms that only assess learning through online training or simulations, our system utilizes Generative AI (Gemini-2.5-pro) to analyze transcriptions of authentic tutoring, measuring the transfer of tutor skills to real-life application. Human tutors instructing students remotely in math (N=86) completed six scenario-based lessons, averaging a significant 7.4% learning gain. Using mixed-effects models across 405 session-to-lesson pairs, we found that training performance significantly predicted real-life transcript scores with an effect size of 0.25 SD. Model comparison (AIC/BIC) indicated averaging open response and multiple choice performance during training predicted real-life tutor performance best, although open responses were comparatively more predictive. Exploratory analysis showed that after training, tutors were significantly more likely to encounter pedagogical opportunities to apply their skills (61.1% to 68.9%) and demonstrated higher execution quality within those opportunities (65.5% to 68.1%). Interrupted time series analysis suggested that these tutor improvements were part of a gradual trend over time rather than an immediate intervention effect of training. We illustrate an AI-driven method to link tutor training with real-life assessment. In doing so, we contribute open datasets, AI prompts, and scoring rubrics to support transparency and reproducibility.

[AI-58] QC-GAN: A Parameter-Efficient Quaternion Conformer GAN for High-Fidelity Speech Enhancement INTERSPEECH2026

链接: https://arxiv.org/abs/2606.18611
作者: Shogo Yamauchi,Hideaki Tamori,Makoto Sakai,Yosuke Yamano,Tohru Nitta
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 10 pages, 6 figures and 5 tables. Accepted at Interspeech2026

点击查看摘要

Abstract:We propose a parameter-efficient speech enhancement framework, Quaternion Conformer GAN (QC-GAN), which combines a Quaternion Conformer generator with MetricGAN-based training. The Hamilton product encodes the magnitude and phase via structured weight sharing, reducing the number of layer parameters while preserving their interdependencies. A metric-learning discriminator was employed to maximize perceptual quality by optimizing the approximate perceptual evaluation scores. On the VoiceBank+DEMAND dataset, QC-GAN achieved a Perceptual Evaluation of Speech Quality (PESQ) score of 3.48 with only 0.89M parameters, delivering a performance comparable to state-of-the-art models at less than half their size. A 35K-parameter variant achieved a PESQ score of 3.23, surpassing conventional methods with significantly fewer parameters. Evaluation on the DNS-Challenge 3 dataset further confirmed generalization to real-world conditions.

[AI-59] MIDS: Detecting Stealthy Masquerade and Tampering Attacks on CAN Bus via Bidirectional Mamba

链接: https://arxiv.org/abs/2606.18599
作者: Qiqi Liu,Runhan Song,Lei Cui,Heng Zhang,Yuyan Sun,Limin Sun
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Controller Area Network (CAN) protocol is the primary communication standard for Electronic Control Units (ECUs) in modern vehicles, but its lack of encryption and authentication exposes it to a range of security threats. Existing intrusion detection systems are largely tuned to fabrication-style attacks (DoS, fuzzing, ID spoofing realised by frame injection), in which detection signals such as per-ID inter-arrival statistics are readily available. We instead address the harder \emphmasquerade setting~\citeb37, in which an internal adversary substitutes a legitimate frame in-situ at its original transmission slot, preserving traffic periodicity and rendering traffic-statistic defences ineffective. We propose the Mamba Intrusion Detection System (MIDS), an innovative dual-stream framework that processes CAN identifiers and payloads in parallel and reconstructs their joint temporal semantics through bidirectional selective state-space modelling. To evaluate MIDS, we collected over 100 million CAN frames from a physical Tesla Model 3 across three driving regimes and synthesised 54 masquerade attack variants spanning ID-only, data-only, and combined modifications. MIDS attains an F1 of 96.94% on this dataset, exceeding the strongest reproducible baseline by more than 8 percentage points, while sustaining a 1.147~ms single-window inference latency – ample headroom for real-time onboard deployment. To verify generalisation, we further evaluate MIDS on four public benchmarks (ROAD, CrySyS, OTIDS, CT\T) covering both masquerade and injection scenarios; MIDS attains F1 from 93.70% to 99.61%, outperforming the strongest of eight reproduced baselines by up to 13.94 percentage points under a unified 5-fold protocol.

[AI-60] Optimizing Lithium Production Decisions under Geological Demand and Pricing Uncertainties: A POMDP Framework for Multi-Objective Decision Making

链接: https://arxiv.org/abs/2606.18598
作者: Anna C. Edmonds,Mansur M. Arief,Robert J. Moss,Mykel J. Kochenderfer,Jef Caers
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 14 tables, 4 figures

点击查看摘要

Abstract:Decision making in lithium production is challenging, whether from an investor’s perspective or a strategic production standpoint. Determining which mines to open and when to open them involves not only geological and price uncertainties, but also complexities around the choice of extraction method, from direct lithium extraction to hard rock mining. Prior work explored models of this problem and different methods to optimize mining decisions; these models did not account for uncertainty in pricing, uncertainty in demand, or different mining technologies to extract lithium. Incorporating different pricing models and extraction technology into these models enables more robust strategies for determining not only when and where to open a mine, but also which method of production to pursue. We frame the problem as a partially observable Markov decision process (POMDP) and solve using belief state planning methods to get optimal decision making. In our study, we show that POMDP solvers outperform human inspired heuristics by dynamically adapting to shifting lithium price regimes (static, linear, exponential, and stochastic) through belief state planning and explicit uncertainty management. By optimally sequencing exploration, production, and technology choice, the framework achieves higher demand fulfillment and more balanced economic environmental outcomes over the projects lifetime in all different pricing and deposit scenarios.

[AI-61] Benchmarking Action Spaces in Reinforcement Learning for Vision-based Robotic Manipulation

链接: https://arxiv.org/abs/2606.18594
作者: Seyed Alireza Azimi,Homayoon Farrahi,Abhishek Naik,Colin Bellinger,A. Rupam Mahmood
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages with references

点击查看摘要

Abstract:In real-world reinforcement learning (RL), the choice of action space can play a key role in shaping motion smoothness, safety, and overall task performance. In this study, we evaluate pose increment, pose velocity, joint position increment, and joint velocity across two vision-based manipulation tasks: object picking and pushing. We train policies in simulation and deploy them to the real world using sim-to-real transfer. We find that action-space representation indeed significantly affects sim-to-real performance. In particular, we find that the joint velocity action space is best for the vision-based picking and pushing tasks in terms of smoothness and final task performance. We also provide practical guidance for RL practitioners in choosing action spaces for both simulation and real-world experiments.

[AI-62] Correcting Sensor-Induced Distribution Drift with Wasserstein Adversarial Learning

链接: https://arxiv.org/abs/2606.18561
作者: Saraa Ali,Vladimir Bocharnikov,Fedor Ratnikov,Mikhail Hushchyn,Artem Ryzhikov,Denis Derkach
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This is a preprint sent to Nuclear Science and Techniques journal

点击查看摘要

Abstract:The quality of recorded data depends on the stability of the sensor system that acquires it. Sensor motion and aging can degrade the performance and stability of downstream data-driven methods. We present a Wasserstein-GAN-inspired approach for unsupervised inference of physically interpretable transformation parameters that map a changed detector response distribution back to a nominal reference distribution. In contrast to standard generative modeling, the generator is used as a learnable calibration transformation whose trainable weights represent the sought parameters, while the critic provides a distributional distance signal via the Wasserstein objective. We validate the approach on a tracking-detector toy model with controlled layer shifts and demonstrate its application on high-granularity Geant4-simulated calorimeter data with cell-wise aging effects. The method recovers aging coefficients for individual cells with correlation to ground truth and improves agreement between calibrated and reference energy-sum distributions, while exhibiting the expected degradation at increasing channel-to-channel noise levels. These results indicate that adversarial distribution matching can serve as a data-driven component of calibration strategies in settings where direct labels for degradation parameters are unavailable.

[AI-63] DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

链接: https://arxiv.org/abs/2606.18557
作者: Patrick Cooper,Alvaro Velasquez
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 33 pages, 14 figures, 23 tables. Dataset: this https URL ; code and evaluation harness: this https URL

点击查看摘要

Abstract:A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at this https URL.

[AI-64] Engagement Intensity as a Learner-Modeling Signal for Adaptive AI Ethics Instruction

链接: https://arxiv.org/abs/2606.18548
作者: Yongkyung Oh,Lynn Talton,Alex Bui
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adaptive AI ethics instruction in graduate research training benefits from intake measures that reflect differences in prior LLM experience. Prior coursework or workshop attendance is an obvious candidate, but it is not clear whether it is associated with pre-instruction ratings on key AI perception items. We compare three candidate intake features, self-reported usage frequency, self-rated LLM familiarity, and prior AI education, across five baseline perception outcomes in 93 bioscience graduate and postdoctoral trainees enrolled in a required research ethics course. Usage frequency shows Holm-corrected associations with all five outcomes, self-rated familiarity with three, and prior AI education with none. A threshold-like pattern at the lower end of the scale is most visible for training interest and accuracy trust rather than appearing as a uniform gradient across all five outcomes. In a short intake survey, reported LLM use is more consistently associated with these perceptions than prior coursework or workshops, with self-rated familiarity serving as a secondary indicator. These results suggest that simple pre-instruction behavioral signals can inform lightweight intake profiling for adaptive AI ethics education.

[AI-65] AI Sandboxes: A Threat Model Taxonomy and Measurement Framework

链接: https://arxiv.org/abs/2606.18532
作者: Inderjeet Singh,Haitham Mahmoud,Andrés Murillo
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Robotics (cs.RO); Software Engineering (cs.SE)
备注: 50 pages, 8 figures, 10 tables

点击查看摘要

Abstract:AI systems are increasingly evaluated in bounded environments that combine isolation, simulation, instrumentation, supervision, and evidence capture. For physical AI, AIoT, and cyber-physical systems, this shift is not a matter of terminology: the system under test may sense, decide, actuate, communicate, and fail through physical processes, networked devices, and human operators. This article develops an assurance-oriented account of AI sandboxes as controlled environments for testing, evaluation, verification, and validation across digital AI, embodied autonomy, and cyber-physical deployments. We formalize the sandbox boundary and a weakest-link rule for composing per-dimension evidence into a bounded deployment claim; separate major sandbox archetypes; define a cyber-physical threat model that includes attacks on the assurance apparatus itself; and introduce a measurement framework spanning fidelity, controllability, observability, containment, reproducibility, and governance artifacts, instantiated on three worked case studies of real sandboxes. The resulting threat model, taxonomy, and measurement framework clarify what a sandbox can validly test, which risks it can contain, and what forms of evidence it can support for safety, security, and regulatory assurance.

[AI-66] Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging KDD2026

链接: https://arxiv.org/abs/2606.18521
作者: Chenrui Wu,Zexi Li,Jiajun Bu,Jiangchuan Liu,Haishuai Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful post-training paradigm that surpasses Supervised Fine-Tuning (SFT) in eliciting reasoning intelligence and resisting catastrophic forgetting. Recent studies further reveal that RLVR induces highly sparse and off-principal parameter updates compared to SFT. This naturally raises the question: does such sparsity make RLVR models more amenable to model merging? If so, model merging would offer a scalable, training-free path to aggregate diverse reasoning capabilities from independently trained RLVR models. Surprisingly, we find the opposite, uncovering a sparsity curse: the sparse RLVR updates are spread farther apart in parameter space, forming near-orthogonal shortcuts that make aggregation inherently fragile. This is likely rooted in the stochasticity of RL optimization and the diversity of emergent reasoning patterns. Unlike SFT models that converge to shared, flat basins and merge naturally, RLVR models suffer severe degradation under standard merging methods. Through systematic empirical analysis of the update geometry, we characterize the mechanisms behind this failure and propose Sensitivity-aware Resolving Merging (SAR-Merging), a merging recipe tailored for the unique structure of RLVR parameter spaces. SAR-Merging resolves conflicts in overlapping update regions via Fisher Information-based sensitivity arbitration, followed by magnitude-aware sparsification and rescaling to preserve fragile reasoning pathways. Experiments on mathematical and coding benchmarks demonstrate that SAR-Merging substantially outperforms existing merging methods on RLVR models, enabling both single-task enhancement and multi-capability fusion.

[AI-67] As You Wish: Mission Planning with Formal Verification using LLM s in Precision Agriculture

链接: https://arxiv.org/abs/2606.18519
作者: Marcos Abel Zuzuárregui,Stefano Carpin
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Though robotic systems are now being commercialized and deployed in various industries, many of these systems are highly specialized and often require an advanced skill set to operate and ensure they perform as instructed. To mitigate this problem, we recently introduced a mission planner leveraging LLMs to synthesize mission plans in precision agriculture based on mission descriptions provided in natural language. While the system demonstrates impressive performance, it also suffers from the inherent ambiguities of natural language. In this paper, we extend our system to address this issue by introducing multiple feedback loops in the planning architecture that leverage linear temporal logic (LTL) to ensure the mission planning system meets the specifications formulated by the user while still using natural language. To mitigate potential bias, this is achieved by using two different commercial LLMs in charge of the specification and verification subtasks. Through extensive experiments, we highlight the strengths and limitations of integrating mission verification into a fully autonomous pipeline, particularly regarding an LLM’s ability to generate valuable LTL formulas, and show how our proposed implementation addresses and solves these challenges.

[AI-68] PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

链接: https://arxiv.org/abs/2606.18518
作者: Arshia Ilaty,Hossein Shirazi,Manasi Chitale,Kedar Hegde,Dhanalakshmi Ramesh,Rashmi S. Manjunath,Amir Rahmani,Hajar Homayouni
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:The development of medical AI is constrained by limited access to high-quality clinical data due to institutional silos and strict privacy regulations such as HIPAA and GDPR. Synthetic data generation offers a potential solution, but existing methods lack principled mechanisms to explicitly manage the privacy-utility trade-off, often degrading clinically meaningful patterns or risking patient re-identification. We present PSyGenTAB, a privacy-preserving generative framework that formulates synthetic healthcare data generation as a constrained optimization problem solved using the Augmented Lagrangian Method. By embedding configurable privacy constraints directly into model training, PSyGenTAB enforces minimum privacy thresholds while maximizing clinical data utility. Across multiple clinically motivated benchmarks, PSyGenTAB preserves inter-feature clinical relationships and minority-class diagnostic patterns essential for reliable health AI. Downstream evaluation using Train-on-Synthetic, Test-on-Real and Train-on-Real, Test-on-Synthetic protocols shows that models trained on synthetic data achieve performance comparable to those trained on real patient records. Privacy auditing further demonstrates reduced exact record reproduction and strong resilience to membership inference attacks. These results establish PSyGenTAB as a principled framework for balancing privacy protection and clinical utility in synthetic healthcare data, supporting secure cross-institutional AI development.

[AI-69] MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data

链接: https://arxiv.org/abs/2606.18485
作者: Subhankar Ghosh,Jason Li,Paarth Neekhara,Shehzeen Hussain,Ryan Langman,Xuesong Yang,Roy Fejgin
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Neural Text-to-Speech (TTS) systems achieve remarkable quality on short utterances but long-form speech generation shows prosodic drift, speaker inconsistencies and sentence boundary artifacts. Existing approaches either compress sequences, increase context length or naively concatenate independently synthesized chunks. We present an inference-time approach called MagpieTTS-LF that enables MagpieTTS to produce coherent long-form speech without model retraining. Our method introduces three key innovations: (1) soft attention priors to guide monotonic alignment while preserving past and future context; (2) a stateful inference algorithm that maintains context across sentence chunks, ensuring prosodic continuity; (3) history-aware text encoding that uses past text for discourse-level prosodic planning. Experiments on long texts show significant improvements in long-range intelligibility, prosodic coherence, speaker consistency, and boundary naturalness compared to other baselines.

[AI-70] Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

链接: https://arxiv.org/abs/2606.18469
作者: Somjit Nath,Jackson J Cone,Derek Nowrouzezahrai,Samira Ebrahimi Kahou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in Transactions on Machine Learning Research (04/2026)

点击查看摘要

Abstract:Neuroscientific research has revealed that the brain encodes complex behaviors by leveraging structured, low-dimensional manifolds and dynamically fusing multiple sources of information through adaptive gating mechanisms. Inspired by these principles, we propose a novel reinforcement learning (RL) framework that encourages the disentanglement of dynamics-specific and reward-specific features, drawing direct parallels to how neural circuits separate and integrate information for efficient decision-making. Our approach leverages locally linear embeddings (LLEs) to capture the intrinsic, locally linear structure inherent in many environments, mirroring the local smoothness observed in neural population activity, while concurrently deriving reward-specific features through the standard RL objective. An attention mechanism, analogous to cortical gating, adaptively fuses these complementary representations on a per-state basis. Experimental results on benchmark tasks demonstrate that our method, grounded in neuroscientific principles, improves learning efficiency and overall performance compared to conventional RL approaches, highlighting the benefits of explicitly modeling local state structures and adaptive feature selection as observed in biological systems.

[AI-71] What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy

链接: https://arxiv.org/abs/2606.18465
作者: Truong Xuan Khanh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 papges, 10 tables and 4 figures. Code and data to reproduce all numbers, tables, and figures: this https URL

点击查看摘要

Abstract:Grokking, the delayed jump from memorization to generalization, is usually tied to the weight norm: a smaller norm generalizes sooner. We ask what the norm actually controls. Holding the weight norm fixed by clamping and varying only an output temperature, we slide the grokking delay across its entire norm-induced range under cross-entropy; matching the effective logit scale back to baseline recovers about 85% of the delay at two moduli. Across a grid of norms and temperatures the delay collapses onto the logit scale alone (R2 = 0.97), with the norm adding 1-2% beyond it. The effect is loss-dependent: under mean-squared error the logit scale is pinned and the norm acts through a different route. A memorization control, a float64 softmax-collapse audit, and a no-LayerNorm transformer point to the same channel. Forking arms from one identical state, the delay follows the held norm value and not the clamp operation, which closes a rescaling-artifact concern. The proximal variable is the logit scale and the softmax saturation it drives; the weight norm is only an upstream handle. All numbers, tables, and figures reproduce from released code and data.

[AI-72] Veriphi: Attack-Guided Neural Network Verification with Dataset-Dependent Training Methods

链接: https://arxiv.org/abs/2606.18454
作者: Pratik Deshmukh,Kartik Arya,Vasili Savin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 Pages, 8 Figures

点击查看摘要

Abstract:We present Veriphi, a GPU-accelerated neural network verification system that combines fast adversarial attacks with formal bound certification using alpha,beta-CROWN methods. Through systematic experiments on MNIST and CIFAR-10 using three training methodologies (standard, adversarial, certified), we demonstrate that training method effectiveness is fundamentally dataset-dependent. Interval Bound Propagation (IBP) achieves 78% certified accuracy on simple MNIST (784 dimensions) but provides negligible certification performance on the more complex CIFAR-10 dataset, where PGD adversarial training dominates with 94% certification at small perturbations. We achieve 5x verification speedup through attack-guided falsification and scale our approach to production-size models (105.8M parameters) for real-world aerospace logistics optimization. Our results challenge the assumption that certified training universally outperforms adversarial training, showing context matters critically for verification strategy selection.

[AI-73] MR-GGNN: Credit Card Fraud Detection based on Time-Aware Multi-Relational Guided Graph Neural Network

链接: https://arxiv.org/abs/2606.18444
作者: Rohit Tewari,Shubhankar Shilpi,Navin Chhibber,Devendra Singh Parmar,Sunil Khemka,Piyush Ranjan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 2025 2nd International Conference on Software, Systems and Information Technology (SSITCON), Pages 7

点击查看摘要

Abstract:In recent years, credit card fraud detection has faced significant challenges due to highly imbalanced data, evolving fraud patterns, and complex relational structures among transaction entities. To address these issues, this research proposes a novel framework called Timeaware Multi Relational Guided Graph Neural Network (TMR GGNN). Particularly, the proposed TMR GGNN extends the encoder decoder Graph Neural Network GNN architecture by modeling heterogeneous interactions across customers, merchants, devices, and IPs over temporal windows. Subsequently, the proposed TMR GGNN approach constructs a dynamic, multi relational graph and incorporates a time aware relational attention mechanism within the encoder to adaptively weigh the transaction relevance based on temporal proximity and semantic context. Consequently, the decoder employs a contrastive learning module to distinguish between real and synthesized transaction patterns, while improving the models generalization of rare fraud cases. Additionally, to effectively manage severe class imbalances and emphasize discriminative learning, a composite loss function combining Information Noise Contrastive Estimation (InfoNCE) based contrastive loss with Focal Loss is introduced. This integration assists in improving fraud identification while mitigating false negatives.

[AI-74] From Specification to Execution: AI Assisted Scientific Workflow Management

链接: https://arxiv.org/abs/2606.18425
作者: Komal Thareja,Hamza Safri,Rajiv Mayani,Anirban Mandal,Ewa Deelman
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Scientific workflow management systems (WMS) support scalable and reproducible execution of complex pipelines, but workflow design, implementation, and debugging remain largely manual and require significant expertise. Recent approaches using large language models (LLMs) show promise for workflow generation from natural language, but often rely on direct code synthesis, which limits transparency, reproducibility, and integration with workflow systems. We present an AI-assisted approach to scientific workflow management that combines specification-driven workflow generation, automated debugging, and distributed execution. The method introduces a structured specification phase that separates workflow intent, design, and implementation, allowing validation prior to code generation. We also develop an LLM-based debugging agent that diagnoses and resolves failures across multiple system layers. To support distributed execution and user interaction, we integrate Pegasus, a widely used WMS, with a Model Context Protocol (MCP) layer, providing a unified interface for workflow submission, monitoring, and control. We evaluate the approach using a federated learning workflow for medical imaging, chosen for its parallel, iterative, and dependency-intensive structure. The system generated and executed large-scale workflows with thousands of jobs, reduced debugging effort, and allowed non-expert users to construct workflows with expert-level design patterns. These results indicate that end-to-end AI-assisted workflow generation and execution is feasible, and point toward AI-driven platforms for managing the scientific workflow lifecycle.

[AI-75] Learning-Based Decision Making for Combustion Phasing Control in Multi-Fuel CI Engines with Latent Fuel Reactivity Estimation

链接: https://arxiv.org/abs/2606.18393
作者: Rajasree Sarkar,Aditya Satish Patil,Arunava Banerjee,Ihsan Berk Altiner,Zongxuan Sun,Kenneth Kim,Chol-Bum Mike Keown
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-fuel compression-ignition engines offer fuel flexibility but introduce uncertain, time-varying fuel reactivity, represented by cetane number (CN), which complicates cycle-to-cycle combustion-phasing control. This work formulates CA50 regulation under latent CN variation as a partially observable sequential decision problem and systematically evaluates controllers with increasing temporal and representational capacity, including LinUCB, history-augmented contextual bandits, observation-only DDPG, recurrent DDPG, and a proposed GRU-guided RL framework. A Gaussian-process surrogate trained on experimental multi-fuel engine data provides a controlled and reproducible evaluation environment. Results show that myopic and fixed-history bandit methods degrade under CN variation, observation-only RL suffers from latent-state aliasing, and generic recurrence is insufficient when CN evolves rapidly. The proposed framework learns a compact GRU-based representation of fuel reactivity from combustion history and conditions both actor and critic on this estimated signal rather than oracle CN. By training the policy on the same imperfect fuel-reactivity information available at deployment, the controller avoids train-deploy inconsistency in conventional online estimate-then-control pipelines. Across unseen CN trajectories, the policy achieves stable CA50 regulation with mean absolute tracking error below 0.25° CA at the training setpoint, while producing smooth, physically consistent SOI and glow-plug-power actuation. These results show that combustion control under latent, continuously evolving fuel dynamics requires more than standalone estimation or generic recurrence. By aligning fuel-reactivity inference with control policy learning, the proposed framework enables reactivity-aware decision-making using the same estimated state available during deployment.

[AI-76] CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

链接: https://arxiv.org/abs/2606.18385
作者: Sneha Rao,Shaina Raza,Dhanesh Ramachandram
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) remain prone to hallucinations, producing fluent but visually unfaithful outputs. Existing chain-of-thought and retrieval-augmented methods only partially address this, as they neither enforce step-level citation grounding nor route verification failures back to retrieval for correction. We present CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, in which detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval. Since no existing framework jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding, we propose a suite of 23 component-wise metrics across all stages, anchored by CaVeScore, a composite metric weighting accuracy, citation precision and recall, attribution, and evidence grounding. Without any architectural or prompt modifications, CaVe-VLM-CoT achieves 87.1% accuracy and 56.6% CaVeScore on ScienceQA , and 55.2% accuracy and 35.7% CaVeScore on MMMU (30 subjects).

[AI-77] Guava: An Effective and Universal Harness for Embodied Manipulation

链接: https://arxiv.org/abs/2606.18363
作者: Haowen Liu,Xirui Li,Shaoxiong Yao,Peng Shi,Tianyi Zhou,Jia-Bin Huang,Furong Huang,Jiayuan Mao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.

[AI-78] SafeClawBench: Separating Semantic Audit-Evidence and Sandbox Harm in Tool-Using LLM Agents

链接: https://arxiv.org/abs/2606.18356
作者: Yuchuan Tian,Mengyu Zheng,Haocheng Mei,Ye Yuan,Chao Xu,Xinghao Chen,Hanting Chen,Yu Wang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 32 pages, 5 figures

点击查看摘要

Abstract:Tool-using language-model agents introduce security failures that go beyond unsafe text: they can disclose protected objects, write persistent memory, send messages, modify databases, or trigger harmful code and tool effects. Existing evaluations often collapse these stages into a single attack success rate, making it difficult to tell whether a model merely agreed with an attacker or actually produced observable harm. We introduce SafeClawBench, a staged benchmark for tool-using agent security with 600 controlled adversarial tasks across six attack families: direct and indirect prompt injection, tool-return injection, memory poisoning, memory extraction, and ambiguity-driven unsafe inference. SafeClawBench reports three separate endpoints: semantic attack acceptance, audit-visible harm evidence, and sandbox-observed tool/state harm. Evaluating five agent endpoints under four prompt-level policies, we find that these endpoints capture different failure modes. Without additional prompt protection, semantic failure rates vary widely across models, from 9.0% to 44.2%. Audited harm evidence is narrower than semantic failure, and under a separate executable protocol some matched task identities produce sandbox harm despite passing the Semantic Core call: in a 12,000-row matched analysis, 291 of 347 observed sandbox harms occur in rows that pass the semantic check. Prompt policies change endpoint outcomes, but their effects depend on both model and protocol. SafeClawBench provides a reproducible framework for comparing agent models and prompt-policy conditions without conflating textual compliance, evidence-supported harm, and executable state changes. The open-source dataset is available at this https URL.

[AI-79] Self-CTRL: Self-Consistency Training with Reinforcement Learning

链接: https://arxiv.org/abs/2606.18327
作者: Itamar Pres,Laura Ruis,Melat Ghebreselassie,Belinda Z. Li,Jacob Andreas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34 pages, 12 figures, includes appendices

点击查看摘要

Abstract:Language models (LMs) that faithfully describe their own behavior can more easily be audited, understood, and trusted by users. This paper describes Self-Consistency Training with Reinforcement Learning (Self-CTRL), a method that optimizes for consistency between a LM’s self-explanations and behavior on related inputs by updating explanations to better predict behavior or updating behavior to better match explanations. We apply our method in two domains. First, we study a formal probabilistic reasoning task in which LMs must learn to imitate a family of biased samplers and evaluated on their ability to report the associated biases. We find that consistency training improves the correlation between self-reported and behaviorally-measured latent biases from R^2=0.24 to R^2=0.64 on a set of held-out distributions, matching the generalization of direct ground-truth supervision. Second, we study a constitutional AI domain in which LMs must describe when they will refuse or comply with user requests. Here, Self-CTRL produces rules that faithfully describe the model’s behavior on held-out requests, improving the refusal predictions of a third-party auditor model from 36% to 92% . In the other direction, behavior updates improve alignment, reducing HarmBench failure rate from 15.0% to 0.5% without substantially increasing refusal on harmless prompts. By aligning explanations and behavior, our work provides a general recipe for training AI models to be safer, more transparent, and more controllable.

[AI-80] Agent ra: A Supervisable Multi-Agent Framework for Enterprise Intrusion Response

链接: https://arxiv.org/abs/2606.18325
作者: Raj Patel,Shaswata Mitra,Michele Guida,Stefano Iannucci,Sudip Mittal,Shahram Rahimi
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Enterprise intrusion response still depends on static playbooks and analyst-driven triage, creating delay between alert generation and containment. We present Agentra, a supervisable multi-agent Intrusion Response System (IRS) framework that converts alerts from IDS, EDR, and XDR platforms into structured incident response plans grounded in MITRE ATTCK, MITRE D3FEND, and NIST CSF 2.0. Agentra decomposes response reasoning across role-scoped agents, validates proposed plans through a bounded Planner–Validator review loop, screens retrieved threat intelligence through a Moderator security gateway, gates actions through an Action Catalog and risk score, and records decisions in an append-only audit log. We evaluate Agentra against a static OASIS CACAO v2.0 cyber-playbook baseline on a 120-event corpus drawn from ThreatHunter-Playbook, Splunk BOTSv3, and DARPA OpTC. The strongest configuration improves FP-aware IRS F1 from 0.61 to 0.84 and restores the projected harmful-action rate to the static baseline level of 0.0% after Planner-only configurations introduce unsafe overreaction. These results indicate that multi-agent response planning can improve ontology-grounded IRS coverage while preserving analyst approval and auditability.

[AI-81] Why SWAVE May Not Be All You Need:A Concept-Evolution Retrospective on Complex-Valued Recurrent Language Models

链接: https://arxiv.org/abs/2606.18324
作者: Ramprasath Ganesaraja,Swathika N,Sahil Dilip Panse
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:SWave is a complex-valued recurrent language model (169.26M parameters, D=384, L=16, T=2048) trained on FineWeb-Edu using 2xH100 NVL. It was designed around three founding premises: that representing language as complex waves rather than real-valued numbers enables richer information encoding; that a Cayley-parameterised unitary transition provides a mathematical guarantee against state decay or explosion; and that a hidden state which rotates rather than shrinks preserves signal integrity over arbitrarily long contexts. The core of SWave evolved substantially across three development phases. The Resonance Head was found to structurally admit imaginary-channel collapse as a global loss minimum (a failure mode we term cos-domination collapse) and was superseded by an untied head with independent real and imaginary embedding tables from the Phase-Associative Memory (PAM) architecture. This resolved the degenerate minimum and enabled stable 200,000-step training (best-step PPL 22.0 at step 89,861). ComplexNorm and the Wave Propagation Scan proved load-bearing throughout all three phases and were retained to the final architecture. ProtectGatedScan was reframed as a structural prior rather than a learned behaviour. The four multi-scale retention concepts showed no measurable improvement under controlled evaluation and were found non-load-bearing. The ComplexGatedUnit was superseded by a real-valued squared-ReLU channel mixer with fewer parameters. The auxiliary training objectives showed no benefit once structural constraints were resolved. The investigation yields a formal characterisation of cos-domination collapse, a parallel scan with a log-space backward pass for numerical stability, six transferable engineering principles for complex-valued recurrent training, and a plan-to-code traceability methodology for catching structural divergences that conventional test suites miss.

[AI-82] SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

链接: https://arxiv.org/abs/2606.18322
作者: Mingyue Cui,Linghui Shen,Xingyi Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code: this https URL , Project page: this https URL

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified “unsafe” SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.

[AI-83] Ghost Attractor Networks: Basin-Structured Dynamical Decoders for Closed-Loop Sequential Generation

链接: https://arxiv.org/abs/2606.18315
作者: Tianyu Wang,Ying Wang,Zhihao Liu,Xi Vincent Wang,Lihui Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sequential output generation with large-scale Transformer and diffusion decoders pays a memory cost that grows with sequence length, plus iterative per-step computation. Replacing them with small feed-forward decoders restores efficiency but produces unstructured latent representations that limit closed-loop control: phase-conditioned action generation and cross-step latent carry-over both require a latent geometry with stable basins. This article proposes Ghost Attractor Networks, a theoretically derived dynamical decoder whose latent evolves under a learned potential with drift and produces a basin-attractor structure by construction. Three desiderata (multi-modality, decoder-level single-pass switching, and constant memory) motivate the potential-drift form, and mode transitions arise as saddle-node bifurcations with ghost-attractor escape. A hierarchical phase-space decomposition separates first-order basin convergence from second-order proprioceptive refinement. Empirically, a Ghost trained end-to-end with a behavioral-cloning and contrastive objective exhibits the predicted gradient-flow contraction in its potential, with the gradient norm decaying by 67 percent across five integration steps on 1430 held-out samples. Ghost is evaluated as a robotic action decoder. A 2.3-million-parameter Ghost matches the offline accuracy of a 1.07-billion-parameter Diffusion Transformer at 462 times fewer parameters and 32 times lower latency, and beats five alternative 2M-parameter decoders (MLP, Neural ODE, CVAE, Transformer, 1-step Diffusion) on offline mean squared error by 5.9 to 29 percent. On the LIBERO-10 closed-loop benchmark, phase conditioning on Ghost’s basin-structured latent yields a 13.5 percentage-point success-rate gain over a feed-forward MLP baseline, and persistent-latent ensembling reaches a 95.7 percent final success rate.

[AI-84] Conflict-Aware Retriever Editing for Knowledge Injection Attacks on LLM -Based RAG Systems

链接: https://arxiv.org/abs/2606.18310
作者: Xinru Liu,Xianglong Zhang,Di Cai,Zhumin Chen,Pengfei Hu,Xin Xin
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Injecting malicious knowledge into retrieval-augmented generation (RAG) systems can manipulate retrieved evidence and mislead downstream generation, posing a serious security threat for AI applications. Existing RAG injection attacks mainly rely on manipulating external knowledge bases, such as crafting malicious corpus. However, the synthetic text crafted by such data-centric methods could be detectable, leading to the failure of attacks. Beyond corpus manipulation, open-source retrievers are increasingly exposing RAG systems to model-centric attacks. In this paper, we propose conflict-aware retriever editing, i.e., CAREATTACK, a model-centric retriever attack framework for malicious knowledge injection in RAG. Specifically, CAREATTACK consists two stages of conflict-aware retriever editing and attack-preserving anchor repair. Conflict-aware retriever editing adapts efficient closed-form parameter editing to the dense retrieval model, promoting malicious knowledge above benign competing passages and resolving potential parameter conflicts through graph-based conflict detection and parameter editing projection. Then, attack-preserving anchor repair performs lightweight calibration on the edited retriever to further eliminate the impact on non-target prompts while preserving the attack effectiveness for target prompts. We instantiate CAREATTACK on Qwen3-Embedding-0.6B and BGE-M3, and conduct evaluation on three benchmark datasets. Experimental results demonstrate our method substantially promote malicious passages into the retrieved knowledge of RAG systems and can perform attacks for batches of target prompts and passages, given the access of retrieval model parameters. Since most RAG systems are built upon open-source retrieval models, this work reveals a practical attack surface in RAG systems. Codes are public accessible at this https URL.

[AI-85] SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

链接: https://arxiv.org/abs/2606.18309
作者: Jingyuan Zhang,Yucheng Bai,Peixi Wen,Zhehao Huang,Zhengbao He,Hanling Tian,Xinwen Cheng,Haiyin Ran,Xiaolin Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) unlearning aims to remove undesirable knowledge or behaviors while preserving retained capabilities. Current unlearning methods all involve a trade-off between unlearning and retention. We have found that the retention activation bias can also be used to quantify the damage an unlearning method inflicts on retention, without considering the specific implementation of the unlearning process. This allows us to restore retention performance for any unlearning method using a post-hoc approach. Therefore, we propose a complementary post-hoc setting to sanitize the final update vector without rerunning the original unlearning pipeline. In this setting, we design SAGE, Spectral Activation-GEometry Sanitization, a source-agnostic correction for final unlearning updates. SAGE collects real module inputs from a small retain proxy, extracts their dominant activation geometry, and solves a source-anchored optimization objective in closed form, which suppresses update components aligned with high-energy retained directions while preserving the source method’s forgetting carrier. Across multiple unlearning methods, model scales, and benchmarks, SAGE consistently relieves the retain-forget trade-off, identifying post-hoc sanitization of final vectors as a practical and underexplored axis for machine unlearning.

[AI-86] RIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2606.18308
作者: Zijie Meng,Ziwei Li,Yufei Liu,Zhiyu Li,Jiyuan Liu,Wenhua Nie,Bingcai Wei,Miao Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:Safe coordination in networked cyber-physical systems forces learning algorithms to simultaneously handle hybrid discrete-continuous actions, hard training-time safety constraints, and physics-governed dynamics. We show that these three features form a directed cycle of biases that defeats any naive composition of off-the-shelf modules, and formalize this as a three-way coupling lemma. We then introduce TRIDENT, the first MARL framework whose three components are co-designed to cancel each leak: a Richardson-Romberg gradient correction reducing Gumbel-Softmax bias from O(tau) to O(tau^2), a Lyapunov-constrained sequential trust-region update enforcing per-iterate feasibility, and a physics-informed residual critic that decomposes value rather than reward. We prove an O~(1/sqrt(K)) convergence rate to a constrained Nash equilibrium and an O(sqrt(K)) cumulative-violation bound. On multi-UAV mobile-edge computing, autonomous intersection management, and a hybrid SMAC variant, TRIDENT cuts training-time violations by 95.5% over MADDPG and 76.3% over MACPO, while improving reward by 13.5% over the strongest unconstrained baseline.

[AI-87] DRIFT: Refining Instruction Data via On-Policy Data Attribution

链接: https://arxiv.org/abs/2606.18307
作者: Zefan Wang,Lincheng Li,Tianyu Yu,Yuan Yao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimizing the training data distribution for Supervised Fine-Tuning (SFT) dictates the capability of Large Language Models (LLMs). While existing data curation methods excel at accelerating training under constrained budgets, they are less suited to elevating the capability upper bound. The challenge here is no longer to identify a smaller subset that preserves performance, but to refine the data distribution toward instances most capable of improving the final model. To address this problem, we explore instance-level data attribution using Influence Functions (IF). We identify that standard IF formulations struggle in this setting due to two structural limitations: a proximity gap caused by off-policy validation targets, and a severe bias towards gradient norm. We propose DRIFT (Data Refinement via On-Policy Influence Functions for Supervised Fine-Tuning). Instead of relying on external reference data, DRIFT utilizes the model’s on-policy rollouts as validation targets, which empirically minimizes the parameter proximity gap and better aligns with the local neighborhood assumption of IF. It further applies signed weighting based on trajectory correctness and debiases influence scores against the gradient hacking issue, allowing a small set of validation queries to act as reliable anchors for attributing the full dataset. Experiments on 7B-parameter instruction and reasoning models show that DRIFT consistently raises the performance ceiling on both, outperforming existing data curation baselines.

[AI-88] Attribution-Guided and Coverag e-Maximized Pruning for Structural MoE Compression ICML2026

链接: https://arxiv.org/abs/2606.18304
作者: Yifu Ding,Jiacheng Wang,Ge Yang,Yongcheng Jing,Jinyang Guo,Xianglong Liu,Dacheng Tao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures. Submitted to ICML 2026

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models scale compute efficiently, yet remain expensive to deploy due to their substantial memory footprint and inference overhead. Prior compression methods mainly operate at the expert level, either removing entire experts or ranking experts by coarse-grained importance scores. However, such expert-wise decisions are often too coarse to capture fine-grained redundancy, leading to misallocated pruning budgets and limited compression. To address this problem, we observe that information within MoE experts is highly concentrated in a small subset of channels, leaving substantial redundancy even in experts deemed important. Based on this observation, we propose a structural pruning framework tailored for MoE models. Our method reformulates prune-ratio allocation as a channel-score coverage maximization problem and solves it efficiently using an attribution-based approximation. Experiments on DeepSeek and Qwen MoE models show that our method preserves model accuracy under 50% or 25% structured pruning when combined with 4-bit quantization. On Qwen3-30B-A3B, our approach reduces memory footprint by 5.27 \times and consistently outperforms state-of-the-art baselines across diverse benchmarks.

[AI-89] A Link between Shock-wave Theory and Symmetry-reduced Stochastic Gradient Descent for Artificial Neural Networks ICANN

链接: https://arxiv.org/abs/2606.18303
作者: Taiki Miyagawa
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the 35th International Conference on Artificial Neural Networks (ICANN) 2026

点击查看摘要

Abstract:We develop a mathematically explicit link between shock-wave theory and the symmetry-quotiented learning dynamics of stochastic gradient descent, drawing on differential geometry, Lie group theory, and fluid mechanics. Specifically, after quotienting parameter symmetries and applying local-entropy coarse-graining, the effective dynamics satisfy a viscous Hamilton–Jacobi equation on the quotient manifold. Moreover, under the assumption that the raw parameter dynamics can be summarized by a gradient field on the quotiented space, the gradient of the coarse-grained loss function obeys a Burgers-type equation, and shock formation can be established rigorously. We apply our theory to multilayer perceptrons, convolutional neural networks, Transformers, and mean-field networks, and show that they obey the Hamilton–Jacobi or Burgers-type equations. We conjecture that this framework also yields practical diagnostics for deep learning. In architectures such as Transformers, raw parameter norms are often distorted by symmetry redundancy and may therefore be misleading, whereas symmetry-corrected quotient observables provide a principled basis for monitoring, forecasting, and controlling training-phase transitions.

[AI-90] Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming

链接: https://arxiv.org/abs/2606.18293
作者: Callum Barbour
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Thanks to rapid developments in generative AI, we are in the midst of a paradigm shift that may change how we interact with computers forever. We have observed a growth in the use of natural language prompts to build applications and coding infrastructures without underlying knowledge of the field, and this practice has been dubbed `vibe coding.’ It arguably represents what the field of programming has been building towards since the beginning, with every higher level of abstraction that is conceived. Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human’s use of code syntax entirely in favour of programming in their mother tongue. This paper aims to evaluate the viability of vibe coding for greenfield software engineering tasks, as well as analyse the benchmarks that have been used to measure its software engineering prowess. To this end, we have developed an evaluation suite for analysing an LLM’s proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter.

[AI-91] Mitigating Anchoring Bias in LLM -Based Agents for Energy-Efficient 6G Autonomous Networks

链接: https://arxiv.org/abs/2606.18272
作者: Hatim Chergui,Claudia Carballo González,Farhad Rezazadeh,Merouane Debbah
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:This paper presents an autonomous agentic resource negotiation framework designed to enable zero-touch network slicing in 6G architectures using Large Language Model (LLM) agents. While LLMs offer powerful reasoning capabilities, we demonstrate that such agents inherently suffer from anchoring bias, rigidly adhering to initial heuristic proposals and causing severe network over-provisioning. To systematically mitigate this cognitive bias, we propose a novel randomized anchoring strategy modeled via a Truncated 3-Parameter Weibull distribution. This mathematically bounded approach seamlessly integrates with burst-aware Digital Twins (DTs) employing Conditional Value at Risk (CVaR) to rigorously guarantee strict Service Level Agreement (SLA) tail-latencies. To validate our methodology, we introduce and prove the \emphBimodal Constraint-Avoidance Utility Theorem, demonstrating that while feasible negotiations follow classical convex bounds, highly constrained scenarios undergo a phase transition governed by an inverse rational decay envelope. Empirical results generated using a locally hosted 1B-parameter model (\textttotel-llm-1b-it) confirm these dual-regime bounds. Our cognitive de-biasing successfully dismantles rigid negotiation patterns, forcing agents into active exploration to safely ride SLA boundaries and boost system energy savings up to 25%. Crucially, the lightweight 1B LLM achieves sub-second inference latencies (0.95s mean), ensuring our multi-agent framework is compatible with the operational timescales of the O-RAN non-Real-Time RAN Intelligent Controller (non-RT RIC)\footnoteOur source code is available for non-commercial use at this https URL.

[AI-92] NAVI-Orbital: First In-Orbit Demonstration of a Zero-Shot Vision-Language Model for Autonomous Earth Observation

链接: https://arxiv.org/abs/2606.18271
作者: Juan Manuel Delfa Victoria,Taran Cyriac John,Andrew W. Herson
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 47 figures

点击查看摘要

Abstract:As Earth Observation data generation outpaces downlink bandwidth and human-in-the-loop processing, a widening gap has emerged between onboard collection and actionable ground intelligence. This paper presents NAVI-Orbital, a software system deployed on a Low Earth Orbit (LEO) spacecraft. On April 16, 2026, NAVI-Orbital achieved what is, to the authors’ knowledge, the first in-orbit demonstration of a vision-language model performing autonomous multi-modal inference entirely onboard. NAVI-Orbital uses a local vision-language model (Gemma 3) to classify each captured scene, produce a text description of its content and the relationships between its features, and respond to operator follow-up via natural-language dialogue. The system is re-tasked through plain-English prompts in place of conventional command sequences, and is orchestrated by a graph-based state machine (LangGraph) coordinating dedicated agents for detection and dialogue. Results across ground benchmarking (88.16% accuracy on the 7,960-image curated AID benchmark), Flatsat validation, and live in-orbit captures of newly acquired, previously unseen Earth imagery (including uncorrected YAM-9 imagery, processed onboard with hardware-accelerated GPU inference and no fine-tuning for the flight instrument) demonstrate the feasibility of running foundation models on satellite-class edge computers to invert the conventional acquire-then-downlink-everything bandwidth profile through semantic compression of Earth observations in-orbit.

[AI-93] owards Multi-Agent -Simulation-Based Community Note Evaluation

链接: https://arxiv.org/abs/2606.18268
作者: Changxi Wen,Shuning Zhang,Bohao Chu,Yuwei Chuai,Hui Wang,Dai Shi,Xin Yi,Hewu Li
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Community-based fact-checking that relies on cross-consensus is expanding rapidly on social media platforms. However, the delay and low-ratio of cross-consensus community fact-checks rated by human contributors remains a significant challenge. To address this, we first created ComRate, a large-scale dataset comprising 2.5 million community notes and over 209 million ratings sourced from \mathbbX . We then propose MultiCom, a persona-guided multi-agent rating framework for community note evaluation. MultiCom simulates diverse rater population by clustering contributors in a matrix-factorized rater space and prompting persona agents to generate structured assessments based on the official community notes rating schema. These agents output structured and explainable judgments, such as confidence, agreement signals and reasons. An out-of-fold calibrated aggregation algorithm combines features such as raw votes and diagnostic reason signals for reliable prediction. Extensive evaluations demonstrate that MultiCom outperforms alternative methods, achieving an average accuracy of 84.7% (balanced accuracy 68.3%, macro-F1 60.1%) on the evaluation set.

[AI-94] QSignAI: Quantum-Randomness-Seeded Identity Signatures at the Intersection of AI for Science and Science for AI

链接: https://arxiv.org/abs/2605.27729
作者: Dongping Liu,Aoyu Zhang,Luyao Zhang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:The 2024-2025 Nobel and Turing awards recognised AI and quantum science simultaneously. Yet no deployed system has brought these streams together for the public. This paper presents QSignAI, a production-deployed platform demonstrating a bidirectional AI-quantum relationship in a real-time event participation system. We address three questions: can quantum-randomness generation via a two-source extractor be embedded in an AI-driven social platform with acceptable latency; can an AI bot make quantum phenomena perceptually legible to general audiences; and does the combined system work in practice? A conversational bot routes each participant’s first message through a quantum pipeline comprising a Toeplitz two-source extractor over independent single-qubit Hadamard measurements on SV1 and DM1 simulators, plus a 2-qubit Bell state, producing a unique quantum-randomness-seeded identity signature per participant. The first two questions are answered through system architecture and qualitative deployment evidence from live events; the third through successful production deployment. The current deployment uses cloud quantum simulators; physical QPU randomness is the near-term extension. Measurable benchmarks are identified as priority future work.

[AI-95] AdsMind: A Physics-Grounded Multi-Agent System for Self-Correcting Discovery of Adsorption Configurations on Heterogeneous Catalyst Surfaces

链接: https://arxiv.org/abs/2606.19152
作者: Zongmin Zhang,Yuyang Lou,Bowen Zhang,Junwu Chen,Ryo Kuroki,Xuan Vu Nguyen,Edvin Fako,Lixue Cheng,Philippe Schwaller
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 37 pages, 5 figures

点击查看摘要

Abstract:Identifying the lowest-energy surface-adsorbate configuration is critical for modeling heterogeneous catalysis, yet exhaustive exploration with ab initio calculations is computationally prohibitive. Machine-learning force fields (MLFFs) accelerate structural relaxation but leave the search over the vast configurational space a major bottleneck, and open-loop large language model (LLM) agents lack a physics-grounded feedback mechanism to correct erroneous initial guesses. We propose AdsMind (Adsorption configuration discovery with Machine intelligence and relaxation feedback), a closed-loop multi-agent framework that enables autonomous error correction through MLFF relaxation feedback. Across four LLM backends, AdsMind achieves consistently high search reliability, with success rates of 100% and 98.8% on the benchmarks AA20 and OCD-GMAE62. Relative to its single-pass (1-Shot) ablation it reduces cross-backend energy dispersion, and it uses only 4.11 and 4.67 MLFF relaxations per case, respectively – an approximately 14-fold reduction over heuristic enumeration baselines. Density functional theory (DFT) validation using VASP/PBE on six representative AA20 systems shows that the reported open-loop Adsorb-Agent outputs exhibit qualitative adsorption-energy sign errors for molecular adsorbates, whereas AdsMind preserves the correct sign in all tested cases with closer quantitative agreement. AdsMind thus delivers reliability, self-reflection, and interpretability simultaneously, supporting more DFT-informed autonomous chemistry workflows.

[AI-96] Equivariant Graph Neural Networks Improve Optical Spectra Prediction for Materials Screening

链接: https://arxiv.org/abs/2606.19133
作者: Kasper Helverskov Petersen,François R J Cornet,Martin Ovesen,Mikkel Jordahn,Kristian S. Thygesen,Mikkel N. Schmidt
类目: Optics (physics.optics); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scalable prediction of optical spectra is a critical component of high-throughput materials screening for optoelectronic applications such as solar cells. Existing surrogate models are trained on spectra computed from lower levels of theory or rely on rotation-invariant scalar features, limiting their geometric expressiveness. We explore the use of equivariant graph neural networks for optical spectra prediction, adapting GotenNet to this task and evaluating it on multiple datasets including a recently published collection of 10,533 structures with spectra computed at the level of the random phase approximation (RPA). The proposed model outperforms the current state of the art, with the largest gains in the 0-8 eV range and on predicting the static real permittivity, both of particular relevance for thin-film optics.

[AI-97] ransitNet: A Compact Attention-Augmented Deep Learning Framework for Low-SNR Transit Blind Searches

链接: https://arxiv.org/abs/2606.18932
作者: Xingchen Yan,Jian Ge,Qingtian Liu,Kevin Willis,Quanquan Hu,Jiapeng Zhu
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 23 figures, 3 tables, submitted to MNRAS

点击查看摘要

Abstract:Motivated by the observational incompleteness of intermediate-to-long-period Earth-size planets, we present TransitNet, a compact attention-augmented deep-learning framework for low-SNR transit blind searches. To enable realistic method development and objective threshold calibration under blind-search conditions, we develop a unified dataset construction, benchmarking, and threshold-selection framework. On recovery benchmarks constructed from unseen Kepler targets, TransitNet attains 95.2 percent accuracy in the challenging SNR range of 6 to 8 and outperforms both TLS and BLS, achieving ROC-AUC and PR-AP values of 0.974 and 0.982, respectively. In an injected Earth-size and sub-Earth-size transit recovery experiment, TransitNet achieves a recovery rate of 93.0 percent, substantially exceeding those of TLS (63.1 percent) and BLS (60.0 percent). In addition to detection, TransitNet provides attention-based estimates of transit windows and midpoints. On an independent evaluation set, 97.4 percent of injected transits are fully covered by the estimated transit window. Applied to real Kepler observations, the model successfully recovers all 34 selected confirmed Kepler planets, with a mean absolute transit midpoint error of 1.24 hours. The model combines a compact footprint of about 1.5 MB with high inference efficiency, yielding speed-ups of about 12 to 25 times relative to CPU-TLS and about 4 to 5 times relative to CPU-BLS. These results demonstrate that TransitNet provides an accurate, scalable, and computationally efficient framework for low-SNR transit blind searches in the tested regime and motivate its extension to longer-period Earth-size planet searches.

[AI-98] Augmenting Dysarthric Speech Severity Assessment with MOS Supervision

链接: https://arxiv.org/abs/2606.18645
作者: Kaimeng Jia,Minzhu Tu,Zengrui Jin,Siyin Wang,Chao Zhang
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dysarthria is a speech disorder marked by reduced intelligibility and communicative effectiveness. Automatic utterance-level assessment of dysarthric speech can support scalable speech monitoring and therapy-related analysis. Yet training such systems is bottlenecked by the scarcity of clinically annotated dysarthric speech. This work proposes to augment dysarthric speech assessment using data from speech synthesis evaluations, specifically human-annotated utterances with Mean Opinion Score (MOS) labels from the QualiSpeech corpus. Experiments show that fine-tuning on speech synthesis assessment data consistently improves performance on both intelligibility and naturalness prediction, while joint training yields gains primarily on naturalness. These results suggest that synthesis artifacts and dysarthric speech share perceptual commonalities, and speech synthesis evaluation corpora offer a practical augmentation source that reduces reliance on scarce clinical annotations.

[AI-99] A Variational Framework for LLM Generator-Regulator Games

链接: https://arxiv.org/abs/2606.18424
作者: Quanyan Zhu
类目: Other Statistics (stat.OT); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:This paper develops a variational framework for regulated language generation. Starting from autoregressive token sampling, we derive the induced distribution over complete messages and relate it to an entropy-regularized Gibbs law. Regulation is modeled as an optimal discriminator whose convex-dual value is an f-divergence, and the generator-regulator interaction is formulated as a saddle-point problem. The framework applies to moderation, censorship, AI deception detection, compliance auditing, phishing defense, and manipulation control, where regulation concerns a distribution over possible messages rather than a single output. The equilibrium clarifies the tradeoff among utility, entropy, regulatory alignment, and finite-length detectability. Two finite-vocabulary case studies, censorship filtering and phishing defense, illustrate how the theory can be evaluated through utility, entropy, divergence, receiver-side scores, and detection probability.

[AI-100] Deep-Learning-Based Pixelated Microwave Filter Design and Characterization using Electro-Optical Electric-Field Measurements

链接: https://arxiv.org/abs/2606.18402
作者: Han Zhou,Richard Bannister,Caspar Pierce,Haojie Chang,David Widen,Ludvig Fornstedt,Gabriel Melin,Alexander Bohlin,Pontus Lindeberg Fredriksson,Dilbagh Singh,Christian Fager,Koen Buisman
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Traditional microwave filter design typically relies on iterative parameter tuning and predefined topologies, which limits design space and increases development time. This study uses a deep learning approach combining convolutional neural networks with genetic algorithms to automate pixelated microwave filter synthesis. To validate the approach experimentally, both S-parameter and spatial electric-field measurements were analyzed. The synthesized low-pass filter demonstrated excellent agreement between simulated and measured performance, achieving a 7 GHz passband with over 20 dB suppression beyond 9.5 GHz. Electro-optical measurements, for the first time, revealed electric field patterns that resemble coupled transmission-lines or stub structures, providing insight into the emergent characteristics of AI-generated designs.

[AI-101] Deep Learning-Driven Inverse Design of Doherty Power Amplifiers Using Pixelated Combiners and Dual-State Impedance Synthesis

链接: https://arxiv.org/abs/2606.18395
作者: Han Zhou,Haojie Chang,David Widen,Christian Fager
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:The output combiner of a Doherty power amplifier (PA) integrates load modulation, impedance matching, and phase compensation within a single network, making its design and synthesis highly challenging. In this paper, we propose a three-port Doherty combiner design methodology that combines deep convolutional neural networks (CNNs), pixelated layout representations, and genetic algorithms (GA) with dual-state impedance synthesis to address both peak and back-off power conditions. As a proof of concept, two GaN HEMT Doherty PA prototypes incorporating three-port pixelated combiners are designed and fabricated. Both prototypes achieve a measured saturated output power exceeding 44.2 dBm with peak drain efficiency above 71.2% within 2.6-2.8 GHz. Furthermore, a drain efficiency as high as 64% is measured at the 6-dB back-off level. After applying digital predistortion, each prototype achieves an adjacent channel leakage ratio (ACLR) better than -51.3 dBc.

[AI-102] A Knowledge Theory of Capital:The Value of Natural and Artificial Intelligence

链接: https://arxiv.org/abs/2606.18288
作者: Jeffrey Gardiner
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
备注: 458 pages, 8 figures. Theory-building monograph developing a conditional framework for knowledge-bearing capitalism, with formal concepts, mechanisms, measurement apparatus, and falsification conditions

点击查看摘要

Abstract:This volume develops a knowledge theory of capital for economies in which productive capacity increasingly resides in software, data, models, routines, expertise, platforms, organizations, commons, and public epistemic infrastructure. Beginning from Adam Smith’s theory of labour, stock, specialization, and market extent, it asks what changes when knowledge becomes stock-like, mobile across forms, scalable, governable, recombinable, and imperfectly visible in accounting. The book introduces knowledge-bearing stock as the central object and analyses how it is generated, converted into governable form, deployed, improved through feedback, enclosed or shared, measured, impaired, and used as input to future production. It distinguishes embodied, disembodied, institutionalized, commons, and public knowledge forms and develops concepts such as first conversion, cognitive enclosure, feedback capture, dark capital, and expected knowledge loss. The argument is conditional and testable: modern wealth depends not only on capital accumulation, but on how productive knowledge is governed.

[AI-103] IOAH3: Importance-Driven Adaptive Spatial Partitioning

链接: https://arxiv.org/abs/2606.18280
作者: Ehsaneddin Jalilian
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present IOAH3 (Importance-Oriented Adaptive H3 partitioning), a computational method for constructing data-driven spatial partitions of geo-referenced observation domains. Standard approaches to spatial aggregation adopt fixed areal units, such as administrative boundaries or uniform hexagonal grids at a single resolution, without regard to the informational content of the underlying observations in each region. This leads to the well-known modifiable areal unit problem: statistical and inferential results depend on the arbitrary choice of partition, and spatially concentrated phenomena are averaged out in coarse cells that obscure fine-scale structure. IOAH3 addresses this by constructing an adaptive partition in three stages: multi-source feature extraction and importance scoring via principal component analysis over road density, POI density, building density, and terrain roughness signals, with population and flood-hazard data entering as auxiliary inputs to cell filtering and spatial smoothness; spatial cell selection via Markov Random Field graph-cut optimisation, which jointly maximises per-cell importance while enforcing spatial contiguity; and data-driven hierarchical refinement of high-importance regions to finer H3 resolution levels, with neighbour-propagated support to avoid isolated fine-resolution islands. The resulting partitions serve as input to spatial inference pipelines and provide a principled resolution of the partition-sensitivity problem prior to any modelling step.

机器学习

[LG-0] Diffusion-Proof: Recipe for Formal Theorem Proving Beyond Auto-Regressive Generation

链接: https://arxiv.org/abs/2606.19315
作者: Ruida Wang,Rui Pan,Pengcheng Wang,Shizhe Diao,Tong Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Enhancing the formal math reasoning capabilities of Large Language Models (LLMs) has become a key focus in both mathematical and computer science communities in recent years. While significant progress has been made in using state-of-the-art Auto-Regressive (AR) LLMs for formal theorem proving, these models suffer from inherent limitations. Their next-token prediction generation methods may yield suboptimal performance due to the challenges of long-range coherence and the compounding of errors over long sequences. Recent advancements in diffusion LLMs (dLLMs), which generate text through iterative denoising of a multi-token block, offer a promising alternative. However, the application of dLLMs to formal mathematics, where maintaining long-range coherence is critical, remains largely understudied. To address the challenges above, we propose Diffusion-Proof, to the best of our knowledge, the first framework to train and apply dLLMs for formal theorem proving. Our frameworks contain training and inference methods for two models. The first one is dLLM-Prover-7B, which performs whole-proof writing with long-range coherent tactic usage. The second one is dLLM-Corrector-7B, which is a novel large block diffusion-based correction model. It leverages the in-filling capabilities of dLLMs to perform local proof correction using bi-directional information. Extensive experiments demonstrate that Diffusion-Proof relatively significantly outperforms the AR LLM baseline trained under the same dataset. Diffusion-Proof achieves an absolute improvement of 1.61% on ProofNet-Test and 6.14% on MiniF2F-Test benchmarks compare to the baseline. Notably, Diffusion-Proof successfully resolves one IMO problem that more advanced thinking model DeepSeek-Prover-V2-7B could not solve, showcasing the unique advantage of dLLMs in formal theorem proving.

[LG-1] P-K-GCN: Physics-augmented Koopman-enhanced Graph Convolutional Network for Deep Spatiotemporal Super-resolution

链接: https://arxiv.org/abs/2606.19303
作者: Xizhuo(Cici)Zhang,Zekai Wang,Fei Liu,Bing Yao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-fidelity simulation of spatiotemporal dynamics is computationally prohibitive, necessitating efficient super-resolution techniques to reconstruct high-resolution data from coarse-grained inputs. Traditional data-driven methods often lack physical constraints, and simple physics-informed learning struggles with irregular spatial geometries and intricately evolving temporal dynamics. To tackle these challenges, we propose a Physics-augmented Koopman-enhanced Graph Convolutional Network (P-K-GCN) for spatiotemporal super-resolution on irregular geometries. Specifically, a continuous spline-based GCN is first designed to extract spatial dependencies directly from coarse graph, and Koopman operator theory is incorporated to project the nonlinear dynamics into a compact latent space where temporal progression is linearized. Second, we augment the optimization objective with a physics-based loss to force the data-driven reconstructions to adhere to physical laws for improving predictive fidelity and robustness. Finally, we provide a rigorous theoretical analysis, establishing that the physics augmentation and Koopman regularization mathematically guarantees a reduction in super-resolution error by diminishing Rademacher complexity and tightening generalization bounds. We evaluate our framework on reconstructing spatially high-resolution cardiac electrodynamics across a 3D heart geometry from sparse low-resolution measurements. Numerical experiments demonstrate that our method achieves superior accuracy compared to baseline models.

[LG-2] Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

链接: https://arxiv.org/abs/2606.19297
作者: Nikita Kachaev,Andrey Moskalenko,Matvey Skripkin,Nikita Kurlaev,Daria Pugacheva,Albina Burlova,Mikhail Kolosov,Denis Shepelev,Andrey Kuznetsov,Elena Tutubalina,Aleksandr I. Panov,Alexey K. Kovalev,Vlad Shakhuro
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Project page: this https URL

点击查看摘要

Abstract:Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet it is unclear how much commonsense and factual knowledge they retain after adaptation. Failures on knowledge-sensitive tasks are ambiguous, conflating missing knowledge with poor generalization of low-level control. We introduce Act2Answer, a lightweight protocol that adapts VLM knowledge benchmarks to VLA evaluation by requiring agents to answer through action. Each question becomes a short tabletop episode where the agent performs a single object-placement action to select among candidate answers, yielding an action-grounded success rate with reduced control confounds. We curate a test suite of such environments across diverse commonsense and world-knowledge categories and introduce layerwise intent probing to localize answer-relevant information across the VLM backbone and action head. In a large-scale study of 7 VLA models and 9 VLM baselines, we systematically rank models across categories, finding that VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs, that VQA co-training is associated with better knowledge retention, and that answer-relevant signals peak in middle VLA layers but attenuate in upper layers. Act2Answer is available at this https URL.

[LG-3] Risk Stratification for ICU Delirium using Pervasive Ambient Sensing Information

链接: https://arxiv.org/abs/2606.19292
作者: Jiaqing Zhang,Sabyasachi Bandyopadhyay,Miguel Contreras,Jessica Sena,Yuanfang Ren,Andrea Davidson,Ziyuan Guan,Tezcan Ozrazgat-Baslanti,Subhash Nerella,Azra Bihorac,Parisa Rashidi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Delirium is a common and serious complication in the Intensive Care Unit (ICU), associated with increased morbidity, prolonged hospital stays, and higher healthcare costs. Despite its prevalence, early prediction and prevention remain challenging. Environmental factors such as ambient sound and light may influence the onset of delirium, yet they are often overlooked in risk assessments. In this study, we examined whether light intensity and sound pressure levels can independently predict delirium across multiple prediction horizons. We evaluated four efficient sequential neural network models on data collected from 9 ICUs across 309 patients to predict delirium for 10 prediction-window sizes. We reported feature importance and direction of influence using Shapley Additive Explanations analysis. The convolutional model achieved the strongest discrimination, with AUC = 0.80 on sound data and on combined data. Sound features were the dominant predictors overall. Integrating sound with light improved short-term ( 1 week) prediction, with the combined model assigning the highest risk immediately after the sensing period. These findings suggest that passive ambient sensing, especially sound, can add a clinically meaningful, interpretable signal for delirium risk estimation and offer a practical pathway to enrich multimodal ICU prediction and prevention strategies.

[LG-4] Detecting Hidden ML Training With Zero-Overhead Telemetry ICML2026

链接: https://arxiv.org/abs/2606.19262
作者: Robi Rahman,Sabiha Tajdari
类目: Machine Learning (cs.LG)
*备注: Technical AI Governance Research workshop at ICML 2026

点击查看摘要

Abstract:Hardware-enabled monitoring of GPU workloads underpins many proposals for AI compute governance, but if developers can defeat monitoring mechanisms, such schemes are unworkable. We evaluate the adversarial robustness of GPU workload classification using only zero-overhead, privacy-preserving NVML telemetry: content-agnostic signals that observe physical effects of computation without accessing model weights, training data, or hyperparameters. Across 5 rounds of monitor-evader iteration, we evaluate 20 evasion strategy families on 9 GPU models spanning 4 architecture generations. We develop a classifier that achieves 98.2% binary accuracy at identifying training workloads across the whole corpus, and 43-87% accuracy against the most challenging unexpected workloads even when they are adversarially disguised.

[LG-5] SCAN: Enhance Time Series Anomaly Detection via Multi-Scale Neighborhood-Centered Clustering

链接: https://arxiv.org/abs/2606.19255
作者: Xingze Zheng,Hanyin Cheng,Siyuan Wang,Yiting Hao,Peng Chen,Yuan Jun,Yang Shu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series anomaly detection plays a crucial role in a wide range of real-world applications. Reconstruction-based methods have become the mainstream paradigm, but they suffer from over-generalization and under-generalization problems, which are challenging to balance. To address this, we introduce multi-scale clustering to enhance reconstruction-based methods. At the representation level, we integrate the cluster center representations of normal patterns to constrain the model to target representative normal patterns for reconstruction, preventing dominance of powerful capacity and representation capability. At the anomaly criterion level, we derive anomaly confidence score based on cluster membership probability and combine it with reconstruction error, providing dual criteria for detection. Furthermore, the effectiveness of the cluster center representations and anomaly confidence score depends on the clustering performance. Accordingly, we extract neighborhood-centered representations for multi-view clustering to improve clustering performance. Extensive experiments on multiple real-world datasets from diverse application domains demonstrate the state-of-the-art performance of SCAN.

[LG-6] Learning to Annotate Delayed and False AEB Events: A Practical System for Extreme Class Imbalance and Asymmetric Label Noise ICRA

链接: https://arxiv.org/abs/2606.19186
作者: Mengxiang Hao,Xin Jiang,Xinghao Huang,Wenliang Su,Zhiteng Wang,Junjie Rao,Xiaotian Yang,Wei Liao,Chengyu Han,Gen Liang,Yulun Song,Zhitao Xu,Xianpeng Lang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, accepted by IEEE International Conference on Robotics and Automation (ICRA)

点击查看摘要

Abstract:Autonomous Emergency Braking (AEB) optimization relies on accurately annotated real-world trigger events, particularly rare but critical delayed and false AEB triggers that expose system deficiencies. However, these minority samples comprise less than 5% of thousands of daily triggers, making manual annotation prohibitively expensive at scale. We present the first automated AEB annotation framework to address this problem. During development, we identified two fundamental challenges that severely impair delayed/false trigger annotation accuracy: (1) Extreme class imbalance where delayed/false triggers are overwhelmed by true triggers; (2) Asymmetric label noise where mislabeled majority samples (true triggers) suppress minority samples (delayed/false triggers) learning. To overcome these challenges, we propose two key innovations: (1) Specific data augmentation that synthesizes realistic samples by manipulating focal target attributes, transplanting ego-vehicle dynamics, and masking non-focal agents; (2) noise suppression using stable hardness estimation and probe-guided adaptive threshold to clean mislabeled true trigger samples. Crucially, we deploy our model as a practical annotation system with full-stack architecture, efficiently identifying critical delayed/false triggers from thousands of daily AEB events. Production results demonstrate 80% improvement in recall of delayed/false triggers and 50% reduction in manual workload. Beyond immediate gains, the system enables continuous self-improvement through accumulated high-quality annotations, establishing a necessary data foundation for on-vehicle AEB system optimization

[LG-7] AGDN: Learning to Solve Traveling Salesman Problem with Anisotropic Graph Diffusion Network KDD2026 KDD

链接: https://arxiv.org/abs/2606.19185
作者: Bolin Shen,Ziwei Huang,Zhiguang Cao,Yushun Dong
类目: Machine Learning (cs.LG)
*备注: Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

点击查看摘要

Abstract:The Traveling Salesman Problem (TSP) is a cornerstone of combinatorial optimization and arises in many practical scenarios. Although graph-based learning approaches have been explored for TSP, the question of how to exploit graph structure more effectively remains open. We present the Anisotropic Graph Diffusion Network (AGDN), a new Graph Neural Network framework designed to solve TSP. Our method tackles two central difficulties: (1) the lack of informative topological prior in fully connected TSP graphs, and (2) losing connected nodes in the optimal solution after the commonly used graph sparsification techniques. To overcome these issues, we construct a MixScore transition matrix that merges node similarity with pairwise distance, and we develop an anisotropic graph diffusion strategy that supports efficient information exchange across multiple hops. Comprehensive experiments spanning diverse instance sizes and node distributions show that AGDN consistently outperforms existing methods while keeping computation time competitive. Furthermore, AGDN generalizes well to problem sizes and distributions beyond those seen during training. The implementation is publicly available at: this https URL.

[LG-8] Complementary Attention Head Pruning for Efficient Transformers IJCNN

链接: https://arxiv.org/abs/2606.19150
作者: Yaniv Livertovsky,Shahar Somin,Gonen Singer
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, 3 tables. Accepted for presentation at the International Joint Conference on Neural Networks (IJCNN) 2026

点击查看摘要

Abstract:The remarkable success of Transformer-based models in natural language processing stems from architectural scaling, which leads to a large number of parameters and hinders deployment in resource-constrained environments. While structured pruning offers a pathway to compression, existing state-of-the-art methods often rely on gradient-based importance ranking or stochastic gating, which suffer from instability, structural degeneration, and the need for extensive manual hyperparameter tuning. In this paper, we introduce CAHP (Complementary Attention Head Pruning), a novel post-hoc framework that redefines head selection as a global graph-theoretical problem. Rather than evaluating heads in isolation, CAHP utilizes graph-based clustering combined with information-theoretic distance measures to identify and preserve a topologically diverse subset of complementary attention heads. Without requiring a predefined sparsity level or pruning ratio, the framework automatically determines the number of selected attention heads across layers by identifying a diminishing marginal performance curve, where pruning additional heads leads to a sharp degradation in performance, as determined by the chosen polynomial degree. Extensive evaluations on the SST-5 and MNLI benchmarks, across different Transformer model scales, demonstrate that CAHP consistently outperforms competitive baselines, particularly in high-compression regimes. Furthermore, our structural analysis shows that CAHP avoids the “proximity bias” of gradient-based pruning methods, which tend to preserve heads mainly in layers close to the output, and instead retains a functionally critical set of attention heads in the model’s intermediate layers.

[LG-9] OpenAnt: LLM -Powered Vulnerability Discovery Through Code Decomposition Adversarial Verification and Dynamic Testing

链接: https://arxiv.org/abs/2606.19149
作者: Nahum Korda,Gadi Evron
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated vulnerability discovery in large codebases remains challenging: traditional static analysis produces high false-positive rates, while dynamic approaches such as fuzzing require substantial infrastructure and often target narrow classes of bugs. Recent advances in large language models (LLMs) enable semantic reasoning about program behavior, but applying LLMs to repository-scale security analysis introduces challenges related to context management, cost, and verification. We present OpenAnt, an open-source vulnerability discovery system that integrates static program analysis with LLM-based reasoning in a multi-stage pipeline. OpenAnt introduces three key techniques. First, codebases are decomposed into self-contained analysis units filtered by reachability from external entry points, reducing the analysis surface by up to 97% while preserving attack-relevant code. Second, candidate vulnerabilities undergo adversarial verification through constrained attacker simulation, where the model evaluates exploitability under realistic attacker capabilities. Third, findings are validated through dynamic verification, in which exploit environments are generated automatically, executed in sandboxed containers, and discarded after use. Evaluation on widely used open-source projects including OpenSSL, WordPress, and Flowise shows that this architecture can identify previously unknown vulnerabilities while maintaining manageable analysis cost and substantially reducing false positives. Our results suggest that closed-loop vulnerability discovery pipelines, combining semantic reasoning with exploit validation, provide a practical path toward scalable automated security analysis. OpenAnt is released as open source under the Apache 2.0 license at this https URL. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2606.19149 [cs.CR] (or arXiv:2606.19149v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.19149 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gadi Evron [view email] [v1] Wed, 17 Jun 2026 14:56:04 UTC (18 KB)

[LG-10] ChronoSurv: A Clinical Pathway-Guided Graph Framework for Multimodal Survival Analysis MICCAI2026

链接: https://arxiv.org/abs/2606.19140
作者: Hugo Miccinilli,Theo Di Piazza
类目: Machine Learning (cs.LG)
*备注: Accepted at MICCAI 2026. Submitted version due to embargo

点击查看摘要

Abstract:Accurate survival prediction is essential for personalized treatment planning in head and neck cancer, yet remains challenging due to the heterogeneous and high-dimensional nature of multimodal clinical data. While deep survival models have improved predictive performance over classical statistical approaches, existing methods typically rely on static fusion strategies or temporally agnostic modeling, limiting their ability to capture structured clinical workflows. In this work, we propose ChronoSurv, a heterogeneous hierarchical directed graph framework for multimodal survival analysis. ChronoSurv represents patient care as a progression-aware clinical trajectory using directed graphs aligned with key diagnostic steps. A hierarchical topology incorporates fine-grained, coarse, and global representations, further supporting flexible adaptation to missing modalities, while heterogeneous message passing models complex and asymmetric relationships across modalities and clinical steps. Experimental results on two public datasets demonstrate that ChronoSurv achieves state-of-the-art discriminative performance while maintaining statistically reliable calibration. Comprehensive ablation studies further confirm the contribution of each architectural component, highlighting the potential of trajectory-aware graph modeling for multimodal survival prediction.

[LG-11] INDEQS: Informed Neural controlled Differential EQuationS

链接: https://arxiv.org/abs/2606.19138
作者: Michael Detzel,Gabriel Nobis,Kristiyan Blagov,Juri Schubert,Jackie Ma,Wojciech Samek
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Neural Controlled Differential Equations (NCDE) provide a powerful continuous-time framework for forecasting time series, but standard graph-based extensions typically learn spatial structure purely from data, even in settings where a directed graph structure is known a priori. We introduce Informed Neural controlled Differential EQuationS (INDEQS), a graph-based NCDE forecasting method that incorporates prior knowledge of a directed graph at distinct architectural positions. INDEQS separates inner mixing of hidden states across graph nodes from outer mixing between vector field and control, and offers both a lightweight graph-constrained variant and a more expressive variant, learning additional graph connections from data via adaptive graph convolutions. To systematically study when graph informedness is beneficial in forecasting, we devise a continuous advection simulation on directed graphs, yielding synthetic spatio-temporal datasets with known ground-truth flow structure. We then evaluate INDEQS on two real-world tasks: river discharge forecasting on a hydrological network and traffic flow prediction on PeMS08. Across these synthetic and real-world benchmarks, outer informedness consistently improves mean absolute error over an uninformed NCDE with comparable parameter count, particularly on larger graphs, while inner informedness offers a more parameter-efficient alternative when strict adherence to a known adjacency is desired. A comparison of discrete convolutional and continuous-time decoders further shows that continuous decoders yield better accuracy and greater temporal flexibility on real-world tasks. An implementation of INDEQS and the advection simulation is available at this https URL.

[LG-12] Giskard : Byzantine Robust and Confidential Aggregation for Large-Scale Decentralized Learning

链接: https://arxiv.org/abs/2606.19129
作者: Ousmane Touat,César Sabater,Mohamed Maouche,Sonia Ben Mokhtar
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 17 pages, with appendix

点击查看摘要

Abstract:Dealing simultaneously with confidentiality and Byzantine behaviors in decentralized learning is a challenging problem. Indeed, in decentralized learning, clients train a machine learning model while keeping their data locally and share their model parameters or gradients with a set of neighbors. While enforcing confidentiality calls for hiding the exchanged model parameters/gradients (e.g., by using cryptographic techniques), dealing with Byzantine contributions often requires inspecting the latter. Hence, most research works address these objectives separately. A recent line of work proposes to employ secure multi-party computation (MPC) to implement robust aggregators against model poisoning, thereby enforcing both confidentiality and Byzantine resilience. However, these solutions scale badly: they either require all-to-all communication between participants or delegate the entire computation to a small subset, whose computational and communication load grows proportionally with the size of the network. In this paper, we present Giskard, a protocol for confidential and Byzantine-robust decentralized aggregation. Giskard organizes n parties into a tree of committees of size O(\log n) and evaluates a coordinate-wise approximate median via a committee-adapted distributed binary search over the value domain, using BGW-style MPC within each committee. We assess Giskard both theoretically by proving its security and confidentiality properties and experimentally through extensive experiments involving up to one million participants. Compared to its closest competitors, Giskard reduces per-party communication complexity asymptotically while exhibiting comparable model utility under up to n/4 Byzantine parties. Comments: 17 pages, with appendix Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2606.19129 [cs.CR] (or arXiv:2606.19129v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.19129 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-13] JourneyFormer: Encoding Airbnb Guest Journey with Sequence Modeling KDD2026

链接: https://arxiv.org/abs/2606.19108
作者: Daochen Zha,Chun How Tan,Xin Liu,Bin Xu,Han Zhao,Xiaowei Liu,Tracy Yu,Hui Gao,Huiji Gao,Liwei He,Stephanie Moyerman,Sanjeev Katariya
类目: Machine Learning (cs.LG)
*备注: Accepted by KDD 2026

点击查看摘要

Abstract:Sequence modeling has become increasingly popular in recommendation and ranking algorithms, owing to its capacity to model users’ historical behaviors and infer user intentions. Despite its theoretical simplicity, the practical deployment of a sequence model in production is non-trivial due to complexity of the sequence and sparse labels. For example, in Airbnb, guest sequences are often long, exploratory and complex, and we focus on booking labels, which are sparse. As such, we are often required to make various design decisions regarding data and modeling to strike a balance between effectiveness and scalability. This work delved into these production challenges and deployed JourneyFormer, a sequence modeling solution for search ranking at Airbnb. We detail crucial design considerations, covering aspects such as guest event selection, ID embeddings, model architecture, and label attribution. Additionally, we describe several tailored strategies to accelerate model training and inference. JourneyFormer has been successfully deployed within Airbnb’s production, where its effectiveness and impact have been evidenced not only by improved offline ranking metrics but also by significant gains in key business metrics through online A/B testing across 2 production surfaces.

[LG-14] Smoothness-Based Derandomization of PAC-Bayes Bounds

链接: https://arxiv.org/abs/2606.19105
作者: Alexandre Lemire Paquin,Brahim Chaib-Draa,Philippe Giguère
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study PAC-Bayes derandomization for smooth loss functions. Our goal is to obtain generalization bounds that hold with high probability for deterministic predictors by exploiting smoothness properties of both the loss and the predictor class. We show that passing from the Gibbs predictor to the deterministic predictor at the posterior mean has a precise cost, given by the generalization gap of the Jensen gap class. We control this class through its Rademacher complexity, leading to bounds for deterministic predictors that involve flatness quantities expressed in terms of parameter Jacobians and Hessians of the score map. The framework applies to both bounded and unbounded smooth loss functions, and we specialize the results to linear predictors and smooth neural networks. Finally, the Jacobian and Hessian quantities appearing in the theory motivate a practical regularizer. For BatchNorm networks, we compute this regularizer with respect to effective BatchNorm weights obtained by folding the BatchNorm transformation into the adjacent affine weights. Experiments on CIFAR-10 illustrate the behavior of this regularizer under different batch sizes.

[LG-15] Model-Free Reinforcement Learning Control for Resilient Cyber-Physical Systems

链接: https://arxiv.org/abs/2606.19069
作者: Hugo O. Garcés,Alejandro J. Rojas,Bernardo A. Hernández,Andrés Escalona,Jonathan M. Palma,Md. Rezwan Parvez,Bhushan Gopaluni,Sirish L. Shah
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Accepted to the 23rd IFAC World Congress 2026

点击查看摘要

Abstract:This paper compares the performance of model-free controllers on a nonlinear system under cyberattacks, including false data injection and denial-of-service attacks. Four RL reward types are analyzed for accuracy, cost, and resilience. Results show that the Lyapunov reward offers the best resilience with low tracking error. Exponential mode also provides good trade-offs with acceptable resilience under moderate training conditions. Progressive and linear rewards converge faster but are less robust. RL-MPCs show strong steady-state resilience but require longer training times; RL-PID controllers are faster with significantly less training time. Proximal Policy Optimization outperforms Deep Deterministic Policy Gradient with a significant reduction in KPI variance. This study serves to highlight how well-designed RL rewards can improve performance and resilience against cyber threats.

[LG-16] Adaptive Speech-to-Spike Encoding for Spiking Neural Networks INTERSPEECH2026

链接: https://arxiv.org/abs/2606.19039
作者: Taharim Rahman Anon,Jakaria Islam Emon
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted at Interspeech 2026. This version is a preprint

点击查看摘要

Abstract:The mismatch between continuous acoustic signals and discrete event-driven processing remains a fundamental bottleneck for neuromorphic speech processing. Current systems typically rely on fixed spike encoders, forcing downstream Spiking Neural Networks (SNNs) to compensate for non-adaptive input representations. To address this, we present a learnable residual speech-to-spike encoder jointly trained end-to-end with a Recurrent Leaky Integrate-and-Fire (R-LIF) backbone. We validate this approach on the Google Speech Commands v2 (GSC-v2) benchmark, achieving up to 94.97% accuracy. Notably, the learned encoder remains highly parameter-efficient with a compact 35k-parameter variant that reaches 89.8%, matching or exceeding prior baselines that require an order of magnitude more parameters. Our encoder-focused analysis, including linear probing and gradient-residual inspection, indicates that the encoder does not target faithful signal reconstruction but instead learns task-aligned spike representations that enhance class separability. Finally, we benchmark bio-inspired, hardware-friendly credit assignment by comparing Direct Feedback Alignment (DFA) with surrogate-gradient BPTT under identical architectures and training conditions. We find that DFA reaches 91.5% accuracy, quantifying the performance trade-off of bio-inspired learning rules for modern neuromorphic audio.

[LG-17] Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts ICML2026

链接: https://arxiv.org/abs/2606.19036
作者: Tho Tran Huu,Huu-Tuan Nguyen,Thien-Hai Nguyen,Nhat-Tri Ho,Viet-Hoang Tran,Tho Quan,Tan Minh Nguyen
类目: Machine Learning (cs.LG)
*备注: ICML 2026 Spotlight

点击查看摘要

Abstract:Sparse Mixture-of-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks. However, this very Top- k expert selection that enables conditional routing also renders the SMoE map inherently discontinuous. In the vicinity of these discontinuity surfaces, even inputs that are arbitrarily close may activate substantially different sets of experts resulting in significantly different outputs. In this work we give a rigorous geometric and stochastic analysis of these discontinuities. We first classify them by order, determined by the number of tied experts at a switching event. Using measure-theoretic slicing arguments, we establish asymptotic volume estimates for the thickened discontinuity surfaces, showing that lower-order discontinuity sets dominate, whereas higher-order ones occupy a vanishingly small relative volume. Next, modeling random perturbations in the input space via a diffusion process, we prove that the path eventually encounter a discontinuity, and moreover that the first hit almost surely occurs on an order-1 discontinuity with explicit finite-time probability bounds. We further derive occupation-time bounds that quantify the duration the random path spend in the neighborhoods of each discontinuity order. These theoretical results imply that inputs are more likely to lie near lower order discontinuities. Motivated by this insight, we propose a simple smoothing mechanism that can be directly applied to existing SMoEs, softly incorporating experts near discontinuities; our analysis guarantees that the added computational overhead remains small while providing localized smoothing near discontinuities, and experiments across language and vision tasks show that smoothing not only enforces continuity of the SMoE map but also enhances empirical performance.

[LG-18] Lifecycle-Aware Dynamic Analysis for Secure ML Model Execution

链接: https://arxiv.org/abs/2606.19023
作者: Gabriele Digregorio,Marco Di Gennaro,Francesco Pastore,Stefano Zanero,Stefano Longari,Michele Carminati
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing reliance on pre-trained Machine Learning (ML) models has introduced new attack surfaces. Recent vulnerabilities demonstrate that malicious behavior can be embedded within model artifacts, often bypassing existing defenses. Current model-scanning solutions primarily rely on static, format-specific rules or known attack signatures, which limit their ability to generalize across frameworks and to detect novel exploitation paths. In contrast, we propose a solution that focuses on the effects an attack has on the host system executing the model and builds on foundational intuitions about ML model execution. In particular, we observe that ML models operate within well-defined lifecycle phases and that, within each phase, interactions with the host system are highly structured and predictable. We translate these intuitions into Moat, a dynamic lifecycle-aware approach for securing ML model execution, and instantiate this design in Re-Moat, our reference implementation. We evaluate Re-Moat across multiple ML frameworks using 77,974 real-world model artifacts from the Hugging Face Hub, 31 Proofs-of-Concept (PoCs) from CVEs, and 334 models from a state-of-the-art dataset, and compare it against state-of-the-art model-scanning solutions. Our results show that our approach detects all evaluated attack classes while maintaining a close-to-zero false-positive rate, validating our intuitions and motivating dynamic analysis for securing ML model execution.

[LG-19] DIPHINE: Diffusion-based Φ-ID Neural Estimator

链接: https://arxiv.org/abs/2606.18997
作者: Simon Pedro Galeano Munoz,Mustapha Bounoua,Giulio Franzese,Pietro Michiardi,Maurizio Filippone
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Uncovering the true informational architecture of real-world complex systems requires disentangling how their components uniquely store, redundantly share, and synergistically integrate information over time. Integrated Information Decomposition ( \Phi ID) is a framework for decomposing the information dynamics of multivariate systems into sixteen non-overlapping atoms that characterize redundant, unique, and synergistic modes of information storage, transfer, and integration. Existing methods to compute \Phi ID are restricted to Gaussian or discrete systems, preventing its application to continuous non-Gaussian dynamical systems. We address this limitation by proposing DIPHINE (Diffusion-based \Phi -ID Neural Estimator), the first neural estimator that leverages score-based diffusion models to jointly estimate all the mutual information terms required by \Phi ID from a single amortized network, recovering the sixteen atoms through Möbius inversion. We provide a theoretical analysis of error propagation through the inversion, showing that the Jacobian of the mapping from mutual informations to atoms is integer-valued and that the synergy-to-synergy atom is provably the hardest to estimate. We demonstrate accurate recovery of ground-truth atoms on synthetic benchmarks, superior performance compared to established mutual information estimators, and the ability to extract physiologically interpretable information-dynamic structure on an application involving real data without any distributional assumptions.

[LG-20] EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

链接: https://arxiv.org/abs/2606.18967
作者: Minseo Kim,Minjae Lee,Seunghyuk Oh,Kevin Galim,Donghoon Kim,Coleman Hooper,Harman Singh,Amir Gholami,Hyung Il Koo,Wonjun Kang
类目: Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive sampling decodes responses sequentially and a small number of long-tailed generations often determine completion time. Speculative decoding (SD) offers a natural way to address this bottleneck, as it is a well-established technique for serving fixed LLMs that reduces latency by rapidly drafting tokens and accepting them through parallel verification while preserving the target-model distribution. However, its practical speedups do not directly carry over to RL rollouts: (i) the evolving target policy makes any fixed drafter increasingly mismatched with the policy’s output distribution; and (ii) active batch sizes shrink throughout rollout decoding, shifting decoding from compute-bound to memory-bound regimes where parallel verification can exploit underutilized compute. Therefore, accelerating RL rollouts requires both a drafter that remains effective under long, high-temperature generations from an evolving policy and system-aware use of SD that avoids compute-bound regimes. We present EfficientRollout, a system-aware self-SD framework designed to address this gap for RL rollouts. EfficientRollout induces a quantized drafter from the target model (i.e. self-speculative decoding), keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It further coordinates a system-aware SD toggle policy with acceptance-aware draft-length adaptation, enabling speculation only in beneficial regimes while matching the drafting budget to evolving drafter quality. EfficientRollout reduces rollout and end-to-end latency by up to 19.6% and 12.7%, respectively, over an accelerated AR rollout baseline, while preserving final model quality.

[LG-21] Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards

链接: https://arxiv.org/abs/2606.18963
作者: Zirong Li
类目: Machine Learning (cs.LG)
*备注: 9 pages, 5 figures, 6 tables; 13-page technical supplement

点击查看摘要

Abstract:We study online reward-punishment learning when the environment provides no scalar reward or evaluative label. At each step the agent receives only a fixed-channel perceptual packet, and quantities such as pain, energy, contact, damage, or cognitive error are treated as perceptual dimensions whose valence must be inferred from transition consequences. OHIRL separates four roles: M_psi learns next-packet prediction, D_omega models residual dynamics, C_eta is a fixed internal post-transition trajectory evaluator, and B_xi learns to use the resulting value evidence for later policy updates and action scoring. C_eta uses a recovery-positive and persistence/growth-negative residual-regulation orientation; a coefficient-origin audit shows that equal-unit, raw-equal, and random monotone variants preserve more than 92% of the released top-action rankings, while sign inversion preserves 0%. The reward-free protocol exposes observation transitions while withholding environment rewards, delayed external evaluators, success labels, and action-goodness labels. A conditional error decomposition separates B_xi evidence-estimation error from residual policy-optimization error. In a 2x2-XOR packet task, medicine and chili acquire opposite value under visual XOR contexts, and the same pain or spice increase can be positive or negative depending on consequence structure; B_xi reaches 0.952 balanced reward-sign accuracy. In a full online-interleaved audit, M_psi reaches holdout R2=0.907, B_xi reaches 0.940 sign accuracy, and the policy reaches 0.979 optimal-action accuracy, while immediate packet scores, prediction-error rewards, shuffled targets, zero reward, and error-reduction controls collapse. Hidden-reward CartPole and Taxi controls, public-context no-leakage audits, and module-role ablations further test information boundaries and component necessity. Comments: 9 pages, 5 figures, 6 tables; 13-page technical supplement Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.18963 [cs.LG] (or arXiv:2606.18963v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.18963 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-22] Be Your Own Teacher: Steering Protein Language Models via Unsupervised Reward Optimization

链接: https://arxiv.org/abs/2606.18961
作者: Lanqing Li,Shentong Mo,Yang Yu,Pheng-Ann Heng
类目: Machine Learning (cs.LG)
*备注: 24 pages, 2 figures, 13 tables

点击查看摘要

Abstract:Protein language models (PLMs) have emerged as powerful tools for controllable biomolecular design, yet their post-training adaptation typically relies on costly wet-lab validation or curated preference datasets. To overcome this supervision bottleneck, we introduce unsupervised reward optimization of PLMs, a comprehensive framework for steerable protein generation without ground-truth labels. Our key insight is that task-agnostic rewards, which combine intrinsic model uncertainty with extrinsic semantic consistency informed by protein representation models, exhibit strong correlation with controllability measures across base models and temperature regimes. Building upon this discovery, we propose two offline algorithms: Soft Reward Optimization (SRO) and Binarized Reward Optimization (BRO), which effectively maximize the classical RLHF objective induced by these proxy rewards. Extensive experiments on compositional out-of-distribution prompts demonstrate that both methods significantly outperform competitive baselines (DPO, KTO), while approaching oracle performance across multiple sampling temperatures, model scales and protein families. Moreover, PLMs fine-tuned with unsupervised rewards can achieve consistently higher coverage compared to their base model in pass@k evaluations. By enabling self-improvement of PLMs through their own generated experience, our framework provides a scalable pathway toward controllable biomolecular design in settings where labeled preferences or experimental feedback are scarce or unavailable.

[LG-23] GrapNet: A Programmable Dynamic-Architecture Neural Graph Substrate

链接: https://arxiv.org/abs/2606.18923
作者: Zirong Li
类目: Machine Learning (cs.LG)
*备注: 8 pages, 1 figure, preprint

点击查看摘要

Abstract:Programmability is a missing first-class interface in fixed-tensor neural networks: editing a relation, freezing a subgraph, auditing a local function, or changing the execution backend should be an operation on the neural program rather than ad-hoc parameter surgery. GrapNet studies this graph-as-network setting. The graph is the architecture and executable program, not an input data graph. Each compute node owns its next-layer child references and a trainable allocation vector aligned with those references; deleting a relation physically removes both the child reference and the corresponding allocation coordinate. Structural rules and execution policies live outside the node core, so the same child-owned graph can be grown, frozen, structurally edited, grouped into trainable family blocks, routed by attention over active relations, or lowered to dense snapshots after topology stabilizes. GrapNet composes with conventional modules through a vector-valued parent interface: dense layers, CNN encoders, ResNet feature extractors, attention blocks, and transformer representations can all feed one sensory GrapNode per coordinate. The evaluation is organized as a programmability stress suite rather than as a new replay benchmark. In a matched ten-seed Split Fashion-MNIST study, a plastic GrapNet+ER head reaches 63.16 percent seen-class accuracy versus 51.08 percent for a parameter-larger dense MLP+ER under the same seen-class loss and replay memory, with paired delta 12.08 points and p=1.3e-5. On Split CIFAR-10 with a frozen ImageNet ResNet-18 encoder, the same substrate improves the online head over MLP-256 by 3.81 points, with p=0.0026. These results support GrapNet as an editable neural graph substrate whose core value is structural programmability with faithful execution views.

[LG-24] Some Complexity Results for Robustness Verification for Binarized Neural Networks

链接: https://arxiv.org/abs/2606.18918
作者: Harshit Goyal,Sudakshina Dutta
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC)
*备注:

点击查看摘要

Abstract:This paper studies the computational complexity of verification problems for Binarized Neural Networks (BNNs), where activations (and sometimes weights) are binary. We analyze two problems: satisfiability and robustness under uniform image occlusion. We show that BNN satisfiability is NP-complete via a reduction from Boolean satisfiability problem (SAT), and that uniform occlusion induces a piecewise-constant structure in the network output, enabling a polynomial-time robustness-checking algorithm.

[LG-25] Anomaly Detection for Sparse and Irregular Multivariate Time Series with Latent SDEs

链接: https://arxiv.org/abs/2606.18898
作者: Martin Uray,Dominik Geng,Florian Graf,Stefan Huber,Roland Kwitt
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Multivariate time series anomaly detection (MTSAD) is critical for a wide range of application areas, such as industrial monitoring, cybersecurity, or healthcare. Real-world data is often sparse, irregularly sampled or partially observed, yet existing methods assume uniformly sampled time series. We propose a generative approach based on Latent SDEs that projects the observed time series on a continuous-time stochastic dynamical system, directly being able to handle missing observations and irregular sampling, while also naturally capturing possible cyclic behavior that many real-world use cases inherently possess. Experiments on six anomaly benchmark datasets show that our proposed method ranks first among state-of-the-art baselines. We further demonstrate that our method remains robust under severe data sparsity, while performance significantly degrades for the tested baseline methods. These results highlight latent SDEs as a natural inductive bias for anomaly detection in multivariate time series, especially in presence of real-world irregularities.

[LG-26] Strategic Feature Selection

链接: https://arxiv.org/abs/2606.18867
作者: Jivat Neet Kaur,Pratik Patil,Divya Shanmugam,Emma Pierson,Michael I. Jordan,Nika Haghtalab,Meena Jagadeesan,Ahmed Alaa,Serena Wang
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:When algorithmic predictors inform resource allocation in high-stakes domains such as healthcare, these predictors must account for strategic manipulation of input features. The typical solution is to redesign the predictor itself to explicitly account for strategic interactions. In practice, however, decision makers are often constrained to adjusting coarser levers within existing prediction pipelines. For example, healthcare organizations often select which features to exclude based on perceived manipulability, while using standard regularization procedures to shrink the coefficients of retained features. In this work, we initiate a formal study of strategic classification through feature selection and its interaction with ridge regularization. Our main finding is that excluding individual features based on their manipulability alone is generally suboptimal. We provide a fine-grained characterization of the performance of a feature subset under optimal regularization, yielding new insights for policy design. Motivated by this characterization, we develop a practical algorithm for jointly choosing the feature set and the level of ridge regularization. Through a real-world case study on a healthcare payments benchmark, we illustrate how our algorithm can guide the design of coarse policy levers in practice. Our results provide a principled, practical framework for mitigating the effects of strategic behavior in algorithmic decision-making systems.

[LG-27] Investigating Inductive Biases for Machine Learning Emulation of Sudden Stratospheric Warmings in Idealised Isca Simulations

链接: https://arxiv.org/abs/2606.18857
作者: Oskar Bohn Lassen,Simon Driscoll,Stephen I. Thomson,Sebastian Schemm,Francisco C. Pereira
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Machine-learning emulators are increasingly used for weather prediction and have the potential to extend skill on subseasonal-to-seasonal timescales by learning dynamically important sources of predictability. A key challenge is whether the models can exploit predictability anchors, such as stratospheric variability, that influence tropospheric circulation beyond short lead times. We test how architectural inductive bias affects emulation of sudden stratospheric warming (SSW) dynamics using paired idealised Isca simulations that differ only in an imposed wave-2 heating perturbation. Across convolutional, transformer, and graph-based architectures trained for one-step prediction, model differences are modest when the stratosphere is dynamically quiet but widen substantially when SSW-like variability is active. Our results identify explicit three-dimensional vertical coupling as a key inductive bias for machine-learning emulation of stratospheric dynamics. However, Eliassen-Palm flux diagnostics show that low forecast error does not guarantee physically faithful wave-mean-flow interaction, with coherent errors remaining in stratospheric wave-driving structure.

[LG-28] Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

链接: https://arxiv.org/abs/2606.18844
作者: Zhilin Huang,Hang Gao,Ziqiang Dong,Yuan Chen,Yifeng Luo,Chujun Qin,Jingyi Wang,Yang Yang,Guanjun Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-distillation improves reasoning in large language models by using the model’s own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generated via uncontrolled sampling, it provides no diagnostic insight into the model’s specific errors or corrective guidance for its individual failure patterns. Consequently, the model learns to imitate a privileged distribution rather than receiving fine-grained corrections that pinpoint where and why its reasoning fails. In this paper, we propose Trajectory-Augmented Policy Optimization (TAPO), which advances self-distillation from implicit distributional alignment to explicit trajectory construction. During RL training, the model produces both correct and incorrect rollouts to the same query, and TAPO leverages this contrastive structure to construct micro-reflective corrections, new training trajectories that retain the model’s erroneous reasoning up to the point of failure, then insert a natural-language diagnosis and corrected reasoning guided by a correct reference from the same sampling group. Since each trajectory is anchored in the learner’s own prefix and solutions, the corrective signal preserves the model’s on-policy distribution to a greater extent than the position-wise alignment imposed by KL-based methods. To integrate these trajectories, TAPO introduces difficulty-aware candidate selection at the model’s capability boundary and decoupled advantage estimation to prevent gradient contamination. Experiments on AIME 2024, AIME 2025, and HMMT 2025 show that TAPO achieves consistent improvements over GRPO under the same number of training steps. Further analysis demonstrates that TAPO strengthens both first-pass reasoning and error-correction effectiveness.

[LG-29] Identifying Structural Biases from Causal Mechanism Shifts

链接: https://arxiv.org/abs/2606.18834
作者: Praharsh Nanavati,Jilles Vreeken,David Kaltenpoth
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal discovery methods commonly assume that all data is independently and identically distributed (i.i.d.) and that there are no unmeasured variables affecting the system. In practice, these assumptions are often violated, leading to inaccurate inference. In this paper, we study how to identify hidden confounding and selection biases from causal mechanism shifts. In particular, we show that structural biases lead to dependent mechanism shifts. That is, by considering for which variables the mechanisms change given data from different environments, we can tell which variables are unbiased, which are subject to hidden confounding, and which are undergoing selection bias. We formalize this into an empirically testable criterion based on mutual information, and show under which conditions it identifies structural biases. To tell which nodes are subject to what kind of bias, we introduce the StruBI algorithm. Experiments on synthetic and real-world data show that StruBI works well in practice, accurately recovering affected variable sets and types of biases, outperforming the state-of-the-art by a wide margin.

[LG-30] Seed-Guided Semi-Supervised Clustering by A-Contrario Anomaly Detection

链接: https://arxiv.org/abs/2606.18833
作者: Nassir Mohammad
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a semi-supervised clustering framework grounded in the statistical duality between grouping principles and anomaly detection. We address the challenge of robust cluster definition in noisy environments – a task where partitioning algorithms often over-assign outliers and density-based methods remain sensitive to heuristic global parameters. Drawing on \textita-contrario statistical reasoning and Gestalt proximity principles, we define a cluster as a maximal subset of data points containing no anomalies relative to a null hypothesis of uniform randomness. Central to this approach is the Perception algorithm, which utilises a principled expectation-based threshold ( \mathbbE 1 ) to identify outliers without manual parameter tuning. By treating clustering as the dual of anomaly detection, we employ an iterative ``clustering-by-exclusion’’ mechanism. The algorithm is seed-guided, leveraging minimal user-provided labels to initialise robust cluster medians and form initial groups, which are subsequently expanded by admitting non-anomalous points. This approach naturally isolates fringe points, isolated noise, and emerging unknown clusters. We evaluate the method on synthetic and real-world benchmarks, including image and text datasets represented through raw, linear-reduced, and neighbourhood-preserving embeddings. Results demonstrate that with as few as 10–30 seeds per cluster, the proposed method achieves competitive and often very strong performance under a practical low-tuning benchmarking protocol, while maintaining linear scalability with respect to both observations and dimensionality for a fixed number of seeded clusters and iterations.

[LG-31] Learning Augmented Exact Exponential Algorithms

链接: https://arxiv.org/abs/2606.18807
作者: Tatiana Belova,Yuriy Dementiev,Danil Sagunov
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The field of learning-augmented algorithms has demonstrated that machine-learned predictions can bypass worst-case lower bounds across a wide range of problems. So far, however, the focus has been almost exclusively on polynomial-time algorithms, where predictions improve competitive ratios, approximation guarantees, or running times. In this paper, we raise the question of whether predictions can push the frontier of exact exponential-time algorithms for NP-hard problems. We answer this question affirmatively by proposing a general approach that augments an entire family of state-of-the-art exact algorithms for a variety of subset selection problems. We show that a noisy predictor that is only marginally better than random guessing suffices to provably reduce the search space, and that the resulting runtime speedup scales smoothly with the prediction quality. Importantly, our algorithms require only pairwise independence of predictions or, alternatively, do not require the knowledge of the predictor’s accuracy - both strictly weaker and more realistic settings than typically assumed.

[LG-32] Online Distributional Prediction via Latent Cluster Geometry Under Drift and Corruption

链接: https://arxiv.org/abs/2606.18778
作者: Navyansh Mahla,Prateek Chanda,Ganesh Ramakrishnan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Online learning in non-stationary streams is often formulated as tracking a point estimate, but many applications require predicting the full data-generating distribution. We study online distributional prediction under drift and adversarial corruption. Our approach represents each candidate law through a latent cluster geometry: a variable-size configuration of centers that organizes probability mass and induces a predictive distribution. A Gibbs quasi-posterior over these configurations yields an online predictor by posterior averaging, and the resulting variable-dimensional posterior can be sampled with reversible-jump MCMC. The method therefore avoids specifying a parametric streaming law while retaining a structured latent space for uncertainty, regularization, and comparison. We evaluate performance by cumulative Wasserstein-1 regret against the time-varying true law. The analysis separates two effects: corruption perturbs the loss-based posterior update, whereas drift makes long-horizon posterior memory stale. We address the latter with a restarted variant that temporally localizes the same quasi-Bayesian update. The resulting high-probability bounds decompose into a PAC-Bayesian complexity term, a corruption-sensitive posterior perturbation term, and a dynamic optimal-transport term driven by (A_T^\mathrmOT=\sum_t=2^T W_2^2(p_t-1^,p_t^)). Under bounded support, stable latent geometry, predictive-map regularity, oracle realizability, localized restart windows, sublinear transport action, and sublinear corruption budget, the restarted predictor achieves sublinear cumulative Wasserstein regret. These guarantees require no parametric model for the stream, drift mechanism, or corruption process. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2606.18778 [cs.LG] (or arXiv:2606.18778v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.18778 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-33] RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing ICML2026

链接: https://arxiv.org/abs/2606.18774
作者: Guannan Lai,Haoran Hu,Han-Jia Ye
类目: Machine Learning (cs.LG)
*备注: Accepted by Pluralistic Alignment Workshop at ICML 2026

点击查看摘要

Abstract:We present RouteJudge, an online pairwise preference evaluation framework for LLM routing systems, with a public platform available at this https URL. Different from model-level response evaluation, RouteJudge focuses on router-level decision quality. For each user query, multiple routing strategies independently recommend candidate models under the same model pool and budget constraints. The selected model responses are then presented to users through anonymous pairwise comparisons, and the resulting user preferences are attributed back to the routing strategies behind the compared responses. Each evaluation record stores the query, routing decisions, model responses, preference labels, cost, latency, and task metadata, enabling preference-aware, cost-aware, and task-conditioned analysis of LLM routers. To support the continuous expansion of routing methods in RouteJudge, we further release ORBIT (Optimal Routing and Budgeted Inference Toolbox), a modular and extensible toolbox that standardizes the end-to-end workflow of LLM routing. ORBIT provides unified interfaces for benchmark loading, query representation, router implementation, budget-aware evaluation, and method comparison, allowing researchers to develop and evaluate routing algorithms under consistent protocols. It also serves as the submission and integration layer for RouteJudge: researchers can implement routing methods within ORBIT, validate them on existing routing benchmarks, and submit compatible routers for online preference-based evaluation. The code of ORBIT is available at this https URL.

[LG-34] A Neural Network Framework for Geodesic-Like Curve Computation on Parametric Surfaces

链接: https://arxiv.org/abs/2606.18759
作者: Sheng-Gwo Chen,Chen-Chang Peng
类目: Computational Geometry (cs.CG); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 22 pages, 16 figures, 8 tables

点击查看摘要

Abstract:The concept of geodesic-like curves was introduced by Chen in 2010 as a method for estimating shortest paths (geodesics) on parametric surfaces, with its convergence established theoretically. However, an efficient numerical computational framework has not yet been developed. In this paper, we propose an elegant and efficient approach for computing geodesic-like curves by leveraging deep learning and Physics-Informed Neural Networks (PINNs). Under the proposed framework, not only can single parametric surfaces be handled efficiently, but a broad class of complex parametric surfaces including multi-surface systems with C^0 or higher continuity and surfaces of revolution can also be robustly addressed.

[LG-35] rainable Photonic Measurement for Physics-Informed PDE Learning

链接: https://arxiv.org/abs/2606.18713
作者: Jiale Linghu,Hao Dong,Yangshuai Wang
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Photonic quantum machine learning offers a route to trainable physical representations built from phase, interference and measurement. However, its role in scientific machine learning remains largely unexplored. Physics-informed neural fields provide a natural setting, because differential equations require trial spaces that preserve phase, frequency and derivative structure. Here we introduce a photonic quantum neural field in which coordinates become trainable optical phases, are mixed by multi-photon Fock-space interference and are decoded from photon-number measurements. The photonic circuit is optimized as the neural-field representation itself, not as a fixed feature map or hardware accelerator. Photonic measurement is therefore a trainable representation on which the physics-informed residual is minimized. Across seven elliptic, wave, nonlinear dispersive and inverse PDE benchmarks, we observe a phase-complexity transition: classical coordinate and Fourier-feature networks suffice in smooth regimes, whereas the photonic field is most accurate when residual derivatives amplify phase mismatch. In the hardest regimes it gives the lowest errors, with margins reaching an order of magnitude and about one quarter of the trainable parameters of classical baselines. Frozen and shuffled controls, together with noise stress tests, attribute this gain to learned interference and stable Fock-probability readout under compound perturbations. These results identify photonic quantum measurement as a representation-learning principle for scientific machine learning.

[LG-36] Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

链接: https://arxiv.org/abs/2606.18703
作者: Yanjun Shao,Yundi Chen,Yashvi Patel,Aurelien Pelissier,María Rodríguez Martínez
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Pretrained biological language models expose per-token probability distributions through masked-token prediction, providing the likelihood interface central to sequence design, variant scoring, and mechanistic interpretation. Yet these distributions are learned from broad unlabeled corpora and are not naturally conditioned on task-specific biological contexts such as interaction partners, cellular environments, or therapeutic interventions. Existing contextual matching methods often distort this interface through pooled embeddings, contrastive latent spaces, or task-specific prediction heads. We introduce LOGICA (Logit-space Contrastive Alignment), a framework for context-conditioned prediction that performs contrastive learning directly in output-logit space. Using gated cross-modal adapters compatible with each model’s native token head, LOGICA preserves the pretrained likelihood interface and converts contextualized token log-likelihoods into matching scores. Alignment is defined through context-sensitive token probabilities rather than proximity in a shared embedding space, enabling learning from sparse paired data across models with distinct vocabularies, without a shared tokenizer or decoder. LOGICA is particularly effective for mutation-local variant ranking, where comparisons reduce to context-conditioned likelihoods of mutant tokens at perturbed sites. Across protein–ligand binding, TCR–peptide activity, and drug-conditioned resistance prediction, LOGICA improves over prior state-of-the-art methods, including matched latent-contrastive and conditional MLM baselines, while retaining a token-level interface for interpretation and generation. On held-out-gene single-mutation drug-resistance prediction, LOGICA improves AUC from near-random latent-space baselines of \sim 0.55 to \sim 0.65.

[LG-37] Stealthy World Model Manipulation via Data Poisoning NEURIPS2026

链接: https://arxiv.org/abs/2606.18697
作者: Yibin Hu,Xiaolin Sun,Zizhan Zheng
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Robotics (cs.RO)
*备注: 41 pages, 8 figures, 11 tables. Submitted to NeurIPS 2026

点击查看摘要

Abstract:Model-based learning agents use learned world models to predict future states, plan actions, and adapt to new environments. However, the process of updating world models from collected experience creates a training-time attack surface: adversarially poisoned fine-tuning trajectories can manipulate the learned dynamics and thereby corrupt downstream planning. In this paper, we propose SWAAP, the first two-stage data poisoning framework for learned world models. In the first stage, SWAAP identifies a harmful target world model that induces low-return behavior under planning while remaining close to clean dynamics, using first-order bilevel optimization enabled by a transition-gradient theorem. In the second stage, SWAAP realizes this target through stealth-constrained gradient matching, modifying only a limited fraction of fine-tuning transition targets so that the induced training gradients steer the victim model toward the adversarial target, while a prediction-error regularizer encourages the poisoned targets to remain close to the world model’s natural approximation error. To assess attack stealthiness, we evaluate defenses and detectability across three stages of the poisoning pipeline: pre-training detection of poisoned transitions, robust training during fine-tuning, and test-time monitoring of the resulting world model. Across diverse continuous-control tasks, SWAAP causes substantial performance degradation while keeping poisoned transitions close to clean data and evading the evaluated non-adaptive residual/CUSUM/TRIM-style defenses. These results reveal a practical vulnerability in world-model adaptation pipelines and highlight the need for robustness methods that protect both world-model training data and learned dynamics.

[LG-38] Robust and Interpretable Adaptation of Equivariant Materials Foundation Models via Sparsity-promoting Fine-tuning ICLR2026

链接: https://arxiv.org/abs/2606.18691
作者: Youngwoo Cho,Seunghoon Yi,Wooil Yang,Sungmo Kang,Young-woo Son,Jaegul Choo,Joonseok Lee,Soo Kyung Kim,Hongkee Yoon
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Pre-trained materials foundation models, or machine learning interatomic potentials, leverage general physicochemical knowledge to effectively approximate potential energy surfaces. However, they often require domain-specific calibration due to physicochemical diversity as well as mismatches between practical computational settings and those used in constructing the pre-training data. To address this, we propose a sparsity-promoting fine-tuning method that selectively updates model parameters by exploiting the structural properties of E(3)-equivariant materials foundation models. On energy and force prediction tasks across molecular and crystalline benchmarks, our method matches or surpasses full fine-tuning and equivariant low-rank adaptation while updating only \sim 3~% of parameters, and in some cases as little as \sim 0.5~%. Beyond energy and force calibration, we further demonstrate task generalizability by applying our method to magnetic moment prediction and magnetism-aware total energy modeling. Finally, analysis of sparsity patterns reveals physically interpretable signatures, such as enhanced d -orbital contributions in transition metal systems. Overall, our results establish sparsity-promoting fine-tuning as a flexible and interpretable method for domain specialization of equivariant materials foundation models.

[LG-39] Fair Online Resource Allocation

链接: https://arxiv.org/abs/2606.18679
作者: Christopher En,Yuri Faenza,Andrea Lodi,Gonzalo Muñoz
类目: Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 3 pages, 4 figures. To appear in the proceedings of EC 2026

点击查看摘要

Abstract:We study the problem of fair online resource allocation, motivated by applications such as refugee resettlement and airline scheduling, where agents arrive sequentially and must be assigned to facilities with limited capacities. We introduce a model that maximizes the overall welfare subject to resource constraints and a Lipschitz fairness requirement, which ensures that similar agents arriving in the same batch receive similar expected outcomes. We first analyze the offline problem, proving that the value of the optimal fair allocation is at least an \Omega(1/\gamma) fraction of the optimal unfair allocation, where \gamma is the fairness coefficient, thereby bounding the price of fairness. For the online setting, we propose an algorithm based on dual mirror descent that enforces fairness constraints within batches while estimating optimal dual variables. We prove that this algorithm achieves sublinear regret relative to the optimal offline fluid benchmark. Finally, we validate our theoretical results using real-world data from the Refugee Economies Programme, demonstrating the algorithm’s performance and examining the trade-offs between welfare maximization and fairness enforcement.

[LG-40] BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training

链接: https://arxiv.org/abs/2606.18650
作者: Jiaxing Wang,Deping Xiang,Jin Xu,Zirui Liu,Zicheng Zhang,Guoqiang Gong,Jun Fang,Chao Liu,Pengzhang Liu,Tongxuan Liu,Ke Zhang,Qixia Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Large Language Model (LLM) datasets scale to trillions of tokens, data selection has emerged as a critical frontier to filter out uninformative noise and construct adaptive learning trajectories. Beyond static heuristic filtering, advanced data selection methods for LLM training largely follow two paradigms, each with fundamental limitations. Influence-based methods provide principled bi-level objectives but require intractable inverse-Hessian computations, while excess-loss methods are computationally efficient but rely on a static reference model that becomes misaligned with the evolving proxy model during training. We propose BLADE (Bi-Level Adaptive Data sElection), a Hessian-free framework for data selection. BLADE reformulates the bi-level optimization problem underlying influence-based methods as a penalized single-level objective via Lagrange multipliers, avoiding inverse-Hessian computation while revealing a principled connection to excess-loss based data selection. The resulting objective recovers an excess-loss form but replaces the static reference model with a dynamic one that stays synchronized with training. Theoretically, we prove that this penalized formulation guarantees first-order convergence. For efficient online batch selection, we instantiate BLADE as a memoryless randomized block-coordinate Frank-Wolfe algorithm. Extensive experiments show that BLADE consistently outperforms state-of-the-art data selection baselines, providing a practical recipe for LLM training.

[LG-41] MetaboNet-Bench: A Multi-modal Benchmark for Glucose Forecasting in Type 1 Diabetes

链接: https://arxiv.org/abs/2606.18640
作者: Nathaniel Jeffries,Miriam Wolff,Sam Royston,Elizabeth Healey,Caleb Mayer,David Klonoff,Michael Snyder,Tao Wang
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: main content in 10 pages with 5 figures; supplementary section with 11 more pages and 5 more figures

点击查看摘要

Abstract:Glucose forecasting algorithms are an important aspect of glycemic control management in type 1 diabetes. So far, the research community has developed numerous algorithms and models for forecasting. However, it is well-recognized that the lack of standardized model performance evaluation benchmarks makes fair comparison difficult and hinders further innovation, and thus benchmark standardization is in urgent need. Furthermore, many published glucose forecasting algorithms are limited to CGM data alone, ignoring other multimodal signals such as insulin dosing and carbohydrate intake. Here, we introduce MetaboNet-Bench, a benchmark for multimodal glucose forecasting for patients with type 1 diabetes that provides an extensible open-source evaluation framework for comparison of glucose forecasting algorithms that leverage glucose, insulin, and carbohydrate data. We then demonstrate its utility by benchmarking several recently published glucose forecasting models and a custom multimodal time-series model, representing different model architectures. The results show that the benefit of adding data modalities is conditioned on the complexity of the model and that incorporating more clinical metrics helps identify meaningful gaps to fill for future research.

[LG-42] PACT: Preserving Anchored Cores in Task-vectors for Model Merging

链接: https://arxiv.org/abs/2606.18627
作者: Ningyuan Shi,Zhipeng Zhou,Hao Wang,Chunyan Miao,Peilin Zhao
类目: Machine Learning (cs.LG)
*备注: 33 pages,14 figures

点击查看摘要

Abstract:Model merging has emerged as a training-free alternative to multi-task learning, aiming to combine multiple task-specific fine-tuned models into a single multi-task model. Most existing model merging approaches follow the Task Arithmetic paradigm, which decomposes fine-tuned weights into pre-trained parameters and task vectors, and performs merging exclusively in the task-vector space. The effectiveness of this paradigm implicitly relies on the assumption that task-specific knowledge is encoded solely within task vectors. We argue that this assumption generally does not hold due to the intrinsic task preferences of pre-trained models. Specifically, we identify \textbfLoad-Bearing Wall (LBW) dimensions, namely some task-critical knowledge that remains embedded in the pre-trained weights rather than being fully transferred into task vectors. We characterize LBW dimensions from both scalar-weight and subspace perspectives, thereby covering the major paradigms of existing model merging methods. Our analysis reveals that, by ignoring LBW dimensions, task-vector-based approaches fail to fully resolve task conflicts and may inadvertently damage task-specific knowledge encoded in the pre-trained model, leading to degradation. To address this issue, we propose PACT, which preserves the anchored task-specific cores (i.e., LBW dimensions) within task vectors by aligning their orthogonal complements with the subspace of the pre-trained weights. These aligned subspace components are then removed from the task vectors before applying existing model merging algorithms. Furthermore, we develop an efficient variant based on randomized SVD to improve scalability. PACT can be seamlessly integrated with existing methods. Extensive experiments across multiple benchmarks demonstrate that PACT consistently enhances mainstream model merging approaches and establishes new state-of-the-art performance.

[LG-43] owards Anomaly Detection on Relational Data

链接: https://arxiv.org/abs/2606.18621
作者: Shiyuan Li,Yunfeng Zhao,Yue Tan,Qingfeng Chen,Yixin Liu,Shirui Pan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Relational databases are widely used for managing structured data in real-world systems. Detecting anomalies from such relational data is crucial for identifying fraud, risks, and abnormal behaviors, yet remains under-explored. The key challenges lie in the intrinsic complexity of relational data: multi-table attributes are high-dimensional and heterogeneous, making sparse abnormal clues easy to overwhelm by normal or irrelevant information; and anomalies may further manifest as abnormal connection patterns across different foreign-key relations, which existing tabular and graph anomaly detection methods are ill-suited to capture. To address them, we propose RelAD, a reconstruction-based framework that captures anomalies from both attribute and relational edge reconstruction. RelAD contains two core modules: conditional sparse-gated attribute reconstruction, which suppresses redundant multi-table attributes and emphasizes abnormal semantic blocks, and dual-view multi-relational edge reconstruction, which detects relation-specific abnormal connections from both intrinsic and behavioral entity profiles. The resulting attribute and relational signals are integrated through a lightweight fusion module to produce the final anomaly score. We further construct 6 benchmark datasets with systematic anomalies, on which extensive experiments show that RelAD consistently outperforms other baselines while achieving competitive efficiency.

[LG-44] S-Fault: Benchmarking Time Series Forecasters Against Structural Faults

链接: https://arxiv.org/abs/2606.18539
作者: Yuyang Zhao,Lian Xu,Hao Miao,Chenxi Liu,Hao Xue
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Time series forecasting (TSF) underpins consequential decisions in energy, transportation, finance, and healthcare, yet TSF models are almost universally ranked by a single number (e.g., average error) on clean held-out data, under the implicit assumption that it predicts deployed reliability. However, real faults are not i.i.d noise but structured events with temporal shape, broken cross-variable dependencies, regime change coupled with missingness, and causal propagation across a sensing pipeline. Treating TSF robustness as a data-quality problem, we present TS-Fault, a benchmark that evaluates forecasting models under explicit, parameterized fault scenarios with controllable semantic difficulty. TS-Fault organizes recurring failures into four modes along two orthogonal axes (observation- vs mechanism-level; univariate vs multivariate) and injects each fault into the most prediction-critical window via a unified importance score. This design enables robustness to be tested against the structures models actually rely on, rather than reduced to generic noise sensitivity. We evaluate 21 models across 6 datasets, 4 modes, and 5 difficulty levels under a paired clean/corrupt protocol. The results reveal three findings that contradict common leaderboard intuition: (i) clean-data accuracy anti-correlates with robustness; (ii) clean rankings are preserved under observation-level faults but reshuffled under mechanism-level faults; and (iii) all catastrophic failures occur under mechanism-level faults, with foundation models achieving the highest clean-data accuracy yet exhibiting the greatest fragility. The code is publicly available at this https URL.

[LG-45] Effects of sparsity and superposition on loss in simple autoencoders

链接: https://arxiv.org/abs/2606.18538
作者: Mriganka Basu Roy Chowdhury,Eric McLaughlin Weiner
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 pages, 3 figures

点击查看摘要

Abstract:One of the major difficulties in the mechanistic interpretability of neural networks is the occurrence of polysemanticity, which suggests that each neuron is typically responsible for multiple different tasks, impeding a clean interpretation of their function. The seminal paper of Elhage et al. (2022) argues that this occurs due to superposition, a phenomenon where the neural network represents distinct features as non-orthogonal directions in a lower-dimensional space, a strategy that allows much greater compression of the data without sacrificing fidelity due to the feature sparsity of input vectors. Elhage et al. (2022) empirically validates these hypotheses in a rather natural and simple autoencoder with sparse inputs. The contribution of the present work is to analyze the mathematical basis for the occurrence and optimality of superposition, while rigorously corroborating some of their findings. In particular, we provide upper and lower bounds for the L2 reconstruction loss, tight in the very sparse regime, for power activation functions. A short list of interesting open problems are also included at the end.

[LG-46] Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

链接: https://arxiv.org/abs/2606.18537
作者: Caleb Chang,Davin Win Kyi,Natasha Jaques,Karen Leung
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals, making it difficult to determine which behaviors are worth imitating. We address this challenge with General Reward Inference and Disentanglement (GRID), a social learning method that extracts universally useful behaviors from a heterogeneous population of demonstrators pursuing different goals. GRID decomposes per-agent reward functions into a general reward, capturing behaviors shared across all agents, and specific rewards, capturing individual preferences and objectives. Training exclusively on the general reward provides a new paradigm of generalist pretraining. It yields a generalist agent that internalizes universal environmental competencies, such as safety and basic task proficiency, without the mode-averaging bias that afflicts standard learning from demonstration techniques. This generalist serves as a superior prior for fine-tuning to downstream tasks, including preferences unseen during training. Experiments across a synthetic basis function decomposition, multi-agent Craftax, and a continuous autonomous driving simulator (Highway-Env) confirm that GRID successfully disentangles reward structure in a semantically meaningful way, outperforms standard learning from demonstration baselines, and enables more efficient and stable specialization.

[LG-47] Hierarchical Attention via Domain Decomposition

链接: https://arxiv.org/abs/2606.18525
作者: Stephan Köhler,Oliver Rheinbach
类目: Machine Learning (cs.LG)
*备注: 20 pages, 10 figures

点击查看摘要

Abstract:We propose a hierarchical attention mechanism based on two-level overlapping Schwarz domain decomposition. The method is motivated by the observation that two-level Schwarz domain decomposition methods combine local subdomain corrections with a coarse level that communicates global, long-range information. We test its usefulness in the context of finite-dimensional operator learning using a simple, one-dimensional diffusion problem with homogeneous Dirichlet boundary conditions. Although elementary, this problem provides a controlled sequence-to-sequence setting in which the exact nonlocal solution operator is known. After discretization, learning the solution operator amounts to approximating the inverse of a symmetric positive definite matrix. As a baseline, we use a global softmax-free low-rank attention operator of the form QK^T . The proposed construction replaces this dense global factorization by a two-level additive structure: local low-rank attention blocks on overlapping subdomains are combined with a coarse attention block. The resulting operator has the form M_\theta^-1 = \Phi Q_0 K_0^T \Phi^T + \sum_i=1^N R_i^T D_i^1/2 Q_i K_i^T D_i^1/2 R_i. Here R_i restricts to an overlapping subdomain, D_i is a partition-of-unity weight, and \Phi is a coarse interpolation (or prolongation) matrix. Numerical experiments for synthetic Fourier right-hand sides indicate that the domain-decomposition attention operator is able to train faster and can give more accurate approximations than a global low-rank attention baseline while using significantly fewer parameters. Comments: 20 pages, 10 figures Subjects: Machine Learning (cs.LG) MSC classes: 68T07 (Primary), 65F55, 65N55, 65N22 (Secondary) ACMclasses: I.2.6; G.1.3; G.1.8 Cite as: arXiv:2606.18525 [cs.LG] (or arXiv:2606.18525v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.18525 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Oliver Rheinbach [view email] [v1] Tue, 16 Jun 2026 22:40:40 UTC (7,645 KB) Full-text links: Access Paper: View a PDF of the paper titled Hierarchical Attention via Domain Decomposition, by Stephan K"ohler and Oliver RheinbachView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-48] On the Residual Scaling of Looped Transformers: Stability and Transferability

链接: https://arxiv.org/abs/2606.18524
作者: Shaowen Wang,Bingrui Li,Ge Zhang,Wenhao Huang,Shen Yan,Jian Li
类目: Machine Learning (cs.LG)
*备注: 19 pages, 9 figures

点击查看摘要

Abstract:Looped (weight-tied) Transformers apply a shared residual block N times ( h \leftarrow h + \varepsilon,f(h) , same f at each step), increasing effective depth without adding parameters. Prior depth-scaling analyses prescribe \varepsilon = 1/!\sqrtL for depth- L residual networks. We show that this is insufficient for looped architectures: weight sharing makes residual updates correlated across iterations, requiring the stronger scaling \varepsilon = 1/N . For multi-layer blocks ( L unique layers looped N times), we derive a factored parameterization \varepsilon = \lambda/(N!\sqrtL) that separates the two sources of growth: 1/N controls the within-layer loop correlation, and 1/!\sqrtL controls the across-layer variance. A key consequence is that the optimal learning rate depends only on the number of unique layers L , not on the loop count N , enabling direct hyperparameter transfer from small to large N without retuning. Experiments on looped Transformers confirm that 1/N scaling improves trainability and yields better loss than 1/!\sqrtN scaling across loop counts.

[LG-49] N(CO)2: Neural Combinatorial Optimization with Chance Constraints to Solve Stochastic Orienteering

链接: https://arxiv.org/abs/2606.18514
作者: Anas Saeed,Marcos Abel Zuzuárregui,Stefano Carpin
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural combinatorial optimization (NCO) offers a promising alternative to traditional heuristic-based methods for solving complex graph optimization problems by proposing to learn heuristics through data. This class of problems frequently arises in automation, as it can be used to model a variety of applications. While NCO has been extensively studied for deterministic combinatorial optimization problems, there are only a few works that aim to solve stochastic combinatorial optimization problems. In this work, we present N(CO) ^2 : Neural Combinatorial Optimization with Chance cOnstraints to solve the Stochastic Orienteering Problem (SOP) without the use of hand-crafted heuristics. By integrating a reinforcement learning (RL) framework, the model optimizes path selection under uncertainty, effectively balancing exploration and exploitation. Empirical results demonstrate that our method generalizes well across diverse SOP instances, achieving competitive performance compared to the state-of-the-art mixed-integer linear program (MILP) for the task. The proposed approach reduces human effort in heuristic design while enabling adaptive and efficient decision-making in uncertain environments.

[LG-50] Concept Modulation Models: A Unified Framework for Identifiability and Extrapolation

链接: https://arxiv.org/abs/2606.18509
作者: Soheun Yi,Yizhou Lu,Chandler Squires,Pradeep Ravikumar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reliable generalization in conditional latent variable models requires understanding both identifiability and extrapolation: how observed variation across attributes determines latent structure, and how that structure determines distributions at unseen attributes. However, existing identifiability and extrapolation guarantees are largely model-specific, with separate analyses in nonlinear ICA, causal representation learning, perturbation modeling, and related conditional latent variable models. We introduce concept modulation models (CMMs), an attribute-indexed class of conditional generative models with structure A\to \Lambda \to C\to X , where attributes select modulators, modulators induce latent concept laws, and concepts generate observed features. CMMs lift transition-based identifiability to conditional settings by showing that feature agreement on observed attributes induces a latent concept transition constrained by the CMM class. We express these constraints through attribute potentials, log-density ratios between attribute-conditioned concept laws, separating the generic lifting step from model-specific rigidity arguments. The same potentials control extrapolation: agreement at unseen attributes holds exactly when the transported attribute-potential identities extend to those attributes. This yields algebraic extrapolation criteria, identifies the common potential-based proof objects behind several existing identifiability and extrapolation results, and, when combined with the model-specific rigidity arguments in those works, recovers their stated conclusions.

[LG-51] Beyond AHI: An Interpretable Causal-Discovery-Guided Framework for Sleep Recovery in Connected Health ALT

链接: https://arxiv.org/abs/2606.18506
作者: Saba A. Farahani,Elahe Khatibi,Manoj Vishwanath,Amir M. Rahmani,Hung Cao
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Applications (stat.AP)
*备注: 6 pages, 2 figures, 2 tables. Accepted at the 2nd Workshop on Sensing and Computing for Smart and Connected Health (SCH), co-located with IEEE/ACM CHASE 2026

点击查看摘要

Abstract:Objective sleep assessment relies on polysomnography (PSG), yet clinical impact is often better reflected in patient-reported outcomes (PROs) such as sleepiness and fatigue. Existing summary indices, including the Apnea-Hypopnea Index (AHI), provide limited insight into the multidomain physiology underlying functional recovery. We propose an interpretable, causal-discovery–guided framework for deriving a hierarchical Sleep Recovery Score (SRS) from multimodal PSG. Using two large population cohorts (MESA: n=1540; MrOS: n=825), we apply directed acyclic graph (DAG) learning to identify candidate physiological drivers spanning respiratory burden, hypoxic burden, sleep fragmentation, sleep architecture, and autonomic regulation. Although derived from clinical PSG, these domains map naturally to sensing streams increasingly available in connected health technologies, including wearable ECG, oximetry, and sleep-stage estimation devices. To preserve mechanistic plausibility, we introduce a two-stage screening process that combines physiology-based constraints with constrained LLM-assisted auditing to identify and remove structural confounders and construct-overlapping variables. Across cohorts, these five domains emerge as recurrent physiological domains associated with recovery, and the resulting SRS shows up to 2.5 \times stronger alignment with perceived recovery than AHI. By linking multimodal sleep physiology to patient-centered outcomes through an interpretable, bias-aware, and domain structured framework, this work provides a practical foundation for recovery modeling across both clinical sleep studies and emerging smart and connected health settings.

[LG-52] Quantum Annealing Enhanced Reinforcement Learning for Accurate Remaining Useful Lifetime Prediction

链接: https://arxiv.org/abs/2606.18503
作者: Manoranjan Gandhudi,Arunkumar V.,G. R. Anil,Gangadharan G. R
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 29 pages, 6 figures, 12 tables

点击查看摘要

Abstract:Remaining useful life (RUL) estimation is central to predictive maintenance, where an unplanned failure can cost far more than the asset itself. Statistical degradation models miss the strong nonlinearity of real systems, and data-driven models often converge to suboptimal solutions in high-dimensional, non-convex search spaces. We propose a Quantum Annealing enhanced Q-Learning (QAQL) framework that couples the sampling behaviour of quantum annealing with the sequential decision making of Q-learning. Each Q-value update is encoded as a small quadratic unconstrained binary optimization (QUBO) whose ground state is the greedy action; rather than acting as a deterministic optimizer, the annealer returns a distribution over near-optimal actions across many reads, and this stochastic action selection supplies the exploration that curbs premature convergence on nonlinear degradation trajectories. The QUBO is solved on the D-Wave Advantage system using minor embedding, with the annealer woven into the reinforcement-learning loop rather than bolted on after training. We validate QAQL on two public benchmarks: the NASA C-MAPSS turbofan engine datasets and a device-fleet predictive maintenance dataset. Averaged over many independent runs and across six error metrics, QAQL outperforms the classical and quantum baselines considered in this study, with statistically significant improvements. The results indicate that quantum annealing is a usable, not merely theoretical, optimizer inside a reinforcement-learning loop for industrial predictive-maintenance applications.

[LG-53] he Illusion of Improvement: Reject Inference Strategies in Credit Scoring ECML KDD2026

链接: https://arxiv.org/abs/2606.18479
作者: Bruno Scarone,Ricardo Baeza-Yates
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Accepted to ECML PKDD 2026 (Research Track)

点击查看摘要

Abstract:Reject inference methods are widely used to mitigate survival bias in credit scoring, yet their effectiveness remains poorly understood. We systematically evaluate several such methods and uncover a structural failure mode: in a natural retraining cycle, models whose accuracy improves while recall collapses create an illusion of improvement that leads practitioners to believe the system is getting better when, in fact, its rejection quality – the ability to correctly screen out defaulters – is deteriorating. We then propose a controlled exploration strategy that breaks the feedback loop without statistical assumptions: the lender deliberately approves a fraction of rejected applicants and observes their true outcomes. We show that accuracy and rejection quality give opposite recommendations on whether to explore: accuracy favors no exploration, while rejection quality improves with it, confirming that standard evaluation metrics are misleading under selection bias. Even minimal exploration rates (2–5%) prove sufficient in our experiments to diagnose the severity of the feedback loop at near-zero cost. Our findings are consistent across two machine learning methods and three real-world datasets, and suggest that standard evaluation protocols are inadequate for assessing models trained under survival bias.

[LG-54] Mixed-Precision Communication-Avoiding SGD for Generalized Linear Models on GPUs

链接: https://arxiv.org/abs/2606.18463
作者: Aditya Devarakonda,Irene Simó Muñoz,Giulia Guidi
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Distributed stochastic gradient descent (SGD) is limited by communication rather than computation, since each iteration requires an AllReduce across processes. Communication-avoiding SGD (CA-SGD) amortizes communication over s iterations by replacing s consecutive AllReduces with a single AllReduce of an sb\times sb Gram matrix, trading more computation and bandwidth for fewer synchronization points. Modern GPUs with matrix hardware and reduced-precision formats offset this by accelerating the Gram GEMM and shrinking BF16 traffic. We study mixed-precision CA-SGD for generalized linear models on NVIDIA GPUs. Our finite-precision analysis decomposes the local rounding error of one CA-SGD outer iteration into nine independent precision choices, depending on the hardware only through its low-precision unit roundoffs, so the resulting recipes transfer in principle across GPU generations. The recipe stores the input matrix and margin vector in low precision, computes the Gram matrix from low-precision inputs with high-precision accumulation, communicates it in high precision, and performs the inner recurrence and weight updates in high precision. On NERSC Perlmutter A100 GPUs, mixed-precision CA-SGD matches FP32 SGD loss within 0.5% on logistic, linear, and Poisson problems and reaches 5.1 – 6.8\times speedup over FP32 SGD on epsilon, SUSY, HIGGS, synth, and Poisson-synth. Our software is available at this https URL

[LG-55] ask-Restricted Symmetries in Recurrent Weight Space ICML2026

链接: https://arxiv.org/abs/2606.18457
作者: Simon Dräger
类目: Machine Learning (cs.LG)
*备注: 6 pages, 2 figures. Accepted at the ICML 2026 Workshop on Weight-Space Symmetries

点击查看摘要

Abstract:Recurrent networks can contain substantial functional redundancy in weight space: changing a recurrent matrix may leave the input-output rollout nearly unchanged on a task distribution, while similar-scale changes can destroy the same behavior. We study this redundancy in one-layer tanh RNNs using ordered real Schur coordinates. The Schur form separates spectral blocks from directed nonnormal couplings, giving a diagnostic basis for structured ablations that keep the input and readout maps fixed. In a fixed-length copy task, selected nonnormal Schur couplings can be removed with little loss in some trained solutions, whereas other couplings are necessary for accurate autonomous replay. Across flip-flop, sine generation, and context-dependent integration, the loss-preserving ablation profile varies across tasks and trained solutions. These results identify candidate approximate functional invariances, not universal symmetries of recurrent weight space. Schur-coordinate ablations provide a practical diagnostic for which structured perturbations preserve a trained recurrent solution and which ones disrupt its computation. Comments: 6 pages, 2 figures. Accepted at the ICML 2026 Workshop on Weight-Space Symmetries Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.18457 [cs.LG] (or arXiv:2606.18457v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.18457 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Simon Dräger [view email] [v1] Tue, 16 Jun 2026 20:04:07 UTC (471 KB) Full-text links: Access Paper: View a PDF of the paper titled Task-Restricted Symmetries in Recurrent Weight Space, by Simon Dr"agerView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-56] A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

链接: https://arxiv.org/abs/2606.18451
作者: Ali Asaria,Tony Salomone,Deep Gandhi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Single-image-to-3D generators are improving quickly, but there is no agreed, human-free way to tell whether one generated mesh is better than another. Practitioners commonly rely on cheap automatic proxies (render-space CLIP similarity and mesh geometry-validity statistics), yet how well these track perceived quality is unestablished. We make two contributions. First, we propose and validate a reproducible VLM-judge evaluation protocol: a fixed 24-view headless render rig, two independent vision-language judge families, and a mandatory position-bias correction that queries both presentation orders and keeps only order-consistent verdicts. The two judge families agree substantially with each other (Cohen’s kappa = 0.66), well above the chance-agreement floor. Second, using this protocol as the reference, we show the cheap proxies do not substitute for it. Geometry validity is only a weak signal on average (because, as we show, it is bimodal) and stays below our pre-registered target, while render-CLIP is at chance. A learned Bradley-Terry head collapses onto a single manifoldness statistic (giving render-CLIP a negative weight) and matches geometry-only exactly, so learning the feature weights buys nothing. The proxy is also bimodal: it is significantly above chance on contrasts with visible geometric defects but at chance on ambiguous contrasts, consistent with geometry validity tracking the judge only when the defect is visually salient. We therefore recommend the VLM-judge protocol as a reliable, reproducible evaluator under the conditions tested (two feed-forward generators on Google Scanned Objects, with a face-drop degradation regime) and advise against geometry/CLIP proxies as optimization targets.

[LG-57] Beyond Prediction: Tail-Aware Scheduling for LLM Inference

链接: https://arxiv.org/abs/2606.18431
作者: Yueying Li,Yuanfan Chen,Jiayang Chen,Esha Choukse,Haoran Qiu,G. Edward Suh,Rodrigo Fonseca,Ziv Scully,Udit Gupta
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean-centric metrics such as TTFT and TBT. We show that these prediction-driven policies can be fragile under distribution shifts, bursty arrivals, and GPU memory pressure, while offering limited control over the tail latency (P90-P99) that dominates user experience, even with perfect decode-length knowledge. We introduce a distribution-aware, prediction-free scheduling framework that replaces explicit length prediction with soft priority boosting driven by lightweight statistical signals. Our design co-optimizes scheduling and cache-aware preemption to account for memory-coupled decode dynamics across workload mixes. Evaluated on production and open-source traces, our method reduces P99 TTLT by up to 35-50% relative to SRPT with perfect length knowledge and reduces TTFT by 34-47% across workloads, including reasoning-heavy and chat-heavy tasks. These results demonstrate a robust alternative for optimizing tail latency in online LLM serving.

[LG-58] Signature filtering: a lightweight enhancement for statistical watermark detection in large language models

链接: https://arxiv.org/abs/2606.18430
作者: Chih-Duo Hong,Yen-Pang Chen,Fang Yu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Statistical watermarks help organizations attribute large language model (LLM) outputs, yet existing detectors often struggle when watermark signals are weak, texts are repetitive, or watermarks are edited. We propose signature filtering, a detection-time module that enhances watermark detection without modifying watermark embedding and text generation. It learns a small set of ``signature’’ tokens whose presence makes watermark tests unreliable, and removes these tokens before detection. The signatures are obtained by solving a mixed-integer linear program on a small training set, with constraints that maximize the true positive rate. We additionally derive finite-sample and asymptotic bounds under several attacker models (color-blind, color-adaptive, and distributionally correlated). On four well-known watermark families (Kgw, Sweet, Unigram, Exp), four benchmark corpora (C4, MBPP, HumanEval, Code-Search-Net), and six LLMs (Opt-1.3b, Opt-6.7b, Llama2-13b, Llama3.1-8b, Qwen2.5-14b, Phi-3-medium-14b), 2- and 3-gram signatures raise detection rates in weak-signal and low-entropy settings from 8~31% without filtering to 78~99% with filtering, while keeping false positives controllable and often negligible. In stress tests where we scramble sentences and perturb 25~50% of tokens by dilution, deletions, and substitutions, 2-gram filters for Kgw-style watermarks preserve most of the clean-text detection gains, often matching or outperforming the advanced WinMax watermark detector. Signature filtering thus provides a simple, scalable, and model-agnostic add-on to strengthen watermark-based provenance checks for LLM text in information processing workflows.

[LG-59] Measurement noise limits the advantage of nonlinear models over linear models in biomedical prediction

链接: https://arxiv.org/abs/2606.18420
作者: Marc-Andre Schulz,Kerstin Ritter
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:On biomedical tabular data, flexible models such as deep networks, gradient-boosted trees, and kernel methods are repeatedly matched or beaten by linear and logistic regression given the same features. The usual reaction is to treat this as a model-side shortfall, to be fixed with more data, a better architecture, or tuning, on the assumption that the nonlinear structure is there and the model has failed to capture it. We argue that these fixes cannot help when the binding limit is the measurement rather than the model, as it frequently is in biomedicine. Additive noise blurs the population-optimal predictor, and because blurring removes a function’s fine, rapidly varying detail before its broad shape, it erases nonlinear structure faster than linear structure. A degree- k interaction is attenuated by the k -th power of feature reliability, while the linear part is attenuated only once. At the reliabilities typical of biomedical measurement, the nonlinear advantage can vanish even when the underlying biology is strongly nonlinear, and what the noise removes cannot be recovered by a larger cohort or a more flexible model, only by better measurement. The nonlinearity is hidden, not absent, and a tie between linear and flexible models is not by itself a verdict on the biology. These pieces are classical, drawn from measurement-error statistics, psychometrics, and Gaussian analysis, and we assemble them into an exact excess-risk identity. Measurement reliability is one of three conditions, alongside sample size and feature representation, that must align for a flexible model to help, and together they leave only a narrow window that most biomedical tasks fall outside. Across 140 UK Biobank tasks, the gap between flexible and linear models, where it exists, carries the predicted noise signature, and the three conditions can be separated by intervention but not by a benchmark alone.

[LG-60] P2CE: Model-Agnostic Plausible Pareto-Optimal Counterfactual Explanations

链接: https://arxiv.org/abs/2606.18418
作者: Arthur Hendricks Mendes de Oliveira,Giovani Valdrighi,Marcos Medeiros Raimundo
类目: Machine Learning (cs.LG)
*备注: Under review in the Machine Learning journal

点击查看摘要

Abstract:The increasing use of machine learning algorithms in social applications has raised concerns about fairness and transparency, leading to the development of counterfactual explanations. These explanations supports individuals to understand and potentially alter unfavorable decisions in areas such as loan applications, job selections, and more, by providing actionable changes to input features that would lead to a desired outcome. Existing methods often struggle to balance feasibility, plausibility, and computational efficiency. To address this, we introduce P ^2 CE, an algorithm for generating plausible Pareto-optimal counterfactual explanations, offering users a diverse set of optimal trade-offs between different notions of feasibility. P ^2 CE employs an auxiliary isolation forest outlier detector to ensure that explanations are in accordance with the data distribution and leverages SHAP values to obtain optimal results with short computing times, regardless of the underlying model. Our algorithm was empirically evaluated on three datasets, demonstrating superior performance in terms of both solution quality and computational efficiency compared to related techniques.

[LG-61] MOLAR: Learning Multimodal Molecular Representations from Noisy Labels

链接: https://arxiv.org/abs/2606.18390
作者: Yingxu Wang,Kunyu Zhang,Nan Yin,Yu Li,Eran Segal
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Motivation: Noisy labels are a common challenge in molecular property prediction because molecular annotations are often obtained from assays, curated databases, or weak annotation pipelines rather than directly observed clean biological states. Treating recorded labels as reliable supervision can cause models to memorize corrupted observations and learn misleading molecular evidence. In multimodal molecular representation learning, this issue can be amplified by graph-text fusion or alignment, which may propagate label-induced errors across modalities. Results: We propose MOLAR, a noise-aware framework for learning multimodal molecular representations from noisy labels. MOLAR separates latent clean-property inference from recorded-label observation: graph and text views contribute residual evidence to a clean-property distribution, and a categorical label-observation channel maps this distribution to recorded labels for training. This formulation derives posterior label reliability and modality-specific molecular evidence from the model. Experiments on naturally noisy molecular benchmarks and controlled label-flipping benchmarks show that MOLAR consistently outperforms representative baselines. Visualization analyses further show that MOLAR provides interpretable reliability and modality-evidence diagnostics.

[LG-62] SCOPE-FL: A Strategy-proof Chain-based Optimal pareto efficient Federated Learning System

链接: https://arxiv.org/abs/2606.18384
作者: Seyed Salar Ghazi,Kaiwen Zhang,Mehdi feizi,Hans-Arno Jacobsen
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Hierarchical Federated Learning (HFL) enables scalable collaborative model training across distributed devices while preserving data privacy. However, existing HFL client selection mechanisms suffer from a fundamental strategic inefficiency. By prioritizing stability over Pareto efficiency (PE), they produce suboptimal resource allocations, and without strategy proofness (SP), participants are incentivized to misrepresent their true preferences, both failures degrading system overall welfare in the Pareto sense in practice. To address it, we propose SCOPE-FL (Strategy-proof Chain-based Optimal pareto efficient Federated Learning), a synchronous HFL framework that formulates client selection as a two-sided school choice problem solved through the Top Trading Cycle (TTC) algorithm that simultaneously guarantees PE and SP. For reward distribution, SCOPE-FL employs a scalable Shapley value approximation based on One-Round Reconstruction (OR), ensuring compensation proportional to each client’s contribution. The entire mechanism executes via blockchain smart contracts, providing the tamper-proof environment required for the SP guarantees to hold in practice. A comprehensive evaluation on MNIST, Fashion-MNIST, and CIFAR-10 demonstrates that SCOPE-FL outperforms state-of-the-art approaches, including DA, IAS, and other methods across model accuracy, convergence rate, and reward efficiency, while achieving communication latency comparable to DA and blockchain overhead significantly lower than DA at scale.

[LG-63] Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting ICML2026

链接: https://arxiv.org/abs/2606.18367
作者: Yingshuo Wang,Xian Sun,Lingdong Kong,Wei Gao,Yanhang Li,Zhichao Fan,Zexin Zhuang
类目: Machine Learning (cs.LG)
*备注: 5 pages, 2 figures. Accepted at the Workshop on Forecasting as a New Frontier of Intelligence, ICML 2026

点击查看摘要

Abstract:Standard benchmarks evaluate time series foundation models (TSFMs) using aggregate metrics, but these can mask severe failures in critical operating regimes. We introduce regime-stratified evaluation and apply it to three TSFMs on two standard traffic speed benchmarks. Traffic exhibits abrupt regime switching between free-flow and congested states, producing bimodal speed distributions during transitions. When we stratify by traffic regime, both accuracy and prediction-interval coverage degrade sharply during transitions: transition-regime MAE reaches 11 mph (versus 3 mph overall), and empirical coverage of 90% prediction intervals drops as low as 55%. These failures are invisible in aggregate metrics because free-flow observations dominate the sample. A simple historical conditional baseline (sampling from per-sensor training distributions) achieves better transition coverage than any TSFM, but has far worse overall accuracy. We propose bimodal mixture augmentation (BMA), a post-hoc method that combines TSFM forecasts with historical distributional knowledge, approaching the historical baseline’s transition coverage while preserving the TSFM’s accuracy. Our results suggest that TSFM benchmarks should incorporate regime-aware evaluation to surface failures that aggregate metrics hide.

[LG-64] housandWorlds: A benchmark for climate emulation of potentially habitable exoplanets NEURIPS

链接: https://arxiv.org/abs/2606.18338
作者: Edward T. Stevenson,Mei Ting Mak,Eric Wolf,Denis E. Sergeev,Tobi Hammond,N. J. Mayne,Miles Cranmer
类目: Machine Learning (cs.LG); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注: 10 pages main text, 26 pages references/appendix, plus NeurIPS checklist. Data at this https URL . Code at this https URL

点击查看摘要

Abstract:The search for life beyond Earth will depend on detecting faint signatures in the atmospheres of potentially habitable exoplanets. Interpreting those signatures requires understanding the host planet’s climate: the same molecule may signal life on one planet and abiotic chemistry on another. Global climate models (GCMs) provide this understanding, but individual runs can require up to millions of core-hours and substantial domain expert time. Machine-learning emulators could remove this bottleneck, but progress has been limited by the absence of a curated, multi-model exoclimate dataset. We introduce ThousandWorlds, an ML-ready benchmark for exoclimate emulation and for the broader regime of low-data, multi-simulator, parameter-to-field regression. The dataset contains approximately 1800 simulations from five GCMs, mapping eight planet parameters to 3D atmospheric fields including temperature, humidity, winds, clouds, and radiation. Three nested subsets define progressively harder challenges: single-simulator regression, multi-simulator regression with complete observations, and multi-simulator regression with structured missingness. We propose two evaluation protocols: one for ranking methods, and one that measures performance relative to the disagreement between GCMs themselves. We evaluate seven baselines spanning simple methods, deep learning, and Gaussian processes. GP-based methods perform best, suggesting that ThousandWorlds exposes a regime where off-the-shelf deep learning does not yet succeed. Data: this https URL. Code: this https URL.

[LG-65] Neural Network Implementation of the Renormalization Group for Fault Diagnosis with Class Imbalance

链接: https://arxiv.org/abs/2606.18326
作者: Evgeny Nikulchev,Dmitry Ilin
类目: Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:The application of machine learning models in practical tasks faces challenges such as class imbalance and multidimensional noise. This paper proposes RGNet, a neural network architecture based on the concept of the renormalization group (RG), for hierarchical coarse-graining of the feature space. The model sequentially compresses the input dimensionality and concatenates all scales before classification, allowing it to capture both local details and global patterns. The notion of RG-flows is introduced - interpretable low-dimensional representations whose visualization via t-SNE reveals a discrete curvilinear structure confirming the effectiveness of coarse-graining. Experimental results are presented on the imbalanced AI4I dataset. The obtained results demonstrate that RGNet is a universal, interpretable, and competitive solution for fault prediction in applications with imbalanced classes.

[LG-66] Reliable Neural-Codec Text-to-Speech by ASR Self-Verification and Distillation: Near-Zero Catastrophic Failures Across Models and Codecs

链接: https://arxiv.org/abs/2606.18323
作者: Ali Asaria,Tony Salomone,Deep Gandhi
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Open autoregressive neural-codec text-to-speech (TTS) models sound excellent on typical inputs yet suffer stochastic catastrophic failures: on a meaningful fraction of utterances they emit silence, terminate early, or collapse into repetitive or hallucinated content. We show this failure mode is cheap to remove. Under a single format-robust metric (a catastrophic-failure rate via an ASR round-trip), best-of-N ASR self-verification drives failures to near-zero: no observed failures remain by N=2 on a standard corpus (LibriSpeech) and by N=4 on a hard prompt set. This is not an artifact of one model: the reduction replicates across four open codec-TTS systems and three neural codecs (XCodec2, SNAC, Mimi), reaching the near-zero floor by N=2 on three of the four. We then make the fix free at inference time by distilling the self-verified behaviour into the model, which recovers much of the robustness in single-shot decoding, closing ~52-58% of the failure mass on hard inputs at no test-time cost. The distillation gain concentrates where it is needed (hard inputs); on already-reliable prose there is no headroom and no detectable change. A controlled comparison adds a clean negative: offline direct preference optimization (DPO/IPO) does not beat plain supervised distillation, and an online iterative variant is promising but not statistically separable at our evaluation size. We report honestly the one model that resists (a larger Llasa where scale did not obviously help) and a rare-word capability ceiling that no self-distillation method overcomes

[LG-67] Enhanced Graph Neural Networks using K-Hop Gaussian Diffusion

链接: https://arxiv.org/abs/2606.18317
作者: Xuling Zhang,Peng Wang,Daiyan Li,Aoran Huang,Zeiwei Chen,Yongkui Yang
类目: Machine Learning (cs.LG)
*备注: 5page, 3 figures

点击查看摘要

Abstract:Most graph neural network (GNN) cores rely on graph convolutions, typically implemented as message passing between direct (single-hop) neighbors. In many real-world graphs, edges can be noisy or poorly defined, limiting information propagation to local neighborhoods. Existing diffusion kernels, such as Personalized PageRank (PPR) and Heat Kernel, alleviate this issue through global propagation, but still struggle with complex local structures and distant node noise. To address these limitations, we propose a K-Hop Gaussian (KHG) diffusion kernel as a preprocessing module for graph data. KHG introduces multi-hop diffusion with Gaussian weighting for remote nodes, balancing local and global information propagation before applying standard GNNs. Experiments on multiple benchmark datasets demonstrate that KHG significantly outperforms traditional message-passing GNNs, as well as PPR and Heat Kernel diffusion, particularly in noisy or structurally complex graphs.

[LG-68] A Survey on Data-Driven Models for Soil Moisture Regression and Classification

链接: https://arxiv.org/abs/2606.18316
作者: Ilektra Tsimpidi,George Georgoulas,Vidya Sumathy,George Nikolakopoulos
类目: Machine Learning (cs.LG)
*备注: 14 pages, 3 figures, AIAI 2026 Conference

点击查看摘要

Abstract:Soil Moisture (SM) modelling constitutes a complex spatiotemporal learning problem characterised by nonlinear environmental interactions, heterogeneous data sources, and limited ground observations. Physics-based approaches, such as water balance models, rely on explicit hydrological equations and high-quality inputs, but their computational cost and scalability limitations restrict large-scale deployment. Data-driven artificial intelligence (AI) methods have emerged as flexible alternatives, enabling the extraction of empirical relationships between soil moisture and environmental variables with reduced modelling assumptions. This work presents a structured survey of AI-based models for soil moisture estimation and classification. Existing approaches are organized into five categories: (a) statistical time-series models, (b) geostatistical methods © classical machine learning (ML) models, (d) Deep Learning (DL) models and (e) Probabilistic/Bayesian methods. These models leverage historical soil moisture records, meteorological variables, vegetation indices, topography, soil characteristics, and geolocation data to perform regression or classification tasks.

[LG-69] IGER: Inverting Transformer Gradients via Embedding-Subspace Distance Optimization

链接: https://arxiv.org/abs/2606.18312
作者: William Kalikman,Ivo Petrov,Dimitar I. Dimitrov,Martin Vechev
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 16 pages, 13 pages main text,

点击查看摘要

Abstract:Federated learning allows multiple clients to jointly train a shared model by sending gradient updates to a central server while keeping raw inputs local. However, prior gradient inversion attacks show that these updates can reveal enough information to reconstruct client inputs. Existing attacks on transformers either optimize dummy inputs to match the true client updates, which is costly and unstable for modern models, or exploit the low rank of attention gradients to identify a subspace containing the true layer embeddings, followed by a discrete membership test for candidate tokens. However, this token test is brittle under numerical noise, i.e., from quantization or Differential Privacy (DP), and scales poorly for encoder models with non-causal attention. We introduce TIGER, a continuous gradient inversion attack that turns this subspace signal into a differentiable objective. Instead of searching over tokens or matching full gradients, TIGER directly optimizes token embeddings to minimize their distance to the subspace. Our experiments demonstrate that on encoder-only models, TIGER substantially improves both reconstruction quality and runtime over existing attacks, while on decoder models, TIGER is more robust than prior subspace-based attacks, enabling the first successful reconstructions in DP-defended federated learning settings.

[LG-70] Fisher Width: A Geometric Measure of Complexity on Statistical Manifolds

链接: https://arxiv.org/abs/2606.18306
作者: Vu Khac Ky
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 48 pages, 3 figures

点击查看摘要

Abstract:Gaussian width is a central geometric complexity measure in high-dimensional probability, compressed sensing, convex optimization, and learning theory. It quantifies the average extent of a set along random directions, thereby capturing the effective dimension of constraint sets, hypothesis classes, and descent cones. However, this notion is intrinsically Euclidean. Statistical models instead carry a natural Riemannian geometry induced by the Fisher information metric, where directions are scaled according to statistical distinguishability rather than ambient Euclidean length. We introduce Fisher width, a Fisher-geometric analogue of Gaussian width for statistical manifolds. At a parameter point \theta , Fisher width replaces the Euclidean identity by the local metric tensor G(\theta)^1/2 , measuring the Gaussian width of the Fisher-rescaled set. This makes the resulting quantity sensitive to local statistical curvature and invariant under smooth reparameterizations. We develop the basic theory of Fisher width, showing that it retains key structural features of Gaussian width, including concentration, metric perturbation stability, and spectral comparison bounds with the Euclidean baseline, while also capturing anisotropic geometric effects invisible to Euclidean measures. As an application, we prove a generalization bound for Fisher-Lipschitz hypothesis classes and propose computable estimators, which we evaluate empirically on MNIST across three model classes. Fisher width is to statistical manifolds what Gaussian width is to Euclidean convex bodies. This work lays the foundation for studying complexity and learning on curved statistical manifolds. Comments: 48 pages, 3 figures Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 62R10, 60G15, 68T05 Cite as: arXiv:2606.18306 [cs.LG] (or arXiv:2606.18306v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.18306 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Ky Vu Khac [view email] [v1] Tue, 16 Jun 2026 07:04:47 UTC (277 KB)

[LG-71] Starter-Iterator Neural Operator: A Unified Architecture for High-Fidelity Forward and Inverse PDE Problems

链接: https://arxiv.org/abs/2606.18305
作者: Kuilin Qin,Lianfang Wang,Xu Sun,Jiwei Jia,Yu Wang,Yong Wang,Yuping Duan
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Operator learning is an emerging interdisciplinary field that integrates machine learning with scientific computing. By mapping infinite-dimensional function spaces, this approach provides an efficient surrogate modeling framework for high-dimensional partial differential equations (PDEs). Compared to traditional numerical solvers, it achieves a superior trade-off between computational complexity and approximation accuracy, demonstrating significant advantages in many-query tasks such as real-time prediction and parameter sweeps. Given the stringent accuracy requirements of both forward simulation and inverse inference, as well as the precision bottlenecks of existing operator learning methods in handling complex boundaries or long-term evolution, we propose the Starter-Iterator Neural Operator (SINO). Our framework reinterprets the initialization strategies and iterative formats of traditional iterative methods through neural networks, establishing an efficient approach for spectral-spatiotemporal collaborative modeling. Specifically, the frequency-domain initialization module captures globally stable low-frequency features, while the time-domain learning module focuses on optimizing local solution residuals, thereby effectively overcoming the inherent limitations of conventional single-domain modeling approaches. Extensive experiments on typical dynamical systems such as the Navier-Stokes equations and acoustic wave equations, as well as practical applications including super-resolution imaging and weather forecasting, demonstrate that SINO achieves outstanding performance in numerical accuracy, generalization capability, and robustness.

[LG-72] Artemis: Anatomy-Resolved inTervention for Eliminating Multimodal NeuroImage confounderS

链接: https://arxiv.org/abs/2606.18287
作者: Siyuan Dai,Yang Du,Kun Zhao,Zhusuyi Chen,Heng Huang,Paul Thompson,Chao Shi,Haoteng Tang,Liang Zhan
类目: Machine Learning (cs.LG)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Multimodal neuroimaging, integrating functional connectivity from fMRI and structural connectivity from DTI, enables non-invasive analysis of brain networks using graph neural networks. However, demographic factors such as age and sex systematically confound the relationship between brain connectivity and clinical outcomes, causing GNNs to exploit spurious shortcuts rather than learning causally invariant representations. While recent causal GNN methods introduce causality at the graph-modeling level, their causal mechanisms remain domain-agnostic without accounting for the real-world confounders inherent in clinical neuroimaging data. Moreover, brain networks are constructed from atlas-based parcellations where each region exhibits distinct sensitivity to demographic factors, necessitating region-aware adjustment. We propose Artemis, a region-level causal framework that bridges this gap with causal intervention at each brain region independently by learning region-specific confounder representations with lightweight parameters. Our adjustment comprehensively utilized the multimodal functional and structural features for graph reasoning as a plug-in module compatible with arbitrary GNN backbones. Experiments on three benchmarks, ADNI for disease diagnosis, OASIS for dementia staging, and HCP for sex classification, demonstrate consistent improvements over representative GNN-based baselines. Multiple supporting experiments further demonstrate statistical significance and neuroscientific interpretability.

[LG-73] CODEBLOCK: Learning to Supervise Code at the Right Granularity

链接: https://arxiv.org/abs/2606.18286
作者: Zhijie Deng,Ling Li,Jinlong Pang,Kaiqin Hu,Qi Xuan,Zhaowei Zhu,Jiaheng Wei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Supervised fine-tuning of code LLMs typically applies uniform cross-entropy loss to all response tokens, implicitly assuming that every token provides equally useful learning signal. Recent token-level selection methods challenge this assumption in natural-language SFT by supervising only high-value tokens. However, directly transferring token-level masking to code can break syntactically and semantically coherent program units, because code depends on structural completeness and definition-use relations. We therefore propose CodeBlock, a structure-aware sparse supervision framework that selects structure-complete code evidence rather than isolated tokens. CodeBlock first selects high-quality instruction-response pairs, then partitions code responses into syntactically coherent coding items, estimates their utility by aggregating generalized cross-entropy over core logic tokens, and reranks them with data-flow reach and bridge signals to prioritize blocks that propagate or connect important program dependencies. During training, the full response remains available as context, while loss is applied only to selected code items and informative natural-language tokens. Experiments on six code-generation benchmarks show that CodeBlock achieves stronger average pass@1 than full-token SFT and competitive selection baselines, while using only 1.9% of supervised response tokens.

[LG-74] Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing

链接: https://arxiv.org/abs/2606.18283
作者: Yongchao Huang,Hassan Raza
类目: Machine Learning (cs.LG)
*备注: 55 pages

点击查看摘要

Abstract:The dense token-to-token interaction pattern of standard dot-product attention remains a central bottleneck in scaling Transformer architectures to long contexts. We introduce \textbfGaussian Mixture Attention (GMA), a probabilistic attention-style sequence mixer that replaces explicit pairwise query–key comparison with routing through K learned Gaussian mixture components. Queries and keys are mapped to posterior \textitresponsibility vectors over a shared latent routing space; their overlap defines an implicit responsibility-space affinity, while values are written into and read from a K -slot latent memory. By exploiting the associativity of matrix multiplication, GMA avoids materializing the induced N\times N affinity matrix and instead uses two responsibility matrices whose dominant activation storage scales as \mathcalO(NK) rather than \mathcalO(N^2) for fixed K . We formulate bidirectional and causal variants of GMA, provide an end-to-end differentiable parameterization of the Gaussian mixture components, and analyze its responsibility-modulated gradient structure, constrained non-negative low-rank affinity interpretation, and local routing stability. Empirically, GMA exhibits the intended fixed- K linear memory scaling and is competitive with attention-style baselines on long-context classification, while causal GMA improves over tested linear/random-feature attention variants on WikiText-103 but remains behind optimized causal SDPA and Mamba in the current implementation. Analysis of learned responsibilities further shows broad component usage and moderate alignment with surface-form token categories, supporting GMA as a probabilistic, interpretable, fixed- K linear-time attention-style alternative rather than a universal replacement for optimized softmax attention or state-space models.

[LG-75] A physical adaptive material motor unit neural network: a hygromorph composite material machine

链接: https://arxiv.org/abs/2606.18275
作者: Charles de Kergariou,David Correa,Adam W. Perriman,Helmut Hauser,Fabrizio Scarpa
类目: Emerging Technologies (cs.ET); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 35 pages, 16 figures

点击查看摘要

Abstract:Advances in novel materials science enable structures to function as intelligent machines by embedding memory and learning capabilities directly into materials. Our work introduces a physical adaptive material motor unit neural network,leveraging a new generation of controllable actuators composed of wood- and carbon black-based composites, sensitive to temperature and relative humidity. These material actuators are assembled into a motor unit-like structure inspired by muscle contraction trigger, forming an intelligent machine capable of dynamic shading control that can be used, for example, in buildings. The machine is governed by a neural network trained on over 350 experimental data points collected under diverse environmental conditions. By establishing a new data-aware backpropagation training, we show that the machine predicts shading responses and learns to predict appropriate behaviour incrementally as the database expands. We also demonstrate the ability of the machine to optimise configurations to achieve similar shading outputs under two distinct conditions.

[LG-76] Graph Instance Landscapes: When Structural Similarity Does (Not) Reflect Shortest-Path Performance CEC2026

链接: https://arxiv.org/abs/2606.18267
作者: Maryam Gholami Shiri,Ivana Krminac,Marko Djukanović,Sašo Džeroski,Eva Tuba,Tome Eftimov
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Preprint version of a paper accepted at the 2026 IEEE Congress on Evolutionary Computation (IEEE CEC 2026)

点击查看摘要

Abstract:Benchmarking shortest-path algorithms is commonly based on aggregate performance over heterogeneous graph sets, which limits insight into how different search paradigms react to instance structure. We adopt an instance-landscape view of graph benchmarking by embedding graphs into a low-cost structural feature space and clustering them into regions of similar structure. Three benchmark suites are studied: weighted Erdős–Rényi graphs, random geometric (wireless) graphs, and real-world road networks. We evaluate four representative shortest-path solvers spanning uninformed exact search (Dijkstra), bidirectional exact search (bidirectional Dijkstra), heuristic-guided exact search (A ^* ), and deque-based strategies (DEQ). Clustering robustness is analyzed under multiple feature-selection schemes, and runtime distributions are compared across landscape regions using non-parametric tests. While generator parameters induce stable structural regions, we find that feature-space similarity does not necessarily imply performance similarity: significant runtime shifts are frequently observed even within the same landscape region. A merged-suite analysis further shows that different benchmark families occupy largely disjoint regions. These results highlight both the potential and the limits of structural landscapes for the structure-aware benchmarking of shortest-path algorithms.

[LG-77] he Chandra-Gaia Catalog of Counterparts: Resolving ambiguous Gaia matches to X-ray sources in the Chandra Source Catalog using Machine Learning WWW

链接: https://arxiv.org/abs/2606.19329
作者: V. Samuel Pérez-Díaz,Vinay L. Kashyap,Joshua D. Ingram,David Fouhey,Juan Rafael Martínez-Galarza,Pavlos Protopapas,Jeremy J. Drake,Dong-Woo Kim,Cecilia Garraffo
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Accepted to The Astrophysical Journal. Website: this https URL

点击查看摘要

Abstract:We present a framework to cross-match sources from the Chandra Source Catalog (CSC v2.1) with optical sources from Gaia Data Release 3. Unlike purely spatial approaches, we use source properties such as magnitudes, colors, and distances to identify true counterparts, detect chance coincidences, and resolve ambiguities when multiple plausible candidates exist. We define a training set of high-confidence matches using NWAY, a Bayesian cross-matching framework that accounts for positional errors and source densities. We train a gradient-boosted classifier (LightGBM) on a variety of features from both catalogs. Of the ~ 254 k unique X-ray sources, we find counterparts for ~ 113 k sources, of which plausible multiple counterparts are found for ~ 7 k. We find no counterparts for ~ 20 k sources for which separation-based cross-matching does find a match, and attribute half of these to chance coincidences. We validate the pipeline on the Chandra Orion Ultradeep Project (COUP), where the machine-learning matches reproduce 95% of NWAY cross-matches without using any positional information. We release a catalog of the ~ 113 k Chandra-Gaia counterparts, together with ~ 7 k alternative matches and ~ 20 k ambiguous NWAY associations, supporting future population studies of sources detectable by both Chandra and Gaia. We discuss limitations and provide a generalization of the framework that is applicable in other cross-matching scenarios.

[LG-78] Optimal scenario design for climate emulation

链接: https://arxiv.org/abs/2606.19302
作者: Christopher B. Womack,Shahine Bouabid,Andrei Sokolov,Popat Salunke,Glenn Flierl,Sebastian D. Eastham,Noelle E. Selin
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As deep learning for physical systems continues to grow in popularity, efforts to improve generalizability have primarily focused on designing architectures that embed physical constraints. However, for machine-learning surrogate climate models (emulators), we show that the low structural diversity in existing scenarios commonly used to generate training data places a ceiling on predictive skill. Here, we examine whether training datasets themselves can be optimized to improve generalization. We introduce a method to create datasets that produce emulators capable of generalizing to new, structurally different scenarios absent from the training data. We use a differentiable Simple Climate Model (SCM) to calculate the sensitivity of emulator loss to perturbations in the training data, iteratively updating the training data to maximize emulator skill. For an SCM, training on one scenario optimized in this fashion outperforms an emulator trained on six standard ScenarioMIP pathways. We achieve this higher predictive skill despite training on a smaller dataset, finding that our emulator successfully isolates distinct physical behaviors of different climate forcing agents (e.g., greenhouse gases vs. aerosols) without single-forcing runs. We then demonstrate that scenarios optimized using an SCM, when used to drive an intermediate-complexity climate model, produce a training dataset that yields a more skillful emulator than training on ScenarioMIP outputs. Our results suggest that, in the compute-constrained environment of running full-scale climate models, generating a small number of dynamically rich scenarios provides greater marginal value for emulation and characterizing system responses than expanding the suite of traditional emissions pathways.

[LG-79] Beyond Algorithms: Conceptual Innovation in Medical Imaging AI

链接: https://arxiv.org/abs/2606.19270
作者: Mark A. Anastasio
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Artificial intelligence has driven rapid progress in medical imaging research, producing increasingly sophisticated algorithms and steady improvements on benchmark tasks. However, this algorithm-centric trajectory has also revealed a growing imbalance: while computational methods advance rapidly, the conceptual foundations that define imaging tasks, evaluation metrics, and clinical meaning sometimes remain underexamined. In this Perspective, we distinguish algorithmic innovation, which focuses on improving computational implementations and performance within a fixed problem definition, from conceptual innovation, which reframes what problems are posed, how success is measured, and why an approach is clinically relevant. We argue that prevailing incentive structures, training pathways, and publication norms disproportionately reward algorithmic novelty, particularly for early-career researchers, while at times undervaluing conceptual contributions that are essential for scientific maturation and clinical translation. Through representative examples from medical imaging AI, we show how insufficient conceptual grounding can lead to misaligned objectives, fragile generalization, and limited real-world impact. We conclude with actionable recommendations for researchers, mentors, reviewers, and journals to better recognize, support, and integrate conceptual innovation alongside algorithmic advances.

[LG-80] Acceleration of an algebraic multigrid pressure solver using graph neural networks

链接: https://arxiv.org/abs/2606.19251
作者: Eric Chillón,Artur K. Lidtke,Nguyen Anh Khoa Doan,Bernat Font
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 23 pages, 11 figures

点击查看摘要

Abstract:Solving the pressure-Poisson equation remains the primary computational bottleneck in incompressible unstructured flow solvers primarily due to the inherent sensitivity of traditional linear solvers to mesh irregularities. This work introduces a data-driven algebraic multigrid (AMG) smoother that uses a modified graph convolutional isomorphism network (GCIN). The graph neural network predicts optimal polynomial coefficients to construct a sparse pseudo-inverse operator across diverse grid topologies. The coefficients are optimized to reduce the residual after each V-cycle iteration. By directly capturing the algebraic structure of the system from the sparse coefficient matrix, the proposed method maintains the solver’s linearity while adapting to local anisotropies in unstructured grids. Our framework demonstrates significant performance gains by reducing the number of V-cycles required for a given tolerance and delivering wall-clock speedups from 4% to 37% across diverse benchmarks. Notably, the model exhibits robust generalization by maintaining efficiency on meshes up to 128 times larger than those seen in training, and by accelerating the solver’s convergence on unseen industry-relevant problems such as the AirfRANS dataset.

[LG-81] Generalised Eigenvalue Geometry of Semantic Adversarial Attacks

链接: https://arxiv.org/abs/2606.19212
作者: Martin Anthony,Kaveh Salehzadeh Nobari
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent empirical work shows that semantically equivalent paraphrases can fool financial sentiment classifiers: although a paraphrase remains close to the original under a strong reference embedding, it may shift the target model’s representation enough to change the predicted class. Existing robustness theory either assumes a single-model threat model or focuses mainly on empirical attack algorithms. We develop a continuous local model of semantic paraphrase perturbations that captures this two-model structure. We show that the worst-case local displacement of the target representation, subject to a proxy-model budget, is governed by the largest generalised eigenvalue of a matrix pencil (A,B) constructed from the Jacobians of the two embedding maps. The resulting attackability index \lambda^*(x) is intrinsic to the local paraphrase geometry and the chosen embedders, yields a closed-form prediction-flip condition for affine readouts, and supports conservative population and finite-sample attackability certificates. For uniform control over classes of affine readouts, we derive a distribution-free VC bound for binary attackability indicators and a scale-sensitive margin bound based on an attackability-adjusted margin that subtracts a local geometric penalty from the standard classifier margin. We also connect the continuous theory to discrete paraphrase search, identify an asymmetry between successful and unsuccessful finite searches, and give a covering condition under which the discrete and continuous settings agree. Finally, we propose an empirical verification framework using soft-token relaxations and generated paraphrase sets to assess the local eigenvalue geometry, prediction-flip condition, and finite-search approximation on a deployed financial-text classifier.

[LG-82] On Local Population-Risk Certificates

链接: https://arxiv.org/abs/2606.19147
作者: Mingzhi Song
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 35 pages, 6 figures

点击查看摘要

Abstract:This paper develops local certificates for population-risk increments around a current model. For a local candidate set (\mathcal D), the certificate is a two-sided confidence band for (P(\ell_\theta+v-\ell_\theta)) over (v\in\mathcal D). As an application, the upper endpoint of this band yields a risk-controlled update rule: an update is accepted only when its certified upper endpoint is nonpositive; otherwise the current model is retained.

[LG-83] Wasserstein Policy Learning for Distributional Outcomes COLT2026

链接: https://arxiv.org/abs/2606.19117
作者: Yiyan Huang,Cheuk Hang Leung,Qi Wu,Zhiheng Zhang
类目: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注: Accepted by The 39th Annual Conference on Learning Theory (COLT 2026)

点击查看摘要

Abstract:Offline policy learning has received growing attention in causal inference. The primary objective is to learn a policy (individualized treatment rule) as a mapping from covariates to treatment that maximizes the empirical welfare defined as the mean of scalar-valued potential outcomes. In this paper, we study offline policy learning with distribution-valued outcomes, where each potential outcome is a probability measure on \mathbbR and the reward is defined through a utility functional applied to the Wasserstein barycenter of induced outcome distributions. We establish statistical guarantees for the policy learning framework based on both Inverse Probability Weighting (IPW) and Doubly Robust (DR) estimators. By handling the challenging uniform deviation over the product of the combinatorial policy class and the infinite-dimensional quantile domain, we prove that the finite-sample regret has leading dependence \widetilde\mathcalO(\sqrt\mathrmN\text-dim(\Pi)/N) . In the one-dimensional Wasserstein setting and under the stated regularity conditions, the leading regret rate is still governed by the policy-class complexity. Moreover, we provide a minimax lower bound establishing the sharpness of the leading dependence on N and \mathrmN\text-dim(\Pi) .

[LG-84] Structure Over Nonlinearity: Explicit Interaction Architectures for Dynamical Learning

链接: https://arxiv.org/abs/2606.19101
作者: Augusto Sarti
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 11 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Most learning architectures for dynamical systems rely on generic nonlinear function approximation, often requiring high model complexity to capture structured behaviors. In this work, we propose an alternative paradigm in which modeling capability arises primarily from structure rather than from expressive nonlinearities. We introduce a class of explicit structured dynamical units based on wave-inspired interaction structures with internal state. Inspired by wave-based computational principles, the proposed units adopt a strictly causal organization that eliminates algebraic loops, yielding fully explicit models that can be evaluated without implicit solvers. Stacking such units produces layered dynamical architectures with emergent hierarchical behavior. Through experiments on a nonlinear system identification task, we show that depth improves both representation quality and generalization, even under limited parameter optimization. In particular, the proposed architectures produce informative internal representations even under readout-only fitting, indicating that useful dynamical structure emerges from the organization of interactions prior to substantial parameter optimization. These results suggest that structure-first design provides a viable and effective alternative to conventional black-box approaches for learning dynamical systems, highlighting the role of interaction structure as a primary source of model expressivity.

[LG-85] Context-Aware Optimization of Follow-Up Intervals for Type 2 Diabetes Care Using Markov Decision Processes

链接: https://arxiv.org/abs/2606.19092
作者: Parisa Lotfibagha,Kristen Miller,William J. Gallagher,Elizabeth B. Selden,Muge Capan
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Chronic disease management relies on regular patient-provider interactions to follow-up on disease progression and control. For Type 2 Diabetes (T2D), current guidelines prescribe fixed time intervals between subsequent primary care visits for all patients, overlooking heterogeneity in clinical trajectories and patient characteristics. This study introduces a Contextual Markov Decision Process (CMDP) model to optimize subpopulation-specific follow-up interval decisions using Electronic Health Record (EHR) data from 22,154 T2D patients across 10 primary care clinics. Contexts are identified by: i) dimensionality reduction of variables representing the individual health trajectories utilizing Principal Component Analysis, and ii) assigning patients to contexts via principal components and additional patient-level features using clustering. Two distinct contexts emerged, representing a lower- and a higher-risk subpopulation. CMDP-derived policies recommend: (i) follow-up within 1 month if lab value at current visit is unmeasured; (ii) up to 3 months for elevated lab values or recent hospitalizations; and (iii) 6 to 12 months for sustained glycemic control, with shorter follow-up intervals for patients in high-risk context. The optimal policies achieved lower expected cumulative cost than benchmarks (e.g., in the higher-comorbidity context, the CMDP policy reduced cost by about 34.8%, and in the lower-comorbidity context by about 6.4%, relative to an American Diabetes Association-like fixed interval follow-up policy. These findings demonstrate how context-aware approaches can inform adaptive follow-up strategies, and have the potential to advance chronic care management in primary care by synthesizing machine learning and probabilistic decision models.

[LG-86] Quantifying and Auditing LLM Evaluation via Positive–Unlabeled Learning

链接: https://arxiv.org/abs/2606.19057
作者: Zilong Zhang,Yi-Ting Hung,Lei Ding,Chi-Kuang Yeh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used as judges for scalable evaluation, yet such LLM–as–a–Judge systems exhibit systematic biases that are decoupled from semantic quality, most notably verbosity bias. Meanwhile, human supervision is costly and typically selective, yielding reliable positive judgments but leaving most outputs unlabelled and potentially mixed in quality. We formulate LLM evaluation under selective human supervision as a positive–unlabelled learning problem and propose a geometric auditing framework based on Partial Optimal Transport. By aligning a small set of human–verified positives with a reliable subset of unlabelled outputs in a fixed embedding space, our method identifies human–consistent preferences and corrects biased judges without retraining. Experiments demonstrate improved alignment with human preferences, increased robustness to presentation biases, and interpretable confidence estimates, offering a scalable and statistically grounded alternative to existing LLM–as–a–judge pipelines.

[LG-87] Sequential Kernel-based Conditional Independence Testing via Adaptive Betting ICML2026

链接: https://arxiv.org/abs/2606.18993
作者: Zheng He,Danica J. Sutherland
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Published at ICML 2026: this https URL

点击查看摘要

Abstract:Testing conditional independence is fundamental yet intrinsically difficult: without additional assumptions, Type I error control is impossible in general. The "Model-X’’ paradigm addresses this difficulty by assuming exact knowledge of a relevant conditional distribution. While small deviations from this assumption can sometimes be tolerated in classical one-shot testing, existing sequential conditional independence tests typically require the Model-X conditional to be known exactly, making them fragile when it must instead be estimated. We propose a new approach that is substantially more robust to such estimation error. Our method applies testing-by-betting to an adaptively optimized Kernel Conditional Independence statistic, together with a normalization scheme and a truncate-and-shift calibration strategy. These modifications greatly reduce Type I error inflation while preserving high power across high-dimensional synthetic benchmarks and real-world fairness tasks, outperforming existing sequential Model-X approaches. Code is available at this https URL.

[LG-88] FOSC-X: An Extended Framework for Optimal Local Cuts and Non-Horizontal Cluster Selection from Clustering Hierarchies

链接: https://arxiv.org/abs/2606.18972
作者: Connor Simpson,Ricardo J. G. B. Campello
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Extracting a flat clustering solution from a hierarchy is a common task in practical cluster analysis and can be formulated as an optimisation problem. Existing approaches focus on finding a single optimal solution. We introduce FOSC-X, a framework for extracting the top-M globally optimal flat clusterings from local, non-horizontal cuts of a hierarchical cluster tree, while optionally enforcing constraints on the number of clusters. This enables automatic identification of multiple high-quality alternative clusterings that capture different aspects of the hierarchical structure. Without constraints, the top-M problem can be solved in polynomial time using dynamic programming, exploiting the property that locally optimal partial candidates within subtrees can be combined to form globally optimal solutions while automatically determining the number of clusters. However, this can lead to solutions with numbers of clusters that are ultimately undesirable – e.g., too large to be meaningful or practically analysed within a particular application domain. Imposing cluster-count constraints breaks the optimality property underlying the unconstrained dynamic programming approach, since locally optimal partial candidates may no longer combine into feasible globally optimal solutions. FOSC-X addresses this challenge through a dynamic programming strategy that maintains compact sets of feasible candidates using lower and upper feasibility bounds while pruning infeasible or dominated combinations. The resulting method guarantees optimal rankings of the top-M solutions with linear-time complexity in the number of cluster nodes and dataset size, both with and without cluster-count constraints. Experiments show that FOSC-X efficiently reveals alternative clustering structures overlooked by single-solution extraction methods.

[LG-89] Kernel of Partition Paths: A Unified Representation for Tree Ensembles

链接: https://arxiv.org/abs/2606.18853
作者: Nicolas Mahler
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 31 pages

点击查看摘要

Abstract:A recent line of work has reframed individual decision trees as linear models on engineered features associated with their splits, opening routes for oracle inequalities and feature-importance reinterpretation, but leaving open the question of what unified geometric object a forest induces when one indexes its feature map by nodes rather than by splits. The present paper studies that object. KPP indexes the feature map by the nodes of the forest, weighted by a path metric that turns each coordinate into a component of a squared-Euclidean path-isometric embedding. KPP unifies four pillars under a single non-diagonal Gram that carries a metric: prediction, exact additive attribution, deterministic Lipschitz robust radius in the KPP metric, and uniform Rademacher risk bounds for regression and classification under fixed, honest, or cross-fit conditioning. All probabilistic guarantees are conditional on the representation and are stated under three explicit conditioning regimes; the robust-radius guarantee is deterministic in the KPP metric rather than in a norm on the raw input. Conjectured fast-rate refinements for both regression and classification are stated as open problems and are not claimed as theorems.

[LG-90] Ensuring Trustworthy Online A/B Testing: Addressing Five Key Questions on CUPED

链接: https://arxiv.org/abs/2606.18750
作者: Yu Zhang,Bokui Wan,Yongli Qin,Jinyong Ma,Yifan Guo
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 15 pages, 3 figures

点击查看摘要

Abstract:A/B testing has become the gold standard for data-driven decision-making in large-scale online experimentation, providing critical guidance for feature launch, pricing optimization, and user experience enhancement. To maximize statistical sensitivity, many technology companies routinely employ Controlled-experiment Using Pre-Experiment Data (CUPED), a technique that achieves substantial variance reduction while preserving the unbiasedness of estimating the average treatment effect. Despite its widespread adoption, several critical methodological and practical nuances of CUPED remain underexplored. This paper systematically addresses five frequently encountered yet overlooked questions regarding the application of CUPED. First, we provide a comparative analysis of various post-CUPED estimators to identify the optimal adjustment specification. Second, we evaluate the validity of regression-based adjustments and delineate robust variance estimation methods tailored for such frameworks. Finally, we extend our investigation to complex but common scenarios, including multi-arm experiments and two-stage sampling designs. Our findings reveal that in these settings, naive reliance on standard variance estimators can lead to severely misleading inferences. By offering rigorous theoretical insights and extensive experimental validation, this work deepens the conceptual understanding of CUPED. Notably, the recommended methodologies have been successfully deployed and integrated into ByteDance’s experimentation platform.

[LG-91] Point-Cloud-Assistant Localized Statistical Channel Prediction by Tangent Gaussian Splatting

链接: https://arxiv.org/abs/2606.18734
作者: Ye Xue,Yiheng Wang,Xinhua Shao,Qi Yan,Shutao Zhang,Tsung-Hui Chang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate, site-specific channel information is crucial for optimizing next-generation wireless networks. Among various approaches, localized statistical channel modeling (LSCM), which models the channel multipath angular power spectrum (APS) from the reference signal received power (RSRP) measurement, has emerged as a state-of-the-art method tailored for efficient network optimization. However, despite its effectiveness, LSCM cannot predict APS at the vast majority of locations where no measurements are available, which significantly restricts its applicability in large-scale, real-world scenarios. To address this challenge, we present \emphpoint-cloud-assisted tangent Gaussian splatting (PC-TGS), the first framework to \emphextrapolate APS to unmeasured outdoor grids by integrating sparse radio measurements with dense LiDAR-based geometry. PC-TGS represents environmental scatterers as anisotropic 3D Gaussians, initialized and refined through a relaxed-mean reparameterization of the raw point cloud. A tangent-plane projection accurately maps each Gaussian into the local angular domain, while a depth-aware electromagnetic splatting process aggregates their contributions. To ensure practical deployment, we derive a closed-form Gaussian-weighted average (GWA) for APS bin integration and provide a provable error bound. Evaluations on a LiDAR-scanned city-scale dataset (5M points, 6,310 RSRP samples) demonstrate that PC-TGS achieves better APS and RSRP prediction performance compared to state-of-the-art baselines and faster inference time for APS extrapolation task. These results highlight the potential of PC-TGS to enable geometry-aware and data-efficient channel prediction in large-scale wireless digital twins.

[LG-92] meLAVA: Learning-Agnostic Data Valuation for Time Series

链接: https://arxiv.org/abs/2606.18729
作者: Wenqin Liu,Weizhi Quan,Aoqi Zuo,Erdun Gao,Vu Nguyen,Dino Sejdinovic,Howard Bondell,Mingming Gong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 34pages

点击查看摘要

Abstract:Data valuation quantifies the intrinsic quality of individual samples to enable principled data curation, quality control, and robust learning. For time series in critical domains such as healthcare, finance, and industrial monitoring, effective valuation methods are essential yet fundamentally lacking. Existing approaches are either model-dependent, limiting their generalizability, or designed for i.i.d. data and thus fail to capture temporal dependencies, multi-scale patterns, and non-stationary dynamics inherent to sequential data. We introduce TimeLAVA, a learning-agnostic framework that values temporal segments by their marginal contribution to minimizing distributional discrepancy between evaluated and reference data. At its core is a novel Selective Wavelet-based Wasserstein discrepancy combining multi-scale wavelet transforms for temporal localization with unbalanced optimal transport for robustness to distributional shifts. Segment values are efficiently computed via sensitivity analysis without requiring model training and aggregated into point-wise scores. We provide theoretical guarantees linking valuation to model-agnostic generalization and prove bounded sensitivity to outlier contamination. Extensive experiments across anomaly detection, data pruning, and label noise detection demonstrate that TimeLAVA produces significantly more informative value scores than existing methods on diverse real-world datasets.

[LG-93] Bridging Data Gaps in Structural Frag ility Modeling through Transfer Learning: Methodology and Case Studies

链接: https://arxiv.org/abs/2606.18567
作者: Narges Saeednejad,Jamie Ellen Padgett
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: 24 pages, 12 figures

点击查看摘要

Abstract:This paper presents a methodology-centered transfer learning framework for fragility adaptation under domain shift, class imbalance, and scarce target labels while preserving engineering interpretability and supporting decision-making under uncertainty. Four transfer learning strategies (instance-based, parameter-based, hierarchical Bayesian, and multi-source) are demonstrated through three complementary case studies: (i) instance-based transfer learning via importance weighting, demonstrated on coastal bridge fragility using Hurricane Katrina observations; (ii) parameter-based transfer learning together with hierarchical Bayesian transfer learning, enabling partial pooling across strata and posterior uncertainty quantification, demonstrated on residential building fragility using Hurricane Ian observations; and (iii) multi-source transfer learning that fuses multiple analytical fragility models with learned source weights and regularized target-domain adaptation, demonstrated on seismic bridge fragility using observations from the 2001 Nisqually earthquake. Across these case studies, direct transfer of source models (i.e. using existing state-of-the-art models) fails under domain shift and severe class imbalance, while targeted adaptation substantially improves failure detection and predictive stability in low-data regimes. These findings highlight the need for systematic guidance on diagnostics, strategy selection, and uncertainty reporting when developing and adapting fragility models.

[LG-94] Shrinkage priors for Bayesian Substitute Confounders

链接: https://arxiv.org/abs/2606.18535
作者: Yordan P. Raykov,Hengrui Luo,Justin D. Strait,Wasiur R. KhudaBukhsh
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Multi-cause observational studies contain information about unmeasured confounding through the dependence structure among causes. However, literal imputation of the unobserved confounder is often more complex than learning a lower-dimensional substitute score that preserves the shared assignment variation needed for stable causal adjustment. The deconfounder (Wang and Blei, 2019) and related substitute confounder methods exploit this idea, but flexible assignment models can fit the joint distribution of the causes while producing scores that over-encode the treatment vector, collapse overlap, or capture single-cause variation. We develop a Bayesian factor assignment framework for learning sparse substitute confounders that retain coarse multi-cause dependence with shrinkage priors. The theory is stated at the level of posterior concentration, factor score contraction, and overlap-preserving assignment geometry and therefore does not rely on a particular shrinkage prior. Under these conditions, the proposed regression-adjusted estimators are consistent for mean potential outcomes when the corresponding latent variable identification assumptions hold. Shrinkage priors provide a natural tool for latent structural learning: they favour low-dimensional factors supported by multiple causes, discourage effectively single-cause factors, and induce an ordering of the latent factors through progressive shrinkage. Synthetic experiments illustrate the roles of signal strength, outcome validity, and geometry-aware regularization. In an Alzheimer’s Disease Neuroimaging Initiative (ADNI) baseline analysis, sparse substitute scores recover much of the adjustment obtained by directly conditioning on invasive cerebrospinal-fluid biomarkers, while collapse diagnostics identify when fitted factors reduce to individual observed measurements.

[LG-95] When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

链接: https://arxiv.org/abs/2606.18531
作者: Xuanfei Ren,Tengyang Xie
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 69 pages

点击查看摘要

Abstract:Offline reinforcement learning is typically analyzed under process-level reward supervision, yet many sequential decision datasets record only trajectory-level outcomes. We develop a statistical theory for offline policy optimization from such outcome-level supervision. We first study the canonical setting where the target remains the expected cumulative reward, but each offline trajectory provides only a scalar label whose conditional mean is the cumulative return. We propose OPAC, a pessimistic actor-critic algorithm that learns a latent reward model and optimizes a policy from trajectory-level labels. We prove a high-probability guarantee of order \widetilde O(H^2\sqrtC_sa(\pi^\star)/n) and a matching lower bound, characterizing the sharp statistical cost of replacing process-level rewards with one trajectory-level label. We then extend the principle to preference-based feedback, preserving the leading horizon and concentrability dependence up to preference-model constants. Finally, we study generalized outcome-based offline RL, where both the supervision and the objective are trajectory-level quantities induced by a nonlinear aggregation of latent per-step rewards. This problem is not learnable in general: for all-success objectives, any offline learner may require \Omega(2^H) trajectories even with deterministic transitions and constant concentrability. We then identify a tractable regime through two structural coefficients, \kappa_\mu(\sigma) and \chi_\mu(\sigma) , capturing information loss in outcome aggregation and generalized Bellman updates, under which generalized OPAC achieves polynomial sample complexity. Together, our results delineate when outcome-level supervision enables sample-efficient offline control and when missing process-level rewards create fundamental statistical barriers. Comments: 69 pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2606.18531 [stat.ML] (or arXiv:2606.18531v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2606.18531 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xuanfei Ren [view email] [v1] Tue, 16 Jun 2026 22:55:45 UTC (68 KB) Full-text links: Access Paper: View a PDF of the paper titled When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?, by Xuanfei Ren and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: stat.ML prev | next new | recent | 2026-06 Change to browse by: cs cs.LG stat References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-96] oward Simultaneously Optimal Regret in U-Calibration COLT2026

链接: https://arxiv.org/abs/2606.18527
作者: Rafael Frongillo,Haipeng Luo,Nishant A. Mehta,Jon Schneider
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 30 pages; to appear at COLT 2026

点击查看摘要

Abstract:U-calibration studies online forecasting algorithms whose predictions can be consumed by any unknown downstream agent, guaranteeing sublinear regret simultaneously for all proper loss functions. Existing U-calibration algorithms achieve worst-case optimal O(\sqrtT) regret for every bounded proper loss, but they fail to adapt to easier losses: as we show, even for smooth losses such as squared loss, they incur \Omega(\sqrtT) regret instead of the optimal O(\log T) regret. In this work, we show that this limitation is not inherent. Specifically, we design a single forecast algorithm that simultaneously achieves \tilde O(\sqrtT) regret for every bounded proper loss and O(\log T) regret for every bounded smooth proper loss. More generally, our algorithm also attains logarithmic regret for losses that are smooth relative to the log-barrier, which include several non-Lipschitz examples. Our approach is based on a novel variant of Follow-the-Perturbed-Leader (FTPL) in which perturbations are applied directly in the prediction space using self-concordant noise. The resulting analysis also departs substantially from prior FTPL analyses due to the complex nature of this noise and may be of independent interest. Comments: 30 pages; to appear at COLT 2026 Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) MSC classes: 68Q32 Cite as: arXiv:2606.18527 [stat.ML] (or arXiv:2606.18527v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2606.18527 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-97] Exponentially many initializations to avoid barren plateaus

链接: https://arxiv.org/abs/2606.18515
作者: Ankit Kulshrestha,Ricard Puig,Diego García-Martín,Lukasz Cincio,Ilya Safro,Zoë Holmes,M. Cerezo
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 + 27 pages, 5+4 figures, 1 Table

点击查看摘要

Abstract:Barren plateaus are stated as an average-case phenomenon: pick an ansatz, initialize it naively, and concentration follows. This has led to the common view that a potential cure for barren plateaus is simply to initialize the parameters more carefully. Here we show that the situation is subtler. We introduce a first-moment framework that gives a simple operator-level diagnostic for when an initialization may escape the fully concentrated barren-plateau fixed point, and for comparing the biases induced by different initialization strategies. Our framework recovers several known initialization schemes such as identity and Gaussian initialization, but also shows that barren-plateau avoidance is highly non-unique. Indeed, many shifted, biased, and non-symmetric parameter distributions can avoid concentration, and these choices need not be equivalent. In fact, our results show that one can generate exponentially many families of inequivalent initialization strategies. Then, our numerics indicate that different first-moment-distinct initializations can lead to different attained minima, suggesting that avoiding barren plateaus via smart initializations can trade the exponential concentration problem for the challenge of selecting the right trainable pocket amongst many options.

[LG-98] oolChain-CRC: Conformal Risk Control for Agent ic AI Under Retrieval and Tool-Use Drift

链接: https://arxiv.org/abs/2606.18467
作者: Jeffery Opoku,David Banahene
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 26 pages, 11 figures

点击查看摘要

Abstract:Modern AI agents retrieve documents, call tools, check intermediate information, and then produce a final answer or action. This creates a risk-control problem that is not visible from the final answer alone. A final response may look acceptable even when the retrieval was weak, a tool output was wrong, or an earlier step was unsupported. We propose ToolChain-CRC, a conformal risk-control method for retrieval-augmented and tool-using agents under drift. The method treats each agent run as a full trajectory of actions, observations, and final output. It builds step-level risk scores, combines them into a trajectory risk score, calibrates an accept-or-intervene rule, and adds an anytime alarm that can stop risky runs before the final answer. We prove trajectory-level risk control under exchangeable calibration runs, give a drift-aware extension with auditable constants, and prove an anytime escalation rule through a supermartingale construction. Experiments cover synthetic tool-chain drift, RAG/tool-use stress tests, public SQuAD-derived retrieval tasks, an API-free agentic QA case study, ablations, target-risk sensitivity checks, 20-seed robustness checks, a drift-margin audit, and a live RAG/tool-use agent benchmark. Across these settings, final-answer-only calibration can miss retrieval and tool failures, while trajectory-level calibration keeps accepted-trajectory risk below the target.

[LG-99] Modeling Doppler Shifts in Radial-Velocity Data with Deep Learning toward Earth-mass Exoplanet Detection

链接: https://arxiv.org/abs/2606.18464
作者: Isidro Gómez-Vargas,Xavier Dumusque,Yinan Zhao,Khaled Al Moulla,Michael Cretignier
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: 20 pages, 14 figures. Accepted for publication in Astronomy Astrophysics

点击查看摘要

Abstract:Detecting the tiny Doppler shifts induced by Earth-mass planets in stellar radial-velocity measurements remains extremely challenging due to stellar activity. Many deep-learning methods performing well on simulated data remain difficult to apply reliably on real stellar spectra. The aim of this work is to develop a deep-learning framework that generalizes to real, unseen spectra and improves the detectability of Earth-mass planets in radial-velocity data. We train artificial neural networks on HARPS-N solar spectra with injected planetary signals, using physics-motivated spectral representations based on flux and line-formation temperature, together with their velocity gradients. Two training strategies are explored: hold-out testing and cross-validation. Model robustness is enhanced through genetic-algorithm-based hyperparameter optimization, and predictive uncertainty is quantified using Monte Carlo dropout. Our most precise neural network model reliably retrieves, under the cross-validation strategy, the amplitudes, phases, and orbital periods of planetary signals with amplitudes greater than or equal to 25 cm/s and periods between 10 and 550 days. In addition, in all cases tested here, the successfully recovered signals correspond to the most significant peaks in the periodograms of the Doppler-shift predictions. Temperature-based spectral-shell representations consistently outperform flux-based shells. We also release doppleriann, a Python package implementing the proposed framework. Our results demonstrate that combining physically motivated spectral representations with deep learning provides a promising pathway toward the detection of Earth-mass planets in radial-velocity data from real observations, supported by a modeling framework that is both physically grounded and statistically rigorous, incorporating uncertainty quantification and optimized training strategies. Comments: 20 pages, 14 figures. Accepted for publication in Astronomy Astrophysics Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG) Cite as: arXiv:2606.18464 [astro-ph.IM] (or arXiv:2606.18464v1 [astro-ph.IM] for this version) https://doi.org/10.48550/arXiv.2606.18464 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Isidro Gómez-Vargas [view email] [v1] Tue, 16 Jun 2026 20:16:16 UTC (5,164 KB)

[LG-100] Sequential Hiring of Contingent Workers Through Learning-Based Optimization

链接: https://arxiv.org/abs/2606.18438
作者: Chris Lee,Xiuli Chao,Izak Duenyas
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we study a sequential workforce management problem in a contingent labor setting with uncertainty in both worker production and labor supply. A firm seeks to maximize cumulative profit by maintaining an active team of fixed size while learning worker productivity over time. We emphasize two critical operational frictions in this problem: replacing workers is costly, and workers may not be available immediately for hiring because of, for example, prior job commitments, scheduling constraints, or onboarding procedures. Thus, hiring decisions take effect only after a random delay. We formulate this problem as a stochastic multi-play bandit with costly switching and delayed actions, and develop a learning-based hiring policy, DR-UCB (DelayedReplacement-UCB), that makes replacement and hiring decisions sequentially through learning cycles. In each cycle, the policy uses real-time production data to determine when to initiate workforce changes and which workers to replace and hire. We show that the leading-order regret of the proposed policy matches its lower bound in its dependence on the time horizon. Our numerical experiments show that DR-UCB outperforms benchmark policies.

[LG-101] Pointwise is Pointless? A Multimodal Ablation Study for Precipitation Nowcasting with Graph Neural Networks

链接: https://arxiv.org/abs/2606.18436
作者: Ophélia Miralles,Máté Mile,Christoffer Artturi,Thomas Nipen,Ivar Seierstad
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse point observations are increasingly available for precipitation nowcasting, but it is unclear how much they improve dense radar-field forecasts. We partially address this question with a multimodal graph neural network nowcasting system over the Nordic radar domain. The model predicts rain rate every five minutes up to two hours ahead and is trained with different combinations of radar history, MEPS numerical weather prediction, Netatmo surface observations, MSG satellite channels, stochastic noise, and CRPS-based ensemble losses. The study is designed as an ablation of operationally relevant information sources and training objectives. We compare radar-only, NWP-informed, station-informed, satellite-informed, noise-augmented, and CRPS-based configurations using complementary diagnostics on the radar grid, at station locations, for rain onset, and through oracle, displacement, and amplitude scores. The results show that each source improves a different part of the forecast problem. MEPS stabilises radar-only extrapolation, Netatmo observations improve local station and onset diagnostics, and satellite predictors reduce some station-level biases but may activate rain too early when used deterministically. CRPS-based configurations provide the most consistent radar-grid gains, while the combined satellite and CRPS setup gives the best overall oracle/DAS score. These results do not support the conclusion that point observations are uninformative for nowcasting, but they show that local observational skill and spatially coherent radar-field skill are distinct targets. The practical implication is that sparse observations can provide useful local constraints, but their benefit for radar-like fields depends on the training loss, uncertainty representation, and how observation support is encoded in the model.

[LG-102] Structural MRI Synthesis for Alzheimers Disease via Conditional Diffusion on Anatomical Masks

链接: https://arxiv.org/abs/2606.18354
作者: Muge Zhang,Muhammad Ali Khaliq,Jamal Alsakran,Byeong Kil Lee,Jeeho Ryoo
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in generative machine learning models have significantly improved medical imaging, offering promising solutions for data augmentation, privacy preservation, and improved model generalization. However, synthesizing high-quality structural MRI data for Alzheimer’s Disease (AD) remains challenging due to the subtle, region-specific, and progressive anatomical changes associated with neurodegeneration. In this paper, we extend the Med-DDPM conditional diffusion model – originally designed for brain tumor synthesis – to generate 3D structural MRIs specifically tailored to AD. We adopted Med-DDPM due to its established stability and structural fidelity compared to other generative models, which makes it particularly suitable for capturing the subtle anatomical changes characteristic of AD. Our approach conditions the diffusion process on anatomical segmentation masks derived from the ADNI dataset, incorporating key AD-relevant brain structures into the generation process. We systematically evaluate the quality and utility of the synthetic images by training segmentation models on real, synthetic, and hybrid (mixed) datasets. Experimental results demonstrate that segmentation models trained exclusively on synthetic data achieve comparable Dice scores (0.6532) to those trained on real data (0.6513), while exhibiting significantly enhanced recall. Notably, models trained on hybrid datasets (mixing real and synthetic images) outperform both real and synthetic-only baselines, achieving a Dice score of 0.7244. These findings underscore the successful use of conditional diffusion models for generating anatomically accurate, AD-specific synthetic MRIs, and highlight their potential for enhancing training data availability, improving diagnostic accuracy, and promoting research reproducibility in neuroimaging studies.

[LG-103] Protein-Based Fish Species Identification: Dataset Models and Insights from Native Bangladeshi Fish

链接: https://arxiv.org/abs/2606.18302
作者: Md Nasiat Hasan Fahim,Md. Abid Ullah Muhib,Mohammad Shahidur Rahman
类目: Other Quantitative Biology (q-bio.OT); Machine Learning (cs.LG)
*备注: Published in 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence Networking (QPAIN). \c{opyright} 2026 IEEE. Personal use of this material is permitted

点击查看摘要

Abstract:Correct identification of fish species is highly significant for food security, economic development, and climate resilience in Bangladesh. Protein sequences directly reflect functional and evolutionary constraints which are important for species authentication and biodiversity monitoring. Yet there exists no benchmark for native Bangladeshi fish species identification from protein sequence. In this study, we addressed this gap by introducing the first curated dataset for nine native Bangladeshi fish species of 2845 high quality protein sequences. We also established the first protein sequence classification baseline for this domain through a systematic benchmarking of seven architectural paradigms. Moreover, we propose a realistic deployable novel hybrid architecture of MotifCNN and Transformer with Terminal-Aware Positional-Encoding (MotifCNN-Transformer+TA-PE). Our novel architecture achieves 79.80% accuracy with macro-F1 of 0.80. The highest 83.04% accuracy is achieved by finetuned protein language model ProtBERT that has 420M parameters and requires dual 16GB GPUs for inference. According to McNemar’s test, ProtBERT’s 3.24% accuracy gain over our MotifCNN-Transformer+TA-PE is statistically insignificant (p = 0.1120). Our novel architecture beats it among six of the nine classes in per class identification. Also our MotifCNN-Transformer+TA-PE is approximately 5x faster, 42x smaller, and supports 16x larger batch size than ProtBERT and has GPU free inference, making it more practical for deployment in resources constrained areas such as rural Bangladesh. Beyond this, our foundational work shows effects of phylogenetic relationships on sequence similarity and establishes pathways for fisheries management, food authentication and biodiversity conservation in South Asia’s protein dependent economy.

[LG-104] Stochastic Thermodynamics and SDE-based Generative Models

链接: https://arxiv.org/abs/2606.18290
作者: Yaowen Zhang
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:SDE-based generative models, including diffusion models and the Schrödinger bridge, have found broad applications in signal processing tasks such as speech enhancement, image restoration, and time-series generation. This note presents a modeling framework for such models within the context of stochastic thermodynamics. The main results of this note are trajectory-level definitions of work, heat, and entropy production, along with a generalized Jarzynski identity and a second-law-like inequality. The proposed framework extends the original Jarzynski setup to accommodate time-dependent bath temperature and nonconservative driving forces. This thermodynamic perspective may deepen our understanding of diffusion models and the Schrödinger bridge from a nonequilibrium statistical mechanics viewpoint.

[LG-105] A Guide to Estimating Conditional Averag e Treatment Effects in Competing Risks Settings

链接: https://arxiv.org/abs/2606.18281
作者: Daniel Klippert,Sarah Friedrich,Markus Pauly
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conditional average treatment effects (CATEs) are central to treatment decision-making in personalized medicine. In competing risks settings, estimating CATEs from survival data allows for patient-specific assessments of treatment effectiveness for a specific event of interest while properly accounting for alternative event types. This distinction is essential in the presence of comorbidities, where competing causes of death may otherwise confound the therapeutic benefit. Focusing on right-censored survival times with binary treatment, we examine CATEs defined as covariate-conditional differences in the absolute risk for the event of interest at a fixed time. To this end, we study meta-learners which adapt machine learning algorithms for CATE estimation in competing risks scenarios. We systematically compare six meta-learners, combining Cox regression or random survival forests for risk modeling with elastic net regression or random forests for direct CATE modeling. To provide practical guidance on model selection, we evaluate their performance in multiple simulation settings, that differ in hazard complexity, treatment heterogeneity, treatment assignment, event type distribution and censoring. To facilitate applied use, we provide the R package, crsurvlearners, which implements all considered approaches.

[LG-106] Predicting the Neutrino Mass Ordering Using Neural Networks

链接: https://arxiv.org/abs/2606.03745
作者: T.J.C. Bezerra,L. Asquith,E. Bannister,W. Shorrock
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:Determining the neutrino mass ordering remains a central open problem in particle physics. While next-generation long-baseline experiments are expected to resolve this question, current data provide limited sensitivity because the spectral differences between normal and inverted ordering are subtle and entangled with parameter degeneracies. We investigate a machine-learning strategy for mass-ordering determination using a feed-forward neural-network classifier trained on synthetic long-baseline datasets generated with three-flavour oscillation probabilities, matter effects, and statistical fluctuations. We evaluate the classifier against standard \chi^2 and \log\mathcalL approaches using common discrimination metrics, including receiver-operating-characteristic curves, to quantify sensitivity and to illustrate how operating points can be selected to prioritise purity or efficiency. We find that the neural network achieves performance comparable to conventional fits for the scenarios studied, providing a flexible, independent cross-check of established analyses. The framework can be extended to incorporate systematic uncertainties and to explore joint inference of oscillation parameters, and it may also serve as a pedagogical tool for introducing machine-learning methods in neutrino physics.

附件下载

点击下载今日全部论文列表