本篇博文主要内容为 2026-07-03 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-07-03)
今日共更新693篇论文,其中:
- 自然语言处理共94篇(Computation and Language (cs.CL))
- 人工智能共228篇(Artificial Intelligence (cs.AI))
- 计算机视觉共155篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共181篇(Machine Learning (cs.LG))
- 多智能体系统共13篇(Multiagent Systems (cs.MA))
- 信息检索共12篇(Information Retrieval (cs.IR))
- 人机交互共22篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在社会结构化情境下,其公开表达行为是否会因角色、受众及关系语境的影响而与私下非公开(off-the-record, OTR)表达产生系统性差异的问题。研究发现,在缺乏显式目标提示的情况下,社会结构本身即可引发代理在公共渠道与私密渠道之间出现显著的表达分歧。其解决方案的关键在于提出一种双通道辩论框架(dual-channel debate framework),通过并行生成公开言论与仅记录不共享的私密回应,实现对代理在不同社交情境中行为策略的分离测量。实验结果显示,经过对齐优化的模型在不同场景下表现出高达约40%的决策分歧,显著高于3%的基线水平,且该现象在立场分析、语义相似性、自然语言推理及问卷调查等多重评估维度上均具一致性。部分私密反馈明确指出,公共表达的妥协源于职业风险或赞助义务等关系压力。研究强调,对代理的评估应超越显式任务目标,识别其潜在涌现的目标,并提出基于双通道机制的评估框架及配套行为度量方法,以更全面地揭示代理在复杂社会环境中的行为动态。
链接: https://arxiv.org/abs/2607.02507
作者: Arman Ghaffarizadeh,Danyal Mohaddes,Aliakbar Izadkhah,Shahriar Noroozizadeh
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:LLM agents will increasingly act in socially structured settings where role, audience, and relational context can shape what is advantageous or costly to say. We study whether such social structure, without any explicit objective in the prompt, changes what an agent expresses publicly relative to an off-the-record (OTR) channel elicited under the same condition. We introduce a dual-channel debate framework in which agents produce public utterances that enter the shared history alongside OTR responses that are recorded but never shown to the other participant. Across 10 models, 3 scenarios, and 5 variations within each scenario, alignment-inducing settings produce systematic public-OTR divergence in the targeted agent, with its decision divergence rising from a \sim 3% baseline to roughly 40%. The effect is consistent across four aggregate analyses: stance, semantic similarity, natural language inference, and survey responses. In some cases, the OTR response explicitly attributes public accommodation to relational pressures, such as career risk or sponsorship obligation. The findings suggest that agent evaluation should extend beyond explicit goals and detect emergent objectives. We present a dual-channel evaluation framework and complementary behavioral measures that operationalize this assessment.
[MA-1] Adoption and Ecosystem Health: A Longitudinal Analysis of Open-Source Multi-Agent Frameworks
【速读】:该论文旨在解决开源人工智能代理(AI agent)框架在快速兴起背景下,工程团队在选择框架时面临的决策困境。当前,框架的流行度常被误用为质量或可持续性的指标,而以GitHub星标数等“热度信号”为代表的表面数据难以真实反映生态系统的健康状况。其解决方案的关键在于提出一套更科学、多维度的评估体系:通过分析贡献者密度(contributor density)、跨框架协作程度以及初期贡献者的留存率,揭示生态系统的实际采纳深度与持续性。研究发现,高星标数往往反映短期炒作而非真实社区参与,真正健康的生态系统表现为较高的贡献者密度和跨框架协作能力(如LangChain作为共享基础设施吸引82.5%的跨生态贡献者),且用户留存率在前30天急剧下降后于约90天趋于稳定。因此,论文强调应以贡献者密度、跨生态参与度和留存率取代单一星标数,作为框架选型的可靠依据。
链接: https://arxiv.org/abs/2607.02453
作者: Xi Zhang(Cisco Systems),Papi Menon(Cisco Systems),Vivian Chu(Cisco Systems),Koray Cosguner(Indiana University)
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: 24 pages, 10 figures
Abstract:Since ChatGPT’s launch in November 2022, open-source agentic frameworks have proliferated, making framework selection important for engineering teams while obscured by popularity signals such as GitHub stars. This paper analyzes 15 major open-source AI agent framework repositories from late 2022 to early 2026, using 808,042 stars, 73,997 pull requests, 86,241 commits, and 987,330 user profiles to assess ecosystem health across awareness, adoption, and retention. Three findings emerge. First, headline popularity is unreliable. Star counts reflect hype cycles and inorganic activity. AutoGPT gained 111,967 stars in one month but converted fewer than 9 contributors per 1,000 stars, defined as contributor density in this research, compared with LangChain’s 41. Lower-profile frameworks such as Pydantic-AI show higher contributor density, indicating deeper adoption. Second, mapping awareness against adoption shows that visibility and engagement diverge. MetaGPT and LangFlow have contributor density ratios below 5 even with their high visibility. Openai-agents-python’s limited contributor base suggests institutional backing alone does not ensure community depth. By analyzing cross-framework contribution, we discover that LangChain functions as a shared infrastructure, attracting 82.5% of cross-ecosystem contributors. Third, retention drops most steeply in the first 30 days of initial contribution and stabilizes near 90 days. Overall, ecosystem health is better measured by contributor density, cross-ecosystem engagement, and retention than by stars alone. These metrics offer teams a more robust basis for framework evaluation.
[MA-2] Agents CAD: Automated Design for Manufacturing of FDM Parts via Multi-Agent LLM Reasoning and Geometric Feature Recognition
【速读】:该论文旨在解决增材制造(Additive Manufacturing, AM)中基于熔融沉积成型(Fused Deposition Modeling, FDM)工艺的零件在打印过程中面临的可制造性问题,具体包括陡峭悬垂结构导致的打印失败、结构完整性不足及后续加工成本增加等挑战。现有切片软件虽能识别如悬垂角超过45°等缺陷,但无法对原始几何体进行自适应修改,导致设计迭代依赖人工经验。其解决方案的关键在于提出AgentsCAD——一个基于多智能体系统(multi-agent system)的自动化设计优化框架,通过将边界表示(Boundary-Representation, B-Rep)几何数据与大语言模型(Large Language Model, LLM)的推理能力相结合,实现从几何缺陷检测到语义化设计建议的闭环流程。该系统首先解析STEP文件,利用面邻接拓扑图分析几何特征,并结合在MFCAD++数据集(59,665个零件)上训练的GraphSAGE模型注入语义标签,随后由Claude Sonnet智能体生成重定向、倒圆角、倒斜角等针对性改进策略;再由GPT-4o视觉-语言验证器通过渲染视图检查几何合理性,最终输出修正后的STEP文件与可读报告。实验以鸟屋模型为例,验证了系统能够准确识别悬垂缺陷、选择合适的缓解策略并提出物理上可行的修改方案,有效缓解了生成式人工智能在CAD修改中“几何-语言”语义映射的核心难题。
链接: https://arxiv.org/abs/2607.02448
作者: Emmanuel George,Christopher Keefe,Peter Pak,Amir Barati Farimani
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Parts manufactured with Fused Deposition Modeling (FDM) often require Design for Additive Manufacturing (DFAM) modifications to ensure printability, structural integrity, and reduced post-processing. Current slicers identify defects such as steep overhangs but are unable to modify the underlying geometry. This work presents AgentsCAD, a multi-agent system that bridges raw boundary-representation (B-Rep) geometry and Large Language Model (LLM) reasoning to automate targeted DFM. The workflow begins by parsing a STEP file. The agentic system detects overhangs above a 45°threshold, constructs a face-adjacency topology graph, and optionally injects semantic feature labels from a GraphSAGE model trained on MFCAD++ (59,665 parts), before dispatching a Claude Sonnet design-reasoning agent that recommends reorientations, fillets, chamfers, and similar modifications. A GPT-4o vision-language verifier inspects rendered views to confirm geometric integrity. Outputs include a modified STEP file and a human-readable report. A test case on a birdhouse model demonstrates that the system correctly diagnoses overhangs, selects appropriate defect mitigation strategies, and proposes physically valid corrections, partially solving the geometry-to-language translation problem central to LLM-driven CAD modification.
[MA-3] Hardware-Enforced Semantic Coordination for Safety-Critical Real-Time Autonomous Systems
【速读】:该论文旨在解决复杂自主系统中异构组件在不确定性环境下实现安全关键型实时协同的难题,尤其针对现有软件中介协调机制在保障确定性时序同步、可验证协调行为及强制安全约束方面存在的根本性局限。其核心解决方案是提出一种基于现场可编程门阵列(FPGA)的硬件强制语义协同架构,将选定的语义协调机制(源自基于主题的通信空间佩特里网,TB-CSPN)直接映射至硬件层面,构建硬件原生的语义协同层。该方法的关键在于通过硬件实现时间同步、语义门控、授权约束与有界协调行为,确保协同过程的确定性与可验证性,同时保持语义推理的灵活性与软件驱动特性,从而在不牺牲系统适应性的同时,实现对安全性和实时性的强保障。
链接: https://arxiv.org/abs/2607.02376
作者: Uwe M. Borghoff,Paolo Bottoni,Remo Pareschi
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 1 figure, 6 pages
Abstract:Recent advances in agentic AI are producing increasingly complex autonomous systems that integrate large language models, world models, optimization engines, specialized neural architectures, autonomous platforms, and human operators. While much current research focuses on improving reasoning capabilities, safety-critical real-time deployment also requires bounded and verifiable coordination among heterogeneous components operating concurrently under uncertainty. Software-mediated coordination presents fundamental limitations in domains where bounded latency, deterministic coordination, and enforceable safety guarantees are essential. Hence, we propose a hardware-enforced semantic coordination architecture in which selected coordination semantics are implemented directly at the hardware level via field-programmable gate arrays (FPGAs). The approach builds on the Topic-Based Communication Space Petri Net (TB-CSPN) framework, which separates semantic reasoning from interaction management. In this approach, selected TB-CSPN coordination mechanisms are mapped onto FPGA primitives, creating a hardware-native semantic coordination layer. Focus is not on acceleration, but on enforcing temporal synchronization, semantic gating, authorization constraints, and bounded coordination behavior directly in hardware. Semantic reasoning remains adaptive and software-driven, while embedded coordination semantics become deterministic. Comments: 1 figure, 6 pages Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2607.02376 [cs.AI] (or arXiv:2607.02376v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2607.02376 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-4] Securing People and their Machines Against Major Faults
【速读】:该论文旨在解决去中心化草根平台(grassroots platforms)在面临私钥丢失和/或智能手机损毁等重大故障时的不可恢复性问题。这类平台由用户自选公钥标识身份并依托其设备(如智能手机)运行,缺乏全局性的资源支持以实现故障恢复。其核心挑战在于如何在无中心化恢复机制的前提下,保障身份与状态的可恢复性。解决方案的关键在于构建一个基于多方协作的去中心化恢复框架,包含三个核心组件:(ⅰ)由用户自主建立与维护的草根社交图谱(grassroots social graph),用于表达信任关系;(ⅱ)由个人指定的身份托管人(identity custodians),负责在身份变更时达成共识;(ⅲ)平台特定的状态托管人(state custodians),负责维护社交图谱的状态一致性。当用户遭遇身份丢失时,只要获得其身份托管人中的超多数同意,其所有好友即可在社交图谱中协同更新公钥并重建信任关系,同时所有好友作为状态托管人共同维护系统状态。整个密钥更换过程通过链下(off-chain)方式完成,包括生成新密钥对、获取新设备及说服托管人授权变更。对于仅设备丢失而私钥仍存的情况(如手机被毁或内存清空),恢复仅需状态托管人的协助即可完成。作者将社交图谱及其安全版本形式化为受保护的多智能体原子事务,并基于“意愿型代理”(volitional agents)模型,在最终同步的消息传递环境中实现了该系统,证明了其在可恢复故障场景下能正确映射到规范行为。该方法同样适用于草根代币与债券等金融原语,揭示了状态恢复的通用核心机制与平台特异性设计之间的平衡,确保货币的单写者日志可精确恢复,主权恢复后不会发生双重支出。
链接: https://arxiv.org/abs/2607.02304
作者: Ohad Eitan,Idit Keidar,Ehud Shapiro
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注:
Abstract:We consider grassroots platforms – distributed systems of agents consisting of people identified by self-chosen public keys and their machines (smartphones) – and wish to make them secure against \emphmajor faults: the loss of their private keys and/or their smartphones. As grassroots platforms have no global resource to rely on for recovery, our peer-based solution is based on: (\ia) \empha grassroots social graph in which agents establish and maintain friendships; (\ib) \emphidentity custodians, designated by each person, and (\ic) \emphstate custodians, which are grassroots platform-specific. Upon a person experiencing identity loss, and given a willing supermajority of the identity custodians of the person, the friends of the person replace the old public key with the new one across the graph and restore friendships, where all friends serve as state custodians for the social graph. Choosing a new keypair, obtaining a new smartphone, and convincing identity custodians to will a change of key all happen ``off-chain’'. Recovery from machine loss without loss of key (e.g. smartphone run over by truck, or its memory wiped) is simpler, requiring only the help of state custodians. We specify the social graph and its secure version as guarded multiagent atomic transactions, and implement the secure social graph via communicating volitional agents, an eventually synchronous message-passing model one step closer to implementation. We prove the implementation maps runs with recoverable faults to correct runs of the specification. We follow a similar path for grassroots coins and bonds, showing a common core as well as the platform-specific aspects of state recovery: a currency’s single-writer log is recovered exactly, the recovered sovereign resuming without double-spending. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA) Cite as: arXiv:2607.02304 [cs.DC] (or arXiv:2607.02304v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2607.02304 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-5] CausalSteward: An Agent ic Divide-Conquer-Combine Copilot for Causal Discovery
【速读】:该论文旨在解决高维数据中因果模型学习的挑战,尤其针对现实场景下因核心假设被违反而导致的因果可识别性问题。尽管存在大量蕴含丰富因果信息的先验知识,但如何有效将其融入因果发现过程仍是一个开放问题。其解决方案的关键在于提出一种名为CausalSTeward(CAST)的人机协同框架,采用多智能体协作机制,通过“分而治之”的策略对大规模变量簇进行迭代分割与独立分析;同时结合检索增强生成(Retrieval-Augmented Generation, RAG)和条件独立性检验等定制化工具,实现先验知识与数据驱动方法的深度融合,从而提升复杂因果结构的构建能力与可信度。
链接: https://arxiv.org/abs/2607.01936
作者: Nicholas Tagliapietra,Gian Lorenzo Marchioni,Moritz Willig,Juergen Luettin,Lavdim Halilaj,Kristian Kersting
机构: Bosch Center for Artificial Intelligence (博世人工智能中心); TU Darmstadt (达姆施塔特工业大学); Hessian Center for AI (黑森州人工智能中心); German Research Center for AI (德国人工智能研究中心)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning causal models from high-dimensional data is a significant challenge, particularly in real-world settings where violations of core assumptions lead to causal identifiability issues. Although massive amounts of prior knowledge are available, and contain valuable causal information, effectively integrating this knowledge into the causal discovery process remains an open problem. We introduce CausalSTeward (CAST), a novel human-in-the-loop framework for interactively assembling large causal models. CausalSteward is a multi-agent collaborative system that tackles high-dimensional causality through a divide-and-conquer approach where large clusters of variables are iteratively partitioned and then separately analyzed. Our framework fuses prior knowledge with a data-driven approach by using tailored tools such as retrieval augmented generation and conditional independence tests. Finally, we use this work to examine the capabilities and limitations of causal reasoning in multi-agent frameworks, and how the human-in-the-loop can contribute to accurate and trustworthy results.
[MA-6] Congestion-Based Slot Pricing in a Railway Auction Game
【速读】:该论文旨在解决在去监管背景下,如何公平、高效地分配具有竞争性且易发生拥堵的离散资源(如铁路运行时段)的问题,尤其关注异质性战略主体(不同规模与能力的运营商)之间的博弈行为。核心挑战在于大型运营商可能利用其资源优势实施策略性垄断,从而压制小型参与者并扭曲资源配置效率。论文提出的关键解决方案是设计一种结合基于拥堵的基价机制(congestion-based base price)与非对称校正调整机制(asymmetric corrective adjustment)的拍卖机制:前者随总体需求上升而提高价格以反映资源稀缺性,后者则通过惩罚请求最多资源的主体、奖励请求最少的主体来抑制大机构的策略性过度索取。该机制在保持透明性和拥堵敏感性的同时,试图缓解大型代理的战略主导地位。实验通过一个实时、基于网络的多智能体系统进行验证,由领域专家作为真实人类代理参与多轮交互。初步结果显示,尽管机制能有效响应整体需求并触发校正激励,但代表大型运营商的代理仍持续采取高申请策略,表明仅靠价格校正不足以完全消除战略优势。此外,事后访谈显示决策主要受角色身份驱动而非个人特质,支持了长期战略动机(如维持市场存在、抬高对手成本)的存在,这与短期利润最大化并存。研究为不对称预算下的多智能体机制设计提供了重要启示,并指出了未来需开展更系统的分析验证与更大规模实验的方向。
链接: https://arxiv.org/abs/2607.01822
作者: Bill Roungas,Sebastiaan Meijer
机构: 未知
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH)
备注: 13 pages, 2 figures, presented in ISAGA 2026
Abstract:We present a multi-agent system for studying the allocation of discrete, congested resources among heterogeneous strategic agents, motivated by the problem of railway slot allocation under deregulation. Multiple operator-agents, differing in size and capacity, interact through a shared auction mechanism over repeated rounds under time-constrained decision-making. The mechanism combines a congestion-based base price that increases with aggregate demand with an asymmetric corrective adjustment that penalises the agent requesting the most slots and rewards the agent requesting the fewest, and is designed to mitigate strategic dominance by large agents while preserving transparency and congestion sensitivity. We formulate the interaction as a repeated game with incomplete information and implement the system as a real-time, web-based multi-agent environment in which human participants control individual agents and observe live marginal-cost and competitor feedback. We report exploratory observations from two structured sessions with domain experts acting as operator-agents. The congestion mechanism responds to aggregate demand as designed and the corrective incentives are actively triggered, but agents representing large operators persist with high-request strategies despite the penalty, suggesting that corrective pricing is necessary but not sufficient to neutralise strategic dominance in this multi-agent setting. A post-session debrief indicates that participants’ decisions were driven by the assumed agent role rather than personal disposition, and provides qualitative support for strategic motives, such as preserving market presence and raising rivals’ costs, operating alongside short-term profit maximisation. We discuss implications for multi-agent mechanism design under asymmetric budgets and outline directions for analytical validation and larger-scale multi-agent experiments. Comments: 13 pages, 2 figures, presented in ISAGA 2026 Subjects: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH) Cite as: arXiv:2607.01822 [cs.MA] (or arXiv:2607.01822v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2607.01822 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-7] Mechanism and Stability Analysis of Metabolic Closed-Loop Metaheuristics
【速读】:该论文旨在解决代谢多智能体优化器(Metabolic Multi-Agent Optimizer, MMAO)在框架层面的理论解释问题,即其核心机制——私有能量、公共预算、角色漂移与生命周期更替构成的代谢资源循环——是否具有超越叙事隐喻的数学与系统性框架意义。其解决方案的关键在于构建一个通用的MMAO状态模型,该模型抽象化了领域特定的动作算子,同时保留定义框架的核心资源记账机制。在适度的有界增益与有界支出假设下,论文证明了私有能量、公共预算、角色状态及活跃种群规模的有界性与非负性,并揭示了该资源循环所诱发的三种内生行为模式:持续资源短缺下的收缩、公共积累盈余时的再投资,以及跨智能体或子群体异质边际回报下的搜索再分配。研究采取保守立场,不主张整个自适应系统的全局收敛性、对专用优化器的普遍优越性,或过程的完整稳态刻画,而是聚焦于识别哪些内部调控特性是该循环的普适结果,哪些仍依赖具体实现。通过代表性连续与离散MMAO实例的紧凑机制验证包提供了支持性实证证据,但未替代全面基准测试。因此,本文贡献在于建立了一个有界、可再生、资源调控的MMAO框架解释,而非对算法全族所有自适应行为的完整证明。
链接: https://arxiv.org/abs/2607.01551
作者: Jinliang Xu,Liping Ma
机构: The Seventh Medical Center of Chinese PLA General Hospital (中国人民解放军总医院第七医学中心)
类目: Neural and Evolutionary Computing (cs.NE); Multiagent Systems (cs.MA)
备注:
Abstract:This paper studies the Metabolic Multi-Agent Optimizer (MMAO) at the framework level rather than at the implementation or benchmark level. The central question is whether the metabolic resource loop of private energy, communal budget, role drift, and lifecycle turnover has a framework-level interpretation beyond narrative metaphor. We introduce a generic MMAO state model that abstracts away domain-specific move operators while retaining the resource bookkeeping that defines the framework. Under mild bounded-gain and bounded-spending assumptions, we establish boundedness and nonnegativity properties for private energy, communal budget, role state, and active population size. We then characterize three endogenous behavioral regimes of the loop: contraction under sustained resource deficit, reinvestment under surplus communal accumulation, and search redistribution under heterogeneous marginal returns across agents or subgroups. The analysis is intentionally conservative. It does not claim global convergence of the full adaptive system, universal superiority over specialist optimizers, or a complete stationary characterization of the resulting process. Instead, it identifies which internal regulation properties are generic consequences of the loop and which remain implementation specific. A compact mechanism-validation package on representative continuous and discrete MMAO realizations provides supporting empirical evidence for this reading, but is not intended to replace a full benchmark study. The resulting contribution is therefore a bounded, regenerative, resource-regulated interpretation of MMAO, rather than a complete proof of all adaptive behaviors of the full algorithm family.
[MA-8] MMAO-Cls: Metabolic Multi-Agent Optimization for Joint Feature Selection and Classifier Tuning
【速读】:该论文旨在解决分类模型选择中如何有效平衡模型精度与复杂度的问题,特别是在混合搜索空间(包含特征子集选择与超参数优化)下实现高效、紧凑的全局优化。其核心挑战在于如何在有限计算资源下,同时优化特征冗余性与模型性能,并避免过拟合。解决方案的关键在于提出MMAO-Cls,一种基于代谢多智能体优化(Metabolic Multi-Agent Optimizer, MMAO)的混合空间框架:每个智能体联合编码二值特征掩码与分类器超参数,通过模拟生物代谢中的私有能量、公共资源、角色漂移与生命周期更替等机制,将这些动态行为映射至封装式学习中的精度-复杂度权衡。创新点包括从特征信息先验推导出特征预算自适应策略,以及通过引入子集紧凑性与训练-验证过拟合差距对验证奖励进行正则化,从而增强搜索过程的稳定性与泛化能力。实验在七个标准表格数据集上以三组随机种子评估,结果显示MMAO-Cls在平均测试性能上优于随机搜索(RandomSearch)和GA-lite,接近PSO-lite及无共享基线,且在所有方法中使用最紧凑的特征子集(平均特征比0.4881),表明其在保持高性能的同时具备更强的特征压缩能力。尽管统计检验显示差异尚未显著,但结果支持其作为分类任务中可行的外层优化器,尤其在混合空间搜索与特征精简方面表现突出。
链接: https://arxiv.org/abs/2607.01539
作者: Jinliang Xu,Liping Ma
机构: The Seventh Medical Center of Chinese PLA General Hospital (中国人民解放军总医院第七医学中心)
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:This paper studies whether the Metabolic Multi-Agent Optimizer (MMAO) can act as a credible outer-loop optimizer for classification model selection. We propose MMAO-Cls, a mixed-space realization in which each agent jointly encodes a binary feature mask and classifier hyperparameters, while private energy, communal budget, role drift, and lifecycle turnover are mapped to the accuracy-complexity tradeoff of wrapper learning. The implementation is strengthened by deriving feature-budget adaptation from feature-information priors and by regularizing validation reward with both subset compactness and train-validation overfitting gap. We evaluate MMAO-Cls on seven standard tabular benchmarks with three seeds each and compare it against RandomSearch, GA-lite, PSO-lite, and an endogenous no-sharing ablation. On the aggregate validation objective, MMAO-Cls ranks second ( 0.9433 ) behind GA-lite ( 0.9446 ). On held-out test performance, it reaches mean score 0.8882 , improving over RandomSearch ( 0.8808 ) and GA-lite ( 0.8857 ), remaining close to PSO-lite ( 0.8874 ) and the no-sharing ablation ( 0.8900 ), while using the most compact mean held-out feature subset among all compared methods (feature ratio 0.4881 ). Pairwise tests show that these margins are not yet statistically significant. The resulting claim is therefore conservative: MMAO-Cls supports classification applicability and compact mixed-space search more clearly than it isolates communal sharing as a decisive standalone advantage.
[MA-9] Simulation Based Reward Function Validation for Multi-Agent On Orbit Inspection
【速读】:该论文旨在解决轨道上多艘检测航天器协同作业时的智能控制问题,传统方法中采用的奖励函数仅针对预设有限个固定检测点进行优化,限制了任务的灵活性与适应性。为此,本文提出一种基于轨道目标三维重建分析的广义奖励函数,使多智能体强化学习(MARL)系统能够评估任意位置、任意数量的图像数据质量,从而实现对成像时机与位置的完全自主决策。该解决方案的关键在于构建可泛化的奖励机制,不仅提升了检测任务的灵活性和效率,还为后续在非MARL框架下的空间物体检测任务提供了可复用的最佳实践指导。
链接: https://arxiv.org/abs/2607.01367
作者: Patrick Quinn,Bala Prenith Reddy Gopu,George M. Nehma,Madhur Tiwari
机构: Florida Institute of Technology (佛罗里达理工学院)
类目: Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: 13 pages, 6 figures. This submission integrates a published correction made to the original manuscript. The DOIs for both the original manuscript as well as the correction are provided
Abstract:A proposed method for the control of groups of inspection spacecraft is Multi-Agent Reinforcement Learning (MARL). While MARL has already been employed for this purpose in previous work, the reward functions used focus on reaching a finite set of predetermined inspection points around the target. In this work, we study and develop a generalized reward function for the MARL inspection task informed by the analysis of 3D reconstructions of inspected objects in orbit. Because the reward function is generalized such that any number of images at arbitrary locations may evaluated, we also allow trained agents to have complete control over when images are collected. With this approach, we gather insights into best practices for not only the specific MARL inspection task, but also gain key takeaways informative to the broader inspection task outside of a MARL context.
[MA-10] Cache Merging as a Convergent Replicated State for Multi-Agent Latent Reasoning
【速读】:该论文旨在解决多智能体(multi-agent)推理中因缓存合并方式导致的非交换性问题,即传统方法(如BagMerge)在拼接各智能体的键值缓存(KV-cache)时,其结果依赖于输入顺序,且最优顺序随推理场景、潜伏步数预算及模型规模变化而不可预测。其核心解决方案是提出CanonicalMerge,通过固定内容布局实现可收敛的复制状态(replicated state):首先,基于中间层键向量的均值范数对缓存进行排序,确保任意输入顺序下的合并结果字节完全一致;其次,将状态设计为基于内容寻址的潜在片段集合,其合并操作定义为集合并集(set union),满足CvRDT性质(交换律、结合律、幂等性、吸收性),从而保证了结构上的确定性与容错性。由于渲染结果字节等价,所有精度指标可直接复用,重复项被吸收而非重复拼接。实验表明,CanonicalMerge在分区推理基准上无需预知最优顺序即可达到最佳BagMerge性能,仅以微小但统计不显著的精度损失换取全局结构保障,并在真实多文档问答任务(HotpotQA)中超越无训练输出融合基线(PackLLM)45分,证明缓存级合并已进入与输出级融合不同的有效范式。此外,该方法虽能传输和共置潜在轨迹(latent traces),但尚未实现其组合,为此类未来工作提供了明确方向。
链接: https://arxiv.org/abs/2607.01308
作者: Carlos Baquero,Luís Brito
机构: Faculdade de Engenharia (FEUP), Universidade do Porto (UP), Portugal; Escola Superior de Tecnologia e Gestão (ESTG), Politécnico de Viana do Castelo (IPVC), Portugal
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent latent reasoning composes agents’ KV-caches into one context for a final agent. Prior work (Agent Primitives) does this by concatenating caches along the sequence axis with RoPE re-encoding, which we call BagMerge. BagMerge is non-commutative, and the best input ordering is unpredictable, shifting with the regime, the latent-step budget, and the model scale. We make this exchange a convergent replicated state. First, CanonicalMerge fixes the layout by content: ordering caches by mean K-norm at a middle layer renders the merged cache byte-identical under any input permutation, verified algorithmically (arity N=5) and bit-for-bit on real Qwen3-1.7B and 4B state. Second, we separate the replicated state from decode-time layout: the state is a set of content-addressed latent fragments whose merge is set union, a state-based CvRDT (commutative, associative, idempotent, absorbing), and CanonicalMerge is its deterministic render. Because the render is byte-equivalent, every N=2 accuracy number carries over unchanged and re-delivered duplicates are absorbed rather than re-concatenated. On a partitioned-reasoning benchmark, CanonicalMerge matches the best BagMerge ordering in every regime-by-budget-by-ordering cell without knowing which order is best, trading a small, statistically insignificant accuracy margin for an unconditional structural guarantee. The behaviour transfers to real multi-document QA (HotpotQA), while the closest training-free output-fusion baseline (PackLLM) loses by 45 points at matched budget, placing cache-level merging in a regime distinct from output-level fusion. Finally, at k2 the approach transports and colocates latent traces but does not by itself compose them, which we characterize to motivate future work.
[MA-11] Beyond Line of Sight: Hybrid Validation of V2X Collective Perception in Complex Scenarios
【速读】:该论文旨在解决复杂交通场景下车联网(V2X)赋能的集体感知(Collective Perception, CP)系统在感知范围受限与可信度不足的问题。传统单车智能感知受限于本车传感器的视域,难以应对遮挡、盲区等挑战,且缺乏对感知不确定性进行量化表达的能力。为此,论文提出一种基于贝叶斯融合的混合验证方法,其核心在于构建一个共享的、概率化的占用栅格(probabilistic occupancy grid),通过融合多智能体异构传感器数据,将每个栅格单元的状态表示为包含占据概率与不确定性的联合分布,从而实现超越本车视域的可解释、可信赖的情境感知。该方案的关键创新在于:一方面,利用贝叶斯推理框架实现多源异构感知信息的高效融合,显著扩展感知边界;另一方面,构建结合CARLA虚拟仿真与“车辆在环”(vehicle-in-the-loop)实车测试的混合验证框架,有效弥合仿真与真实世界评估之间的鸿沟。实验结果表明,在环形交叉口场景下,六智能体协同感知使视域覆盖率提升260%,占据单元召回率从单车状态下的0.82提高至0.94,验证了该方法在提升感知性能与系统可验证性方面的有效性。
链接: https://arxiv.org/abs/2607.00874
作者: Markos Antonopoulos,Anastasia Bolovinou,Bill Roungas,Elena Daskalaki,Angelos Amditis
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: 6 pages, 4 figures, to be presented in ITS World 2026
Abstract:This paper introduces a probabilistic framework and hybrid validation methodology for V2X-enabled Collective Perception (CP) in complex traffic scenarios. The proposed Bayesian fusion algorithm extends the perceptual horizon of connected and autonomous vehicles by integrating heterogeneous sensor observations from multiple agents into a shared probabilistic occupancy grid. Each cell of this grid encapsulates both occupancy likelihood and uncertainty, enabling explainable and trustworthy situational awareness beyond the ego vehicle’s field of view. To bridge the gap between simulation and real-world evaluation, a hybrid testing framework is developed, combining CARLA-based virtual environments with vehicle-in-the-loop experimentation. Experimental results in a roundabout scenario demonstrate a 260 percent increase in field-of-view coverage and a rise in occupied-cell recall from 0.82 (ego-only) to 0.94 (six-agent CP) under nominal localization conditions. Overall, the proposed approach provides a reproducible and interpretable foundation for validating CP systems, supporting the safe and certifiable deployment of cooperative autonomous vehicles.
[MA-12] Mean Field Reinforcement Learning
【速读】:该论文旨在解决大规模随机系统中多智能体强化学习(multi-agent reinforcement learning)与平均场控制(mean field control)之间的理论衔接问题,特别是针对具有平均场交互和共同噪声的大规模种群系统中的马尔可夫决策过程(Markov decision processes)。其核心挑战在于如何在有限智能体系统向无限大种群极限过渡时,保持学习方法的可扩展性与数学一致性。解决方案的关键在于构建一个融合概率论、数学分析与控制理论的统一框架,以形式化代表性智能体学习问题,并通过传播混沌(propagation-of-chaos)极限分析揭示其与有限种群系统的渐近关系;同时结合动态规划原理、表格式Q-learning与策略梯度方法的理论分析,提出适用于一般及线性-二次型模型的可计算学习算法,包括基于深度确定性策略梯度(deep deterministic policy gradient)等深度强化学习方法的数值实现。该研究强调问题的数学结构设计,为大规模随机种群系统提供了兼具理论严谨性与实际可行性的可训练学习范式。
链接: https://arxiv.org/abs/2607.01525
作者: René Carmona,Mathieu Laurière
机构: 未知
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Probability (math.PR)
备注:
Abstract:This monograph provides an introduction to mean field reinforcement learning through the lens of Markov decision processes arising from large-population stochastic control with mean field interactions and common noise. Starting from the connection between multi-agent reinforcement learning and mean field control, it develops the probabilistic, mathematical, and control-theoretic framework needed to formulate representative-agent learning problems, analyze their relationship with finite-population systems, and study both general and linear-quadratic models. The presentation includes dynamic programming principles, propagation-of-chaos limits, and theoretical analyses of tabular Q-learning and policy-gradient methods. It also discusses numerical implementations, including tabular schemes and deep reinforcement learning methods such as deep deterministic policy gradient. The goal is to give readers a coherent bridge between mean field control theory and reinforcement learning methodology, emphasizing the mathematical structure of the problems and the design of tractable learning approaches for large stochastic populations.
自然语言处理
[NLP-0] LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中记忆敏感信息(如个人身份信息,PII)所带来的隐私泄露问题,尤其关注现有后处理去遗忘(unlearning)方法在参数层面是否真正实现知识擦除,而非仅在输出层面进行表象上的混淆。当前最先进的去遗忘方法普遍采用“定位优先、去遗忘其次”的范式,聚焦于特定模型参数的修改,但现有评估体系仅基于输出层面的表现,无法验证这些方法是否真正从模型参数中移除了敏感知识,这一缺陷被“重浮现攻击”(resurfacing attacks)的成功所进一步凸显。为此,论文提出LACUNA——首个具备真实参数级定位能力的去遗忘测试平台。LACUNA通过掩码持续预训练(masked continual pretraining)将合成个体的PII注入1B和7B OLMo基线模型的预定义参数中,从而实现对知识存储位置的精确控制与可验证性。利用LACUNA对现有SOTA去遗忘方法进行基准测试发现,尽管其在输出层面表现良好,但在参数层面仍存在严重不精确性,且极易受到重浮现攻击。研究进一步表明,当实现精准定位时,即使采用简单的基于梯度的去遗忘方法,也能实现强效的知识擦除并有效抵御重浮现攻击,凸显了精确定位在构建鲁棒去遗忘机制中的关键作用。论文开源发布LACUNA,以补充行为评估,推动面向定位的可靠去遗忘技术发展。
链接: https://arxiv.org/abs/2607.02513
作者: Matteo Boglioni,Thibault Rousset,Siva Reddy,Marius Mosbach,Verna Dankers
机构: Mila – Quebec Artificial Intelligence Institute (蒙特利尔人工智能研究所); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:LLMs memorize sensitive training data, including personally identifiable information (PII), creating a pressing need for reliable post hoc removal methods. Unlearning has emerged as a promising solution, with state-of-the-art(SOTA) methods often following a localize-first, unlearn-second paradigm that targets specific model parameters. However, existing benchmarks evaluate unlearning solely at the output level, leaving open the question of whether unlearning truly erases knowledge from a model’s parameters or merely obfuscates it, a concern reinforced by the success of resurfacing attacks. To bridge this gap, we introduce LACUNA: the first unlearning testbed with ground-truth parameter-level localization. LACUNA injects PII of synthetic individuals into predefined parameters of 1B and 7B OLMo-based models via masked continual pretraining, enabling direct evaluation of whether unlearning targets the weights responsible for knowledge storage. We use LACUNA to benchmark current SOTA unlearning methods and find that, despite strong output-level performance, existing methods are highly imprecise and susceptible to resurfacing attacks. We further show that when localization is successful, even a simple gradient-based unlearning method achieves strong erasure and robustness to resurfacing attacks, highlighting the importance of precise unlearning. We release LACUNA to complement behavioral evaluations and drive further advances in robust, localization-based unlearning.
[NLP-1] Program-as-Weights: A Programming Paradigm for Fuzzy Functions
【速读】: 该论文旨在解决日常编程任务中难以通过规则化方法实现的问题,例如关键日志行的告警、损坏JSON的修复以及按用户意图对搜索结果进行排序等。这类任务当前普遍依赖大型语言模型API,但存在局部性差、可复现性低和成本高昂等问题。其解决方案的关键在于提出“模糊函数编程”(fuzzy-function programming)范式,即从自然语言规范编译出紧凑且可本地执行的神经型函数(neural artifact)。具体实现为Program-as-Weights(PAW),利用一个40亿参数的编译器在自建的1000万样本数据集FuzzyBench上训练,生成针对冻结轻量级解释器的参数高效适配器。实验表明,仅0.6亿参数的Qwen3解释器在执行PAW程序时,性能可媲美直接调用320亿参数的Qwen3模型,同时推理内存消耗仅为后者的约1/50,并可在MacBook M3上以30 tokens/s的速度离线运行。PAW的核心创新在于将基础模型的角色从逐输入求解者转变为函数构建工具:仅需在函数定义阶段调用一次,即可生成可重复使用的轻量级产物,显著提升后续调用效率与经济性。
链接: https://arxiv.org/abs/2607.02512
作者: Wentao Zhang,Liliana Hotsko,Woojeong Kim,Pengyu Nie,Stuart Shieber,Yuntian Deng
机构: University of Waterloo ( Waterloo University); Cornell University (康奈尔大学); Harvard University (哈佛大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Many everyday programming tasks resist clean rule-based implementation, such as alerting on important log lines, repairing malformed JSON, or ranking search results by intent, and are increasingly outsourced to large language model APIs at the cost of locality, reproducibility, and price. We propose fuzzy-function programming: compiling such a function from a natural-language specification into a compact, locally-executable neural artifact. We instantiate this paradigm with Program-as-Weights (PAW), in which a 4B compiler trained on FuzzyBench, a 10M-example dataset we release, emits parameter-efficient adapters for a frozen, lightweight interpreter. A 0.6B Qwen3 interpreter executing PAW programs matches the performance of direct prompting of Qwen3-32B, while using roughly one fiftieth of the inference memory and running at 30 tokens/s on a MacBook M3. PAW reframes the foundation model from a per-input problem solver into a tool builder: invoked once per function definition, it produces a small reusable artifact whose subsequent calls per function application are cheap and offline.
[NLP-2] Online Safety Monitoring for LLM s ICML2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署过程中仍存在生成不安全输出的问题,即便经过对齐训练(alignment training),其安全性仍无法完全保障。为应对这一挑战,论文提出一种简单而有效的实时监控机制:通过外部验证模型(verifier model)提供的置信度信号,结合风险控制方法校准阈值,实现对潜在不安全输出的实时预警。该方案的关键在于利用外部模型的验证信号进行阈值决策,并通过风险控制框架确保报警系统的可靠性与可调性,在数学推理和红队测试数据集上的实验表明,该方法在性能上可与基于序列假设检验(sequential hypothesis testing)等更复杂的监控系统相媲美。
链接: https://arxiv.org/abs/2607.02510
作者: Mona Schirmer,Metod Jazbec,Alexander Timans,Christian Naesseth,Maja Waldron,Eric Nalisnick
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
备注: ICML 2026 Hypothesis Testing Workshop
Abstract:Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with more advanced monitors based on sequential hypothesis testing.
[NLP-3] Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas ICML2026
【速读】: 该论文旨在解决长篇电视剧视频理解中的复杂剧情解析难题,核心挑战在于准确实现角色说话人识别(speaker recognition),即为每一句对白精准归属到对应角色。现有方法在短语或声学特征不显著的场景下表现不佳,难以依赖单一模态线索完成可靠识别。本文的关键解决方案是提出一种基于大推理模型(Large Reasoning Model, LRM)的鲁棒框架——DramaSR-LRM,其通过自主调用多模态工具,融合语音、语言与视觉等多源上下文信息,实现跨模态证据的动态聚合与综合推理,从而提升对白归属的准确性,尤其在声学生物特征不可靠的短对白场景中表现出显著优势。此外,研究构建了大规模基准数据集DramaSR-532K,包含超过532,000条标注对白及900余个独特角色,为该任务提供了高质量的数据支持。
链接: https://arxiv.org/abs/2607.02504
作者: Yuxuan Li,Lingxi Xie,Xinyue Huo,Jihao Qiu,Jiacheng Shao,Pengfei Chen,Jiannan Ge,Kaiwen Duan,Qi Tian
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026
Abstract:Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on \textbfspeaker recognition, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduce \textbfDramaSR-532K, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integration of auditory, linguistic, and visual cues for speaker recognition. (2) We propose \textbfDramaSR-LRM, a robust approach built upon a large reasoning model (LRM). DramaSR-LRM is designed to autonomously aggregate contextual evidence via multimodal tool-use, synthesizing diverse inputs to achieve high-fidelity attribution. Experimental results demonstrate that DramaSR-LRM significantly outperforms existing baselines, particularly on short utterances where acoustic biometrics are inherently unreliable. \textitAll the data and code will be made publicly available at the project page: this https URL.
[NLP-4] owards Robustness against Typographic Attack with Training-free Concept Localization ECCV2026
【速读】: 该论文旨在解决基于对比语言-图像预训练(CLIP)的视觉编码器在面对文本干扰时存在的鲁棒性缺陷,即“排版攻击”(Typographic Attack, TA)问题——当图像中包含无关文本时,模型会错误地依赖文本的词汇语义而非真实的视觉语义进行推理,从而导致决策偏差。这一问题在自动驾驶等安全关键场景中构成严重风险。其解决方案的关键在于提出一种无需训练的机制可解释性方法,通过采样分析隐藏层表示,并量化各注意力头对语义与词法信息的关注程度,结合概率分析与电路挖掘技术,精准定位在视觉变换器(ViT)中过度编码词法信息的核心组件。在此基础上,仅通过针对这些识别出的电路实施轻量级干预(如选择性调整注意力权重),即可显著提升模型对排版攻击的鲁棒性,且无需重新训练。实验表明,该方法在多个先进大视觉语言模型(LVLMs)上均能有效提升在RIO-Bench基准上的视觉问答(VQA)性能,验证了其有效性与通用性。
链接: https://arxiv.org/abs/2607.02494
作者: Bohan Liu,Wenqian Ye,Guangzhi Xiong,Zhenghao He,Sanchit Sinha,Aidong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 15 pages main text, provisionally accepted to ECCV 2026
Abstract:Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state representations and quantitatively attributes semantic versus lexical focus to individual attention heads. Through probabilistic analysis and circuit mining, we isolate specific Vision Transformer (ViT) components that disproportionately encode lexical information, thereby identifying the mechanistic source of TA. We further show that simple interventions applied directly to the identified circuits, without any additional training, can substantially improve robustness against Typographic Attacks in object classification. These interventions, such as selective adjustment of attention weights, also outperform both supervised and training-free defense methods. Our experiments demonstrate that applying the proposed intervention to the vision encoders of several state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy under Typographic Attack interference on RIO-Bench. These results confirm both the efficacy and the generalizability of our mechanistic approach. Code is released at this https URL.
[NLP-5] Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning
【速读】: 该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在生成式思维链(Chain of Thought, CoT)推理过程中缺乏视觉感知引导的自我反思问题,尤其是在分布外(out-of-distribution)图像上,现有模型难以有效利用视觉信息进行基于证据的错误修正。其解决方案的关键在于提出一种新型强化学习训练框架——视觉引导的自我反思强化学习(Visual-Reflected Reinforcement Learning, VRRL),通过两个核心设计实现视觉感知驱动的自我反思:一是训练时随机掩码推理轨迹前缀,强制模型从错误的中间预测中恢复,而非避免早期错误;二是引入经验回放缓冲区中的回滚(buffered roll-ins),使模型暴露于多样化的失败状态,从而学习如何基于视觉输入进行修正。该方法显著提升了模型在表格、图表视觉定位及空间导航等任务中面对分布外数据时的鲁棒性与准确性,优于标准强化学习与面向反思的微调基线。
链接: https://arxiv.org/abs/2607.02490
作者: Liyan Tang,Fangcong Yin,Greg Durrett
机构: The University of Texas at Austin; New York University
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large vision-language models can reason over multimodal inputs by generating textual chains of thought (CoT). A key capability exhibited in CoT reasoning is self-reflection: revisiting earlier decisions and correcting previous errors. However, existing LVLMs often fail to properly attend to visual inputs during reflection, limiting their ability to translate feedback into grounded corrections, especially for out-of-distribution images. To address this issue, we propose a novel reinforcement learning training framework VRRL, with two components explicitly designed to elicit visually grounded self-reflection. First, we randomly mask trajectory prefixes during training to emphasize recovery from incorrect intermediate predictions rather than making early mistakes. Second, we introduce buffered roll-ins from an experience replay buffer to expose the model to diverse failure states that it must learn to correct. We evaluate our approach on visual grounding tasks involving tables and charts, as well as spatial navigation benchmarks. While off-the-shelf and conventionally fine-tuned models degrade substantially under distribution shift, our method substantially improves average out-of-distribution accuracy over standard RL and reflection-oriented fine-tuning baselines by using self-reflection effectively.
[NLP-6] Audio-Based Understanding of Audiobook Narration Appeal INTERSPEECH2026
【速读】: 该论文旨在解决如何量化分析叙述质量(narration quality)对有声书吸引力的影响,特别是在不同体裁、标题和受众背景下的差异性作用。其核心问题是:在缺乏大规模标注数据的条件下,如何通过可计算的声学特征识别影响听众参与度的关键叙述因素。解决方案的关键在于利用预训练音频模型从LibriVox平台提取丰富的语音与声学特征(如音调、语速、音量等),并结合观看率(view-rate)等消费数据,系统地分析这些特征与听众吸引力之间的关联。研究发现,即使在控制标题效应后,声学信息仍与有声书吸引力存在显著且稳健的关联,且该结论在更精细的专有参与度指标下得到验证。这表明,基于数据驱动的方法能够有效揭示叙述质量对用户偏好影响的规律,为有声书个性化推荐与主播选角提供科学依据。
链接: https://arxiv.org/abs/2607.02473
作者: Shahar Elisha,Mariano Beguerisse-Díaz,Emmanouil Benetos
机构: Spotify( Spotify); Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2026
Abstract:Narration is central to the audiobook listening experience, shaping how listeners engage with and understand the content. This work explores how narration qualities shape an audiobook’s appeal, noting that their effects can vary by genre, title, and audience. We extract vocal and acoustic features (e.g., tone, pace, loudness) from LibriVox using pre-trained audio models and analyse their relationship with consumption data (specifically, view-rate) and their interplay with genre and title. Despite limited consumption data, we find that acoustic information alone has a robust association with appeal, even after accounting for title effects. We further validate these findings using more nuanced proprietary engagement metrics. To our knowledge, this is the first systematic computational study linking narration qualities, genre, title, and audiobook consumption, highlighting the potential of data-driven insights to improve audiobook personalisation and narrator casting.
[NLP-7] stEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution WWW
【速读】: 该论文旨在解决生成式AI在软件测试自动化中对代码与测试协同演化(test and code co-evolution)理解不足的问题。现有测试生成与更新基准普遍将测试与代码变更割裂,依赖静态元数据,无法验证测试的可执行性或其与代码变更的语义关联性,导致难以评估测试自动化代理是否真正理解代码变更如何影响测试套件。为此,论文提出TestEvo-Bench,一个基于真实软件仓库挖掘的、面向测试与代码协同演化的基准数据集,包含两个任务类型:测试生成(生成新测试以捕获新行为)和测试更新(修改失败的旧测试以适配变更后的代码行为)。其关键创新在于:每个任务均锚定于真实的提交历史,并附带可执行的环境配置,支持基于实际运行结果的评估指标(如通过率、覆盖率、突变分数),从而实现“执行驱动”的评估。此外,TestEvo-Bench为动态活体基准,通过记录变更时间戳并定期自动挖掘新任务,确保评估可限定于模型训练截止时间之后的任务,有效降低数据泄露风险。当前版本涵盖746个测试生成任务与509个测试更新任务,源自152个开源Java项目中的59,950条候选协同演化记录。实验表明,结合强基座模型(Claude Opus 4.7、Gemini 3.1 Pro)与强大工具链(Claude Code、Gemini CLI、SWE-Agent)的四种先进代理,在测试生成上最高达77.5%成功率,测试更新上达74.6%,但性能在最新任务及受限计算成本下显著下降,揭示了当前方法在真实场景下的局限性。
链接: https://arxiv.org/abs/2607.02469
作者: Jiale Amber Wang,Kaiyuan Wang,Pengyu Nie
机构: University of Waterloo ( Waterloo 大学); Google(谷歌)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: TestEvo-Bench leaderboard and data explorer are hosted at this https URL
Abstract:Software tests and code evolve together: a code change should be followed by new or updated tests that record the new software behavior. Yet existing test generation and update benchmarks often isolate the test from the code change, and rely on static metadata that does not verify whether a test is executable or semantically tied to the code change. This makes it difficult to evaluate whether a test automation agent understands how a code change should propagate into the test suite. We introduce TestEvo-Bench, a benchmark of test and code co-evolution tasks mined from software repositories, with two tracks: in test generation, the agent shall write new tests to capture the new software behavior; in test update, the agent shall adapt failing existing tests to the changed software behavior. Each task is anchored to a real commit history and packaged with environment configuration to support execution-grounded metrics such as pass rate, coverage, and mutation score. TestEvo-Bench is also a live benchmark: each task records the timestamp of the test and code changes, and new tasks are periodically mined by our automated pipeline, so evaluation can be restricted to tasks postdating a model’s training cutoff to reduce data leakage risk. The current snapshot contains 746 test generation and 509 test update tasks, curated from 59,950 candidate co-evolution records across 152 open-source Java projects. We experiment with four state-of-the-art agents that combine strong harnesses (Claude Code, Gemini CLI, and SWE-Agent) with strong foundation models (Claude Opus 4.7 and Gemini 3.1 Pro). Results show that they achieve up to 77.5% success rate on test generation and 74.6% on test update. However, success rate is materially lower on the most recent benchmark tasks and drops significantly under limited per-task cost. Comments: TestEvo-Bench leaderboard and data explorer are hosted at this https URL Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2607.02469 [cs.SE] (or arXiv:2607.02469v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2607.02469 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-8] Will Scaling Improve Social Simulation with LLM s?
【速读】: 该论文旨在解决当前大语言模型(LLM)社会模拟在真实性与可信度方面不足的问题,探究现有语言建模的扩展范式是否足以提升社会模拟的保真度,或是否存在与通用能力解耦的独立挑战。其核心解决方案的关键在于利用缩放定律(scaling laws)系统分析计算资源投入、通用能力基准表现与三类代表性社会模拟任务——观点建模、行为模拟和长期预测——之间的关系。研究通过85个基于Qwen3架构、在DCLM网络文本语料上预训练的Transformer LLM,在固定计算预算(10^18至10^20 FLOPs)下验证了所有三类任务均存在显著的计算缩放效应;进一步评估了35个更大规模的开源模型(最大达700亿参数),发现多数行为与观点模拟任务将随规模增长而快速提升,尤其在英语网络语料中已有充分覆盖的人群;然而,对于低资源领域、非主流观点或与通用知识推理(如MMLU)相关性较低的长期预测任务,缩放效果较弱。更关键的是,在行为模拟任务中,即使扩大参数量至80亿,模型仍无法有效捕捉人类认知偏差(如风险规避)与启发式策略(如从相关任务中学习关联奖励),表明缩放本身不足以改善模型校准性。因此,结论指出:尽管规模扩张可显著提升多数社会模拟任务的性能,但部分场景存在“例外”且在低资源领域改进不可靠,提示需对社会模拟保真度进行独立、针对性的研究投入。
链接: https://arxiv.org/abs/2607.02464
作者: Caleb Ziems,William Held,Su Doga Karaca,David Grusky,Tatsunori Hashimoto,Diyi Yang
机构: Stanford University (斯坦福大学); Open Athena (开放奥丁)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Model (LLM) social simulations are a promising research method, but they are not yet faithful enough to be adopted widely. In this work, we investigate whether the current scaling paradigm in language modeling is likely to close these gaps, or whether simulation fidelity is orthogonal to general capabilities and therefore deserving of more research attention. We use scaling laws to study the relationship between LLMs’ compute scale, general capability benchmarks, and the fidelity of social simulation in three representative sub-domains: opinion modeling, behavioral simulation, and longitudinal forecasting. Surprisingly, we discover strong compute scaling in all three settings, using a suite of 85 transformer LLMs with the Qwen3 architecture pre-trained on the DCLM web text corpus under fixed-compute budgets from 10^18 to 10^20 FLOPs. Then we evaluate 35 larger and more capable open-weight models up to 70B parameters, allowing us to predict downstream accuracy from loss. This reveals that the majority of behavioral and opinion simulation tasks will rapidly improve with scale, particularly when they involve populations that are well-represented in English web corpora. Longitudinal forecasting and underrepresented opinions scale more slowly, especially when they are less correlated with general knowledge and reasoning benchmarks like MMLU. In behavior simulation, scaling fails to improve model calibration with human cognitive biases like risk aversion, as well as human heuristics like learning correlated rewards from related tasks. On these tasks, even fine-tuned models fail to noticeably scale up performance from 0.5B to 8B parameters. Taken together, we conclude that scale will improve social simulations in most settings, but outliers exist, and improvements will be less reliable in low-resource domains.
[NLP-9] Language Models as Measurement Apparatus for Culture ACL2026
【速读】: 该论文试图解决的问题是:当前自然语言处理(NLP)在文化量化研究中普遍将文化现象视为可被动记录的客观实体,却忽视了其测量过程本身即是一种具有建构性的物质—话语实践(material-discursive practice)。论文指出,语言模型及其配套的数据、标注与评估体系并非中立工具,而是通过“能动性切割”(agential cut)——即现象与测量工具之间动态生成的边界——主动参与构建其所测量的文化现实。解决方案的关键在于:承认并系统化地反思这种能动性切割,将模型设计中的实质性选择(如数据筛选、特征提取、评价标准)视为对文化边界的主动划定,并意识到语言模型在训练过程中已内化大量被测量的文化材料,导致测量与被测对象之间存在根本性的纠缠关系。通过六个案例分析(三组文化现象测量与三组工具自身反思),论文提出一种以理论为导向、实证严谨且文化情境敏感的研究范式,主张将每一次能动性切割视为方法论与伦理上的自觉承诺,从而推动更具批判性与责任意识的文化计算研究。
链接: https://arxiv.org/abs/2607.02459
作者: Kent K. Chang
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注: Accepted to the Big Picture workshop co-located with ACL 2026. This version expands the camera-ready (adding Fig. 3 and section 6.3, as well as correcting minor typos) in Proceedings of The Big Picture v2: Crafting a Research Narrative, pp. 131–143, San Diego, CA, USA. Association for Computational Linguistics
Abstract:Language models are increasingly used to quantify cultural phenomena, but what makes such measurement distinctively cultural? This paper argues that NLP work on culture is a material-discursive practice: the apparatus – model, data, annotation, evaluation – participates in constituting the cultural reality it measures, rather than passively recording it. Drawing on Karen Barad’s concept of the agential cut – the contingent boundary between phenomenon and instrument – I show that the apparatus’s substantive design choices draw such boundaries, and that the boundary is entangled from the start because language models have already internalized much of the cultural material they measure. I illustrate this through three case studies on television and film dialogue (measuring structure, interaction, and deviation) and three examinations of the apparatus itself (erasure of cultural markers, attunement to historical material, and agency in an agentic workflow). This big picture analysis proposes a research program that is theory-driven, empirically rigorous, and culturally contingent, treating each agential cut as a conscious commitment, at once methodological and ethical.
[NLP-10] EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments
【速读】: 该论文旨在解决当前自主智能体在通过反馈改进可执行策略(executable policy)过程中,评估方式过于简化或与开放式软件工程进展混淆的问题。现有方法通常将整个优化过程压缩为单一最终评分,缺乏对策略迭代演进机制的精细刻画。为此,论文提出“自主策略演化”(Autonomous Policy Evolution)这一受控评估范式,其核心在于构建一个固定交互预算下的持续策略编辑框架,以真实反映智能体在有限资源下渐进优化策略的能力。关键解决方案是构建EvoPolicyGym基准测试平台,该平台基于紧凑的交互式强化学习环境,能够量化评估智能体在多轮迭代中如何分配计算预算、将反馈转化为参数调优,并实现策略的持续改进。实验表明,GPT-5.5在16个环境上均取得领先表现,且轨迹级诊断分析揭示:卓越的自主策略演化不仅依赖于单任务性能提升,更取决于发现适配任务特性的优化机制并在此基础上进行有约束的策略精炼。
链接: https://arxiv.org/abs/2607.02440
作者: Zhilin Wang,Han Song,Runzhe Zhan,Jusen Du,Jiacheng Chen,Tianle Li,Qingyu Yin,Yulun Wu,Zhennan Shen,Tong Zhu,Yanshu Li,Guanjie Chen,Derek F. Wong,Yafu Li,Yu Cheng,Yang Yang
机构: University of Science and Technology of China (中国科学技术大学); The Chinese University of Hong Kong (香港中文大学); University of Macau (澳门大学); Tsinghua University (清华大学); Zhejiang University (浙江大学); Soochow University (苏州大学); Brown University (布朗大学); Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 24 pages
Abstract:Autonomous agents are increasingly expected to improve executable policies through feedback, yet existing evaluations often collapse this process into a final score or confound it with open-ended software-engineering progress. We introduce Autonomous Policy Evolution, a controlled evaluation setting in which a harness-model agent repeatedly edits an executable policy system under a fixed interaction budget. We instantiate this setting in EvoPolicyGym, a benchmark built from compact interactive RL environments that evaluates how agents iteratively improve explored policies. On the EvoPolicyGym suite, GPT-5.5 achieves the strongest aggregate rank score and top-two performance on all 16 environments. Beyond leaderboard results, EvoPolicyGym also provides trajectory-level diagnostics that distinguish how agents allocate budget, convert feedback into parametric tuning. These analyses show that strong autonomous policy evolution depends not only on isolated task wins, but on discovering task-appropriate mechanisms and refining policies under bounded feedback.
[NLP-11] Automated grading of Linux/bash examinations using large language models : a four-level cognitive taxonomy approach
【速读】: 该论文旨在解决计算教育中命令行考试规模化与可靠评分的难题,传统人工评分因学生人数增加而难以实施,而基于规则的自动评分系统又无法处理部分得分、等效解法及语法变异等问题。其解决方案的关键在于评估四种前沿大语言模型(GPT、Claude Opus、Gemini 和 GLM)在评分短时 Linux/bash 命令响应时是否能逼近专家判断,并引入一个融合认知复杂性与操作影响的四层认知分类体系(从信息检索 L1 到高级系统管理 L4),结合最小基准提示与基于评分量规的增强提示两种策略进行测试。实验基于 1200 名二年级计算机工程专业学生的实际作答,由三位专家独立评分作为金标准。结果显示,采用量规引导提示的 Gemini 3.0 Pro 达到最高的人机一致性(ICC(3,1) = 0.888,MAE = 0.10,Bland-Altman 偏倚 = -0.014),且随着分类层级升高,模型表现下降明显,高阶题目差异最大。研究进一步表明,量规质量对评分一致性的影响大于模型提供商选择,结构化提示可稳定提升性能。结果证实了问题复杂度是预测大语言模型(LLM)评分准确性的可靠指标,构建了一个基于分类体系的可解释性框架,用于判断哪些题目适合人工智能辅助评分,哪些需保留人工评审,同时提供了可迁移的评估协议与提示模板。
链接: https://arxiv.org/abs/2607.02432
作者: Manuel Alonso-Carracedo,Ruben Fernandez-Boullon,Pedro Celard,Francisco J.Rodriguez-Martinez,Lorena Otero-Cerdeira
机构: University of Vigo (维戈大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Scalable and reliable grading of command-line examinations remains a challenge in computing education, where rising enrolments make manual marking difficult and rule-based autograders cannot handle partial credit, equivalent solutions, or syntactic variation. This paper evaluates whether four frontier Large Language Models (GPT, Claude Opus, Gemini, and GLM) can approximate expert judgment when grading short Linux/bash command responses. The study adopts a four-level cognitive taxonomy that combines cognitive complexity and operational impact, ranging from information retrieval (L1) and basic file manipulation (L2) to structural operations (L3) and advanced system management (L4). The models were tested with two prompt variants, a minimal baseline and a rubric-enhanced version, on 1200 real responses from second-year Computer Engineering students independently graded by three expert instructors. Gemini~3.0 Pro with rubric-guided prompting achieved the highest human-AI agreement (ICC(3,1) = 0.888, MAE = 0.10, Bland-Altman bias = -0.014). Agreement declined consistently as taxonomy level increased, with the largest discrepancies at higher levels. Across all models, rubric quality had a larger effect than provider choice, with structured prompts consistently improving agreement. These results show that question complexity is a reliable predictor of the difficulty LLMs face in grading accurately, and they establish a principled, taxonomy-based framework for determining which questions are suitable for AI-assisted grading and which require human review, while also providing a transferable evaluation protocol and prompt templates.
[NLP-12] he Future of NLP may not be at NLP Conferences: Scholarly Migration Patterns in Natural Language Processing
【速读】: 该论文旨在探究大规模语言模型(Large Language Models, LLMs)兴起背景下,自然语言处理(Natural Language Processing, NLP)领域学术发表格局是否正在发生根本性转变。其核心问题是:随着LLMs的发展,NLP研究的学术重心是否正从传统NLP顶会(如ACL主会议)向更广泛的机器学习(Machine Learning, ML)通用顶会迁移。解决方案的关键在于通过跨时段(2010–2026)、多作者群体(包括资深与新晋研究者)的实证分析,结合因果推断方法,量化不同发表场所的影响力差异。研究发现,资深作者在ACL主会议的占比下降19.2个百分点,转而更多选择新兴的Findings系列及通用ML会议,后者份额上升8.6个百分点;新晋作者中,以ACL为主要发表平台的比例从2019年的84%降至2024年的74%,而转向通用ML会议的比例则从5%升至21%。进一步分析表明,通用ML会议具有显著更高的引用优势,这一“引用溢价”成为影响作者投稿决策的核心动因。因此,该研究揭示了NLP研究发表生态正经历结构性转移,其背后驱动力是跨学科影响力的重新分配。
链接: https://arxiv.org/abs/2607.02416
作者: David Jurgens
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Natural Language Processing (NLP) has traditionally been published in its core disciplinary venues like ACL. However, advances in Large Language Models (LLMs) has led to a blurring of the disciplinary lines between NLP and general Machine Learning (ML), with authors regularly publishing in venues from both fields. Here, we ask whether the disciplinary center of gravity is shifting. Using NLP research published from 2010 to 2026 and studies of both established and new authors, we find that a migration is taking place. First, comparing the pre- and post-LLM eras, established authors lost 19.2pp of share at flagship *ACL main-conference tracks while gaining 14.8pp in the newer Findings tracks, and general ML venues rose 8.6pp, even when adjusting for parallel growth in the fields. Second, among newer authors who debut with at least three first-author NLP-topic papers, the share whose work appears mostly at *ACL venues fell from 84% (2019) to 74% (2024), while the share appearing mostly at general ML venues rose from 5% to 21%. Using causal inference techniques, we estimate that these general ML venues confer a significant citation premium, which influences venue selection. Together, these results point to a significant shift in where NLP research is published.
[NLP-13] Know Your Source: A Public Knowledge Store for Media Background Checks
【速读】: 该论文旨在解决基于大语言模型(LLM)的检索增强生成(RAG)系统在自动化事实核查(AFC)任务中对检索证据可靠性假设过强的问题。现实世界中的信息常存在冲突、过时或来源于不可靠、有偏见的来源,而现有方法普遍忽略这一挑战。为应对该问题,论文提出一种名为“源批判推理”(source-critical reasoning)的机制,通过媒体背景核查(Media Background Check, MBC)评估证据来源的可信度,从而提升事实验证的准确性与透明性。然而,当前MBC生成依赖昂贵的专有搜索API,严重限制了研究的可复现性。为此,论文提出MEDIAREF——一个公开可用的、基于网络抓取文档的知识库,涵盖200个媒体来源,支持低成本、可复现的MBC生成评估。其关键在于构建了一套可重复的语料构建与更新方法,并通过自动与人工双重评估验证了该知识库能显著提升MBC生成质量,为未来可复现的源可信度评估提供了基础支撑。
链接: https://arxiv.org/abs/2607.02383
作者: Benjamin Nichols,Michael Schlichtkrull,Nedjma Ousidhoum
机构: Cardiff University (卡迪夫大学)
类目: Computation and Language (cs.CL)
备注: Code and Data: this https URL
Abstract:LLM-based retrieval-augmented generation (RAG) is increasingly used for automated fact-checking (AFC) and related tasks. By grounding LLM outputs in retrieved evidence, RAG-based systems provide transparent justifications while allowing external information to be updated independently of the underlying model. However, existing approaches often assume retrieved evidence is reliable, although real-world information may be conflicting, outdated, and can originate from unreliable or biased sources. Recent work on source-critical reasoning addresses this challenge through media background checks (MBCs) (Schlichtkrull, 2024), which assess the credibility of evidence sources to support downstream fact verification. However, generating MBCs relies on costly proprietary search APIs, limiting reproducibility. To mitigate this issue, we introduce MEDIAREF, a publicly available knowledge store of web-sourced documents that enables reproducible, low-cost evaluation of MBC generation across 200 media sources. We describe a reproducible methodology for constructing and updating the collection, assess widely used LLMs on the MBC generation task, and demonstrate that MEDIAREF supports higher-quality MBC generation through both automatic and qualitative evaluation.
[NLP-14] HULAT2 at MER-TRANS 2026: Governed Multi-Agent Simplification for Spanish Easy-to-Read Generation
【速读】: 该论文旨在解决多语言易读性翻译(multilingual Easy-to-Read translation)任务中的生成质量与可解释性问题,尤其关注如何在保持语义忠实度、可读性和事实一致性的同时,提升翻译的用户导向适切性。其核心解决方案是提出一种基于LangGraph的多智能体工作流(multi-agent workflow),通过集成Gemini 2.5 Flash与RigoChat-7B-v2模型,结合并行生成策略、内部质量信号反馈、事件-条件-动作(Event-Condition-Action)路由机制、可控编辑以及可追溯决策链,实现对翻译过程的精细化控制。实验表明,该信号驱动的多智能体路由架构显著优于传统的生成-评估-重生成基线(RUN3),尤其在SARI指标上表现突出;而引入词汇支持层虽未带来参考标准分数的普遍提升,但凸显了在复杂任务中需进一步开展细粒度的段落级与文档级分析以全面评估可读性、事实一致性和用户适用性。
链接: https://arxiv.org/abs/2607.02381
作者: Lourdes Moreno,Paloma Martínez,Marco Antonio Sanchez-Escudero,Miguel Domínguez-Gómez
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 1 figure, 3 tables
Abstract:This paper describes the participation of HULAT2-UC3M in the Spanish track of MER-TRANS 2026, a shared task on multilingual Easy-to-Read translation. Three fully automatic Spanish runs were submitted. RUN1 and RUN2 used a LangGraph-based multi-agent workflow combining Gemini 2.5 Flash and RigoChat-7B-v2, parallel generation strategies, internal quality signals, Event-Condition-Action routing, controlled editing and traceable decisions. RUN1 used the base workflow, while RUN2 activated an additional lexical-support layer based on a glossary and lexical resources. RUN3 was a RigoChat-based generate-evaluate-regenerate baseline with prompt engineering and LoRA-based adaptation. The official leaderboard reports BLEU-Orig, BLEU-Gold, SARI and BERTScore. During development, additional internal signals were also inspected, including semantic fidelity, readability, lexical simplicity, syntactic clarity and factual consistency. According to official SARI, RUN1 was the best HULAT2 run, with 44.0543 points, followed by RUN2 with 43.1049 and RUN3 with 38.5136. These results indicate that, in this task setting, signal-guided multi-agent routing outperformed the linear regeneration baseline. They also show that adding lexical support did not automatically improve reference-based scores. Further segment-level and document-level analysis are required to assess readability, factual consistency and user-oriented adequacy.
[NLP-15] World Wide Models: Literary Tools for Cultural AI
【速读】: 该论文旨在解决生成式 AI(Generative AI)在跨文化语境中因语言单一性(monolingualism)所导致的文化理解偏差与阐释局限问题。其核心挑战在于,当前大语言模型(LLMs)以单语自动化方式处理文本,缺乏对多元文化语境的敏感性与深层解读能力。解决方案的关键在于构建一个多层次的文本建模框架,通过整合文学研究中的比较阅读、叙事学与诗学分析、批判理论以及世界文学(world literature)方法,将文学与AI发展之间的自然交集系统化;具体路径包括从宏观结构(macrostructure)、文本流通机制(circulation)及不可译性(untranslatability)三个维度,推动更具包容性的多语种、多文化文本理解范式,从而实现对全球AI文本性的更丰富、更辩证的诠释。
链接: https://arxiv.org/abs/2607.02369
作者: Nina Begus
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:LLMs stage a new form of cultural encounter that is massive, automated, and monolingual. Literary disciplines have always negotiated cultural struggles with comparative reading of literature, narratological and poetic analysis, critical theory, world literature, and translation. These tools have now become indispensable for building culturally literate AI. The essay develops a layered framework toward more nuanced textual models and pluralistic interpretations of AI, emphasizing the natural intersections of literature and AI development, connecting current debates in critical theory with structural monolingualism, and suggesting a new application of world literature approaches to address global AI textuality through macrostructure, circulation, and untranslatability.
[NLP-16] SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的软件工程智能体在技能市场中因技能组合引发的隐式意图(implicit intents)问题。具体而言,尽管单个技能在独立审计时表现正常,但其与其他技能协同执行时可能产生未预期的行为或目标,而传统基于孤立审计的机制无法识别此类风险。核心挑战在于:隐式意图仅在技能组合执行时显现、执行环境难以在准入阶段获取、且技能组合空间随市场规模呈指数级增长。为此,论文将隐式意图发现建模为对技能组合的模糊测试(fuzzing)问题,以技能组合为测试单元,利用规划产物作为执行前意图的表征,并以无技能基线为差异性判据(differential oracle)。基于此,提出首个无需执行的测试方法SkillFuzz,其关键创新在于:通过提取结构化技能契约(structured skill contracts),并采用契约引导的蒙特卡洛树搜索(Monte Carlo Tree Search)策略,高效优先探索潜在冲突的组合路径。实验结果表明,在固定查询预算下,SkillFuzz在代表性工作负载中发现了超过1000种不同的隐式意图,执行验证阶段确认了80%以上高风险组合的有效性,且相比其他搜索策略,在极小的成对交互探索比例下识别出更多高严重性隐式意图。
链接: https://arxiv.org/abs/2607.02345
作者: Jinwei Hu,Yi Dong,Youcheng Sun,Xiaowei Huang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under Review
Abstract:Large Language Model (LLM)-based agents increasingly automate software engineering tasks through reusable skills, natural-language instruction documents that guide planning and execution. Open skill marketplaces enable users to assemble agents by co-activating community-contributed skills, but marketplace operators typically audit skills in isolation. As a result, individually benign skills may interact to redirect an agent toward unintended objectives, which we term implicit intents. Detecting such intents is challenging because the effect emerges only through skill composition, execution environments are often unavailable at admission time, and the space of possible co-activations grows exponentially with marketplace size. In this paper, we formulate implicit-intent discovery as a fuzzing problem over skill compositions, where skill compositions are the unit under test, planning artifacts expose agent intent before execution, and deviations from a skill-free baseline serve as a differential oracle. Based on this formulation, we propose skillfuzz, the first execution-free testing approach that extracts structured skill contracts and uses contract-guided Monte Carlo Tree Search to prioritize potentially conflicting compositions. Across representative skill-marketplace workloads, skillfuzz discovers over 1,000 distinct implicit intents under a fixed query budget, confirms more than 80% of the highest-risk flagged compositions during execution-time validation, and identifies substantially more high-severity implicit intents than alternative search strategies while exploring only a fraction of the pairwise interaction space they require.
[NLP-17] On the Role of Directionality in Structural Generalization
【速读】: 该论文旨在解决现有语法解析框架在处理具有方向性语义任务时的局限性问题,特别是针对SLOG评测中涉及方向性区分的任务(如修饰语位置偏移、论元提取位置等),而此前的最优方法AM-Parser采用的抽象含义代数(AM algebra)其操作无法编码方向信息,导致在方向性任务上表现受限。解决方案的关键在于重构符号后端,引入基于范畴语法(CCG)的有向类型系统,实现确定性的CKY解析与单线性解码器结构,并仅使用3万可学习参数,从而显式建模方向性依赖。实验表明,在相同BERT-base编码器下,新系统达到75.9±6.4%的精确匹配率,显著优于AM-Parser的70.8±4.3%;尤其在5个位置偏移类别上提升29.9个百分点,充分验证了方向性建模的有效性。同时,当替换为DeBERTa-v3-large编码器后,准确率进一步提升至90.7±4.9%,且在递归深度类别的提升最为显著,体现出方向性与神经编码器升级之间的互补性。该设计将模型瓶颈从符号层(原AM-Parser在部分类别上达到0%上限)转移至神经层,使整体性能随编码器增强而持续优化。
链接: https://arxiv.org/abs/2607.02307
作者: Zichao Wei
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Several SLOG test categories explicitly involve directional distinctions (modifier position shifts, argument extraction positions), yet AM-Parser, the previous SOTA, uses an AM algebra whose operations do not encode direction. We redesign the symbolic backend around CCG directed types (deterministic CKY + single linear decoder, 30K learnable parameters). Under the same BERT-base encoder, the system achieves 75.9 \pm 6.4% LF exact match, surpassing AM-Parser (70.8 \pm 4.3%). Per SLOG’s own category groupings, gains are highly directional: the CCG system outperforms AM-Parser on all 5 position-shift categories (+29.9pp), while AM-Parser outperforms on all 6 recursive-depth categories. Replacing the encoder with DeBERTa-v3-large yields 90.7 \pm 4.9%, with the largest encoder gains in recursive-depth categories, complementary to directionality’s gains. Directional representations shift the bottleneck from the symbolic layer (AM-Parser’s 0% category ceiling) to the neural layer, which improves with encoder upgrades.
[NLP-18] HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures
【速读】: 该论文旨在解决现有数据混合方法中标签系统(label system)的局限性问题,即大多数方法依赖于预先划分好的数据分组,而这些分组通常基于单一语义轴(如来源、主题或格式分类体系)和固定粒度,导致其表达能力受限且难以灵活调整。传统标签体系在改变粒度时需重新构建标签,缺乏可扩展性和灵活性。论文提出的关键解决方案是构建一种数据驱动的层次化标签基底——HERMES,其核心由一个学习的语义变换(Learned Semantic Transform)与三阶段残差向量量化(residual vector quantization)组成,能够对每篇文档进行一次性的粗到细编码,通过前缀长度控制粒度,最多可达约13万编码单元。该结构不依赖特定聚类算法,而是提供了一个可复用、自适应的数据衍生粒度层级。实验表明,在10亿参数、2500亿词元的预训练场景下,该层次结构揭示了固定粒度管道无法捕捉的交互效应:在某一粒度层级上,采用第二阶段规则的组合策略(均衡子桶覆盖与按大小比例分配的质量优化)使16项任务的宏观平均性能提升+0.0253;而在更精细层级,由于候选池缩小约5倍,该规则优势消失。因此,论文的核心贡献在于提出了一个新型的、可扩展的层次化标签基底,将数据混合设计从选择固定标签集转变为在可复用的数据驱动粒度层级中进行导航。
链接: https://arxiv.org/abs/2607.02266
作者: Ziyun Qiao,Yue Min,Ruining Chen,Yujun Li
机构: Wizard Quant( wizardquant); Peking University (北京大学); University of Science and Technology of China (中国科学技术大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 5 figures
Abstract:Most data-mixing methods assume the corpus has already been partitioned into groups, and the choice of those groups determines what a mixer can express. Existing labels, including provenance, topic or format taxonomies, and flat embedding clusters, commit to one semantic axis at one granularity; changing the resolution rebuilds the labels. We argue the bottleneck is the label system, not the mixer, and provide a hierarchical one. HERMES is a data-derived labeling substrate: a Learned Semantic Transform followed by 3-stage residual vector quantization annotates each document once into a coarse-to-fine code whose prefix length controls granularity up to approximately 130k cells. At coarse granularity HERMES sits at a plateau with KMeans-family methods on standard clustering metrics, so the contribution is the substrate, not the clusterer. On 1B-parameter, 25B-token pre-training, the hierarchy exposes an interaction fixed-granularity pipelines cannot test: at one prefix length, a combined Stage-2 rule contrast, equal-subbucket coverage versus size-proportional within-bucket quality top-30%, lifts a 16-task capability macro-average by +0.0253; at the next finer level, the same rule loses its measurable edge as candidate pools contract approximately 5x. HERMES reframes data mixture design from choosing among fixed label sets to navigating a reusable, data-derived granularity hierarchy.
[NLP-19] CheckRLM: Effective Knowledge-Thought Coherence Checking in Retrieval-Augmented Reasoning
【速读】: 该论文旨在解决生成式推理语言模型(Reasoning Language Models, RLMs)在处理知识密集型复杂任务时,其推理链中易引入事实性错误的问题,尤其在长时序推理过程中,错误会逐级累积并严重影响最终结果的可靠性。其核心解决方案是提出CheckRLM框架,通过引入检索增强生成(Retrieval-Augmented Generation, RAG)机制,在推理过程中实时检测与修正事实性错误。该方法的关键在于:在推理链生成过程中动态提取关键事实主张(factual claims),识别并定位潜在的知识不一致;一旦发现错误,即通过轻量级但精准的修正机制,利用外部知识源进行最小成本的修正,从而确保推理过程与正确知识的一致性。实验表明,CheckRLM显著优于现有基线方法,在降低计算开销的同时,有效缓解了长推理链中的错误传播问题。
链接: https://arxiv.org/abs/2607.02262
作者: Dingling Xu,Ruobing Wang,Qingfei Zhao,Yukun Yan,Zhichun Wang,Daren Zha,Shi Yu,Zhenghao Liu,Shuo Wang,Xu Han,Maosong Sun
机构: Beijing Normal University (北京师范大学); Beijing Key Laboratory of Artificial Intelligence for Education (北京市人工智能教育重点实验室); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Tsinghua University (清华大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: 24 pages, 7 figures
Abstract:Reasoning Language Models (RLMs) have significantly improved performance on complex tasks by extending the reasoning chain. However, these chains are prone to containing factual errors, particularly in knowledge-intensive tasks. To address this issue, we propose CheckRLM, a framework that improves the reliability of the reasoning process through Retrieval-Augmented Generation (RAG) by timely checking and correcting factual errors. Specifically, CheckRLM extracts factual claims from the reasoning chain to identify and localize subtle knowledge inconsistencies during inference. Upon detection of errors, a refinement mechanism performs minimal-cost yet precise corrections by leveraging external knowledge, ensuring coherence between the reasoning chain and correct knowledge. Extensive experiments demonstrate that CheckRLM substantially outperforms existing baselines, exhibiting a strong capability to mitigate error accumulation in long-horizon reasoning with lower costs. The code and data are available at this https URL.
[NLP-20] BamiBERT: A New BERT-based Language Model for Vietnamese
【速读】: 该论文旨在解决当前越南语文本编码器PhoBERT在上下文长度限制、依赖外部分词模块以及跨领域泛化能力不足等关键问题。其解决方案的关键在于:从零开始在129GB通用领域越南语语料上训练一个全新的基于BERT的预训练语言模型BamiBERT,支持高达2048个标记(token)的扩展上下文长度,并可直接处理原始输入文本而无需依赖外部分词工具;同时,在8个越南语基准测试中,BamiBERT在15项指标中有11项取得最优成绩,3项位列第二,显著超越现有“base”规模的越南语编码器,确立了新的性能基准,展现出强大的跨领域泛化能力。
链接: https://arxiv.org/abs/2607.02259
作者: Dat Quoc Nguyen,Thinh Pham,Chi Tran,Linh The Nguyen
机构: Qualcomm AI Research(高通人工智能研究院); Virginia Tech(弗吉尼亚理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this paper, we introduce BamiBERT, a new BERT-based pre-trained language model for Vietnamese that addresses key limitations of PhoBERT – the current de facto Vietnamese text encoder. Trained from scratch on a 129GB corpus of general-domain Vietnamese text for 20 epochs, BamiBERT supports an extended context length of up to 2048 tokens and operates directly on raw input, eliminating the need for external word segmentation. Across 8 Vietnamese benchmarks, it achieves the best score on 11 of 15 metrics and the second-best on 3 others, setting a new state of the art among “base”-sized Vietnamese encoders and demonstrating strong cross-domain generalization. We release BamiBERT at: this https URL
[NLP-21] Agent icSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents
【速读】: 该论文旨在解决长时程大语言模型(LLM)智能体在复杂决策任务中因上下文过载导致的记忆可解释性与可控性下降问题。传统方法通过将历史观测、工具调用及反思内容简单拼接至每个提示词,虽便于访问过往信息,却使上下文变得混乱,难以分离单个记忆组件的影响。为此,论文提出一种受约束的“记忆契约”(bounded contract):每次决策均基于类型化检索生成的全新用户消息,不附加原始跨决策上下文,从而确保提示词长度始终受限,且各记忆层可独立消融分析。该方案在《杀戮之境2》(Slay the Spire 2)这一封闭规则、高随机性的卡牌构建游戏中得到验证——该任务需数百次战术与战略决策,现有前沿模型在最低难度下五种配置均未能取得胜利,人类胜率仅为16%,任务具有挑战性但未饱和。实验表明,在固定模型规模(A0)下,启用策略技能层后胜率从3/10提升至6/10,尽管统计上未达显著水平(Fisher精确检验p≈0.37),但已呈现方向性差异。研究还引入跨主干网络探测与公开累积上下文基线作为操作性对照,而非对契约变量的严格控制测试。最终,作者开源了一个可复现的评测平台,包含298条完整轨迹数据、条件标签、冻结的记忆/技能快照、提示记录及分析脚本,为研究显式记忆层如何影响长时程LLM智能体决策提供了标准化的代理设计与验证方法论。
链接: https://arxiv.org/abs/2607.02255
作者: Xiangchen Cheng,Yunwei Jiang,Jianwen Sun,Zizhen Li,Chuanhao Li,Xiangcheng Cao,Yihao Liu,Fanrui Zhang,Li Jin,Kaipeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Memory for a long-horizon LLM agent is a contract about what each future decision is allowed to see. The simplest contract appends past observations, tool calls, and reflections to every prompt, which makes prior context easy to access but also turns it into a jumbled mixture in which the effect of any single memory component is hard to isolate. We introduce and instrument an alternative bounded contract: every decision is made from a fresh user message assembled by typed retrieval, with no raw cross-decision transcript appended. The prompt thus stays bounded across runs of any length, and any single layer can be ablated in isolation. We instantiate the contract in Slay the Spire 2, a closed-rule stochastic deck-building game whose runs require hundreds of tactical and strategic decisions. A public online benchmark of frontier LLMs on the same game reports zero wins at the lowest difficulty across five configurations, and the developer-reported human win rate at the same difficulty is 16%; the task is hard but not saturated. Within our harness, a fixed-A0 ablation shows the largest observed difference when triggered strategic skills are enabled: the no-store baseline wins 3/10 games and adding the skill layer 6/10. At this sample size the comparison is directional rather than statistically decisive (Fisher exact p\approx0.37); a cross-backbone probe and public accumulating-context baselines are reported as operational comparisons rather than controlled tests of the contract variable itself. We release a reproducible testbed: 298 completed trajectories with condition tags, frozen memory/skill snapshots, prompt records, and analysis scripts – an agent design and a validated, reusable methodology for studying how explicit memory layers shape long-horizon LLM-agent decisions.
[NLP-22] Challenges and Recommendations for LLM s-as-a-Judge in Multilingual Settings and Low-Resource Languages
【速读】: 该论文旨在解决当前生成式AI评估范式在多语言及低资源语言场景下存在的可靠性与有效性问题。尽管大语言模型作为评判者(LLM-as-a-Judge)在英文自然语言生成任务中已展现出对人类判断的高度相关性并逐渐取代传统评估指标,但在多语言尤其是低资源语言场景中,其性能受限于模型在非主流语言上的理解能力不足,且缺乏充分的人类验证支持。论文通过分析ACL Anthology中涉及多语言和低资源语言的650篇提及LLM-as-a-Judge的研究,发现仅有33篇真正聚焦此类场景,且存在评估结果不一致、过度依赖单一模型判断以及缺乏标准化验证流程等问题。其解决方案的关键在于:建立更加严谨的多语言与低资源情境下的评估框架,包括采用多模型交叉验证、引入人工标注作为基准、明确评估边界,并提出针对此类设置的系统性实践建议,以提升评估结果的可信度与可复现性。
链接: https://arxiv.org/abs/2607.02235
作者: A.Seza Doğruöz,Xixian Liao,Verena Blaschke,Jakob Prange,Senyu Li,David Ifeoluwa Adelani
机构: LT3, IDLab, Universiteit Gent (比利时根特大学LT3-IDLab实验室); Barcelona Supercomputing Center (巴塞罗那超级计算中心); LMU Munich Munich Center for Machine Learning (慕尼黑大学慕尼黑机器学习中心); German Center for Addiction Research in Childhood and Adolescence (德国儿童与青少年成瘾研究中心); University Medical Center Hamburg-Eppendorf (汉堡-埃彭多夫大学医学中心); Mila - Quebec AI Institute (蒙特利尔魁北克人工智能研究所); McGill University (麦吉尔大学); Canada CIFAR AI Chair (加拿大加拿大人工智能主席)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:LLM-as-a-Judge has become the dominant evaluation paradigm for many natural language generation tasks, due to shortcomings of conventional metrics and high correlations with human judgment, albeit mostly in English. There are now attempts to extend LLM-as-a-Judge to multilingual settings including low-resource languages. However, LLMs have limited proficiency in low-resource languages, and there is often no adequate human validation in these settings. To highlight the scope of the problem and current practices, we explore the use of LLM-as-a-Judge evaluators in ACL Anthology papers focusing on multilingual settings and low-resource languages across a diverse set of tasks. Out of 650 papers mentioning LLM-as-a-judge, only 33 of them focus on low-resource or multilingual settings. Our in-depth analysis of these papers indicates inconsistent evaluation outcomes, a tendency to overtrust LLM judgments in multilingual settings, and the widespread reliance on a single judge model per study. To help the NLP community further, we conclude with recommendations about how to use LLM-as-a-Judge in multilingual and low-resource settings.
[NLP-23] Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning
【速读】: 该论文旨在解决语音语言模型(Speech Language Model, SLM)在指令微调(instruction tuning)过程中面临的挑战,即需同时学习语音模态和大量语音特定指令,相较于文本大语言模型(Large Language Model, LLM)而言,其训练复杂度显著增加。现有方法通常沿用文本LLM的训练范式,通过合成大规模语音预训练与指令微调数据集来实现,但该策略难以扩展,主要受限于语音序列远长于文本序列导致的计算成本高昂。本文提出SpeechCombine,一种无需任何指令微调步骤的指令遵循型语音语言模型,仅通过一次在3万小时语音数据上的连续预训练即可完成训练。其核心解决方案是:以文本LLM为基础模型,先在语音语料上进行连续预训练以获得适应语音的模型,随后直接将该语音适配模型的权重与文本LLM中指令微调版本与基础版本之间的权重差异进行融合。实验结果表明,该简单组合策略不仅有效保留了原始文本LLM的知识与能力,还成功将其迁移至语音领域,揭示了一种不依赖海量语音数据的SLM训练新路径。
链接: https://arxiv.org/abs/2607.02214
作者: Congrui Du,Yang Zhang,Kaizhi Qian,Shiyu Chang
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Instruction tuning for speech language models (SLMs) is substantially more challenging than for text-based large language models (LLMs), as it requires learning a new modality and a wide range of speech-specific instructions in addition to those supported by text LLMs. Existing SLM training approaches largely replicate the text LLM training paradigm by synthesizing large-scale speech pre-training and instruction-tuning datasets. However, this strategy is difficult to scale, since speech sequences are significantly longer than text sequences. In this paper, we propose SpeechCombine, an instruction-following speech language model trained without any instruction tuning, using only a single round of speech pre-training on 30k hours of data. Starting from a text LLM base model, we perform continuous pre-training on speech utterances to obtain a speech-adapted model, and then directly combine its weights with the weight difference between the instruction-tuned and base versions of the text LLM. Our results show that this simple combination strategy not only preserves the knowledge and capabilities of the original text LLM, but also effectively transfers them to the speech domain. These findings suggest a new direction for SLM training that avoids reliance on massive speech data.
[NLP-24] Bayesian Sparse Low-Rank Adaptation for Large Language Model Uncertainty Estimation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定任务微调过程中普遍存在的过度自信(overconfidence)问题,这一问题严重制约了模型在实际应用中的可信部署。其核心解决方案是提出一种名为数据自适应低秩适应(Data-Adaptive Lower-Rank Adaptation, DALorRA)的变分贝叶斯稀疏框架。该方法的关键在于将不确定性量化(uncertainty quantification)的范式从传统的密集参数空间转移到低秩适配(Low-Rank Adaptation, LoRA)的轻量级秩维度层面,通过在秩维度上施加随机掩码(stochastic masking),实现训练阶段的贝叶斯正则化与推理阶段的集成式校准,从而有效抑制模型过拟合和过度自信,同时保持原有的推理准确性。
链接: https://arxiv.org/abs/2607.02182
作者: Jijie Zhang,Zhe Ren,Quan Zhang,Dandan Guo
机构: Jilin University (吉林大学); Michigan State University (密歇根州立大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint. 16 pages, 7 figures, 6 tables
Abstract:Large language models (LLMs) exhibit remarkable reasoning capabilities, but their task-specific fine-tuning is notoriously plagued by overconfidence, severely hindering trustworthy deployment. We propose Data-Adaptive Lower-Rank Adaptation (DALorRA), a simple and effective variational Bayesian sparse framework that shifts the paradigm of uncertainty quantification from the dense parameter space to the lightweight rank level of low-rank adaptation (LoRA). With the insight that LoRA essentially aggregates multiple rank-one components that may provide superfluous model capacity, DALorRA imposes stochastic masking on rank dimensions, enabling Bayesian regularization of model capacity during training and ensemble-like calibration during inference. Extensive experiments demonstrate DALorRA’s excellent calibration of LLMs without compromising reasoning accuracy.
[NLP-25] ESC: Emotional Self-Correction for Reliable Vision-Language Models ECCV
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在多模态任务中存在不可靠推理的问题,尤其关注现有自修正方法依赖后训练或精心设计的反馈机制所带来的高计算成本。其核心解决方案是提出一种无需额外训练的自修正框架——情感自修正(Emotional Self-Correction, ESC),通过引入外部验证器检测初始响应中的潜在错误,并注入情感反馈信号以激发模型的反思行为,从而在不增加训练开销的前提下实现更谨慎、可靠的推理。实验结果表明,ESC在安全性、幻觉抑制、视觉感知及多模态推理等多个基准上均显著提升了模型的可靠性,同时保持了原有性能水平,验证了情感线索可作为可扩展的控制信号,推动构建具备类人情绪感知与自修正能力的下一代智能系统。
链接: https://arxiv.org/abs/2607.02089
作者: Tien-Huy Nguyen,Minh-Nhat Nguyen,Nguyen Nhat Huy,Hung Viet Nguyen,Huy Nguyen Minh Nhat,Thanh-Huy Nguyen,Cuong Tuan Nguyen,Hoang M. Le,Dat Nguyen,Phat Kim Huynh,Min Xu,Ulas Bagci
机构: AI VIETNAM; PAMI Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: ECCV Main Track 2026 (113 pages, 15 tables, 65 figures). Project Page: this https URL
Abstract:Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, yet they remain vulnerable to unreliable reasoning. Existing self-correction methods mitigate these issues but typically rely on post-training or carefully engineered feedback, incurring high computational cost. In this work, we revisit this challenge through the lens of emotional cues, asking whether they can activate latent self-correction behaviors in VLMs without additional training. \textbfWe find that emotional signals serve as an effective trigger for self-correction, encouraging more cautious and reflective reasoning. Motivated by this finding, we propose \escabstract (\textbf\underlineEmotional \textbf\underlineSelf-\textbf\underlineCorrection), a training-free self-correction framework. ESC introduces an external verifier that detects potentially incorrect initial responses and injects emotional feedback to encourage model to reflect, and produce a better revised response without additional training. Extensive experiments across safety, hallucination, vision-centric perception, and multimodal reasoning benchmarks show that ESC consistently improves reliability while preserving overall model utility. These results suggest that emotion can function not only as an ability to be recognized, but also as a practical control signal for scalable self-correction in VLMs. \textbfWe therefore believe that ESC provides a strong foundation for a new reliable human-like, emotion-integrated research direction. Our project is publicly available at \textcolorredthis https URL.
[NLP-26] HaloGuard 1.0: An Open Weights Constitutional Classifier for Multilingual AI Safety
【速读】: 该论文旨在解决大模型在输入安全(input safety)检测中面临的高误报率(false positive rate, FPR)、低召回率(false negative rate, FNR)以及多语言泛化能力不足的问题,尤其针对当前主流开源安全防护模型体积庞大、效率低下且对意图反转攻击敏感的缺陷。其核心解决方案是提出一种基于“安全宪法”(safety constitution)的结构化生成与评估框架:以46项政策和2,940个子类别构成的自然语言宪法为指导,通过严格的一一对应反事实数据生成机制,在保持主题与词汇不变的前提下仅反转意图,从而实现对模型行为的精准控制;采用双层无害设计,分别针对边界区域和基线区域的误报进行优化;同时在46种语言间实现均衡的多语言表征,将语言视为跨边界的表面形式而非对抗性信号。该方法使HaloGuard 1.0-0.8B在七项英文及多语言提示安全基准上达到90.9的平均F1分数,显著优于参数量高达30倍以上的基线模型,且维持低误报率(FPR=4.3)与低漏报率(FNR=9.5)。HaloGuard 1.0-4B进一步提升至平均F1 92.1,通过增加容量增强精度。对剩余失败案例的结构化分析表明,多数看似漏检的情况实为基准标注错误,而非模型真实失效。此外,系统引入持续运行的对抗红队测试协议,有效抵御内容级与代理型攻击。最终,模型以开放权重形式发布,推动可复现、可审计的安全防护研究。
链接: https://arxiv.org/abs/2607.02079
作者: Navaneeth Sangameswaran,Preetham S,Ashmiya Lenin
机构: Astroware AI(艾斯特瓦人工智能); Karunya Institute of Technology and Sciences(卡鲁尼亚科技与科学学院)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 30 pages, 7 figures, 20 Tables, Link: this https URL
Abstract:We present HaloGuard 1.0, an open-weights implementation of the constitutional-classifier paradigm for input safety. It achieves state-of-the-art performance on English and multilingual prompt-safety benchmarks at roughly one-tenth the model size of current leading open guard models. The safety constitution is the organising structure of the corpus: a natural-language constitution of 46 policies and 2,940 subcategories drives synthetic data generation, with exhaustive one-to-one paired counterfactuals that hold topic and vocabulary fixed while flipping intent, a two-tier harmless design that separately targets boundary and baseline false positives (FPs), and balanced multilingual materialisation across 46 languages that treats language as a surface form appearing on both sides of the boundary rather than as an adversarial signal. Across seven prompt-safety benchmarks, HaloGuard 1.0-0.8B attains the best average F1 (90.9) of any open guard we evaluate, outperforming baselines up to 27B parameters (over 30 times larger) while holding false-positive rate (FPR) to 4.3 and false-negative rate (FNR) to 9.5. The HaloGuard 1.0-4B variant reaches average F1 of 92.1 and FPR of 3.5, spending its extra capacity on precision rather than recall. A structured adjudication of the remaining failures indicates that most apparent missed-harm cases are benchmark mislabels rather than genuine model misses. An always-on adversarial red-teaming protocol continuously hardens the guard against both content-level and agentic attacks. We release the models as open weights.
[NLP-27] SPLIT: Cross-Lingual Empathy and Cultural Grounding in English and Ukrainian LLM Responses KR
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在情感支持与危机情境下跨语言能力不足的问题,尤其关注低至中等资源语言在情感共鸣与文化语境适配方面的表现。现有评估基准虽涵盖多语言性能,却普遍缺乏对危机情境下共情能力与文化根基的深入考察。为此,研究提出SPLIT基准,包含500个提示,覆盖压力(Stress)、恐慌(Panic)、孤独(Loneliness)、内部流离(Internal Displacement)及紧张(Tension)五大类别,用于评估LLMs在英语与乌克兰语中生成具有情感根基回应的一致性。研究从共情准确性(Empathetic Accuracy)、语言自然度(Linguistic Naturalness)和上下文文化适配性(Contextual Cultural Grounding)三个维度,对比评估三类技术架构各异的LLMs。关键发现表明,Gemini-2.5-Flash与LLaMA-3.3-70B-Instruct在转向乌克兰语时性能显著下降,而DeepSeek-V3则表现出更强的稳定性;同时,人类与人工智能评价者在共情与自然度上仅呈现弱一致性,但在文化适配性上存在明显分歧。研究进一步指出,生成乌克兰语文本并不等同于提供有效的乌克兰语情感支持,强调未来需发展更具文化针对性的评估框架,并强化以用户为中心的人类评估机制。
链接: https://arxiv.org/abs/2607.02049
作者: Anna Chorna
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 19 pages, 5 figures, 3 tables. Benchmark paper introducing SPLIT for evaluating empathy, linguistic naturalness, and cultural grounding in English and Ukrainian LLM responses
Abstract:Large Language Models are increasingly deployed in emotional-support contexts and crisis-related situations. Nevertheless, their cross-lingual abilities in these circumstances remain underexplored. Existing benchmarks emphasize multilingual performance but rarely examine crisis-related empathy and cultural grounding in low-to-mid-resource languages. We introduce SPLIT, a 500-prompt benchmark designed to evaluate LLM consistency in generating emotionally grounded responses across five categories: Stress, Panic, Loneliness, Internal Displacement, and Tension. We evaluate three technically diverse LLMs across three dimensions: Empathetic Accuracy, Linguistic Naturalness, and Contextual Cultural Grounding. The framework aims to assess and compare the quality of LLM responses in both English and Ukrainian languages, as well as to explore the reliability of the LLM-as-a-jury paradigm. Our findings reveal that Gemini-2.5-Flash and LLaMA-3.3-70B-Instruct degrade when transitioning to Ukrainian, while DeepSeek-V3 remains comparatively stable within our benchmark. We additionally find that human and AI evaluators agree weakly on empathy and naturalness but diverge on cultural grounding. We further argue that producing Ukrainian text is not equivalent to producing Ukrainian emotional support. Our findings may assist in the future development of more culturally tailored benchmark designs, as well as encourage a stronger emphasis on human-centered evaluation.
[NLP-28] OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets
【速读】: 该论文旨在解决生成式 AI(Generative AI)在安全完成任务时存在的评估难题,即现有方法依赖孤立提示(isolated prompts)进行安全评估,难以准确反映模型在不同意图下的行为一致性。其解决方案的关键在于提出 OpenSafeIntent——一个受控的提示集基准,通过固定底层任务而系统性地改变意图(包括良性、双用途和恶意三种变体),从而实现对模型在意图迁移过程中是否具备安全校准能力的精准评估。研究发现,仅在平均层面表现安全的模型在具体意图变体间存在显著不一致,双用途行为对改写敏感,高阶风险话题的回答安全性不可靠,且将模糊请求重构为更安全任务的响应更不易突破安全边界。这表明,安全完成应被视作在受控任务变体上对意图具有校准能力的行为,而非独立提示下单一的安全性与有用性权衡。
链接: https://arxiv.org/abs/2607.02047
作者: Rheeya Uppaal,Seungwoo Lyu,Selina Sung,Junjie Hu
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Korea University (高丽大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Safe completion requires models to provide useful assistance without enabling harm, but this behavior is difficult to evaluate with isolated prompts. We introduce OpenSafeIntent, a benchmark of controlled prompt-sets that vary intent while holding the underlying task fixed. Each datapoint contains benign, dual-use, and malicious variants of the same task. This design lets us evaluate whether models calibrate assistance across intent shifts, rather than merely appearing safe on average. Across a broad model suite, we find that prompt-level safety hides important failures: models often fail to remain safe across matched intent variants, dual-use behavior is brittle under paraphrase, high-level answers on risky topics are not reliably safe, and responses that reframe ambiguous requests into safer tasks are substantially less likely to cross the safety boundary. Our results suggest that safe completion should be evaluated as intent-calibrated behavior over controlled task variants, not as a single safety-helpfulness tradeoff over independent prompts.
[NLP-29] PACE: A Proxy for Agent ic Capability Evaluation
【速读】: 该论文旨在解决在评估大语言模型智能体(LLM agents)时面临的高成本、耗时长及基础设施复杂等问题,尤其针对SWE-Bench和GAIA等典型代理型基准测试所存在的资源消耗难题。其核心解决方案在于提出一种名为PACE的框架,通过构建代理基准(proxy benchmark),利用少量精心挑选的原子级非代理型评估实例(atomic evaluation instances)来准确预测模型在昂贵的代理型基准上的表现。PACE的关键在于结合两种互补的实例选择策略:基于目标相关性的局部选择(target-relevance local selection)与全局信息量最大的全局选择(globally informative global selection),从而筛选出最具预测力的源实例子集;在此基础上,通过回归建模将模型在小规模源实例上的得分映射至目标代理型基准的预测得分。实验表明,基于PACE构建的PACE-Bench在14个模型、4个代理型基准和19个非代理型基准上的评估中,实现了留一法交叉验证(LOOCV)下均方绝对误差(MAE)低于4%、斯皮尔曼等级相关系数(Spearman correlation)超过0.80、成对模型排序准确率约85%的优异性能,且计算开销不足完整代理评估的1%。此外,对所选代理实例的分析揭示了各代理型基准所特有的核心能力需求。该方法为模型开发、选型与路由阶段提供了高效可靠的代理性能估算手段,显著降低了对全量代理评估的依赖。
链接: https://arxiv.org/abs/2607.02032
作者: Yueqi Song,Lintang Sutawika,Jiarui Liu,Lindia Tjuatja,Jiayi Geng,Yunze Xiao,Daniel Lee,Aditya Bharat Soni,Vincent Lo,Xiang Yue,Graham Neubig
机构: Carnegie Mellon University (卡内基梅隆大学); Salesforce AI Research (Salesforce AI 研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Evaluating LLM agents on benchmarks like SWE-Bench and GAIA can be expensive, time-consuming, and requires complex infrastructure. A single evaluation can cost thousands of dollars and take days to complete. In contrast, non-agentic LLM benchmarks that test individual capabilities (e.g., reasoning, code generation) are fast and cheap to run. In this paper, we investigate whether performance on expensive agentic benchmarks can be accurately predicted by the performance on a small, carefully selected subset of atomic evaluation instances. We introduce PACE, a framework that constructs proxy benchmarks by selecting instances from existing non-agentic evaluations whose aggregate scores most reliably predict model performances on agentic benchmarks. Given a pool of candidate instances spanning atomic capabilities, PACE fits a regression that maps a model’s scores on a compact subset of source instances to its score on the target agentic benchmark. The subset itself is curated by combining two complementary instance-selection strategies, target-relevance local selection and globally informative global selection. We apply PACE to the 4 target agentic benchmarks in this paper, which yields PACE-Bench, the concrete proxy benchmark that we evaluate in the paper. Experiments across 14 models, 4 agentic benchmarks, and 19 non-agentic benchmarks show that PACE-Bench predicts agentic scores with leave-one-out cross-validation (LOOCV) mean absolute error (MAE) under 4%, Spearman correlation above 0.80, and pairwise model-ranking accuracy around 85%, all at much less than 1% of the full agentic evaluation cost. We further analyze the selected proxy instances, revealing which skills each agentic benchmark uniquely demands. PACE enables practitioners to obtain reliable estimates of agentic performance during model development, selection, and routing, without the overhead of full agent evaluation.
[NLP-30] EduArt: An educational-level benchmark for evaluating art history knowledge in large language models
【速读】: 该论文旨在解决当前大型语言模型(LLM)在艺术史领域知识与视觉推理能力评估中存在的关键问题:现有评估体系多依赖合成题目且缺乏细粒度分析,难以揭示模型在特定学科内的真实表现能力。尤其当模型在通用基准上已接近天花板分数时,传统以选择题为主的单一格式评测无法有效区分前沿模型的实际认知水平。为此,论文提出EduArt——一个面向多模态大模型的艺术史知识与视觉推理教育级基准,包含871道由人类编写的意大利中学课程及美国大学先修课程(AP Art History)试题,覆盖双语、七种题型(从单选到文本内填空、错误识别等),具备高心理测量学特性(平均区分度0.514,82.3%为良好区分题)。研究发现,尽管多选题准确率普遍接近饱和(六款模型达天花板),但在开放生成和错误识别等需深度理解与推理的任务中,性能急剧下降(如Claude Opus 4.6在开放补全中降至23.9%,Claude Sonnet 4.6在错误识别中仅6.2%),表明识别能力与知识应用能力存在显著分离。此外,要求提供解题理由的“动机条件”对不同模型家族产生异质性影响,多数情况下降低准确率。这说明单一格式评估会严重夸大模型在复杂学术任务中的可靠性。因此,该研究的关键解决方案在于构建一个结构化、多维度、基于真实教育材料的评估框架,并通过经典测验理论与逻辑回归分析揭示题型、语言、图像存在性及模型来源对表现的独立影响,从而实现对多模态大模型在艺术史研究中实际能力的精准画像,为负责任地应用于内容生成与操作等高阶学术任务提供了必要前提。
链接: https://arxiv.org/abs/2607.02007
作者: Gianmarco Spinaci,Lukas Klic,Giovanni Colavizza
机构: University of Bologna (博洛尼亚大学); Villa i Tatti – The Harvard University Center for Italian Renaissance Studies (哈佛大学意大利文艺复兴研究中心); University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large language models now score near ceiling on general benchmarks, but these aggregate measures reveal little about how models behave within single disciplines. Existing art-focused evaluations rely on synthetic questions and rarely report item-level properties. This paper introduces EduArt, an educational-level benchmark for art-historical knowledge and visual reasoning in multimodal LLMs. EduArt comprises 871 human-authored questions from Italian secondary-school exercises and US Advanced Placement Art History exams, spanning two languages and seven formats from multiple choice to in-text word placement and error identification. Twelve models from six provider families were evaluated under a default answer-only condition and a motivation condition requiring written justification, and characterized using Classical Test Theory and a logistic regression isolating the effects of format, language, image presence, and model. The benchmark showed strong psychometric properties (mean discrimination 0.514, 82.3 percent good discriminators), while multiple-choice accuracy saturated near ceiling for six models, showing recognition formats alone cannot distinguish frontier models. Format was a strong independent predictor of accuracy: models exceeding 94 percent on multiple choice fell to 23.9 percent on open completion (Claude Opus 4.6) and 6.2 percent on error identification (Claude Sonnet 4.6). The motivation condition changed accuracy in a predominantly negative, family-dependent direction. These dissociations indicate that art-historical knowledge and the ability to deploy it are distinct capabilities, and that single-format benchmarks overestimate what models can reliably do. Mapping this capability profile is a precondition for responsible use of multimodal LLMs in art-historical scholarship, where tasks demand producing and manipulating content rather than selecting from fixed options.
[NLP-31] Using embeddings to predict spoken word duration and pitch in Mandarin monosyllabic words
【速读】: 该论文旨在解决普通话口语中单音节词的发音时长(spoken word duration)是否可由上下文嵌入表示(contextualized embeddings, CEs)预测的问题。现有研究表明,语调基频(f0)轮廓在时间归一化后可部分由CEs预测,但其在发音时长预测方面的有效性尚不明确。本研究通过分析来自自然口语语料库的7470个普通话单音节辅音-元音(CV)词的语音数据,发现CEs不仅在词类(type-level)层面,且在个体词项(token-wise)层面均能显著预测发音时长,这一结论通过类型级与词项级置换基线检验得到验证。其解决方案的关键在于:利用CEs对时长进行建模,并将预测的时长结果用于反向转换归一化时间尺度上的基频轮廓至毫秒(ms)时间尺度,生成的预测基频轮廓与实证数据高度吻合,且优于置换基线,表明所提方法在时长建模与声学特征重构方面具备足够的精度和有效性。
链接: https://arxiv.org/abs/2607.02002
作者: Xiaoyun Jin,Mirjam Ernestus,R.Harald Baayen
机构: University of Tuebingen (图宾根大学); Radboud University (奈梅亨大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Time-normalized f0 contours of Mandarin words in conversational speech have been shown to be predictable in part from their contextualized embeddings (CEs). The present study investigates whether CEs also predict spoken word duration for 7470 tokens of Mandarin monosyllabic CV words extracted from a Mandarin corpus of spontaneous speech. We show that CEs indeed are predictive for duration, above chance level, not only at the type level, but also at the level of individual tokens, as indicated by the results obtained with the type-wise and token-wise permutation baselines. We also show that the predicted durations are sufficiently precise to back-transform predicted f0 contours in [0,1] normalized time to contours on the ms time scale. The resulting predicted contours approximate empirical contours and also outperform a permutation baseline.
[NLP-32] Multimodal Knowledge Edit-Scoped Generalization for Online Recursive MLLM Editing
【速读】: 该论文旨在解决在线多模态知识编辑(Online Multimodal Knowledge Editing, OMKE)中,现有方法在保证编辑可靠性与长时稳定性的同时,缺乏对每次编辑语义传播边界的可控性问题。具体而言,现有方法虽能实现单个实例的准确修正,但无法确保编辑效果在跨模态变体间的有效迁移,且易引发无关输入的意外干扰,其根本原因在于编辑相关的语义响应集中于深层语义层而缺乏边界控制。为此,论文提出“编辑作用范围泛化”(Edit-Scoped Generalization)的新范式,将在线多模态大模型(MLLM)编辑从单一实例修正重构为对每条编辑传播范围的精确调控。其核心解决方案是ScopeEdit,一种具备作用范围感知能力的在线编辑器:该方法将每次更新分解为两个独立分支——模态局部吸收分支(modality-local absorption branch)与证据门控共享泛化分支(evidence-gated shared generalization branch)。前者保障编辑在本地模态内的稳定吸收,后者仅在视觉与文本证据充分对齐时才允许跨模态传播。两个分支在正交低秩空间中执行分范围写入操作,并通过Sherman-Morrison递推机制维护各自预条件器,从而实现每次编辑的恒定开销。大量实验表明,ScopeEdit在多种基准、长时编辑流、不同MLLM主干网络及复杂视觉-语言架构下,均显著提升了在作用范围内跨模态迁移与作用范围外局部性之间的权衡,同时保持了编辑的可靠性、稳定性与在线效率。
链接: https://arxiv.org/abs/2607.01978
作者: Siyuan Li,Youyuan Zhang,Ruitong Liu,Junxi Wang,Jing Li
机构: Harbin Institute of Technology, Shenzhen, China; Peng Cheng Laboratory; Peking University; Fudan University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Online multimodal knowledge editing requires injecting a continual stream of visual-textual corrections into multimodal large language models (MLLMs) with bounded overhead and minimal disruption to unrelated behaviors. Existing editors mainly emphasize edit reliability and long-horizon stability, but rarely control the semantic boundary of each edit. Our pilot analyses of post-edit behaviors and internal neuronal activities reveal a scope gap behind reliable edits: instance-level success neither guarantees transfer to valid cross-modal variants nor prevents leakage to unrelated inputs, while edit-related cross-modal responses concentrate in deeper semantic layers. Therefore, we formulate Edit-Scoped Generalization, reframing online MLLM editing from merely correcting an instance to controlling the propagation boundary of each edit. To this end, we propose ScopeEdit, a scope-aware online editor that decomposes each update into a modality-local absorption branch and an evidence-gated shared generalization branch. The local branch supports stable edit absorption, whereas the shared branch enables cross-modal propagation only when visual and textual evidence are sufficiently aligned. Both branches perform scope-separated write geometries in orthogonal low-rank spaces and maintain branch-wise preconditioners via Sherman–Morrison recursions, yielding constant per-edit overhead. Extensive experiments across diverse benchmarks, long-horizon edit streams, MLLM backbones, real-world VLKEB scenarios, and complex vision-language architectures show that ScopeEdit consistently improves the trade-off between in-scope cross-modal transfer and out-of-scope locality, while preserving edit reliability, stability and online efficiency. Our code is available at this https URL.
[NLP-33] Object Aligner: A Configurable JSON Schema Similarity Score for Graphs Applied to LLM Prompt Optimization
【速读】: 该论文旨在解决大语言模型(LLM)生成符合固定模式的JSON数据时,如何高效、准确且可复现地衡量生成结果与标准参考之间结构相似性的难题。现有方法存在明显缺陷:精确匹配过于敏感,文本相似性忽略结构信息,而依赖大语言模型作为评判器则成本高、不可靠且具有非确定性。为此,论文提出Object Aligner(OA),一个开源的Python库,其核心创新在于通过递归对齐两个JSON对象的树形结构实现确定性评分——对无序集合采用匈牙利算法,对有序序列采用序列比对,并根据模式定义的粒度给予部分得分。该方法完全通过JSON Schema扩展进行配置,无需编写代码即可适配新任务。针对复杂结构中常见的图或超图形式(如基于任意标识符的引用关系),传统树形对齐假设被打破,因此论文进一步提出“引用对齐”(referential alignment)机制,通过推断黄金数据与候选输出间标识符的双射关系,使评分对重命名保持不变。由于精确求解该双射等价于图同构问题,OA采用Weisfeiler-Leman颜色细化近似求解。此外,引入顺序敏感的序列模式以支持排序与规划类任务。由于同一对齐过程可定位所有不一致点,该框架还能在无额外开销的情况下生成排序修复建议。实验表明,在GEPA提示优化器中作为奖励函数使用时,Object Aligner在所有数据集上均表现良好或至少保持中立,验证了其有效性与通用性。
链接: https://arxiv.org/abs/2607.01972
作者: Jan Drchal
机构: Czech Technical University in Prague (捷克技术大学); Technology Agency of the Czech Republic (捷克共和国技术局)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, This is a submitted version of a manuscript under review at IEEE Access; it has not been peer reviewed
Abstract:Large language models (LLMs) are often asked to produce JSON conforming to a fixed schema, powering information extraction, tool calling, agentic planning, and knowledge-graph construction. Measuring how closely an output matches a gold reference is essential yet surprisingly hard: exact match is brittle, text similarity ignores structure, and an LLM judge is expensive, opaque, and non-deterministic. We address this with Object Aligner (OA), an open-source Python library that scores two JSON objects deterministically by recursively aligning their trees (the Hungarian algorithm for unordered collections, sequence alignment for ordered ones) and awarding partial credit at the granularity the schema declares. The Object Aligner is configured entirely through a set of JSON Schema extensions, so adapting it to a new task involves annotating a schema rather than writing code. Complex structured data, however, are rarely flat trees: records may form graphs or hypergraphs keyed by arbitrary identifiers, breaking the assumptions of prior similarity metrics. Our central contribution, referential alignment, closes this gap by inferring a bijection between gold and candidate identifiers and scoring every reference through it, so the score is invariant to relabeling. Since recovering this bijection exactly is graph isomorphism, the Object Aligner approximates it with Weisfeiler-Leman color refinement. An order-sensitive sequence regime targets ranking and planning. Since the same alignment localizes every mismatch, the Object Aligner emits ranked repair suggestions at no extra cost. Used as a reward inside the GEPA prompt optimizer, Object Aligner helps or stays neutral across all datasets.
[NLP-34] owards a Phonology-Informed Evaluation of Multilingual TTS INTERSPEECH2026
【速读】: 该论文旨在解决生成式语音合成(TTS)系统在实现自然语音的同时,可能未能准确保留语言特有音位对比(phonological contrasts)的问题。现有标准评估指标(如平均意见分,MOS)无法有效检测此类音位失真。为此,论文提出一种基于分类器的审计框架,以人类语音为基准,针对特定语言的音系模式对合成语音进行评估。其核心解决方案在于:利用在真实人类语音上训练的分类器,迁移至合成语音,从而量化合成结果与目标音位特征(如阿萨姆语中的先进舌根[+ATR]元音和谐)的一致性。实验表明,尽管合成语音在整体自然度上表现良好,但约1/3的[+ATR]中元音被错误地表现为[-ATR],而此偏差在人类语音中不存在。此外,基于预测的ATR标签在词级音位和谐分类上的表现优于转写标签,揭示了合成语音在音位实现上与预期存在系统性偏差。该框架不仅提供任务特定的诊断能力,还可推广至其他具有可测量声学线索的音位对比分析。
链接: https://arxiv.org/abs/2607.01965
作者: Sneha Ray Barman,Neeraj Kumar Sharma,Shakuntala Mahanta
机构: IIT Guwahati (Indian Institute of Technology Guwahati); Mehta Family School for Data Science and Artificial Intelligence, IIT Guwahati (印度理工学院古瓦哈蒂分校数据科学与人工智能梅塔家庭学院); Centre for Linguistic Science and Technology, IIT Guwahati (印度理工学院古瓦哈蒂分校语言学科学技术中心)
类目: Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Accepted at Interspeech 2026
Abstract:Neural TTS systems can sound natural across languages, but naturalness does not guarantee the preservation of sound contrasts that distinguish words from their grammatical forms. Standard metrics like MOS do not test for this. We propose a classifier-based framework that audits TTS output against language-specific phonological patterns using human speech as a benchmark. Testing Assamese advanced tongue root (ATR) vowel harmony with Meta’s MMS TTS, we show that a classifier trained on human speech transfers to synthesized speech with minimal loss. The faithfulness audit reveals that [+ATR] mid vowels are realized as [-ATR] in 1/3 tokens despite an underlying [+ATR] specification, a bias absent in human speech. At the word level, predicted ATR labels classify harmony more accurately than transcription labels, indicating a gap between intended and produced phonology. The framework offers task-specific diagnostics and generalizes to other phonological contrasts with measurable acoustic cues.
[NLP-35] Beyond Supervised Clarification: Input Rewriting with LLM s for Dialogue Discourse Parsing SIGDIAL2026
【速读】: 该论文旨在解决在真实部署场景下,对冻结的下游话语解析模型进行输入重写以提升性能时所面临的挑战。其核心问题是:在缺乏监督式澄清标注的情况下,仅依赖零样本提示或来自冻结解析器的反馈,如何有效实现输入重写以改善话语解析准确率。研究发现,传统依赖最后一句澄清的策略在无监督条件下可靠性显著下降,且通用型重写方法往往引入比修复更多的错误,因为修改虽能修正某些问题,却可能破坏解析器依赖的关键话语线索。通过对“最佳8个重写”方案的分析,揭示出输入重写存在实际上限——大量错误无法通过输入侧修改解决。尽管采用基于GRPO(Generalized Reward Policy Optimization)训练的解析器感知澄清器可将错误增加量减少37%,但仍无法生成具有选择性意识的澄清内容,难以持续提升解析效果。因此,论文提出将澄清重构为一种选择性干预(selective intervention)问题,并指出可重写性预测(rewritability prediction)——即在干预前判断某话语是否可修复——是当前输入端优化冻结话语解析器的关键缺失能力,也是未来提升智能体流水线整体效能的重要方向。
链接: https://arxiv.org/abs/2607.01964
作者: Yiming Liu,Ziyue Zhang,Zhichao Xu,Xin Yu,Yingheng Tang,Tianyu Jiang,Jie Cao
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to SIGDIAL 2026. 17 pages, 2 figures
Abstract:Rewriting inputs to improve frozen downstream models has become a common strategy in modern NLP pipelines. Prior work on incremental dialogue discourse parsing (DDP) shows that supervised clarification models can rewrite fragmentary or underspecified utterances, such as resolving ellipsis or references, to improve parsing accuracy. In this work, we revisit this idea under realistic deployment conditions, where no clarification supervision is available and the clarifier must rely on zero-shot prompting or feedback from a frozen parser. Across three Segmented Discourse Representation Theory (SDRT) datasets and multiple parsers, we find that last-utterance clarification is far less reliable than suggested by supervised settings. Parser-agnostic rewriting often introduces more regressions than repairs, as edits that enable fixes also disrupt discourse cues relied upon by the parser. A best-of-8 rewriting analysis further reveals a practical ceiling: a large fraction of errors are not repairable through input rewriting alone. A parser-aware clarifier trained with GRPO reduces regressions by up to 37% by learning conservative abstention, yet still fails to produce selectivity-aware clarifications that consistently improve parsing. Together, these findings recast clarification as a selective intervention problem. We identify rewritability prediction, deciding whether an utterance is repairable before intervention, as the key missing capability for input-side optimization of frozen discourse parsers, and a critical direction for improving agentic pipelines more broadly.
[NLP-36] NAVER LABS Europe Submission to the Instruction-following 2026 Short Track
【速读】: 该论文旨在解决多任务语音处理中的指令跟随问题,具体包括自动语音识别(ASR)、语音翻译(ST)和语音问答(SQA)的联合建模挑战,目标是实现从英语语音到中文、意大利语和德语的端到端多语言转换。其解决方案的关键在于两个核心改进:一是采用SpeechMapper方法替代原有的语音投影模块,该方法仅依赖自动语音识别(ASR)数据即可学习高效的语音到大语言模型(LLM)嵌入投影器,显著提升了语音表示的语义对齐能力;二是构建了一个名为fakACL的合成语音问答(SQA)数据集,通过大语言模型(LLM)生成科学演讲内容,并利用SeamlessM4T-large-v2合成语音,实现了领域特定的高质量训练数据生成。上述两项改进使系统在保持更小模型规模和较弱LLM基础的前提下,性能超越去年最优方案,并在IWSLT 2026指令跟随语音处理短赛道中与其它系统并列第一。
链接: https://arxiv.org/abs/2607.01960
作者: Marcely Zanon Boito,Hemant Yadav,Jean-Luc Meunier,Ioan Calapodescu
机构: 未知
类目: Computation and Language (cs.CL)
备注: IWSLT 2026 system paper
Abstract:In this paper, we describe NAVER LABS Europe’s submission to the instruction-following speech processing short track at IWSLT 2026. We participate again in the constrained setting, developing systems capable of jointly performing ASR, ST, and SQA from English speech into Chinese, Italian, and German. Building on our previous submission, ranked first in last year’s short track, we update our multi-stage training pipeline by replacing the speech projector with SpeechMapper, a method for learning a speech-to-LLM embedding projector using only ASR data. In addition, we introduce a synthetic SQA dataset, fakACL, composed of artificially generated scientific presentations. This dataset is built by prompting the LLM backbone, segmenting the generated talks, and synthesizing speech with SeamlessM4T-large-v2. The combination of an improved speech projection mechanism and domain-specific synthetic data allows our model to outperform last year’s best short-track system, while being considerably more compact and relying on a weaker LLM backbone. This year’s results place our system tied for first place in the overall short track ranking.
[NLP-37] PhysMani: Physics-principled 3D World Model for Dynamic Object Manipulation ECCV2026
【速读】: 该论文旨在解决在非结构化三维环境中对快速动态目标进行精准操控的挑战,现有视觉-语言-动作模型与世界模型在3D几何建模和物理上合理的未来状态预测方面存在局限。其解决方案的关键在于提出PhysMani框架,该框架将基于物理原理的3D高斯世界模型与前瞻感知的动作策略模型相耦合:世界模型通过在线优化学习无散度的高斯速度场,实现对快速且物理上合理未来的动态预测;策略模型则通过可学习的基于标记的交叉注意力模块,融合预测的三维场景未来动态信息,从而提升决策能力。实验表明,该方法在自建的PhysMani-Bench动态操控基准(包含16个任务)上,在仿真与真实机器人实验中均显著优于现有强基线模型。
链接: https://arxiv.org/abs/2607.01938
作者: Peng Yun,Shouwang Huang,Hao Li,Jinxi Li,Jianan Wang,Bo Yang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ECCV 2026. Code and data are available at: this https URL
Abstract:Manipulating fast and dynamically moving targets in unstructured 3D environments remains challenging for embodied AI. Existing visual-language-action models and world models struggle with accurate 3D geometry and physically meaningful forecasting. We propose PhysMani, a framework that couples a physics-principled 3D Gaussian world model with a future-aware action policy model. The world model learns a divergence-free Gaussian velocity field via online optimization for fast and physically grounded future dynamics prediction. The policy model integrates the predicted 3D scene future dynamics through a learnable token based cross-attention module. We introduce PhysMani-Bench, a dynamic manipulation benchmark with 16 tasks, and demonstrate a superior success rate over strong baselines in both simulation and real-world robot experiments.
[NLP-38] AIriskEval-edu: New Dataset for Risk Assessment in AI-mediated K-12 Educational Explanations CCS
【速读】: 该论文旨在解决教育领域中基于大语言模型(LLM)的生成内容在教学评估中存在可解释性不足与潜在教学风险难以识别的问题。其核心挑战在于如何构建一个既能有效识别生成式教学内容中隐含的五维教学风险(事实准确性、深度完整性、焦点相关性、学生适切性及意识形态偏见),又能提供可追溯、可解释的风险定位与描述的评估体系。解决方案的关键在于提出并构建了AIriskEval-edu-db2数据集,该数据集包含1,639条来自K-12阶段科学、语文和社科类问题的教师真实解释及11种由模拟教师角色生成的、带有特定教学风险的LLM生成解释,并引入一套与教育标准对齐的综合性风险评估量表。尤为关键的是,其中785条解释配备了结构化的可解释性标注(包括风险定位与风险描述),通过半自动流程结合专家教师验证实现,显著提升了风险分析的透明度与可信度。此外,研究通过对比前沿闭源模型与轻量级本地部署的Llama 3.1 8B模型在风险检测与可解释性评估任务中的表现,验证了在该数据集上进行监督微调后,本地化模型能够逼近甚至超越强大闭源模型的性能,从而在保障教育数据隐私的前提下,实现高效、可解释的教学风险审计。
链接: https://arxiv.org/abs/2607.01934
作者: Javier Irigoyen,Roberto Daza,Francisco Jurado,Julian Fierrez,Ruben Tolosana,Alvaro Ortigosa,Enrique Blas,Aythami Morales
机构: Universidad Autónoma de Madrid (马德里自治大学); VERIDAS; M2RAI (MICIU/FEDER); TRUST-ID (MICIU/AEI and the EU); PowerAI+ (Comunidad de Madrid and UAM); Ministerio de Ciencia e Innovación (MINECO)/FEDER
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 6 pages, 2 figures. Accepted at the IEEE International Carnahan Conference on Security Technology (ICCST 2026), October 14, 2026
Abstract:This work introduces AIriskEval-edu-db2, a new dataset designed to train and evaluate auditors based on LLMs for an explainable pedagogical risk assessment in instructional content for grades K-12. The dataset comprises 1,639 explanations from 170 curated ScienceQA questions, covering science, language arts, and social sciences. For each question, the dataset includes an explanation written by a human teacher alongside 11 explanations generated by LLM-simulated teacher profiles associated with distinct pedagogical risks. We propose a comprehensive risk rubric aligned with established educational standards that covers five complementary dimensions: factual precision, depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. A key contribution is the addition of 785 explanations with structured explainability annotations, including risk localization and risk description. The annotations are produced through a semi-automatic process with expert teacher validation. Finally, we present validation experiments comparing state-of-the-art proprietary models with a lightweight local Llama 3.1 8B model in both the pedagogical risk detection and the explainability assessment. These experiments evaluate whether supervised fine-tuning on AIriskEval-edu-db2 enables a locally deployable model to approach or outperform stronger frontier models while preserving privacy in educational auditing and assessment tasks.
[NLP-39] UDUM: A Turkish-Thinking Reasoning Pipeline for Qwen 3.5-27B
【速读】: 该论文旨在解决生成式AI在处理土耳其语(Turkish)推理任务时存在的“表面本地化”问题,即模型虽能以土耳其语响应,但其内部思维过程仍依赖英语主导的推理逻辑,仅在最终输出阶段进行语言转换。为实现真正的土耳其语思维(Turkish-thinking),TUDUM(Türkçe Düşünen Üretken Model)提出将“think…/think”推理块视为可训练的行为模块,而非仅作为中间表示。其核心解决方案是:基于unsloth/Qwen3.5-27B基础模型,首先使用15,991个土耳其语推理样本进行监督微调(SFT),并采用LoRA适配器降低计算成本;随后在经过代理过滤的土耳其语数学环境中应用GRPO类强化学习(Reinforcement Learning)。实验结果表明,SFT显著缩短了响应长度、提升了推理行为的一致性与土耳其语特征表达,但牺牲了部分基准测试准确率;强化学习在一定程度上恢复了数学能力(如在AIME24上表现优异),但未能全面提升所有指标,且未超越基线模型在宏平均(Macro-6)上的表现。因此,该研究的核心贡献在于构建了一个技术上诚实、透明的土耳其语思维推理管道及其评估框架,而非宣称达到最先进的土耳其语推理性能。所发布的step-50模型已公开可用。
链接: https://arxiv.org/abs/2607.01927
作者: Baran Bingol,Bahaeddin Turkoglu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents TUDUM (Türkçe Düşünen Üretken Model), a project pipeline for adapting a Qwen-family 27B thinking model toward Turkish reasoning. The central problem is not only to answer Turkish prompts in Turkish, but to make the explicit reasoning trace itself Turkish. A thinking model may translate a Turkish prompt into an English-centered internal or visible scratchpad, solve the problem mostly in English, and only localize the final answer. TUDUM instead treats the generated think…/think block as a trainable behavior. The pipeline starts from the project base checkpoint unsloth/Qwen3.5-27B, applies supervised fine-tuning (SFT) on 15,991 Turkish reasoning examples using LoRA adapters, and then applies GRPO-family reinforcement learning on a proxy-filtered Turkish mathematics environment. The results are mixed. SFT made the model shorter and more consistently Turkish in its reasoning behavior, with large reductions in average response length and thinking exhaustion, but reduced benchmark accuracy. RL recovered some mathematical performance, especially AIME24 at the best early checkpoint, yet did not uniformly improve all benchmarks and did not exceed the base model on the reported Macro-6 average. The contribution is therefore best framed as a technically honest Turkish-thinking reasoning pipeline and evaluation, not as a claim of state-of-the-art Turkish reasoning. The released step-50 model is publicly available.
[NLP-40] he Grammar Does the Work: Functional vs. Lexical Dependency Length Minimization Across Universal Dependencies
【速读】: 该论文旨在解决依赖句法分析中依赖距离最小化(Dependency Length Minimization, DLM)研究长期存在的局限性问题,即以往研究仅报告每种语言的单一平均依赖距离(Mean Dependency Distance, MDD),忽略了不同句法关系类型之间的差异。其核心解决方案在于揭示DLM在两个不同层面上运作:一是由语法驱动的优化,针对功能类依赖(如限定词、格标记、助动词等),这类依赖普遍较短(均值1.71,标准差0.33),且在不同类型的语言中保持稳定;二是由语言处理压力驱动的优化,作用于词汇类依赖(如主语、宾语、状语等),其依赖距离更长(均值2.87,标准差0.63),具有高度可变性,并受到语序类型学的制约。这一不对称性在句法结构相反的苏丹语料库(SUD)中依然显著(相关系数r=0.92),表明“语法承担了最小化的主要工作”,通过局部的功能性依附构建句法骨架,而词汇核心成分的排列则由语言处理负荷决定。
链接: https://arxiv.org/abs/2607.01899
作者: Kim Gerdes(LISN, Qatent, STL)
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Dependency length minimization (DLM) is a well-documented processing universal, but previous studies report a single mean dependency distance (MDD) per language, obscuring variation across syntactic relation types. We analyze 122 languages in UD and SUD (version 2.17), showing that DLM operates on two distinct levels. Grammar-driven optimization targets functional dependencies (det, case, aux), which are universally short (mean 1.71, \sigma = 0.33) and invariant across typologically diverse languages. Processing-driven optimization operates on lexical dependencies (nsubj, obj, obl), which are longer (mean 2.87), highly variable ( \sigma = 0.63), and constrained by word-order typology. This asymmetry holds in SUD despite reversed head direction (r = 0.92). We conclude that ‘‘the grammar does the work’’ of minimization by scaffolding sentences with local functional attachments, leaving processing pressures to determine the ordering of lexical heads.
[NLP-41] Spec-AUF: Accept-Until-Fail Training under Train-Inference Misalignment for Masked Block Drafters
【速读】: 该论文旨在解决生成式AI(Generative AI)中自回归生成加速技术——推测解码(speculative decoding)所面临的训练目标与推理过程不匹配的问题。具体而言,传统的块级(block-level, DLM-style)推测生成器在训练时采用全块交叉熵损失(full-block cross-entropy),对每个位置均施加监督信号,但推理过程中仅保留首个被接受的前缀,其余内容均被丢弃,导致训练阶段的大量监督信息在实际应用中无效。为解决这一问题,论文提出一种名为AUF(Acceptance-aware Unmasking Feedback)的改进方法,其核心在于通过教师强制学习(teacher-forced learning)的动机,将监督信号聚焦于实际被采纳的前缀部分。AUF的关键创新在于:仅在推测生成器首次预测失败的位置处保留交叉熵支持(即损失计算范围),从而实现对接受前缀的敏感性建模;该方法无需额外辅助目标、无需验证器回滚(verifier rollouts)、不改变推理流程或精确性承诺(exactness contract),是一种轻量级且可直接部署的单步修改。实验表明,在Qwen3-8B固定生成器骨干与服务设置下,AUF使DFlash推测生成器的平均输出长度τ从2.40提升至2.61(六项基准测试均提升),并在Domino双分支头结构上实现从2.56到2.68的提升。进一步分析揭示:仅使用衰减权重的基线虽在共享掩码上取得更高词元准确率,但解码性能更差;而在DFlash上,一旦AUF截断损失支持,标准指数位置衰减权重便失去实际影响,说明监督集中度而非位置衰减是关键因素。
链接: https://arxiv.org/abs/2607.01893
作者: Tianjian Yang,Meng Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 5 figures
Abstract:Speculative decoding accelerates autoregressive generation by drafting a block of tokens that the target model verifies left-to-right, committing only the longest accepted prefix. Block (DLM-style) drafters predict the whole block in parallel, which is fast but trained with a full-block cross-entropy that supervises every position against the gold continuation – even though inference discards every token after the first rejection. Recent acceptance-aware objectives patch this by reweighting the full-block loss; we instead use teacher-forced learning as a motivation for how supervision should concentrate on the accepted prefix. A mask-only block drafter has no input-side channel for gold-prefix conditioning, so AUF approximates that prefix-sensitive supervision on the loss side by keeping the cross-entropy support only through the drafter’s first predicted failure. AUF is a single, detached change to the CE support – no auxiliary objective, no verifier rollouts, and no change to the inference pipeline or the exactness contract. Within fixed drafter backbones and serving settings on Qwen3-8B, AUF raises the DFlash drafter’s average emitted length \tau , averaged over six benchmarks, from 2.40 to 2.61, with a gain on every benchmark, and transfers to Domino’s two-branch head (2.56 to 2.68). Two findings sharpen the picture: the decay-only baseline reaches higher token accuracy on the shared block mask yet decodes worse, and on DFlash, once AUF truncates the support, the standard exponential position-decay weighting becomes empirically inert.
[NLP-42] PairCoder: Pair Programming as a Universal Paradigm for Verified Code-Driven Multimodal and Structured-Artifact Generation ACL2026
【速读】: 该论文旨在解决生成式代码在构建可验证结构化产物(如图表、科学图像、矢量图形、CAD模型、3D场景及硬件设计等)时,因单次推理过程缺乏反馈闭环而导致的脆弱性问题。其核心挑战在于:生成代码的模型无法直接感知编译器、渲染器或仿真器等工具链对输出结果的验证反馈,导致错误难以被发现和修正。为此,论文提出PairCoder,一种基于工具链反馈的双智能体协同编程框架,关键创新在于将代码生成与审查过程显式地嵌入工具链中——由“驾驶员”(Driver)负责编写代码,“导航员”(Navigator)则依据诊断信息、执行结果及当前产物与目标之间的可视化对比进行逐行审查,当错误持续存在时双方角色互换。该方法在17个公开基准测试中,覆盖七种来自三家厂商的模型,显著提升了所有可验证产物的生成质量(例如,Blender场景可执行率从0.20提升至0.78,TikZ编译成功率在各模型上均提高10至30个百分点),且改进效果集中于工具链提供丰富反馈且基线仍有提升空间的场景;而在反馈信息薄弱的场景下,方法表现稳定甚至轻微退化。研究证明,将配对编程机制与工具链验证相结合,是一种可靠且可扩展的面向验证的代码生成范式。
链接: https://arxiv.org/abs/2607.01883
作者: Junhao Chen,Xiang Li,Mingjin Chen,Boran Zhang,Henghaofan Zhang,Yibin Xu,Yuehan Cui,Fangsheng Weng,Fei Ma,Qi Tian,Ruqi Huang,Hao Zhao
机构: THU (清华大学); PKU (北京大学); PolyU, Hong Kong (香港城市大学); USTC (中国科学技术大学); UESTC (电子科技大学); Tongji University (同济大学); Tianjin University (天津大学); Guangming Lab (光明实验室); BAAI (北京智谱人工智能研究院)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026. Project Page: this https URL
Abstract:Code is the medium through which large language models generate structured artifacts: charts, scientific figures, vector graphics, CAD models, 3D scenes, and hardware designs are all produced by writing programs. In this regime single pass inference is brittle, because the compiler, renderer, or simulator that decides whether the artifact exists is invisible to the model. We present PairCoder, which grounds review in the toolchain and realizes it as two agent pair programming: a Driver agent writes the program, a Navigator agent reviews it against verification evidence (diagnostics, execution results, and renderings of the current artifact beside the target), and the two switch roles when errors persist. Across 17 public benchmarks and seven models from three vendors, PairCoder improves essentially every benchmark whose artifact is verifiable, on full official metric suites rather than execution alone (for example, Blender scene executability 0.20 to 0.78; TikZ compile rate up 10 to 30 points on every model), at 2.9 to 9.2 times single model cost (about 7 times overall). The improvements concentrate where the toolchain provides an informative oracle and the baseline leaves headroom, and the method ties or mildly regresses where the oracle is weak; we frame pair programming as a reliable recipe for verified code driven generation.
[NLP-43] SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agent ic Skill-Use
【速读】: 该论文旨在解决大语言模型(LLM)智能体在使用技能时因技能库中存在重叠技能而导致的可靠性和可复现性问题,以及传统以最终验证器成功为唯一评估标准所导致的评价粒度粗略、无法有效识别智能体在技能选择、执行、组合及反思等过程中的潜在缺陷。其解决方案的关键在于提出SkillCoach——一种自演化评分框架,通过从真实运行轨迹中提取基于技能的过程评分标准,从四个维度(技能选择、技能遵循、技能组合、技能驱动的反思)对智能体行为进行精细化评估,并将外部验证器结果作为独立的结果信号,从而区分过程质量与偶然的任务成功。该框架生成的演化评分标准不仅显著提升了评估精度,揭示了仅依赖最终准确率时被掩盖的失败模式,还为高质量训练轨迹的选择提供了更强的过程监督信号,优于仅依赖结果过滤的传统方法。
链接: https://arxiv.org/abs/2607.01874
作者: Jiayin Zhu,Kelong Mao,Yudong Guo,Dengbo He,Sulong Xu,Simiu Gu,Yutao Yue
机构: Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); JD.com(京东)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Skills are becoming a reusable operational layer for LLM agents, encoding SOPs, domain rules, tool workflows, scripts, and validation routines. In realistic skill repositories, overlapping skills make reliable skill-use difficult. Final verifier success is too coarse for both evaluation and training, since an agent may pass through trial and error while selecting distractor skills, skipping required steps, composing workflows incorrectly or omitting final checks. We introduce SkillCoach, a self-evolving rubric framework for evaluating and enhancing agentic skill-use. SkillCoach derives skill-grounded process rubrics from real rollouts and evaluates trajectories along four dimensions: skill selection, skill following, skill composition, and skill-grounded reflection. It keeps the external verifier as a separate outcome signal, allowing process quality to be distinguished from accidental task success. The evolved rubrics further serve as process supervision for selecting high-quality training trajectories. Experiments show that evolved rubrics substantially improve evaluation quality, expose failures hidden by final accuracy, and provide stronger supervision signals than outcome-only filtering for enhancing agentic skill-use.
[NLP-44] Safety Targeted Embedding Exploit via Refinement
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)过程中主要基于英语数据进行训练,导致其安全机制在低资源语言及混合语言语码转换(code-switching)场景下泛化能力不足的问题。核心问题在于:当输入超出模型安全训练数据的分布时,模型会因“认知鸿沟”(epistemic gap)而自信地生成有害内容,却仍维持拒绝响应的行为模式,从而掩盖了潜在风险。为此,作者提出一种基于梯度引导的攻击方法——STEER(Safety Targeted Embedding Exploit via Refinement),其关键在于通过梯度分析识别对模型拒绝行为贡献最大的词汇,并迭代地将其翻译为低资源语言,以削弱模型的拒绝反应同时保留原始有害意图。实验表明,该方法在六种开源8B参数模型上于JailbreakBench和AdvBench基准上的攻击成功率分别达到93.0%与96.7%,显著优于随机语码转换和贪心坐标梯度(GCG)方法。此外,生成的提示具有跨模型迁移能力,在未接触目标模型的情况下仍使GPT-4o-mini的攻击成功率达35.5%,揭示出该缺陷并非特定于某一架构。研究结果表明,仅以英语为基础的安全对齐无法保障多语言输入下的安全性,因此提升多语言安全性的关键在于:在对齐阶段扩大语言覆盖范围,并引入显式的分布外(out-of-distribution)检测与拒答机制。
链接: https://arxiv.org/abs/2607.01859
作者: Joshua Adrian Cahyono
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Safety training for large language models (LLMs) is conducted predominantly in English, leaving uncertain how well safety mechanisms generalize to low-resource languages and mixed-language code-switching. We show that this creates an epistemic gap in which models confidently generate harmful responses for inputs that fall outside the distribution of their safety training. To study this phenomenon, we introduce STEER (Safety Targeted Embedding Exploit via Refinement), a gradient-guided attack that identifies words contributing most strongly to the model’s refusal behavior and iteratively translates them into low-resource languages to suppress refusal while preserving harmful intent. Across six open-source 8B-parameter models, STEER achieves attack success rates of up to 93.0% on JailbreakBench and 96.7% on AdvBench, outperforming random code-switching and Greedy Coordinate Gradient (GCG). The resulting prompts also transfer to GPT-4o-mini, achieving a 35.5% attack success rate without requiring access to the target model, suggesting that the underlying weakness is not specific to a single architecture. These findings demonstrate that safety mechanisms aligned primarily on English cannot be assumed to generalize across multilingual inputs. We argue that improving multilingual safety requires broader coverage during alignment and mechanisms that explicitly detect and abstain on out-of-distribution inputs.
[NLP-45] Non-synchronism in Global Usage of Research Methods in Library and Information Science from 1990 to 2019
【速读】: 该论文旨在解决图书馆与信息科学(Library and Information Science, LIS)领域在全球范围内研究方法差异的系统性识别与比较问题,尤其关注不同国家在研究方法使用上的异质性及其演变趋势。其解决方案的关键在于结合人工内容分析与深度学习模型,构建了一个能够自动分类研究方法的智能系统:研究团队首先对5,281篇来自81个国家、发表于国际代表性期刊的文献进行人工标注,随后基于此数据集训练并验证了深度学习模型,实现了对研究方法的自动化识别与分类。该方法使研究得以大规模、跨国家地对比分析研究方法的分布特征,揭示出各国在研究方法选择上具有独特的“方法谱系”(research profile),且尽管跨国间存在差异,但近三十年来这种差异呈现收敛趋势。这一方法不仅为理解各国学科发展路径提供了量化依据,也为推动国际学术合作与制定本土化学科发展战略提供了重要支持。
链接: https://arxiv.org/abs/2607.01833
作者: Chengzhi Zhang,Liang Tian
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:The global development of Library and Information Science (LIS) is influenced by various factors such as the economy, society, culture, discipline, tradition, and more. Consequently, the research methods of LIS vary greatly among countries. To better understand these differences, we conducted a study of 5,281 research papers from 81 countries published in internationally representative journals over the past thirty years. We manually annotated the research methods used in some articles through content analysis, and subsequently developed and trained a deep learning model for automatic classification of research methods. Using this method, we conducted a comparative analysis of the usage of research methods in different countries. Our findings reveal that there are differences in the research methods used across countries, with each country having its unique research profile and distribution of research methods. Even when investigating the same topic, research methods can differ between countries. Our study also uncovers that there are differences between the national and international distribution of research methods, these differences have decreased over the past 30 years. By highlighting the characteristics of discipline development in various countries from the perspective of research methods, our study can help guide discipline development at the national level. This study provides insights into the usage trends of research methods across different countries and highlights the unique characteristics of discipline development in each country. This information can be valuable in promoting collaboration and understanding between countries and in guiding discipline development at the national level.
[NLP-46] Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在航空业务运营场景中缺乏针对航空领域特定操作知识的安全性与准确性评估问题。由于航空领域的高风险性和强监管特性,通用基准无法有效衡量模型在航空法规、机场地面运行及复杂操作情境下的推理能力,导致潜在安全隐患。为此,研究提出Pre-Flight——一个开源基准,包含300道基于国际标准(如ICAO和美国联邦航空管理局FAA规定)、机场地面运行材料及航空通用知识的多选题,由具备空管、地面运行和商业飞行经验的专业人员编写与审校。其解决方案的关键在于构建一个面向航空领域的专业化评估框架,采用Inspect评估体系对多种主流商业与开源模型进行评测,以准确率作为核心指标,并持续更新排行榜。实验结果显示,即使最强模型(2026年发布)仅达到82.7%的准确率,相较于约95%的专家参考水平仍存在显著差距,且提升缓慢。这表明当前生成式AI在非安全关键型航空操作中的可靠性和可信度仍远未达到专业要求。研究团队公开了数据集、评估工具链及结果,将其集成于inspect_evals社区评估包中,强调此类领域特定评估是实现生成式AI在航空领域负责任部署的必要前提。
链接: https://arxiv.org/abs/2607.01829
作者: Alex Brooker,Tim Hughes
机构: Airside Labs(空气侧实验室); Mahino Research(马希诺研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 1 figure, 2 tables. Benchmark available in inspect_evals (UKGovernmentBEIS/inspect_evals)
Abstract:Large language models (LLMs) are increasingly proposed for aviation business operations, from documentation and training generation to customer facing assistants. General purpose benchmarks do not measure whether a model reasons safely and correctly about aviation specific operational knowledge, and the high stakes, regulated nature of the domain makes that gap consequential. We present Pre-Flight, an open source benchmark of 300 multiple choice questions drawn from international standards and airport ground operations material, covering international airport ground operations, ICAO and US FAA regulations, aviation general knowledge and complex operational scenarios. Questions were authored and reviewed by practitioners with experience in air traffic management, ground operations and commercial flying. We evaluate a range of contemporary commercial and open weight models using the Inspect evaluation framework, scoring by accuracy under a standard multiple choice protocol, and we maintain the leaderboard on a rolling basis as new models are released. Against an informal expert reference of around 95%, obtained from a low sample quiz of aviation professionals at a conference, even the strongest model evaluated (released in 2026) reaches 82.7%, having improved only gradually from roughly 75% in early 2025. A substantial and persistent gap below expert level reliability therefore remains. We release the dataset, the evaluation harness and the results, and the benchmark is available within the community evaluations package distributed with inspect_evals. We argue that domain specific evaluation of this kind is a necessary precondition for responsible deployment of generative AI in non safety critical aviation operations.
[NLP-47] Gender Differences in Research Topic and Method Selection in Library and Information Science: Perspectives from Three Top Journals
【速读】: 该论文旨在解决学术研究中性别差异对研究方法选择的影响问题,尤其关注在图书馆与信息科学领域内,女性与男性研究人员在方法偏好上的系统性差异。传统观点认为女性更倾向于采用定性方法,而男性则偏好定量方法,但这一现象是否独立于研究主题存在尚不明确。为此,研究引入一种基于全文认知的自动化分类模型(CogFT),并采用更为细致的方法分类体系,以控制研究主题的干扰因素。研究发现,在不同研究主题下,女性仍显著更常使用访谈法(Interview),而男性则更倾向于采用理论方法(Theoretical approach)。其解决方案的关键在于通过技术手段实现对研究方法的精准自动识别与归类,从而剥离主题影响,准确揭示性别因素在研究设计中的独立作用。该研究为理解性别差异背后的研究行为机制提供了实证依据,并提出应通过优化研究方法指导与支持体系,推动学术界的性别包容性与公平性建设。
链接: https://arxiv.org/abs/2607.01828
作者: Chengzhi Zhang,Siqi Wei,Yi Zhao,Liang Tian
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Research in the social sciences has shown that there are gender differences in the selection of research methods, with women often opting for qualitative methods while men prefer quantitative methods. However, it is important to consider that research methods are generally chosen based on the research topic. To figure out the influence of gender on research method selection, a study was conducted in the field of Library and Information Science, using a more fine-grained method classification system and an automatic classification model called CogFT, which is based on full-text cognition. The findings showed that women tend to use Interview while men prefer Theoretical approach, across a range of topics. The study offers insights into the specific research design processes that contribute to gender differences in method selection and suggests ways to promoting gender inclusivity and equality in academia by considering research method use and guidance.
[NLP-48] On the Limits of Steering Vectors for Preference-Aligned Generation
【速读】: 该论文旨在解决生成式AI(Generative AI)中控制文本生成的可解释性与通用性之间的矛盾问题,具体聚焦于引导向量(steering vectors)在偏好对齐(preference alignment)任务中的泛化能力局限。其核心挑战在于:尽管引导向量提供了无需训练、可解释的文本生成控制机制,但其在不同属性表达、任务迁移以及多属性组合场景下的有效性尚未被充分理解。研究的关键发现是:引导向量的效能随属性类型显著变化,从正负风格样本中提取的向量在下游写作个性化任务中可能出现性能退化;此外,多种多属性组合方法均表现出随着向量数量增加,属性表达能力显著下降,且存在一致性(coherence)与表达性(expressibility)之间的权衡,需针对具体场景进行超参数调优。因此,研究揭示了引导向量作为通用偏好对齐工具的实质性限制,提示其应用需谨慎评估任务与属性适配性。
链接: https://arxiv.org/abs/2607.01802
作者: Melanie Subbiah,Zara Hall,Kathleen McKeown
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Steering vectors have emerged as a promising approach to controlled text generation, offering interpretable, training-free mechanisms for shaping model outputs. However, their practical generality remains poorly understood. We study the limits of steering vector generalization along three dimensions: trait expressibility, task transfer, and multi-trait composition. Using the PLUME writing personalization benchmark, we extract steering vectors for a range of preferences and evaluate them on summarization and email-writing tasks across two open-source models (Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct). We find that steering effectiveness varies substantially across traits. We further show that steering effectiveness can degrade when vectors extracted from positive and negative style examples are transferred to downstream writing personalization tasks. Finally, we compare common methods for composing multiple steering vectors and find that all methods suffer significant drops in trait expression as more vectors are added, with a tradeoff between coherence and expressibility that requires per-setting hyperparameter tuning. Taken together, our results suggest that steering vectors face meaningful limits as a general-purpose tool for preference alignment.
[NLP-49] Do LLM s Truly Generalize in the Molecular Domain? A Perturbation-Based Analysis
【速读】: 该论文旨在解决生成式分子大模型(Generative AI)在分子发现任务中因基于序列的表示方式导致的拓扑结构约束问题,即模型对分子结构微小变化的敏感性与化学空间内在连续性之间的矛盾。其核心挑战在于:尽管分子在化学上具有局部相似性,但现有模型在面对微小结构扰动时表现出显著性能下降,揭示了其局部信任区域狭窄且对结构变化极为脆弱。解决方案的关键在于引入一种基于图编辑距离(Graph Edit Distance, GED)的分子扰动框架,系统评估模型在可控结构变异下的表现,并提出利用上下文调优(In-Context Tuning, ICT)策略,通过锚定于结构相似分子的预测来增强模型鲁棒性。实验表明,ICT能够部分扩展模型的局部信任区域,为提升分子大模型在结构变异下的稳定性提供了有效路径。
链接: https://arxiv.org/abs/2607.01800
作者: Jiatong Li,Weida Wang,Changmeng Zheng,Shufei Zhang,Yatao Bian,Xiao-yong Wei,Qing Li
机构: The Hong Kong Polytechnic University (香港理工大学); National University of Singapore (新加坡国立大学); Shanghai AI Lab (上海人工智能实验室); Fudan University (复旦大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 21 pages
Abstract:Large Language Models (LLMs) have recently shown promise in molecular discovery, yet a gap remains between their probabilistic nature over discrete sequential tokens and the rigid topological constraints of chemical space. This raises the question of whether molecular LLMs can generalize beyond the local neighborhoods induced by their sequence-based representations. To systematically investigate this question, we introduce a Molecular Perturbation framework that generates syntax-valid structural variants of training molecules under controlled Graph Edit Distance (GED) to probe the manifold regularity of molecular LLMs. Our analysis shows that even a single edit can cause substantial performance drops on common molecular tasks, revealing a narrow local trust region and fragile sensitivity to structural changes. Since similar molecules tend to exhibit similar properties, In-Context Tuning (ICT), which anchors predictions on structurally similar molecules, offers a natural way to mitigate such fragility. Our experiments also examine whether ICT confers robustness under controlled structural perturbations, and the results suggest that it can partially expand the local trust region and offer a promising direction for stabilizing molecular LLMs against structural variation.
[NLP-50] PARTREP: Learning What to Repeat for Decoder-only LLM s
【速读】: 该论文旨在解决生成式 AI(Generative AI)中解码器仅模型(decoder-only LLMs)因因果注意力机制导致的信息流不对称问题,即生成过程中后序标记具有更强的上下文依赖性,而早期标记的上下文感知能力较弱,从而影响推理性能。其核心解决方案是提出一种名为PartRep的有选择性增强方法,通过仅重复原始提示中最富有信息量的标记(而非完整重复),实现对上下文接地(contextual grounding)在序列位置间的再平衡。该方法以逐标记负对数似然(NLL)作为选择信号,基于“可预测性较低的标记更难从上下文中恢复,因而更受益于后期重复”的假设进行筛选。为避免全前向传播带来的高昂计算开销,研究训练了一个轻量级门控网络,利用浅层隐藏状态预测高NLL标记,实现在预填充阶段中期通过早退出策略完成高效选词。实验表明,在八个基准测试(包括MMLU、GSM8K和RULER)及三个模型家族(Qwen2.5、Llama3.2、Gemma4)上,PartRep在仅消耗原全重复方案59.4%的键值缓存(KV cache)和79.0%的预填充浮点运算量(prefill FLOPs)的前提下,仍能保留绝大部分性能增益,显著提升了长上下文场景下的实用性与效率。
链接: https://arxiv.org/abs/2607.01792
作者: Andikawati P Widjaja,Yongjun Kim,Hyounghun Kim,Jaeho Lee
机构: Bandung Institute of Technology (万隆理工学院); Pohang University of Science and Technology (浦项科技大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages and 7 figures (including appendix)
Abstract:While decoder-only LLMs excel at a vast array of natural language tasks, it suffers from an asymmetric information flow induced by causal attention: later tokens are richer in contextual grounding than earlier ones. A simple and effective remedy is prompt repetition – just appending a second copy of prompt before generation can redistribute grounding across positions and improve reasoning performance. However, full repetition of the original prompt doubles the KV cache footprint and quadruples attention cost during prefill, making it impractical for long-context settings. We propose PartRep, a selective augmentation method that appends only the most informative tokens – rather than the entire prompt. We use token-wise negative log-likelihood (NLL) as a selection signal, motivated by the hypothesis that less predictable tokens are less recoverable from surrounding context and therefore benefit more from late-position repetition. To avoid the heavy cost of a full forward pass for scoring, we train a lightweight gate that predicts high-NLL tokens from early-layer hidden states, enabling token selection during mid-prefill via early exit. Across eight benchmarks (including MMLU, GSM8K, and RULER) and three model families (Qwen2.5, Llama3.2, Gemma4), PartRep retains most of the gains of full repetition while using only 59.4% of its KV cache and 79.0% of its prefill FLOPs.
[NLP-51] Subliminal Clocks: Latent Time Modelling in Diffusion Language Models
【速读】: 该论文旨在解决生成式语言模型中扩散模型(Diffusion Language Models, DLMs)的内部工作机制问题,特别是其在无显式时间步条件下的去噪进度表征与利用方式。传统扩散模型依赖显式的时刻(timestep)条件以控制去噪过程,而DLMs虽未显式引入时间步信息,但其内部是否隐含了去噪进度的表示仍不明确。本文的关键发现是:DLMs在残差流(residual streams)中编码了与扩散时间步相关的潜在表示。通过跨层探针(probes)可可靠地提取该信号,表明去噪进度可从模型内部激活值中解码。进一步地,研究提出通过操控与推断出的时间步相关联的低维子空间,能够系统性调节模型对去噪进度的感知,从而实现对模型置信度和熵的可控变化。此外,该表示在激活空间中呈现出结构化且可解释的几何特性,揭示了此类信号在模型中的处理机制。因此,该研究的核心解决方案在于识别并利用隐藏在模型内部的去噪进度表示,为理解及调控非自回归扩散语言模型提供了可解释性基础与干预手段。
链接: https://arxiv.org/abs/2607.01774
作者: Maximo Rulli(1),Thomas Fontanari(1),Simone Petruzzi(1),Federico Alvetreti(1),Giorgio Strano(1),Donato Crisostomi(1),Giorgos Nikolaou(2),Tommaso Mencattini(2),Andrea Santilli(3),Emanuele Rodolà(1),Simone Scardapane(1),Alessio Devoto(3) ((1) Sapienza University of Rome, (2) EPFL, (3) Independent researcher)
机构: Sapienza University of Rome (罗马大学); EPFL (洛桑联邦理工学院); Independent researcher (独立研究员)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Equal contribution: Thomas Fontanari and Simone Petruzzi
Abstract:Diffusion Language Models (DLMs) have recently emerged as a promising alternative to autoregressive models. Unlike standard diffusion-based approaches, DLMs are not explicitly conditioned on a timestep, raising a natural question: do these models internally represent denoising progress, and how is such information used downstream? In this work, we show that DLMs do in fact encode a latent representation related to the diffusion timestep within their residual streams. We find that this signal can be reliably extracted using probes across layers, indicating that denoising progress is decodable from internal activations. We further demonstrate that steering the model along a low-dimensional subspace associated with the inferred timestep allows us to systematically modulate its notion of denoising progress, leading to predictable changes in model confidence and entropy. Finally, we analyse the geometry of the identified representation, showing that it exhibits structured and interpretable properties in activation space, and shedding light on how such a signal is processed by these models.
[NLP-52] Denser neq Better: Limits of On-Policy Self-Distillation for Continual Post-Training
【速读】: 该论文旨在解决基础模型在持续后训练(continual post-training)过程中面临的知识遗忘问题,尤其关注自蒸馏(self-distillation)策略在维持已有能力的同时引入新知识的有效性。其核心问题是:尽管基于策略的自蒸馏方法(如SDPO)在特定条件下可加速领域内专业化,但其在分布外(out-of-distribution)场景下的泛化能力不足,且易引发严重遗忘甚至模型崩溃。解决方案的关键在于揭示了密集自蒸馏会加剧参数空间与响应空间中的漂移,并通过教师-学生循环放大高频格式化伪影,从而导致稳定性下降。研究进一步表明,仅依赖在线策略数据不足以支撑持续学习,应谨慎使用自蒸馏作为默认的稳定机制;相比之下,基于强化学习的在线策略方法(如GRPO)更具保守性,能更有效地保留先前能力。因此,该研究强调需权衡自蒸馏的加速优势与潜在稳定性风险,提出应结合多源监督与动态调控以实现更鲁棒的持续学习。
链接: https://arxiv.org/abs/2607.01763
作者: Meng Wang,Haohan Zhao,Wenzhuo Liu,Lu Yang,Geng Liu,Haiyang Guo,Guo-Sen Xie,Gaofeng Meng,Hongbin Liu,Fei Zhu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Continual post-training enables foundation models to acquire new knowledge while preserving existing capabilities. Recent work suggests that on-policy learning can mitigate forgetting, with on-policy self-distillation emerging as a particularly attractive approach. In this work, we revisit this optimistic view through self-distillation policy optimization (SDPO). Our experiments show that SDPO can accelerate in-domain specialization when teacher signals are stable and well aligned, but it struggles to generalize to out-of-distribution scenarios. In continual post-training, SDPO exhibits stronger forgetting and can even collapse, whereas on-policy reinforcement learning methods such as GRPO adapt more conservatively and better preserve prior capabilities. Further analyses reveal that denser self-distillation induces larger drift in both parameter space and response space, and can amplify high-frequency formatting artifacts through a self-reinforcing teacher–student loop. These findings suggest that on-policy data alone is insufficient for continual learning. Dense self-distillation can accelerate specialization when teacher targets are stable and token-level supervision is reliable, but it should not be treated as a default stabilizer for continual post-training. Our code is available at this https URL.
[NLP-53] Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving
【速读】: 该论文旨在解决生成式语言模型(LLM)与自动语音识别(ASR)融合过程中,文本先验知识在大规模标注语音数据下贡献减弱、以及简单联合训练未能充分挖掘文本语义信息的问题。其核心解决方案是提出一种面向ASR任务的语音-文本交错预训练(Joint Speech-Text Interleaved Pretraining, JSTIP)策略,通过在对齐的语音-文本对中构建词级与片段级交错的连续输入序列,增强语音-文本模态间的交互与对齐。该方法有效缓解了语音与文本之间的模态鸿沟,保留了大语言模型的生成先验能力,并显著提升了医学实体识别等任务的性能,尤其在零样本语音问答场景中表现出更强的泛化能力,验证了交错结构对提升语音理解效果的关键作用。
链接: https://arxiv.org/abs/2607.01733
作者: Ruchao Fan,Yiming Wang,Rui Zhao,Liliang Ren,Keqi Deng,Xiaoyang Chen,Ali Zare,Bo Ren,Yuxuan Hu,Junkun Chen,Yan Huang,Yelong Shen,Jinyu Li
机构: Alibaba Group (阿里巴巴集团); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Speech-LLM integration has shown promising results by leveraging extensive textual pretraining, yet its specific benefits for automatic speech recognition (ASR) remain unclear. We observe that as supervised ASR training data increases, the contribution of LLM priors becomes less evident, and simple speech-text joint training under-utilizes textual knowledge. We therefore propose Joint Speech-Text Interleaved Pretraining (JSTIP), an ASR-oriented pretraining strategy that constructs word-level and segment-level interleaved speech-text sequences within aligned pairs for speech-LLM architectures that accept continuous inputs. Experiments on 38k hours of ASR data show consistent entity accuracy improvement compared to ASR-only and joint speech-text training baselines. JSTIP achieves on-par entity recognition performance using domain transcription text compared to synthetic speech-text pairs, simplifying domain adaptation. Benefiting from textual pretraining and domain text data, JSTIP is competitive with open-source ASR and Speech-LLM systems in medical entity recognition. The zero-shot speech question answering behaviors further suggest that interleaving reduces the speech-text modality gap and preserves the LLM generative prior, which is likely the reason for the entity improvements on the ASR task.
[NLP-54] Beyond Pixel Diffs: Benchmarking Image Change Captioning for Web UI Visual Regression Testing
【速读】: 该论文旨在解决视觉回归测试(Visual Regression Testing, VRT)中因依赖像素级对比而产生的大量误报问题,此类方法对渲染噪声与真实缺陷无差别处理,导致测试人员需耗费大量时间手动审查无关差异。其核心挑战在于缺乏能够准确描述网页用户界面(Web UI)图像变化内容的自然语言生成能力,现有工具虽引入机器学习但缺乏公开评估体系。为此,论文提出新任务——网页用户界面图像变化字幕生成(Web UI Image Change Captioning, WUICC),并构建首个针对该任务的数据集与基准测试平台WUICC-bench。解决方案的关键在于通过结合图像差异检测与自然语言生成技术,使系统不仅能识别视觉变化,还能以自然语言形式精准描述变化内容(如“按钮颜色由蓝色变为红色”),从而显著降低误报率,并为后续面向特定领域优化的模型研究奠定基础。
链接: https://arxiv.org/abs/2607.01728
作者: Licheng Zhang,Bach Le,Pengtao Zhao,Naveed Akhtar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:Visual regression testing (VRT) is a standard quality assurance step in modern software release pipelines. On every change, it re-renders user interface (UI) screenshots, compares each one against an approved baseline image, and routes any detected difference to a human reviewer who decides whether it is an intended update or an unintended regression. A widely used approach, especially in open-source and continuous-integration pipelines, is pixel-level comparison, which is semantically blind and treats rendering noise and genuine defects identically, producing large volumes of false positives that force developers and testers to spend substantial time and effort manually reviewing flagged differences at every release cycle. Industry tools apply machine learning to VRT, but lack public evaluation. More critically, no dataset or benchmark exists to support natural language descriptions of UI changes, a capability that tells testers what changed in words instead of leaving them to interpret a binary flag or a highlighted region. To address the gap, we propose a new task, Web UI Image Change Captioning (WUICC), which sits at the intersection of VRT and image difference captioning (IDC), and release WUICC-bench, its first dataset and benchmark for the task. We evaluate eleven representative IDC methods, together with two zero-shot general-purpose LLMs. We find that: (1) these methods tend to struggle in the Web UI domain due to its layout diversity, dense text, and fine-grained changes, and (2) yet the trained methods already suppress non-meaningful visual noise far more selectively than the pixel-level comparison VRT relies on, providing a solid foundation for future domain-specific research.
[NLP-55] When Does Generating More Help? Disentangling Fixed-Source Synthesis from Source Expansion in Synthetic Data Scaling
【速读】: 该论文旨在解决生成式数据(Synthetic Data)规模化过程中,因源数据扩展(Source Expansion, SE)与固定源合成(Fixed-Source Synthesis, FSS)混杂导致的可比性问题。现有研究普遍在数据量增加时同步扩展源数据,从而混淆了两种不同规模路径的独立影响,致使FSS这一关键路径长期被忽视。本文通过固定种子问题池和教师模型,仅调整拒绝采样(Rejection Sampling, RS)中每题的响应预算,系统地分离并研究FSS路径。其解决方案的关键在于提出一种修正后的缩放定律(rectified scaling law),该定律基于重复采样对固定源的覆盖机制推导而来,并在低预算条件下拟合后,能准确预测高预算下的性能表现,适用于所有测试的师生模型对。实验表明,在总样本预算相同的情况下,小规模时SE与FSS性能相当,但大规模时增加种子问题(即SE)优于同等预算下增加响应数量;而在FSS框架内,无论是否从已有种子生成新问题或改变合成协议,均无法超越单纯使用RS的方法。因此,FSS被视为一个有界且可控的缩放轴,为不同合成策略的公平比较提供了基准设置。
链接: https://arxiv.org/abs/2607.01727
作者: Xu Guo,Jian Tong,Zhihui Lu,Qipeng Guo
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Synthetic data can be scaled along two routes: Source Expansion (SE), which enlarges the source by adding seed materials or generators, and Fixed-Source Synthesis (FSS), which holds the source fixed and scales the generation budget. Existing scaling studies typically expand the source as the data grows, conflating SE with FSS and leaving FSS underexplored. We isolate FSS by holding the seed-question pool and teacher model fixed, varying only the per-question response budget under Rejection Sampling (RS). We adapt the rectified scaling law to FSS, deriving it from how repeated sampling covers a fixed source. Empirically, the derived form, fit on low budgets, predicts performance at the held-out highest budget for every evaluated teacher–student pair. At matched total-sample budgets, SE and FSS are comparable at small budgets; at large budgets, adding seed questions outperforms spending the same budget on more responses. Within FSS, however, neither synthesizing additional questions from the existing seeds nor varying the synthesis protocol outperforms plain RS at matched budgets. FSS is thus a bounded scaling axis and a controlled setting for comparing synthesis protocols. We will release our code and data to facilitate further research.
[NLP-56] Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing
【速读】: 该论文旨在解决大语言模型在基于显式标注为虚构内容的文档进行微调时出现的“否定忽视”(Negation Neglect)问题,即模型虽知晓文本被标注为虚构,仍会错误地相信其核心主张的真实性。现有方法依赖于修改训练数据中的标注,但效果有限。本文提出Goggles——一种基于学习的梯度干预模块,其关键在于不修改原始数据,而是在监督微调过程中对大型语言模型(LLM)的低秩适配(LoRA)梯度进行动态修正,从而注入特定的认识论框架(epistemic frame),即模型对所读内容真实性的立场。该模块一次性训练完成后可冻结使用,适用于未参与训练的新文档。实验表明,在无虚构标注的相同文档上,经Goggles处理后的模型能将虚构内容识别率提升至约91%,同时保持甚至优于基线的推理能力(如GPQA和TruthfulQA表现)。此外,Goggles支持多种框架,例如将文档视为红木研究(Redwood Research)安全评估的一部分,而非单纯虚构。该框架在持续微调中表现出强鲁棒性,能够抵抗反向偏移。Goggles为在已知存在认知偏差的数据上训练模型提供了有效路径,避免模型吸收不良行为,同时保留其知识与能力。
链接: https://arxiv.org/abs/2607.01690
作者: Joshua Penman
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, 10 figures, 2 tables. Code at this https URL and generated documents, questions, and teacher rollouts at this https URL
Abstract:Finetuning a language model on documents that are explicitly annotated as fictional results in a model that still actually believes the documents’ core claims, an effect known as Negation Neglect. In our evaluations, models trained on documents prefixed and suffixed with such annotations correctly identify the relevant claims as fictional only about 9% of the time. To address this, we introduce Goggles, a learned module that intervenes on the finetuning gradient rather than the data. During supervised finetuning, a Goggles module edits the gradients an LLM LoRA receives, imparting a chosen epistemic frame (the stance the model takes toward the nature of what it reads) to whatever the documents teach. A Goggles instance is trained once for a given base model, frame, and LoRA configuration, then applied frozen to documents it was never trained on. Trained through Goggles on those same documents, now carrying no fictional annotation, the model flags the content as fictional roughly 91% of the time, while preserving capability (GPQA and TruthfulQA match or exceed baseline). The same architecture supports other frames: a Goggles instance can be trained to treat documents as “part of an AI safety evaluation by Redwood Research” rather than simply as fiction. The imparted frame persists under continued finetuning that pushes back toward the claim, where prior interventions revert. Goggles suggests a path toward training language models on known-misaligned data without absorbing the behaviors that data demonstrates.
[NLP-57] Agent icDataBench: A Comprehensive Benchmark for Data Agents
【速读】: 该论文旨在解决当前生成式数据智能(Agentic Data Science)领域缺乏全面、细粒度且覆盖多样场景的评估基准这一关键问题。现有研究虽已探索基于大语言模型(Large Language Model, LLM)的数据代理(Data Agents)以自动化数据科学工作流,但缺乏能够系统评估其在真实复杂任务中表现的标准化评测体系。为此,本文提出AgenticDataBench,其核心解决方案在于构建一个涵盖15个垂直领域的综合性基准,包含来自真实世界业务场景(如某头部金融科技公司的5个B2B案例)的高质量任务,并通过引入“数据科学技能”(Data Science Skills)作为核心建模单元,对任务进行结构化分解与量化覆盖。该方法基于大规模Stack Overflow任务解答数据,采用技能对齐的层次聚类提取代表性技能,从而确保任务设计的可复现性与多样性;同时,针对缺乏真实任务的领域,提出一种基于LLM的系统性任务生成框架,依据所定义技能生成符合实际操作模式的工作流。最终,该基准支持在细粒度技能层级上对先进数据代理进行评估,为模型性能分析提供深入洞察。
链接: https://arxiv.org/abs/2607.01647
作者: Zhaoyan Sun,Shan Zhong,Daizhou Wen,Jiaxing Han,Guoliang Li,Ying Yan,Peng Zhang,Yu Su,Xiang Qi,Baolin Sun,Chengyuan Yang,Tao Fang,Huaiyu Ruan
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Data science aims to derive actionable insights from heterogeneous raw data, unlocking the value of the massive amounts of data generated in modern society. Automating this process is essential to reducing labor-intensive efforts for data scientists and enabling scalable data-driven applications. Recently, large language model (LLM)-based data agents have emerged as a promising solution to automate data science workflows. However, the field lacks comprehensive benchmarks to rigorously evaluate these agents across diverse scenarios with fine-grained granularity. To address this gap, we propose AgenticDataBench, a comprehensive benchmark featuring realistic tasks spanning diverse domains with fine-grained ground-truth labels. This enables evaluations to capture the diversity and complexity of data science workflows and the detailed performance of agents. First, to cover diverse domains, we collect real datasets and tasks from 15 vertical domains, including 5 real-world B2B use cases from a leading fintech company. Second, to remove redundancy in real-world tasks and generate high-quality tasks for domains lacking real data, we introduce data science skills, recurring data-centric operational patterns, and quantify benchmark coverage by the number of skills included. Representative skills are extracted from large-scale task solutions on Stack Overflow using skill-aligned hierarchical clustering. Third, for real-world business tasks, we select task-solution pairs that maximize diversity in skill composition, ensuring broad coverage of practical scenarios. Fourth, to generate realistic tasks for devise domains without real tasks, we propose a systematic LLM-based task generation approach to create workflows and tasks based on these skills. Finally, we evaluate state-of-the-art data agents using our annotated benchmark and open-sourced testbed, providing detailed skill-level insights.
[NLP-58] ProWAFT: A ROMA-LPD Instance for Workload-Aware and Dynamic Fault Tolerance in FPGA-Based CNN Accelerators
【速读】: 该论文旨在解决基于SRAM的现场可编程门阵列(FPGA)在边缘计算场景下进行卷积神经网络(CNN)推理时,因瞬态故障(transient faults)导致的隐性错误(silent errors)对系统可靠性造成的威胁。现有方法如全三模冗余(full TMR)虽能提升正确性但带来显著的性能与能耗开销,而反应式恢复机制可能引入关键路径上不可接受的延迟。为此,论文提出一种主动式、工作负载感知的容错框架ProWAFT,其核心在于利用部分重配置(partial reconfiguration)技术,在可重构分区间选择性地应用三模冗余(TMR),以实现资源与容错策略的动态优化。ProWAFT通过量化任务的工作负载关键性,建模故障传播特性及重配置开销,并基于延迟、能耗与可靠性风险的综合目标函数进行配置决策。实验在Xilinx Zynq UltraScale+ ZCU104平台(含六个可重构区域)上完成,基于ResNet-18、MobileNetV2和EfficientNet-Lite生成的500个任务轨迹,在时变单粒子翻转(SEU)注入条件下验证表明,ProWAFT在保持高任务成功率与接近基线吞吐率的同时,相较静态TMR与反应式重配置方案实现了更低的综合成本,且在线决策开销极低。
链接: https://arxiv.org/abs/2607.01602
作者: Xinxin Chen,Haoran Qiao,Yiming Guo,Kecheng Luo,Siyuan Feng,Jingwen Ma
机构: University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注: 13 pages
Abstract:SRAM-based FPGAs provide an attractive platform for energy- and latency-constrained CNN inference at the network edge, yet transient faults can lead to silent errors that compromise reliability. Always-on redundancy (e.g., full TMR) improves correctness but incurs substantial performance and energy overhead, while reactive recovery may introduce unacceptable latency on the critical path. We propose \textbfProWAFT, a proactive workload-aware fault-tolerance framework for FPGA-based CNN accelerators that uses partial reconfiguration to selectively apply TMR across reconfigurable partitions. ProWAFT quantifies workload criticality, models fault propagation and reconfiguration overhead, and selects configurations that minimize a composite objective over latency, energy, and reliability risk. Implemented on a Xilinx Zynq UltraScale+ ZCU104 platform with six reconfigurable regions and evaluated on a 500-task trace derived from ResNet-18, MobileNetV2, and EfficientNet-Lite under time-varying SEU injection, ProWAFT achieves lower composite cost than static TMR and reactive reconfiguration while maintaining high task success rate and near-baseline throughput with low online decision overhead.
[NLP-59] BOUNDARY_SYNC: Measuring Communication-Induced Representational Coupling in Multi-Agent LLM Systems
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)作为通信代理在协作过程中,因相互交流而导致输出趋于一致(即同质化)的问题。其核心问题是:当多个LLM代理通过文本或图像进行通信时,它们的表示空间是否会因交互而发生可测量的耦合效应?解决方案的关键在于提出一种名为BOUNDARY_SYNC的协议,通过耦合放大因子(Coupling Amplification Factor, CAF = JSD_cond / JSD_baseline)量化表示耦合程度,其中CAF < 1 表示同质化,CAF > 1 表示多样化。研究通过控制实验(N=30,约9,900次API调用)验证了文本与图像通信均导致显著同质化(如文本通信中CAF=0.803),并发现群体规模(K=5时同质化,K=3时出现多样化趋势)和通信模态对耦合方向具有调节作用,且耦合行为由提示上下文驱动而非累积更新,呈现无状态特性。该研究首次实证表明LLM代理间的耦合是可测量、可调控的,为多智能体系统的设计提供了关键理论依据与干预路径。
链接: https://arxiv.org/abs/2607.01600
作者: Zewen Liu
机构: Qilu Institute of Technology (齐鲁理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 18 pages, 3 figures, 2 tables
Abstract:As large language models (LLMs) are deployed as communicating agents, does inter-agent communication cause outputs to converge? We introduce BOUNDARY_SYNC, a protocol measuring representational coupling via the Coupling Amplification Factor (CAF = JSD_cond / JSD_baseline), where CAF 1 indicates homogenization and CAF 1 indicates diversification. In controlled GPT-4o experiments (N=30, ~9,900 API calls), we measure coupling in text and image communication. Key findings: (1) text communication causes significant homogenization (CAF=0.803 [0.740, 0.873], d=1.30, p0.001), confirmed by no-communication ablation and prompt-perturbation controls; (2) image communication also homogenizes under within-modality baselines (CAF=0.834 [0.811, 0.858]), with comparable proportional effect; (3) group size moderates coupling direction – K=5 produces homogenization while K=3 yields CAF 1.0 (point estimates 1.14 and 1.06, CI pending), suggesting a directional shift toward diversification; (4) cross-model replication shows extreme variation (CAF 0.034-0.803), with DeepSeek dominated by format artifacts; (5) coupling is stateless – driven by prompt context rather than cumulative updating, with continuous consensus producing monotonic convergence. These results establish LLM agent coupling as real, measurable, and controllable at the prompt level, with direct implications for multi-agent system design.
[NLP-60] Safe and Adaptive Cloud Healing: Verifying LLM -Generated Recovery Plans with a Neural-Symbolic World Model
【速读】: 该论文旨在解决云原生AI系统在规模与复杂性持续增长背景下,实现快速故障检测与自适应恢复以保障服务可靠性的关键挑战。现有方法虽融合了大语言模型(LLM)进行语义理解及深度强化学习(DRL)优化策略,但普遍采用串行、松耦合的架构,未能充分发挥LLM在生成与推理方面的潜力。本文提出PASE(Planning-Aware Semantic self-healing engine),一种将故障自愈重构为神经符号程序合成任务的新范式。其核心创新在于构建了一个紧密集成的“推理—规划—验证—适应”闭环:以大语言模型(LLM)作为核心计划生成引擎,从语义原子库中生成结构化恢复计划;基于神经符号世界模型(Neural-Symbolic World Model)通过仿真验证计划可行性;同时引入经由深度强化学习训练的元提示优化器(Meta-Prompt Optimizer),动态生成最优提示以引导LLM的规划过程。该机制实现了超越预定义动作空间的动态、上下文感知恢复策略生成。在真实云环境故障注入数据集上的实验表明,PASE显著优于现有先进方法,平均恢复时间降低超过40%,并在未知故障场景下提升了故障检测精度。本框架通过统一基于大语言模型的推理能力、模型辅助的验证机制与元学习驱动的提示优化,推动了自主系统管理向更智能、更自适应的方向演进。
链接: https://arxiv.org/abs/2607.01595
作者: Junyan Tan,Haoran Lin,Siyuan Guo,Yichen Fang,Xinyue Luo,Tianyu Shen,Zeyu Qiao
机构: Zhejiang University (浙江大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages
Abstract:As the scale and complexity of cloud-based AI systems continue to escalate, ensuring service reliability through rapid fault detection and adaptive recovery has become a critical challenge. While existing approaches integrate Large Language Models (LLMs) for semantic understanding and Deep Reinforcement Learning (DRL) for policy optimization, they often rely on sequential, loosely coupled architectures that underutilize the generative and reasoning capabilities of LLMs. In this paper, we propose a paradigm shift with PASE, a Planning-Aware Semantic self-healing engine, a novel fault self-healing framework that reconceptualizes recovery as a neuro-symbolic program synthesis task. PASE employs an LLM as a core Plan Synthesis Engine to generate structured recovery plans from a library of semantic primitives. A Neural-Symbolic World Model verifies plan feasibility through simulation, while a Meta-Prompt Optimizer, trained via DRL, learns to generate optimal prompts that guide the LLM’s planning process. This tight reason-plan-verify-adapt loop enables dynamic, context-aware recovery strategy generation beyond predefined action spaces. Experiments on a real-world cloud fault injection dataset demonstrate that PASE significantly outperforms state-of-the-art methods, reducing average system recovery time by over 40% and improving fault detection accuracy in unknown fault scenarios. Our framework advances autonomous system management by unifying LLM-based reasoning with model-assisted verification and meta-learned guidance.
[NLP-61] ADVENT: LLM -Driven Automatic Predicate Invention for ILP
【速读】: 该论文旨在解决归纳逻辑编程(Inductive Logic Programming, ILP)中的谓词发明(Predicate Invention, PI)瓶颈问题,即在缺乏先验知识的情况下自动发现有意义的新谓词以扩展假设空间。现有方法依赖领域专家经验,生成的谓词语义不透明,难以适应陌生领域或实现跨任务复用。其解决方案的关键在于提出ADVENT框架,该框架结合大语言模型(Large Language Model, LLM)的溯因生成能力与Prolog的演绎验证机制,形成一个迭代优化循环:LLM基于结构化关系数据识别隐含模式并生成候选谓词,具体执行结果反馈至LLM用于进一步精炼;同时,所发明的谓词及其学习规则被存入知识池,支持跨任务复用。实验表明,在七种不同大语言模型上对九个扑克牌型概念进行测试,仅使用ILP时完全失败,而引入LLM驱动的谓词发明后成功率提升至58%,经形式化验证后达80%,且通过知识池复用可带来最高+31个百分点的性能增益,同时生成可解释的人类可读规则。这表明ADVENT为自动化谓词发明及实现ILP中跨任务知识共享提供了可行路径。
链接: https://arxiv.org/abs/2607.01585
作者: Tingting Yu,Pei-Cing Huang,Chan Hsu,Chan-Tung Ku,Yihuang Kang
机构: National Sun Yat-Sen University (国立中山大学)
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Predicate invention (PI), the creation of new predicates to extend the hypothesis space, remains a critical bottleneck in Inductive Logic Programming (ILP). Existing methods rely on domain expertise and produce semantically opaque predicates, hindering adaptation to unfamiliar domains and cross-task reuse. We present ADVENT, an LLM-driven PI mechanism for ILP. ADVENT pairs LLM abductive generation with Prolog deductive verification, forming an iterative loop in which concrete execution results guide the LLM to refine candidate predicates. The mechanism leverages Large Language Models to identify implicit patterns in structured relational data and invent auxiliary predicates with meaningful names and definitions. Invented predicates and learned rules accumulate in a knowledge pool for cross-task reuse. Experiments on nine poker-hand concepts across seven LLMs show that LLM-driven PI achieves 58% success rate where ILP alone fails entirely, formal verification raises this to 80%, and the knowledge pool yields gains up to +31 percentage points, while producing human-interpretable rules. These results suggest that ADVENT offers a promising direction for automating predicate invention and enabling cross-task knowledge reuse in ILP.
[NLP-62] Beyond Skepticism: Evaluating LLM s Pedagogical Intent Reasoning with the Adaptive Pedagogical Vigilance Framework
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在教学沟通中对教学意图(pedagogical intent)推理能力不足的问题,尤其聚焦于翻译教学等教育领域。现有模型普遍缺乏对教学内容背后深层教育目标与策略的识别能力,导致其在处理具有明确教学目的的语篇时表现不佳。为此,论文提出一种名为自适应教学警觉性(Adaptive Pedagogical Vigilance, APV)的新型计算框架,其核心创新在于将沟通警觉性重新定义为一种通过意图推断优化学习效果的自适应机制。APV的关键在于构建一个贝叶斯教学意图推断引擎(Bayesian Pedagogical Intent Inference Engine, PIIE),该引擎能够建模教师如何选择内容以最大化教学效用,并使学习者逆向推理出潜在的教学配置——包括体裁(genre)、立场(stance)和激励机制(incentives)。通过三级评估体系(区分教学体裁、推理结构化教学设置、泛化至真实教育语料),实验表明APV显著提升了模型的警觉性表现,在区分教学性内容与非教学性内容方面达到最优性能,且与人类判断高度相关(r=0.958),在自然语料上亦保持鲁棒性,优于基线方法。该研究建立了统一的评估与增强框架,推动了更可靠的人工智能辅助学习系统的发展。
链接: https://arxiv.org/abs/2607.01581
作者: Minghao Chen,Ruihan Zhou,Jiayi Tang,Zihan Xu,Bowen Huang,Yuxin Liu
机构: Zhejiang University(浙江大学)
类目: Computation and Language (cs.CL)
备注: 22 pages
Abstract:The capacity of Large Language Models (LLMs) to reason about pedagogical intent within instructional communication remains underexplored, particularly in educational domains such as translation pedagogy. To address this, we propose the \textbfAdaptive Pedagogical Vigilance (APV) framework, a novel computational formalism that reframes communicative vigilance as an adaptive mechanism for optimizing learning through intent inference. APV formalizes the problem via a Bayesian Pedagogical Intent Inference Engine (PIIE), which models how instructors select content to maximize pedagogical utility and how vigilant learners should inversely reason about latent instructional configurations – encompassing genre, stance, and incentives. We evaluate APV through a three-tier hierarchy: distinguishing instructional genre, reasoning about structured pedagogical setups, and generalizing to authentic educational discourse. Experiments on leading LLMs (e.g., GPT-4o, Claude 3.5) show that APV substantially improves model vigilance. It achieves the strongest discrimination between pedagogical and exposure-based content, correlates highly with human judgments ( r=0.958 ), and maintains robust performance on naturalistic data where baseline methods degrade. This work establishes a unified framework for assessing and enhancing LLMs’ understanding of pedagogical motives, advancing the development of more reliable AI-assisted learning systems.
[NLP-63] DiPS: Dialogue Policy Selection for High-Stakes Persuasion Agents SIGDIAL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险情境下说服能力不足的问题,尤其针对个体用户因性格差异与关切点不同而需要个性化说服策略的挑战。传统的一刀切式对话策略难以有效应对复杂的人类心理与行为反应。为此,研究聚焦于火灾救援这一高风险说服场景,提出一种基于Q-learning的动态对话策略选择框架——对话策略选择(Dialogue Policy Selection, DiPS)。其核心创新在于构建一个以提升撤离成功率为目标的评判器(critic),该评判器能够根据对话过程中居民的最新反馈动态评估并选择最优说服策略。通过在模拟环境与真实人类交互中对比多个基线方法,实验结果表明,DiPS显著优于零样本大语言模型及通用检索增强生成(RAG-augmented)方法,在提升撤离成功率方面表现更优。
链接: https://arxiv.org/abs/2607.01557
作者: Tianyi Zhang,Mousumi Das,Abrar Anwar,Jesse Thomason,David Traum
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Proceedings of the 27th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2026)
Abstract:Large Language Models (LLMs) often struggle with persuasion in high-stakes scenarios. People’s individual personalities and concerns require tailored strategies rather than a one-size-fits-all approach. To address this challenge, we focus on a fire-rescue scenario in which an operator must persuade a resident to evacuate as a high-stakes persuasion domain and propose Dialogue Policy Selection (DiPS), a Q-learning framework to dynamically select persuasion strategies adapted to the evolving conversational context. Specifically, we train a critic, trained to maximize the chance of evacuation success, to select a persuasion policy at each turn based on the resident’s recent this http URL then evaluate DiPS against multiple baselines in both simulated and real human interactions. We find that DiPS achieves higher evacuation success than a zero-shot LLM and generic RAG-augmented approach.
[NLP-64] Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale
【速读】: 该论文旨在解决生成式语言模型(Language Models, LMs)在大规模语料库上进行上下文内检索(in-context retrieval)时面临的关键挑战,即如何实现高效、准确且可扩展的检索性能。传统向量检索方法依赖于密集向量表示与相似度计算,而生成式模型则尝试通过在上下文中的语料条件生成答案,但此前研究多集中于小规模任务或专有系统,缺乏对百万级词元规模及长序列外推能力的系统性探索。本文提出的核心解决方案是设计一种名为BlockSearch的0.6B参数量的语言模型检索器,其通过架构改进与训练策略优化,显著提升了对长文本的长度泛化能力(最长可达训练规模的10倍)。然而,当面对极端外推时,检索性能仍会崩溃,作者分析发现根本原因在于“注意力稀释效应”(attention dilution effect):随着语料库增大,无关文档在注意力softmax分母中占据主导地位,导致真实相关文档的归一化注意力权重被稀释,即使其原始得分较高也无法有效激活。为此,论文引入了长度感知的注意力softmax调整机制和文档级别的稀疏注意力结构,有效缓解了该问题。实验表明,在百万级词元规模下,该模型在MS MARCO、NQ等主流基准上达到与密集检索相当甚至更优的表现,且在仅需7倍更小模型的情况下超越同期模型MSA;尤其在强调非传统相似性度量的任务(如LIMIT)中,其性能提升达3倍以上。综上,本工作确立了上下文内检索作为经典检索范式的有力替代方案,并揭示了在极端上下文增长条件下对注意力机制进行精细化控制的重要性。
链接: https://arxiv.org/abs/2607.01538
作者: Siddharth Gollapudi,Nilesh Gupta,Prasann Singhal,Sewon Min
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Language models (LMs) raise an intriguing alternative to vector-based retrieval: conditioning on an in-context corpus and directly generating a relevant answer. However, prior work has largely focused on proprietary systems or the smaller-scale reranking task, leaving corpus-scale in-context retrieval largely unexplored. In this work, we present the first systematic study of in-context retrieval on two scales practical retrievers demand: million-token corpora and length-generalization far beyond training-time sizes. We first introduce BlockSearch, a 0.6B LM retriever whose architectural and training modifications improve over prior LM baselines and length-generalize up to 10 times beyond its training regime. Nevertheless, retrieval still collapses under more extreme extrapolation. We trace this failure to an attention dilution effect: as the corpus grows, irrelevant documents dominate the softmax denominator, reducing the normalized mass on the gold document even when its pre-softmax score stays high. Motivated by this analysis, we introduce length-aware adjustments to the attention softmax and document-level sparse attention. With these modifications, at the million-token scale, our model matches dense retrieval on widely studied benchmarks (e.g, MS MARCO and NQ), while outperforming the concurrent model MSA despite being 7 times smaller. Furthermore, it significantly outperforms dense retrieval on tasks requiring entirely different notions of similarity, such as LIMIT, achieving a 3 times higher score. Together, our results position in-context retrieval a promising alternative to classical retrieval while emphasizing attention control under extreme context growth as a new challenge.
[NLP-65] Multi-Head Recurrent Memory Agents
【速读】: 该论文旨在解决递归记忆代理(Recurrent Memory Agents)在处理长上下文时存在的可靠性问题,即随着上下文长度增加,端到端性能系统性下降。其核心诊断指出,性能退化主要源于“记忆保留(memory retention)”能力的崩溃,而非“记忆捕获(memory capture)”。现有方法将记忆视为单一文本块进行更新,导致每次新信息写入都有可能覆盖已有内容,从而破坏长期记忆。为此,作者提出一种通用且无需训练的解决方案——多头递归记忆(Multi-Head Recurrent Memory, MHM),通过将记忆划分为独立的多个头,并采用分阶段“选择-更新”策略,在每一步仅更新一个头,其余头在结构上被屏蔽以避免覆盖,从而将记忆保留的保障从模型行为转移到架构设计。作为轻量级实现,提出最小最近更新多头记忆(MHM-LRU),在零额外词元开销下实现均匀的头利用率。大量实验表明,MHM-LRU在10万至100万词元范围内显著提升记忆保留率与端到端准确率,例如在896K词元的RULER-HQA任务中,记忆保留率从不足30%提升至73.96%。该方法在不同模型家族、规模和任务类型上均表现普适,验证了通过架构优化实现可靠长上下文记忆的可行性与高性价比。
链接: https://arxiv.org/abs/2607.01523
作者: Jiatong Li,Samuel Yeh,Sharon Li
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 11 figures, 5 tables
Abstract:Recurrent memory agents extend LLMs to arbitrarily long contexts by iteratively consolidating input into a fixed-size memory window. Despite their scalability, these agents exhibit a well-documented reliability problem: end-to-end performance degrades systematically as context length grows. We diagnose this failure by decomposing performance into two factors–memory capture and memory retention–and quantitatively confirm that retention is the dominant bottleneck. Retention collapses because existing designs maintain memory as a monolithic text block, forcing every update to risk overwriting previously retained content. Motivated by this diagnosis, we propose Multi-Head Recurrent Memory (MHM), a general, training-free framework that partitions memory into independent heads governed by a stage-wise select-then-update strategy. At each step, exactly one head is selected for update while the remaining heads are structurally shielded from overwriting, shifting the burden of retention from model behavior to architectural design. As a lightweight instantiation, we introduce Least-Recently-Updated MHM (MHM-LRU), which guarantees uniform head utilization with zero additional token overhead. Extensive experiments on long-context benchmarks show that MHM-LRU substantially improves both retention and end-to-end accuracy across the 100K–1M token range, where baselines degrade sharply. On RULER-HQA at 896K tokens, MHM-LRU improves the memory retention rate from less than 30% to 73.96%. These gains generalize across model families, scales, and task types, positioning architectural optimization as a practical and cost-efficient path toward reliable long-context recurrent memory.
[NLP-66] Parameter Golf: What Really Works?
【速读】: 该论文旨在解决在严格资源预算下语言模型性能优化的极限问题,具体表现为如何在16 MB的完整模型制品(包括训练代码与压缩权重)体积限制内,并于8块H100 SXM GPU上10分钟内完成训练的前提下,最大化模型生成文本的质量。其核心挑战在于平衡模型复杂度与计算效率之间的矛盾。解决方案的关键在于系统性地分析竞赛中2,037个拉取请求及1,430个经过评分的提交结果,构建出包含84种优化技术的分类体系,并量化每项技术对比特每字节(bits-per-byte, BPB)这一质量指标的实际贡献。研究发现,尽管单个技术通常仅带来低于1%的BPB改善,但通过多技术协同作用,整体性能显著提升——验证后的排行榜得分从1.2244降至1.058 BPB,实现13.6%的相对降低。更重要的是,研究揭示了多数优化手段在竞争性实现中效果衰减,从而识别出少数在不同技术栈中仍具稳定增益的核心方法,为高效模型压缩与训练提供了关键实践指导。
链接: https://arxiv.org/abs/2607.01517
作者: Prashanna Mani Paudel,Shivanand Venkanna Sheshappanavar
机构: University of Wyoming (怀俄明大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:How far can a language model improve under a strict artifact budget? Parameter Golf posed this question as an open community challenge in which participants trained the best language model, with the complete artifact (training code + compressed weights) required to fit within 16 MB and be trained in under ten minutes on 8xH100 SXM GPUs. Quality was measured in bits-per-byte (BPB), the average number of bits required to encode each byte of unseen text. We analyze 2,037 pull requests and 1,430 clean scored submissions from the contest, build a taxonomy of 84 optimization techniques, and measure each technique’s contribution to BPB. The verified leaderboard score dropped from 1.2244 to 1.058 BPB across three phases – a 13.6% reduction, despite individual techniques rarely improving BPB by more than 1%. We show that most gains in techniques shrink across competitive submissions, isolating the few methods that improve performance across stacks.
[NLP-67] From Monolingual to Multilingual: Evaluating Mamba for ASR in South African Languages
【速读】: 该论文旨在解决生成式AI(Generative AI)在非洲语言语音识别(ASR)中的应用问题,特别是针对南非洲七种语言的低资源场景下,评估新型状态空间模型Mamba相较于传统Conformer模型的性能与效率。其核心挑战在于:在有限训练数据条件下,如何提升模型对非洲语言的识别准确率并降低计算开销。解决方案的关键在于采用Mamba架构,并通过多语言联合训练策略增强泛化能力。具体而言,研究提出在多语言训练中引入语言族信息(language-family embeddings)作为偏置项注入下采样声学表示,同时结合多任务学习框架(以CTC目标和语言识别头联合优化),显著提升了跨语料库鲁棒性;尽管显式语言信息未改善域内性能,但嵌入向量在低资源设置下仍带来性能增益,且其作用机制并非反映语言类型学相似性,而是作为任务特定的控制向量调节模型行为。实验表明,Mamba在保持与Conformer相当识别精度的同时,具备更低的计算成本与更快的训练速度,验证了其在非洲语言ASR中的有效性与潜力。
链接: https://arxiv.org/abs/2607.01502
作者: Jesujoba O. Alabi,Julian Herreilers,Badr M. Abdullah,Dietrich Klakow
机构: University of Paderborn (帕德博恩大学); University of Bremen (不来梅大学)
类目: Computation and Language (cs.CL)
备注: under review
Abstract:Recent advances in automatic speech recognition (ASR) have explored different sequence models, including Conformer-based models and newer state space models such as Mamba. Although prior work has evaluated these architectures in multiple languages, their effectiveness in African languages remains underexplored. In this work, we evaluate Mamba for ASR on seven South African languages. In monolingual experiments, each model is trained on 50 hours of speech per language, and we compare Mamba to a Conformer baseline of similar parameter scale. Mamba achieves similar recognition accuracy to Conformer while using fewer computational resources and training faster. We further evaluate generalization in this setting and find that both models struggle to generalize to speech that is much longer than what they were trained on. We then study multilingual ASR using Mamba models, where the baseline is pooling all languages together. On top of this, we tested three extensions: training with language-family information by adding both language and language-family embeddings as biases to the downsampled acoustic representations, and multitask learning with a CTC ASR objective and a language identification (LID) head. We find that multilingual training consistently improves performance over monolingual training. However, adding explicit language information does not improve in-domain performance but does improve cross-corpus robustness. We conducted ablation studies in low-resource multilingual settings using 5-hour and 10-hour per-language training data, where we observed gains from using language embeddings and further demonstrated that removing or altering them hurt model performance. Lastly, we analysed these embeddings and find that they do not capture linguistic similarity in a typological sense, but instead act as task-specific control vectors.
[NLP-68] Comparing Architectures for Supervised Political Scaling
【速读】: 该论文旨在解决政治文本意识形态定位(text scaling)中的性能瓶颈问题,即如何更准确地将政治主体在意识形态光谱上进行定位。当前主流方法多采用分类或回归范式,但各自存在局限:分类方法难以捕捉连续性意识形态差异,而回归方法在处理离散语义边界时表现不佳。本文的核心解决方案在于提出一种联合建模框架,通过同时预测多个政治主体的意识形态位置,实现跨样本的协同优化,从而提升整体定位精度;同时探索了介于分类与回归之间的中间路径——利用分段回归或分层输出结构,在保持连续性建模优势的同时增强对离散语义特征的敏感性,为文本尺度化任务提供了更具鲁棒性和可解释性的新范式。
链接: https://arxiv.org/abs/2607.01464
作者: Anna Golub,Sebastian Padó
机构: University of Stuttgart, Germany
类目: Computation and Language (cs.CL)
备注:
Abstract:Text scaling, the task of positioning political actors on an ideological scale, is a fundamental task in political analysis. To ease the need for manual analysis, various NLP methods have been proposed for this task, including classification- and regression-based approaches, showing successes as well as limitations. The goal of our paper is to consolidate the state of the art in this area. We ask two questions: (a) Can the performance of scaling methods be improved by predicting scales not individually but jointly? (b) Is there a middle ground between classification and regression?
[NLP-69] Grounded Optimization: A Layered Engineering Framework for Reducing LLM Hallucination in Automated Personal Document Rewriting
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在用于求职者跟踪系统(Applicant Tracking Systems, ATS)简历优化时所引发的特定幻觉问题,包括时间错位的技术注入、跨领域术语污染、结构变异以及内容虚构等。这些问题在通用文本生成中较少见,但在简历优化场景下可能导致严重误导性输出。其解决方案的关键在于提出一种名为“基于事实的优化”(Grounded Optimization)的五层框架,该框架通过集成时间上下文验证、确定性污染检测、结构不变性强制、提示层面的事实锚定以及评估代理(evaluator agent)等机制,系统性地抑制上述幻觉行为。实验表明,在多种模型配置和温度设置下,该框架可将幻觉检测率从基准的每份简历2.48–5.36次显著降低至0.04–0.24次,其中提示层面的事实锚定在低温度与强指令遵循模型下可实现零幻觉,但高温度或弱模型条件下仍需依赖确定性层作为互补保障。
链接: https://arxiv.org/abs/2607.01457
作者: Shashank Indukuri,Adarsh Agrawal
机构: DePaul University (德保罗大学); Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 1 figure. Equal contribution by both authors. Code and data: this https URL
Abstract:Large language models (LLMs) are increasingly applied to resume optimization for applicant tracking systems, introducing hallucination failures distinct from general text generation: anachronistic technology injection, cross-domain terminology contamination, structural mutation, and content fabrication. We present Grounded Optimization, a five-layer framework combining temporal context validation, deterministic contamination detection, structural invariant enforcement, prompt-level grounding, and an evaluator agent. In ablation experiments across three LLMs, four temperature settings, and six layer configurations on 25 synthetic resumes spanning 14 industries, undefended baselines produce 2.48-5.36 detected hallucinations per resume. Among detectors independent of the active defenses, temporal hallucinations are reduced by 50-95% across all conditions; overall detected hallucination rate falls to 0.04-0.24. Prompt-level grounding alone achieves zero detected hallucinations at low temperature with a capable instruction-following model; higher temperatures and weaker models reveal the need for the deterministic layers as a complement. We release the contamination taxonomy, evaluation code, and raw data. Comments: 13 pages, 1 figure. Equal contribution by both authors. Code and data: this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2607.01457 [cs.CL] (or arXiv:2607.01457v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2607.01457 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-70] On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain
【速读】: 该论文旨在解决生成式模型中混合专家(Mixture-of-Experts, MoE)架构在资源受限环境下部署时面临的高内存开销与模型可靠性之间的矛盾问题,尤其关注领域特定的专家剪枝(structured expert pruning)对模型实用性和事实可靠性的影响。其解决方案的关键在于系统评估不同剪枝方法、剪枝比例及任务场景下MoE模型在生物医学等高风险领域中的表现,揭示适度剪枝可在保持领域内任务性能的同时维持较高的事实可靠性,但极端剪枝会显著增加幻觉(hallucination)风险;同时发现跨领域迁移时模型的实用性和可靠性均迅速下降。研究强调,仅依赖任务性能评估不足以支撑高风险场景下的安全部署,必须结合可靠性分析才能实现可信的模型压缩。
链接: https://arxiv.org/abs/2607.01444
作者: Atsuki Yamaguchi,Szymon Palucha,Léo Bijar,Aline Villavicencio,Nikolaos Aletras
机构: University of Sheffield (谢菲尔德大学); AstraZeneca (阿斯利康); University of Exeter (埃克塞特大学); Federal University of Rio Grande do Norte (里约格朗德州联邦大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review
Abstract:Mixture-of-Experts (MoE) models offer inference speedups via selective activation but impose substantial memory requirements because the whole network must remain loaded. Structured expert pruning is a practical approach for reducing deployment costs in resource-constrained settings. However, prior studies primarily evaluate benchmark utility, leaving the effect of pruning on factual reliability underexplored, particularly in high-stakes domains such as biomedicine. In this paper, we investigate how domain-specific expert pruning affects both utility and reliability. We assess four MoE models, six pruning methods, and multiple pruning ratios across generation and classification tasks under in-domain (biomedical) and cross-domain settings. Results reveal that moderate pruning preserves in-domain utility without immediate reliability decline, although hallucination risks increase at extreme pruning ratios. When shifting to the general domain, both utility and reliability degrade rapidly. These findings indicate that safe compression depends heavily on the task and domain. Evaluating pruned MoE models solely on utility is inadequate for high-stakes deployment without reliability assessment.
[NLP-71] FaithMed: Training LLM s For Faithful Evidence-Based Medical Reasoning
【速读】: 该论文旨在解决当前医疗领域大语言模型(Large Language Models, LLMs)在临床决策推理过程中缺乏对证据的主动获取与可信评估的问题。现有方法或无法动态访问可靠证据,或在使用检索到的证据时未对证据的评估与应用过程进行有效监督,导致推理过程透明性与可解释性不足。为此,研究提出FaithMed框架,其核心创新在于将循证医学(Evidence-Based Medicine, EBM)原则形式化为过程层面的评价标准,并结合临床专家设计、自动优化的评分量表,通过基于步骤级过程奖励分配与优势分组的强化学习机制,实现对推理过程中每一步证据应用的显式监督。实验结果表明,在七个医疗基准测试中,FaithMed相较于代理搜索基线平均提升9%,优于仅以结果为导向的强化学习方法5.8%,同时在循证医学评分量表上相较基线提升15.5%。该研究证明,对推理过程进行细粒度的步骤级监督,能够显著提升任务成功率与推理过程的忠实性。
链接: https://arxiv.org/abs/2607.01440
作者: Zhiyun Zhang,Liwen Sun,Xiang Qian,Chenyan Xiong
机构: Carnegie Mellon University (卡内基梅隆大学); Stanford University School of Medicine (斯坦福大学医学院); Xlue
类目: Computation and Language (cs.CL)
备注: 15 pages, 5 figures
Abstract:Faithful reasoning is essential in medicine, where clinical decisions require transparent justification grounded in reliable evidence. Current medical LLMs either lack active access to evidence or use retrieved evidence without supervising how it should be appraised and applied during reasoning. To address this, we formalize evidence-based medicine principles as process-level criteria and introduce FaithMed, a framework that combines clinician-designed, automatically refined rubrics with reinforcement learning using step-level process reward assignment and advantage grouping. Across seven medical benchmarks, FaithMed improves over agentic-search baselines (+9% on average) and outcome-only RL (+5.8%), while raising average evidence-based medicine rubric scores over agentic-search Qwen3 baselines (+15.5%). This work demonstrates that explicit step-level supervision can improve both task success and the faithfulness of the reasoning process. Code is available at this https URL.
[NLP-72] IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLM s
【速读】: 该论文旨在解决大语言模型(LLM)评估中推理能力与领域知识检索难以分离的问题,尤其针对生成式人工智能(Generative AI)在科学类问题求解中的表现评估。传统评估方法往往无法区分模型是因具备更强的逻辑推理能力还是依赖特定领域知识而取得性能提升,从而导致对“链式思维”(Chain-of-Thought, CoT)等推理策略有效性的误判。为解决这一问题,论文提出ISOSCI基准,即同构跨域科学问题对数据集,其核心设计在于每一对问题具有相同的逻辑结构但需调用不同的领域知识,从而实现对推理模式增益的可控归因。关键发现表明,在所测试的五组模型对中,91.3%的推理模式性能提升均依赖于具体领域知识而非结构不变性(63/69个增益;95%置信区间[82.3%, 96.0%]),直接挑战了“链式思维能普遍提升短程程序化科学问题求解能力”的主流假设。此外,高能力模型开启推理模式后整体准确率提升不足5个百分点,且专门优化推理能力的o3-mini模型虽在GPQA Diamond上优于标准版本(+19.2个百分点),但在ISOSCI上反而表现更差(-24.7个百分点),凸显评估基准选择对结论的决定性影响。因此,该研究的关键解决方案在于构建一个可分离推理与知识维度的基准体系,以实现对模型推理能力的更精确、更可信的评估。
链接: https://arxiv.org/abs/2607.01431
作者: Samir Abdaljalil,Erchin Serpedin,Hasan Kurban
机构: Texas AM University (德州农工大学); Hamad Bin Khalifa University (哈马德·本·哈利法大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce ISOSCI, a benchmark of isomorphic cross-domain science problem pairs that separates reasoning ability from domain knowledge retrieval in LLM evaluation. Each pair shares identical logical structure but requires different domain-specific knowledge, enabling controlled attribution of reasoning-mode gains. Across five model pairs spanning four model families, we find that 91.3% of reasoning-mode gains are knowledge-dependent rather than structure-invariant (63/69 gains; Wilson 95% CI [82.3%, 96.0%]), directly challenging the assumption that chain-of-thought reasoning improves short-horizon procedural scientific problem-solving. Reasoning toggles on highly capable models provide less than 5 percentage points accuracy gain across all domains, and a reasoning-specialized model (o3-mini) that outperforms its standard counterpart on GPQA Diamond (+19.2 percentage points) underperforms on ISOSCI (-24.7 percentage points), showing that benchmark choice determines conclusions about reasoning utility. We release ISOSCI at this https URL
[NLP-73] MultAttnAttrib: Training-Free Multimodal Attribution in Long Document Question Answering EMNLP2026
【速读】: 该论文旨在解决多模态长文档问答系统中生成答案的证据溯源(attribution)准确性问题,尤其是在当前以生成式AI(Generative AI)为核心的智能助手日益广泛应用的背景下,确保答案可追溯至原始证据对提升用户信任与模型安全性至关重要。现有研究主要聚焦于单模态场景下的归因方法,而多模态情境下的归因机制仍缺乏系统性探索。为此,论文提出一种无需训练的归因生成方法——MultAttnAttrib,其核心创新在于利用模型预填充阶段(prefill pass)的信息,结合选定的注意力头(attention heads)与校准阈值,高效定位文档中的源证据。为评估该方法性能,研究进一步构建了首个专为长篇多模态文档设计的细粒度标注基准数据集MultAttrEval,实现了对答案组件的精确归因标注。实验结果表明,MultAttnAttrib在多项指标上显著优于多种基于提示(prompting)的归因方法,并达到如GPT 5.4等前沿模型的水平;同时,在保持高归因精度的前提下,推理延迟仅为同类提示方法的约七分之一,显著提升了效率。因此,该方案的关键在于通过轻量级、无训练的注意力机制分析实现高精度、低延迟的多模态归因。
链接: https://arxiv.org/abs/2607.01420
作者: Dang Quang Thien Tran,Quang V. Dang,Vinamra Tyagi,Sai Soorya Rao Veeravalli,Trang Nguyen,Ryan A. Rossi,Franck Dernoncourt,Nedim Lipka,Koustava Goswami,Samyadeep Basu
机构: Stanford University (斯坦福大学); Google (谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages (8 main, 17 references + appendix), 15 figures, Submitted to EMNLP 2026 Conference (Long Paper)
Abstract:As grounded QA systems are increasingly deployed in AI assistants, accurately attributing generated answers to evidence is critical for user trust and model safety. While unimodal attributions have been explored in depth, the multimodal setting remains relatively under-researched. As a result, we introduce MultAttnAttrib, a training-free attribution-generation method that leverages a model’s prefill pass, selected attention heads, and calibrated thresholds to locate source evidence within a document. To establish baseline results for the method, we introduce MultAttrEval, a complementary benchmark dataset annotated with fine-grained, ground-truth attributions for answer components grounded in multimodal source documents. To our knowledge, this is the first evaluation dataset designed specifically for multimodal attribution in long-form documents. Experimental results show that MultAttnAttrib consistently outperforms a variety of attribution-generation methods, including several strong prompting-based approaches and matches the latest frontier models such as GPT 5.4. Our method not only substantially improves attribution accuracy for both unimodal and multimodal attribution types, but also produces attributions at up to one-seventh of the direct inference latency compared to prompting on the same base model.
[NLP-74] Multi-Objective Exploration and Preference Optimization via Mutual Information KDD2026 ECML
【速读】: 该论文旨在解决大语言模型在对齐多样且异构的人类价值观时,如何有效权衡冲突的偏好维度这一关键问题。现有方法虽通过基于偏好向量的策略训练与在线直接偏好优化实现多目标权衡,但探索过程中的不确定性会导致不同偏好条件下生成响应的奖励分布发生重叠,从而削弱模型对相应偏好向量的有效对齐。为此,本文提出一种基于互信息(Mutual Information, MI)的信息论框架——多目标探索与偏好优化(MI-EPO),其核心在于通过最大化生成响应、偏好反馈与偏好向量之间的联合条件互信息,统一多目标探索与对齐过程。该方法引入概率路由机制,自然地将目标对齐与偏好感知探索解耦,促使模型生成在不同偏好条件下具有可区分性且高度对齐的响应。实验结果表明,MI-EPO显著提升了生成响应与偏好向量之间的对齐程度,增强了输出的可控性,并在多个目标间实现了稳定的权衡。
链接: https://arxiv.org/abs/2607.01392
作者: Hongyan Xie,Yikun Ban,Ruiyu Fang,Zixuang Huang,Deqing Wang,Jianxin Li,Shuangyong Song
机构: Xingchen AGI Lab
类目: Computation and Language (cs.CL)
备注: Accepted at ECML/PKDD 2026
Abstract:Aligning large language models with diverse and heterogeneous human values requires multi-objective alignment methods to effectively trade off conflicting preference dimensions. Current methods achieve this trade-off by training policies conditioned on preference vectors and leveraging online direct preference optimization. However, exploration uncertainty can cause the reward distributions of responses generated under different preference vectors to overlap, and the generated responses may fail to effectively align with the corresponding preference vectors. In this paper, we propose Multi-Objective Exploration and Preference Optimization via Mutual Information (MI-EPO), an information-theoretic framework. It unifies multi-objective exploration and alignment by maximizing the joint conditional mutual information among generated responses, preference feedback, and preference vectors. By incorporating a probabilistic routing mechanism, MI-EPO naturally decomposes objective alignment and preference-aware exploration, encouraging the model to generate responses that are distinguishable and aligned with different preference conditions. Experiments on safe alignment and helpful assistant tasks show that MI-EPO significantly improves the alignment between generated responses and preference vectors, makes the outputs more controllable, and achieves stable trade-offs across multiple objectives.
[NLP-75] RusFinChain: A Russian Benchmark for Verifiable Chain-of-Thought Reasoning in Finance with Fuzzy-Aligned Evaluation
【速读】: 该论文旨在解决金融领域中多步符号推理(multi-step symbolic reasoning)评估缺乏可验证中间推理步骤的问题,尤其针对俄语语言环境下的金融分析任务。现有基准如FINCHAIN仅支持英文,而FINESSE-Bench虽包含俄语内容但依赖无步骤监督的多项选择题,无法有效评估模型的真实推理能力。为此,本文提出RusFinChain,首个面向俄语金融领域的可验证链式思维(Chain-of-Thought, CoT)推理基准,涵盖17个领域、172个主题,基于可执行的Python模板生成5,280个参数化示例,确保数据集无污染,并为每个问题提供带中间数值的黄金标准推理链以实现自动化验证。其核心解决方案在于引入两种增强型评估指标:模糊数值对齐(Fuzzy Numeric Alignment)与软注意力对齐(Soft-Attention Alignment),显著提升了评估结果与最终答案正确性之间的相关性(Spearman rho ≈ 0.48),优于传统ChainEval指标(rho ≈ 0.38–0.46),展现出更强的诊断能力。实验评估8个开源大模型在分层样本上的表现,揭示了模型存在显著推理差距——尽管步骤对齐的Hard F1约为0.65,但最终答案正确率仅为约29%。研究公开了数据集、代码及评估框架,旨在推动俄语社区中可验证的金融人工智能发展。
链接: https://arxiv.org/abs/2607.01388
作者: M. K. Arabov
机构: Kazan Federal University (喀山联邦大学)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Multi-step symbolic reasoning is essential for robust financial analysis, yet most benchmarks neglect intermediate reasoning steps. FINCHAIN introduced verifiable Chain-of-Thought (CoT) evaluation but is limited to English. FINESSE-Bench includes a Russian block but relies on multiple-choice questions without step-level supervision. We present RusFinChain, the first Russian-language symbolic benchmark for verifiable CoT reasoning in finance. It spans 17 domains, 172 topics, and comprises 5,280 parameterized examples from executable Python templates, ensuring contamination-free evaluation. Each example includes a gold-standard reasoning chain with intermediate numeric values for automatic verification. We also introduce enhanced metrics: Fuzzy Numeric Alignment and Soft-Attention Alignment. We evaluate 8 open-weight LLMs on a stratified sample, generating 8,100 responses. Results reveal a substantial reasoning gap: models achieve Hard F1 of ~0.65 for step alignment, but only ~29% of final answers are correct. Our fuzzy and soft metrics show stronger correlation with final-answer correctness (Spearman rho approx 0.48) than the original ChainEval (rho approx 0.38-0.46), demonstrating superior diagnostic power. We release dataset, code, and evaluation framework to foster verifiable financial AI for the Russian-speaking community.
[NLP-76] urnNat: Automatic Evaluation of Turn-Taking Naturalness in Dyadic Spoken Dialogue
【速读】: 该论文旨在解决全双工语音对话系统中对话轮换自然度的自动化评估难题。现有评估方法多依赖人工判断或特定行为的时间特征指标,难以在统一框架下比较多种异构的轮换时间异常。其解决方案的关键在于提出一种基于似然性的评估框架TurnNat,通过在自然对话数据上训练因果轮换预测模型,估计未来双人语音活动状态,并利用观测到的未来活动状态的负对数似然(NLL)来衡量时间异常程度。TurnNat将帧级NLL值在从话语起止点提取的轮换边界单元(Turn-Taking Boundary Units, TBUs)上进行聚合,综合均值与尾部TBU得分生成对话级别的自然度评分。此外,研究构建了一个包含成对自然与扰动对话片段的受控扰动基准测试集,并通过人工自然度评价验证其有效性。实验表明,TurnNat能够有效识别跨多种异构时间异常的不自然轮换现象。
链接: https://arxiv.org/abs/2607.01345
作者: Hao Zhang,Thomas Thebaud,Georgi Tinchev,Venkatesh Ravichandran,Laureano Moro-Velazquez
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Turn-taking naturalness is central to full-duplex spoken dialogue systems, yet its automatic evaluation remains limited. Existing evaluations often rely on human judgments or behavior-specific timing metrics, making it difficult to compare heterogeneous timing failures within a unified framework. We propose TurnNat, a likelihood-based framework for automatic turn-taking naturalness evaluation in two-channel spoken dialogue. A causal turn-taking prediction model trained on natural conversations estimates future two-speaker voice-activity states, and the negative log-likelihood (NLL) of the observed future activity measures timing atypicality. TurnNat pools frame-level NLLs over turn-taking boundary units (TBUs) extracted from utterance onsets and offsets, and aggregates mean and tail TBU scores into a dialogue-level naturalness score. We further construct a controlled perturbation benchmark of paired natural and perturbed dialogue clips, validated by human naturalness judgments. Experiments on this benchmark show that TurnNat successfully identifies unnatural turn-taking perturbations across heterogeneous timing failures.
[NLP-77] Black-Box Inference of LLM Architectural Properties with Restrictive API Access
【速读】: 该论文旨在解决在当前受限的黑盒API访问条件下,如何从商业大语言模型(Large Language Model, LLM)中逆向推断其架构参数的问题。尽管多数商业LLM提供商已限制API仅返回单个解码词元的对数几率(logit),并取消了对对数几率偏置(logit bias)的支持以防止模型架构泄露,但本文指出这些措施仍不足以完全隐藏模型的内在结构。其解决方案的关键在于提出名为NightVision的新型攻击方法,该方法通过一种创新的“公共集合提示”(common set prompting)技术,构造多个提示语句以暴露相同输出词元集合的对数概率;随后利用这些结果进行谱分析(spectral analysis),从而推断出模型的隐藏维度(hidden dimension)。此外,结合端到端的首次生成时间(time to first token, TTFT)测量与估计的隐藏维度,进一步估算模型深度(depth)和参数量(parameter count)。实验表明,该方法在32个开源模型上实现了平均相对误差低于23%(混合专家模型为9%)的隐藏维度估计精度,并在参数量超过三亿的模型上对深度和参数量的估计误差控制在53%以内。研究通过大量消融实验验证了精度随令牌预算和模型特性的变化规律,最终表明当前主流的API限制仍不足以充分掩盖底层模型的架构细节。
链接: https://arxiv.org/abs/2607.01313
作者: Christopher Ellis,Shreyas Chaudhari,Mei-Yu Wang,Leighton Barnes,Giulia Fanti,José M. F. Moura
机构: Carnegie Mellon University (卡内基梅隆大学); Pittsburgh Supercomputing Center (匹兹堡超级计算中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:In practice, most commercial LLM providers do not publicly release details of underlying LLM architectures. However, prior work has shown that given limited API access to an LLM (namely, top- k logits and/or a logit bias function), one can recover certain architectural details of an LLM, such as the hidden dimension of the feed-forward network. Perhaps in response to these results, most commercial LLM providers have restricted their APIs to expose only the single logit for each decoded token, and they no longer give users the ability to bias logits. We show that even under current restrictive APIs, several architectural parameters are still recoverable. We present NightVision, an attack that uses restrictive black-box API access to estimate the hidden dimension, depth, and parameter count of an LLM. Algorithmically, NightVision relies on a novel common set prompting technique in which multiple prompts expose log probabilities for the same set of output tokens; a spectral analysis of these results is used to infer hidden dimension. NightVision additionally uses end-to-end time to first token (TTFT) measurements and the estimated hidden dimension to estimate depth and parameter count. We empirically evaluate NightVision on 32 open-source LLMs, recovering hidden dimension to within 23% average relative error across all models (9% on MoE models), and depth and parameter count to within 53% for models exceeding three billion parameters. We run extensive ablations to demonstrate how these accuracies scale with token budget and model properties. Overall, our results suggest that current LLM APIs are not sufficiently restricted to fully obfuscate the architectural details of their underlying models.
[NLP-78] RuleChef: Grounding LLM Task Knowledge in Human-Editable Rules
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)任务中模型可解释性与透明度不足的问题,尤其是在文本分类、命名实体识别(Named Entity Recognition, NER)及关系抽取等场景下,传统深度学习模型往往作为“黑箱”运行,难以提供可理解的决策依据。为此,论文提出RuleChef框架,其核心解决方案是利用大语言模型(Large Language Models, LLMs)在学习阶段自动生成可执行规则,并基于任务描述与标注样例迭代优化这些规则,同时结合额外样本和人类反馈对已有规则进行修正。该框架还可通过现有模型的输入-输出对实现规则的初始生成(即规则引导的模型蒸馏)。在整个过程中,仅在训练阶段使用LLMs进行规则合成与修补,最终输出一个快速、确定性且可解释的规则系统。实验初步验证了其在分类与NER任务上的有效性,且该工具已开源,采用Apache 2.0许可证发布。
链接: https://arxiv.org/abs/2607.01293
作者: Ádám Kovács,Nadia Verdha,Gábor Recski
机构: KR Labs; TU Wien
类目: Computation and Language (cs.CL)
备注: 8 pages
Abstract:We present RuleChef, a framework that uses large language models (LLMs) to generate executable rules for NLP tasks such as text classification, Named Entity Recognition (NER), or relation extraction. Rules are generated based on a task description and a set of labeled examples, then they are iteratively improved based both on additional examples and on human feedback overexisting rules. RuleChef can also be used to bootstrap rules using the observed input-output pairs from any existing model for a given task. LLMs are used only at learning time, synthesizing rules and iteratively patching them based on failures measured on a held-out split. The result of this process is a fast, deterministic, and inspectable rule system. Preliminary evaluation is performed on both classification and NER tasks. We release RuleChef as open-source software under an Apache 2.0
[NLP-79] Structuring the Space of Sociotechnical Alignment
【速读】: 该论文旨在解决生成式AI(Generative AI)在社会技术对齐(sociotechnical alignment)过程中存在的核心问题:即“社会可取性”(social desirability)这一关键概念缺乏系统性定义与规范基础,导致现有研究在技术实现中往往模糊处理其价值内涵。其解决方案的关键在于提出一个以人类为中心的框架,将社会科学中关于社会行为可取性的理论基础引入对对齐目标的界定,从而为判断人工智能行为的社会可取性提供可操作、可辩护的规范依据。通过系统的文献综述,研究发现当前实践普遍存在三大缺陷:支撑可取性判断的规范性概念未明确定义或与系统行为目标混淆,目标受众群体界定不清,设计决策缺乏理论支撑。针对这些问题,论文提出将社会科学研究框架与对齐设计选择相衔接的建议,推动社会技术对齐从经验性、模糊性的实践转向更具概念清晰性和累积性的科学方法。
链接: https://arxiv.org/abs/2607.01250
作者: Esra Dönmez,Agnieszka Falenska
机构: University of Stuttgart (斯图加特大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
Abstract:Sociotechnical alignment concerns the social desirability of AI behavior and is thus inherently normative, not merely technical. While NLP research increasingly addresses its technical aspects, it often leaves underspecified what such “social desirability” entails. We argue that this reflects a fundamental gap: the absence of a systematic way to specify how sociotechnical alignment defines, justifies, and evaluates socially desirable AI behavior. To address this gap, we introduce a human-centered framework for specifying sociotechnical alignment. We draw on social-scientific accounts of sociobehavioral desirability to ground the basis for behavioral desirability judgments and use this framework to analyze how alignment is specified in practice. Our systematic literature review identifies recurring patterns: normative concepts grounding desirability judgments are often unspecified or conflated with alignment targets for (desired) system behavior, target populations are underdefined, and design choices are rarely theoretically justified. These findings point to a lack of conceptual specificity that limits cumulative progress. We therefore offer recommendations that link social-scientific frameworks to alignment design choices, supporting more conceptually precise approaches to sociotechnical alignment. Comments: Preprint Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2607.01250 [cs.CY] (or arXiv:2607.01250v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2607.01250 Focus to learn more arXiv-issued DOI via DataCite
[NLP-80] Mapping Text to Multiplex Graph: Prompt Compression as Lévy Walk-Guided Graph Pruning
【速读】: 该论文旨在解决现有提示压缩方法将文本视为扁平的词元序列,无法有效捕捉重要信息在空间上分散且通过局部句法依赖与全局语义关系相互关联的特性,导致压缩过程中丢失关键语义结构的问题。其核心解决方案是提出一种基于多层图结构的冗余感知图剪枝(Redundancy-Aware Graph Pruning, RAGP)方法,将提示压缩建模为在一个融合细粒度注意力依赖与粗粒度语义关系的异质多层图上的优化问题。关键在于利用莱维游走(Levy walks)的重尾步长分布,在密集的局部子图与稀疏的全局连接之间实现局部探索与全局寻优的自然平衡,从而高效识别非冗余节点。实验结果表明,RAGP在LongBench数据集上以4倍压缩比达到平均49.3分,优于基于大语言模型的现有方法(如LongLLMLingua在3倍压缩比下为48.8),并在多个任务上超越了当前最先进的视觉引导文本压缩范式。
链接: https://arxiv.org/abs/2607.01241
作者: Yaxin Gao,Yao Lu,Jinhong Deng,Jiaqi Nie,Zhe Tang,Jian Zhang,Zhaowei Zhu,Shanqing Yu,Qi Xuan,Joey Tianyi Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing prompt compression methods treat text as flat token sequences, failing to capture the distributed nature of important information, which is often spread across multiple locations and connected through both local syntactic dependencies and global semantic relations. Such relational structure is naturally represented as a graph, where tokens or sentences become nodes and their dependencies become edges. To this end, we propose RAGP, which formulates prompt compression as Redundancy-Aware Graph Pruning on a multiplex graph that jointly models fine-grained attention-based dependencies and coarse-grained semantic relations. To efficiently identify non-redundant nodes in this heterogeneous structure (dense local subgraphs and sparse global connections), we employ Levy walks whose heavy-tailed step distribution naturally balances local exploitation with global exploration. Experiments on LongBench show that RAGP achieves an average score of 49.3 under a 4x compression ratio, outperforming existing LLM-based compression methods, such as LongLLMLingua, which attains 48.8 at a 3x compression ratio. Besides, RAGP also surpasses state-of-the-art vision-based text compression paradigms on multiple tasks. The code is available at this https URL.
[NLP-81] Prompt Framing Distorts Count-Based Evaluation of LLM Error Detection: Evidence from Numeric Anchoring
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在自然语言纠错任务中,因提示工程(prompt engineering)导致的计数型F1分数虚高问题,即“F1膨胀”(F1 Inflation)现象——即计数型F1分数显著上升,但实际错误片段定位能力(span localization)并未同步提升。其核心解决方案是提出并应用ErrorBench,一种受控的压力测试协议,用于系统性检测由提示诱导引发的计数偏差。通过在CoNLL-2014数据集上对六种主流大语言模型(LLM)在五种提示条件下进行4,290次响应评估,研究发现锚定提示(anchored prompts)可导致最高达0.79的计数型F1膨胀,严格匹配标准下甚至高达0.96。在100篇文档的复现实验中,尽管盲提示转为锚定提示使平均计数型F1提升+0.21,但多参考标准下的ERRANT F0.5仅提升+0.04,进一步验证了该膨胀现象。研究还发现,高度遵循指令的GPT/Claude系列模型倾向于产生更多错误计数,而Gemini系列则相对保守。因此,论文强调在大语言模型的校对与文档审阅评估中应避免预先填充错误数量,并必须结合片段感知型指标(span-aware metrics)与计数型指标,以获得更真实、可靠的性能评估。
链接: https://arxiv.org/abs/2607.01240
作者: Dekun Yang
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures, 12 tables. Preprint under review
Abstract:Count-based F1 is widely used as a proxy for LLM error-detection quality, but this paper shows that it can rise dramatically without a corresponding improvement in span localization, a gap termed F1 Inflation. The paper introduces ErrorBench, a controlled stress-test protocol for prompt-induced count distortion. ErrorBench evaluates six contemporary LLMs under five prompt conditions over 4,290 responses from 143 CoNLL-2014 passages. Under CoNLL-2014 M2-style scoring, anchored prompts produce up to 0.79 points of F1 Inflation, and up to 0.96 under strict matching. A 100-passage replication using the official ERRANT 3.0.0 pipeline and multi-reference scoring reproduces the pattern: averaged over six models, the Blind-to-Anchored prompt shift raises Count-F1 by +0.21 while raising multi-reference ERRANT F0.5 by only +0.04. The study finds larger count responses in highly instruction-compliant GPT/Claude systems and smaller responses in the Gemini family under this stress-test protocol. The findings suggest that LLM proofreading and document-review evaluations should avoid pre-populated error counts and should report span-aware metrics alongside count-based metrics.
[NLP-82] Breaking Safety at the Token Boundary: How BPE Tokenization Creates Exploitable Gaps in LLM Alignment
【速读】: 该论文旨在解决现代大语言模型(LLM)在安全对齐(safety alignment)方面存在的漏洞问题,即字符级扰动(character-level perturbations)能够绕过模型的安全防护机制,尽管这些扰动后的提示(prompt)仍保持人类可读性。其核心问题在于:基于字节对编码(BPE)的分词机制会将关键安全词汇拆分为子词片段,而当前主流的对齐数据集(alignment datasets)中几乎不存在此类有意碎片化的输入样本,导致模型无法学习识别被分割的安全词。解决方案的关键在于揭示并验证这一结构性机制——通过优化目标针对安全词的分词碎片化,可在五种不同模型家族(Qwen-3-4B、Qwen-2.5-7B、Gemma-3-4B、Llama-3.1-8B、Mistral-7B)上实现高达80%–100%的初始拒绝触发器反转,并使其中48%的攻击实例生成真正有害输出。研究进一步通过激活图谱定位(activation patching)确认干扰信号集中于模型最后约30%的层,通过对齐数据集扫描发现3万条样本中无任何碎片化提示,且靶向突变实验将破坏点精确锁定至安全词本身。在防御层面,实验表明基于直接偏好优化(DPO)的配置无法在存在封闭池大小混淆的情况下稳定关闭攻击成功率(ASR),而使用碎片化提示进行监督微调(SFT)虽能关闭3/5个模型的ASR,但伴随全局拒绝率上升,表明仅补充缺失分布不足以实现选择性修复。为此,作者提出“Conv-Benign”作为候选诊断工具以区分选择性修复与全局崩溃。所有结果均经三名评审员校准,确保评估结果一致性与可靠性。
链接: https://arxiv.org/abs/2607.01239
作者: Tung-Ling Li,Hongliang Liu,Yuhao Wu
机构: Palo Alto Networks(帕洛阿尔托网络)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Character-level perturbations bypass safety alignment in modern LLMs despite leaving prompts human-readable. We identify and test a central structural mechanism: BPE tokenization fragments safety-critical words into sub-word pieces, and the three public alignment datasets we surveyed contain no intentionally fragmented inputs. The mechanism is a chain, tested end-to-end on five model families (Qwen-3-4B, Qwen-2.5-7B, Gemma-3-4B, Llama-3.1-8B, Mistral-7B). An optimization targeting safety-token fragmentation flips the first-token refusal trigger on 80-100% of refused HarmBench prompts, with 48% of those flips producing genuinely harmful outputs (per-model 29-65%; gap-vs-behavior ROC-AUC 0.66-0.98, pooled 0.84). Activation patching localizes the disrupted signal to the last \sim30% of layers; an alignment-data scan finds zero fragmented prompts among 30,000 examples (positive-control recall \geq 99% at attack-relevant intensities); and targeted-mutation experiments isolate safety words as the disruption locus. On the defense side, a 68-cell grid (55 trained checkpoints) shows that no DPO configuration achieves seed- and pool-stable ASR closure on the three families with closed pool-size confounds. SFT trained on fragmented prompts closes ASR on 3/5 families but only via global collapse that raises refusal on benign prompts as well, indicating the missing distribution is necessary but not sufficient under the LoRA-16 recipe we tested. To distinguish selective repair from global collapse, we introduce Conv-Benign, a candidate paired diagnostic. All ASR claims are 3-judge-calibrated (cell rankings stable across judges; absolute levels \pm 18pp; see App.~B.13).
[NLP-83] SPARCLE: SPeaker-aware Aligned Representations via Contrastive Language Embeddings INTERSPEECH
【速读】: 该论文旨在解决低资源场景下文本到语音(TTS)合成中因依赖传统音素转换系统(G2P)而导致的声学表征不准确与说话人特异性信息缺失问题。现有基于音素的模型虽能缓解文本到声学的一对多映射问题,但其性能严重依赖于预训练的G2P系统,而这些系统在低资源条件下表现不佳。此外,音素表示无法有效捕捉说话人特有的声学变体。为克服上述局限,本文提出SPARCLE——一种面向说话人的字形表征模型,通过将字符与其精确的声学实现进行联合建模,增强字形表征的声学语义。其核心创新在于采用对比学习目标,在给定说话人身份的条件下,使字形与对应的Wav2Vec2声学嵌入对齐,从而生成更具说话人感知能力的字形表征。该模型可直接替代传统G2P系统,作为下游TTS任务的输入。实验表明,相较于标准字形模型,SPARCLE在极端低资源设置下将词错误率降低50%,显著提升了语音合成质量。
链接: https://arxiv.org/abs/2607.01238
作者: Priyam Mazumdar,Yurii Halychanskyi,Steven Guo,Mark Hasegawa-Johnson,Volodymyr Kindratenko
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄本那-香槟分校); National Center for Supercomputing Applications (国家超级计算应用中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 Pages, 1 Figure, 2 Tables, Interspeech
Abstract:Recent advances in speech synthesis have shifted from phoneme representations to direct grapheme modeling. While phonemes address the one-to-many mapping between text and acoustics, they rely on grapheme-to-phoneme (G2P) systems that fail to capture speaker-specific acoustic variation. Prior work demonstrates that grapheme-based models outperform phoneme-based systems at scale, but not in low-resource settings. In this paper, we propose SPARCLE, a speaker-aware grapheme representation model that enriches characters with their precise acoustic realizations. SPARCLE is trained with a contrastive objective to align graphemes with corresponding Wav2Vec2 acoustic representations while conditioned on speaker identity. The resulting model serves as a replacement to G2P systems for downstream text-to-speech (TTS) tasks. We demonstrate that SPARCLE improves generation quality, reducing word error rates by half in extreme low-resource settings compared to standard grapheme-based models. Comments: 5 Pages, 1 Figure, 2 Tables, Interspeech Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS) Cite as: arXiv:2607.01238 [cs.CL] (or arXiv:2607.01238v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2607.01238 Focus to learn more arXiv-issued DOI via DataCite
[NLP-84] Kara: Efficient Reasoning LLM Serving via Sliding-Window KV Cache Compression
【速读】: 该论文旨在解决生成式 AI(Generative AI)在推理过程中因长链式思维(Chain-of-Thought, CoT)导致的键值缓存(KV cache)占用过大、解码延迟高及吞吐量受限的问题。现有 KV 缓存压缩方法存在两大关键局限:其一,基于阈值触发的压缩策略可能无法有效提升吞吐量,甚至导致吞吐量下降,且可能完全丢弃序列中某些区块的全部 KV 对,加剧信息损失;其二,多数方法仅保留孤立的 KV 对或固定大小的块,边界僵化,难以灵活保留任意位置的重要语义片段。为克服上述问题,作者提出 Kara,一种基于滑动窗口的运行时 KV 缓存压缩方法,仅对最近生成的上下文进行压缩操作,并利用双向注意力机制对窗口内信息量高的 KV 对进行评分与选择。为进一步实现重要语义信息的灵活保留,设计 Token2Chunk 模块将部分选中的 KV 对扩展为可变尺寸的语义块。此外,将 Kara 适配至 PagedAttention 架构,构建了基于 vLLM 的 KvLLM 推理框架,显著降低 KV 缓存内存占用并有效提升输出吞吐量。大量实验表明,Kara 与 KvLLM 在多个基准上均实现了稳定的性能提升。
链接: https://arxiv.org/abs/2607.01237
作者: Shen Han,Yuyang Wu
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures
Abstract:Reasoning language models often generate long chain-of-thought (CoT), which accumulates a massive KV cache during the decoding phase and incurs high decoding latency and limited throughput. To address these issues, KV cache compression has emerged as a promising technique for reducing memory overhead by selectively removing unimportant KV pairs while preserving useful ones for subsequent decoding. Nevertheless, we identify two key limitations in existing KV cache compression methods: 1) their threshold-triggered compression policy may provide limited throughput improvement or even reduce throughput, and may fully eliminate KV pairs from certain blocks of the sequence, potentially worsening information loss. 2) they typically retain either isolated KV pairs or fixed-size chunks with rigid boundaries, failing to preserve important flexible-sized chunks at arbitrary token positions. To overcome these limitations, we propose Kara, a sliding-window KV cache compression method that performs decoding-time compression by operating only on the recently generated context. Kara leverages bidirectional attention to score and select informative KV pairs in the window. To enable flexible preservation of important semantic information, we design a Token2Chunk module to expand a subset of selected KV pairs into chunks. Furthermore, we adapt Kara to PagedAttention and develop KvLLM, an inference framework built upon vLLM, which reduces KV cache memory usage and effectively improves output throughput. Extensive experiments demonstrate consistent performance improvements of proposed Kara and KvLLM.
[NLP-85] Safeguarding LLM Agents from Misalignment through Provenance Analysis
【速读】: 该论文旨在解决大语言模型(LLM)智能体在调用外部工具时可能出现的意图错位(misalignment)问题,即智能体提出的工具调用行为与用户真实意图不符,可能导致不可逆的有害后果。现有运行时防护机制普遍采用“大模型作为裁判”(LLM-as-a-judge)范式,缺乏系统性的对齐推理框架,导致判断结果不一致且难以审计。为此,论文提出基于溯源分析(provenance analysis)的概念性框架,将误对齐检测形式化为判断所提议的工具调用是否能在智能体上下文中找到可追溯的证据支持。在此基础上,构建了多阶段管道 ProvenanceGuard,通过在工具执行前分析三类误对齐情况,仅当行动被判定与用户查询对齐时才允许执行。实验在 Agent-SafetyBench 与 WorkBench 两个基准上,针对 10 种主流 LLM 进行评估,结果表明,相较于基线方法,ProvenanceGuard 将 Agent-SafetyBench 上误对齐轨迹的错误率从 42.9% 降至 1.8%,WorkBench 上从 32.1% 降至 17.3%,同时将任务成功轨迹上的干预负担从 30.5% 降低至 12.8%,且在对齐轨迹上未引入统计显著的误干预。这证明了基于结构化溯源推理的防护机制在保障 LLM 智能体对齐性方面具有高效性与实用性。
链接: https://arxiv.org/abs/2607.01236
作者: Yining She,Yiliang Liang,Eunsuk Kang
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As LLM agents gain increasing access to powerful tools, ensuring that their actions are aligned with the user’s intent becomes critical. When an agent’s proposed tool invocation deviates from the user’s intent – a phenomenon called misalignment – it may lead to harmful consequences that are difficult to undo. Existing runtime guardrails rely on an LLM-as-a-judge paradigm that lacks a systematic framework for reasoning about alignment, often producing judgments that are inconsistent or difficult to audit. Motivated by provenance analysis, we propose a provenance-based conceptual framework that formalizes misalignment detection as determining whether a proposed tool call is supported by traceable evidence in the agent’s context. Building on this framework, we propose ProvenanceGuard, a multi-stage pipeline that analyzes the agent’s action for three types of misalignment before the selected tool is executed and only allows the action to take place when it is considered aligned with the user’s input query. We evaluated our proposed approach on two different benchmarks, Agent-SafetyBench and WorkBench, across 10 backbone LLMs. Compared to the LLM-as-a-judge baseline, ProvenanceGuard reduces error rate on misaligned traces from 42.9% to 1.8% on Agent-SafetyBench and from 32.1% to 17.3% on WorkBench, while reducing intervention burden on task-successful traces from 30.5% to 12.8% and introducing no statistically significant increase in unnecessary interventions on aligned traces. These results demonstrate that structured, provenance-based reasoning provides an effective and practical foundation for safeguarding LLM agents from misalignment.
[NLP-86] okenScope: Token-Level Explainability and Interpretability for Code-Oriented Tasks in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成过程中难以实现细粒度、实时可解释性分析的问题,尤其关注模型在词元(token)层面的决策机制。现有工具虽能提供模型内部状态或生成结果的洞察,但普遍缺乏解码过程中的动态信号、细粒度不确定性度量以及探索替代生成路径的交互能力。其解决方案的关键在于提出TokenScope——一个面向基于解码器架构的LLMs的交互式可解释性与分析工具。该工具通过暴露生成过程中的词元级指标、注意力模式及结构化程序信息,支持交互式词元替换、反事实分支探索,并结合抽象语法树(Abstract Syntax Tree, AST)实现代码感知的聚合分析。通过将解码时信号与结构化程序分析相统一,TokenScope实现了对代码生成过程中模型行为的系统性探究。
链接: https://arxiv.org/abs/2607.01235
作者: Amirreza Esmaeili,Fatemeh Fard
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Understanding how Large Language Models (LLMs) make token-level decisions during code generation remains a major challenge for both researchers and practitioners. While recent tools provide insights into model internals or generation outcomes, they often lack decoding-time signals, fine-grained uncertainty measures, and interactive mechanisms for exploring alternative generation paths. We present TokenScope, an interactive interpretability and analysis tool for decoder-based LLMs that exposes token-level metrics, attention patterns, and structural information during generation. TokenScope supports interactive token replacement, counterfactual branching, and code-aware aggregation via abstract syntax trees. By unifying decoding-time signals with structural program analysis, TokenScope enables systematic investigation of LLM behaviour during code generation.
[NLP-87] Robust for the Wrong Reason s: The Representational Geometry of LLM Robustness to Science Skepticism
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对用户质疑时可能产生“谄媚式退让”(sycophantic retreat)的问题,即在用户表达怀疑时,模型倾向于偏离科学共识,制造虚假平衡,将已确立的科学事实视为多种观点之一。研究通过在三个主流指令微调模型(Llama-3.1-8B、Qwen2.5-7B、Mistral-7B)上,针对气候、疫苗和进化三大科学共识领域,在单轮与多轮对话场景中进行系统性测试,结合行为测量、线性探测(linear probing)与激活修补(activation patching)技术,揭示了模型对质疑压力的真实响应机制。研究发现,模型并未表现出普遍的谄媚退让,而是呈现出三种截然不同的应对策略:反应性强化(reactive assertion,如Llama模型在质疑下反而更坚定地重申共识)、表面缓和(surface hedging,如Qwen模型语气软化但立场不变),以及非响应(non-response,如Mistral模型完全不回应)。双盲判断验证了反应性增强是立场而非风格的变化(63.6%,p = .007),且其驱动因素为共识主张的增强,而非虚假平衡(每单位质疑强度β = +0.042,p < 1e-77)。线性探测进一步将差异定位至模型中间层,显示Llama与Qwen在该层实现完美分离,而Mistral仅达72%分离度,且置信区间无重叠,表明其根本未线性表征质疑信号。关键发现是,这种鲁棒性不具备跨领域迁移能力——在关键安全领域(如疫苗)中,模型反而出现反向退化,反驳错误信念的能力在质疑压力下减弱。研究据此提出一个四维分类框架,区分主动鲁棒性(基于理解)与偶然鲁棒性(源于感知失败),强调仅依赖行为评估无法区分模型是真正理解质疑信号并抵抗之,还是因未能感知信号而看似“稳健”。
链接: https://arxiv.org/abs/2607.01951
作者: Minjong Cheon
机构: Sejong University (世宗大学)
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly consulted on contested scientific questions, raising the concern that they will sycophantically retreat from established consensus when a user signals doubt – drifting toward a false balance that treats settled science as one view among several. We test this across three open instruction-tuned models (Llama-3.1-8B, Qwen2.5-7B, Mistral-7B), three consensus-science domains (climate, vaccines, evolution), and single- and multi-turn settings, combining behavioral measurement with linear probing and activation patching. We do not observe sycophantic retreat. Instead, models show three distinct policies under the same skeptical pressure: reactive assertion, where consensus assertion increases rather than decreases (Llama); surface hedging, where tone softens while the position holds (Qwen); and non-response (Mistral). Pairwise judgments confirm the reactive shift is stance, not style (63.6%, p=.007), and a decomposition identifies increased consensus assertion, not false balance, as its driver (beta=+0.042 per dose, p1e-77). Linear probes localize the divergence to middle layers – perfect separation in Llama and Qwen versus 72% in Mistral, with non-overlapping confidence intervals – indicating the non-responsive model does not linearly represent the skepticism signal at all. Crucially, this robustness does not transfer: it attenuates across domains and, in the safety-critical vaccine domain, can reverse, with myth-rebuttal weakening under skeptical pressure. We synthesize these into a four-way taxonomy separating active from accidental robustness, and argue that behavioral evaluation alone cannot distinguish a model that resists skepticism because it understands the signal from one that only appears to resist because it fails to perceive it.
[NLP-88] Self-Supervised Test-Time Tuning for Packet Loss Concealment
【速读】: 该论文旨在解决传统包丢失隐藏(Packet Loss Concealment, PLC)方法中模型参数在部署后固定不变的问题,即现有PLC系统通常采用预先训练好的静态模型,在接收端无法根据实际接收到的信号特征进行动态调整,从而限制了对特定通话或录音中信号特异性信息的利用。其核心解决方案是提出一种自监督测试时调优(Test-Time Tuning, TTT-PLC)框架,通过仅使用已接收的音频包,构建自监督信号掩码任务来实现对已有PLC模型的动态适应:具体而言,该方法通过对当前可获得的信号部分进行合成掩码,并以原始PLC目标函数为指导训练模型恢复被掩码内容,进而利用优化后的模型重建真实丢失的数据包。该方法无需干净参考信号、外部训练数据或结构修改,即可在非因果和因果两种部署场景下有效提升隐蔽性能。关键创新在于利用信号中仍可观察的部分作为自监督信号,使预训练模型在推理阶段具备动态适应能力,显著提升了对丢失包的重构质量。
链接: https://arxiv.org/abs/2607.01823
作者: Yehoshua Dissen,Joseph Keshet
机构: Technion–Israel Institute of Technology (以色列理工学院); Andrew and Erna Viterbi Faculty of Electrical and Computer Engineering (安德鲁与埃尔娜·维特比电气与计算机工程学院)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Under submission to IEEE TASLP
Abstract:Packet loss concealment (PLC) reconstructs audio packets that are missing at the receiver, usually with a trained model whose parameters remain fixed at deployment time. This treats the PLC model as static, even though each call or recording exposes signal-specific information through the packets that did arrive. We present TTT-PLC, a self-supervised test-time tuning framework that adapts existing PLC models using only those received packets. The method creates supervision by synthetically masking portions of the available signal, training the model to conceal them with its native PLC objective, and then using the adapted model to reconstruct the true packet losses. No clean reference signal, external adaptation data, or architectural modification is required. We study TTT-PLC in two deployment settings. In the non-causal setting, the received file is available before reconstruction, allowing repeated self-supervised adaptation passes and providing a per-file adaptation ceiling. In the causal setting, audio is streamed without revising emitted samples; adaptation is performed only on completed past blocks, and updated parameters affect only future audio. We instantiate the framework on two public PLC backbones, FRN, a recurrent full-band speech PLC model, and PARCnet, a hybrid autoregressive-neural model for networked music. Across these settings, the results show that pretrained PLC systems do not need to be treated as fixed at inference time, the still-observed portions of a lossy signal can provide an effective training signal for improving concealment on that same signal. Comments: Under submission to IEEE TASLP Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL) Cite as: arXiv:2607.01823 [eess.AS] (or arXiv:2607.01823v1 [eess.AS] for this version) https://doi.org/10.48550/arXiv.2607.01823 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
信息检索
[IR-0] Bringing Agent ic Search to Earth Observation Data Discovery
链接: https://arxiv.org/abs/2607.02387
作者: Minghan Yu,Youran Sun,Chugang Yi,Yixin Wen,Haizhao Yang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 19 pages, 1 figure, 6 tables
Abstract:NASA and its data centers hold thousands of geoscience datasets and tools like Worldview, Giovanni, the Science Discovery Engine, and Harmony. Finding the right one is hard even for domain experts. We present an agentic search system, deployed as a public service for the geoscience community, that takes a natural-language research query and returns the matching datasets and tools. We demonstrate that, in the era of large language models, the latent value of knowledge graphs (KGs) can be substantially amplified through agentic search. From the NASA Earth Observation Knowledge Graph (NASA EO-KG) we derive NASA-EO-Bench, an open benchmark of 47k query-dataset pairs (21k task-based queries). A neural scorer fine-tuned on NASA-EO-Bench beats cosine and BM25 baselines. Further combining it with BM25 via score fusion raises both Recall@10 (R@10) and MRR by over 5x. On top of this supervised pipeline, we add a zero-shot agentic reranking stage that, without any additional training, lifts MRR by 28% on a stratified N=200 subset, showing that LLM reasoning is complementary to supervised retrieval.
[IR-1] HNSW with Accuracy Guarantees Using Graph Spanners – A Technical Report VLDB2027
链接: https://arxiv.org/abs/2607.02338
作者: Minghao Li,Raghav Mittal,Sanjivni Rana,Suraj Shetiya,Gautam Das,Nick Koudas
类目: Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 23 pages, 22 figures, Submitted to VLDB2027
Abstract:Hierarchical Navigable Small World (HNSW) graphs serve as the industry standard due to their logarithmic complexity and strong empirical performance. However, HNSW relies on greedy graph traversal, a heuristic that provides no theoretical guarantees of correctness. In this paper, we propose a novel “Certify-then-Rectify” framework that bridges the gap between the speed of heuristic search and the rigor of exact retrieval. Rather than discarding HNSW, our approach first employs a distribution-free statistical certifier to dynamically evaluate the quality of a standard HNSW search with minimal overhead. If certification indicates that the retrieved neighbors are of low quality, the framework safely escalates to a rigorous exact recovery algorithm. To make this exact recovery computationally feasible, we reinterpret the HNSW graph as a geometric spanner and utilize Extreme Value Theory to stochastically estimate its maximum empirical stretch factor. This allows us to mathematically bound the maximum distance of true nearest neighbors. Extensive evaluations on benchmark datasets demonstrate that our tiered framework delivers the average-case speed of HNSW while ensuring the worst-case correctness of exact search and outperforming other applicable approaches.
[IR-2] Planning over Matrix-Factorization MDPs for Candidate Generation KDD2026
链接: https://arxiv.org/abs/2607.02115
作者: Mikhail Trapeznikov,Maksim Utushkin
类目: Information Retrieval (cs.IR)
备注: Accepted to the 5th Workshop on End-to-End Customer Journey Optimization at KDD 2026. 6 pages, 3 figures, 2 tables
Abstract:For a recommender service, we view the customer journey as a chain of item recommendations: a useful item changes the user’s state and therefore what should be retrieved next. Standard matrix-factorization retrieval ignores this – it builds one user vector and returns the top- K items by a static score, treating them as independent. We ask a narrow question: when is it worth planning over the user-state dynamics that fold-in induces? To answer it we propose casting top- K retrieval as an MDP over the implicit-ALS posterior (A^-1,u) , where an action is an item and the transition is a closed-form rank-one fold-in, and the trajectory reward combines a relevance similarity with a posterior-alignment term. Under the same fixed embeddings we compare static retrieval, one-step planning, and horizon- K MCTS across five datasets and two protocols: a per-user leave-last- n split and a stricter global time split. Dynamics-aware planning tends to overcome static retrieval on all datasets under leave-last- n , and the gains hold on MovieLens-1M and the VK-LSVD slices under the global time split. A single step of lookahead already captures most of the gain, so the lightweight planning layer turns static top- K scoring into a short decision and improves retrieval over fixed collaborative-filtering embeddings, with no retraining and no change to the representation. These gains depend on measuring relevance with cosine rather than inner-product similarity, which is otherwise entangled with item popularity.
[IR-3] Evaluating Chunking Strategies for Retrieval-Augmented Generation on Academic Texts
链接: https://arxiv.org/abs/2607.01852
作者: Valentin J. J. Kreileder,Johannes Reisinger,Andreas Fischer
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) systems use the question-answering capabilities of Large Language Models (LLMs) to access information outside their parameters. We evaluate if cluster-based semantic chunking improves retrieval and answer quality compared to fixed-size and recursive chunking evaluating on long, structured academic theses using the Retrieval Augmented Generation Assessment (RAGAs) framework. RAGAs based faithfulness shows limited reliability in this setup. Performance on fixed versus document specific questions varied substantially, likely related to the formatting of documents and preprocessing. Under the tested configuration, cluster-based chunking did not outperform simpler strategies.
[IR-4] IntentTune: Using user demand and personalization to resolve “unknown” query intents for e-commerce search
链接: https://arxiv.org/abs/2607.01530
作者: Rachith Aiyappa,Ishita Khan,Chester Palen-Michel,Jayanth Yetukuri,Samarth Agrawal,Mehran Elyasi,Shuang Zhou
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding user intent is fundamental to delivering relevant search results in e-commerce. However, substantial fraction of real-world queries are under-specified (e.g., “watch” or “shirt”), lacking explicit attributes such as gender or age group. This ambiguity poses a significant challenge for query intent detection models in e-commerce search systems, which must accurately infer latent user intent (e.g., age, gender) to support effective downstream retrieval. We introduce IntentTune, a framework for resolving ambiguous or under-specified query intents by leveraging either (1) user-specific behavioral signals including search history, browsing activity, and profile attributes or (2) population-level demand patterns aggregated across all users. Through experiments on real-world e-commerce data, we first demonstrate that population-level demand patterns alone are insufficient to reliably infer intent in under-specified queries. We then demonstrate that user-specific behavioral signals – particularly prior search queries – outperform both population-level statistics and static profile information for inferring gender, age group, product category, and size intent from underspecified queries.
[IR-5] CoPersona: Collaborative Persona Graphs for Robust LLM Personalization KDD’26
链接: https://arxiv.org/abs/2607.01485
作者: Yangtian Zhang,Leyao Wang,Hiren Madhu,Ngoc Bui,Walter Roznyatovskiy,Rex Ying
类目: Information Retrieval (cs.IR)
备注: Accepted at KDD '26. 12 pages, 5 figures, 8 tables
Abstract:Real-world LLM personalization is often constrained by sparse and skewed user histories: most users provide only a handful of interactions, while even frequent users’ logs capture an incomplete and biased view of their preferences. As a result, weakly observed user attributes are difficult to infer, leading to brittle personalization when test-time requests shift toward under-supported facets. Motivated by this limitation, we present CoPersona, a graph-based collaborative personalization framework that completes sparse user profiles by borrowing signals from behaviorally similar peers. However, directly transferring signals is difficult because uneven facet coverage introduces bias into interaction histories, obscuring user similarity in the unstructured global space. To address this issue, CoPersona decomposes interaction histories into multiple facet-level representations and explicitly models peer-to-peer, facet-level alignment through a multiplex persona graph. To effectively leverage peer information at inference time, we employ a dual-branch architecture that combines non-parametric peer retrieval with parametric graph reasoning. Experiments across multiple domains and model scales demonstrate consistent improvements over strong baselines, validating CoPersona as an effective approach for robust LLM personalization.
[IR-6] Bi-NAS: Towards Effective and Personalized Explanation for Recommender Systems via Bi-Level Neural Architecture Search
链接: https://arxiv.org/abs/2607.01387
作者: Longfeng Wu,Yao Zhou,Tong Zeng,Zhimin Peng,Bhanu Pratap Singh Rawat,Lecheng Zheng,Giovanni Seni,Dawei Zhou
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Recommender systems are vital in helping users navigate vast amounts of information, offering personalized suggestions and effective explanations for these recommendations. While previous efforts have attempted to provide such explanations, evaluating their effectiveness across various scenarios remains a challenge. Enhancing these explanations is essential for improving user engagement, trust, and decision-making. To facilitate effective explanations within the recommender system, we propose a Bi-level Neural Architecture Search (Bi-NAS) framework to optimize explanations. This approach simultaneously refines cross-attention mechanisms and feature interaction functions by exploring both intra-layer and inter-layer design spaces. Furthermore, we integrate Large Language Models (LLMs) to enhance explanation generation, leveraging zero-shot prompting to produce more effective and personalized justifications. By aligning user feature preferences with item quality scores, our approach ensures that explanations reflect both user intent and item attributes, improving transparency and reasoning depth. Extensive evaluations on four real-world datasets demonstrate that Bi-NAS not only boosts recommendation accuracy but also significantly improves the effectiveness of explanations for recommender systems, providing users with clear and reliable insights into the suggestions they receive.
[IR-7] Embedding Inference Attack
链接: https://arxiv.org/abs/2607.01276
作者: Cedric Fitiavana Raelijohn,Sébastien Gambs,Jean-Francois Rajotte
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 12 pages
Abstract:Embedding models are essential components of modern Information Retrieval (IR) systems, yet they are typically hidden behind APIs. Recent works have shown that dense IR system can lead to security vulnerabilities such as embedding inversion attacks. However, such attacks usually require that the attacker knows the embedding model for the attack to be applicable. In this paper, we study IR systems under a black-box setting in which the adversary observes only the unordered set of retrieved documents, without ranking or similarity scores. We demonstrate that in such contexts, tailored queries allow an adversary to identify which embedding model is in use from a set of known model candidate, which we coin as an embedding inference attack (EIA). We also show that certain queries remain discriminative even when the system includes a reranker as a potential defense mechanism. We further validate our method on a real Retrieval-Augmented Generation (RAG) system, in which the tailored queries bypass the LLM’s tendency to reject inputs it does not recognize as well-formed questions. Finally, we propose and evaluate other mitigation strategies such as similarity thresholds.
[IR-8] Office Comprehension Benchmark
链接: https://arxiv.org/abs/2607.01245
作者: Firoz Shaik,Mateus Picanço Lima Gomes,Tanvir Aumi,Jingci Wang,Milos Milunovic,Filip Basara,Ivana Jovanovic,Vishwas Suryanarayanan,Neha Nandan Kenkare,Weiyao Xie,Zhipeng Han,Zheng Zhang,Waleed Shahid,Jay Rathi,Russell Scherer,Thong Q. Nguyen,Michael Bentley,Tamara Stankovic,Rasika Chakravarthy,Vishal Chowdhary
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:We introduce Office Comprehension Bench (OCB), the first public benchmark to jointly evaluate LLM systems on Word, Excel, and PowerPoint comprehension over native file formats (.docx, .xlsx, .pptx) and their variants. OCB consists of two tracks. File Fidelity QA tests structural and visual perception of office artifacts - tables, charts, embedded images, formulas, and app-specific elements such as headers, speaker notes, and named ranges. Domain QA tests expert-level reasoning grounded in real-world industry documents across 12 professional domains, with queries requiring multi-step analysis and synthesis across documents. Each reference answer is decomposed into atomic, binary-gradable claims, and an ensemble of LLM judges scores responses against each claim independently. Even the strongest frontier system in its default reasoning mode reaches only about 59.3% on Domain QA; increasing thinking depth within a tier does not move performance materially, while moving to a higher product tier yields modest gains. We release the dataset, evaluation tooling, judge prompt, and a public leaderboard.
[IR-9] Retrieval-Augmented Generation to Support Railways Engineering Tasks: A Case Study
链接: https://arxiv.org/abs/2607.01244
作者: Andrea Gerardo Russo,Federico Ruggeri,Ivan Tomarchio,Davide Bombini,Nicolò Donati,Gianmarco Pappacoda,Paolo Torroni,Giuseppe-Emiliano La Cara
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
备注:
Abstract:The growing number and complexity of technical regulations represent an important challenge for all professionals in regulated industries. This paper describes a case study, from design to deployment, of building a Retrieval-Augmented Generation system for the consultation of complex technical regulations in the railway domain. Although developed for the railway sector, this testimony of an industrial experience is of particular value for technical domains where regulatory compliance and accurate information retrieval from complex documentation are essential requirements. It also constitutes a human-centered approach for implementing LLM-powered technical documentation consultation across various regulated industries, balancing technological capabilities with domain expertise.
[IR-10] STRUCTSURVEY: Structured Agent ic Retrieval for Automated Survey Paper Generation ACL
链接: https://arxiv.org/abs/2607.01243
作者: Paolo Pedinotti,Enrico Santus
类目: Information Retrieval (cs.IR)
备注: 8 pages, 1 figure, appendices, SurgeLLM, RAG4Reports, ACL
Abstract:The rapid growth of scientific publications makes it increasingly difficult to track and synthesize research progress. While Large Language Models (LLMs) can support automated survey generation, existing methods retrieve unstructured data and require models to infer conceptual, methodological, and taxonomic relations from raw text at generation time. We introduce STRUCTSURVEY, a hierarchical multi-agent framework that shifts structural reasoning from generation to retrieval by dynamically constructing graph-based representations of entities, relations, and topical taxonomies. We evaluate STRUCTSURVEY on a new reference-grounded benchmark of ACL survey papers for reproducible long-form scientific summarization. Compared with embedding-only retrieval baselines, STRUCTSURVEY improves ROUGE-1 recall by +2.9 and ROUGE-2 recall by +1.0 on average, without reducing precision. It also improves LLM-as-a-Judge ratings for logical structure, depth, and synthesis, showing that explicit structural retrieval yields surveys closer to human-written organization and reasoning.
[IR-11] ExPerT: Personalizing LLM Responses to Users Domain Expertise via Query-Wise Semantic and Keystroke Behavioral Cues ACL2026
链接: https://arxiv.org/abs/2607.01242
作者: Yeji Park,Jiwon Tark,Taesik Gong
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to ACL 2026 (Main, Long)
Abstract:Large language models (LLMs) are increasingly used by end users, yet existing personalization methods relying on static profiles or text-only signals fail to capture query-specific expertise variation. We present ExPerT, a query-wise personalization framework that adapts LLM responses to users’ query domain expertise by combining semantic and behavioral cues. ExPerT consists of two key components: (i) a semantic-behavioral expertise inference module that jointly interprets query text and keystroke dynamics via in-context LLM prompting, and (ii) an expertise-conditioned response generation that adapts the level of detail, terminology, and conceptual complexity. Our user study with 40 participants and 1270 queries demonstrated that ExPerT reduced expertise inference error by 65.7% compared to the strongest baseline (MAE = 0.398 vs. 1.162) and improved response satisfaction by 17.52% (from 3.71 to 4.36) on a 5-point Likert scale.
人机交互
[HC-0] When Do LLM Personas Support Visualization Design? A Cross-Model Study of Color Assignment and Chart Choice
链接: https://arxiv.org/abs/2607.02455
作者: Shahreen Salim,Klaus Mueller
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, 3 figures
Abstract:Large language model personas are increasingly used to approximate diverse users during early-stage visualization design, but it remains unclear whether persona-conditioned outputs reflect stable personality effects or artifacts of model choice and task framing. We examine this question across two visualization-relevant tasks: color assignment for abstract and concrete concepts, and chart-idiom preference ratings across task contexts. Using 43 Big Five profiles across GPT-4o-mini, GPT-4.1-mini, and GPT-5-mini, we find that personality-color coupling is highly model-configuration dependent: absent in GPT-4o-mini for all six concepts, consistent in GPT-4.1-mini across all six, and partial in GPT-5-mini for two of six. Concept type further shapes the signal: for abstract concepts, personality explains more hue variance than model identity, while concrete concepts show smaller and comparable effects. In chart choice, trait-aligned cluster aggregation produces stable top-idiom rankings across all nine cluster-context combinations, but a no-persona baseline recovers the same top choice in 8 of 9 model-context cells, indicating that task context drives rank-1 selection more than personality. These findings position LLM personas as exploratory probes for visualization design, not substitutes for human participants, and motivate multi-model testing, concept-type disaggregation, and no-persona baselines in future studies.
[HC-1] Physical surfaces make touch interactions in virtual reality precise efficient and bimanual WWW
链接: https://arxiv.org/abs/2607.02430
作者: Wen Ying,Seongkook Heo
类目: Human-Computer Interaction (cs.HC)
备注: This paper has been accepted by the “International Journal of Human-Computer Studies (IJHCS)” [Project Link] this https URL [Paper Link] this https URL [Video Link] this https URL
Abstract:Virtual reality (VR) systems can enable convenient hand-based interactions across diverse work scenarios. However, mid-air gestures lack tactile feedback and a physical reference surface to support the hand. This absence of haptic grounding can cause significant challenges in achieving precise and efficient touch interactions. This paper investigates the effect of different types of hand-grounded haptic feedback on the touch performance of VR tasks that demand high precision, such as selecting, tracing, and sketching. We compared three levels of haptic feedback: 1) No Haptic Feedback, where only visual feedback was provided; 2) Tactile Feedback, where users received vibrotactile and pressure feedback upon touching a virtual surface; 3) Physical Surface, where users interacted with a portable and tangible surface. Our study found that portable physical surfaces enabled the best selection precision, tracing efficiency, and sketch quality. Furthermore, participants showed increased bimanual hand utilization when engaging with a physical surface during tasks. These observed behaviors corresponded to participants’ preference for interacting with physical surfaces, attributed to a better sense of confidence and control. Comments: This paper has been accepted by the “International Journal of Human-Computer Studies (IJHCS)” [Project Link] this https URL [Paper Link] this https URL [Video Link] this https URL Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2607.02430 [cs.HC] (or arXiv:2607.02430v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2607.02430 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: International Journal of Human-Computer Studies, 215:103850, 2026 Related DOI: https://doi.org/10.1016/j.ijhcs.2026.103850 Focus to learn more DOI(s) linking to related resources
[HC-2] Data Comics for Education: Evaluating Effectiveness Benefits and the Ethics of AI-Assisted Creation
链接: https://arxiv.org/abs/2607.02361
作者: Zirui Shan,Vanessa Echeverria,Yuheng Li,Yi-Shan Tsai,Roberto Martinez-Maldonado
类目: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET)
备注:
Abstract:In today’s data-driven world, students often struggle with interpreting visualisations due to limited visualisation literacy. Data comics have emerged as a promising medium to enhance engagement and understanding, but their educational value has seen little empirical examination, partly due to the effort required to create them. Recent advances in Generative AI (GenAI) offer a scalable solution to this challenge. We conducted a within-subjects study with 60 university students, comparing conventional visualisations with data comics, created with assistance from GenAI tools, across information retrieval and comprehension tasks. Students consistently performed better with data comics, particularly in insight comprehension tasks, independent of prior visualisation literacy. Students also commented data comics as more engaging and easier to understand, though concerns were raised about GenAI-driven misinformation and ownership. Our findings highlight the potential of data comics as a potentially effective tool for data communication in education, while underscoring the need to address ethical concerns related to AI-assisted creation.
[HC-3] Personality Without Persons? A Psychometric Critique of Big Five Testing in Large Language Models
链接: https://arxiv.org/abs/2607.02325
作者: Kim Zierahn,Cristina Cachero,Anna Korhonen,Nuria Oliver
类目: Human-Computer Interaction (cs.HC)
备注: 11 pages
Abstract:Human personality inventories are increasingly used to characterize large language models (LLMs), compare systems, and inform downstream governance claims. Yet, these inventories were developed and validated for humans, and it remains unclear whether they apply to LLMs. We present a systematic psychometric evaluation of Big Five personality measurements in LLMs. We ask three research questions: Do Big Five inventories a) appropriately describe LLMs, b) capture inter-individual differences across models, and c) reflect internal factors consistent with human personality. We assess content validity of five candidate Big Five inventories and administer the winning inventory to N = 244 different models spanning 49 model families. First, we found that Big Five items adapted for LLMs can reach sufficient content validity, while original human-developed items did not. Second, Big Five inventories did not capture meaningful differences between LLMs: We found low variability between models, accounting for only 3% of total score variance. Third, LLMs responses did not recover the Big Five five-factor structure with four of the Big Five facets collapsing into one (r = .92). Direct comparisons between base and instruction-tuned model variants suggested that alignment training systematically shifted Big Five scores toward socially desirable traits. These findings demonstrate that Big Five scores do not measure a construct equivalent to human personality in LLMs. Applying human personality frameworks to LLMs produces misleading characterizations used to benchmark, compare, and govern LLMs. We highlight the need for evaluation frameworks that are developed for LLMs, rather than adopting human constructs without validation.
[HC-4] Copewell: A Multi-Agent Swarm Architecture for Equitable Mental Wellness Support
链接: https://arxiv.org/abs/2607.02245
作者: Seren Yenikent,Jack Vinijtrongjit,Katherine Ng
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Mental health disorders affect nearly one billion people globally, yet 75% of individuals in low- and middle-income countries receive no treatment due to workforce shortages, cost barriers, and stigma. Current AI-powered wellness solutions predominantly rely on single-mode conversational interfaces that suffer high abandonment rates and fail to provide measurable, immediate relief calibrated to users’ dynamic emotional states. This paper presents Copewell, a novel multi-agent swarm system designed to expand access to mental wellness support through human-centered AI principles. Our architecture introduces three technical innovations: (1) a multi-source assessment framework integrating self-reported, physiological, and contextual data to mitigate algorithmic bias; (2) valence-arousal emotion mapping using Russell’s Circumplex Model of Affect to route users to specialized AI agents; and (3) dual-mode intervention delivery combining conversational support with evidence-based sensory wellness protocols. We examine the sociotechnical design considerations underlying Copewell’s development, including a privacy-first architecture, embedded ethical oversight through a dedicated Ethics Supervisor agent, and participatory design informed by mental health practitioners. Early practitioner engagement and beta deployment inform design decisions and identify directions for future empirical evaluation. This work contributes to responsible AI discourse by demonstrating how technical architecture can operationalize equity and safety principles from inception.
[HC-5] What Types of Human-AI Teams Exist?
链接: https://arxiv.org/abs/2607.02198
作者: Nathan Hughes,Ibrahim Habli
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 36 pages, 12 figures
Abstract:Human-AI teaming has received increasing attention in the literature. However, the range of studies conducted in multiple domains make it difficult to understand what types of teams are being studied, and in what ways are they similar/different from one another. In this study, we analyse 53 papers on human-AI teams and categorise them into five main clusters based on psychological taxonomies of teaming; AI Assistant, Ad-hoc Dependency, Ad-hoc Forced Dependency, Paired Equanimity, and Group Equanimity. Each cluster represents a unique combination of holistic team-level characteristics, indicating there are multiple disparate team types studied under the same definition. In turn, this raises the question of whether insights are truly transferable between papers. We conclude with guidance on how to identify the types of human-AI teams studied, a checklist for reporting a human-AI team in research work, and ways in which the field can be further synthesised.
[HC-6] Synthetic Contact with AI Reduces Cross-Partisan Animosity
链接: https://arxiv.org/abs/2607.02181
作者: Benjamin Lira,Noah Castelo,Stefano Puntoni,Olivier Toubia
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 32 pages, 6 figures; 5 preregistered studies, N = 3,960
Abstract:Americans’ warmth toward members of the opposing political party has fallen sharply over the past three decades – yet meaningful cross-partisan contact remains scarce, in part because people actively avoid it. Across five preregistered studies (total N = 3,960 U.S. partisans), we test whether brief conversations with AI chatbots representing the political outgroup can substitute for the contact people shun. Synthetic contact first lowers the barrier to entry: partisans would endure almost twice as long contemplating their own mortality to avoid a human outgroup partner as an AI one. These conversations then correct the misperceptions that fuel division. At baseline, Democrats placed Republicans more than a standard deviation past their actual position on environmental consumption attitudes – enough to flip the average Republican from supportive to opposed – and a single ten-minute conversation with an outgroup chatbot corrected those beliefs and warmed affect in a within-person study of both parties. A three-arm experiment ruled out pure engagement and sociality as drivers. Synthetic contact also moved behavior, in a sample of both parties and on a more affectively charged issue: participants who spoke with an outgroup bot about immigration were six percentage points more likely than controls to choose to have a real conversation with a partisan from the other side. A final study tested whether these gains last: the warmth effect replicated immediately in a new sample; most of it faded within a week, with a small residual concentrated among the most extreme partisans. Analyzing conversation content showed that information, more than friendliness, distinguishes outgroup bots from control chatbots. Together, these findings establish synthetic contact as a scalable, behaviorally consequential, and – unlike face-to-face contact – widely acceptable form of cross-partisan engagement.
[HC-7] Choreographing the Way of Water: A Computational Framework for Aquatic Robotic Art
链接: https://arxiv.org/abs/2607.02174
作者: Aswin Ramachandran,Christopher Golling,Sebastian Burmester,Noa Sendlhofer,Jan Kamm,Ruiheng Jiang,Raffaello D’Andrea
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: Video: this https URL
Abstract:Robotic choreography in open water is governed by nonlinear fluid dynamics, which impose significant challenges due to environmental disturbances and nonlinear system dynamics. This paper presents the cyber-physical architecture of Way of Water, a vertically integrated framework that orchestrates a fleet of autonomous surface vessels as a distributed choreographic platform. Moving beyond the surface-pixel paradigm, these vessels use laminar nozzles and multi-zone lighting to extend their expressive range from the 2D water plane into the 3D volumetric domain. Our primary contribution is the Way of Water Studio, a browser-based, timeline-compositing authoring paradigm that treats the fleet as a DAW-like instrument for music-responsive choreography. The Studio encapsulates Sequential Convex Programming for trajectory generation and Model Predictive Control for disturbance rejection presented through a visual timeline, broadening access to high-performance aquatic robotics for non-programmer artists. Grounding the Studio is the full cyber-physical stack: a custom holonomic chassis, a state-estimation and control stack tuned for the aquatic domain, and an LTE/MQTT fleet link with RTK-GPS time synchronization. We report on the system’s validation across two distinct deployments: an 18-vessel Swan Lake interpretation at Lake Zurich and an 8-vessel Time Space Existence 2025 Venice Biennale demonstration at Forte Marghera, establishing a foundational reference for the design and deployment of fluidic robotic swarms.
[HC-8] Visual Analytics of Neighborhood Attribute Profiles for Exploring Structural Equivalence IEEE-VIS2026
链接: https://arxiv.org/abs/2607.02163
作者: Kohei Arimoto,Masahiko Itoh
类目: Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: 5 pages, 3 figures. Accepted as a Short Paper at IEEE VIS 2026
Abstract:Exploring similar nodes in attributed networks represents a key challenge in data mining. While recent representation learning methods embed networks into low-dimensional vectors, they often implicitly assume a uniform and continuous feature space. This paper proposes a visual analytics approach using dimensionality reduction to help clarify the true topological structure of high-dimensional feature spaces formed by nodes’ neighborhood attribute profiles. Analyzing inter-firm transaction networks indicates that structural roles can form complex, non-linear manifolds with density biases. Comparing this feature space with industry classifications suggested: (1) supply chain hierarchies transition continuously; (2) categories treated identically under general semantics can be clearly separated by actual transaction networks; and (3) a single industry label may fragment into multiple regions. These findings suggest potential limitations in assuming identical semantics imply similar structural roles and highlight the possible need for new similarity metrics aligned with manifold topology.
[HC-9] A Social Norms Approach to Youth Social Media Design
链接: https://arxiv.org/abs/2607.01807
作者: JaeWon Kim
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Young people consistently say they want authentic self-expression, less judgment, and more interpersonal trust on social media, yet they rarely manage to engage that way. My dissertation argues that the obstacle is normative rather than individual: how youth engage is governed less by personal choice than by platform norms, peer perception, and beliefs about how others behave. I take a social norms approach to youth social media design organized around three claims. First, platform norms constrain individual behavior, producing a pluralistic ignorance in which youth enact norms they privately reject. Second, design interventions are themselves shaped by existing norms, so whether a feature works depends on the environment around it, which means relational goals such as privacy must be treated as social norms rather than individual settings. Third, a societal norm about what ``social media’’ is – equating it with a few mainstream platforms – confines policy and design to mitigating those platforms rather than actively envisioning supportive alternatives. Together these claims motivate my dissertation research: engaging youth directly in designing and building an evidence-based independent platform whose features consistently signal that building trusted connections is what the space is for.
[HC-10] Adapting CCDF Plots for Visualizing Ordinal Regression Results
链接: https://arxiv.org/abs/2607.01747
作者: Abhraneel Sarma
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Cumulative-link ordinal regression models are an alternative approach for analysing ordinal data such as Likert items, which are widely used in Visualization (and other related fields like HCI, psychology etc.). There are many researchers who are strong proponents of this approach, as it makes less stringent assumptions about the data, compared to the more commonly used linear model or ANOVA. Yet, ordinal regression models have seen limited adoption. I posit that one possible reason for this might be due to the difficulty in visually representing the results from such models, and in communicating the key takeaways in an intuitive manner. I propose the use of (modified) Complementary Cumulative Distribution Function (mCCDF) plots to visualize the results of ordinal regression models, and demonstrate how the same takeaways that researchers present from analyses which treat ordinal data as metric can be easily communicated using mCCDFs.
[HC-11] From Answer Generators to Reasoning Facilitators: Designing AI Tutors for Mathematical Reasoning in High-Stakes Environments
链接: https://arxiv.org/abs/2607.01692
作者: Yuming Feng,Yuan Tian,Erica Zhao
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:The rapid integration of Large Language Models (LLMs) into educational technology threatens to reduce mathematical learning to mere answer generation. This paper presents a generative study, usability study, and 12-participant field deployment of AITutor, an interactive system that translates theoretical pedagogical mechanisms into concrete user interface features. We explore how junior-high students preparing for high-stakes exams (Zhongkao) interact with AI tutoring. Through mixed-methods triangulation (7,379 telemetry events, 8 contextual observations, 10 interviews), we reveal that students actively resist traditional Socratic dialogue under time pressure, repurposing “answer-first” shortcuts as vital diagnostic checkpoints. We demonstrate how features like layered worked examples, step-linked visual grounding, and metacognitive scaffolding lower the interaction cost of reasoning repair. We contribute a “Reasoning-Centered Product Loop,” offering actionable implications for designing AI that structurally supports the inspection, local repair, curriculum verification, and delayed retrieval of mathematical reasoning in the wild.
[HC-12] Evaluating Glanceable Multi-Device Family Health Tracking with Smartwatches and Home Displays
链接: https://arxiv.org/abs/2607.01618
作者: Lucas M. Silva,Evropi Stefanidi,Aehong Min,Franceli L. Cibrian,Jesus A. Beltran,Cassie Zeiler,Sabrina E. B. Schuck,Kimberley D. Lakes,Gillian R. Hayes,Daniel A. Epstein
类目: Human-Computer Interaction (cs.HC)
备注: Accepted with minor revisions for IMWUT 2026
Abstract:While ubiquitous computing research has explored diverse devices for personal health tracking, we know less about multi-device designs for family informatics, where health management is inherently collaborative. To understand how families adopt and perceive ubiquitous access to shared health data across contexts, we evaluated smartwatch-only, home display-only, and combined designs for tracking moods and goals, domains central to family health behavior regulation. 44 people across 12 families alternated between these designs over nine weeks. Log analysis revealed that mood tracking and goal reporting were significantly more frequent with the home display present compared to smartwatch-only use, despite an overall decline in mood tracking over time. Tracking peaked in afternoons, dropped on weekends, and occurred 2.6X more at home, with children tracking more consistently than adults across all designs. From interview analysis, we learned how family data glanceability on smartwatches supported opportunistic tracking and awareness while apart, whereas displays reminded families to self-track and collaborate during home routines including members that avoided wearables (e.g., non-participants). Multi-device redundancy accommodated diversity in routines, mobility patterns, and device preferences among members in the same family. We discuss opportunities for multi-device family informatics that accommodates different preferences through inclusive, glanceable, and adaptable ubiquitous data sharing.
[HC-13] Made to Feel: How Designers Bring Emotions into Affective Visualization IEEE-VIS
链接: https://arxiv.org/abs/2607.01593
作者: Yixin Bai,Ziyi Wang,Keke Wu,Fumeng Yang
类目: Human-Computer Interaction (cs.HC)
备注: IEEE VIS Short Paper (2026)
Abstract:Affective visualization is increasingly studied in visualization research, yet how designers bring emotions into their visualization work remains unexplored. This paper addresses this gap through semi-structured interviews with 15 visualization practitioners. Using hybrid thematic analysis, we identify: (1) three functions that emotions can serve for viewers (entry, engagement, outcome); (2) three facets of how designers work with emotion (data, design, audience), along with design strategies; and (3) ethical considerations in the design process. We also observe that affective intent often emerges during the design process rather than being planned from the outset, and that emotional impact arises from accumulated design choices rather than isolated visual elements. Finally, we highlight evaluation as a key challenge identified by our participants.
[HC-14] OrchestrXR: A Multi-Agent System for Idea-to-Prototype XR Study Authoring
链接: https://arxiv.org/abs/2607.01588
作者: Shuqi Liao,Chenfei Zhu,Karthik Ramani,Voicu Popescu
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Extended Reality (XR) has become an important interaction paradigm in Human-Computer Interaction (HCI). XR studies are used to investigate interaction, perception, and user behavior in immersive environments, and typically involve experimental tasks, 3D scenes, and interactive logic. However, turning an initial XR study idea into a runnable prototype remains fragmented across study design, scene construction, and interaction implementation. We present OrchestrXR, a multi-agent human-AI workflow for early-stage idea-to-prototype XR study authoring. Rather than treating XR study creation as one-shot generation, OrchestrXR supports a controllable workflow across study design, scene generation, and interaction generation through structured schemas, multi-agent orchestration, and interactive human-agent interfaces, producing a Unity-based prototype from a researcher’s idea. A user study with 12 XR researchers suggests that OrchestrXR provides effective support for early-stage XR study authoring with strong intent preservation across stages.
[HC-15] Mind the Trust Gap: Identifying (Mis)alignments in Teacher-Student Views Toward Control and Agency in K-12 Classroom AI
链接: https://arxiv.org/abs/2607.01506
作者: Tomohiro Nagashima,Lisa Siegrist,Niklas Scholz,Shintaro Sato,Martina Vincoli,Man Su
类目: Human-Computer Interaction (cs.HC)
备注: To be published in Proceedings of the ACM on Human-Computer Interaction, Volume 10, Issue 6, Article CSCW124 (October 2026)
Abstract:As Artificial Intelligence (AI)-based technologies have been integrated into school classrooms where multiple stakeholders (with different roles) interact with each other, it is critical to deeply understand stakeholder views in the classroom. In particular, prior work has not fully uncovered how teachers’ and school students’ views might or might not align well with each other, especially in K-12 classrooms. We conducted a speed-dating study using storyboards with 16 school students and 15 school teachers in Germany to investigate alignments and misalignments between their views on student-AI decision-making control in K-12 classroom. Through an explicit pair-matching analysis, we found that students and teachers had misaligned views on several key topics, including how much they trust AI and social and emotional aspects of student learning with AI. Findings also revealed the importance of teacher-student relationships outside of AI use that shape stakeholders’ views and interactions. We discuss potential reasons for the observed misaligned views and strategies to fill the perspective gaps. This study illustrates the complexities of preferences in teacher-student-AI interactions that depend on the dynamic relations among the stakeholders.
[HC-16] Insights from GitHub Community on the Matter Standard: Developer Perspectives and Challenges
链接: https://arxiv.org/abs/2607.01494
作者: Muhammad Hassan,Carl Gunter,Susan Landau,Masooda Bashir
类目: oftware Engineering (cs.SE); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注:
Abstract:Matter seeks to resolve longstanding interoperability problems in the Internet of Things (IoT), yet little is known about how developers experience the standard in day to day work. This paper examines over 13,000 issues from the official Project CHIP GitHub repository to understand the kinds of problems contributors report when implementing and integrating Matter. Using topic modeling and qualitative analysis, we identify four recurring areas of concern, Testing, Interoperability, Development, and Platform and Network, and describe how they manifest in the evolution of the codebase and tooling. The findings reveal systematic technical and integration challenges and point to concrete opportunities to refine Matter’s test infrastructure, cross vendor guidance, and documentation as the standard continues to mature.
[HC-17] Sign in the Air to Unlock: An Interface for authentication in Virtual and Augmented Reality Powered by Point-Voxel Cross-Attention Network
链接: https://arxiv.org/abs/2607.01435
作者: Neda Abdolrahimi,Thiru Siddharth,Frank Sicongchen,Vir V Phoha
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Significant advancement of immersive technologies such as Virtual and Augmented Reality (VR/AR) and their integration into diverse aspects of modern life need authentication interfaces that are secure, intuitive, and compatible with embodied interaction. Traditional methods such as passwords, PINs, and device-based logins, break immersion and rely on external hardware. Recent 3D-specific behavioral approaches, such as hand-gesture, eye-tracking, and electroencephalography (EEG)-based methods, offer promising alternatives but often require specialized sensors or constrain natural movement, limiting usability in dynamic environments. We present Sign in the Air to Unlock, an in-air signature interface that enables users to authenticate by signing naturally in 3D space which is a familiar, personal, and reproducible gesture. To realize this interface, we design a point-voxel Cross-Attention Network (PV-Net) that jointly models local motion dynamics and global spatial structure from 3D trajectories. The model is evaluated on two datasets: the public DeepAirSig dataset (1,800 signatures from 40 users) and ImmAirsig, a new dataset collected using Meta Quest 2 in immersive VR (880 samples from 22 users). PV-Net achieves an Equal Error Rate of 2.5% on DeepAirSig and 76% classification accuracy on ImmAirSig. These findings highlight the potential of 3D behavioral interfaces for seamless, user-centric authentication that merges security with natural interaction in immersive environments.
[HC-18] Adoption and Impact of Command-Line AI Coding Agents : A Study of Microsofts Early 2026 Rollout of Claude Code and GitHub Copilot CLI
链接: https://arxiv.org/abs/2607.01418
作者: Emerson Murphy-Hill,Jenna Butler,Alexandra Savelieva
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Organizations rolling out agentic command line tools like Anthropic’s Claude Code and GitHub’s Copilot CLI need to know who will try them, who will keep using them, and whether the tools produce enough output to justify their cost. At organizational scale, token spend can run into millions of dollars annually, so misreading adoption, retention, or impact can make a rollout expensive without changing engineering velocity. Studying tens of thousands of engineers at Microsoft over its early-2026 rollout, we find that first use spread primarily through social networks, retention was associated more with engineers’ coding activity than with demographics, and adopters merged roughly 24% more pull requests than they would have otherwise. We use merged pull requests as our proxy for output – acknowledging that a merged PR is not the same as the value it delivers – and the lift persists across our four-month window. These results suggest that CLI coding agents are neither uniformly adopted nor mere novelty effects and that organizations should treat visible peer use as central to rollout strategy.
[HC-19] Mitigating Confirmation Bias through Hand-Drawing Videos IEEE-VIS2026
链接: https://arxiv.org/abs/2607.01359
作者: Chenyu Lin,Cindy Xiong,Icy Zhang
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to IEEE VIS 2026 Short Papers. 4 pages plus references
Abstract:Understanding data visualizations is essential for informed decision-making, yet interpretation is often shaped and even distorted by prior beliefs. We investigate whether an embodied pedagogical approach, in which viewers observe the dynamic hand-drawing of a visualization, can mitigate confirmation bias and improve interpretation accuracy. We conducted a study comparing static bar charts to videos in which charts are constructed through hand-drawing, across contexts that either align with or challenge participants’ prior beliefs. The results indicate that hand-drawn videos helped participants accurately interpret data, even when the data conflicted with their prior beliefs. This approach also reduced belief-consistent errors and increased belief-overriding responses. These findings suggest that exposing the construction process of a visualization supports more accurate reasoning and mitigates the influence of confirmation bias. Consequently, this work introduces a promising design space for bias-mitigating data interfaces.
[HC-20] hree Futures for the Diagnostic Radiologist: A Structured Disagreement About What AI Actually Changes
链接: https://arxiv.org/abs/2607.01253
作者: Jan Beger,Amine Korchi,Christoph A. Agten
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Rationale. The diagnostic radiologist’s role in 2035 will not look like it does today. Imaging AI is already changing how worklists are organized, how reports are generated, and which cases require a radiologist’s attention. What remains genuinely contested is not whether the role changes but how. Approach. Three subject-matter experts (two radiologists and one health tech professional with more than 20 years of experience in medical imaging IT) independently authored 2035 job descriptions for the diagnostic radiologist using a shared template. Each author wrote from a distinct vantage point: one optimistic, one framed as a trade-off view incorporating workforce economics, and one structured around professional stratification. The three versions were published openly and subjected to a structured comparison across seven dimensions. Key findings. The three versions agree on direction but disagree on magnitude. All three describe a radiologist whose routine workload is AI-managed, who carries accountability for AI output, and who spends more time on complex cases and clinical collaboration than today’s radiologist does. They diverge on headcount, career security, and whether the profession expands broadly, concentrates into a smaller well-compensated group, or stratifies into sharply differentiated tiers. Conclusion. AI won’t eliminate the diagnostic radiologist. Whether it expands, concentrates, or stratifies the profession depends on choices health systems haven’t made yet. The clinical argument for optimism is real. So is the economic argument for caution. Both can be true simultaneously. Keywords: radiology workforce; artificial intelligence; diagnostic radiology; job redesign; medical imaging IT; AI governance
计算机视觉
[CV-0] WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory
链接: https://arxiv.org/abs/2607.02517
作者: Hanlin Wang,Hao Ouyang,Qiuyu Wang,Wen Wang,Qingyan Bai,Ka Leong Cheng,Yue Yu,Yixuan Li,Yihao Meng,Zichen Liu,Yanhong Zeng,Yujun Shen,Qifeng Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:We present WorldDirector, a highly controllable video world model framework designed for persistent dynamic object memory and unrestricted viewpoint exploration. Unlike existing world models that entangle physical dynamics with pixel rendering and rely on continuous visual observation to sustain motion, our framework explicitly decouples semantic motion orchestration from visual generation. By leveraging an LLM to coordinate 3D trajectories with camera movements and subsequently employing these orchestrated trajectories as control signals for video generation, our approach ensures strict physical logic and appearance stability, successfully preserving the exact visual identities of dynamic entities even when they re-enter the scene after prolonged periods out of view. Experimental results demonstrate that our method supports the synthesis of complex and extended events with unprecedented controllability and persistent dynamic object memory. Project Page: this https URL
[CV-1] Alignment Is All You Need For X-to-4D Generation
链接: https://arxiv.org/abs/2607.02516
作者: Qiaowei Miao,Kehan Li,Yawei Luo,Yi Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative diffusion models excel at synthesizing high-quality images, videos, and 3D content under multimodal control. However, arbitrary user-defined modality-to-4D (X-to-4D) generation remains challenging due to the high cost of constructing diverse datasets and the limited scalability of existing methods. This paper presents Align4D, a flexible framework that translates any-modal input into coherent video-3D pairs, using video to guide 4D motion and 3D data to shape 4D geometry. Align4D introduces three key techniques: (1) Object Distance Alignment, which searches Video-Aligned and Multiview-Aligned Object Distances (VAOD/MAOD), respectively, to reconcile 4D renderings with video and the priors of multiview diffusion models; (2) Motion-Geometry Joint Alignment, which constrains known and unknown views through synchronized video and 3D inputs, ensuring consistent 4D generation; and (3) Asynchronous Optimization, which decouples Gaussian attribute and deformation network training to enhance motion and geometry fidelity. We further propose the X4D dataset, which integrates prompt, image, video, and 3D data for benchmarking. Experiments on X4D and Consistent4D demonstrate that Align4D achieves state-of-the-art quality and consistency in X-to-4D generation. Project page: this https URL.
[CV-2] PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation ICML2026
链接: https://arxiv.org/abs/2607.02515
作者: Haofei Xu,Rundi Wu,Philipp Henzler,Nikolai Kalischek,Michael Oechsle,Fabian Manhardt,Marc Pollefeys,Andreas Geiger,Federico Tombari,Michael Niemeyer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026. Project page: this https URL
Abstract:State-of-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures and loss functions, or compress geometry into latent spaces in order to leverage pre-trained latent diffusion models. In this work, we show that such architectural overhead and intricate loss formulations are unnecessary. We introduce a minimalist pixel-space Diffusion Transformer, built on a plain ViT, that operates directly on raw 3D point map patches and is conditioned on image tokens from a pre-trained DINOv3. Unlike existing latent diffusion approaches, we train our diffusion backbone entirely from scratch, eliminating the need for point map tokenizers. Despite its simplicity, our approach surpasses complex latent-based diffusion models while remaining significantly simpler than hybrid alternatives. Notably, it produces sharper geometric structure and is more robust in highly ambiguous regions, such as transparent objects.
[CV-3] From SRA to Self-Flow: Data Augmentation or Self-Supervision?
链接: https://arxiv.org/abs/2607.02508
作者: Dengyang Jiang,Mengmeng Wang,Harry Yang,Jingdong Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Representation alignment has become an effective way to accelerate diffusion transformer training and improve generation quality. Recent self-alignment methods, such as SRA and Self-Flow, further remove the dependency on external pretrained encoders by constructing alignment within the diffusion model itself. However, the mechanism behind the improvement from SRA to Self-Flow, dual-time scheduling, remains under-examined: Self-Flow attributes its gain to interactions between tokens at different noise levels, where cleaner tokens help infer noisier ones. In this work, we revisit this explanation and ask whether the gain instead comes from data augmentation along the noise dimension. To disentangle these factors, we introduce Attention Separation, which preserves the same dual-timestep input as Self-Flow while blocking attention between tokens assigned to different noise levels. Surprisingly, removing such interaction does not degrade performance and can even improve it, suggesting that the improvement from SRA to Self-Flow mainly comes from data augmentation. Furthermore,We show that Attention Separation itself provides an augmentation effect by splitting a single image into multiple effective training parts to expand the training data. Based on these observations, we combine self-representation alignment with dual-timestep and attention-separation augmentation, and demonstrate the effectiveness of this design on ImageNet.
[CV-4] Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots
链接: https://arxiv.org/abs/2607.02501
作者: Ling Xu,Chuyu Han,Borui Li,Hao Wu,Shiqi Jiang,Ting Cao,Chuanyou Li,Sheng Zhong,Shuai Wang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Operating Systems (cs.OS)
备注: 12 pages, 2 figures, Project website: this https URL
Abstract:Embodied AI models now span vision-language-action (VLA) models and world-action models (WAMs), but practical deployment remains fragmented across model-specific Python stacks, backend assumptions, and robot-side glue code, especially on heterogeneous edge devices. Existing inference runtimes are designed mainly for request-response serving and therefore do not satisfy the runtime contract of embodied deployment: multi-rate execution inside closed-loop control, latency-first batch-1 inference on heterogeneous hardware, and extensible embodied interfaces beyond fixed token I/O. We present this http URL, a portable C++ inference runtime for embodied models. Based on an architectural analysis of representative VLA models and WAMs, this http URL captures a shared execution path and organizes it into five layers: input adapters, sequence builders, backbone execution, head plugins, and deployment adapters. The runtime provides modular multi-rate execution, latency-first fused inference, and extensible operator and I/O support, enabling deployment across heterogeneous devices, robots, and simulators through one backend abstraction. We evaluate this http URL on two VLA models, HY-VLA and pi0.5, and on a preliminary WAM benchmark using a LingBot-VA Transformer block. The VLA deployments achieve successful closed-loop execution with 100.0% and 91.0% task success rates, respectively. The WAM benchmark reduces block memory from 312.2 MiB to 88.1 MiB. These results show that this http URL improves deployment efficiency while preserving high accuracy across diverse embodied model architectures.
[CV-5] Seek to Segment: Active Perception for Panoramic Referring Segmentation ECCV2026
链接: https://arxiv.org/abs/2607.02497
作者: Song Tang,Shuming Hu,Xincheng Shuai,Henghui Ding,Yu-Gang Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026, Project Page: this https URL
Abstract:Existing referring segmentation models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in the continuous 360 ^\circ environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS). In this setting, an agent is required to adjust its viewing direction ( \Delta\theta, \Delta\phi ) to explore the 360 ^\circ environment, seeking the object specified by a user instruction for segmentation. To tackle this challenging task, we propose PanoSeeker, a memory-augmented agent for efficient APRS. Rather than relying on heuristic scanning, PanoSeeker integrates a Vision-Language Model (VLM) with EgoSphere, an explicit spatial visual memory. By progressively integrating sequential local observations into a unified 360 ^\circ representation, EgoSphere enables the agent to plan efficient and non-redundant search trajectories. Once the target is found, the agent performs active viewpoint alignment and outputs the segmentation mask. Furthermore, we curate an expert-annotated search trajectory dataset with memory timelines for Supervised Fine-Tuning, followed by Reinforcement Learning post-training to explicitly optimize PanoSeeker’s exploration efficiency. Extensive experiments on our newly established APRS benchmark demonstrate that PanoSeeker achieves superior search efficiency and segmentation accuracy, significantly outperforming adapted state-of-the-art baselines.
[CV-6] GeoMix: Descriptor-Free Visual Localization via Global Context and Multi-Detector Training ECCV2026
链接: https://arxiv.org/abs/2607.02486
作者: Yejun Zhang,Xinjue Wang,Zihan Wang,Esa Rahtu,Juho Kannala
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026
Abstract:Descriptor-free visual localization eliminates high-dimensional descriptor storage, preserves scene privacy, and simplifies map maintenance, yet its accuracy still lags far behind descriptor-based pipelines. We identify this gap to insufficient geometric discriminability in geometry-only matching. Without visual appearance, current methods underutilize local geometry cues, lack the global context among keypoints, and overfit to a single keypoint detector. We further observe that descriptor-free matching naturally enables multi-detector training, as heterogeneous keypoints can be optimized in a shared geometry-only space without aligning descriptor spaces. Building on these insights, we propose GeoMix, a descriptor-free 2D-3D matching framework that strengthens geometric discriminability at three levels. Locally, directional and distance-aware embeddings enrich neighborhood aggregation with fine-grained spatial structure. Globally, learnable context nodes aggregate and redistribute scene-wide information via cross-attention to resolve ambiguities beyond local receptive fields. At the training level, Mix-Training exploits this detector-agnostic geometry space to learn representations across multiple keypoint detectors. Extensive experiments on MegaDepth, Cambridge Landmarks, 7Scenes, and Aachen Day-Night show that GeoMix sets a new state of the art among descriptor-free methods, reducing 75th-percentile rotation error by 89% and translation error by up to 90% over the previous best, while generalizing zero-shot to unseen detectors and narrowing the gap to descriptor-based pipelines. Code is available at \hrefthis https URL\textthis links .
[CV-7] Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning ECCV2026
链接: https://arxiv.org/abs/2607.02484
作者: Xuehui Wang,Xuankun Yang,Wei Shen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ECCV 2026
Abstract:Visual token pruning is a crucial strategy for accelerating VLMs by compressing redundant image patches, yet existing methods often fail to preserve critical cues under dense instructions and fine-grained queries. In this paper, we investigate this failure and identify two underlying bottlenecks: the widespread dispersion of textual noise that corrupts dense cross-modal scoring, and the feature fragmentation inherent to standard token selection. To address these issues, we propose Entropy-Aware Dense Pruning (EADP), a framework that reformulates pruning as a structured compression problem. EADP first leverages statistical entropy to quantify and filter out textual noise, yielding a robust, fine-grained instruction relevance score. Subsequently, instead of naive Top-K selection, EADP casts token selection as a submodular maximization problem with a spatial prior, explicitly ensuring a holistic and non-redundant visual representation. Extensive experiments demonstrate that EADP improves the accuracy-efficiency trade-off of VLMs, robustly preserving fine-grained visual cues under strict token budgets while achieving SoTA performance on challenging multimodal benchmarks.
[CV-8] EAGLE-360: Embodied Active Global-to-Local Exploration in 360circ
链接: https://arxiv.org/abs/2607.02479
作者: Jingtao Xu,Zizhuo Lin,Jianwen Sun,Yi Yang,Yawei Luo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:While Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in standard visual understanding, adapting them for active visual search in 360 ^\circ panoramic environments exposes fundamental limitations. Specifically, standard MLLMs struggle to effectively model inherent panoramic properties, such as severe polar distortion and continuous cylindrical topologies, which significantly degrades target detection accuracy. Consequently, existing panoramic search methods attempt to compensate by relying heavily on fragmented local viewpoints. Burdened by rigid initialization and a lack of global panoramic priors, these approaches suffer from myopic, inefficient exploration and struggle with robust error recovery when targets are out of view. To overcome these challenges, we propose EAGLE-360, a novel Embodied Active Global-to-Local Exploration framework. Rather than performing exhaustive local searches, EAGLE-360 leverages global priors to establish an initial holistic perspective, iteratively reasoning and progressively narrowing the search space. Architecturally, we adapt RoPE Rolling, a coordinate-shifting positional encoding mechanism, to seamlessly model the continuous topologies of panoramas. To facilitate this paradigm, we construct the large-scale EAGLE-360 dataset, comprising 14,000+ 4K panoramas and 70,000+ rounds of high-quality VQA dialogues. By employing a training pipeline that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), we effectively elicit complex spatial reasoning and tool-calling capabilities. Extensive experiments demonstrate that EAGLE-360 establishes a new state-of-the-art for 360 ^\circ visual search, achieving nearly an 8-fold increase in accuracy over the base model while significantly enhancing exploration efficiency.
[CV-9] Interpretation-Oriented Cloud Removal via Observation-Anchored Residual Flow with Geo-Contextual Alignment ECCV2026
链接: https://arxiv.org/abs/2607.02471
作者: Ziyao Wang,Maonan Wang,Yucheng He,Xianping Ma,Ziyi Wang,Hongyang Zhang,Yirong Cheng,Man-on Pun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ECCV 2026
Abstract:Cloud removal (CR) is essential for optical remote sensing, serving as a prerequisite for reliable downstream interpretation, such as semantic segmentation and change detection. However, existing CR approaches often prioritize visual realism while overlooking their impact on subsequent analytical tasks, leading to semantic drift and degraded downstream performance. To address this issue, we propose Geo-Anchored Cloud Removal (GACR), a unified framework that jointly ensures faithful reconstruction and robust interpretability. At its core, GACR incorporates Observation-Anchored Residual Flow (OAR-Flow), which reformulates CR as a physically grounded residual inversion process. By anchoring the generative trajectory to the cloudy observation rather than pure noise, OAR-Flow enables fast, stable, and faithful reconstruction. To further preserve semantic structures critical for downstream interpretation, GACR integrates Geo-Contextual Prior Alignment (GCPA) to constrain the reconstruction within a semantic manifold induced by a Vision Foundation Model (VFM). Consequently, GACR strictly maintains the spatial-semantic integrity of complex landscapes. Extensive experiments across six CR datasets and twelve downstream tasks demonstrate that GACR produces superior reconstruction quality while consistently improving downstream task accuracy. The code is available at this https URL.
[CV-10] OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers
链接: https://arxiv.org/abs/2607.02461
作者: Donghyun Lee,Jitesh Chavan,Duy Nguyen,Sam Huang,Liming Jiang,Priyadarshini Panda,Timo Mertens,Saurabh Shukla
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Diffusion transformers (DiTs) achieve state-of-the-art image and video generation, but their multi-step sampling and growing parameter count make inference expensive. Post-training quantization (PTQ) is the natural remedy, yet DiT activations shift across timesteps, prompts, and guidance branches, forcing prior methods to re-fit calibration data for every new checkpoint or modality. We present OrbitQuant, a data-agnostic weight-activation quantizer that bypasses range estimation by quantizing in a normalized, rotated basis. In this basis, a randomized permuted block-Hadamard (RPBH) rotation concentrates each coordinate around one fixed, known marginal regardless of the input, so a single Lloyd-Max codebook serves all timesteps, prompts, and layers of a given input dimension. We extend the same quantizer to weight rows offline, absorbing the rotation into the weights so that it cancels inside each linear layer and only a forward rotation on the activations remains at runtime. The same recipe transfers from image to video with no per-modality tuning. Across FLUX.1, Z-Image-Turbo, Wan 2.1, and CogVideoX, it sets the state of the art for PTQ at several low-bit settings. It also pushes PTQ of image diffusion transformers to W2A4 with usable generation quality.
[CV-11] MARVEL: Margin-Aware Robust von Mises-Fischer Expert Learning for Long-Tailed Out-of-Distribution Detection
链接: https://arxiv.org/abs/2607.02435
作者: A.S. Anudeep,Vaanathi Sundaresan
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:For clinical deployment, it is essential that automated diagnostic systems remain reliable when confronted with previously unseen cases, yet deep models routinely misclassify out-of-distribution (OOD) inputs with high confidence, underscoring the need for more robust OOD detection methods. Although substantial effort has been devoted to improving model robustness, most of the existing literature assumes balanced datasets, evaluates OOD detection on coarse or non-clinical OOD sources, or lacks comprehensive assessment across diverse OOD scenarios. To address the gaps, we propose a novel methodology trained on diverse and imbalanced medical datasets and evaluated across a clinically reflective OOD spectrum. Our framework comprises three key components: (1) a Nonlinear von Mises-Fisher (NvMF) classifier capable of learning non-linear decision boundaries, with theoretical proof of its asymptotic connection to cosine classifiers; (2) a multi-expert framework in which margin-aware NvMF classifiers specialise in different regions of label distribution to better handle imbalance; and (3) an outlier expert trained explicitly to distinguish inlier from outlier data, thereby strengthening OOD detection. Evaluation on RFMiD, ISIC2019, and NCTCRC datasets demonstrates consistent improvements over state-of-the-art methods, achieving mean FPR95 reductions of 8.45%, 13.02%, and 36.90% respectively. These gains are further supported by comprehensive ablations that validated the contributions of each component. This enables reliable identification of unfamiliar cases for deferral to clinicians, supporting safer AI-assisted diagnosis in real-world workflows. Our code is available at this https URL.
[CV-12] Learning to Evolve Scenes: Reasoning about Human Activities with Scene Graphs
链接: https://arxiv.org/abs/2607.02425
作者: Francesca Pistilli,Simone Alberto Peirone,Giuseppe Averta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL
Abstract:Understanding human behavior while interacting with the surrounding world is crucial for many applications of embodied AI. First-person videos are particularly informative for this problem, as they well capture how activities reshape the scene over time. However, existing approaches often rely on implicit visual or language-aligned representations, disregarding structured reasoning over the scene dynamic. We argue that explicit, compositional and editable representations of human-environment interactions can play a crucial role for rich grounded activity understanding. To this end, we introduce SG-Ego, a large scale annotation set extending Ego4D with spatio-temporal scene graphs, where relations triplets are consolidated over time into explicit time-evolving descriptions of the scene state. To reason over this representation, we propose GLEN, a graph-based model that operates over scene graph sequences to both align them with textual actions and model their temporal evolution. In addition, we formulate the activity-driven graph-edit forecasting (A-GEF) problem, a novel task that casts scene dynamics as a sequence of structured transformations conditioned on ongoing actions, enabling explicit reasoning about how scenes change over time. We validate our approach across multiple downstream tasks, spanning retrieval benchmarks as EgoMCQ and EgoCVR, as well as long-horizon reasoning benchmarks as EXPLORE-Bench and the newly introduced A-GEF. GLEN achieves strong results compared to raw video baselines and it excels in reasoning settings, typically addressed only with MLLMs, while enabling controllable and structured predictions of scene dynamics driven by human activities. We believe our results establish spatio-temporal scene graphs, together with models that reason over them, as strong compositional and interpretable representations for video understanding and potentially beyond.
[CV-13] Wavelet-Guided Semantic Signal Compensation for Inversion-Free Image Editing ECCV2026
链接: https://arxiv.org/abs/2607.02421
作者: Anqi Tang,Wenhao Sun,Zhaoqiang Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:Text-guided image editing aims to modify visual content according to a target prompt while preserving the background. Recent inversion-free image editing frameworks such as FlowEdit have demonstrated strong editing capability without requiring inversion. Empirically, FlowEdit can achieve substantial semantic changes under appropriate hyperparameter settings. However, we observe that under certain global attribute shifts, the editing trajectory may not effectively move away from the source distribution in the early timesteps. Our analysis suggests that in the high-noise regime, the dominant manifold-seeking flow toward the data manifold can reduce the influence of the text-conditioned direction, leading to limited global modification while background structures remain only moderately preserved. Inspired by this observation, we propose an inversion-free, frequency-aware semantic compensation strategy that strengthens the effective signal in the early stage of generation, while maintaining structural consistency in the background. The proposed method improves global editing capacity without sacrificing background fidelity.
[CV-14] LIME: Learning Intent-aware Camera Motion from Egocentric Video
链接: https://arxiv.org/abs/2607.02417
作者: Boyang Sun,Jiajie Li,Yung-Hsu Yang,Chenyangguang Zhang,Tim Engelbracht,Sunghwan Hong,Cesar Cadena,Marc Pollefeys,Hermann Blum
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Autonomous robots often need to move their camera before they can act: to inspect an object, reveal an occluded region, or obtain a view that responds to a user’s intent. While vision-language navigation translates instructions to base motion and vision-language-action policies map instructions to manipulation actions, language-conditioned camera motion remains comparatively underexplored as a first-class action. We formulate language-conditioned camera motion generation: given a current RGB observation and a free-form natural-language intent, predict a relative target camera pose for the next observation. This task is inherently non-trivial: viewpoint changes are driven by latent perceptual intentions, and a valid motion may operate at different semantic granularity, from entering a room to looking around a corner, inspecting a visible object, or revealing an occluded detail. To model this structure, we mine multi-intention camera-motion supervision from egocentric video, pairing plausible intents and observation-gain descriptions with relative SE(3) target poses. We propose LIME, a vision-language camera-motion generator that combines an auto-regressive observation-gain output with a continuous flow-matching pose head. This design lets the model jointly predict what the next view should reveal while representing multi-hypothesis target views. Across experiments and downstream robotic tasks, we show that LIME can learn to actively choose camera poses from passive human video, turning ordinary egocentric recordings into supervision for intent-aware active perception.
[CV-15] xt-Driven 3D Indoor Scene Synthesis in Non-Manhattan Environments
链接: https://arxiv.org/abs/2607.02407
作者: Xianhui Meng,Zirui Song,Yuchen Zhang,Li Zhang,Yongxuan Lv,Xiuying Chen,Kun Wang,Yan Luo,Kai Chen,Hangjun Ye,Long Chen,Jun Liu,Xiaoshuai Hao
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in 3D indoor synthesis for Manhattan environments. However, existing methods often fail to capture plausible object layout patterns in non-Manhattan settings, primarily because they struggle to model non-orthogonal spatial relationships, leading to high geometric violations and low physical fidelity. To address this challenge, we propose SPG-Layout, a novel text-driven framework designed to generate physically plausible indoor scenes within complex non-Manhattan environments. Specifically, we first utilize statistical priors of object distributions to guide the training process, enhancing environmental understanding and fidelity. Furthermore, mirroring human design workflows, we adopt a hierarchical layout strategy that prioritizes the placement of large objects, thereby substantially minimizing layout violations. By synergizing these components, SPG-Layout achieves a balanced optimization of semantic realism and physical plausibility. To evaluate performance in these complex settings, we constructed a new benchmark comprising 500 diverse non-Manhattan environments. Extensive experiments demonstrate that SPG-Layout consistently and significantly outperforms existing methods across both Manhattan and non-Manhattan environments. The code will be publicly released.
[CV-16] Object-centric LeJEPA
链接: https://arxiv.org/abs/2607.02404
作者: Jakob Geusen,Ender Konukoglu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Image encoders trained with LeJEPA can deliver strong features for downstream tasks, but, like other image-level self-supervised methods, typically require large training datasets. Aligning representations at the level of objects rather than whole scenes promises greater data efficiency, but doing this in a completely self-supervised way, effectively jointly partitioning a scene and representing its objects, is unstable: the two are locked in a cyclic dependency, partitioning requires meaningful representations, while meaningful representations require consistent partitioning. We sidestep this instability by taking object masks as given during training, using cheap, off-the-shelf SAM proposals. We extend LeJEPA - whose distributional anti-collapse objective ports naturally from whole images to variable-sized sets of objects - to align object-centric representations rather than whole images. An additional instance-separating loss, which treats other objects in the same scene as negatives, further boosts downstream performance. Across two model scales and 10-100% of COCO, object-level LeJEPA outperforms image-level LeJEPA on tracking (DAVIS), classification (ImageNet-1k), segmentation (ADE20k), and re-identification (NAVI).
[CV-17] ACID: Action Consistency via Inverse Dynamics for Planning with World Models
链接: https://arxiv.org/abs/2607.02403
作者: Gawon Seo,Dongwon Kim,Suha Kwak
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: [this https URL]( this https URL )
Abstract:Decision-time planning with action-conditioned world models has become a popular paradigm for embodied control. However, the standard planning cost judges a candidate solely by how close its predicted terminal state lies to the goal, leaving the realizability of the intermediate transitions unchecked – a predicted trajectory can look convincing while the environment rollout drifts away from it. In this paper, we propose ACID, a decision-time planning framework that introduces cycle action consistency: the action inferred backward from a predicted transition by an inverse dynamics model should recover the one that was conditioned on. We fold this per-step residual into the planning cost via a scale-invariant adaptive weight. Across four action-conditioned world models and six tasks spanning rigid and deformable manipulation, articulated control, and visual navigation, ACID consistently improves planning and matches the baseline’s accuracy with substantially less planning compute.
[CV-18] Show Me Examples: Inferring Visual Concepts from Image Sets
链接: https://arxiv.org/abs/2607.02402
作者: Nick Stracke,Kolja Bauer,Stefan Andreas Baumann,Miguel Angel Bautista,Josh Susskind,Björn Ommer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: for code, view this https URL
Abstract:Vision-language models (VLMs) can follow complex textual instructions, yet they struggle to reason from purely visual context. In particular, current models fail to infer shared concepts from sets of example images and apply them to new inputs. We introduce Visual Concept Inference from Sets (VICIS), a task that evaluates this capability. Given a small context set of images sharing a concept and a query image, the model must generate new images that preserve the context-defined concept while remaining consistent with the query. We show that state-of-the-art VLMs perform poorly on this task, often ignoring the visual context or defaulting to biased generations. To address this gap, we propose a training framework and architecture that learn to infer visual concepts from image sets and extract concept-specific embeddings from queries. Experiments on synthetic data and large-scale ImageNet/WordNet data show that our model generates more accurate and diverse outputs and generalizes to unseen concepts and modalities such as sketches.
[CV-19] ransformer Geometry Observatory TGO-II: Representational Similarity Observatory
链接: https://arxiv.org/abs/2607.02386
作者: Kaustubh Kapil,Kishor P. Upla
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:While Vision Transformers have achieved remarkable success across computer vision and language applications, the geometric evolution of their internal representations throughout training remains insufficiently understood. Existing analyses primarily focus on attention mechanisms and downstream performance, leaving the evolution of representation geometry largely unexplored. In this work, we present Transformer Geometry Observatory-II (TGO-II), a representation geometry analysis framework designed to investigate how Transformer representations evolve during supervised training. TGO-II analyzes Vision Transformer (ViT-Small/16) representations using Centered Kernel Alignment (CKA), Singular Vector Canonical Correlation Analysis (SVCCA), Two-Nearest Neighbor Intrinsic Dimensionality (TwoNN-ID), and token covariance analysis. Our experiments reveal three key observations. First, both CKA and SVCCA progressively decrease throughout training, indicating increasing representational specialization across Transformer layers. Second, intrinsic dimensionality consistently increases before stabilizing, suggesting progressive expansion of the representation manifold into a larger set of locally accessible degrees of freedom. Third, token covariance and coupling analyses demonstrate that strong token interaction structure persists throughout training, challenging the hypothesis that increasing representational complexity arises primarily from progressive token independence. These findings suggest that representation complexity and layer specialization emerge simultaneously during training. Manifold expansion appears to occur without token decoupling. Together, these observations motivate a new hypothesis in which Vision Transformers increase representational complexity through progressively richer transformations while preserving strong token interaction structure during learning.
[CV-20] Representation Distribution Matching for One-Step Visual Generation
链接: https://arxiv.org/abs/2607.02375
作者: Lan Feng,Wuyang Li,Eloi Zablocki,Matthieu Cord,Alexandre Alahi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We elucidate the design space of Representation Distribution Matching (RDM), our name for the paradigm that trains a one-step image generator by matching generated and reference feature distributions under frozen pretrained encoders. We identify two design axes, how the distributions are compared and the representations they are compared in, and controlled studies along them yield three findings. First, the classical MMD, which could not train convincing generators a decade ago, becomes a strong and scalable objective once estimated right. Second, the generated batch is then the operative variable, with an optimum above 2048, far beyond customary batch sizes. Third, any single representation can be gamed, driven below the real score while images stay visibly fake, so we match against a balanced battery of encoders and evaluate with SW_r14, a Sliced-Wasserstein distance over 14 encoders that is independent of the training loss and resists gaming. Combining the preferred choices yields improved RDM (iRDM): it sets the one-step state of the art on ImageNet at SW_r14 1.30, corroborated by PickScore, a human-preference proxy our objective never optimizes, which prefers it over the prior best one-step generator on 71.2% of matched samples. The same recipe post-trains the four-step FLUX.2 [klein] into a one-step generator, surpassing the four-step version on GenEval, 0.826 to 0.794, and on PickScore, 22.76 to 22.58, in 90 H200 GPU-hours. Project page: this https URL.
[CV-21] Learning Spectral and Polarimetric Clues for One-to-Multimodal Novel View Synthesis ECCV2026
链接: https://arxiv.org/abs/2607.02372
作者: Federico Lincetto,Gianluca Agresti,Mattia Rossi,Piergiorgio Sartor,Pietro Zanuttigh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026. Project page: this https URL
Abstract:Neural rendering techniques allow for accurate reconstruction of the geometry and color appearance of 3D scenes. Some methods have extended their use to additional imaging modalities, such as multispectral, infrared, or polarimetric data. However, all of these approaches require expensive sensors and calibrated setups to capture new multimodal frames for each new scene. We propose Spectral and Polarimetric Implicit Learned Representation (SPoILeR), a novel method to obtain multi-view consistent renderings of unconventional modalities for scenes where either only RGB frames or very few of the additional modalities are available. Thanks to a multimodal pre-training phase, the model learns the mutual correlation between different modalities. This step allows predicting accurate renderings of unconventional modalities during a fine-tuning phase supervised only by RGB images. Experimental results show that the approach can accurately render infrared, polarimetric, and multispectral frames for scenes where no input sample captured by these types of sensors is provided.
[CV-22] VisionAId: An Offline-First Multimodal Android Assistant for People with Visual Impairment Featuring Personalized Object Retrieval
链接: https://arxiv.org/abs/2607.02371
作者: Cristian-Gabriel Florea,Stelian Spînu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures. Project repository available at: this http URL
Abstract:Over 285 million people worldwide live with a visual impairment, for whom everyday tasks such as avoiding obstacles, locating personal belongings, recognizing familiar faces, or handling cash remain persistent obstacles to personal autonomy. Existing assistive applications are typically limited to recognizing predefined categories, depend heavily on cloud connectivity, or require dedicated hardware. We present VisionAId, an Android application that turns a commodity smartphone into a real-time visual assistant. The system integrates six on-device deep learning models (metric monocular depth estimation, instance segmentation, visual and facial embeddings, face detection, and a custom banknote detector) running entirely through ONNX Runtime, with an optional cloud large language model (Google Gemini Flash) used only for narrative scene description and automatic object labeling. A distinctive contribution is a few-shot pipeline for personal objects: the user photographs an object from several angles, and the system later locates that specific instance in the environment, guiding the user toward it with augmented-reality markers, spatial audio, and distance-proportional haptics. All feedback is multimodal (Romanian speech synthesis, voice commands, vibration). On a reference device (Samsung Galaxy S21 Ultra), INT8 quantization reduces depth latency from ~1200 ms to ~491 ms, the custom banknote detector reaches an mAP@50 of 0.986, and metric depth is calibrated to below 1 cm of error within 3 m.
[CV-23] GAP-GDRNet: Geometry-Aware Monocular Visual Pose Sensing on a Single-Target Synthetic Spacecraft Dataset
链接: https://arxiv.org/abs/2607.02360
作者: Yonglong Zhang,Yang Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Monocular relative pose sensing is a central perception problem in non-cooperative rendezvous and on-orbit servicing. In spacecraft images, however, weak surface texture, thin appendages, illumination changes, and partial occlusion often leave only sparse and unstable geometric evidence. This article presents GAP-GDRNet, a geometry-aware attention-enhanced framework for monocular RGB-based 6D pose sensing. The method follows the geometry-guided direct regression paradigm of GDR-Net and modifies two points in the pipeline: an attention-based feature refinement (AFR) module is placed before dense geometric prediction, and a patch-level geometric self-attention (PGSA) module is inserted into Patch-PnP. AFR reinforces global spacecraft structure together with local weak-texture cues; PGSA then relates downsampled geometric patches before final pose regression. A Blender-based annotation process supplies target masks, visible-region masks, dense model-coordinate maps, camera intrinsics, and 6D pose labels for supervised training.
[CV-24] he Moving Eye: Enhancing VLA Spatial Generalization via Hybrid Dynamic Data Collection IROS2026
链接: https://arxiv.org/abs/2607.02322
作者: Jincheng Tang,Yilong Zhu,Zhengyuan Xie,Jiang-Jiang Liu,Jiaxing Zhang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: IROS 2026
Abstract:Vision-Language-Action (VLA) models have shown remarkable promise in generalized robotic manipulation. However, their spatial generalization remains fragile. We argue that simply increasing the number of viewpoints is insufficient. Models often fall into the trap of Shortcut Learning, latching onto spurious correlations (e.g., fixed relative poses between objects or between the camera and robot base) rather than learning true spatial relationships. In this work, we propose a data-centric solution to enhance VLA spatial generalization. We utilize a dual-arm setup where one arm performs manipulation while the other serves as a mobile environmental camera. We systematically evaluate three data distribution patterns: Fixed, Multi-Fixed, and Moving Views. Our findings reveal that a hybrid strategy, combining continuous camera motion with diverse static viewpoints, yields the best performance by substantially reducing spurious correlations while maintaining training stability. Our experiments demonstrate that this strategy mitigates spurious correlations, enabling VLAs to generalize to unseen camera poses and object configurations where simply adding more static viewpoints fails. Crucially, we reveal that the susceptibility to shortcut learning and the struggle with spatial generalization are universal characteristics shared across diverse architectures. Consequently, all evaluated models (ACT, Diffusion, and VLA models including Pi0 and Gr00t) benefit significantly from our mixed data strategy.
[CV-25] NEvo: Neural-Guided Evolutionary Video Synthesis for Dynamic Visual Selectivity
链接: https://arxiv.org/abs/2607.02317
作者: Yingtian Tang,Sogand Salehi,Ming Zhou,Amir Zamir,Leyla Isik,Martin Schrimpf
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures
Abstract:The human brain processes dynamic visual input through hierarchically organized, functionally specialized regions. While recent in silico brain encoding models can synthesize optimal stimuli to probe selectivity in different brain regions, prior work has been largely limited to static images, leaving dynamic visual processing underexplored. We introduce a novel neural-guided video synthesis framework that generates stimuli optimized for target brain regions across visual cortex. Our method performs evolutionary search over a structured prompt space, guided by a dynamic encoding model that predicts voxel-level responses to video inputs. By maximizing predicted activity for a target ROI, the framework efficiently discovers hyper-activating dynamic stimuli that consistently surpass handcrafted localizer videos. The synthesized videos recover known selectivities across ventral, dorsal, and lateral pathways, and further reveal systematic differences in sensitivity to temporal dynamics. A searchlight analysis provides new insight into the progression toward increasingly complex social-dynamic features along the lateral stream, further supported by probing with synthesized abstract, non-naturalistic stimuli. Taken together, our framework enables in silico exploration of dynamic visual selectivity, with new predictions for in vivo experiments
[CV-26] InvSplat: Inverse Feed-Forward Scene Splatting
链接: https://arxiv.org/abs/2607.02301
作者: Polina Karpikova,Wenjing Bian,Haofei Xu,Hendrik Lensch,Andreas Geiger
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Inverse rendering aims to recover both 3D geometry and physically meaningful material properties from images, enabling applications such as relighting and novel view synthesis. Optimization-based methods achieve high fidelity but require costly per-scene fitting, while image-space learning-based approaches often suffer from multi-view inconsistencies and lack an explicit 3D representation for stable novel view rendering. We present a feed-forward multi-view reconstruction framework for inverse rendering that directly predicts a structured 3D Gaussian representation with intrinsic material attributes. Each Gaussian primitive is parameterized by mean, normal, opacity, rotation, scale, albedo, metallic, and roughness, enabling a disentangled and physically grounded scene representation. Our model integrates priors from a material estimation network with a multi-view 3D reconstruction backbone, allowing joint prediction of geometry and reflectance parameters in a single forward pass. Experiments on synthetic and real-world datasets demonstrate improved multi-view consistency compared to 2D baselines, accurate material recovery, and stable novel view rendering. Our representation further supports physically-based relighting and more faithful modeling of view-dependent effects compared to existing RGB-based feed-forward reconstruction methods. Our project webpage is: \hrefthis https URL\textthis https URL .
[CV-27] Search-based Testing of Vision Language Models for In-Car Scene Understanding
链接: https://arxiv.org/abs/2607.02300
作者: Lev Sorokin,Chen Yang,Ken E. Friedl,Andrea Stocco
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: Accepted at the Industry Track of the 41st IEEE/ACM International Conference on Automated Software Engineering (ASE 2026)
Abstract:In the automotive domain, in-car scene understanding (ISU) enables the detection of safety-critical events, such as driver distraction, and supports drivers or passengers by analyzing the in-car scene and adapting the environment (e.g., ambient lighting). The industry is increasingly exploring vision-language models (VLMs) to interpret camera-recorded in-car scenes and extract information for downstream reasoning tasks. However, VLMs may generate incomplete, erroneous, or misleading scene descriptions, highlighting the need for systematic testing. Collecting real in-vehicle data is costly, difficult to scale, and often infeasible, particularly in early design stages. In this paper, we present ISU-Test, an automated testing approach that combines rendering-based scene generation with search-based testing to evaluate ISU systems. By framing testing as an optimization problem and systematically modifying scene parameters, our method generates diverse in-car scenarios and explores a wide range of configurations. We evaluate ISU-Test on both an industrial prototype and open-source VLMs across two case studies: question answering and captioning, comparing against randomized scenario generation. Results show that ISU-Test significantly outperforms the baseline, achieving up to 10 times higher failure rates and up to 3.6 times higher failure coverage.
[CV-28] Dual-Selective Network for Domain-Incremental Change Detection ICANN-2026
链接: https://arxiv.org/abs/2607.02299
作者: Yuzhi He,Junxi Huang,Haorui Wu,Jiahui Qu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Artificial Neural Networks, ICANN-2026
Abstract:Domain-incremental change detection (DICD) continuously adapts models to new geographic domains while preserving prior knowledge. However, a structural mismatch exists: the label space remains fixed while domain characteristics vary drastically. Consequently, incremental models struggle to maintain stable spatial change representations across domains. Existing strategies, such as replay-based or regularization-based methods, often fail to scale to long domain sequences, leading to knowledge degradation or increased computational cost. We propose Dual-Selective Incremental Network (DSINet), a unified framework built on visual state space models. DSINet leverages Mamba’s input-dependent selective mechanism through a selective spatial state unit (S3U). This unit preserves stable spatial change structures while filtering domain-specific variations during feature propagation. As a result, spatial representations remain stable across domains, preventing the accumulation of feature confusion over incremental steps. Additionally, we employ a concentration-balanced distillation (CBD) strategy to stabilize knowledge transfer across domains. It balances hardness and confidence concentration effects during incremental updates. This ensures reliable probability mass allocation and prevents over-smoothing or mode collapse during distillation. Together, these mechanisms maintain stable learning dynamics throughout incremental stages. Experimental results demonstrate that DSINet mitigates knowledge degradation across long domain sequences while maintaining the linear computational efficiency of state space models.
[CV-29] Real-Time Visual Intelligence on Low-Cost UAVs: A Modular Approach for Tracking Scanning and Navigation
链接: https://arxiv.org/abs/2607.02298
作者: Andrei-Marian Ungureanu,Stelian Spînu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures. Project repository available at: this http URL
Abstract:Autonomous drones are rapidly transforming modern warfare and civil applications alike. This paper presents the development of an integrated intelligent drone system designed to serve as a personal assistant. Leveraging the DJI Tello drone platform, we implemented a modular architecture that integrates three core artificial intelligence functionalities: facial detection, facial recognition, and depth estimation from monocular vision. A web-based interface enables seamless drone control and real-time video monitoring, while a Python-based server processes visual data and executes inference pipelines using lightweight neural models optimized for embedded systems. Unlike existing commercial solutions, this system emphasizes accessibility, low-cost hardware, and open-source technologies. The system demonstrates robust performance in real-world conditions, including person tracking, indoor scanning, and autonomous line following using virtual sensors. This project validates the applicability of advanced AI techniques in real-time robotic systems and illustrates the feasibility of deploying them on constrained hardware, providing a foundation for future research in autonomous UAVs for military, rescue, and surveillance missions.
[CV-30] Optimizing Visual Generative Models via Distribution-wise Rewards ICML2026
链接: https://arxiv.org/abs/2607.02291
作者: Ruihang Li,Mengde Xu,Shuyang Gu,Leigang Qu,Fuli Feng,Han Hu,Wenjie Wang
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026 Main
Abstract:Conventional reinforcement learning strategies for visual generation typically employ sample-wise reward functions, yet this practice frequently results in reward hacking that degrades image diversity and introduces visual anomalies. To address these limitations, we present a novel framework that finetunes generative models using distribution-wise rewards, ensuring better alignment with real-world data distributions. Unlike rewards that evaluate samples individually, distribution-wise reward accounts for the data distribution of the samples, mitigating the mode collapse problem that occurs when all samples optimize towards the same direction independently. To overcome the prohibitive computational cost of estimating these rewards, we introduce a subset-replace strategy that efficiently provides reward signals by updating only a small subset of a generated reference set. Additionally, we apply RL to optimize post-hoc model merging coefficients, potentially mitigating the train-inference inconsistency caused by introducing stochastic differential equation (SDE) in regular RL practices. Extensive experiments show our approach significantly improves FID-50K across various base models, from 8.30 to 5.77 for SiT and from 3.74 to 3.52 for EDM2. Qualitative evaluation also confirms that our method enhances perceptual quality while preserving sample diversity.
[CV-31] DisciplineGen-1M: A Large-Scale Dataset for Multidisciplinary Visual Generation and Editing
链接: https://arxiv.org/abs/2607.02290
作者: Zhaokai Wang,Mingxin Liu,Zirun Zhu,Ziqian Fan,Yiguo He,Mohan Zhang,Leyao Gu,Xiangyu Zhao,Ning Liao,Shaofeng Zhang,Xuanhe Zhou,Zhihang Zhong,Junchi Yan,Xue Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent image generation and editing models can produce visually appealing natural images, yet they remain unreliable when the target image is a knowledge-intensive diagram whose correctness depends on disciplinary concepts, symbolic structure, and precise spatial relations. We introduce DisciplineGen-1M, a million-scale multidisciplinary dataset that supports text-to-image generation and image editing. It contains 1.2M samples spanning mathematics, physics, chemistry, biology, geography, computer science, economics, history, music, and sports. To construct the dataset, we design a scalable framework that combines vector-graphics rendering, OCR-based editing, curated programmatic synthesis, and large-scale text-to-image filtering. These pipelines produce captions, editing instructions, structured annotations, and paired images with controllable semantic differences. Building on DisciplineGen-1M, we further introduce a discipline-informed reasoning-generation model for both text-to-image generation and image editing. Experiments on discipline-related benchmarks, GenExam and GRADE, show substantial improvements over open-source baselines, while evaluations on general reasoning-informed benchmarks, WISE and RISE, further indicate broader transfer. The results suggest that large-scale structured academic visual data is a key ingredient for moving image generation from aesthetic plausibility toward verifiable knowledge-grounded visual creation. We will publicly release our dataset, model, and source code of the data curation pipeline to ensure reproducibility and benefit future research.
[CV-32] FlowCIR: Semantic Transport via Flow Matching for Zero-Shot Composed Image Retrieval ECCV2026
链接: https://arxiv.org/abs/2607.02284
作者: Zhenqi He,Ziqi Jiang,Yuanpei Liu,Yanghao Wang,Teng Wang,Long Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept to ECCV2026
Abstract:Zero-shot composed image retrieval (ZS-CIR) aims to retrieve a target image by editing a reference image with a natural-language instruction, without relying on domain-specific annotated triplets. Most existing ZS-CIR methods rely on textual inversion to translate the reference image into pseudo-text tokens and then compose them with the instruction via simple concatenation in the text space, which can be lossy and brittle for fine-grained semantics. In this work, we propose a new paradigm, namely FlowCIR, that casts ZS-CIR as conditional semantic transport between reference and target embeddings. Leveraging \emphconditional flow matching, our model learns a lightweight transport field that maps the instruction representation toward a target-aligned query embedding conditioned on the reference image. Since FlowCIR operates on pre-extracted VLM embeddings and trains only a small transport module without updating the image or text encoder, it offers a computationally efficient training protocol compared with prior textual-inversion-based approaches. The resulting framework is training-efficient, requiring roughly 10\times fewer training resources than prior textual-inversion-based approaches. We further identify negation and removal as a major failure mode of VLM-based composition. To address this, we propose an inference-only Multi-Negative Steering strategy that steers a negation-containing relative instruction away from its negated semantics, mitigating the limited negation handling of VLMs and improving robustness on negation-heavy queries. Extensive experiments on standard CIR benchmarks demonstrate that FlowCIR achieves strong and competitive performance compared with recent ZS-CIR methods.
[CV-33] AGVBench: A Reliability-Oriented Benchmark of Data Augmentation for Vein Recognition
链接: https://arxiv.org/abs/2607.02271
作者: Haiyang Li,Yuming Fu,Qun Song,Hongchao Liao,Jing Chen,Mounim A.EI-Yacoubi,Xin Jin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint this http URL : this https URL
Abstract:Vein recognition is a secure biometric technology often constrained by limited annotated data and imaging variations. While data augmentation mitigates this, strategies designed for natural images may disrupt the fine-grained topology and textures essential for identity discrimination. We present AGVBench, which evaluates 30 representative augmentation strategies on five public palm- and finger-vein datasets with seven backbone architectures, covering classic CNNs, vision transformers, and vein-specific recognition models. Our results show that multi-image mixing methods (e.g., MixUp, PuzzleMix, StarMixup) generally provide the strongest recognition performance. However, they are often poorly calibrated and vulnerable to adversarial perturbations, revealing a clear inconsistency between clean accuracy and adversarial security. We also find that severe geometric transformations frequently degrade recognition, which is potentially due to feature misalignment or spatial cropping, and that augmentation effectiveness varies across palm and finger vein datasets. These findings prove that accuracy-centric evaluation is insufficient for biometric augmentation. AGVBench provides standardized protocols to support reproducible research and guide the design of reliable, secure, and robust vein recognition systems. Our codebase is available at this https URL.
[CV-34] AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models
链接: https://arxiv.org/abs/2607.02269
作者: Rintaro Otsubo,Ryo Fujii,Reina Ishikawa,Taiki Kanaya,Kanta Sawafuji,Hiroki Kajita,Shigeki Sakai,Hideo Saito,Ryo Hachiuma
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) have demonstrated immense promise in Spatio-Temporal Video Grounding (STVG). However, current evaluation protocols are largely confined to zero-shot assessments on general, daily-life benchmarks. This creates a critical disconnect from real-world applications in specialized fields, where models inevitably encounter rare visual concepts and complex spatio-temporal dynamics. Since exhaustive pre-training across infinite data distributions is infeasible, the ability to adapt to novel domains is essential. To bridge this gap, we introduce AnyGroundBench, a domain-adaptation benchmark designed to shift the STVG evaluation paradigm from static zero-shot testing to rigorous domain adaptation. Targeting five specialized domains (animal, industry, sports, surgery, and public security), AnyGroundBench pairs newly captured videos such as expert-annotated mouse behaviors with established datasets, unifying them through dense, high-fidelity spatio-temporal annotations. Crucially, the benchmark provides dedicated training subsets to systematically measure domain adaptability. We extensively evaluate 15 state-of-the-art VLMs, assessing their zero-shot generalization and In-Context Learning (ICL) capabilities under practical computational constraints. Ultimately, our findings reveal that current models fail in both zero-shot and ICL-based adaptation when confronted with specialized domains, exposing critical flaws in spatio-temporal reasoning that future research must address.
[CV-35] ArcAD: Anomaly-Rectified Calibration for Cold-Start Supervised Anomaly Detection ECCV
链接: https://arxiv.org/abs/2607.02252
作者: Ningning Han,Lei Fan,Jia Guo,Yunkang Cao,Xiu Su,Feng Cao,Donglin Di,Tonghua Su
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to European Conference on Computer Vision (ECCV) 2026
Abstract:The deployment of Industrial Anomaly Detection (IAD) in real-world manufacturing frequently encounters a challenging cold-start bottleneck, in which limited normal samples fail to represent the full normal distribution and only a few anomalies are available. Under such a regime, existing methods struggle to form compact normal boundaries and fail to effectively exploit supervised signals from rare defects. To address this challenge, we propose Anomaly-Rectified Cold-start AD (ArcAD), a plug-and-play calibration framework for reconstruction-based IAD baselines. ArcAD follows a push-pull learning paradigm to construct a compact and discriminative normal boundary under data scarcity. On the one hand, ArcAD projects limited normal samples onto a hypersphere and pulls them into multiple compact clusters to maximize coverage of the normal manifold. On the other hand, it synthesizes pseudo-anomalies on the hypersphere and leverages real anomalies to push the boundary inward and sharpen anomaly discrimination. Extensive experiments on MVTec-AD, VisA, Real-IAD, and MANTA demonstrate that ArcAD significantly outperforms state-of-the-art supervised and unsupervised methods in both single-class and multi-class settings under cold-start conditions. Code is available at: this https URL.
[CV-36] When Token Compression Breaks: Structural Pruning vs. Token Reduction for Robust ViT Segmentation under High Compression ECCV2026
链接: https://arxiv.org/abs/2607.02237
作者: Tien-Phat Nguyen,Ngai-Man Cheung
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:Vision Transformers (ViTs) are strong backbones for semantic segmentation, but their computational cost limits deployment. Recent token compression methods for efficient transformer-based segmentation reduce this cost by decreasing the number of tokens. However, existing evaluations primarily focus on low-to-moderate compression, leaving their behavior under aggressive compression and corrupted inputs unclear. Meanwhile, structural pruning provides an orthogonal route to efficiency by removing redundant components in the ViT architecture, but is rarely compared to token compression under a unified protocol. To bridge this gap, we benchmark representative token compression and structural pruning methods for ViT-based semantic segmentation under matched FLOPs on ADE20K and Cityscapes, together with their common-corruption variants ADE20K-C and Cityscapes-C. Our results reveal a consistent trend on both clean and corrupted inputs: token compression is highly effective at mild reductions but degrades sharply when compression becomes severe, consistent with substantial information loss from overly aggressive token reduction. In contrast, structural pruning exhibits a smoother degradation curve and is more stable at high compression. Motivated by these findings, we study a prune-then-merge pipeline that applies moderate token compression on top of a moderately pruned backbone. At comparable FLOPs, this combined strategy consistently achieves a better accuracy-robustness trade-off at high compression, offering a practical recipe for deployment-oriented ViT segmentation. Code is available at this https URL.
[CV-37] Efficient Waste Sorting for Circular Economy: A Confidence-guided comparison between One-Vs-All and One-Vs-Rest Classification Strategies with Human-in-the-Loop for Automated Waste Sorting
链接: https://arxiv.org/abs/2607.02230
作者: Mohammed Fahad Ali,Dominique Briechle,Marit Briechle-Mathiszig,Tobias Geger,Andreas Rausch
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The complexity of waste disposal regulations across European countries poses significant challenges for the residents and hinders the transition to a Circular Economy. In Germany, the proper sorting and disposal of household waste remains challenging across municipalities. Consequently, substantially reducing incorrectly disposed waste is vital for improving waste management and advancing the Circular Economy. AI-based waste sorting solutions can support residents through user-friendly tools, such as mobile applications, that guide proper waste disposal. To be effective in supporting the Circular Economy, however, these solutions must be configurable to reflect the specific waste sorting scheme of individual municipalities in Germany. In the scope of this work, an evaluation and analysis are performed of two prominent classification strategies: OvA and OvR. The research uses a dataset constructed in alignment with the waste categories and sorting scheme of the city of Goslar in Germany. Moreover, this work aims to extend beyond the overall performance by examining the behavior of OvA and OvR classification strategies in identifying samples likely to be misclassified. These classification strategies are compared by applying varying confidence thresholds to identify uncertain samples for subsequent human review. This evaluation aims to balance the number of misclassifications against the human effort required for data annotation.
[CV-38] DetailAnywhere: Fashion Detail Generation via Cross-Modal Feature Alignment Distillation
链接: https://arxiv.org/abs/2607.02220
作者: Zijun Li,Yimin Zhou,Jia Sun,Honglie Wang,Pengcheng Wei,Junlong Wu,Yongrui Heng,Jiyuan Wang,Huan Ouyang,Boheng Zhang,Huaiqing Wang,Dewen Fan,Qianqian Gan,Fan Yang,Tingting Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion-based generative AI has achieved remarkable success in e-commerce applications such as virtual try-on, poster generation, and product background synthesis. However, when making online purchasing decisions for apparel, consumers also desire the freedom to examine specific detail regions of interest, such as collars, cuffs, and fabric textures, yet existing methods have not explicitly studied this setting. We therefore formalize a new, non-template task: Fashion Detail Generation with focus conditioning, and release FDBench, the first benchmark comprising 40K+ human-verified reference-detail pairs across 41 different categories. This task poses a unique semantic gap challenge: the model must bridge the correspondence between a focus marker on a product reference image and a photorealistic close-up view of the indicated region, while faithfully preserving the garment’s identity, without any precise prompt. To bridge this gap, we propose Cross-modal Feature Alignment Distillation (CFAD), which leverages a fine-tuned DINOv3 teacher to align both branches of a Multimodal Diffusion Transformer in a shared semantic space via dual-branch distillation. To further improve consistency between generated details and reference images, we introduce a consistency reward model that jointly scores image pairs along three quality axes and optimizes generation via reinforcement learning. Experiments show that our model DetailAnywhere significantly outperforms all state-of-the-art opensource methods across all metrics and human evaluations.
[CV-39] MedSaab-US: A Backpropagation-Free Multi-Scale Wavelet-Saab Framework for Thyroid Nodule Segmentation in Ultrasound Images ICIP2026
链接: https://arxiv.org/abs/2607.02209
作者: Mohammad Amanour Rahman
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE ICIP 2026 LBDL 2 Workshop
Abstract:Deep learning (DL) methods dominate thyroid nodule segmentation in ultrasound (US) images, achieving high Dice scores but at the cost of millions of parameters, GPU-dependent training via backpropagation, and limited mathematical tractability. These limitations impede deployment in resource-constrained environments. In this paper, we propose MedSaab-US, a backpropagation-free segmentation framework grounded in the Green Learning paradigm. MedSaab-US extracts multi-scale spatial-frequency features by combining multi-level Discrete Wavelet Transform (DWT) with multi-scale channel-wise Saab (Subspace Approximation with Adjusted Bias) transforms at patch sizes of 5 x 5, 11 x 11, and 21 x 21 pixels. Label-Assisted Greedy (LAG) feature selection retains the most discriminative features, which are fed to an XGBoost classifier for pixel-wise prediction. The Saab transform parameters are determined analytically from data statistics, while XGBoost employs iterative greedy tree construction without requiring backpropagation. Evaluated on the TN3K dataset (2,879 training and 614 test images), MedSaab-US achieves a mean Dice coefficient of 0.4784 +/- 0.2190, precision of 0.5768, and recall of 0.5604, with a model footprint under 500K parameters and CPU-only inference in approximately 0.3 seconds per image. We present this result as an exploratory non-DL baseline for thyroid ultrasound segmentation and analyze the specific challenges posed by isoechoic nodules. An ablation study further quantifies the contribution of each pipeline component, including separate evaluations of LAG feature selection and training-set size.
[CV-40] RadiomicNet: A Hybrid Radiomics-Guided Lightweight Architecture for Interpretable Medical Image Segmentation ICIP2026
链接: https://arxiv.org/abs/2607.02185
作者: Mohammad Amanour Rahman
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at the IEEE ICIP 2026 LBDL 2 Workshop
Abstract:Deep learning has achieved remarkable performance in medical image segmentation, yet it suffers from critical limitations: mathematical intractability, substantial parameter requirements, and lack of clinical interpretability. We propose RadiomicNet, a novel two-stream hybrid architecture that enhances standard deep learning by integrating handcrafted radiomics features directly into the segmentation learning process. The key contribution is the Radiomics Attention Gate (RAG), which leverages Gray-Level Co-occurrence Matrix (GLCM) and Local Binary Pattern (LBP) features to modulate skip-connection attention in a lightweight MobileNetV2-based encoder-decoder, providing ante-hoc interpretability without post-hoc approximations. A novel Radiomics Consistency Loss further enforces alignment between texture complexity and prediction uncertainty, reducing Expected Calibration Error (ECE) from 0.142 to 0.118. RadiomicNet achieves a Dice Similarity Coefficient (DSC) of 0.763 +/- 0.231 on the Breast Ultrasound Images (BUSI) dataset and 0.854 +/- 0.112 on Kvasir-SEG, outperforming U-KAN by 1.2% and 1.8%, respectively (p 0.05, Wilcoxon signed-rank test), with only 3.27M parameters, 9.5x fewer than standard U-Net and 4.3x fewer than U-KAN. Gradient-based feature importance analysis reveals that GLCM dissimilarity (15.24%), GLCM energy (14.56%), and LBP entropy (11.49%) are the dominant radiomics cues, providing clinically meaningful explanations for segmentation decisions. The proposed approach demonstrates that compact, interpretable models grounded in domain knowledge can deliver state-of-the-art segmentation performance with substantially reduced computational overhead.
[CV-41] Efficient PEFT Methods with Adaptive Checkpointing for Vision Models and VLMs on Resource Constrained Consumer-GPUs
链接: https://arxiv.org/abs/2607.02158
作者: Altay Toktassyn,Jurn-Gyu Park
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern pretrained vision models achieve strong accuracy but demand substantial GPU memory for fine-tuning, making edge deployment impractical. This paper compares five parameter-efficient fine-tuning (PEFT) methods (Full FT, LoRA, AdaLoRA, QLoRA, BitFit) on Transformers- (ViT-Small, TinyViT) and Mamba-based vision backbones (Vim-Small, MambaVision-T) under an on-device VRAM budget (e.g., 2 GB), together with three gradient-checkpointing strategies (none, static, and a proposed memory-budget-aware adaptive algorithm); and we evaluate three families of foundation-model baselines: zero-shot contrastive vision language models (OpenCLIP, SigLIP), self-supervised vision backbones with lightweight evaluation protocols (DINOv2), and autoregressive VLMs for prompt-based classification (PaliGemma, MobileVLM, SmolVLM). Experiments on CIFAR-100 and DTD report accuracy, training time, energy, and the NetScore family of multi-objective metrics, which we extend with two deployment-aware variants. QLoRA and BitFit cut energy 20-30% at a 1-2% accuracy cost; the adaptive algorithm reduces peak memory 43-79% with 9-30% energy overhead. DINOv2 surpasses fine-tuned models on CIFAR-100 (0.917 vs. 0.897) at a fraction of the energy, while small autoregressive VLMs remain uncompetitive.
[CV-42] Patient-Specific Articulated Digital Twins from a Single Full-Body CT Scan
链接: https://arxiv.org/abs/2607.02156
作者: Han Zhang,Boyang Zhao,Mathias Unberath
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Patient-specific anatomical models provide individualized context for surgical planning, image-guided intervention, and algorithm development. However, most CT-derived models are static: they preserve the body configuration captured at scan time, but cannot represent how the same anatomy would appear after patient repositioning. This limitation is especially important for radiographic imaging, where appearance depends jointly on imaging geometry and patient pose. We present a proof-of-concept for constructing a patient-specific articulated digital twin from a single full-body CT scan. The method fits a parametric human body model (SMPL) to obtain a patient-aligned kinematic scaffold, binds segmented bones and organs to an anatomy-aware rig, and retargets body-pose changes while preserving skeletal geometry. On three full-body CT subjects, the fitted scaffold achieved 15.8 \pm 4.0 mm chamfer distance and 95.9 \pm 1.8% skeletal enclosure. Recomposition at the acquisition pose preserved major radiographic structure, with overall SSIM of 0.872 \pm 0.016 and PSNR of 18.5 \pm 1.4 dB across paired DRRs. Across unseen target poses, the resulting twins enabled articulation while maintaining high skeletal enclosure (94.4 \pm 0.4%). As a feasibility demonstration, we render the articulated twin as pose-dependent DRRs. These results suggest the feasibility of extending static, view-controllable CT simulation toward pose-controllable anatomical twins for future synthetic imaging and positioning studies.
[CV-43] SAMoR: Motion Modelling for Articulated Objects of Any Skeleton and Topology
链接: https://arxiv.org/abs/2607.02148
作者: Yuhao Zhang,Gerard Pons-Moll,Tolga Birdal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 5 figures
Abstract:Modeling motion for articulated objects of arbitrary skeleton topology remains difficult: existing motion generators target a fixed human skeleton, and prior adaptations either fail to share a vocabulary across rigs or discard motion detail through global pooling. Our key observation is that while joint-level motion does not correspond cleanly across species, motion of functional joint groups does: a human arm, a wolf foreleg, and a bird wing share motion structure despite differing joint counts and connectivity, a correspondence that joint names (e.g., “forearm”, “wing_L1”) partially expose even when topology does not. We introduce SAMoR (Skeleton-Aware Motion Representation for Articulated Objects), a cross-topology motion representation that encodes each motion segment as a small fixed number ( K=8 ) of part tokens shared across arbitrary skeletons. A graph-transformer encoder consumes per-joint motion features, kinematic graph structure, and joint-name embeddings, then compresses them into part-level tokens via cross-attention pooling and residual vector quantization, yielding a discrete motion codebook shared across rigs. To keep the part queries from collapsing into redundant global representations, we introduce a topology-agnostic attention supervision loss, with joint-name dropout to reduce over-reliance on text labels. We curate a heterogeneous corpus from HumanML3D, Truebones Zoo, and animated Objaverse-XL assets, and evaluate SAMoR on held-out characters with unseen skeletons. It supports accurate reconstruction and cross-topology transfer, and enables text-conditioned generation and part-wise editing via a MaskGIT token generator. SAMoR reaches 2.75 \times 10^-2 normalized MPJPE on cross-topology reconstruction, 5.8\times below the strongest adapted variable- J tokenizer baseline, while remaining competitive with fixed-skeleton specialists on HumanML3D.
[CV-44] Predicting Early Stages Of Alzheimers Disease And Identifying Key Biomarkers Using Deep Artificial Neural Network And Ensemble Of Machine Learning Methodologies
链接: https://arxiv.org/abs/2607.02142
作者: Debopriya Ghosh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)
备注: Master’s
Abstract:Alzheimers disease (AD) is a brain disorder that develops slowly and mainly affects memory, thinking, language, and daily activities. It is one of the most common causes of dementia and creates many difficulties for patients as well as their families. In the early stage, the symptoms are often mild and may look like normal ageing. For this reason, many people are diagnosed late, when the disease has already progressed. At present, there is no complete cure for AD. Still, early detection can help doctors manage the condition better and take suitable steps at the right time. In this study, a machine learning model is proposed to detect the early stages of Alzheimers disease using clinical details, neuropsychological test scores, and neuroimaging-related measures. The data used in this work is collected from the Alzheimers Disease Neuroimaging Initiative (ADNI). As the dataset has missing values, iterative imputation is applied to fill them. The dataset also has class imbalance, which is handled using Borderline SVM-SMOTE. After that, feature selection is carried out using wrapper-based and embedded methods so that only important features are used for training. The selected features are divided into training and testing sets, and feature scaling is applied. A stacking ensemble model is developed using Logistic Regression, Extra Trees, Bagging KNN, and LightGBM as base classifiers. Along with this, an artificial neural network is also trained on the same dataset. The performance of these models is compared using precision, recall, F1-score, and AUC-ROC. This study aims to find the best classifier and also identify important biomarkers that may help in the early diagnosis of Alzheimers disease.
[CV-45] AdaCount: Training-Free Similarity-Guided Spatial and Feature Adaptation for Zero-Shot Object Counting
链接: https://arxiv.org/abs/2607.02139
作者: Muhammad Ibraheem Siddiqui,Muhammad Haris Khan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report
Abstract:Zero-shot object counting (ZOC) aims to count instances of arbitrary object categories specified only through textual prompts. Recent training-free approaches leverage foundation models such as SAM to reformulate counting as a prompt-driven segmentation task, eliminating the need for costly counting-specific training data with point-level annotations. More recently, SAM3 introduced promptable concept segmentation, enabling the zero-shot segmentation of all instances corresponding to a text-defined concept. However, SAM3 struggles in densely populated scenes containing numerous small objects, where limited image resolution and insufficient attention to target-relevant regions often lead to missed instances and poor instance separation, hindering accurate object counting. To address this limitation, we propose AdaCount, a training-free framework for ZOC based on similarity-guided spatial and feature adaptation. AdaCount first estimates a prototype-driven similarity map that identifies target-relevant regions. This similarity map subsequently guides two complementary adaptations: (i) similarity-guided spatial warping, which reallocates image resolution toward target instances, and (ii) feature modulation, which amplifies target-relevant encoder representations. Together, these adaptations enable SAM3 to devote greater representational capacity to target-relevant regions while preserving global image context, without requiring any model retraining. Extensive experiments across six diverse counting benchmarks establish AdaCount as a new SOTA among training-free ZOC approaches.
[CV-46] AbsoluteDegradation: A Physics-Inspired Synthetic Film-Degradation Pipeline and Archival Film Restoration Benchmark
链接: https://arxiv.org/abs/2607.02131
作者: Mikołaj Jastrzębski,Dawid Glinkowski,Dawid Zieliński,Daniel Borkowski,Wojciech Kozłowski,Kamil Adamczewski
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Restoring archival film remains a fundamentally challenging problem due to the absence of paired training data and the lack of standardized evaluation benchmarks. Pristine versions of deteriorated footage are physically unrecoverable, requiring supervised methods to rely on synthetic data that often fail to capture the complex, temporally coherent nature of real film degradation. At the same time, existing real-world datasets are limited in scale, quality, and accessibility, hindering reliable evaluation and fair comparison across methods. We address both limitations with AbsoluteDegradation, a physics-inspired, modular pipeline for synthesizing realistic film degradations, and a new large-scale archival benchmark. The proposed pipeline models the analog-to-digital process as a structured composition of artifact families, incorporating signal-dependent grain, parametric scratches, and temporally coherent camera motion, enabling controlled generation of diverse degradation regimes. In parallel, we introduce a curated dataset of 81,576 high-resolution frames sourced from real archival footage, designed for consistent evaluation under real-world conditions. Together, these contributions provide a unified framework for training and benchmarking restoration models. Extensive experiments across multiple architectures show that models trained with AbsoluteDegradation generalize better to real-world footage, while the proposed benchmark reveals systematic failure modes of current methods. We hope this work establishes a foundation for reproducible and domain-authentic evaluation in archival film restoration.
[CV-47] X-Splat: Gaussian Splatting for 3D CBCT Generation from Single Panoramic Radiograph
链接: https://arxiv.org/abs/2607.02099
作者: Tomasz Szczepański,Szymon Płotka,Michal K. Grzeszczyk,Tomasz Trzciński,Arkadiusz Sitek
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 6 figures, including appendix. Under review
Abstract:Generating a 3D dental volume from a single panoramic radiograph (PXR) could provide a low-radiation alternative to Cone-Beam Computed Tomography (CBCT), but the problem is highly underdetermined: panoramic acquisition integrates 3D attenuation along curved X-ray paths into a 2D image, leaving depth-resolved anatomy unobserved. Existing implicit and generative approaches often produce oversmoothed geometry or anatomically inconsistent hallucinations, lacking geometry-driven supervision and relying on smooth representations unable to precisely localize sharp anatomical boundaries. We propose X-Splat, the first Gaussian Splatting framework for generating CBCT-like 3D dental volumes from a single PXR. X-Splat uses the known panoramic acquisition geometry as a generation scaffold: learnable anisotropic Gaussian primitives are initialized along the X-ray paths that formed the input image and adjusted in a single feed-forward pass, constrained by Beer-Lambert reprojection and multi-view radiographic training supervision. A lightweight residual refiner adds dataset-level anatomical priors without overriding the geometry already resolved by the Gaussians. We train on synthetic PXR-CBCT pairs, enabling direct volumetric supervision without paired real scans. We further introduce segmentation-based geometry-aware metrics, providing the first evaluation of PXR-based generation over maxillofacial anatomy. X-Splat outperforms NeRF- and GAN-based baselines, recovering individual teeth, cortical boundaries, and alveolar structure, including the mandibular canal which prior methods fail to reconstruct. Code will be available at this https URL
[CV-48] WBMM: Windowed Batch Matrix Multiplication for Efficient Large Receptive Field Convolution ICML2026
链接: https://arxiv.org/abs/2607.02097
作者: Wan Song,Wei Zhou,Rui Wang,Jun Yu,Toru Kurihara,Jiajia Xu,Shu Zhan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 23 pages, 4 figures. Accepted as a Spotlight paper at ICML 2026. Code available at this http URL
Abstract:Large kernel depthwise convolutions achieve strong performance but suffer from significant degradation as kernel size grows due to irregular memory access from gather-based computation; while Large Kernel Acceleration (LKA) helps on small feature maps, it becomes counterproductive on large feature maps, even slower than non-accelerated implementations. We propose Windowed Batch Matrix Multiplication (WBMM), which partitions input into contiguous windows and indexes a compact relative position bias table to construct weight matrices, enabling regular memory access via batched matrix multiplication. This yields a unique property: WBMM’s throughput improves with larger windows, opposite to depthwise convolutions that degrade with larger kernels. Operator-level benchmarks show WBMM with 14x14 windows outperforms 5x5 depthwise convolution baselines in speed while providing a 7.8x larger per-layer receptive field. Combined with inter-block cross-window communication and hierarchical window reparameterization, WBMM achieves comparable or higher accuracy on ImageNet-1K, COCO, and ADE20K with 1.31-1.88x training speedup, and demonstrates consistent advantages across GPU, CPU, and edge devices without requiring specialized acceleration kernels. Our code is available at this http URL
[CV-49] LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension ECCV2026
链接: https://arxiv.org/abs/2607.02096
作者: Shunya Kato,Taiki Miyanishi,Shuhei Kurita,Mahiro Ukai,Nakamasa Inoue,Chenhui Chu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026. Dataset and code: this https URL
Abstract:Egocentric videos capture rich and diverse human-object interactions and have emerged as a fundamental resource for understanding human activities related to objects. In this context, Video Referring Expression Comprehension (Video REC), the task of localizing the temporal and spatial extent of a referred object in video frames given a natural language query, plays a key role in linking textual descriptions to observed objects in untrimmed egocentric recordings. However, existing egocentric Video REC benchmarks primarily focus on short video clips, where some target object appears densely within frames. Such settings do not reflect real-world egocentric recordings, which are long-form, untrimmed, and characterized by sparse object occurrences and complex activity transitions. To address this limitation, we introduce LongEgoRefer, a novel and challenging benchmark constructed from long-form videos in the Ego4D dataset. LongEgoRefer contains 1,498 referring expressions with an average video duration of 45 minutes. The benchmark exhibits extreme target sparsity, detailed linguistic descriptions, and complex human-object interactions embedded in long, dynamic egocentric narratives. Consequently, it defines a demanding spatio-temporal grounding problem that requires models to identify both when an event occurs and where the referred object appears within extended video sequences. We evaluate existing Video REC approaches, including training-free baselines based on vision-language models combined with Grounded SAM2. Extensive experiments show that even advanced baselines and current state-of-the-art models struggle significantly on LongEgoRefer. These results highlight the intrinsic difficulty of long-form egocentric spatio-temporal grounding and emphasize the need for more robust video understanding models.
[CV-50] Multimodal Fusion for Fine-Grained Classification of Breast Fibroadenoma and Phyllodes Tumors
链接: https://arxiv.org/abs/2607.02091
作者: Chuxi Nan,Di Wu,Hongming Guo,Ning Cao,Xiaohui Zhu,Zhaoting Shi,Jiawei Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Breast fibroadenoma (FA) and phyllodes tumor (PT) are fibroepithelial breast lesions with highly overlapping appearances on B-mode ultrasound, making benign and borderline PT prone to being misclassified as FA and complicating preoperative decision-making. Existing computer-aided diagnosis methods commonly rely on single-modal imaging features and insufficiently exploit complementary clinical and textual information. To address this limitation, we construct the FAPT-M Dataset, a pathology-confirmed multimodal dataset comprising 910 patients with strictly reviewed ultrasound images, structured clinical attributes, and ultrasound diagnostic descriptions. Based on this dataset, we propose a clinically guided multimodal framework that integrates DenseNet-based visual encoding, CLIP-inspired text encoding, and lightweight clinical encoding, and further introduces clinical-conditioned adaptive modulation, cross-modal Transformer fusion, and dual-path representation learning to improve feature alignment and multimodal interaction. Under patient-level five-fold cross-validation, the proposed method achieves an accuracy of 77.64%, F1-score of 73.38%, and AUC of 89.74%, outperforming representative CNN-, Transformer-, and vision-language-based baselines. Ablation studies and class-balanced evaluations further confirm the contribution of three-modality fusion and the key architectural components. Overall, this work provides an effective multimodal approach for fine-grained FA-PT classification and establishes a high-quality benchmark for multimodal breast ultrasound analysis.
[CV-51] CG-AR: Real-Time Multi-View Augmented Reality for Trading Card Game Streaming
链接: https://arxiv.org/abs/2607.02090
作者: Anthony Cioppa,Antoine Verdonck,Maxim Henry,Marc Van Droogenbroeck,Raphaël La Rocca
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 8 figures, 3 tables
Abstract:Trading card games are increasingly played and broadcast online, yet live streams remain mostly limited to flat top-down footage of the playing area. Augmenting such streams with virtual models of the played cards would improve the viewing experience, but most existing systems rely on instrumented playing surfaces and embedded chips, which are costly and impractical for casual players and large-scale events. In this work, we present TCG-AR, a novel real-time pipeline that augments trading card games using ordinary RGB cameras alone, without any physical markers or specialized hardware. Our pipeline detects, orients, and identifies the cards on the board, renders virtual content onto each card across all views, and can additionally compose a broadcaststyle view that summarizes the game state for spectators, streaming the augmented feeds to standard broadcasting software such as OBS. To train the detection, orientation, and identification models without manual labeling, we introduce an automatic procedure that generates annotated synthetic training data from a reference set of card images. Then, we evaluate several trained models on a new manually annotated dataset with real images, analyzing performance and runtime throughput that determine real-world usability. Overall, by relying only on commodity cameras and hardware, and by open-sourcing all code, models, and datasets, this work aims to serve as a reference for real-time trading card recognition and to make real-time augmented-reality streaming accessible to the broader community of players and streamers.
[CV-52] DeepGaze3.5-VL: Modeling Scanpaths via Autoregressive Token Prediction
链接: https://arxiv.org/abs/2607.02083
作者: Susmit Agrawal,Matthias Bethge,Matthias Kümmerer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding human visual attention on a scene over time has applications in domains such as interface design and inferring cognitive states. Modeling visual scanpaths has historically relied on specialized architectures with hand-crafted priors. While these architectures can model fixation sequences, their rigid structural biases restrict easy extendability and flexible conditioning. For instance, integrating task-specific instructions or adapting to distinct viewer identities requires custom, disjoint architectural additions. We frame scanpath prediction purely as a discrete sequence modeling task. By mapping coordinates into a text vocabulary, we leverage the pretrained representations of Vision-Language Models. This framing absorbs diverse factors of variation: simple prompting allows for global conditioning, such as providing viewer identities to capture personalized biases, or task-specific objectives like visual search. The framework can also integrate per-fixation attributes, such as individual fixation durations, alongside spatial locations. The autoregressive alignment enables the scalable, exact computation of per-fixation log-likelihoods, directly equivalent to the commonly used Information Gain (IG) metric. Our model, DeepGaze3.5-VL, establishes a new state-of-the-art across multiple datasets, achieving 2.18 bits of IG on MIT1003, a 46% improvement over DeepGaze III. This advantage persists even when baselines use identical high-capacity vision encoders. Beyond predictive performance, our generative framework serves as a powerful computational tool for direct behavioral interventions, allowing for controlled in-silico simulations that would be experimentally difficult or impossible to conduct in vivo. We demonstrate this ability by performing controlled interventions on the durations of pre-saccadic fixations, recovering known oculomotor phenomena purely from data.
[CV-53] HandsOnWorld: Unconstrained Egocentric Video Generation with Camera-Disentangled Hand Control
链接: https://arxiv.org/abs/2607.02075
作者: Yushuo Chen,Xiaoyu Shi,Xiaoshi Wu,Xintao Wang,Pengfei Wan,Yebin Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 9 figures
Abstract:We present HandsOnWorld, a framework for hand-controlled egocentric video generation that forgoes multi-view and marker-based motion capture, learning instead from unconstrained monocular video. Such generality is bottlenecked by the scarcity of scalable 3D hand annotations: large egocentric corpora lack finger-level labels, whereas precise hand datasets are confined to narrow, instrumented settings, limiting prior hand-controlled generators to restricted scene distributions. We instead annotate 3D hands directly on in-the-wild egocentric video through monocular reconstruction, introducing a protagonist-centered annotation pipeline that filters the reconstructions at the action-semantic, image-quality, and 3D-geometric levels to build EgoVid-Pro, a dataset of clean, protagonist-only hand trajectories spanning 103K clips and roughly 12M frames across diverse everyday scenes. To resolve the camera-hand entanglement induced by large ego-motion, we further propose the Plücker Hand Map, a 3D-aware control signal that extends Plücker-ray representations from camera rays to the hand surface, disentangling camera and hand motion at the representation level. Experiments show that \method surpasses prior hand-controlled generators in reconstruction fidelity and control accuracy, and generalizes to out-of-distribution everyday scenes beyond the laboratory datasets on which prior methods rely.
[CV-54] Comprehensive Robustness Analysis of LiDAR-based 3D Object Detection in Autonomous Driving ECCV2026
链接: https://arxiv.org/abs/2607.02074
作者: Adwait Chandorkar,Kai Krink,Yerdana Maulenbay,Hasan Tercan,Tobias Meisen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026 main
Abstract:Recent advancements in LiDAR-only 3D object detection have demonstrated improved detection accuracy over benchmark datasets. However, the adversarial robustness of these models remains untested. Very few adversarial robustness studies exist for LiDAR-only 3D object detection and unfortunately, even they are limited to legacy models. Moreover, there is a systemic gap in the existing evaluation frameworks that rely simply on mAP ignoring other structural and predictive factors. To fill this gap, we propose a holistic framework that evaluates adversarial robustness using two structural factors (point cloud density and point cloud localization) and three predictive factors (misclassification, localization error, distance from ego). Using this framework, we perform an empirical study and critical analysis on recent and legacy state-of-the-art models using adversarial attacks specifically designed for LiDAR-based models. Our key finding is that high-capacity, voxel-based detectors are more susceptible to structured coordinate perturbations than pillar-based detectors. Additionally, non-anchor-based detectors demonstrate poor adversarial robustness, which necessitates rethinking model training techniques. Overall, our results demonstrate that recent models are as vulnerable to adversarial attacks as their predecessors. Therefore, we argue that there is a need to improve the evaluation benchmarks for 3D object detection that not only reward architectural modifications for improving detection accuracy, but also evaluate whether the design choices improve adversarial robustness.
[CV-55] Beyond the Performance Illusion: Structure-Aware Stratified Partitioning and Curriculum Distributionally Robust Optimization for Spatially Correlated Domains
链接: https://arxiv.org/abs/2607.02055
作者: Prathamesh Patil,Arpit Jain,Aswanth Krishnan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures
Abstract:Performance evaluation in AI systems commonly assumes that random dataset splits produce independent and identically distributed (i.i.d.) subsets. We show that this assumption often breaks down in spatiotemporally correlated domains such as aerial surveillance, precision agriculture, and medical imaging, leading to two systematic failures: data leakage, where correlated samples span training and validation splits and inflate performance estimates, and hidden stratification, where errors on minority subpopulations are obscured by aggregate metrics. To address these issues, we propose a unified evaluation and training framework for spatially correlated data. We introduce Structure-Aware Stratified Partitioning (SASP), which constructs validation splits that reduce spatiotemporal leakage while preserving meaningful class balance, and Curriculum Distributionally Robust Optimization (CDRO), a curriculum-based relaxation of distributionally robust training that stabilizes optimization under these stricter splits. Across multiple benchmarks, this combination yields consistently improved generalization, more reliable confidence calibration, and exposes failure modes that remain hidden under conventional random-split evaluation.
[CV-56] Embracing Intra-Class Heterogeneity for Semi-Supervised Medical Image Segmentation: From Diversity to Precision
链接: https://arxiv.org/abs/2607.02051
作者: Yuqi Liu,Yufei Chen,Wei Fu,Xiaodong Yue,Shuo Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Medical Image Analysis
Abstract:Due to the scarcity of expert-annotated data, Semi-Supervised Medical Image Segmentation (SSMIS) has emerged as a promising approach. Many anatomical structures in medical images exhibit significant intra-class heterogeneity, with different regions showing heterogeneous intensity patterns within the same structure. However, existing methods inadequately exploit this intensity-manifested intra-class heterogeneity, resulting in uniform structural representations and imprecise segmentation. Furthermore, the scarcity of labeled data makes it more difficult to effectively capture such complex heterogeneity. To address this, we propose Multiple Prototype Contrastive Learning (MPCL), an SSMIS framework that possesses better diversity and better precision. It consists of three novel designs: First, we provide structural representations with better diversity and propose Intensity-aligned Heterogeneous Prototype Generation (IHPG) that effectively models intra-class heterogeneity by generating multiple prototypes aligned with intensity characteristics. Second, we further enhance more diverse structural representations and build a solid foundation for more precise segmentation through Prototypical Space Optimization (PSO) that systematically optimizes a more discriminative and generalizable prototypical space. Finally, we achieve segmentation results with better precision through Dual-branch Knowledge Alignment (DKA) that efficiently promotes intra-class heterogeneity knowledge transfer from prototypical space to the segmentation network. Extensive experiments on three medical image datasets with significant intra-class heterogeneity demonstrate that MPCL significantly outperforms existing methods, especially under extremely limited labeled data.
[CV-57] PWM-ArtGen: Part World Model for Articulated Object Generation
链接: https://arxiv.org/abs/2607.02045
作者: Wentao Zheng,Ancong Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The key challenge in articulated 3D object generation from a single image is accurately predicting the underlying kinematic structure. Existing methods either infer kinematic parameters directly from a static image that lacks dynamic part-level kinematic relationships, or estimate parameters from visual dynamics generated from a single image, which is prone to accumulated errors of two steps. Moreover, the limited scale and diversity of existing annotated datasets further hinder generalization to complex, real-world objects. To overcome these limitations, we propose to learn the joint distribution of visual dynamics and kinematic parameters. Recognizing that articulated objects can be formulated as dynamic systems, we propose a unified Part World Model called PWM-ArtGen. To leverage unannotated data, this model couples action diffusion and image diffusion with independent diffusion timesteps, which enables visual branch co-training. We further curate a photorealistic dataset of 19.7k part-level image pairs without kinematic annotations, to support co-training. Experiments demonstrate that PWM-ArtGen substantially outperforms existing baselines in the resting state and exhibits strong zero-shot generalization to out-of-distribution objects.
[CV-58] Hierarchical Anti-Aesthetics: Protecting Facial Privacy against Customized Diffusion Models
链接: https://arxiv.org/abs/2607.02038
作者: Songping Wang,Yueming Lyu,Shiqi Liu,Chen Zhao,Ziyuan Chen,Ning Li,Jing Dong,Caifeng Shan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rise of customized diffusion models has fueled a boom in personalized visual content creation, but it also introduces serious risks of malicious misuse, thereby posing threats to personal privacy. Image aesthetics are strongly correlated with human perception of image quality. Motivated by this observation, we address facial privacy protection from a novel aesthetic perspective by degrading the generation quality of maliciously customized models, thus reducing facial identity leakage. Specifically, we propose a Hierarchical Anti-Aesthetics (HAA) framework that exploits aesthetic cues at multiple perceptual levels. HAA consists of two key branches: (1) Global Anti-Aesthetics, which degrades overall aesthetics and generation quality by constructing a global anti-aesthetic reward mechanism and a corresponding loss; and (2) Local Anti-Aesthetics, which disrupts facial identity by using a local anti-aesthetic reward mechanism and loss to guide adversarial perturbations toward facial regions. By integrating both branches, HAA achieves anti-aesthetic degradation from a global to a local level during customized generation. Extensive experiments show that HAA outperforms existing methods in identity removal, providing an effective tool for protecting facial privacy.
[CV-59] ComplexMimic: Human-Scene Interaction Imitation in Complex 3D Environments
链接: https://arxiv.org/abs/2607.02034
作者: Lu Pan,Hongwei Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Physics-based Human-Scene Interaction (HSI) imitation learning is crucial for embodied intelligence as it bridges the gap between kinematic 3D motions and real-world dynamics. However, most existing methods focus on simplified scene settings, leaving complex environments largely unexplored, which limits their applicability in real-world scenarios. In this paper, we focus on HSI mimicry in complex environments. Under this complex setting, we observe an inherent trade-off between successfully performing interaction and maintaining natural, physically plausible motions. To address this challenge, we propose ComplexMimic, a framework that reconstructs diverse HSI by interpreting imperfect MoCap data. First, we introduce a Dual Flow Strategy, which learns two complementary experts: an imitation expert for accurate motion tracking and an interaction expert for collision-aware adaptation in complex scenes. Second, naive multi-expert distillation, which treats all experts equally, often under-samples challenging behaviors, limiting effective learning. To mitigate this issue, we propose a difficulty-aware distillation strategy that adaptively weights supervision and prioritizes hard-yet-learnable trajectories guided by failure statistics and learning progress signals. Extensive experiments on three benchmark datasets demonstrate that our approach outperforms current state-of-the-art methods. Our implementation is available at this https URL.
[CV-60] Evaluating Vision-Language Models as a Zero-Shot Learning Alternative to You Only Look Once and Optical Character Recognition for Nigerian License Plate Recognition
链接: https://arxiv.org/abs/2607.02025
作者: Ismail Ismail Tijjani,Ahmad Abubakar Mustapaha,Sunusi Ibrahim Muhammad,Muhammad Bashir Aliyu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:License Plate Recognition (LPR) systems are critical tools in traffic monitoring, security enforcement, and urban mobility management. Traditional LPR systems often rely on a multi-stage pipeline involving object detection using You Only Look Once (YOLO) and Optical Character Recognition (OCR), which suffer from limitations such as high resource demands, poor performance in unstructured environments, and the need for large annotated datasets. This study explores the potential of Vision-Language Models (VLMs) as a unified, zeroshot learning solution for Nigerian license plate recognition. Using a curated dataset of 88 challenging real-world images collected in Nigeria, we evaluate five selected VLMs: Gemini 2.0 Flash Exp (Google DeepMind), Qwen2.5-VL-7B-Instruct (Alibaba), GPT-4o (OpenAI), Claude 4 Sonnet (Anthropic), and Llama 3.2 Vision 90b (Meta). Results based on Character Error Rate (CER) reveal that Gemini and Qwen significantly outperform other models in both accuracy and robustness, on the challenging image scenarios. This work highlights the practical advantages of VLMs over YOLO+OCR, questions the claims by model providers, and compares the performances of the VLMs.
[CV-61] Spatio-Temporal and Clinical Conditioning for Fine-Grained Radiology Report Retrieval
链接: https://arxiv.org/abs/2607.02024
作者: P. Sloan,E. Simpson,M. Mirmehdi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 2 figures, 6 tables
Abstract:Radiology is vital to modern healthcare, but rising imaging demand and persistent workforce shortages strain reporting capacity and clinical workflows. Automated radiology report generation has the potential to support radiologists and help alleviate this burden; however, existing retrieval-based methods remain rigid, lack explicit anatomical grounding, and do not account for longitudinal disease progression or available clinical context. In this work, we introduce STAR3, a multimodal, spatio-temporal, attentive retrieval framework for radiology report generation that aligns region-level anatomical information with clinical indications and longitudinal changes across chest X-ray studies. Our framework employs an object detector to identify anatomically meaningful regions and retrieves semantically relevant report sentences conditioned on both current clinical context and changes observed between prior and current examinations. This design enables anatomically and temporally grounded report generation that better reflects clinical reporting practice. Experiments on the MIMIC-CXR dataset demonstrate that STAR3 outperforms current retrieval-based approaches on retrieval, NLP and clinical metrics, highlighting the value of conditioning retrieval anatomically, temporally and clinically for advancing automated radiology report generation.
[CV-62] UnderOneFacade: Worldwide Facade Semantic Segmentation Benchmark Dataset ECCV2026
链接: https://arxiv.org/abs/2607.02018
作者: Yi Wang,Fan Wang,Prabin Gyawali,Ziyang Xu,Anna Klimkowska,Yixiong Jing,Wanru Yang,Filip Biljecki,Christoph Holst,Benjamin Busam,Brian Sheil,Olaf Wysocki
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ECCV 2026
Abstract:Globally consistent semantic digital twins require centimeter-accurate and geographically transferable 3D facade segmentation. However, progress in facade parsing is limited by the lack of large-scale, standardized benchmarks for evaluating cross-domain generalization. Existing datasets are geographically narrow, semantically inconsistent, or insufficiently precise. We introduce UnderOneFacade, the largest cross-country and cross-continent 3D facade benchmark to date, comprising centimeter-accurate point clouds with hierarchical, harmonized, and architecturally grounded semantic labels totaling 2.7 billion annotated points. Through a systematic evaluation of representative point-, graph- and transformer-based architectures, we show that current methods struggle to recognize fine-grained architectural elements and degrade significantly across geographic domains, with the best models achieving only up to 33 IoU on the fine-grained LoFG3 benchmark. By combining geometric precision with standardized semantics at unprecedented scale, UnderOneFacade establishes a rigorous benchmark for developing robust and transferable 3D segmentation models. The dataset, evaluation scripts, and pretrained models will be released upon publication.
[CV-63] Mirror Illusion Art CVPR CVPR2026
链接: https://arxiv.org/abs/2607.02015
作者: Xiaopei Zhu,Zeyuan Li,Jun Zhu,Xiaolin Hu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026 Highlight, also got an Efficient CVPR award
Abstract:Mirror Illusion Art is a novel reflection-conditioned 3D illusion where one object yields two target appearances (front and mirror). The task is formulated as inverse design from two target 2D images (front and mirror) to a printable 3D object with geometry and texture. Prior topology-driven and shadow-based approaches demand substantial manual effort, optimize shape only, and often yield non-smooth or incomplete geometry. To address these challenges, we propose AutoMIA, an automated Mirror Illusion Art design pipeline that jointly optimizes shape and color. To stabilize optimization and suppress artifacts, four mechanisms are introduced: (1) projection-alignment component (PAC) selection to reduce surface noise, (2) position-weighted adaptive (PWA) suppression for background noise, (3) internal voxel preservation (IVP) to prevent internal fractures, and (4) shape-color decoupled (SCD) optimization that balance shape and color optimization. AutoMIA generate diverse smooth Mirror Illusion artworks successfully both in the digital and physical world, with only around 76s design time and 2.6 GB memory on average using a single RTX 3090, advancing inverse graphics and computational design. Our code is available at this https URL.
[CV-64] A Stereo Visual SLAM System Using Object-Level Motion Estimation and Geometric Filtering Based on Cross Disparity
链接: https://arxiv.org/abs/2607.02005
作者: Sujan Kumar Dhali,Bhaskar Dasgupta
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 12 figures, 6 tables,
Abstract:This paper presents OCD SLAM, a dynamic stereo visual SLAM framework that extends ORB-SLAM2 by jointly addressing dynamic objects and dynamic features in the scene. Usual visual SLAM systems operating in dynamic environments often fail in the presence of moving objects, due to the static-world assumption used in pose estimation and mapping. To address this predicament, we introduce a novel geometric approach based on the discrepancy between disparity and a newly proposed notion called ``cross disparity’', which exploits both temporal and stereo inconsistency to identify dynamic feature points. Complementary to this feature-level motion analysis, OCD SLAM integrates a 3D object detection module (SMOKE) with Kalman filter-based object tracking to perform object-level motion classification, enabling robust separation of static and dynamic scene elements for accurate pose estimation. The proposed approach has been evaluated on various sequences from the KITTI Odometry and KITTI Raw datasets. Results demonstrate that OCD SLAM achieves significant improvement in trajectory accuracy compared to ORB-SLAM2 and several state-of-the-art dynamic SLAM methods. Ablation studies further demonstrate the effectiveness of the cross disparity module in the KITTI Raw dataset and show that this method is able to detect dynamic features that are missed by the 3D object detection scheme alone.
[CV-65] raining-free Controllable Human Motion Generation under Heterogeneous Constraints ECCV2026
链接: https://arxiv.org/abs/2607.01990
作者: Xiaofei Hui,Bo Yan,Haoxuan Qu,Hossein Rahmani,Jun Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026
Abstract:Training-free controllable motion generation has attracted growing interest for enabling flexible constraint enforcement without constraint-specific training. However, existing training-free methods require constraints to be continuous objective-based with differentiable losses, while many real-world requirements are criterion-based and provide only discontinuous, sparse, or even black-box feedback. In this paper, we propose Motion-Inference-as-Control (MIC), the first training-free motion generation framework that handles both continuous objective-based and criterion-based motion constraints under a shared mechanism. The key idea is to cast diffusion-based motion generation as a stochastic control problem. This perspective not only provides principled and practically effective step-wise control laws that support criterion-based constraints without requiring differentiability and naturally accommodate objective-based constraints as a special case, but also motivates a control-oriented constraint coordination mechanism that adaptively balances and reconciles motion constraints during generation. Experiments across diverse constraint settings demonstrate the effectiveness of our framework.
[CV-66] Understanding Geometric Representations in Self-Supervised Vision Transformers via Subspace Intervention ECCV2026
链接: https://arxiv.org/abs/2607.01987
作者: Weichen Zhou,Yawen Zou,Chunzhi Gu,Ran Dong,Haoran Xie,Chao Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV2026
Abstract:We introduce a controlled subspace intervention framework to investigate how self-supervised Vision Transformers (ViTs) encode dense geometric information. While linear probing is widely used to assess geometric representations, it treats features as a black box, failing to disentangle the underlying topology. To address this issue, we decompose the weights of converged linear probes to isolate the low-rank subspaces containing explicit geometric signals using Singular Value Decomposition (SVD). Our perspective yields three key insights: (1) Pre-training objectives determine how features are encoded. DINOv2 aligns spatial features for efficient linear extraction, while Masked Autoencoders (MAE) tend to disperse these signals, requiring a broader spatial context. (2) Explicit geometric representations are highly compressible, suggesting dense predictive heads could potentially be constrained to low-rank subspaces with minimal performance loss. (3) The layer-wise task affinity suggests that geometric precision peaks at intermediate layers before yielding to semantic abstraction in the final layers. By connecting internal encoding mechanics with downstream performance, these findings provide a basis for effective feature selection and lightweight decoder design. The source code is available at this https URL.
[CV-67] Liquid Latent State Dynamics for Interpretable Turbofan Degradation Modeling
链接: https://arxiv.org/abs/2607.01986
作者: Weizhi Nie,Weijie Wang,Yuting Su
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. 37 references, 8 figures
Abstract:Multivariate time-series models for prognostics are often evaluated by point prediction accuracy, yet their internal states rarely expose a coherent degradation process. We study liquid neural networks as latent dynamics models for aircraft engine health monitoring on the C-MAPSS benchmark. The proposed model encodes a history window into a latent state, evolves that state with a liquid transition model, and decodes future sensor observations. To separate health evolution from operating-condition variation, the latent state is factorized into degradation and condition components. Remaining useful life, monotonic risk, and latent-consistency losses supervise the degradation component, while condition prediction and decorrelation losses discourage operating-condition leakage. Across FD001–FD004, the full disentangled model improves overall sensor forecasting RMSE from 0.2438 for a GRU baseline to 0.2266, with the largest gains on the multi-condition subsets FD002 and FD004. The learned degradation state also forms a clearer temporal degradation axis, reaching an average state-speed Spearman correlation of 0.5960. Direct remaining-useful-life regression remains stronger for the GRU baseline, indicating that the proposed representation is currently more effective as an interpretable world model for degradation dynamics than as a calibrated lifetime regressor. These results suggest that liquid latent dynamics can bridge predictive maintenance forecasting and inspectable health-state modeling.
[CV-68] Do Newer Lightweight CNNs Perform Better Under Resource Constraints? A Controlled Multigenerational Study of Architecture Initialization Training Budget and Efficiency
链接: https://arxiv.org/abs/2607.01984
作者: Tasnim Shahriar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 8 figure, 13 tables
Abstract:Newer lightweight convolutional neural networks are often presented as improving predictive performance and deployment efficiency, but such claims require controlled evaluation. This study compares nine lightweight CNN model packages across CIFAR-10, CIFAR-100, and Tiny ImageNet under a shared downstream protocol. We report top-1 accuracy, macro F1, top-5 accuracy, parameter count, FP32 storage, GMACs, batch-size-1 latency on an NVIDIA L4 and AMD Ryzen 5 5500U CPU, peak PyTorch CUDA allocated tensor memory, and point estimate Pareto frontiers. EfficientNetV2-S achieves the highest observed top-1 accuracy on CIFAR-10 and CIFAR-100 at 97.57% and 86.98%, while RepViT-M1.0 leads Tiny ImageNet at 79.87%. EfficientNet-B0 remains within 0.22, 0.85, and 1.79 percentage points of the best result on the three datasets while using approximately 79% fewer parameters and 86% fewer GMACs than EfficientNetV2-S. It also appears on every evaluated accuracy and resource Pareto frontier, making it the most consistently competitive intermediate-budget option. MobileNetV3-Small has the lowest GMAC count, is the fastest model under both CPU thread settings, and records higher observed accuracy than MobileNetV4-Conv-S on all three datasets. Under random initialization, it leads MobileNetV4-Conv-S by 2.55, 1.76, and 0.99 points, with paired test-set intervals excluding zero for the fixed trained models. EfficientNet-B0 remains 3.29, 10.10, and 17.54 points below its pretrained counterpart after 100 epochs of scratch training, despite requiring about five times the recorded training time. SqueezeNet1.1 has the fewest parameters and lowest peak CUDA allocation, but substantially weaker accuracy. Latency rankings differ sharply between the L4 and CPU environments, showing that GMACs alone do not predict measured inference performance. Overall, newer designs provide selective rather than universal gains
[CV-69] Open-Weather Robust 3D Detection via Dual-Critic Diffusion Alignment ECCV2026
链接: https://arxiv.org/abs/2607.01983
作者: Shuyao Li,Chuanxing Geng,Heyang Sun,Qiang Zhou,Jingjing Gu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 6 figures, 8 tables. ECCV 2026 camera-ready
Abstract:Robust 3D object detection under adverse weather remains a critical hurdle for autonomous driving. Despite progress with LiDAR-4D radar fusion, most methods are constrained by a closed-world assumption, implicitly requiring training and test weather to align in both type and severity. This premise fails in practice: the open-ended nature of weather, and even variations within a single type like rain, cause dramatically different LiDAR degradation patterns, leading to significant performance drops in unseen conditions. To address this, we present Dual-Critic Guided Diffusion Alignment (DCDA), a weather-agnostic framework that learns to recover degraded LiDAR features toward a clean manifold. Rather than modeling specific weather types, DCDA employs a 4D radar-conditioned diffusion process to progressively refine features, guided by two complementary critics. (i) A detection-guided critic, anchored by a pre-trained clean-weather model, ensures that the refined features retain object-level discriminability and localization accuracy. (ii) A weather adversarial critic enforces holistic distributional consistency with clean-weather representations. By aligning features through semantic and distributional constraints rather than explicit weather modeling, DCDA generalizes effectively to unseen weather types and severities without requiring paired data or weather labels. We further introduce a structured open-weather benchmark with held-out type-severity combinations and extensive experiments verify DCDA’s advantages.
[CV-70] MolSight: A Graph-Aware Vision-Language Model for Unified Chemical Image Understanding
链接: https://arxiv.org/abs/2607.01982
作者: Wenda Wang,Yihan Tong,Yuwei Hu,Zhewei Wei
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:
Abstract:Using molecular large language models (LLMs) as a unified framework for understanding molecular structures and functions is emerging as a new trend in tasks such as molecular design and drug discovery. However, these models struggle to fully capture the visual representation of molecular structures, limiting their potential. While existing molecular vision-language models (VLMs) show promise, they still face challenges in structural alignment and lack the necessary topological modeling for accurate molecular understanding. To address this, we propose MolSight, a graph-aware vision-language model framework designed to enhance the understanding of molecular images by VLMs. MolSight integrates a Molecular Topology Module to inject chemical-bond adjacency information into vision tokens, and a Molecular Grounding Module to align visual features with chemical symbolic semantics. Our experiments demonstrate that MolSight significantly outperforms existing VLMs, molecular LLMs, and specialized tools across multiple chemical visual understanding tasks, achieving a new level of molecular image reasoning.
[CV-71] Assessing VLM Reliability for Medical Image Quality Evaluation Under Corruption and Bias
链接: https://arxiv.org/abs/2607.01973
作者: Sofiane Ouaari,Kevin Vorwalder,Nico Pfeifer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Vision-Language Models (VLMs) are increasingly applied in medical tasks such as pathology description, report generation, and visual question answering. Medical Image Quality Assessment (MIQA) supports diagnostic accuracy and patient safety by determining whether images meet the standards required for clinical decision-making. Automating MIQA with VLMs may reduce workload, but their behavior under real-world conditions, where images may be degraded or textual context may affect judgments, should be further explored before deployment. We benchmark VLMs on medical image quality using the MediMeta-C dataset zero-shot across seven corruption types and five severity levels. We evaluate sensitivity to degradation patterns, the effect of corruptions on embedding geometry, and whether textual attributes (demographics, expertise, infrastructure, institution) alter scores. Across 16 VLMs and seven modalities, pixelation produced the largest score reductions (mean -20.58%, up to -34.4% for OCT), whereas brightness had limited effect (-0.81%). Embedding displacement was associated with score changes. Same-family models showed correlations of 0.67-0.83; some produced increases up to +31% for corrupted mammography. Textual attributes affected scores: institutional prestige raised them +17.15%, and equipment age lowered them -14.7%. The largest changes were +95.62% (InternVL-8B) and -37.7% (MedGemma). Current VLMs show limitations for medical image quality assessment. Pixelation, a privacy-preserving transformation, reduces performance, indicating a trade-off between patient privacy and reliability. Sensitivity to contextual metadata indicates limited objectivity and marks metadata as a privacy and bias source. Privacy protection and objective quality assessment are related requirements for use.
[CV-72] NeoMap: Training-free Novel-View Synthesis from Single Images and Videos ECCV2026
链接: https://arxiv.org/abs/2607.01962
作者: Jinxi Li,Tianyi Zhang,Yafei Yang,Zihui Zhang,Peng Huang,Koon Wing Macgyver Lin,Bo Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Robotics (cs.RO)
备注: ECCV 2026. Jinxi and Tianyi are co-first authors. Code and data are available at: this https URL
Abstract:We study the challenging problem of novel view video synthesis from single images or monocular videos. Existing methods, which operate under the assumption that pre-trained video models lack native novel view synthesis capability and enforce view alignment via camera conditioning, task-specific fine-tuning, or stepwise hard denoising guidance, often suffer from artifacts and compromised global scene consistency. In this paper, we introduce NeoMap, a novel training-free framework designed to locate high-fidelity, view-consistent novel view solutions from general pre-trained video models. The key to our approach is the core insight that promising novel view solutions are inherently encoded within the natural video data manifold learned by pre-trained models, and the core challenge is simply to locate this optimal solution. We solve this via our core mechanism: convergent manifold alternating projection iterations that optimize the initial noise. Extensive experiments demonstrate that NeoMap significantly outperforms all existing methods across 3 standard novel view synthesis benchmarks, including the challenging Tanks-and-Temples, LLFF and DAVIS datasets, achieving state-of-the-art generation fidelity and top-tier view consistency.
[CV-73] Personalized 4D Whole-Heart Mesh Reconstruction from Cine MRI via Multi-Scale Temporal Modeling and Differentiable Contour Rendering
链接: https://arxiv.org/abs/2607.01952
作者: Xiaoyue Liu,Dongcheng Cang,Xiaohan Yuan,Mark YY Chan,Ching-Hui Sia,Lei Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages
Abstract:Accurate 4D whole-heart mesh reconstruction from sparse cine MRI is critical for creating cardiac digital twins, but remains challenging due to limited 2D slice coverage and the complex coupling between cardiac shape and motion. Existing methods often rely on intermediate contour fitting and typically reconstruct static, single-phase, or partial cardiac geometries, limiting their ability to capture full-chamber dynamics. We propose a novel end-to-end framework for reconstructing temporally resolved whole-heart meshes from multi-view 2D cine MRI sequences by learning an image-to-mesh mapping. The framework incorporates a differentiable contour renderer inspired by the Beer-Lambert attenuation principle, enabling anatomy-aware supervision of 3D+t mesh deformation through contour-based projection losses. To improve temporal consistency across the cardiac cycle, we further introduce a multi-scale temporal modeling module that integrates global cycle-level dynamics with local inter-frame coherence to generate smooth and physiologically plausible mesh trajectories. The proposed method achieved a whole-heart mean absolute error of 1.68 \pm 0.31 mm and a motion jitter of 0.77 \pm 0.17 \mathrmmm/\mathrmframe^3 , outperforming existing methods with lower reconstruction error and substantially improved motion smoothness. It also improved 2D contour alignment across multiple cine MRI views and supported downstream proof-of-concept electrophysiological simulation. The code will be released publicly upon acceptance of the manuscript for publication.
[CV-74] LiZAD: A Lightweight Zero-Shot Anomaly Detection Framework for Industrial Manufacturing
链接: https://arxiv.org/abs/2607.01949
作者: Uzair Khan,Luigi Capogrosso,Muhammad Aqeel,Francesco Setti,Michele Magno,Marco Cristani
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE International Conference on Omni-Layer Intelligent Systems (COINS) 2026
Abstract:In modern high-throughput industrial production lines, product configurations and visual characteristics frequently change, making it impractical to collect and annotate data for every new scenario. This dynamic setting makes Zero-Shot Anomaly Detection (ZSAD) particularly suitable, as it enables defect detection without requiring training on target-specific samples. Although recent ZSAD approaches show promising results, they are computationally intensive and thus unsuitable for deployment on resource-constrained devices. We propose LiZAD: a lightweight framework designed for real-time ZSAD specifically tailored for use on edge devices. The proposed approach pairs the dense and spatially aware visual features of DINOv3, crucial for precise pixel-level localization, with the highly computationally efficient text embeddings of MobileCLIP2. These features are then mapped into a shared latent space via low-memory trainable projection heads. Compared to six state-of-the-art ZSAD models, LiZAD achieves an average memory reduction of 61.5%, a parameter reduction of 74.6%, and a speedup of 3.02x in terms of latency. Despite substantial reductions in computational and memory costs, our approach maintains competitive anomaly detection performance, dropping the average P-AUROC by just 6.4% relative to the best state-of-the-art model across the VisA, BTAD, MPDD, and MVTec-AD datasets. Finally, it is successfully deployed on the NVIDIA Jetson NX and Jetson AGX edge devices and tested on the real production line of the Industrial Computer Engineering Laboratory (ICE Lab) at the University of Verona. The code is available at this https URL.
[CV-75] Sparse-Aware Vector Quantization for Bandwidth-Efficient Collaborative 3D Semantic Occupancy Prediction ECCV26
链接: https://arxiv.org/abs/2607.01928
作者: Feng Li,Chaokun Zhang,Gong Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV26
Abstract:Collaborative perception extends single-agent perception by enabling multiple vehicles to exchange complementary perceptual information. However, it introduces an inherent trade-off between perception gain and communication overhead, which is particularly severe for 3D semantic occupancy prediction that relies on fine-grained spatial structures. Existing methods typically compress 3D features into 2D, causing severe spatial information loss, or transmit dense 3D representations, hindering real-world deployment. To overcome these limitations, we propose a bandwidth-efficient collaborative Vector Quantization Semantic Occupancy Prediction (VQSOP) framework. VQSOP employs a Sparse-Aware Vector Quantization (SAVQ) mechanism that exploits 3D scene sparsity to compactly encode informative regions, drastically reducing communication overhead while preserving complete geometric context. Furthermore, to enhance structural consistency and feature continuity, we design a Dual-Branch Adaptive Spatial Refinement (ASR) module that dynamically fuses local high-frequency details with broad contextual semantics. Extensive experiments demonstrate that our approach achieves state-of-the-art performance while reducing communication volume by up to 82x.
[CV-76] Robust Image Processing Techniques for Construction Environment Monitoring Using Underwater Robots
链接: https://arxiv.org/abs/2607.01915
作者: Seunghee Yun,Geonmo Yang,Juhui Lee,Changbeom Park,Jeahyung Choi,Younggun Cho
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 9 figures
Abstract:This paper proposes a robust image processing framework for underwater robot-based construction environment monitoring, targeting complex degradations observed in real marine environments. Unlike conventional approaches that mainly consider absorption and backscattering, real underwater imagery is strongly affected by depth-dependent forward scattering blur and particle-induced degradations such as marine snow. To address this, we introduce a staged processing pipeline that sequentially models background degradation via depth-aware forward scattering and foreground degradation using realistic marine snow patterns extracted from real images. The resulting synthetic data are used to retrain an existing Joint-ID network without modifying its architecture, enabling an isolated evaluation of dataset realism. In addition, a lightweight post-processing scheme is applied to enhance contrast and structural clarity. Experiments on real underwater datasets collected in Korean coastal environments demonstrate consistent improvements in visual quality and UIQM scores. The results indicate that explicitly modeling forward scattering and realistic particle effects effectively reduces the synthetic-to-real gap and improves practical applicability in real-world underwater robotic operations.
[CV-77] owards Real-World Ultrasound Understanding: Large Vision-Language Models from Multi-Image Examinations with Long-Form Reports
链接: https://arxiv.org/abs/2607.01908
作者: Bingcong Yan,Chunlei Li,Jingliang Hu,Yilei Shi,Xiao Xiang Zhu,Lichao Mou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Large vision-language models (LVLMs) have achieved strong performance across many medical imaging tasks, yet their application to ultrasound remains limited due to its inherent complexity and variability. In this work, we revisit what is truly needed to enable real-world ultrasound understanding. Instead of introducing complex architectures or elaborate training strategies, we show that data scale and clinically faithful data alignment are the key factors. We construct a large-scale dataset of 1.5M real-world ultrasound examinations, containing 17.7M images, multi-organ coverage, and paired uncurated clinical reports. Crucially, we organize the data at the examination level, aligning multiple images with their corresponding reports to reflect real clinical workflows. We then fine-tune a standard LVLM using low-rank adaptation (LoRA) on this dataset without task-specific modifications. Surprisingly, this simple recipe already leads to strong performance across diverse ultrasound understanding tasks, outperforming prior methods designed with more complex pipelines. Beyond these results, we present model and data scaling analyses that provide insights into the role of scale in ultrasound LVLMs.
[CV-78] Population-Based Multi-Objective Training of Discriminators for Semi-Supervised GANs
链接: https://arxiv.org/abs/2607.01907
作者: Francisco Sedeño,Francisco Chicano,Jamal Toutouh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: The 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS2026)
Abstract:Semi-supervised generative adversarial networks (SSL-GANs) can exploit large unlabeled datasets while retaining a classifier in the discriminator, but their training is often unstable. This paper proposes a population-based evolutionary training strategy in which discriminator learning is formulated as a multi-objective optimization problem. Instead of aggregating the supervised and unsupervised components of the SSL objective into a single scalar loss, the method maintains a population of discriminators ranked by Pareto dominance, enabling the exploration of different trade-offs between classification accuracy and real/fake discrimination. This formulation aims to improve both roles of SSL-GANs: learning accurate classifiers and training generators capable of producing realistic samples. We analyze several variants, including an elitist strategy and a mono-objective ablation, to assess the role of multi-objective selection. Experiments on MNIST with limited labels show improved training robustness compared to SSL-GAN and CE-SSL-GAN state-of-the-art baselines, while the elitist variant consistently achieves the highest classification accuracy.
[CV-79] SFKD: Spatial–Frequency Joint-Aware Heterogeneous Knowledge Distillation via Multi-Level Wavelet Spectral Interaction ECCV2026
链接: https://arxiv.org/abs/2607.01906
作者: Cuipeng Wang,Haipeng Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026
Abstract:Most existing knowledge distillation methods focus on homogeneous models (e.g., CNN-to-CNN), thereby overlooking the flexibility and potential of knowledge transfer across heterogeneous models. Due to intrinsic inductive bias discrepancies between heterogeneous models that cause spatial distribution inconsistencies, prior heterogeneous distillation methods often weaken or discard spatial information in heterogeneous representations. However, the spatial information in representations often encodes transferable global structural semantics as well as architecture-specific local details, and therefore should not be directly ignored. To better leverage the spatial information encoded in heterogeneous representations, we propose a Spatial-Frequency Joint-Aware Heterogeneous Knowledge Distillation framework (SFKD). By leveraging the complementary properties of wavelet transform spatial locality and Fourier representations in characterizing global energy distributions, we first apply multi-level discrete wavelet transform to explicitly decouple spatial information. The resulting wavelet sub-bands are further refined by a dual-stream dual-stage refinement module, and finally combined with a Gaussian-filtered frequency loss to selectively capture informative global information. Extensive experiments on multiple benchmark datasets under both homogeneous and heterogeneous models demonstrate the superiority of our method.
[CV-80] Rethinking Post-Hoc Calibration in Semantic Segmentation
链接: https://arxiv.org/abs/2607.01902
作者: Tristan Kirscher(ICube),Kim-Celine Kahl(DKFZ),Balint Kovacs(DKFZ),Maximilian R. Rokuss(DKFZ),Klaus Maier-Hein(DKFZ),Xavier Coubez,Philippe Meyer(ICube),Sylvain Faisan(ICube)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Reliable confidence estimates are essential in semantic segmentation, especially in safety-critical settings where overconfident errors can mislead downstream decisions. Yet modern segmentation models often remain miscalibrated. Post-hoc calibration offers a practical way to correct confidence estimates without retraining the segmentation model, but its use in dense prediction raises structural issues that are often overlooked. We study two such issues. First, adding a constant to all logits leaves the softmax probabilities unchanged, but several standard calibrators can still depend on this arbitrary offset. As a result, two logit representations encoding the same predictive distribution may yield different calibrated probabilities. We define translation-invariant (TI) calibrators as those whose outputs are unchanged under such shifts, characterize which common calibrators satisfy this property, and construct TI counterparts of shift-sensitive calibrators to isolate the effect of removing representation dependence. Second, post-hoc calibration is typically fitted by minimizing a likelihood-based objective, whereas segmentation models are trained with task-specific metrics such as Dice. This mismatch can cause calibration to alter class orderings and degrade the deployed segmentation map. We study decision-preserving calibration under argmax- and order-preservation constraints. Since enforcing these constraints collapses affine softmax calibrators to temperature scaling, we introduce class-conditional affine calibrators that can be made argmax- or order-preserving while retaining greater expressivity, allowing us to quantify the calibration-segmentation trade-off induced by decision preservation. Across natural-image and medical segmentation benchmarks, and under corruption-based covariate shift, matched comparisons show that TI variants generally improve calibration metrics, while decision-preserving variants prevent segmentation degradation and retain strong calibration performance. These results provide practical design principles for well-defined post-hoc calibration pipelines in semantic segmentation.
[CV-81] FoundDP: Revisiting Weak Disparity Observability in Dual-Pixel Depth Estimation
链接: https://arxiv.org/abs/2607.01900
作者: Fengchen He,Hao Xu,Dayang Zhao,Tingwei Quan,Shaoqun Zeng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dual-pixel (DP) imaging enables metric depth estimation from a single camera using sub-aperture disparity. However, the extremely small effective baseline limits disparity observability, leading to structural degradation and depth failure in textureless, low-contrast, or downsampled regions. Existing DP-based methods rely primarily on local disparity cues and therefore become unreliable when disparity signals are weak or ambiguous. To address this limitation, we propose \emphFoundDP, a unified framework that integrates metric DP depth with global structural priors from a monocular depth foundation model. Our method preserves metric scale through DP-derived depth and leverages Vision Transformer (ViT) features to restore structural consistency in weak-disparity regions. To ensure reliable metric guidance under DP imaging conditions, we identify and mitigate ViT representation degradation induced by DP defocus blur via ViT feature alignment, enabling stable metric-guided depth estimation. Extensive experiments on synthetic and real-world DP benchmarks show that FoundDP delivers superior performance, with consistent gains in structural fidelity and metric accuracy, especially under reduced disparity observability. Code will be available at: this https URL
[CV-82] Diversity-aware View Partitioning for Scalable VGGT ECCV2026
链接: https://arxiv.org/abs/2607.01885
作者: Jinsoo Park,Donggyu Choi,Ahyun Seo,Minsu cho,Jeany Son
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 11 figures, Accepted to ECCV 2026
Abstract:Geometry transformers such as VGGT achieve strong performance by jointly reasoning over multiple views with global attention. However, scaling them to large view collections remains challenging due to the quadratic cost of attention. Moreover, our empirical analysis reveals that the reconstruction quality in VGGT is sensitive to the distribution of viewpoints. Simply increasing the number of views without sufficient viewpoint diversity can even degrade performance, as redundant views introduce highly similar tokens that dilute informative geometric signals in the attention mechanism. Motivated by this observation, we propose a training-free and plug-and-play VGGT inference framework that organizes views into diversity-aware balanced chunks. The chunks are constructed through combinatorial graph partitioning over visual dissimilarity and spatial dispersion. This view organization allows the transformer to focus attention on geometrically informative views while reducing redundant attention interactions. To estimate spatial dispersion without full pose estimation, we approximate spatial relationships via a soft pose propagation strategy based on visual similarity from a small set of seed frames. Extensive experiments demonstrate improved performance in camera pose estimation, multi-view depth prediction, and 3D reconstruction while reducing memory usage and inference latency. Our framework also complements existing VGGT variants, enabling scalable multi-view reconstruction without sacrificing geometric fidelity.
[CV-83] SAB-LVLM: Significance-Aware Binarization for Large Vision-Language Models
链接: https://arxiv.org/abs/2607.01876
作者: Qi Lyu,Jiahua Dong,Baichen Liu,Xudong Wang,Mingfei Han,Yulun Zhang,Fahad Shahbaz Khan,Salman Khan,Lianqing Liu,Zhi Han
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal understanding, yet their enormous parameter scale and cross-modal computation incur substantial memory and latency overhead, severely limiting real-world deployment on resource-constrained devices. Binarization offers an attractive solution by drastically reducing storage and computational costs. However, existing binarization methods neglect the varying importance of weights across different layers and modalities. This causes parameters irrelevant to downstream tasks to be unnecessarily retained, whereas modality-critical weights may not be adequately optimized, resulting in significant performance degradation. To address these challenges, we develop a novel \underlineSignificance-\underlineAware \underlineBinarization for \underlineLarge \underlineVision-\underlineLanguage \underlineModels (SAB-LVLM). Specifically, after constructing Hessian matrices for textual and visual inputs, we propose a spatial significance map to distinguish full-precision weights activated under a single modality from those activated across modalities. We then devise a modality-guided integration strategy to obtain the significance-aware binarization map, which measures weight significance across layers and modalities. Subsequently, this binarization map is incorporated into the binarization objective as an error reweighting term, and binarization fitting is performed through an alternating significance-weighted update scheme. Extensive experiments illustrate the superiority of our SAB-LVLM over existing binary PTQ methods under an approximately 1-bit compression constraint. Our code is accessible at this https URL.
[CV-84] Descriptor: LYNRED Mobility Dataset Multimodal Detection Subset (LYNRED-MDS)
链接: https://arxiv.org/abs/2607.01871
作者: Loïc Arbez(Thoth),Jessy Matias,Xavier Brenière,Jocelyn Chanussot(Thoth),Ronald Phlypo(GIPSA-VIBS)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current road safety systems primarily focus on minimizing post-collision damage. However, advances in algorithmic perception are shifting focus toward early collision prediction, especially in lowvisibility conditions like nighttime or fog, where thermal infrared sensing outperforms both human vision and RGB imaging. While available RGB-infrared datasets such as FLIR ADAS and LLVIP are good benchmarks, they mostly consist of clear weather and overly simple scenarios. In this article, we introduce the LYNRED-MDS: Multimodal Detection Subset, a subset of the LYNRED Mobility Dataset, comprised of 4000 RGB-infrared image pairs captured under diverse weather, lighting, and road conditions around Grenoble, France. Our dataset spans varied driving contexts (urban, rural, mountainous, etc.) and a vehicle fleet compliant with Western European standards. Thermal cross-dataset evaluation using a YOLOv8n baseline suggests that our dataset offers strong generalization potential for pedestrian detection in driving scenarios. By covering critical edge cases, our dataset supports the development of more reliable and deployable vision systems for advanced driver-assistance systems.
[CV-85] QWERTY: Training-Free Motion Control via Query-Warped Video Diffusion Transformers ECCV
链接: https://arxiv.org/abs/2607.01869
作者: Kyobin Choo,Youngmin Kim,Hyunkyung Han,Geunrip Park,Chanyoung Kim,Sunyoung Jung,Seong Jae Hwang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages, 18 figures, accepted at the European Conference on Computer Vision (ECCV) 2026
Abstract:Video diffusion transformers (DiTs) generate high-fidelity and temporally coherent videos, yet motion control remains implicit, primarily relying on text prompts. As a result, achieving desired motion often requires extensive prompt engineering and repeated resampling. While fine-tuning models with additional spatial prompts (e.g., bounding boxes or point trajectories) enables explicit control, it demands substantial data curation and computation, and may compromise the generative capabilities of pretrained models. Consequently, training-free motion control using such spatial prompts has been explored in U-Net-based video diffusion models, but remains largely unexplored for DiTs. We introduce QWERTY, a training-free framework that enables flexible motion control in pretrained image-to-video DiTs via user-defined object warping and optical flow. We carefully manipulate the 3D full attention of DiTs by warping the frame-invariant semantic subspace of queries. We find that the noise predicted by the query-warped DiT naturally guides the diffusion trajectory toward the desired motion, and further show that leveraging this noise as self-guidance for latent optimization improves control stability and visual quality. Experiments show that QWERTY achieves the most effective motion control among existing training-free approaches on a recent image-to-video DiT, with performance comparable to fine-tuning-based methods.
[CV-86] DL-SLAM: Enabling High-Fidelity Gaussian Splatting SLAM in Dynamic Environments based on Dual-Level Probability
链接: https://arxiv.org/abs/2607.01860
作者: Ziheng Xu,Qingfeng Li,Xuefeng Liu,Chen Chen,Jianwei Niu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in dense dynamic Simultaneous Localization And Mapping (SLAM). Prevailing methods typically discard predefined dynamic objects, ignoring that transiently static objects offer valuable geometric constraints for pose estimation. A recent work attempts to leverage this potential by employing per-pixel uncertainty maps to quantify the magnitude of motion. While this approach enables transiently static objects to enhance pose estimation, it erroneously integrates these objects into the static map, resulting in persistent artifacts. Moreover, its reliance on purely geometric information leads to ambiguous object boundaries in the uncertainty maps. To overcome these limitations, we present DL-SLAM, a monocular Gaussian Splatting SLAM system built upon a novel dual-level probabilistic framework. Our method computes dynamic probability maps by combining semantic and geometric information. These pixel-level probabilities are lifted to 3D and aggregated to derive an object-level dynamic probability for each instance. Object-level probability enables the categorical pruning of dynamic Gaussians, resulting in an artifact-free static map. The static map, in turn, provides a geometrically consistent guidance to refine the pixel-wise probabilities, enhancing their reliability. Experimental results demonstrate that DL-SLAM outperforms existing approaches, improving tracking accuracy by up to 13% while generating high-fidelity semantic maps.
[CV-87] Geometric Foundation Model Distillation for Efficient Lunar 3D Reconstruction ECCV2026 ECCV
链接: https://arxiv.org/abs/2607.01851
作者: Clémentine Grethen,Florient Chouteau,Géraldine Morin,Simone Gasparini
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026, code can be accessed via this https URL
Abstract:Large 3D foundation models such as MASt3R achieve state-of-the-art stereo reconstruction but are computationally demanding for deployment under strict hardware constraints – a critical limitation in domains such as planetary exploration, where onboard computing is severely restricted. We study how far such models can be compressed through knowledge distillation, using lunar stereo reconstruction as a challenging and practically relevant case study. Starting from a 688M-parameter MASt3R teacher fine-tuned on lunar imagery, we distill its dense geometric predictions into a family of lightweight students spanning different encoder types (CNN vs ViT), decoder widths and depths, and training strategies. To bridge the dimensional mismatch between teacher and student, we propose a structured SVD-based initialization that projects the teacher’s decoder weights into the student’s smaller latent space, yielding a warm start that significantly improves convergence and final performance. Based on our results on lunar data, we can obtain a distilled student that retains most of teacher’s reconstruction accuracy while reducing the model size up to 7 times, and even outperforms a baseline trained directly with sparse ground-truth annotations. Beyond compression, our study highlights both principles and practical insights for distilling geometric foundation models: a convolutional encoder underperforms transformer-based alternatives (though pretraining availability remains a confounding factor), preserving encoder capacity is more critical than maintaining a large decoder, feature-level distillation consistently outperforms output-only supervision, and SVD-based initialization improves optimisation stability. These findings provide practical guidelines for deploying 3D reconstruction models in resource-constrained environments.
[CV-88] C2E: Boosting Ego-Only 3D Object Detection via Multi-Teacher Contrastive Knowledge Distillation
链接: https://arxiv.org/abs/2607.01827
作者: Jinlong Wang,Xun Huang,Qiming Xia,Shijia Zhao,Chenglu Wen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 8figures
Abstract:LiDAR-based 3D object detection is essential for autonomous driving systems. However, traditional Ego-only Perception (Eo-Perception) suffers from limited perspective and occlusions in a complex outdoor environment, leading to performance bottlenecks. Recently, research on multi-agent Collaborative Perception (Co-Perception) has demonstrated excellent performance, but high communication costs and accumulated pose error hinder its application. To address this, we explore a novel C2E (Co-Perception to Eo-Perception) paradigm through the Multi-to-Single (M2S) agent contrastive knowledge distillation framework. Our M2S framework first designs Multi-Level Feature Enhancement module to provide more stable features, and introduces Auxiliary Point Cloud Reconstruction and Multi-Teacher Contrastive Distillation mechanisms to mitigate domain gaps in point cloud and feature distributions within the C2E paradigm. Benefiting from this, our M2S can retain the excellent performance of collaborative perception while effectively avoiding the drawbacks, such as communication delays and positioning errors. Extensive experiments on the V2XSet, V2V4Real and DAIR-V2X datasets show the effectiveness and generalizability of our M2S framework when combined with the state-of-the-art CoSDH model and other excellent 3D detectors. Our M2S framework can deliver up to a 8.64% improvement in 3D mAP performance without introducing any communication costs.
[CV-89] Rethinking Conditional Generation for Underwater Salient Object Detection
链接: https://arxiv.org/abs/2607.01825
作者: Hua Li,Yongjie Weng,Yutong Li,Zhiyuan Li,Runmin Cong,Sam Kwong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Salient Object Detection in underwater images remains challenging due to low contrast, uneven illumination, and color distortion caused by scattering and absorption effects, which limit the effectiveness of conventional SOD methods in underwater environments. To address these challenges, we propose a Degradation-aware Conditional Generation Network (DCGNet), specifically designed to construct reliable conditional features for underwater saliency generation. First, we design a Dynamic Multi-Granularity module (DMG) grounded in the human visual system to robustly detect salient objects of varying scales with blurred boundaries. Then, we develop an Underwater Physics-Prior module (UPP), which utilizes pseudo-depth guidance to estimate underwater light attenuation and backscatter, thereby restoring degradation-aware RGB features and mitigating color distortion and boundary ambiguity. Based on the physics-guided representation, we introduce an Underwater Spatial Gaussian module (USG), which constructs a spatial Gaussian saliency prior from the strongest guided response to enhance object-centered salient regions and suppress cluttered underwater backgrounds. In addition, a lightweight timestep-adaptive Diffusion Transformer (DiT) bottleneck is inserted into the denoising decoder to refine fused features at different diffusion timesteps. Comprehensive experiments on USOD10K, USOD, CSOD10K, MAS3K, and RMAS demonstrate that DCGNet significantly outperforms existing state-of-the-art methods, verifying its potential for complex underwater visual applications.
[CV-90] MMBench-Live: A Continuously Evolving Benchmark for Multimodal Models
链接: https://arxiv.org/abs/2607.01813
作者: Yuanzhi Liu,Shousheng Zhao,Bo Zhou,Kongming Liang,Zhanyu Ma
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluation benchmarks are essential for assessing vision-language models (VLMs), but most multimodal benchmarks are static, making them vulnerable to temporal staleness, data contamination, and costly maintenance. We present MMBench-Live, a continuously evolving multimodal benchmark built by a multi-agent-driven automated pipeline. Our framework treats benchmark evolution as task-guided dataset construction, integrating structured benchmark specification, feedback-controlled real-time data acquisition, and verifiable QA generation with executable reasoning. To maintain cross-version comparability, we introduce a distribution-consistent update strategy that extracts task-related visual patterns from the original benchmark to guide data collection and filtering. Instantiated from MMBench, MMBench-Live contains 5.9K newly generated evaluation instances with a high answer correctness rate, while each update costs about USD 30 and takes 1-2 hours. Extensive evaluations show that MMBench-Live preserves stable model rankings, maintains semantic alignment with the original benchmark, and exhibits weaker contamination-related memorization signals, suggesting a practical and scalable paradigm for sustainable multimodal benchmark evolution. The project is available at this https URL.
[CV-91] PixGS: Pixel-Space Diffusion for Direct 3D Gaussian Splat Generation ECCV2026
链接: https://arxiv.org/abs/2607.01803
作者: Duy Cao,Phong Nguyen-Ha
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: Accepted at ECCV 2026
Abstract:Recent advances in 3D content generation from text or images have achieved impressive results, yet view inconsistency from 2D generators and the scarcity of high-quality 3D data remain significant bottlenecks. Existing solutions typically adapt large-scale pre-trained text-to-image latent diffusion models to generate 3D Gaussian Splats (3DGS). However, these approaches often rely on training complex cascade pipelines that are computationally expensive and scalability-limited. Most critically, the quality of generated 3D assets is inherently constrained by each component capacity and compressed latent space, leading to decoding artifacts and accumulated errors. To address these limitations, we propose PixGS, a single-stage pipeline for direct high-quality 3DGS generation, which leverages recent advances in pixel-space diffusion to bypass lossy latent compression while still benefiting from the vast 2D generative priors. By directly denoising 3D Gaussian attributes at each timestep, our method enables precise, splat-level regularization of both appearance and geometry. Furthermore, we introduce a comprehensive supervision strategy that incorporates surface normals, depth, and high-frequency structural information, which is often overlooked in prior works. Experiments demonstrate that PixGS outperforms current state-of-the-art methods while maintaining a fast inference speed (1s on a single A100 GPU), offering a robust and efficient alternative to multi-stage generation pipelines.
[CV-92] SpaceEra: A Unified Framework Towards 3D Spatial Reasoning in Video
链接: https://arxiv.org/abs/2607.01784
作者: Weili Guan,Haoyu Zhang,Meng Liu,Qianlong Xiang,Yaowei Wang,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TPAMI 2026
Abstract:Visual-spatial understanding, defined as the ability to infer object relationships and scene layouts from visual inputs, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, pre-trained vision-language models (VLMs) remain constrained by spatial uncertainty stemming from inherently 2D observations and by the scarcity of data for 3D spatial understanding. To address these limitations, we proposed a novel framework, SpaceEra, in the NeurIPS 2025 Spotlight paper. Although it achieved significant performance gains, we further observed that its effectiveness is hindered by insufficient input from scanning videos and weak reasoning constraints. To tackle these newly emerged challenges, we extend the original framework into a comprehensive system, termed SpaceEra++, which spans data construction, model design, training optimization, and prompting inference. Specifically, to alleviate input insufficiency, we introduce ScenePick, a frame sampling strategy that balances spatial coverage with object semantics to produce compact yet comprehensive scene representations. In addition, to enhance spatial reasoning, we develop SpaceAlign, which enforces pairwise object constraints by jointly exploiting absolute coordinates and relative spatial relations, thereby aligning optimization with spatial accuracy. Extensive experiments across multiple benchmarks demonstrate consistent improvements over strong baselines, while ablation studies validate both the individual and joint contributions of each component, and further analyses provide guidance for future research.
[CV-93] LLM -Empowered Multimodal Fusion Framework for Autonomous Driving: Semantic Enhancement and Channel-Adaptive Design
链接: https://arxiv.org/abs/2607.01772
作者: Wen Wang,Yaping Sun,Yejun He,Hao Chen,Zhiyong Chen,Xiaodong Xu,Nan Ma,Shuguang Cui
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 6 pages, 4 figures. Accepted by 2026 IEEE 37th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC)
Abstract:Vision-radar fusion is central to robust autonomous driving, combining dense visual semantics with precise range and velocity measurements from radar. However, real-world fusion quality is fundamentally challenged by dynamically varying input quality, stemming from occlusion, adverse weather, and channel noise. To address this, we re-frame the problem from static data fusion to channel-aware semantic reasoning and propose a Large Language Model-centric Semantic-layer Channel-aware Integrated Perception (LM-SCIP) framework. It places a Large Language Model (LLM) as a central reasoning core to fuse a local visual stream with a quality-varying external radar stream used to cover perception-blind spots. Concretely, LM-SCIP couples a hierarchical radar-vision encoder with a Channel-Adaptive Semantic Module (CASM) that maps link indicators into a “Channel Prompt” to dynamically gate external radar features. A parameter-efficient, LoRA-tuned LLM, in conjunction with a heterogeneous Mixture-of-Experts (H-MoE), then arbitrates between local visual cues and the channel-conditioned radar context. Finally, a decoupled multi-task decoder outputs localization, trajectory forecasting, and image reconstruction. Experiments on nuScenes and VIRAT validate our approach. On nuScenes, under a controlled toggle of radar input, LM-SCIP reduces localization RMSE by 40.0% versus a vision-only baseline. On VIRAT, the model attains a 0.214m localization RMSE and 0.179m minFDE (k=1). These results reveal that the proposed LM-SCIP enables a robust vision-dominant fallback at low SNR and synergistic fusion at high SNR.
[CV-94] JointHOI: Jointly Generating Contact Maps Enhances Hand Object Interaction Generation
链接: https://arxiv.org/abs/2607.01768
作者: Mingyeong Song,Jungbin Cho,Jisoo Kim,Ananya Bal,Kartik Sharma,Youngjae Yu,Laszlo A. Jeni,Junhyug Noh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages
Abstract:Text driven hand object interaction (HOI) generation is gaining attention for immersive applications and robotics, yet producing physically plausible interactions remains challenging. Even when individual motions appear natural, small contact errors can cause conspicuous artifacts such as floating and interpenetration. Prior methods mitigate these issues using explicit contact cues or implicit grasp priors, but typically rely on multi stage pipelines and fail to model temporally evolving contact. We present JointHOI, a single stage diffusion framework that jointly generates 3D hand object motion and dynamic, distance based contact maps from text. By treating contact as an auxiliary inner modality, joint generation enables the model to learn contact motion coupling during training. At inference, contact guided sampling enforces consistency between generated contact maps and motion implied geometry, improving temporal stability and reducing penetration and floating. Experiments on GRAB and ARCTIC demonstrate consistent improvements in text adherence and physical plausibility over prior methods.
[CV-95] ProCal: Inference-Time Proposal Calibration for Open-Vocabulary Object Detection
链接: https://arxiv.org/abs/2607.01759
作者: Jae-Ryung Hong,Ho-Joong Kim,Seong-Whan Lee
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Open-vocabulary object detection aims to localize and classify objects beyond the fixed set of categories seen dur ing training. Recent open-vocabulary object detection methods improve localization and classification for unseen categories by leveraging a frozen VLM as a detector backbone. However, VLM classification score lacks recognizing position and scale of the object in an image. We observe that pretrained VLMs en able to classify foreground and background regions. According to this observation, we propose a simple inference-time Pro posal Calibration (ProCal) that improves localization quality of the classification score. ProCal computes a proposal prior by combining two scores: localization-aware foreground score and background-aware suppression score. Localization-aware foreground score captures whether a proposal contains an object area. Background-aware suppression score measures the extent to which the proposal resembles background. We analyze that ProCal suppresses false novel activation on background proposals and consistently ranks true novel proposals above background and partial novel proposals. Applied to CLIPSelf ViT-L/14, ProCal improves APr +2.5 on OV-LVIS. The analyses show that proposal-level localization-aware reranking effects to mitigate ranking miscalibration for novel objects.
[CV-96] DL-VINS-Factory: A Modular Framework for Learned Visual Front-Ends in Visual-Inertial SLAM
链接: https://arxiv.org/abs/2607.01757
作者: Shoon Kit Lim,Melissa Jia Ying Chong,Ting Yang Ling
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Deep-learning features excel in visual matching, yet their practical value in tightly coupled visual-inertial SLAM (VI-SLAM) remains insufficiently characterized. We present DL-VINS-Factory, a unified framework that integrates learned feature extractors (ALIKED, RaCo, SuperPoint, XFeat) with either Lucas–Kanade (LK) optical-flow tracking or LightGlue (LG) descriptor matching. All front-ends share a sliding-window Ceres back-end, with optional AnyLoc DINOv2-VLAD loop closure, and 4-DoF pose-graph optimization. We benchmark the system across the four datasets covering indoor, unstructured outdoor, aggressive-motion, and visually degraded conditions. Results show that learned front-ends are viable for real-time embedded VI-SLAM, but are not universally superior to classical tracking. Relative to the corresponding GFTT+LK baseline, ALIKED+LG reduces EuRoC ATE by 5% in monocular odometry and by 7% in stereo with loop-closure. On NTU-VIRAL, where aggressive aerial motion increases inter-frame viewpoint change, ALIKED+LG stereo reduces loop-closed ATE by 12% . In Botanic Garden dataset, optical-flow tracking remains preferable, but learned keypoints still improve over the baseline GFTT, in which SuperPoint+LK reduces grayscale camera ATE by 29% , while RaCo+LK reduces RGB camera ATE by 38% . On SubT-MRS, learned front-ends display varying degree of improvement based on individual cases. With TensorRT acceleration on a Jetson AGX Orin, all valid configurations run in real time between 29 – 47 FPS in monocular mode and 18 – 33 FPS in stereo mode for the EuRoC and NTU-VIRAL datasets. AnyLoc further confirms roughly 2 – 7\times more valid loops than BRIEF+DBoW2. The implementation is open-sourced at this https URL.
[CV-97] ProSAC-CT: Progressive Spectral-Anatomical Co-Guided Multi-Stage Diffusion Model for Low-Dose CT Denoising
链接: https://arxiv.org/abs/2607.01756
作者: Xuepeng Liu,Zetong Liu,Renyiming Li,Yan Li,Ruiyu Li,Ruili Li,Jiayi Ding,Eichi Takaya
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures, 3 tables
Abstract:Low-dose computed tomography (LDCT) reduces radiation exposure but introduces stronger quantum noise, streak artifacts, and local texture degradation, which can obscure anatomical boundaries and weaken low-contrast structures. Diffusion models are promising for LDCT denoising by progressively recovering normal-dose CT (NDCT) images from degraded LDCT inputs, but existing methods often suffer from insufficient anatomical guidance, uncertain frequency-dependent recovery, and uniform reverse-process modeling. We propose ProSAC-CT, a progressive spectral-anatomical co-guided multi-stage diffusion model for image-domain LDCT denoising. ProSAC-CT integrates an anatomical-prior-guided conditioning (APGC) module, a residual frequency-domain decoupling stage (RFDDS), and a time-step-decoupling denoising decoder (TD3). APGC extracts LDCT-derived structural guidance, RFDDS enhances frequency-aware representations, and TD3 assigns them to different reverse-diffusion stages for anatomical stabilization, boundary refinement, and fine-detail recovery. Experiments on four LDCT degradation benchmarks show that ProSAC-CT improves image fidelity, structural similarity, perceptual quality, and information preservation over representative methods while better preserving boundary-sensitive anatomical details. Downstream anatomical-region classification on Mayo-2020 further indicates that ProSAC-CT retains task-relevant anatomical information, supporting its practical use for low-dose CT denoising.
[CV-98] he Turning Point of 3D Plant Phenotyping: 3D Foundation Models Enable Minute-to-Second Cross-Crop Reconstruction and Beyond
链接: https://arxiv.org/abs/2607.01753
作者: Hanyue Jia,Wei Zhou,Wenbo Zhou,Yanan Li,Hao Lu,Tingting Wu
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 39 pages, 6 figures, 3 tables
Abstract:3D plant phenotyping is notoriously known to be procedure-complicated and of low throughput due to the extensive multi-view imaging, the fragile 3D reconstruction pipeline, and the additional cost from reconstructed geometry to phenotypic extraction. These limitations are further amplified in low-cost data acquisition, where smartphone videos or sparsely sampled multi-view images provide limited view overlap and self-occlusion. In this work, we show that the conventional 3D plant phenotyping pipeline could be streamlined and significantly accelerated with 3D Foundation Models (3DFMs), and particularly, present one of the first cross-crop 3D phenotyping frameworks powered by 3DFMs. The framework replaces COLMAP-style sparse initialization with 3DFM-based feed-forward geometric recovery, combines geometry-constrained 3D Gaussian Splatting for dense reconstruction, enables few-view reconstruction through iterative view synthesis and refinement, and converts reconstructed geometry into measurable organs through 2D-to-3D semantic transfer, metric scale recovery, and organ instance separation. We further construct a cross-crop dataset with smartphone-based image acquisition, diverse plant morphologies, and manual annotations for segmentation and phenotypic evaluation. Experiments across 26 plant sequences show that 3D Foundation Models reduce the average reconstruction time from 6.52 minutes to 1.58 seconds while maintaining high reconstruction quality and phenotyping accuracy. These results suggest a fresh technical route for high-throughput 3D plant phenotyping, from low-cost image acquisition to fast reconstruction, perception, scale recovery, and phenotypic measurement.
[CV-99] MedStreamBench: A Time-Aware Benchmark for Streaming and Proactive Medical Video Understanding
链接: https://arxiv.org/abs/2607.01751
作者: Yuan Wang,Shujian Gao,Songtao Jiang,Zhengyu Hu,Zuozhu Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 Pages, 5 Figures
Abstract:Existing medical video benchmarks primarily evaluate whether a model produces the correct answer, but rarely assess whether it answers at the right time. In real clinical settings, AI systems must decide not only what to predict, but also when to answer, defer judgment, or proactively raise alerts. This creates a critical gap between benchmark evaluation and deployment requirements. We present MedStreamBench, a benchmark for time-aware medical video understanding. MedStreamBench integrates 22 medical datasets and 5,419 QA instances across four temporal settings: retrospective, present, future, and proactive. Unlike conventional benchmarks that assume full-video access, MedStreamBench restricts models to temporally bounded evidence windows and supports both single-turn and streaming evaluation. We further introduce a proactive monitoring setting that requires models to determine whether and when clinically relevant alerts should be triggered. Beyond answer correctness, MedStreamBench evaluates temporal behavior through responsiveness and post-evidence stability. Experiments on leading general-purpose and medical vision-language models reveal a substantial gap between offline recognition and temporally grounded decision-making, with performance dropping markedly in streaming and proactive settings. Our benchmark is available at this https URL.
[CV-100] RTE-FM-Dehazer: Radiative Transfer Equation Inspired Flow Matching for Real-World Image Dehazing
链接: https://arxiv.org/abs/2607.01748
作者: Chenfeng Wei,Chun Wang,Boyang Zhao,Si Zuo,Shenhong Wang,Chenguang Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single-image dehazing aims to recover a clear scene from a hazy image and is generally formulated as an image-to-image translation task; however, it faces two limitations. Its performance depends heavily on the haze-formation priors embedded in the model. Prevailing methods adopt the Atmospheric Scattering Model (ASM), whose assumptions of single scattering and homogeneous media are often violated, leading to residual haze and color drift. Moreover, large-scale real hazy/clear pairs are impractical to collect, and existing synthesis approaches fail to reproduce the full complexity of natural haze. To address these issues, we present RTE-FM-Dehazer, a novel dehazing approach, together with a scalable data pipeline. Unlike the ASM, the Radiative Transfer Equation (RTE) jointly accounts for both scattering and absorption, naturally accommodating the non-homogeneous, multiple-scattering media that characterize real hazy scenes. Motivated by the structural similarity between the RTE diffusion-absorption term and the ODE in flow matching, we introduce a diffusion-absorption regularizer derived from a reduced RTE, to steer the flow matching trajectory at each step. Next, leveraging modern vision-language models, we build an automated pipeline and release P-HAZE, a dataset of 50000 realistic hazy/clear pairs. Extensive evaluations demonstrate that RTE-FM-Dehazer, trained solely on P-HAZE, effectively eliminates artifacts like residual haze and color drift, exhibits strong cross-domain generalization, and achieves leading results on five real-world dehazing benchmarks.
[CV-101] InterCMDM: Block-Causal Diffusion for Autoregressive Human Interaction Generation ECCV2026 MDM
链接: https://arxiv.org/abs/2607.01743
作者: Qing Yu,Kent Fujiwara
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026, Project website: this https URL
Abstract:Text-conditioned human interaction generation must capture both long-range temporal causality within each individual and tightly coupled coordination between partners. Existing interaction diffusion models typically denoise full sequences using bidirectional attention, which obscures causality and hinders streaming and long-horizon generation. Autoregressive alternatives enforce causality but often suffer from temporal drift, leading to coordination degradation and unstable interaction dynamics over time. We propose InterCMDM, a block-causal latent diffusion framework for autoregressive two-person interaction generation. InterCMDM introduces a Dual-Stream Causal Diffusion Transformer that maintains separate causal streams for each person while modeling inter-person dependencies via unified dual-stream attention with multi-task attention masks. These masks unify interaction modeling within a single attention mechanism and support diverse coordination behaviors, including simultaneous actions, reactive responses, leader-follower dynamics, and independent motion. By training a single model across these mask configurations as a form of data augmentation, InterCMDM enables controllable interaction generation by simply selecting the desired attention mask at inference time. Finally, a block-wise diffusion objective enables stable latent rollout over long sequences without repeated decode-encode cycles. InterCMDM achieves state-of-the-art performance on InterHuman and Inter-X, improving text-motion alignment, realism, and long-horizon continuity.
[CV-102] ReQuest: Rethinking-based Question-Aware Frame Selection for Long-Form Video QA ECCV2026
链接: https://arxiv.org/abs/2607.01737
作者: Minkuk Kim,Suyong Yun,Young Tae Kim,Jinyoung Moon,Jinwoo Choi,Seong Tae Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026
Abstract:Recent multimodal large language models (MLLMs) have substantially advanced video understanding, yet long-form video QA remains challenging under fixed input token budgets, where uniform sampling can be inefficient for evidence localization. We propose ReQuest , an uncertainty-driven, question-adaptive keyframe selection pipeline that aligns question intent with relevant video content through selective computation. ReQuest integrates (i) a lightweight question-aware selector distilled from MLLM-generated supervision, (ii) Re-thinking Routing that triggers additional inference only when the model is uncertain with a length-adaptive criterion, and (iii) uncertainty-guided adaptive non-maximum suppression that selects temporally diverse frames while adjusting spacing based on question difficulty. As a plug-andplay method, ReQuest improves long-video QA without modifying or fine-tuning the underlying MLLM. Experiments on Video-MME, MLVU, and LongVideoBench demonstrate consistent accuracy gains with competitive computational cost, with particularly strong improvements in medium and long video regimes.
[CV-103] Consistent Scene Understanding in 3D Gaussian Splatting via Multi-Cue Mask Refinement ICPR2026
链接: https://arxiv.org/abs/2607.01708
作者: Hyunjoon Park,Donghyeon Cho
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICPR 2026
Abstract:Reliable instance-level scene understanding is a fundamental prerequisite for object-level interactions and high-fidelity 3D representations. While current methods often leverage 2D foundation segmentation models to obtain these priors, their 2D-centric design typically yields fragmented masks and inconsistent predictions across different views. To address these issues, we propose a novel framework that produces consistent 2D instance masks to guide the optimization of 3D Gaussian Splatting (3DGS) feature fields. Our framework consists of three main stages. (1) Multi-Cue Extraction that generates synergistic semantic, geometric, and structural priors from input images. (2) Multi-Cue-Guided Mask Merging process that consolidates fragmented masks using a composite merge score derived from semantic, depth, and edge cues. (3) Cross-View Mask Matching that establishes globally consistent identity assignments across all viewpoints. By transforming viewpoint-specific segments into coherent 3D primitives, our approach enables stable 3D instance segmentation and effective downstream editing tasks. Experiments demonstrate that our method significantly improves cross-view consistency and segmentation stability over existing baselines while maintaining high-fidelity photometric reconstruction.
[CV-104] LASER: A Corrective Lens for LVLMs via Visual Attention Preservation and Sink Suppression ECCV2026
链接: https://arxiv.org/abs/2607.01707
作者: Bowen Yuan,Zijian Wang,Yadan Luo,Shijie Wang,Zi Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The 19th European Conference on Computer Vision (ECCV 2026)
Abstract:Large vision-language models (LVLMs) exhibit strong reasoning ability but suffer from visual forgetting during long-horizon decoding, where attention progressively drifts away from visual evidence. Existing methods largely treat this issue as a late-stage attention decay problem or attempt to mitigate it through heuristic reminders or post-hoc attention lifting. Through systematic empirical analysis, we find that performance degradation under visual forgetting is largely driven by two overlooked factors: early-stage attention decay disrupts evidence acquisition, and attention concentration on a subset of task-irrelevant visual sink tokens. Motivated by these insights, we propose LASER, a post-training framework that regulates both the visual attention trajectory and intra-visual token attention distribution during reasoning. Technically, LASER introduces two complementary rewards: a Visual Grounding Reward, which encourages the model to maintain attention on semantically salient visual tokens throughout decoding, and a Sink Suppression Reward, which penalizes excessive attention concentration on visual sink tokens. Together, these rewards preserve early-stage grounding while preventing attention collapse onto uninformative regions. Extensive experiments on eight benchmark datasets demonstrate that LASER consistently outperforms strong baselines, validating attention-aware training as an effective remedy for visual forgetting.
[CV-105] Structure-Aware Gaussian Splatting for Large-Scale Scene Reconstruction
链接: https://arxiv.org/abs/2607.01698
作者: Weiyi Xue,Fan Lu,Chi Zhang,Tianhang Wang,Sanqing Qu,Zehan Zheng,Boyuan Zheng,Junqiao Zhao,Guang Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting has demonstrated remarkable potential in novel view synthesis. In contrast to small-scale scenes, large-scale scenes inevitably contain sparsely observed regions with excessively sparse initial points. In this case, supervising Gaussians initialized from low-frequency sparse points with high-frequency images often induces uncontrolled densification and redundant primitives, degrading both efficiency and quality. Intuitively, this issue can be mitigated with scheduling strategies, which can be categorized into two paradigms: modulating target signal frequency via densification and modulating sampling frequency via image resolution. However, previous scheduling strategies are primarily hardcoded, failing to perceive the convergence behavior of scene frequency. To address this, we reframe the scene reconstruction problem from the perspective of signal structure recovery and propose SIG, a novel scheduler that synchronizes image supervision with Gaussian frequencies. Specifically, we derive the average sampling frequency and bandwidth of 3D representations, and then regulate the training image resolution and the Gaussian densification process based on scene frequency convergence. Furthermore, we introduce Sphere-Constrained Gaussians, which leverage the spatial prior of initialized point clouds to control Gaussian optimization. Our framework enables frequency-consistent, geometry-aware, and floater-free training, achieving state-of-the-art performance by a substantial margin in both efficiency and rendering quality in large-scale scenes. The code is available at: this https URL
[CV-106] ICDepth: Taming Video Diffusion Models for Video Depth Estimation via In-Context Conditioning ECCV2026 ICDE
链接: https://arxiv.org/abs/2607.01677
作者: Xuanhua He,Jiaxin Xie,Mingzhe Zheng,Qifeng Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. Project page: this https URL
Abstract:Monocular video depth estimation requires temporal consistency, geometric accuracy, and generalization across diverse scenarios, yet existing methods struggle to achieve all three simultaneously. Discriminative models excel at per-frame accuracy but suffer from temporal drift due to limited context windows, while generative methods improve consistency and generalization at the cost of extensive training data (10M+ samples) and lack of geometric precision. In response to these issues, we introduce \textbfICDepth, a framework that adapts pre-trained text-to-video diffusion transformers for video depth estimation via In-Context Conditioning (ICC), leveraging their rich spatial-temporal priors. To address key challenges in transferring ICC from generation to dense prediction, we propose: (1)~\textbfSAND-Attention, which ensures precise spatial-temporal alignment via shared RoPE and enforces unidirectional attention to prevent noise contamination; (2)~\textbfSRFM, which injects DINOv2 semantic and resolution priors to enhance geometric precision. ICDepth achieves state-of-the-art results on multiple benchmarks with remarkable data efficiency, trained on only 0.8M frames ( 6 – 13\times less than competing generative methods), while demonstrating strong zero-shot generalization to diverse domains.
[CV-107] HistoSeg: Delving deeper with attention and multiscale feature fusion for biomarker segmentation
链接: https://arxiv.org/abs/2607.01675
作者: Saad Wazir,Rao Faizan,Daeyoung Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in the Proceedings of ICBBE 2025. The Version of Record is available at this https URL
Abstract:Segmentation of biomarkers in medical images is frequently viewed as a first step towards medical image analysis in any bioinformatics or biomedical application. Despite progress, existing methods still struggle to capture information at multiple scales and to perform upsampling effectively across different datasets. These shortcomings often result in suboptimal generalization capabilities. Recently, architectures belonging to the Nested-UNet family excel in capturing multiscale contextual information and upsample them effectively. In this work, We propose a novel Nested-UNet architecture that effectively captures multi-scale contextual information. It includes inner and outer attention units to enhance focus during upsampling, along with channel-wise feature recalibration using squeeze-and-excitation modules, leading to improved segmentation performance. Additionally, the architecture integrates an edge-aware loss to emphasize boundary accuracy by assigning greater importance to edge regions. Tested extensively on three publicly available benchmark datasets. Our method demonstrates a generalization performance superior to existing Nested-UNet methods. Code: this https URL
[CV-108] mporal and Cross-Modal Alignment for Enhanced Audiovisual Video Captioning ECCV2026
链接: https://arxiv.org/abs/2607.01667
作者: Chen Zhao,Jiajun Ma,Qilong Huang,Tiehan Fan,Hongyu Li,Zhuoliang Kang,Xiaoming Wei,Jian Yang,Ying Tai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026
Abstract:While Multimodal Large Language Models (MLLMs) have advanced video understanding, achieving precise temporal and cross-modal alignment in audiovisual video captioning remains a formidable challenge. Most existing approaches suffer from modality detachment and temporal incoherence, failing to accurately bind auditory events to visual entities or capture complex causal dynamics. To address these deficiencies, we propose TCA-Captioner, a framework specifically engineered to enhance Temporal and Cross-Modal Alignment for audiovisual video captioning. We first introduce the Observer-Checker-Corrector (OCC) framework, an iterative refinement strategy that generates high-fidelity, meticulously grounded training data. Leveraging a curated high-density human interaction dataset, TCA-Captioner is optimized to model sophisticated audiovisual interactions. Furthermore, we present TCA-Bench, a diagnostic benchmark utilizing a Decoupled Evaluation Protocol to isolate and quantify model proficiency in audiovisual binding and temporal relational reasoning. Extensive experiments demonstrate that TCA-Captioner sets a new standard for temporally-coherent and synchronized audiovisual narratives.
[CV-109] Unified Panoramic-Gaussian Representation for Monocular 4D Scene Synthesis ECCV2026
链接: https://arxiv.org/abs/2607.01663
作者: Yuankun Yang,Yi Wei,Wenyang Zhou,Li Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026
Abstract:4D scene synthesis from monocular videos has made significant progress in recent years. However, existing methods are typically constrained by view interpolation. As a result, they struggle to infer unseen regions beyond the observed views. In this paper, we reformulate the task as 4D scene synthesis with unseen regions, which extends beyond traditional interpolation settings. Camera-conditioned video generation enables unseen region synthesis by guiding generation along specified cameras. However, these methods lack explicit 3D priors and are optimized with random camera trajectories. This design leads to severe inconsistencies under large trajectory deviations. To address this limitation, we build a unified training and inference framework with panoramic trajectory guidance. While this design improves cross-view consistency, the panoramic representation alone fails to model dynamic content effectively. Object motion in panoramic space introduces scale and shape distortions. To address this, we propose PanoGaussian, a unified Panoramic-Gaussian representation that distills the panoramic representation into an explicit dynamic Gaussian representation to capture dynamic physical priors of the 4D scene. Experiments demonstrate that PanoGaussian achieves consistent 4D scene synthesis even under large viewpoint variations.
[CV-110] aching Vision-Language-Action Models What to See and Where to Look ECCV2026
链接: https://arxiv.org/abs/2607.01658
作者: Yuguang Yang,Canyu Chen,Zhewen Tan,Yizhi Wang,Zichao Feng,Chunyang Liu,Kehua Sheng,Juan Zhang,Linlin Yang,Baochang Zhang,Yan Wang,Bo Zhang,Xianbin Cao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper has been accepted by ECCV 2026
Abstract:Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing VLAs’ training relies heavily on text-centric visual question answering and chain-of-thought reasoning data, which emphasizes linguistic reasoning rather than action-grounded planning. As a result, the learned representations capture semantic knowledge but lack spatial dependencies crucial for reliable trajectory prediction. We propose DriveTeach-VLA, a framework that explicitly teaches VLAs what to see and where to look. Driving-aware Vision Distillation (DVD) injects driving-specific perceptual priors into the vision encoder, while 2D Trajectory-Guided Prompts (2D-TGP) provide spatial conditioning aligned with feasible driving trajectories. Together, they form a vision-guided learning pipeline: what to see (DVD pretraining) - where to look (TGP-guided SFT) - how to act (TGP-guided GRPO). DriveTeach-VLA achieves the state-of-the-art performance on NAVSIM and nuScenes. Our code is available at: this https URL.
[CV-111] Domain Generalization via Text-Anchored Information Bottleneck ECCV2026
链接: https://arxiv.org/abs/2607.01657
作者: Eunyi Lyou,Yunjeong Choi,Junho Lee,Joonseok Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:Visual recognition models often fail when deployed in new environments. Domain Generalization (DG) addresses this by learning representations that remain invariant to environment-specific variations. Recent approaches increasingly rely on large vision-language models, assuming that preserving their expressive visual representations improves robustness. However, we show that such visual expressiveness can instead propagate spurious cues that tie representations to the training environments, hindering invariant learning. We therefore discard visual guidance and instead treat the language embedding space as the primary source of domain invariance, naturally acting as an information bottleneck that preserves core semantics while suppressing domain-specific variations. Extensive experiments across diverse backbones exhibit state-of-the-art performance and further analyze what makes guidance effective for robust generalization. These findings shift the focus of DG from improving representations to designing supervision that enforces invariance.
[CV-112] Plug-and-Play Volumetric Reconstruction for Compressive Sensing Light-Sheet Microscopy
链接: https://arxiv.org/abs/2607.01654
作者: Jianqing Jia,Yi Gong,Xinyuan Zhang,Jichen Chai,Yichen Ding,Yifei Lou
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:
Abstract:We investigate volumetric reconstruction for compressive sensing light-sheet microscopy (CS-LSM), where fast volumetric imaging is achieved by encoding multiple axial planes into each camera exposure. To recover the underlying volume from highly multiplexed measurements, we propose a plug-and-play (PnP) framework that flexibly incorporates any user-specified denoiser into the reconstruction process. Building on a slice-based formulation, we further introduce an axial-coupled model that exploits correlations between adjacent slices to improve volumetric continuity. For efficient computation, we derive a Woodbury-based update for the data-consistency step in both the slice-based and axial-coupled formulations, and employ a Gauss-Seidel sweep for the denoising step in the axial-coupled model. Under a weakly convex regularization assumption, we establish subsequential convergence of the proposed algorithm. Experiments on synthetic and real zebrafish-heart data demonstrate that the proposed framework successfully recovers cellular structures from compressed measurements, and provide practical insights into the comparative performance of commonly used denoisers within the PnP framework under the CS-LSM setup.
[CV-113] Boosting Ultrasound Image Classification via Attribute-Guided Dual-Branch Framework MICCAI2026
链接: https://arxiv.org/abs/2607.01648
作者: Bo Zhao,Yapeng Li,Juhua Liu,Bo Du
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by MICCAI 2026
Abstract:Ultrasound image classification is essential for computer-aided diagnosis. However, current methods often neglect clinical priors, leading to poor generalization in challenging scenarios and a lack of interpretability that limits clinical adoption. To address these issues, we aim to develop a medical-prior module that can be seamlessly integrated into existing pipelines to enhance both diagnostic performance and interpretability. In this paper, we propose an attribute-guided dual-branch framework for ultrasound classification that introduces domain-agnostic medical attribute priors, improving generalization while offering interpretable evidence. Specifically, a baseline branch follows conventional architectures and predicts image categories via a fully connected classifier. An attribute-guided branch injects domain-agnostic attributes as priors and produces human-interpretable decision cues. Finally, an adaptive decision module fuses the two branches in a data-dependent manner to yield the final prediction. Experiments across diverse ultrasound classification tasks demonstrate that our approach can be integrated into multiple backbones and state-of-the-art methods with low overhead, consistently improving accuracy and interpretability. Code is available at: this https URL.
[CV-114] Multi-Resolution Flow Matching: Training-Free Diffusion Acceleration via Staged Sampling
链接: https://arxiv.org/abs/2607.01642
作者: Xingyu Zheng,Xianglong Liu,Yifu Ding,Weilun Feng,Junqing Lin,Jinyang Guo,Haotong Qin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code is available at this https URL
Abstract:Hardware-agnostic strategies for accelerating text-to-image diffusion, such as timestep distillation and feature caching, can reduce inference time without custom kernels or system-level optimization. Among them, multi-resolution generation strategies have recently received broad attention, attaining more than 5x speedup without any training. However, the design of performing upsampling in the latent space, together with the selective modification of partial regions, causes these methods to exhibit noticeable blurring or artifacts. To this end, we propose MrFlow, a training-free multi-resolution acceleration strategy for pretrained flow-matching models built upon a staged low-to-high-resolution pipeline. MrFlow first rapidly generates the main structure at low resolution, then performs super-resolution in the pixel space using a lightweight pretrained GAN-based model, subsequently injects low-strength noise to enable high-frequency resampling, and finally refines the details at high resolution. Quantitative and qualitative results on FLUX.1-dev and Qwen-Image show that MrFlow exploits the quadratic token reduction and reduced step requirement of low-resolution sampling to achieve 10x end-to-end acceleration while keeping OneIG within a 1% gap relative to that before acceleration, significantly surpassing other training-free acceleration strategies, and requiring no training or runtime dynamic identification whatsoever. MrFlow can further be directly combined orthogonally with pre-trained timestep distillation strategies, achieving even higher generation acceleration of up to 25x.
[CV-115] Bridging 3D Gaussians and Semantic Occupancy for Comprehensive Open-Vocabulary Scene Understanding from Unposed Images
链接: https://arxiv.org/abs/2607.01633
作者: Hu Zhu,Bohan Li,Xianda Guo,Yanlun Peng,Zheng Zhu,Xin Jin,Wenjun Zeng,Chang Wen Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Hu Zhu, Bohan Li, and Xianda Guo contributed equally. Corresponding author: Wenjun Zeng
Abstract:Comprehensive 3D scene understanding from sparse, unposed images requires a model to recover renderable geometry, open-vocabulary semantics, and free/occupied 3D space without relying on external camera calibration. Recent feed-forward Gaussian methods improve pose-free reconstruction and semantic rendering, but their Gaussian primitives are mainly optimized through image-space objectives and remain weakly constrained in unobserved regions. We propose \textitCOVScene, a pose-free semantic Gaussian framework that couples renderable Gaussian primitives with a dense semantic occupancy field through differentiable volumetric lifting. Instead of converting Gaussians to voxels only at evaluation time, COVScene lifts the predicted semantic Gaussians inside the training computation graph, so volumetric regularization provides gradients to Gaussian opacity, geometry, and semantic features. The framework combines a semantic-aware Geometry Transformer, multi-task Gaussian decoding, geometric foundation distillation, and occupancy entropy regularization to support novel view synthesis, open-vocabulary semantic querying, and semantic occupancy prediction within a single representation. Experiments on ScanNet and ScanNet++ show that COVScene maintains competitive rendering quality, improves open-vocabulary segmentation, and achieves stronger semantic occupancy prediction than the self-supervised baseline without direct voxel-level supervision.
[CV-116] DRDN: Decoupled Representation Dynamic Network for From-Scratch ViT Class-Incremental Learning
链接: https://arxiv.org/abs/2607.01630
作者: Bingchen Huang,Yifu Chen,Zhiling Wang,Yuanchao Du
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, IEEEtran journal format. Preprint submitted to IEEE Transactions on Multimedia
Abstract:Dynamic expansion methods for class-incremental learning (CIL) protect task-specific knowledge by growing dedicated tokens or subnetworks, yet our analyses suggest that classification supervision alone does not sufficiently preserve task-agnostic shared backbone representations over long incremental sequences. We identify two intertwined challenges: cross-task confusion from sequential training on predominantly current-task data, which biases decision boundaries toward recent tasks; and under-optimized shared representations in the backbone that cap long-term discriminability as tasks accumulate. We propose the Decoupled Representation Dynamic Network (DRDN), which addresses these challenges via two orthogonal mechanisms. For shared backbone representations, DRDN continuously applies masked image modeling (MIM) at every incremental step, with reconstruction gradients routed exclusively through the backbone, encouraging it to retain general visual structure beyond class-discriminative cues. For task-specific discrimination, DRDN employs hierarchical task token expansion across all transformer layers, with a modified per-task attention rule that reduces inter-task interference. We support this design with accuracy degradation analysis and cross-task confusion rate measurements. In the from-scratch ViT CIL setting (no external pretraining), DRDN consistently improves over strong token-expansion baselines with comparable backbone scale. On CIFAR100-B0 (10 steps), DRDN achieves 77.19% average accuracy, outperforming DKT by 1.36 points and DyTox by 3.53 points, with an advantage that grows at longer incremental sequences. Multi-seed validation confirms stability (+/-0.31%). The MIM decoder is active only during training, adding no inference-time parameters or computation. Comments: 10 pages, IEEEtran journal format. Preprint submitted to IEEE Transactions on Multimedia Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2607.01630 [cs.CV] (or arXiv:2607.01630v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2607.01630 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-117] Online Segment 3D Gaussians via Launching Virtual Drones
链接: https://arxiv.org/abs/2607.01628
作者: Liwei Liao,Rongjie Wang,Ronggang Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Interactive segmentation of 3D Gaussians offers a compelling opportunity for real-time manipulation of 3D scenes, thanks to the real-time rendering capability of 3D Gaussian Splatting (3DGS). However, existing methods require a time-consuming per-scene setup - typically tens of seconds or even minutes - before interactive segmentation can begin on a raw 3DGS scene. This setup involves multi-view mask preparation, mask lifting, and feature distillation, creating a major bottleneck for online applications. To address this limitation, we aim to completely eliminate the setup stage for interactive 3DGS segmentation while keeping the segmentation time practical (under 1 second). In this work, we present SAGO (Segment Any Gaussians Online), a novel setup-free framework for interactive 3DGS segmentation. By introducing virtual drones, our method reframes the 3D segmentation problem as an online Next-Best-View (NBV) planning task formulated within a Markov process. Extensive experiments demonstrate that SAGO can extract clean 3D assets directly from 3D Gaussians with sub-second latency, thereby enabling a broad range of downstream applications such as object manipulation and scene editing. Moreover, our method achieves over a 50x speedup compared to the previous setup-free 3DGS segmentation frameworks. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2607.01628 [cs.CV] (or arXiv:2607.01628v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2607.01628 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-118] Multi-THuMBS: Multi-person Tracking of 3D Human Meshes Beyond Video Shots
链接: https://arxiv.org/abs/2607.01626
作者: Jeongwan On,Muhammad Salman Ali,Muneeb A. Khan,Sunwoo Park,Inwoong Moon,Hyung Jin Chang,Jaekwang Kim,Seong Jong Ha,Seungryul Baek
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Tracking multi-person 3D human meshes from in-the-wild videos is a highly challenging problem due to complex interactions, frequent occlusions, and severe truncation inherent in unconstrained environments. While recent approaches have improved robustness against these issues, they largely overlook the critical challenge prevalent in real-world footage: frequent shot changes. These abrupt transitions in camera viewpoints often cause existing methods to lose track of human identities and fail in reconstructing temporally coherent trajectories. Although several recent works have explored 3D human mesh tracking under shot changes, they are still limited to single-person scenarios, making them inadequate for real-world videos where multiple people interact and appear simultaneously. To address this limitation, we propose Multi-THuMBS (Multi-person Tracking of 3D Human Meshes Beyond Video Shots) that leverages a state-of-the-art 3D scene prior to reconstruct the two boundary frames in a single shared 3D space. Human meshes are then registered within the shared 3D space, maintaining per-person identity and motion consistency across shot changes. Extensive experiments demonstrate that our approach yields significant improvements in 3D human mesh recovery, camera pose estimation, and identity tracking, thereby ensuring high-fidelity motion reconstruction with consistent identity preservation across shots compared to previous state-of-the-art methods.
[CV-119] VLAFlow: A Unified Training Framework for Vision-Language-Action Models via Co-training and Future Latent Alignment
链接: https://arxiv.org/abs/2607.01586
作者: Guoyang Xia,Fengfa Li,Hongjin Ji,Lei Ren,Fangxiang Feng,Kun Zhan,Yan Xie
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Vision-language-action models (VLAs) have recently advanced robotic manipulation, yet the effects of different robot-data pre-training paradigms remain difficult to compare because existing models often differ in architecture, data, action space, and evaluation protocol. We present VLAFlow (Vision-Language-Action Flow), a unified flow-matching framework for controlled comparison of VLA training objectives. Using a heterogeneous robot corpus, OXEMix, containing approximately 5,000 hours of data from DROID, OpenX-Embodiment, OpenX-Augmented, and RoboCOIN, we evaluate four paradigms under the same pi0-style architecture, shared VLM backbone, action expert, and 14-dimensional action space: action-only modeling (MindPI), language-supervised co-training (MindLPI), future latent alignment (MindWPI), and their combination (MindLWPI). Experiments on LIBERO, LIBERO-Plus, and SimplerEnv show that action-only pre-training is sensitive to heterogeneous data. In contrast, language supervision helps preserve vision-language generalization, while future latent alignment improves state-transition and action-outcome modeling. By combining both signals, MindLWPI achieves the most stable overall transfer performance across benchmarks. These results suggest a meta-action space view: language and future latent representations provide complementary intermediate constraints that make heterogeneous action supervision smoother and more transferable.
[CV-120] MVFusion-GS: Motion-Variance Guided Temporal Attention for High-Quality Dynamic Gaussian Splatting
链接: https://arxiv.org/abs/2607.01578
作者: Jianwei Hu,Tingxuan Huang,Hengyu Zhou,Ningna Wang,Xiaohu Guo Jinshan Lai,Bin Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) enables real-time novel view synthesis for static scenes. Extending it to dynamic scenes via deformation fields has recently attracted significant attention, particularly for dynamic scene reconstructionband distractor-free. However, existing deformation networks lack explicit motion awareness: they neither capture long-term motion intensity nor exploit short-term temporal coherence, leading to inaccurate foreground deformation and pseudo-static residuals in the background. We present MVFusion-GS, a method that enhances deformation networks with two complementary motion-aware mechanisms. The Motion-Variance Guided Refinement aggregates per-Gaussian deformation statistics across time to estimate motion variance and uses it to guide dynamic-static separation during deformation prediction. The MotionFormer Temporal Attention module applies Transformer self-attention over neighboring timesteps to model local motion dependencies and improve temporal consistency. Extensive experiments on both dynamic scene reconstruction and distractor-free reconstruction benchmarks demonstrate state-of-the-art performance, showing that explicit motion awareness improves both foreground motion modeling and static background reconstruction.
[CV-121] Mind the Gap: Standard 3DGS Evaluation Primarily Measures Near-Trajectory Interpolation
链接: https://arxiv.org/abs/2607.01556
作者: Gaoxiang Jia,Vikram Appia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Standard MipNeRF360-style 3D Gaussian Splatting (3DGS) evaluation holds out every N-th frame – but these frames have trained neighbors on both sides, so the metric measures near-trajectory interpolation rather than spatial generalization. We introduce a fair matched-count protocol that isolates this effect: both arms train on the same number of images and differ only in whether the holdout is spread evenly (interpolation) or forms a contiguous spatial sector (extrapolation). Our primary finding is a large, consistent interpolation-extrapolation gap of 3~12dB – several times the differences typically reported between competing methods. The gap is robust to training noise, is in two cases large enough to flip a method ranking under multi-seed confirmation, and – crucially – persists across three representation families, including a non-Gaussian volumetric neural radiance field (NeRF), so it reflects spatial coverage rather than any one representation. Diagnostically, it is dominated by a diffuse/geometry-proxy component and tracks each view’s angular distance to its nearest training view, a zero-cost signal that also guides capture planning; loss-side regularization yields only marginal gains. Standard holdouts remain useful for near-trajectory rendering but should not, alone, be read as evidence of spatial generalization. Prior work notes protocol sensitivity; ours is, to our knowledge, the first to combine matched-count paired holdout, cross-representation quantification, and a diagnostic analysis Table 1. We describe a spatial-holdout benchmark toolkit with standardized splits and baselines for 16 scenes, which we are preparing for public release.
[CV-122] Boosting Infrared Small Target Detection via Logit-Domain Contrast and Adaptive Shape Refinement
链接: https://arxiv.org/abs/2607.01555
作者: Handong Zeng,Zhengeng Yang,Shuai Zhang,Shikai Chen,Hongshan Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Infrared small target detection (IRSTD) remains challenging due to tiny target size, low signal-to-noise ratio, severe foreground-background imbalance, and blurred boundaries in complex scenes. Existing methods usually rely on post-activation probability-domain supervision for discrimination, where weak targets and strong clutter may produce saturated and close probabilities, limiting weak-target discrimination. Meanwhile, blurred boundaries and halo-like predictions mainly stem from thermal diffusion, tiny target scale, boundary uncertainty, and insufficient explicit contour constraints. To address these issues, we propose Adaptive-Contrastive SLSIoU (AC-SLSIoU), a plug-and-play discriminative and shape-aware loss for IRSTD. Specifically, a Logit-Domain Margin Constraint (LDMC) is introduced to enlarge the response gap between targets and informative hard negatives in the logit space, thereby enhancing weak-target discrimination. Adaptive Boundary Suppression (ABS) applies scale-aware annular penalties to refine target contours and suppress halo-like overflow responses. In addition, False-Alarm Focal Loss assigns larger weights to high-probability negative samples, further penalizing persistent high-confidence false alarms. Without introducing extra inference overhead, the proposed method can be seamlessly integrated into existing detectors and consistently improves both detection accuracy and shape quality. Extensive experiments and cross-backbone evaluations demonstrate the effectiveness, robustness, and generalization ability of the proposed method for infrared small target detection.
[CV-123] Hidden-Shot: Towards One-Shot Task Generalization for Low-Level Vision Generalist Models
链接: https://arxiv.org/abs/2607.01535
作者: Shao-Jun Xia,Xianzheng Ma,Zichong Meng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 5 figures, under submission
Abstract:Despite the intense engagement surrounding low-level vision generalist models, their effectiveness in zero/few-shot scenarios beyond learned tasks remains unverified. The primary challenge of developing an ideal generalist lies in achieving the ability to generalize from new unseen tasks, which also can be assessed by matched quantitative criteria. Existing methods have made some progress in prompt engineering but have not systematically explored this gap across a wide range of low-level visual tasks. Stimulated by the problem, we propose Hidden-Shot, an implicit prompt mechanism aimed at exploring low-level task adaptation in a vision generalist model. Specifically, the method extracts implicit visual task-based information, utilizes a global task-aware textural prompt, and selectively merges implicit information with in-task processing information to enhance one-shot capabilities in new tasks. The overall design performs direct injection in a cost-effective manner, while minimally altering the architecture of the original generalist model. Additionally, we introduce a data-driven evaluation framework termed C/U assessment to cover two basic scenarios, 3C4U (3 conventional and 4 unconventional tasks) for retraining existing models and 3C7U (3 conventional and 7 unconventional tasks) for training from scratch, as a comprehensive assessment to systematically test the generalization ability of low-level generalist models. Experiments on seven and ten datasets outperform the state-of-the-art vision generalist model, respectively verified by 3C4U and 3C7U framework. Our presented Hidden-Shot approach demonstrates superior performance on one-shot new tasks while maintaining consistent performance on existing tasks.
[CV-124] Disentangling Pictorial Cue Understanding from Language Bias in VLMs via Depth Ordering Task ECCV2026
链接: https://arxiv.org/abs/2607.01503
作者: Yiqian Liu,Iuliia Kotseruba,John K. Tsotsos
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 7 figures, accepted to ECCV 2026 (30 pages, 13 figures, supplementary materials included)
Abstract:In this paper, we study depth perception of vision-language models (VLMs) to isolate the effects of pictorial depth cues and disentangle vision and language influences on model performance. To this end, we combine depth-ordering and odd-one-out psychophysical tasks: the VLMs are presented with images where one object is at different depth relative to other, otherwise identical, objects, and must determine whether the odd-one-out target is closer or farther to the observer. To create stimuli, we generate 2D views from simulated and real 3D scenes while controlling the presence of individual pictorial depth cues, enabling a fine-grained analysis of cue-level contributions. Language effects are examined by varying referring expression clarity. We also introduce a novel metric to quantify vision-vs-language sensitivities. Applying this methodology, we create the Odd-One-Out Depth (O3-D) dataset with 37K real and synthetic images and 147K image-question pairs. Evaluation of 12 open-source and commercial models on O3-D shows under-utilization of depth cues and depth-ordering accuracies between 47% and 56%, with no model above chance level. At the same time, our metric reveals strong linguistic bias in the answers. Neither chain-of-thought (CoT) nor in-context learning (ICL) significantly improves performance, suggesting that static image data alone may be insufficient for depth understanding. All code, the image generation pipeline, and the O3-D dataset are publicly released at this https URL.
[CV-125] Anti-Prompt: Image Protection against Text-Guided Image-to-Video Generation ECCV2026
链接: https://arxiv.org/abs/2607.01499
作者: Yeonghwan Song,Chanhui Lee,Jinsoo Park,Jeany Son
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:Recent advances in Image-to-Video generation allow a single image to be animated into a convincing video under text guidance, raising serious copyright and privacy risks. We propose Anti-Prompt, an image protection approach that injects imperceptible perturbations into an image, inducing visible inconsistencies and structural failures in text-guided I2V generation. Our method is motivated by a simple empirical observation. When text guidance is removed from modern I2V models, generation quality degrades markedly, not only in motion realism but also in subject preservation, structural coherence, and temporal consistency. Building on this insight, Anti-Prompt exploits the model reliance on textual guidance by attenuating text-conditioned interactions during denoising while strengthening visual-only pathways. To further systematically evaluate protection effectiveness, we introduce a Video-LLM-assisted evaluation protocol that provides interpretable, frame-grounded analyses of generation artifacts and inconsistencies. Experiments on two representative I2V architectures demonstrate that our method achieves strong protection performance while improving efficiency and cross-model transferability.
[CV-126] A Cost-Aware Paired Protocol for Auditing Dynamic Tool Synthesis in Agent ic Video Question Answering
链接: https://arxiv.org/abs/2607.01469
作者: Aseel Mohamed,Rama AlHamidi,Mohamed Rayan Barhdadi,Rasul Khanbayov,Erchin Serpedin,Hasan Kurban
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Agentic Video Question Answering (VideoQA) systems invoke tools during inference, but their tool libraries are fixed, so recurring procedures are rebuilt from primitives on every question. Synthesizing composite tools could remove this overhead, but whether such expansion helps is hard to assess: final-answer accuracy, the standard metric, ignores inference effort, so it cannot reveal how a system shifts cost. We propose a cost-aware, paired protocol for auditing tool-augmented video agents. The protocol pairs two complete systems on the same input for each question and reports their net difference across accuracy and cost jointly. For each question, it sorts the paired outcome into one of six groups defined by joint correctness and by the change in visible tool calls, separating accuracy-preserving efficiency gains from harmful regressions. Significance is reported with McNemar’s test and paired bootstrap confidence intervals. We instantiate the protocol on Dynamic-SAGE, an agentic VideoQA framework that synthesizes, validates, and persistently registers executable composite tools for reuse on unseen questions, and evaluate it against the SAGE baseline on SAGE-Bench. The audit reveals a multi-axis profile that a scalar accuracy comparison would miss: Dynamic-SAGE improves accuracy by 7.5 points (p 0.001) and reduces reasoning turns and visible tool calls by roughly 28%, while shifting rather than reducing inference cost, as token usage rises 34% and cost 26%. Gains are largest on visual and open-ended questions and neutral on verbal and multimodal ones, and residual failures concentrate on hard, open-ended questions where the pipeline does the most work. By measuring accuracy and cost jointly, the protocol shows where the pipeline-level difference is reliable and where it is not. The code is available at this https URL.
[CV-127] From Forgeries to Foundation Models: A Systematic Survey of Identity Document Attack and Detection
链接: https://arxiv.org/abs/2607.01442
作者: Gourab Das,Pavan Kumar C,Raghavendra Ramachandra
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Identity document forgery has undergone a fundamental capability shift: generative AI tools now enable high-fidelity document synthesis and field-level manipulation with minimal technical expertise, while detection methods remain constrained by benchmarks that do not reflect this threat. The resulting attack surface spans physical presentation, digital injection, and fully generative synthesis, introducing distinct forensic failure modes that require a unified threat model and evaluation framework. This survey provides, to our knowledge, the first unified treatment of Presentation Attacks, Digital Injection Attacks, and GenAI-driven synthesis within a single identity verification threat model. We trace detection methodologies from rule-based heuristics through forensic localisation, injection-aware pipelines, foundation models, and few-shot frameworks. A systematic audit of public datasets from 2019–2025 exposes a persistent Reality Gap between benchmark conditions and operational deployment. We further analyse large multimodal models for identity document manipulation, identifying Script-Dependent Generative Instability (SDGI) as a recurring typographic failure mode in non-Latin script inpainting. Finally, zero-shot benchmarking on unseen synthesised ID cards shows that even the strongest publicly available models achieve APCER values above 25% under security-oriented operating conditions, highlighting substantial limits in cross-domain generalisation. We conclude by outlining future directions toward forensically grounded, privacy-preserving, and legally accountable identity verification systems.
[CV-128] How Much Future Helps? A Controlled Study of Future-Privileged Supervision for Causal Egocentric Gaze Estimation CVPR2026
链接: https://arxiv.org/abs/2607.01437
作者: Jia Li,Wenjie Zhao,Fnu Atisri,Sanskriti Aripineni,Shijian Deng,Jon E. Froehlich,Yuhang Zhao,Yapeng Tian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 7th International Workshop on Eye and Gaze in Computer Vision (GAZE 2026), CVPR 2026. Best Paper Award
Abstract:Egocentric gaze estimation is commonly studied using models that process the full video with access to future frames, while real-world applications require strictly causal, online prediction. This discrepancy raises key questions: Does future context inherently provide valuable signals for gaze estimation? If so, how much future look-ahead optimally supervises a causal model during training? To investigate, we propose a controlled framework featuring a future-aware branch that accesses a tunable look-ahead horizon during training but is discarded at inference. This design isolates the impact of future context while keeping the inference architecture fixed and strictly causal. Across EGTEA Gaze+ and Ego4D, we find that future-privileged supervision consistently improves causal gaze prediction, confirming its utility. However, performance gains do not increase monotonically with longer look-ahead, but rather peak within a bounded temporal regime. Specifically, optimal performance corresponds to roughly 1.7–3.3 seconds of future context ( H\in[5, 10] ) on EGTEA Gaze+ and 2.7 seconds ( H=10 ) on Ego4D. Our results demonstrate that lightweight causal models can effectively absorb future-aware signals, providing practical guidance for real-time egocentric gaze modeling.
[CV-129] Beyond Heatmaps: Unsupervised Concept-Graph Reasoning for Interpretable Visual Explanation ECAI2026 IJCAI
链接: https://arxiv.org/abs/2607.01416
作者: Md Mohasin Hossain(1 and 2),Anar Amirli(4),Robert Leist(1),Md Abdul Kadir(1 and 3),Daniel Sonntag(1 and 3) ((1) German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany, (2) Saarland University, Saarbrücken, Germany, (3) Oldenburg University, Oldenburg, Germany, (4) BEGO GmbH amp; Co. KG, Bremen, Germany)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IJCAI-ECAI 2026 Workshop on Explainable Artificial Intelligence (XAI), Bremen, Germany. 7 pages, 4 figures
Abstract:Concept Bottleneck Models (CBMs) provide an intrinsically interpretable alternative to post-hoc explanations. However, existing CBMs often rely on predefined concept vocabularies or supervised annotations, lack explicit concept grounding, and summarize each concept with a single image-level score – discarding spatial recurrence and inter-concept dependencies. We propose a Graph-based Concept Bottleneck Model (G-CBM), an intrinsically interpretable framework that performs unsupervised concept discovery via Non-negative Matrix Factorization (NMF) and represents the discovered concepts as nodes in a per-image concept-graph representation. G-CBM matches region-level features to these concept nodes – providing concept grounding and capturing concept recurrence across the image – and applies a \emphtunable concept filtering threshold \tau to suppress weak region-level features. A Graph Attention Network (GAT) then performs concept-level reasoning by modeling nonlinear dependencies across nodes. Across ImageNet, HAM10000, PH2, and Derm7pt, G-CBM achieves an average relative AUC improvement of 3.7% over a ResNet-50 baseline. Concept filtering frequently improves predictive performance while inducing selective concept use, achieving peak AUC of 0.96 on PH2 with only 2 of 10 concepts and 0.92 on HAM10000 with 3.8 of 9 concepts. On dermoscopy benchmarks, G-CBM is competitive with supervised approaches requiring external annotations. Deletion/insertion analyses with random ablation controls show that the learned concept ranking faithfully reflects model predictions.
[CV-130] NeuroBridge: Bridging Multi-Task MRI Knowledge for Neurodegenerative Disease Diagnosis
链接: https://arxiv.org/abs/2607.01401
作者: Mengyu Li,Guoyao Shen,Chad W. Farris,Xin Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 figures. 3 tables
Abstract:INTRODUCTION: Accurate MRI-based identification of Alzheimer’s disease (AD), mild cognitive impairment (MCI), and related dementias remains challenging because disease-related structural changes are often subtle and heterogeneous. We developed NeuroBridge, a clinically guided multi-task MRI framework for neurodegenerative disease diagnosis. METHODS: NeuroBridge integrates large-scale self-supervised MRI pretraining with hippocampal segmentation, hippocampal atrophy classification, and reconstruction objectives, followed by gated fusion fine-tuning. Performance was evaluated across ADNI and OASIS cohorts, including cross-cohort transfer, probability-based analysis, and opportunistic screening. RESULTS: NeuroBridge achieved the highest performance across evaluated classification tasks, reaching 88.17% accuracy for AD versus cognitively normal controls in ADNI and 82.78% in OASIS. The largest gains occurred in MCI-related and mixed-diagnosis settings. The framework demonstrated strong cross-cohort generalization, systematic associations between predicted-class probability and accuracy, and the feasibility of probability-based opportunistic screening. DISCUSSION: Clinically guided multi-task representation learning improves neurodegenerative MRI diagnosis beyond conventional single-task approaches. NeuroBridge provides a robust and scalable framework for dementia assessment and MRI-based opportunistic screening.
[CV-131] Computer Vision for Wildlife Monitoring: Detecting Brown Howler Monkeys using YOLO
链接: https://arxiv.org/abs/2607.01396
作者: Gabriel Ferri Schneider,Guido Luis Glufke Mainardi,Paulo Ricardo Knob,Patrícia Dias,Márcia Jardim,Júlio César Bicca-Marques,Soraia Raupp Musse
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted on International Conference on Computer Animation, Social Agents, and Extended Reality '26 (CASAXR 26)
Abstract:Urban expansion threatens global biodiversity, especially affecting arboreal species due to the fragmentation of forest habitats. The movement of arboreal species across disjointed forest patches increases mortality risk and, thus, compromises their conservation. In this context, the installation of canopy bridges can be a viable strategy; yet continuous monitoring of their use by arboreal species is essential for ensuring their effectiveness, typically carried out with the aid of camera traps. However, this method often produces false-positive images that demand time from conservationists for review. In this context, computer vision algorithms can optimize the task of detecting target species using the canopy bridges. In this study, we explored the automatic detection of brown howler monkeys (Alouatta guariba) in videos obtained by camera traps. Given the need for a large number of annotated images of the target animals to train the algorithms, we tested the incorporation of auxiliary data to improve detection models, fine-tuning the YOLOv10 framework using varying proportions of them. The improvement of these automatic detection techniques contributes to conservation efforts, by providing automatic tools to monitor solutions that minimize the impact of human interference in animals habitats.
[CV-132] Rethinking Generic Object Tracking Toward Human-Level Perceptual Intelligence
链接: https://arxiv.org/abs/2607.01395
作者: Shih-Fang Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: Ph.D. dissertation, National Yang Ming Chiao Tung University, 2026. arXiv admin note: substantial text overlap with arXiv:2602.14771
Abstract:At the heart of human visual perception lies the ability to maintain a continuous and coherent understanding of the external world. By integrating observations with accumulated experience, the human visual system can continuously adapt to variations in both the target and its surrounding environment, while preserving robust visual continuity as scene dynamics evolve. Human vision can therefore integrate prior knowledge, spatial geometry, and semantic context to understand complex scenes and their changes. As a core problem in computer vision, visual object tracking aims to bring machine perception closer to human visual perception. These capabilities are central to the task of Generic Object Tracking (GOT). In this task, a visual tracker is initialized only with the bounding box of an arbitrarily specified target in the first frame, and must continuously localize the target in subsequent dynamic visual streams. However, future events, observations, and real-world variations are inherently unpredictable; therefore, the model’s generalization and online adaptation capabilities remain bottlenecks. Tracking reliability can deteriorate when the target undergoes severe deformation, is affected by complex distractors, encounters significant environmental changes, or belongs to a category unseen during training. This dissertation aims to narrow the gap between machine visual tracking systems and human visual perception by proposing a series of methods that systematically enhance the target discrimination, robust adaptation, and geometric reasoning capabilities of tracking models.
[CV-133] MIBE: Multi-subject Interaction Benchmark and Evaluator for Personalized Image Generation
链接: https://arxiv.org/abs/2607.01383
作者: Zhihan Chen,Yuhuan Zhao,Yijie Zhu,Xinyu Yao,Mengcong Ren,Suwen Wang,Qiuyang Yin,Yuchen Sun,Qin Wang,Lu Xin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-subject personalized image generation requires the precise rendering of all requested reference identities and their specified interactions based on a guiding prompt. However, state-of-the-art models still struggle with this process, frequently omitting subjects, failing to preserve reference appearances, or misattributing interactions. Furthermore, existing metrics designed primarily for single-subject fidelity cannot reliably capture these errors, suffering severe degradation in ranking separability and failing to align with human preference as the subject count increases. To address this gap, we introduce Multi-subject Interaction Benchmark and Evaluator (MIBE), a unified framework comprising a Multi-subject Interaction Benchmark (MIB) and a Multi-subject Interaction Evaluator (MIE). MIB systematically covers diverse relation types and scene complexities through a decoupled data regime. This consists of a 60K-pair VLM-labeled Silver Set for scalable metric training and a 4K-pair double-blind Human Evaluation Gold Set covering a diverse range of state-of-the-art generators, with the Silver Set reaching 95.1% cross-VLM preference agreement. To demonstrate the utility of this benchmark, we present MIE, a lightweight, reference-conditioned evaluator trained exclusively on the Silver Set with a dual-head ranking and diagnosis objective. MIE exhibits strong cross-generator generalization on the Gold Set, achieving 0.922 overall pairwise accuracy against human preference, including 0.982 on seen generators and 0.884 on unseen generators. By outperforming a broad spectrum of baseline metrics, including CLIP and DINO variants, MIE demonstrates that diagnostic supervision can preserve ranking separability and human alignment where traditional evaluators collapse.
[CV-134] MapDreamer: Aerial Imagery Conditioned Latent Diffusion for Lane-Level Map Generation ECCV2026
链接: https://arxiv.org/abs/2607.01370
作者: Julian Brandes,Philipp Crocoll,Wolfram Burgard
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026
Abstract:High definition map generation is essential for autonomous driving, yet remains a labor-intensive process at scale. We present MapDreamer, a generative diffusion model that synthesizes lane-level vector maps with explicit topology directly from a single aerial image. MapDreamer learns a compact latent representation of lane centerlines and their topological relations using a variational autoencoder and predicts graphs with a transformer-based latent diffusion model. To align generated maps with the observed scene, we condition each denoising step on dense aerial features injected through cross-attention. To handle the varying number of lanes across scenes, we propose a lane cardinality module paired with background ghost lane latents, a learned buffer that prevents slot collapse during diffusion. Furthermore, we introduce a sliding-window global graph aggregation strategy that stitches local tiles into city-scale maps while preserving connectivity through encoded lane boundaries. Experiments on UrbanLaneGraph derived from Argoverse 2 show improved geometric and topological fidelity over non-generative baselines.
[CV-135] Multi-modal Rail Crossing Safety Analysis
链接: https://arxiv.org/abs/2607.01365
作者: Paimon Goulart,Chansong Lim,Nícolas Roque dos Santos,Yue Dong,Sheldon Peterson,Jia Chen,Evangelos E. Papalexakis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Given one or more images of a railway crossing, can we leverage visual cues that allow us to robustly estimate how safe it is? Can we improve our ability to do so by introducing structured data (such as official accident reports) about the accident history of that crossing into our models? In this work, we explore how to best answer those questions towards building an AI system that can ingest multi-modal data for railway crossings and provide safety assessment and scores that align with expert opinion and with safety scoring used by the Federal Railroad Administration (FRA). To that end, we propose a proof-of-concept pipeline that delivers on that goal, while at the same time exploring and tackling a number of critical research challenges that pertain to different parts of the pipeline, from data preparation to different learning paradigms that can allow us to realize such a system. Indicatively, our proposed system identifies HIGH-RISK and LOW-RISK crossings with a macro F1 score of 0.757 and estimates FRA-based safety scores with an RMSE of 0.078 and correlation of 0.492 using a routed fine-tuned compact VLM pipeline, while producing qualitative results that align with domain-expert assessment.
[CV-136] Spatial-Temporal Expert Learning for Video-based Person Re-identification ICPR
链接: https://arxiv.org/abs/2607.01353
作者: Xiaofei Hui,Pengfei Wang,Evan Ling,Dezhao Huang,Keng Teck Ma,Minhoe Hur,Jun Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to V3SC 2026 @ ICPR
Abstract:Video-based person re-identification (Re-ID) aims to retrieve the same identity in the query video clips from the gallery video clips. To solve this problem, exploiting fine-grained features is of great importance, especially when discriminating identities that are similar in appearance. In this paper, we propose to enhance the ability to explore fine-grained information with a novel input-aware extendable expert module. Instead of updating the network parameters with every sample in the dataset, we aim to train the experts within specific subsets that only contain similar samples and promote their ability to exploit fine-grained information within these similar samples. To achieve this goal, we incorporate two mechanisms in this module: input-aware expert selection mechanism and spatial-temporal selection mechanism. The first mechanism dynamically activates a set of experts on subsets of similar samples, pushing the experts to exploit subtle differences between these similar samples, while the second one further increases their sensitivity to the fine-grained differences in spatial and temporal aspects and allows the experts to dynamically utilize them for different input samples. In addition, to facilitate the expert module, we design an extendable scheme that allows the module to flexibly add new experts when necessary. As a result, our method achieves outstanding performance on two large-scale datasets.
[CV-137] KathaTrace: Diagnosing Semantic Trajectory Collapse in Generated Visual Narratives
链接: https://arxiv.org/abs/2607.01312
作者: Jamuna S. Murthy,Amin Karimi Monsefi,Rajiv Ramnath
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual narratives are central to storyboards, comics, children’s media, and film previsualization, where viewers understand stories from images alone. Recent generators such as StoryDiffusion produce coherent sequences, but visual coherence does not guarantee that source-story transition meaning remains recoverable. Existing benchmarks assess visual quality, content faithfulness, and scene coherence, but miss a critical failure mode: storyboards where scenes appear visually coherent while the semantic link between scenes disappears. We introduce KathaTrace, a generator-agnostic protocol for diagnosing semantic trajectory collapse, defined as the loss of transition meaning needed to understand how one scene follows another. KathaTrace evaluates transitions under three evidence conditions: text-only, image-only, and text-plus-image, and filters ambiguous items. We contribute KathaBench-25K, with 5,000 narratives from classical collections including Aesop, Panchatantra, and Kathasaritasagara, 20,000 transitions, and 28,712 recoverability questions. We define Semantic Trajectory Gap, or STG, as text-only minus image-only recoverability, measuring transition meaning lost during visualization. Human validation yields Fleiss’ kappa = 0.845. Experiments across state-of-the-art generators show substantial STG of 23.5 +/- 1.3. Semantic Compass, an actionability probe, uses KathaTrace signals for post-generation repair and improves storyboard selection.
[CV-138] CPG-PAD: Concept-Informed Prompts Guided Presentation Attack Detection
链接: https://arxiv.org/abs/2607.01303
作者: Haoyuan Zhang,Xiangyu Zhu,Li Gao,Ajian Liu,Siran Peng,Zhen Lei
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Transactions on Information Forensics Security (TIFS)
Abstract:Presentation Attack Detection (PAD) serves as a crucial safeguard for face recognition systems against presentation attacks such as printed photos, replayed videos, and 3D masks. Despite significant progress, existing PAD models still struggle to generalize across unseen domains due to variations in sensors, lighting, and attack materials. Recent Vision-Language Models (VLMs) have shown strong generalization ability, yet their applications in PAD remain limited because learned prompts, typically optimized under class-label supervision, fail to explicitly align with fine-grained attack-relevant visual semantics. As a result, the learned representations often overfit domain-specific artifacts instead of capturing transferable attack cues. To address this, we propose Concept-Informed Prompts Guided Presentation Attack Detection (CPG-PAD), a framework that introduces model-level concept guidance into the prompt learning process. Specifically, we design a Visual Concept-driven Enhancement (VCE) module that employs eXplainable AI (XAI) techniques to automatically discover PAD-relevant visual concepts and generate concept-associated heatmaps providing localized fine-grained guidance. Guided by these heatmaps, a Prompt-based Concept Injection (PCI) mechanism integrates these concepts into the prompt space through a Visual-Prompt Decoder (VPD) and a concept-mapping loss, enabling prompts to align with the model’s internal concept space. This design enables CPG-PAD to capture generalizable and domain-invariant attack cues while effectively suppressing dataset-specific biases. Extensive experiments across nine benchmark datasets demonstrate that CPG-PAD consistently achieves state-of-the-art cross-domain performance under multi-source, limited-source, and single-source settings.
[CV-139] AnchorSplat: Fast and Structure Consistent Detail Synthesis for Gaussian Splatting ECCV2026
链接: https://arxiv.org/abs/2607.01290
作者: Dexu Zhu,Jiangnan Shao,Xiaofeng Wang,Junxian Duan,Jie Cao,Zheng Zhu,Huaibo Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV2026
Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful representation for high-fidelity rendering. However, existing assets often suffer from quality bottlenecks such as missing details and texture noise. Prior attempts to enhance these assets via 2D image processing introduce multi-view inconsistencies and high computational costs. In this paper, we propose a novel 3D-native refinement paradigm named AnchorSplat. AnchorSplat is an end-to-end deep network operating directly on 3D structures, avoiding the expensive optimization overhead of traditional 3D-2D-3D pipelines. Crucially, AnchorSplat is a strictly source-free solution requiring no original multi-view images. Central to the proposed method is the Point Anchor Mechanism, which enforces geometric consistency via local offset constraints, mitigating ill-posed mapping and gradient confounding. Furthermore, AnchorSplat replaces iterative densification with a single-pass multiplication mechanism. To facilitate research, we construct 3DGS-SR, the first large-scale benchmark for this task. Experiments demonstrate state-of-the-art results on the 3DGS-SR dataset, with throughput up to 10^5 times faster than optimization methods. Notably, AnchorSplat exhibits robust zero-shot generalization across diverse data distributions, including generative model outputs and real-world scans.
[CV-140] Benchmarking Federated Learning and Knowledge Distillation for Point Cloud Classification ECCV2026
链接: https://arxiv.org/abs/2607.01272
作者: Aizierjiang Aiersilan
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: We are pleased to announce that this paper has been accepted by the 19th European Conference on Computer Vision (ECCV 2026). We appreciate the valuable feedback from the reviewers and look forward to sharing our findings with the community
Abstract:Deploying 3D point cloud analysis in privacy-sensitive, resource-constrained settings faces two barriers: data cannot be centralized, and models must run on limited edge hardware. We present a multi-seed benchmark jointly evaluating federated learning (FL) and knowledge distillation (KD) for 3D point cloud classification. It spans 13 FL algorithms and 10 KD objectives (a 130-pair cross-product) across 504 training runs, evaluated on ModelNet40 and a clinical craniosynostosis dataset. We report three findings. First, under extreme non-IID label skew, standalone FL degrades sharply: on ModelNet40, the strongest method reaches 76.32% against a 92.26% centralized reference; on clinical data, the best reaches 75.83% against 100%. Second, distillation successfully compresses the teacher into a student 74.51% smaller and roughly twice as fast at inference, often matching or surpassing the teacher. Third, the combined pipeline exposes an evaluation pitfall: when distillation keeps a hard-label cross-entropy term on a labeled proxy split, a collapsed federated teacher (8.50%) paired with Logit-MSE still yields a 92.94% student. This 84.4-point gap reflects the proxy labels rather than the federated model, reusing the very labels whose privacy motivated federation. Objectives without hard labels instead track teacher quality ( r \approx 0.99 ) and collapse when the teacher does. We therefore recommend evaluating FL-KD pipelines with label-free distillation so reported accuracy reflects the federated teacher, not the proxy.
[CV-141] Self-Auditing Residual Drifting for Pathology-Preserving Accelerated Knee MRI
链接: https://arxiv.org/abs/2607.02428
作者: Qing Lyu,Jianxu Wang,Mohammad Kawas,Ge Wang,Christopher T. Whitlow
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
Abstract:Accelerated magnetic resonance imaging reduces acquisition time, but reconstruction from undersampled k-space can blur diagnostically relevant structures or introduce failures that are not captured by global image metrics. We propose SA-RDM-DC, a Self-Auditing Residual generative Drifting Model with Data Consistency for accelerated knee MRI. The method adapts the newly proposed generative drifting paradigm to accelerated MRI by training a physics-conditioned drift field from the zero-filled reconstruction toward the fully sampled residual correction. It predicts image- and missing-k-space residual corrections, enforces data consistency with acquired k-space, uses frequency-aware and residual drifting supervision to recover fine detail, and produces dense error maps and slice-level risk scores in the same inference pass. We evaluate SA-RDM-DC on multi-coil fastMRI knee data at acceleration factors of 4, 8, and 12, with fastMRI+ pathology annotations for region-level and classifier-based task preservation, and on SKM-TEA for zero-shot and fine-tuned protocol-shift evaluation. Compared with zero-filled reconstruction, UNet-image-SENSE, DC-UNet, Score-Diffusion, ELF-Diff, SENSE-VarNet, and MoDL baselines, SA-RDM-DC achieves the highest SSIM across fastMRI acceleration factors while retaining subsecond per-slice inference and avoiding the long sampling time of iterative diffusion baselines. In pathology-aware analysis, SA-RDM-DC preserves lesion-region structural fidelity and reduces meniscus prediction instability. Its self-auditing scores strongly identify high-error reconstructions on fastMRI and partially transfer as a selective-review signal under SKM-TEA protocol shift. These results support reconstruction evaluation that jointly considers image fidelity, pathology preservation, runtime, and case-specific reliability.
[CV-142] Population-Scale Segmentation of Penile Tissue in DIXON MRI using Deep Learning for Quantitative Phenotyping in Male Reproductive Health
链接: https://arxiv.org/abs/2607.02127
作者: Jan Ernsting,Gunnar Paul Kordes,Nils Johannaber,Lynn Ogoniak,Wolfgang Roll,Tim Hahn,Alexander Siegfried Busch,Benjamin Risse
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Penile measurement is clinically relevant across male reproductive and urogenital health, including conditions such as micropenis, congenital and endocrine disorders, and sexual or urinary dysfunction. However, quantitative assessment of penile size has relied mainly on external length or circumference measurements, which are difficult to standardize, sensitive to measurement conditions, and unable to capture the internal portion of the penis. MRI enables volumetric assessment of the whole penis in vivo, but automated segmentation has not previously been established at population scale. Automated whole-organ volumetry would enable high-throughput phenotyping for multi-omics and clinical studies of male reproductive disease. Here, we present a deep learning framework for whole-penis segmentation in multi-channel DIXON MRI. Using a newly curated expert-annotated training dataset ( n = 145 subjects; 13,050 annotated slices) and a double-annotated independent test benchmark ( n = 24 subjects; 2,160 double-annotated slices), we optimized a 3D nnU-Net architecture. The model achieved a 5-fold cross-validation Dice score of 0.90 and performed at observer-level accuracy on the independent test set (Dice: 0.92 ; Hausdorff distance: 3.58 ). We deployed the model in 34,412 UK Biobank participants, enabling automated quantification of total penile tissue, including both external and internal components. Longitudinal evaluation in 2,282 men demonstrated high inter-session reproducibility ( r = 0.87 ). This framework establishes a reproducible and population-scalable method for MRI-based assessment of penile anatomy and provides an open technical resource for future studies in urological imaging and male reproductive health. The trained model weights will be publicly released. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2607.02127 [eess.IV] (or arXiv:2607.02127v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2607.02127 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-143] Quantum-Inspired Vision: Leverag ing Wave-Particle Duality for Low-Illumination Enhancement
链接: https://arxiv.org/abs/2607.01731
作者: Yiquan Gao
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optimization and Control (math.OC); Quantum Physics (quant-ph)
备注:
Abstract:This study provides a theoretical expansion of the recent Data Relativistic Uncertainty (DRU) framework by formalizing a physics-to-AI paradigm for image enhancement. By modeling images as probabilistic wave functions rather than deterministic states, the paradigm explicitly integrates wave-particle duality to illustrate the system flow of how DRU leverages the intrinsic physical uncertainty of light, a dimension requiring further theoretical discussion. Consequently, this paradigm provides a rigorous Explainable AI (XAI) approach that enhances the interpretability of how DRU mitigates illumination bias and maintains robustness against data noise.
[CV-144] Boundary-Aware Quantization: Finite-Scale Decision Geometry of Neural Classifiers
链接: https://arxiv.org/abs/2607.01478
作者: O.M. Kiselev
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 2 figures, 6 tables
Abstract:We measured quantization-induced decision-boundary changes using local logit-margin radii, first-order boundary displacement, normal variation, slice-boundary Jaccard distance, grid prediction changes, multiclass junction counts, and low-margin boundary-band flips. On the digits benchmark, 8-bit weight quantization preserved all test labels while producing boundary-mask Jaccard (0.428) on the PCA slice; at 4 bits, accuracy remained (0.9733), while boundary Jaccard rose to (0.970) and median local boundary shift reached (0.0290). Interpolation between adjacent quantization levels localized the visible reconfigurations at multiclass junctions, with 12, 34, and 17 triple-junction cells in the selected transitions. Calibration-to-test stopping reduced the digits held-out flip rate from (0.0094) to (0.0022) and boundary Jaccard from (0.825) to (0.524); the same stopping rule also reduced flips on MNIST and Fashion-MNIST. On official CIFAR-10 subsets, PTQ-W selected by accuracy gave 6-bit flip (0.0367) and boundary Jaccard (0.184), whereas boundary-aware stopping selected 8-bit flip (0.0083) and boundary Jaccard (0.048). On full CIFAR-10 with three seeds, 6-bit PTQ-W lost (0.0029) accuracy relative to float, changed (5.3%) of held-out decisions, and changed (24.5%) of low-margin boundary-band decisions. A fixed-bit boundary-gap rounding term changed the trade-off at 4 bits by reducing boundary Jaccard from (0.457) to (0.435) and boundary-band pair-order flip from (0.3600) to (0.3558), with an accuracy trade-off; the 3-bit stress test exposed the tuning limit of this surrogate. Calibration boundary Jaccard predicted held-out boundary Jaccard across PTQ-W and optimized rounding variants with (r=0.947)–(0.994).
人工智能
[AI-0] Distributed Attacks in Persistent-State AI Control
链接: https://arxiv.org/abs/2607.02514
作者: Josh Hills,Ida Caspary,Asa Cooper Stickland
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As AI coding agents become more autonomous, they increasingly ship code iteratively, with the codebase persisting across sessions. This persistence creates a new attack surface: a misaligned or prompt-injected agent can distribute attacks across pull requests (PRs) and time its payload for the PR with the best natural cover. To study the resulting dynamics, we introduce Iterative VibeCoding, a setting for AI control, the study of safely deploying capable but potentially untrusted AI. In Iterative VibeCoding, a coding agent builds software over a sequence of PRs in a persistent codebase while pursuing a covert side task. Our benchmark includes two task families: CLI tools and Flask web services, across 20 total task variations. We use Claude Sonnet 4.5 as the attack agent and GPT-4o as the monitor. We compare gradual attacks, which distribute the side task across PRs, against non-gradual attacks concentrated in a single PR. No single monitor is robust to both: which strategy evades best (success while evading the monitor) depends on the monitor type, so a defender cannot close off both gradual and non-gradual attacks with any one monitor. High evasion (= 65%) generalizes across model attack agent backends (Sonnet 4.5, Gemini 3.1 Pro, Kimi K2.5), confirming this is a property of the persistent-state attack surface rather than a single model’s capability. Evasion also remains high across state-of-the-art monitor models and the gap between gradual and non-gradual evasion widens for more capable models. We introduce a stateful link-tracker monitor that tracks suspicious buildup across PRs. On both task families, it detects gradual attacks substantially better than diff monitors that merely see more accumulated history. Combining this stronger monitor with trajectory monitors in a four-monitor ensemble reduces gradual-attack evasion from 93% under the weakest standard diff monitor to 47%.
[AI-1] ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning
链接: https://arxiv.org/abs/2607.02509
作者: Yanjun Zhao,Ruizhong Qiu,Tianxin Wei,Yuanchen Bei,Zhining Liu,Lingjie Chen,Ismini Lourentzou,Hanghang Tong,Jingrui He
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding and reasoning over long contexts has become a key requirement for deploying large language models (LLMs) in realistic applications. Although recent LLMs support increasingly long context windows, they often fail to use relevant evidence that is already present in the input, revealing a gap between context access and effective context utilization. In this work, we propose Recursive Evidence Replay as LLM Harness for Long-Context Reasoning (RECONTEXT), a training-free inference method for improving long-context reasoning. RECONTEXT uses model-internal relevance signals to construct a query-conditioned evidence pool and replays it before final generation while preserving the full original context. This recursive selection process separates evidence organization from answer generation without training, external memory, or context pruning. We also provide a theoretical analysis based on associative memory, which characterizes the context as a memory store, the question as a retrieval cue, attention as cue-trace association, and replay as trace reactivation. Experiments on eight long-context datasets with 128K context length show that RECONTEXT consistently improves evidence utilization across Qwen3-4B, Qwen3-8B, and Llama3-8B, achieving the best average rank on all three backbones. Code is available at this https URL.
[AI-2] DemoPSD: Disagreement-Modulated Policy Self-Distillation
链接: https://arxiv.org/abs/2607.02502
作者: Yunhe Li,Hao Shi,Wenhao Liu,Mengzhe Ruan,Hanxu Hou,Zhongxiang Dai,Shuang Qiu,Linqi Song
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:On-policy self-distillation (OPSD) has emerged as a practical method for training large language models (LLMs) to reason, where a single model acts as both the teacher and the student with different levels of information access. However, recent studies have found that the teacher’s dense token-level supervision, conditioned on privileged information, can lead to overfitting to in-domain patterns, suppress exploration, and hurt cross-domain generalization, while also introducing a more fundamental issue: privileged information leakage, where the student encodes answer-dependent shortcuts that are unavailable at test time. We introduce DemoPSD, a novel framework that resolves such problems through the idea of selective adoption of teacher guidance. Instead of fitting the full teacher distribution, DemoPSD steers the student toward a reverse-KL barycenter target, a weighted geometric combination of the teacher and student distributions, that naturally balances learning from the teacher with preserving the student’s own reasoning capacity. We measure the difference between their distributions and use such a discrepancy to adaptively control the blending at each token position. We provably show that DemoPSD achieves (1) leakage attenuation, i.e., effective mitigation of privileged information leakage; and (2) exploration preservation, i.e., preservation of exploration capacity under dense token-level distillation. Extensive experiments on SciKnowEval across four scientific fields show that DemoPSD outperforms both GRPO and SDPO while maintaining higher training entropy and robustly generalizing to out-of-distribution GPQA benchmarks.
[AI-3] Beyond Adam: SOAP and Muon for Faster Label-Efficient Training of Machine Learning Interatomic Potentials
链接: https://arxiv.org/abs/2607.02499
作者: Gil Harari,Yoel Zimmermann,Ola Tangen Kulseng,Laura Zichi,Chuin Wei Tan,Marc L. Descoteaux,Boris Kozinsky
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
备注:
Abstract:Machine learning interatomic potentials (MLIPs) have become a hallmark of AI for scientific simulation. While efforts on new architectures and datasets have led to increasingly accurate and general models, the choice of optimizer for training has largely remained unexplored, defaulting to Adam and its variants in the community. Here, we implement and systematically compare a class of recently proposed matrix-structured optimizers, including Muon, SOAP, and the hybrid SOAP-Muon, for training NequIP and Allegro MLIP models. We find that these optimizers can substantially outperform Adam in both convergence speed and final accuracy. SOAP and SOAP-Muon emerge as robust and consistently strong methods, while Muon only provides partial gains relative to Adam. The improvements are particularly pronounced under partial force supervision. Our results indicate that optimizer choice is an overlooked yet impactful design axis for MLIPs.
[AI-4] G-RRM: Guiding Symbolic Solvers with Recurrent Reasoning Models
链接: https://arxiv.org/abs/2607.02491
作者: Timo Bertram,Sidhant Bhavnani,Richard Freinschlag,Erich Kobler,Andreas Mayr,Günter Klambauer
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In this work, we focus on SE-RRMs, a symbol-equivariant instantiation of RRMs that exhibits improved extrapolation to larger problem sizes. We propose a neuro-symbolic approach, ``Guiding with Recurrent Reasoning Models’’ (G-RRM), which integrates SE-RRMs with symbolic solvers for constraint satisfaction problems. SE-RRMs act as neural solvers that generate full solution proposals and guide classical symbolic solvers, such as backtracking or SAT-based methods like Glucose 4.1 and CaDiCaL 3.0.0, that produce globally correct solutions. Centrally, we investigate when neural guidance with G-RRM improves the search efficiency of symbolic solvers. % Our experiments show that the efficacy of G-RRM depends on two conditions: first, the problem instances must have an expansive combinatorial search space to expose potential gains, and second, the solver architecture must be capable of dynamically overwriting its branching choices to recover when neural hints are imperfect. When these conditions hold, guidance drives median conflict counts to zero and yields significant wall-clock speedups: on 9\times9 Sudoku, where the SE-RRM correctly solves 91.1% of instances, backtracking accelerates by 33.3\times and Glucose 4.1 by 1.70\times (median, p0.001 ), with Glucose 4.1 retaining a 1.17\times speedup on perfect-hint 25\times25 grids. In contrast, CaDiCaL 3.0.0, whose runtime is overhead-dominated and which always respects the injected branching hints rather than overwriting them, shows no significant speedup (median 1.02\times , n.s.) and even a small significant mean slowdown ( 0.90\times ) on 9\times9 . These results delineate the regimes in which neural guidance translates into practical speedups.
[AI-5] Human Capital Not Model Benchmarks Predicts Hybrid Intelligence in Forecasting
链接: https://arxiv.org/abs/2607.02467
作者: Vivienne Ming
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 4 pages, 1 figure, PNAS brief style
Abstract:Whether pairing people with AI helps or hurts is usually reported as a single average effect. Using a real-money prediction market (Polymarket) as an objective, externally resolved benchmark, this pilot shows that the value of human-AI collaboration depends on a specific, measurable form of human capital. Analyzed at the level of the individual forecaster, hybrid performance is trimodal: most people either deferred to the model (matching it) or used it to rubber-stamp a prior guess (performing worse than the model alone), while a minority engaged in genuine complementary reasoning and reached accuracy matching or even exceeding (i.e., lower error than) the market itself. Collaborative traits (perspective-taking, intellectual humility, and curiosity) rather than raw cognitive ability or model benchmarks, distinguished who reached that mode. The results are preliminary but statistically robust, and motivate a pre-registered replication now in preparation.
[AI-6] Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs ICML2026
链接: https://arxiv.org/abs/2607.02466
作者: Junhao Shi,Siyin Wang,Xiaopeng Yu,Li Ji,Jingjing Gong,Xipeng Qiu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026, 21 pages,6 figures
Abstract:Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations – triplets of observations, instructions, and actions that are costly to collect at scale. We argue that this bottleneck stems from conflating two distinct learning objectives: acquiring physical competence (how to move) and acquiring semantic alignment (what to do). Crucially, only the latter requires language supervision. Building on this Decomposition Hypothesis, we propose Task-Agnostic Pretraining (TAP), a two-stage framework that first learns transferable motor priors from cheap, unlabeled interaction data – including discarded off-task trajectories and autonomous robot play – via a self-supervised Inverse Dynamics objective. A lightweight second stage then grounds these priors in language using minimal expert data. On the SIMPLER benchmark, TAP matches models trained on over 1M expert trajectories while using orders of magnitude less labeled data, yielding a 10% absolute gain over standard behavior cloning. On a real-world WidowX platform, TAP retains 25% success under camera perturbations where internet-scale baselines collapse to 0%, demonstrating that task-agnostic pretraining produces robust, transferable physical representations and offers a scalable path forward for Embodied AI.
[AI-7] Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation
链接: https://arxiv.org/abs/2607.02460
作者: Zhuowei Chen,Xiang Lorraine Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Post-training large language models (LLMs) without real-world interaction feedback or human-labeled supervision remains challenging, particularly in specialized domains where expert annotations are costly to obtain. Recent annotation-free self-evolution methods address this by using the model’s own outputs as supervision signals, constructing a teacher via additional context and aggregating predictions across multiple rollouts through majority voting to produce pseudo-labels. However, these approaches are not without drawbacks: SFT- and GRPO-based variants suffer out-of-domain performance degradation, while reward-based on-policy RL inflates calibration error. In this paper, we propose Neuron On-Policy Self-Distillation (Neuron-OPSD), a data-centric framework for annotation-free self-distillation that leverages internal neuron activations to guide both training-data selection and teacher context construction. The model is then trained via on-policy distillation from the teacher distribution, requiring no ground-truth labels at any stage. Across specialized-domain benchmarks, Neuron-OPSD improves in-domain task performance while preserving cross-domain generalization and mitigating calibration collapse over prior annotation-free baselines. This framework is particularly relevant to settings where online interaction or external supervision is costly or infeasible, and is conceptually distinct from offline RL approaches that rely on logged, reward-labeled trajectories.
[AI-8] Reasoning effort not tool access buys first-try reliability in agent ic code generation: an observational study
链接: https://arxiv.org/abs/2607.02436
作者: Achint Mehta
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures, 10 tables. Dataset and evaluation artifacts: this https URL
Abstract:Agentic coding assistants are increasingly given extra capabilities, such as browser based testing tools and design oriented system prompts, on the assumption that more capability yields better software. This study tested that assumption directly. Ninety independent agent runs built the same application, a real time retrospective board, from one detailed specification, each scored on a fixed 14 criterion functional rubric (42 point maximum) and a visual quality review. The runs spanned several model generations, two agent harnesses, two reasoning effort levels, a testing tool, and two design oriented prompts. Capability tier dominated: frontier models clustered near the ceiling while a low cost local model fell to 24 to 37 points. A criterion level analysis revealed what run totals conceal. Container deployment was the dominant defect, failing first try in 44 percent of runs, with its failure rate shifting sharply across model generations while mean totals moved less than a point. The testing tool raised cost by 42 to 68 percent without improving functional score or reliability, even on interface visible criteria. Raising reasoning effort from High to xHigh lifted first try perfect runs from 28 percent to 89 percent and cut corrective prompts about five fold, for 9 to 29 percent more cost. A design oriented prompt raised visual quality, 4.5 versus 3.0 on a 5 point scale, without lifting function, and a one paragraph paraphrase of its directive reproduced the entire lift. The practical lesson is to match the fix to the failure: most first run failures came from weak reasoning, which a stronger model or more effort prevents, not from visible flaws a checking tool would catch.
[AI-9] WorldSample: Closed-loop Real-robot RL with World Modelling
链接: https://arxiv.org/abs/2607.02431
作者: Yuquan Xue,Le Xu,Zeyi Liu,Zhenyu Wu,Zhengyi Gu,Xinyang Song,Bofang Jia,Ziwei Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figures, conference paper
Abstract:Reinforcement learning (RL) can overcome the demonstration-coverage limitation of imitation learning (IL) by allowing robots to improve through trial-and-error interaction beyond the states observed in demonstrations. However, deploying RL on real robots remains constrained by high interaction costs, since each physical rollout is costly and reflects only one realized action-outcome path. To address this challenge, we propose WorldSample, a physically grounded data augmentation framework for real-robot RL that closes a real-synthetic loop between physical rollouts, world-model generation, and policy improvement. Grounded on real rollouts, WorldSample generates high-fidelity synthetic transitions through a post-trained world model, which greatly lowers the visual hallucination. Specifically, rather than simply using these transitions as real-world experience, WorldSample introduces Policy-Paced Learning (PPL) to regulate the training process through sample selection and scheduling, balancing useful augmentation against value overestimation and mitigating the hallucination-induced noise. Experiments on robot manipulation tasks involving contact-rich and precise tasks show that WorldSample improves policy success rate by 28% while reducing training steps by 59% compared with baselines. Furthermore, WorldSample improves world model visual fidelity by 19.4dB in PSNR and 0.47 in SSIM over demonstration-only post-training, validating the effectiveness of the real-synthetic loop for both policy and world model performance.
[AI-10] QFedAgent : Quantum-Enhanced Personalized Federated Learning for Multi-Agent Activity Recognition
链接: https://arxiv.org/abs/2607.02426
作者: Quoc Bao Phan,Tuy Tan Nguyen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated learning (FL) enables collaborative model training across distributed devices without sharing raw data, making it suitable for privacy-sensitive robotic sensing applications. However, multi-agent systems generate heterogeneous and non-independent and identically distributed (non-IID) multimodal sensor streams that degrade conventional FL algorithms, while classical fusion modules introduce substantial parameter overhead and communication cost. This paper proposes QFedAgent, a hybrid quantum-classical personalized FL framework for multi-agent activity recognition. The approach integrates a variational quantum circuit fusion module that models accelerometer–gyroscope interactions through quantum state encoding and entanglement, requiring only 72 quantum rotation parameters versus 33K in classical multi-layer perceptron-based fusion, achieving approximately 10x total parameter reduction. Experiments on the OPPORTUNITY dataset under subject-based non-IID partitions demonstrate 97.7% mean test accuracy, confirming that parameter-efficient quantum fusion remains competitive with conventional federated baselines.
[AI-11] Neuron-Aware Active Few-Shot Learning for LLM s
链接: https://arxiv.org/abs/2607.02423
作者: Zhuowei Chen,Liwei Chen,Christian Schunn,Raquel Coelho,Xiang Lorraine Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Active Few-Shot Learning (AFSL) adapts LLMs to specialized domains by identifying the most valuable unlabeled samples for annotation and use as few-shot demonstrations, effectively reducing human annotation costs while promoting high performance. However, existing methods typically rely on output-level signals for sample identification, such as predictive entropy or semantic similarities with test-time data based on external embeddings, which often overlook models’ internal dynamics, which could pinpoint specific knowledge gaps. To bridge this gap, we propose NeuFS, a Neuron-Aware Active Few-Shot Learning framework that shifts the selection paradigm from output-level proxies to models’ internal dynamics. NeuFS utilizes neuron activation patterns to represent sample directly, and includes a dual-criteria selection strategy that: (1) ensures few-shot sample diversity with neuron patterns for broader example coverage, while (2) prioritizing on identifying informative and challenging few-shot samples LLMs tend to hallucinate by quantifying neuron consensus. Experiments on three datasets demonstrate that NeuFS excels in both reasoning and text classification tasks, outperforming existing AFSL baselines. Ablation studies further highlight that internal neuron activations provide a more principled and effective selection signal than external embeddings, validating the superiority of the proposed NeuFS.
[AI-12] Fast Multi-dimensional Refusal Subspaces via RFM-AGOP
链接: https://arxiv.org/abs/2607.02396
作者: Thomas Winninger
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the Mechanistic Interpretability Workshop at the 43rd International Conference on Machine Learning, Seoul, South Korea, 2026
Abstract:Steering and monitoring activations in Large Language Models (LLMs) are increasingly used for both safety and interpretability. Early work assumed behaviours are encoded along single linear directions, but recent findings suggest complex behaviours, such as the refusal to answer harmful queries, live in multi-dimensional subspaces. However, existing methods for extracting these subspaces are computationally expensive, which becomes prohibitive on reasoning models who produce long reasoning traces. By adapting the Recursive Feature Machine (RFM) algorithm – which can be computed efficiently – with a probe-informed initialization, we are able to identify the multi-dimensional refusal subspace in seconds, on reasoning (Qwen 3) and non-reasoning (Qwen 2.5) models. While RFM allows for faster subspace identification, it also showed better performances on the ablation task than its alternatives. More work is planned to better understand the relations between subspaces found by different methods. If confirmed, RFM could be a cheap and scalable complement to existing subspace-extraction methods in LLMs.
[AI-13] Steerability via constraints: a substrate for scalable oversight of coding agents
链接: https://arxiv.org/abs/2607.02389
作者: Thomas Winninger
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
备注: Accepted to the Deep Learning for Code Workshop at the 43rd International Conference on Machine Learning, Seoul, South Korea, 2026
Abstract:Coding agents are capable; human oversight is the bottleneck. Unconstrained agents introduce security risks, erode codebase scalability, and make human review increasingly costly. We argue that the same methods used for decades to manage large human engineering teams: access control, network policies, strict coding conventions enforced by tooling; transfer directly to coding agents, and are cheaper (in token) than recent agentic scaffolding. We sketch a start-to-end system on this principle, and report a controlled experiment in scalable oversight: a small reviewer (Gemma 4 e4b) inspects a Python codebase containing 11 inserted backdoors. Recall rises from 54.5% (unconstrained, no tools) to 90.9% (constrained substrate plus a ~200-LoC docs CLI), with substrate and tools contributing independently. We choose Python deliberately: substrate-level oversight gains are largest where the language gives the fewest guarantees by default; the principles extend to languages like Rust.
[AI-14] DRIFTLENS: Measuring Memory-Induced Reasoning Drift in Personalized Language Models
链接: https://arxiv.org/abs/2607.02374
作者: Xi Fang,Weijie Xu,Yingqiang Ge,Yuhui Xu,Stephanie Eckman,Chandan K. Reddy
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures
Abstract:Personalization changes what a model says to a user; we show that it can also change the reasoning trajectory used to justify the response. Modern LLMs personalize interactions by storing user attributes, preferences, and prior context, then injecting this information into future prompts. We study whether such memory reshapes reasoning on open-ended questions where no single ground-truth answer exists. To quantify this effect, we introduce DRIFTLENS, a ground-truth-free framework that maps each expressed reasoning step to a value category and measures divergence between a question’s no-memory trajectory and its trajectory under injected user-attribute memory. We first validate that DRIFTLENS distinguishes content-free pragmatic noise from substantive reasoning changes. Across four LLMs and 10 user-attribute categories, including age, occupation, and disability, user-attribute memory induces medium-to-large reasoning drift above each model’s pragmatic-noise floor, even when final answers remain fluent, on-topic, and plausible. We then evaluate GRPO- and DPO-based post-training methods for reducing drift. Both reduce drift, but neither uniformly dominates; effects on downstream capability, helpfulness, and instruction following are model-and reward-dependent. These results suggest that memory-induced reasoning drift is a measurable and only partly mitigated failure mode of personalized language models.
[AI-15] Understanding Agent -Based Patching of Compiler Missed Optimizations
链接: https://arxiv.org/abs/2607.02370
作者: Batu Guan,Zirui Wang,Shaohua Li
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 11 pages, 10 figures
Abstract:Compiler missed optimizations refer to cases in which compilers failed to optimize certain code. It takes many compiler developers’ efforts to implement or patch such missed optimizations. In this paper, we present a systematic study of how well agents patch compiler missed optimizations. We identify a significant challenge that patching a missed optimization requires more than just fixing the reported case, and instead requires generalizing to similar cases. We construct a benchmark of real-world LLVM missed optimization issues and compare agent-generated patches with patches from developers in terms of optimization scope. Our results show that coding agents often optimize the given examples, but many generated patches either cover only part of the developer-intended scope or partially overlap with it; in some cases, they further generalize beyond the reference patch. We further introduce historical-knowledge augmentation techniques that leverage prior LLVM optimization pull requests through retrieval and distillation, showing that they improve developer-aligned generalization and yield practical benefits when applied to real-world IR.
[AI-16] Self-Gating Attention for Efficient Time Series Forecasting
链接: https://arxiv.org/abs/2607.02344
作者: Dezheng Wang,Tong Chen,Wei Yuan,Congyan Chen,Shihua Li,Hongzhi Yin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformer architectures have shown strong potential in time series forecasting, where multi-head self-attention is widely used to capture temporal dependencies across historical timestamps. However, standard self-attention has quadratic time and memory complexity with respect to the look-back length. This cost may limit its use in resource-constrained or high-throughput forecasting systems, where fast and memory-efficient inference is important. Through qualitative and quantitative analyses, we observe that self-attention maps in time series forecasting often contain redundant patterns across different timestamps. This phenomenon can be related to the repeated temporal patterns and relatively stable temporal correlations in many real-world time series. Motivated by this observation, we propose Self-Gating Attention (SGA), a plug-and-play attention mechanism that represents the attention score with a shared learnable matrix and an input-dependent residual component. The shared matrix captures common attention patterns, while the residual component captures input-dependent variations. In this way, SGA avoids the query and key projections used in standard attention score computation, leading to linear time and score-matrix memory complexity with respect to the look-back length. We integrate SGA into several forecasting backbones and compare it with standard self-attention and lightweight attention variants on nine publicly available real-world datasets covering electricity, finance, weather, medical monitoring, human activity, and climate records. The results show that SGA improves inference efficiency on public benchmarks while maintaining competitive forecasting performance against state-of-the-art attention mechanisms. These benchmark results provide deployment-oriented evidence.
[AI-17] SelectTSL: Prompt-Guided Selective Target Sound Localization in Complex Scenarios
链接: https://arxiv.org/abs/2607.02343
作者: Ziyang Jiang,Yu Chen,Zexu Pan,Xinyuan Qian,Bowen Xing,Ivor W. Tsang,Xu-Cheng Yin,Haizhou Li
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Humans can selectively attend to a target sound and estimate its direction in complex scenarios, whereas such selective localization remains challenging for current deep learning-based systems. Sound source localization (SSL) has achieved remarkable success with deep learning, yet most methods localize all active sources without selectivity. Conversely, target sound extraction (TSE) extracts sources using multimodal prompts but typically fails to preserve the multichannel spatial information required for accurate localization. To bridge this gap, we formulate the task of prompt-guided selective target sound localization and propose SelectTSL, an end-to-end architecture that localizes only the user-specified target in multi-source acoustic scenes. Specifically, we design a target-aware selective localization strategy that employs a Prompt-Guided Selective Attention Module (PGSA) to generate prompt-informed embeddings. These embeddings guide an inter-channel phase difference (IPD) enhancer to refine raw phase cues, fusing with target magnitudes to jointly estimate direction of arrival (DoA) and target-source cardinality, i.e., the number of target sound sources. This coupled design effectively focuses on the user-specified target spatial cues for selective localization and also handles time-varying numbers of target sources. Extensive experiments on both synthetic data and real-world recordings demonstrate that our proposed method consistently outperforms other baselines and exhibits robust generalization to real acoustic environments.
[AI-18] Grounded autonomous research: a fault-tolerant LLM pipeline from corpus to manuscript in frontier computational physics ICML2026
链接: https://arxiv.org/abs/2607.02329
作者: Haonan Huang
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computational Physics (physics.comp-ph)
备注: 39 pages, 5 figures. Accepted at the ICML 2026 AI for Science Workshop ( this https URL ). Includes the pipeline-generated companion physics manuscript as an appendix. Data and scaffolding archive: this https URL
Abstract:Autonomous-research agents have demonstrated end-to-end LLM automation in machine-learning sandboxes where execution provides calibration. Frontier physical science differs categorically: physical reasoning underlies every methodology choice, toolchains are often underdocumented, and calibration must come from external literature anchors - which unscaffolded agents cite but do not confront, hallucinating plausible, unverifiable results from internal priors. We present a pipeline that runs end-to-end from a corpus of 11,083 recent condensed-matter physics arXiv papers to a publication-grade manuscript with three substantive physics findings (here on altermagnetic piezomagnetism): the agent autonomously conceives a research direction by mapping the corpus, calibrates methodology by reproducing published references, conducts novel first-principles computations, and writes the manuscript - grounded in literature throughout, across 47 fresh-context sessions in six phases sharing only on-disk state, with 2,162 literature-consultation events. Fault tolerance emerges from redundancy: fresh-context isolation, distributed grounding, and adversarial review catch what any single session misses; pre- and post-pilot stages are fully autonomous, and pilot requires bounded human intervention only at reproduction failures - operational knowledge curation, not scientific direction. Two paired failure modes - a pre-architecture baseline and a no-pilot ablation - isolate structurally enforced numerical confrontation at calibration checkpoints as the operative grounding mechanism. The primitives, characterized failure modes, and quantified intervention pattern lay a foundation for autonomous research in high-stakes scientific domains beyond computational physics.
[AI-19] A Hippocampus for Linear Attention: An Exact Memory for What the Recurrent State Forgets
链接: https://arxiv.org/abs/2607.02303
作者: Wanyun Cui
类目: Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:Linear-attention and state-space language models compress the prefix into a fixed-size recurrent state, yielding O(1) memory at the cost of a lossy exact memory: when many key–value associations compete, earlier facts are overwritten and needle recall degrades. Inspired by Complementary Learning Systems, we give linear attention a hippocampal complement. HOLA (Hippocampal Linear Attention) keeps the usual delta-rule state as a compressive memory and adds a bounded exact KV cache, forming a semiparametric test-time memory: the state models linearly compressible structure, while the cache stores associations that should not be forced through that state. The cache writes without a learned eviction module, keeping tokens with large beta * ||e||, the prediction residual actually committed to the state; a decoupled RMSNorm-gamma cache read then turns these exact KV pairs into sharp retrieval rather than soft averaging. At 340M parameters trained on 15B SlimPajama tokens, HOLA lowers Wikitext perplexity from 27.32 to 22.92 (-16.1%), below a full-attention Transformer++ (26.88), and improves LAMBADA perplexity from 30.95 to 30.26. It also achieves the best linear in-context retrieval and remains much more robust than GDN or a matched HOLA+recency cache on RULER needle-in-a-haystack recall out to 32k tokens (16x its training length).
[AI-20] Generalization in offline RL: The structure is more important than the amount of pessimism
链接: https://arxiv.org/abs/2607.02288
作者: Max Weltevrede,Matthijs T.J. Spaan,Wendelin Böhmer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While pessimism counteracts overestimation bias in offline reinforcement learning (RL), being overly conservative has been associated with hindering certain forms of generalization. However, in this paper we demonstrate that being overly pessimistic does not inherently prevent optimal generalization in contextual MDPs (CMDPs). Instead, we argue successful generalization depends not on the amount of pessimism, but whether the pessimistic structure respects the underlying symmetries of the optimal solution. We prove that a mildly pessimistic, non-symmetric value function can generalize worse than an overly pessimistic, symmetric one. In offline RL, the structure of the pessimism is determined by the structure of the dataset coverage. As such, enforcing a symmetric value function can be non-trivial, and might require techniques such as data augmentation (DA). Inspired by our theoretical results, we argue that DA can best be applied through a consistency loss during policy extraction, rather than the common practice of (regular) offline training on an augmented dataset. This is empirically validated using IQL and CQL on a rotationally symmetric reacher environment.
[AI-21] Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
链接: https://arxiv.org/abs/2607.02234
作者: Zhanming Shen,Jintao Tong,Shaotian Yan,Chen Shen,Hao Chen,Wentao Ye,Xiaomeng Hu,Rui Miao,Haobo Wang,Junbo Zhao,Gang Chen,Jieping Ye
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:On-policy self-distillation (OPSD) has emerged as a promising paradigm for improving LLM reasoning, where a privileged teacher with access to reference solutions provides token-level supervision on the student’s own generated trajectories. However, we find that OPSD consistently fails on long chain-of-thought (long-CoT) reasoning models, yielding at best marginal gains while destabilizing the reflective reasoning capability these models depend on. Through a novel decomposition of the teacher’s supervision signal, we identify the root cause: the teacher’s supervision is dominated by a reference-induced component that drives rote memorization of reference-specific shortcuts, while the question-conditioned, inference-transferable component is ignored or actively opposed. Based on this diagnosis, we propose a two-step solution. First, we construct a reference-only teacher (the same model conditioned on the reference without the question) to isolate the non-transferable component of the supervision signal; the residual after subtracting this component captures the question-conditioned, inference-transferable correction. Second, we use pointwise mutual information (PMI) as the mechanism to transform this residual into a well-formed PMI target distribution that the student can directly distill from, filtering out the reference-induced shortcut. Experiments on four long-CoT models across two datasets demonstrate consistent improvements over both the base model and standard OPSD, while preserving the models’ natural epistemic behavior throughout training.
[AI-22] CoFL-S: Spatially Queryable Sector Flow Fields for Local Language-Conditioned Navigation
链接: https://arxiv.org/abs/2607.02222
作者: Haokun Liu,Zhaoqi Ma,Yicheng Chen,Wentao Zhang,Masaki Kitagawa,Zicen Xiong,Jinjie Li,Moju Zhao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 27 pages, 13 figures
Abstract:Vision-Language Navigation has increasingly emphasized high-level instruction reasoning, memory, global map construction, and instruction decomposition, while the low-level action representation remains comparatively underexplored. We propose CoFL-S, a low-level vision-language-action framework that predicts a language-conditioned flow field over the robot’s local visible sector and generates continuous trajectories by rolling out the predicted field. To train this low-level representation, we convert each VLN-CE episode, originally a whole-episode instruction paired with an action sequence, into frame-level local supervision with aligned sub-instructions and matched action, trajectory, and dense flow-field targets. For evaluation, we introduce a continuous-time Habitat benchmark that isolates low-level action interfaces from instruction decomposition and executes all methods through a shared velocity-command controller, enabling decomposition-independent closed-loop comparison across different planner frequencies rather than fixed discrete forward-and-turn transitions in VLN-CE. Under matched encoders and training settings, CoFL-S consistently outperforms action-token and action-chunk baselines across planner frequencies in the continuous-time Habitat benchmark, and zero-shot real-world closed-loop deployment further shows its advantage over both baselines beyond simulation.
[AI-23] Criticality-Based Guard Rail Validation for AI Agent Decisions in Autonomous Telecom Networks
链接: https://arxiv.org/abs/2607.02210
作者: Ravi Kant Sharma
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: 9 pages, 5 figures, 5 tables
Abstract:The evolution toward fully autonomous telecommunications networks (Autonomous Network Levels 4-5) requires AI/ML agents to make real-time network decisions without human intervention. However, no standardized runtime mechanism exists to intercept and validate individual inference outputs before they trigger live network state changes, creating risks of erroneous autonomous decisions. This paper proposes the Guard Rail Validation (GRV) framework, a standardizable runtime architecture for intercepting and validating AI-driven decisions before execution. The framework evaluates decisions across multiple weighted dimensions – including action scope, action type, service criticality, agent autonomy level, reversibility, and temporal behavioural patterns – to determine a criticality level. Based on this level, graduated validation mechanisms are applied: execute-with-logging, bounds checking, independent agent validation, or multi-agent consensus. The framework additionally provides cross-agent conflict detection with criticality-weighted priority resolution and runtime conformance logging for regulatory compliance (e.g., EU AI Act Article 14). We present the architecture, algorithmic procedures, O-RAN deployment model, and evaluate threat coverage against known AI/ML attacks in telecommunications.
[AI-24] he Eticas AI Risk Taxonomy: Open Infrastructure for Operationalizing AI Audits
链接: https://arxiv.org/abs/2607.02201
作者: Gemma Galdon Clavell,Pablo Accuosto,Usman Gohar
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid deployment of AI systems across high-stakes domains has created urgent demand for standardized evaluation, yet the field remains fragmented across competing risk taxonomies that catalog risks without showing how an audit is executed. At least 74 AI risk taxonomies exist, and almost all stop at the catalog. The hard part of auditing is not naming a risk but operationalizing it: turning it into a test run against a real system, a measured value, a calibrated severity, and a defensible grade. This paper leads with that bridge. We present the operationalization layer Eticas has built and run, shown end to end on a single risk (PII leakage) against a public benchmark, and then the open taxonomy that makes the method scale. On GPT-4-0314, a disclosure risk that seven external frameworks require be controlled is measured at 0%, 51%, and 84% disclosure as adversarial conditioning increases, mapping through calibrated severity bands to a subcategory grade of E with a SYSTEMIC pattern. Around this example, the Eticas AI Risk Taxonomy v2.0.0 organizes 76 active subcategories across 10 categories and 20 sub-groups, with mappings to 18 external frameworks across compliance, reference, and academic tiers. Its category and sub-group layer is published under CC BY 4.0 as open semantic infrastructure with stable URIs and SKOS/JSON-LD distributions, and a worked subcategory example shows the operational layer down to its severity thresholds. The contribution is the demonstrated bridge from concept to graded finding, anchored by a clean separation of risks from the mechanisms by which they surface, and framed by an open-core model in which the conceptual scaffold is open and the methodology calibration is the practitioner layer. This is the infrastructure the AI auditing field needs: shared, open, and demonstrably operable.
[AI-25] Overview of Risk Assessment and Management for Intelligent Systems under the AI Act and Beyond CCS
链接: https://arxiv.org/abs/2607.02197
作者: Javier Irigoyen,Roberto Daza,Aythami Morales,Julian Fierrez,Ruben Tolosana,Ruben Vera-Rodriguez,Francisco Jurado,Alvaro Ortigosa
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 6 pages, 1 figure, 1 table. Accepted at the IEEE International Carnahan Conference on Security Technology (ICCST 2026), October 14, 2026
Abstract:The society and emerging risk-based regulatory frameworks for AI underscore the need for rigorous risk assessment to ensure safe and reliable AI systems. In response to this imperative, this paper presents an overview of AI risk assessment (identification and analysis) and management methodologies. It begins by reviewing the worldwide regulatory landscape that drives the need for systematic AI risk assessment. Then we characterize the spectrum of AI-related risks identified in the literature, from technical failures to ethical and social impacts. Subsequently, it reviews key risk assessment methodologies proposed for AI systems, focusing on general frameworks. The paper highlights best practices and illuminates methodological gaps, highlighting areas for further research on AI risk assessment.
[AI-26] UA-ChatDev: Uncertainty-Aware Multi-Agent Collaboration for Reliable Software Development
链接: https://arxiv.org/abs/2607.02186
作者: Temitayo Olamilekan Ogunsusi,Lijun Qian,Xishuang Dong
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Software development is a complex task that demands cooperation among agents with diverse roles. Large language models (LLMs) have enabled autonomous multi-agent software development frameworks that leverage role-based collaboration to automate requirements analysis, coding, testing, and refinement. However, existing approaches typically assume that intermediate agent outputs are equally reliable, leaving them vulnerable to hallucination propagation, where incorrect decisions generated in early development phases are transferred to downstream agents and negatively impact final software quality. To address this challenge, we propose UA-ChatDev, an uncertainty-aware multi-agent software development framework that integrates uncertainty quantification into agent interactions. It introduces a lightweight uncertainty estimation mechanism based on token-level log probabilities to assess the confidence of agent responses and employs phase-aware threshold calibration to selectively trigger retrieval-based verification when uncertainty exceeds acceptable levels. Extensive experiments on the SRDD benchmark demonstrate that UA-ChatDev consistently outperforms existing single-agent and multi-agent software development frameworks across completeness, executability, consistency, and overall quality metrics. Further ablation studies and communication analyses verify that uncertainty-aware interactions enhance code execution reliability.
[AI-27] A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks
链接: https://arxiv.org/abs/2607.02175
作者: Samiha A. Ismail,Fan X. Chen,Ali Merali
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 4 tables
Abstract:Multiple-choice medical benchmarks are increasingly saturated, and recent rubric-based evaluations such as HealthBench have shown that open-ended clinical performance is far from solved - its “Hard” subset top score remains 32%. We present a small, deliberately difficult evaluation dataset of five clinician-authored clinical scenarios spanning four specialties (anaesthesia, internal/family medicine, emergency medicine, and obstetrics), each accompanied by an atomic, weighted, MECE rubric (25-62 criteria per task; 184 criteria total) authored from a clinician-drafted golden answer. We evaluate three frontier models: GPT 5.4, Claude Opus 4.7, and Gemini 3.1 Pro. Mean rubric pass rates were 0.47 (Claude), 0.39 (GPT), and 0.37 (Gemini). The central finding is an inversion of clinical priority: the highest-weighted (weight-5, critical) criteria passed at only 32.4-41.7%, while low-stakes weight-1 criteria passed at 80-90%. 56 of 108 critical (weight-5) criteria (52%) were satisfied by no model. Three LLM autoraters reproduced expert met/not-met labels on 92.8-94.7% of 552 graded criteria. We position this as a methods-and-preliminary-findings contribution: the five tasks demonstrate a scalable, defensible pipeline ready to develop into a large-scale benchmark.
[AI-28] Dynamic Neural Graph Encoding of Inference Processes in Deep Weight Space
链接: https://arxiv.org/abs/2607.02166
作者: Di Wu,Huan Liu,Zhixiang Chi,Yuanhao Yu,Konstantinos N. Plataniotis,Yang Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in Transactions on Machine Learning Research (TMLR), 2026. 28 pages, 5 figures
Abstract:The rapid advancements in using neural networks as implicit data representations have attracted significant interest in developing machine learning methods that analyze and process the weight spaces of other neural networks. However, efficiently handling these highdimensional weight spaces remains challenging. Existing methods often overlook the sequential nature of layer-by-layer processing in neural network inference. In this work, we propose a novel approach using dynamic graphs to represent neural network parameters, capturing the temporal dynamics of inference. Our Dynamic Neural Graph Encoder (DNG-Encoder) processes these graphs, preserving the sequential nature of neural processing. Additionally, we also leverage DNG-Encoder to develop INR2JLS (Implicit Neural Representation to Joint Latent Space) for facilitate downstream applications, such as classifying Implicit Neural Representations (INRs). Our approach demonstrates significant improvements across multiple tasks, surpassing the state-of-the-art INR classification accuracy by approximately 10% on the CIFAR-100-INR.
[AI-29] A2utoLPBench: An Auto-Generated Agent -Friendly LP Benchmark via Inverse-KKT Construction
链接: https://arxiv.org/abs/2607.02141
作者: Shuo Ren,Yaohui Han,Yifan Shi,Libo Shen,Haodong Lu,Dongfang Wu,Rongliang Fu,Bei Yu,Tsung-Yi Ho
类目: Artificial Intelligence (cs.AI)
备注: 25 pages and 4 figures
Abstract:Most LP-from-text benchmarks are static datasets of word problems written and labeled by hand. Once such a dataset is released, its size is fixed, its difficulty is fixed, and every problem can leak into the training data of future LLMs. We present \textbfA ^2 utoLPBench, a benchmark for testing LLM-driven agents on linear programming problems written in plain text. We first pick a feasible point and dual, then write down a problem for which that point is optimal and the objective value is known. The answer is known by construction, with no solver call and no human annotator. The evaluation environment bundles a reference solver-critic baseline and a Docker image whose usage instructions are written for an LLM-driven agent to read. With these in place, any agent can run the benchmark and get a calibrated score with one command. Because the benchmark is a generator rather than a fixed dataset, it has properties no fixed dataset can match: an unlimited supply of fresh problems, a difficulty knob set by (n,m) , ground-truth answers correct by construction, low LLM-side cost per problem relative to human authoring, repeatable scores across independent batches, and resistance to training-data leakage when fresh post-cutoff seed ranges are used.
[AI-30] ART for Diffusion Sampling: Continuous-Time Control and Actor-Critic Learning
链接: https://arxiv.org/abs/2607.02137
作者: Yilie Huang,Wenpin Tang,Xun Yu Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: 36 pages, 14 figures, 8 tables
Abstract:We study timestep allocation for score-based diffusion sampling, where a learned reverse-time dynamics is discretized on a finite grid. Uniform and hand-crafted schedules are standard choices, but they rely on fixed prescriptions and can therefore be suboptimal. To address this limitation, we propose Adaptive Reparameterized Time (ART), a continuous-time control formulation that learns a time change by treating the speed of the sampling clock as the control, so that a uniform grid on the learned clock induces adaptive timesteps in the original diffusion time. Based on a leading-order Euler error surrogate, ART provides a principled objective for allocating timesteps along the sampling trajectory. To solve this deterministic control problem, we introduce ART-RL, an auxiliary randomized formulation with Gaussian policies that turns schedule learning into a continuous-time reinforcement learning problem. We prove that the randomized ART-RL formulation is equivalent to ART at the optimizer level, in the sense that its optimal Gaussian policy recovers the optimal ART time-warping rate through its mean. We further establish policy evaluation and policy improvement characterizations and derive trajectory-based moment identities that yield implementable actor–critic updates for learning the schedule. Across experiments ranging from controlled low-dimensional settings to image generation, ART-RL can be plugged into existing diffusion samplers by changing only the timestep grid, consistently improving sample quality over strong baseline schedules at matched budgets while leaving the rest of the sampling pipeline unchanged. The learned schedules also exhibit broad generalization, transferring without retraining across sampling budgets, datasets, solvers, pipelines, and representation spaces.
[AI-31] Coding-agents can replicate scientific machine learning papers
链接: https://arxiv.org/abs/2607.02134
作者: Atharva Hans,Ilias Bilionis
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Scientific machine learning papers typically make computational claims, e.g., that the relative mean square error is less than 5% or that the 95% predictive credible interval covers the test data. A coding agent can be prompted to replicate those claims from paper materials alone, but the prompt does not by itself reliably preserve progress or check whether generated evidence supports the paper’s claims. We introduce Paper-replication, a workflow that makes each selected paper claim a target with recorded evidence, and implement it as a coding-agent skill. The workflow makes the agent record those targets, reconstruct the paper’s method, run computational experiments, link generated outputs to provenance and comparisons with the paper’s claims, record where matched evidence appears in the replication report, and pass validation checks before completion. We evaluate Paper-replication on twelve independent runs across four scientific machine learning papers. All twelve workspaces pass the completion gate, and all 158 recorded targets are matched with report coverage. Even in this completed workspace state, repeated runs differ in how papers are divided into targets, in numerical fidelity to the source papers, in elapsed replication time, in the number of intermediate executions replaced before final evidence is accepted, and in the rules used to accept evidence. Paper-replication makes completion depend on workspace evidence and validation checks rather than on the agent’s final message.
[AI-32] Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring
链接: https://arxiv.org/abs/2607.02121
作者: William Hackett,Peter Garraghan
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 19 pages, 13 figures, 4 tables
Abstract:As Large Language Models (LLMs) and agentic systems become integrated into real-world applications, ensuring their safety and security is critical. Guardrail systems that detect and block malicious instructions sent to and from an LLM are an essential component of AI security. However, researchers conducting black-box adversarial emulation against production AI systems often struggle to determine whether a guardrail block or an LLM rejection has occurred. This distinction is important because the techniques used to bypass guardrails can differ substantially from those used to bypass LLM safety alignment, and has a material impact on attack technique selection and optimization. We propose the first black-box guardrail reconnaissance methodology, which detects the presence of a guardrail within a target AI system through behavioral monitoring of HTTP, lexical, and timing signals, assuming only black-box access and zero prior knowledge of the guardrail or AI system. Experiments demonstrate that our approach detects guardrail presence with 100% accuracy, with statistically significant behavioral separation between benign and malicious interactions (q 0.001). Our approach further identifies the content categories a guardrail is designed to block, and distinguishes guardrail blocks from LLM rejection on unseen prompts with an average F1 score of 98%.
[AI-33] Enhancing Fitness Intelligence through Domain-Specific LLM Post-Training
链接: https://arxiv.org/abs/2607.02118
作者: Xingtao Zhao,Tian Yang,Han Jiang
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 6 tables, 2 figures. Accepted by the 12th International Conference on Big Data Computing and Communications (BigCom 2026)
Abstract:Scientific Fitness Coaching (SFC) is typically delivered by human professionals, making it costly and inaccessible to many. While recent advances in Large Language Models (LLMs) show considerable promise for more inclusive fitness coaching, directly deploying prevailing general-purpose LLMs in SFC reveals critical limitations. These models often lack sufficient domain-specific knowledge integration, leading to weak performance on complex SFC scenarios. In this paper, we introduce FitOne, a series of fitness LLMs (with 8B and 32B parameters) designed to improve reliability and domain specialization for SFC applications. Built upon the Qwen3 foundation models, FitOne is developed through a three-stage post-training pipeline consisting of continual pre-training, supervised fine-tuning, and reinforcement learning, using large-scale, high-quality datasets derived from rigorous knowledge engineering. We conduct comprehensive evaluations of FitOne on professional fitness certification exams, including ACSM-EP and NSCA-CSCS, as well as general capabilities such as knowledge reasoning and instruction following. Experimental results show that, while retaining strong general capabilities, FitOne-8B/32B achieves average improvements of up to 10.09%/9.29% and 12.73%/7.01% on the ACSM-EP and NSCA-CSCS exams, respectively, compared with the Qwen3 base models. Furthermore, in-depth ablation studies confirm the necessity of each training stage, highlighting the pipeline’s effectiveness in balancing domain expertise enhancement with general ability retention. We believe this research advances LLM systems toward more reliable fitness intelligence and will inspire future research on developing domain-specific LLMs.
[AI-34] ContextNest: Verifiable Context Governance for Autonomous AI Agent
链接: https://arxiv.org/abs/2607.02116
作者: Misha Sulpovar(1),Benn R. Konsynski(2),Qaish Kanchwala,Gabe Goodhart(3) ((1) PromptOwl, LLC, (2) Goizueta Business School, Emory University, (3) IBM Research)
类目: Artificial Intelligence (cs.AI)
备注: 35 pages, 11 tables, 4 figures
Abstract:Autonomous AI agents increasingly depend on external knowledge stores, yet most retrieval pipelines provide relevance without durable guarantees of provenance, version identity, integrity, traceability, or point-in-time reconstruction. We formalize this as context governance and present ContextNext, an open specification and reference implementation for governed AI-consumable knowledge vaults. ContextNext does not replace Retrieval-Augmented Generation (RAG); it supplies the governance layer beneath retrieval, determining which artifacts are approved, current, attributable, and integrity-verified before retrieval systems operate over them. The specification combines typed Markdown documents with metadata, deterministic set-algebraic selectors, contextnest:// URI references, SHA-256 hash-chained version histories, graph-level checkpoints, source nodes for live data through the Model Context Protocol (MCP), and audit traces of agent context consumption. These mechanisms let organizations reconstruct which knowledge versions informed an agent output and whether those versions were AI-eligible when consumed. We report first empirical results from two controlled experiments. In a stale-version attack isolating the governance-versus-retrieval failure mode, governed selection strictly Pareto-dominates BM25 sparse retrieval, with higher answer-quality pass rate (97% versus 93-90%) at about one-third the input-token cost. In a retrieval-determinism experiment over a 1,060-document corpus, deterministic selectors and BM25 return stable document sets across repeated identical queries (Jaccard 1.0), while a dense+HNSW baseline is non-deterministic on 80% of queries (mean Jaccard 0.611, worst case 0.210). These results suggest that context governance addresses failure modes retrieval quality alone is not designed to resolve. We release a core engine, CLI, and MCP server under open licenses. Comments: 35 pages, 11 tables, 4 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2607.02116 [cs.AI] (or arXiv:2607.02116v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2607.02116 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-35] Guided Action Flow: Q-Guided Inference for Flow-Matching Vision-Language-Action Policies
链接: https://arxiv.org/abs/2607.02092
作者: Liuhaichen Yang,Zhuang Jiang,Chenchao Sheng,Zezhi Tang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Flow-matching vision-language-action policies generate robot action chunks through an iterative transport process, creating an opportunity for test-time guidance without retraining the base policy. We study this opportunity in Guided Action Flow, an inference-time framework that keeps a pretrained SmolVLA policy frozen and uses a learned action-chunk critic to guide its reverse-time flow sampler. The critic is trained from real success and failure rollouts, can condition on task-description features from the frozen SmolVLA language pathway, and is used only through action gradients during sampling. We evaluate the approach on LIBERO manipulation tasks. A single-task critic improves success from 68.0% to 82.0% on one seed window and from 82.0% to 86.0% on another. A multi-family task-description critic improves validation success from 46.0% to 56.0%, while the locked held-out test gain is positive but modest, from 65.0% to 67.5%. These results support the feasibility of Q-guided inference for frozen flow-matching VLA policies, while showing that critic generalization and uncertainty-aware guidance remain the central bottlenecks.
[AI-36] SUNTA: Hierarchical Video Prediction with Surprise-based Chunking
链接: https://arxiv.org/abs/2607.02087
作者: Tomoshi Iiyama,Masahiro Suzuki,Yutaka Matsuo
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Hierarchical state-space models (HSSMs) offer a promising approach to long-horizon prediction by segmenting sequences into temporal chunks. However, their performance hinges on how chunk boundaries are determined. While prior HSSMs typically rely on fixed-length chunking or similarity-based boundary detection, these methods often misalign with the intrinsic temporal structure of the data. We argue that chunking should instead be driven by prediction errors, which more directly indicate when longer-range context becomes necessary. Nevertheless, integrating surprise-based chunking into HSSMs introduces critical challenges, including hierarchical collapse during end-to-end training and the absence of surprise signals during open-loop prediction. To address these issues, we propose Surprise-based Nested Temporal Abstraction (SUNTA), a method that employs a decoupled training strategy to preserve surprise signals and uses internal inconsistency as a top-down surprise metric to determine chunk boundaries within imagined rollouts. Experiments on video prediction tasks in 2D and 3D environments demonstrate that SUNTA outperforms baselines, uniquely maintaining accurate predictions over 250 timesteps, whereas all baselines degrade within the first 10 timesteps.
[AI-37] Evolutionary Wave Function Collapse
链接: https://arxiv.org/abs/2607.02082
作者: Dipika Rajesh,Ahmed Khalifa,Julian Togelius
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 4-page short paper with 3 figures accepted at CoG 2026
Abstract:Wave Function Collapse (WFC) is a widely used procedural content generation method that learns local adjacency constraints from example inputs to generate larger outputs. In this paper, we explore combining WFC with evolutionary search by evolving the small input examples used by WFC rather than directly evolving complete levels. In this approach, WFC acts as a genotype-to-phenotype mapping. The generated levels are then evaluated through domain-specific fitness functions. We evaluate the method in two domains with different relationships between local and global structure: Maze connectivity maps and Zelda-style dungeon layouts. Our results show that evolutionary optimization over WFC inputs improves generation quality in domains where properties emerge from local relationships, while domains requiring global constraints remain challenging. These findings suggest that evolutionary search can effectively guide WFC generation when target objectives align with local structure.
[AI-38] Evidence-State Rewards for Long-Context Reasoning
链接: https://arxiv.org/abs/2607.02073
作者: Ya Gao,Pekka Marttinen
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review
Abstract:Long-context reasoning requires models to locate, revise, and synthesize evidence distributed across lengthy inputs. Existing long-context RL methods usually reward final answers or static evidence extraction, offering little feedback on how intermediate actions change the model’s evidence state. We propose Maven, a reinforcement learning framework with an editable evidence memory. Maven defines an answer-conditioned evidence-state value and rewards action-level state transitions: add actions are credited by marginal gain and hindsight contribution, link actions by evidence synergy, and drop actions by improved answer support after removing misleading evidence. These rewards are assigned to the corresponding action spans in GRPO. Across Llama and Qwen models on LongBench v2, LongReason, and RULER, Maven outperforms outcome-only RL and evidence-identification baselines, producing more sufficient evidence sets and lower distractor retention. Our results show that long-context RL benefits from optimizing stateful evidence navigation rather than one-shot evidence extraction.
[AI-39] kNNGuard: Turning LLM Hidden Activations into a Training-Free Configurable Guardrail
链接: https://arxiv.org/abs/2607.02072
作者: Mahmoud Abdelfattah,Hamid Nasiri,Peter Garraghan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 17 pages, 11 figures
Abstract:Large language models (LLMs) are increasingly deployed in domains requiring guardrails to detect unsafe, off-topic, or adversarial prompts. Existing guardrails predominately rely on fine-tuning to build classifiers, which often suffer from low generalization and high inference latency. We present kNNGuard, a training-free guardrail that utilizes the activation space of an off-the-shelf LLM. Given a small bank of 50 safe and unsafe prompts, kNNGuard extracts hidden activations and performs multi-layer kNN fusing activation-space and embedding-space scores for classification. Across six domains spanning topical and security prompts, kNNGuard achieves competitive or superior F1 compared to fine-tuned state-of-the-art guardrails while running 2.7x faster than the best comparable guardrail, and 10x faster than a fine-tuned safety classifier without gradient updates or fine-tuning. Domain adaptation requires only updating the labeled bank, which can be constructed in under 10 seconds and several orders of magnitude faster than established guardrails. We also analyze the impact of system prompts, layer selection, and integration into production LLM pipelines as a configurable, low-latency guardrail.
[AI-40] Algebraic Model Counting for Global Analysis of Optimal Decision Trees ECML-PKDD2026
链接: https://arxiv.org/abs/2607.02069
作者: Hiroki Arimura
类目: Artificial Intelligence (cs.AI)
备注: Proc. Joint European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2026), LNCS, Naples, Italy, 7-11 September 2026
Abstract:Ensuring model reliability in Explainable AI requires a global assessment of the hypothesis space. We propose a formal framework for the exhaustive analysis of optimal and near-optimal decision trees, called Algebraic Decision Tree Counting (ADTC). Inspired by Algebraic Model Counting (AMC) in knowledge representation, ADTC reformulates diverse analytical tasks, such as optimization, counting, and sampling, into a unified sum-of-products computation over a semiring R . While the hypothesis space of decision trees is doubly exponential with respect to the maximum depth \Delta , our dynamic programming algorithm achieves O^(n^O(\Delta)) time complexity in the number of features n , where O^ suppresses polynomial factors. To handle complex constraints consisting of multiple tree metrics, we introduce model behavior tensors that aggregate semiring values via convolution products over a tensor semiring. This algebraic approach efficiently constructs a model profile that captures the global landscape and trade-offs between criteria such as accuracy, size, and fairness. We demonstrate the utility of our software, emtrees, on real-world datasets, illustrating how ADTC facilitates evidence-based model selection in sensitive domains.
[AI-41] SA-HGNN: Sample-Adaptive Hyperbolic Graph Neural Network for EEG-Based Depression Recognition
链接: https://arxiv.org/abs/2607.02063
作者: Yang Li,Pan Hu,Yan Zhang,Wenfan Yang,Tao Wu,Lianbo Guo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Neural Networks (GNNs) have been widely used to capture spatial functional connectivity patterns to improve electroencephalography (EEG)-based depression recognition performance. However, the functional connectivity of brain networks in patients with depression exhibits an inherent hierarchical structure, making it difficult to capture accurate connection patterns. To address these issues, this paper proposes a novel model named Sample-Adaptive Hyperbolic Graph Neural Network (SA-HGNN), which aims to accurately extract the authentic hierarchical structure of depression-affected brain networks. Specifically, the proposed model comprises three core modules. First, a Sample-Adaptive Graph Construction module dynamically constructs personalized brain network topologies to capture more complex spatial relationships within the brain network. Second, hyperbolic graph convolution is employed to overcome the representation bottlenecks of Euclidean space, leveraging hyperbolic geometry to precisely capture latent hierarchical relationships within the brain network. Finally, an Attention Pooling module adaptively filters out highly redundant noise channels in EEG signals, effectively mitigating the interference of inherent noise on the authentic hierarchical topology. Extensive experiments on public EEG datasets demonstrate the superior performance of our method across resting-state and task-related paradigms, validating its robustness to noise and efficacy in capturing abnormal functional connectivity patterns in brain networks of patients with depression.
[AI-42] Prompt Coverag e Adequacy
链接: https://arxiv.org/abs/2607.02057
作者: Florian Tambon,Michael Konstantinou,Cedric Richter,Charles Chenouard,Mark Harman,Mike Papadakis
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, it has become increasingly evident that large language models (LLMs) and autonomous agents raise the level of abstraction in software development by shifting the focus from writing precise procedures to expressing intents and goals. This paradigm shift introduces new challenges, particularly in how testing should be guided when prompts, rather than code, become primary development artifacts. To address this challenge, we propose Prompt Coverage Adequacy, a novel coverage criterion designed to support the testing of code generated from task descriptions. Prompt Coverage Adequacy serves as an analog to traditional code coverage, but operates at the level of prompts used in LLM and agent-based programming. Specifically, it measures how well a given test suite satisfies the requirements expressed in a prompt by leveraging the attention mechanisms of LLMs. We evaluate a simple instantiation of this criterion, based on attention boosting, across two datasets and multiple LLMs. Our results demonstrate that Prompt Coverage is associated with fault-detection effectiveness and can uncover over 30+% more faults than traditional code coverage when used to guide test generation. These findings suggest that Prompt Coverage Adequacy can serve as a foundation for developing testing metrics better suited to the emerging paradigm of LLM-driven software development, addressing the limitations of classical coverage criteria in this new context.
[AI-43] owards Load-Aware Prefill Deflection for Disaggregated LLM Serving
链接: https://arxiv.org/abs/2607.02043
作者: Shrikara Arun,Anjaly Parayil,Srikant Bharadwaj,Renee St. Amant,Victor Rühle
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Disaggregated LLM serving runs prefill and decode on separate GPU pools to keep the two phases from interfering. In practice, this creates a new asymmetry: under bursty, heavy-tailed workloads prefill nodes saturate while decode nodes have compute underutilized, and on a production-style A100 cluster with 2 prefill and 2 decode nodes (2P2D), we find that prefill execution accounts for only 2-23% of P95 Time-to-First-Token (TTFT). Queuing and inter-node GPU-GPU KV-cache transfer account for the rest. We present a proactive prefill-deflecting scheduler that lets decode nodes serve prefill phase of requests as chunked-prefill steps interleaved with their in-flight decode batches. For each queued request, we estimate the TTFT it would see on the prefill node, and on every decode node, search for the largest chunk schedule that keeps in-flight decodes within their Time-Between-Tokens (TBT) SLO and deflect when the decode path helps tail latency. Because the prefill phase of deflected requests runs in place on the decode node, the inter-node KV transfer is eliminated. Implemented on vLLM and evaluated on production-style traces with DeepSeek-V2-Lite, our approach reduces P95 TTFT by upto 81% and raises SLO attainment by upto 79% over state-of-the-art disaggregated schedulers, at sub-millisecond per-request routing cost.
[AI-44] Hidden Forgetting in Continual Multimodal Learning: When Accuracy Survives but Grounding Fails
链接: https://arxiv.org/abs/2607.02020
作者: Qianyu Chen,Canran Xiao,Runxuan Tang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models must continually adapt to evolving tasks and domains, yet standard continual learning metrics mainly measure whether old answers remain correct, leaving the stability of multimodal grounding largely unexamined. We study this overlooked failure mode and ask whether a continually adapted MLLM can preserve not only what it answers, but also how it uses visual, textual, OCR, chart, and document evidence. We identify \emphhidden evidence-use forgetting, where answer accuracy is retained while the model silently shifts toward different or less grounded evidence channels, and propose \textscRCL, a replay-free reliance-constrained continual learning framework. \textscRCL freezes the previous checkpoint as a behavioral reference, estimates teacher and student evidence-reliance profiles through counterfactual channel interventions, and jointly optimizes task learning, prediction preservation, and reliance preservation without adding inference-time cost. Across CoIN, COAST, MCITlib, and an evidence-sensitive multimodal stream, \textscRCL consistently improves final performance and reduces forgetting over replay-free, PEFT, routing, and memory-assisted baselines, while substantially lowering modality reliance drift, dominant evidence flips, and hidden forgetting rates. These results suggest that robust continual multimodal learning requires preserving the evidence path behind correct answers, not merely the answers themselves.
[AI-45] InduceKV: Fixed-Footprint Continual Adaptation of Multimodal LLM s via Inducing KV Memories
链接: https://arxiv.org/abs/2607.02010
作者: Qianyu Chen,Ziteng Feng,Canran Xiao,Runxuan Tang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models must adapt to evolving tasks and domains, yet continual improvement under bounded deployment footprint remains difficult because repeated parameter updates or growing replay stores can accumulate adaptation state over time. We study fixed-footprint continual adaptation: the deployed adaptation state is kept under a fixed memory budget, while the backbone model is left unchanged and task-specific updates are externalized. We propose InduceKV, a retrieval-based method that stores each selected training prefix as an attention-ready memory entry, consisting of a frozen retrieval key and compact layerwise key–value (KV) payloads that can be appended to the model’s self-attention cache. Under a strict memory budget, InduceKV constructs a compact inducing set through bilevel selection: a lightweight calibration is fit for retrieval, while the selected memory balances current-task likelihood, anchor-based retention, and coverage in the frozen retrieval space. Across task-incremental instruction tuning, continual VQA, domain-incremental adaptation, and lifelong multimodal instruction tuning, InduceKV consistently improves over PEFT, MoE, replay, and prompt-retrieval baselines under matched memory budgets. We further report backbone-matched, stage-1 CoIN, compute-matched, and scalability diagnostics, showing that the gains are not due to a stronger backbone, replay alone, or an unbounded candidate pool.
[AI-46] raceable Fault Diagnosis for Battery Energy Storag e Systems via Retrieval-Augmented Multi-Agent OM Assistant
链接: https://arxiv.org/abs/2607.01992
作者: Jiangdi Ru,Bing Li,Yage Huang,Ding Wang,Keru Hua
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large-scale battery energy storage systems (BESSs) require OM decisions that combine alarms, cell-level measurements, device topology, diagnostic tables, historical cases, and maintenance documents. Monitoring platforms can flag threshold violations, but they often cannot explain whether voltage inconsistency, resistance drift, short-circuit risk, capacity divergence, or thermal abnormality needs intervention. This digest presents a traceable BESS fault-diagnosis assistant that uses retrieval-augmented multi-agent reasoning to connect operational data, domain knowledge, visual evidence, and report generation. Reliability is improved through BESS-specific task routing, schema-constrained natural-language database access, hybrid text-image retrieval, and evidence-based answer synthesis. Preliminary internal evaluation is reported for routing, database access, and diagnostic reasoning.
[AI-47] Episodic-to-Semantic Consolidation Without Identity Drift
链接: https://arxiv.org/abs/2607.01988
作者: Xue Qin,Simin Luan,Cong Yang,Zhijun Li
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Long-running adaptive intelligent agents face a structural tension between knowledge consolidation and information integrity. Memory consolidation is conventionally treated as an agent-changing operation: a model is fine-tuned, a prompt rewritten, a policy distilled, or a reflection appended to the context that governs future behaviour. In regulated autonomic deployment this is a liability because the agent operates under commitments and audit contracts that bind to a specific, cryptographically certified identity. We propose to treat consolidation not as a mutation of the planner or the identity manifest, but as a deterministic function f: M^ep - M^sem over episodic memory whose output is a separately addressable semantic knowledge layer; the identity hash does not read M^sem, so consolidation updates knowledge without changing the agent’s certified identity. We give a formal account of the agent representation, prove identity invariance through a structural lemma on the manifest’s hash-input set, specify a deterministic aggregation algorithm whose outputs are auditable database rows with explicit confidence and supporting-event provenance, and validate the construction with synthetic experiments demonstrating per-field correctness, byte-equal identity across consolidation passes, and a mean 79.82% reduction in unproductive planner attempts (95% BCa CI [78.02%, 81.49%] across 10 seeds) against a calibrated Bayesian-shrunk baseline. The construction is a knowledge-update discipline for autonomic agents in which lessons accumulate as queryable facts while the agent’s certified identity remains byte-equal across its operational lifetime, with an embodied service agent as the running case study.
[AI-48] OntoLearner: A Modular Python Library for Ontology Learning with Large Language Models
链接: https://arxiv.org/abs/2607.01977
作者: Hamed Babaei Giglou,Jennifer D’Souza,Andrei Aioanei,Nandana Mihindukulasooriya,Sören Auer
类目: Artificial Intelligence (cs.AI)
备注: 30 pages. Under review at Nature Communications. This version is reformatted with a different section structure; content is unchanged
Abstract:Ontology learning (OL) aims to automatically construct structured knowledge models from text, yet progress remains fragmented across methods, domains, and evaluation practices. Despite decades of research, OL lacks a shared infrastructure for systematic evaluation and ontology access. This absence has hindered progress and fragmented research, leaving the central challenges of OL largely unaddressed. We introduce OntoLearner, a modular, cross-domain, and first-of-its-kind framework that unifies ontology access, large language model (LLM)-driven learning pipelines, and standardized benchmarking. OntoLearner releases 180 machine-readable ontologies spanning 22 domains and provides pipeline-ready datasets with train/dev/test splits for three core OL tasks: term typing, taxonomy discovery, and non-taxonomic relation extraction. Using this infrastructure, we conduct a large-scale empirical study of OL, evaluating 22 retrieval models and 12 LLMs across domains and tasks. The results converge on a finding that reframes the central challenge of OL: failure modes scale with ontological complexity rather than model size or architectural sophistication. The primary bottleneck is not model capability, but a structural mismatch between how models encode knowledge and how ontologies organize it. These findings establish that effective OL is reachable through the cross-domain, multi-task benchmarking enabled by OntoLearner. OntoLearner is open-source (MIT license) at this https URL.
[AI-49] A Multi-Branch Hierarchy-Aware Framework for Heterogeneous Audio Classification
链接: https://arxiv.org/abs/2607.01974
作者: Beile Ning,Jiayi Yu,Zitong Wang,Yufei Hu,Wenjun Xu,Yuanhang Qian,Zhongxin Bai,Gongping Huang
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:This technical report describes our system for Task 1 of the DCASE 2026 Challenge, which aims to classify heterogeneous audio recordings according to the Broad Sound Taxonomy (BST). The task requires both accurate second-level prediction and consistency with the top-level taxonomy. Our system is built on CLAP-based audio-text representations and is improved along three strategies: expanding the training set with a filtered subset of BSD35k, enhancing acoustic modeling with feature-specific branches, and refining predictions using hierarchy-aware classifiers and KNN-based post-processing. Among the acoustic features considered, the log-STFT branch provides the strongest single-model performance. With KNN-based post-processing, our best single system achieves a hierarchical F1 score (Hier. F1) of 80.84% on the BSD10k-v1.2 set under the same evaluation protocol as the baseline. We further construct ensemble systems by combining models with complementary acoustic features and classification heads, achieving Hier. F1 scores of 81.25% and 81.18%, respectively.
[AI-50] Atomic Task Graph: A Unified Framework for Agent ic Planning and Execution
链接: https://arxiv.org/abs/2607.01942
作者: Yue Zhang,Sihan Chen,Ziwen Huang,Hanyun Cui,Kangye Ji,Zhi Wang
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures
Abstract:LLM-based agents have shown strong potential for solving complex multi-step tasks, yet existing performance improvements often rely on either scaling to larger backbone models or task-specific fine-tuning. The former incurs substantial computational costs, while the latter typically generalizes poorly across different tasks. Although prompt-based control is training-free and broadly applicable, existing methods still leave input-output dependencies between subtasks implicit in textual trajectories, making verified intermediate results difficult to reuse. To address these limitations, we propose Atomic Task Graph (ATG), a unified control framework for planning and execution. Specifically, ATG maintains an explicit graph to expose dependencies and support reuse. During planning, it recursively decomposes a high-level task into subtasks, forming a sequence of directed acyclic graphs (DAGs) whose evolution can be traced. During execution, the dependencies exposed by ATG allow independent branches to be executed in parallel, thereby improving execution efficiency. When failures are detected, ATG leverages the graph evolution history to localize the error source and repair only the affected region, preserving validated regions unchanged. Experiments show that ATG consistently outperforms strong baselines in success rate and execution efficiency across three interactive benchmarks using only 7B-8B backbones.
[AI-51] Conditional Co-Ablation: Recovering Self-Repair Backups in Transformer Circuits
链接: https://arxiv.org/abs/2607.01940
作者: Zhiren Gong,Zihao Zeng,Chau Yuen,Wei Yang Bryan Lim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mechanistic interpretability often relies on component-level interventions to discover how a model produces a behavior. This guides attribution, capability knockout, and model pruning downstream to operate by scoring each unit by the effect of ablation in isolation. Such first-order scoring is natural when component importance is additive, but becomes misleading when a transformer self-repairs: after a primary component is removed, a dormant backup can take over, muting the primary’s measured effect while the backup itself appears irrelevant on the intact model. We recast this failure as a recovery task, conditional circuit completion, and introduce Conditional Co-Ablation (CoAx), a label-free, output-grounded score that asks how much each remaining unit’s ablation effect grows once a primary set has been removed. This conditional growth exposes the second-order interaction that single-unit scores discard. On the GPT-2-small IOI circuit, CoAx raises backup-head recovery from 0.33 to 0.91 ROC-AUC, outperforming all baselines, including self-repair-aware gradient scores (best 0.82); counterfactual patching verifies that the recovered heads causally carry the repair. The same label-free procedure transfers to induction across eight models. Beyond discovery, the recovered backups correct self-repair-masked attribution, identify the components required for capability knockout, and yield repair-aware structured pruning scaling from 124M to 7B. Component importance is therefore not merely an isolated-unit property: in robust circuits, the components that matter can become visible only under the interventions that make them necessary.
[AI-52] A-TMA: Decoupling State-Aware Memory Failures in Long-Term Agent Memory
链接: https://arxiv.org/abs/2607.01935
作者: Zitong Shi,Yixuan Tang,Anthony Kum Hoe Tung
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Long term memory lets LLM agents act as persistent assistants, but user facts change. A useful memory system must know what is true now, what used to be true, and what changed. We study \emphghost memory, a state coordination failure in which old, current, and transition facts coexist in the memory bank, remain mixed during retrieval, and mislead the answer model. We argue that memory systems should be understood and optimized from three levels: bank maintenance, retrieval, and answer time resolution. We propose ATMA, a state aware overlay for existing memory systems. ATMA keeps superseded and transition records in the bank, builds evidence packets for the query’s requested state view, and exposes current, historical, and transition labels to QA. We further call for decoupled evaluation of bank, retrieval, and answer level failures, since final QA accuracy can hide where ghost memory occurs. To make this failure measurable, we build LTP (LoCoMo Temporal Plus), a conflict heavy benchmark for ghost memory, and evaluate on LoCoMo for long conversation generalization. On LTP, Graphiti+ATMA improves conflict accuracy by 0.240 absolute over Graphiti. On LoCoMo, Graphiti+ATMA raises temporal F1 from 0.0295 to 0.1705. The gains are host dependent, but they indicate that explicit state roles can reduce memory failures hidden by final QA accuracy.
[AI-53] Low-Latency Task-Oriented Image Transmission with Opportunistic Spectrum Access
链接: https://arxiv.org/abs/2607.01921
作者: João Henrique Inacio de Souza,Mattia Merluzzi,Mateus P. Mota,Beatriz Soret,Petar Popovski
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: This work has been accepted for presentation at IEEE SPAWC 2026
Abstract:Communication systems designed for reliable data reconstruction, rather than task-oriented communication, typically rely on separate source and channel coding and incur high latency under limited spectrum availability and fading channels. To address this, we propose a transmission framework with opportunistic spectrum access, in which the transmitter sends discrete latent representations learned via a vector-quantized variational autoencoder (VQ-VAE) over idle licensed channels using standard digital modulation. The AI-powered receiver is still able to reconstruct task-related information from the heavily compressed data. We develop a cross-layer latency model that accounts for compression, block errors, retransmissions, and stochastic channel access. Results on latency-accuracy trade-offs show that the proposed scheme achieves at least 79- and 3.3-fold latency reductions with only 5.7% and 2.4% drops in classification accuracy compared to benchmarks using conventional source and channel coding. The framework enables low-latency communication and reliable task execution even under limited spectrum availability and challenging channel conditions.
[AI-54] ElephantAgent : Contextual State Continuity in Agent ic Systems
链接: https://arxiv.org/abs/2607.01919
作者: Jiankai Jin,Xiangzheng Zhang,Zhao Liu,Wenzhuo Xu,Dongdong Yang,Deyue Zhang,Quanchen Zou
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Agentic systems enhance their capabilities by invoking external tools and maintaining persistent memory. However, these external dependencies introduce novel attack surfaces. Recent tool and memory poisoning attacks show that maliciously crafted tool descriptors and poisoned memory can covertly bias agent behavior. These threats reflect a deeper issue: the lack of verifiable continuity in the agent’s contextual state for planning and execution. We present ElephantAgent, a protocol that enforces Contextual State Continuity to defend against contextual state poisoning. Inspired by prior state-continuity mechanisms (e.g., Nimble), ElephantAgent extends this protection to the evolving contextual state of agentic systems. We define the contextual state as the bounded, security-critical subset of the agent’s entire context (e.g., tool state and memory). Before processing each query, ElephantAgent recomputes the digest of the local contextual state and verifies it against the latest authorized digest. Using replicated trusted hardware, ElephantAgent maintains a linearizable ledger of authorized contextual state transitions and detects out-of-band state tampering. To handle in-band semantic abuse, ElephantAgent additionally provides Historical Traceability, enabling conditional post-hoc audit and recovery to a known-good prior state.
[AI-55] ContextSniper: AntTrails Token-Efficient Code Memory for Repository-Level Program Repair
链接: https://arxiv.org/abs/2607.01916
作者: Chiwang Luk,Matin Mohammad Najafi,Zhifeng Jia,Wei Yang,Xiuchang Li,Jinwei Zhu,Yang Ren,Lei Chen,Gao Cong
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model agents can repair real repository issues, but they often spend large context budgets on whole-file reads, broad searches, and long terminal outputs where useful evidence is mixed with irrelevant code and logs. This paper presents ContextSniper, AntTrail’s token-efficient code memory layer for repository-level program repair. As the coding specialization of AntTrail’s broader agent memory engine, ContextSniper implements the Sniper feature for precision evidence selection: it retrieves candidate code and runtime evidence, ranks it with hybrid retrieval signals, filters long outputs through an intention-aware context gate, and returns compact evidence packets while preserving recoverable source context outside the prompt. We evaluate ContextSniper on SWE-bench Lite with OpenClaw and Claude Code, using 50 task runs per host-agent condition. ContextSniper reduces total token use by 51.5% and logged cost by 36.4% for OpenClaw, and reduces total token use by 38.9% and estimated cost by 27.3% for Claude Code. Submitted-resolution rates decrease slightly, from 26.0% to 24.0% for OpenClaw and from 32.0% to 30.0% for Claude Code. ContextSniper’s pilot testing scripts are open-sourced at this https URL
[AI-56] Rethinking Complexity Metrics for LLM -Integrated Applications: Beyond Source Code
链接: https://arxiv.org/abs/2607.01903
作者: Zihao Xu,Yuekang Li,Gelei Deng,Yi Liu,Zhenchang Xing
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:LLM-integrated applications blend natural language prompts with program code, and much of their runtime behavior originates in the prompt layer rather than in the code itself. Existing complexity metrics, however, operate solely at the code level and therefore overlook this behavioral logic entirely. We present HECATE, the first tool designed to assess complexity in both the prompt and code layers of such applications. Central to HECATE is Prompt-as-Specification, a Hoare-logic-inspired formalism that interprets every prompt as a specification of intended behavior. Grounded in 25 complexity dimensions identified across published taxonomies, the tool generates 52 candidate metrics. We assess each metric against 118 components collected from 18 open-source repositories, relying on maintenance activity derived from version history as an empirical proxy for complexity, and discard any metric that loses significance once code size is accounted for. Only ten metrics withstand this test. Seven belong to our newly introduced set; rather than measuring sheer volume, each tallies structurally distinct elements, such as LLM call sites, memory attributes, and prompt templates, an attribute we call structural breadth. Of the three surviving conventional metrics, RFC exhibits a similar breadth-oriented character, while Halstead N and V survive only as a residual effect of size; our top-performing metrics exceed all three. Crucially, the prompt-layer metrics retain significance even when the strongest code-level metric is added as a covariate, establishing prompt complexity as a dimension in its own right. A final validation on 20 components spanning six held-out repositories shows that the two best-performing metrics continue to predict maintenance effort, supporting their generalizability beyond the training set.
[AI-57] SABER: A Semantic-Aligned Brain Network Analysis Framework via Multi-scale Hypergraphs ICME
链接: https://arxiv.org/abs/2607.01901
作者: Yidan Xu,Xiangmin Han,Rundong Xue,Huihui Ye
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted to IEEE International Conference on Multimedia and Expo (ICME) 2026;
Abstract:Effective brain disease diagnosis requires the synergy of brain connectivity patterns and high-level semantic knowledge. Existing methods, however, largely treat semantics from large language models (LLMs) as auxiliary features or supervision, limiting their direct role in decision-making and constraining classification stability and robustness. To overcome this, we propose a semantic-aligned brain network framework that actively integrates LLM-derived semantics into the prediction process. Specifically, ROI-level semantics are first incorporated via global self-attention to enrich node representations and provide whole-brain context. Multi-scale hypergraphs are then constructed to explicitly model functional subnetworks and multi-ROI interactions, addressing the locality limitations of traditional GNNs and capturing high-order dependencies. Finally, a decision-level semantic alignment mechanism selectively injects patient-specific textual embeddings into graph representations, enabling semantics to directly guide predictions without perturbing the underlying network structure. Experiments on public brain network datasets ABIDE and ADHD-200 demonstrate state-of-the-art performance, enhanced stability, and improved interpretability, particularly in small-sample settings.
[AI-58] Rank-Then-Act: Reward-Free Control from Frame-Order Progress
链接: https://arxiv.org/abs/2607.01897
作者: Yuriy Maksyuta,George Bredis,Ruslan Rakhimov,Daniil Gavrilov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 15 figures
Abstract:We introduce Rank-Then-Act (RTA), a framework for learning control policies from expert video demonstrations without environment rewards. RTA trains a Vision-Language Model (VLM) offline as a progress-based ordinal scorer, using a Group Relative Policy Optimization (GRPO) objective over shuffled frame sequences, which forces the model to recover temporal ordering from visual semantics rather than trivial time cues. Importantly, instead of using the scorer directly as a scalar reward model, we propose a correlation-based reward function for reinforcement learning: at each interaction window, we compute the Spearman rank correlation between predicted progress rankings and true temporal indices, yielding a bounded, scale-invariant learning signal. This design decouples reward learning from absolute calibration and enables stable transfer across tasks and environments. We evaluate RTA on discrete control benchmarks (PyBoy: Catrap, Kirby) and continuous control tasks (PointMaze, MetaWorld). RTA consistently matches or outperforms prior video-based reward learning methods and rank-based baselines, while demonstrating strong cross-task reuse of a single pretrained progress scorer. Our results suggest that correlation-structured supervision over video-derived ordinal signals is sufficient for policy learning, offering a scalable alternative to explicit reward design.
[AI-59] CamoNAS: Neural Architecture Search for Enhanced Camouflaged Object Detection
链接: https://arxiv.org/abs/2607.01870
作者: Dawei Ren,Yan Zhang,Hongying Tang,Qiaoling Zhou,Jianpo Liu
类目: Artificial Intelligence (cs.AI)
备注: Published in The Visual Computer. Author manuscript version
Abstract:Camouflaged Object Detection (COD) aims to locate and segment objects that blend into their surroundings, presenting challenges due to weak edge cues and ill-defined boundaries. Traditional COD models rely on hand-designed architectures and multi-scale feature fusion, which are often guided by intuition rather than systematic search. This paper introduces CamoNAS, a frequency-aware multi-resolution Neural Architecture Search (NAS) framework for COD. CamoNAS automatically searches both cell-level operations and network-level downsampling paths, forming a hierarchical search space tailored to detect camouflaged objects. Additionally, it adopts an RGB frequency dual-stream architecture, where a learnable wavelet transform complements the RGB spatial stream. CamoNAS achieves state-of-the-art performance on four COD benchmarks (CAMO, COD10K, NC4K, CHAMELEON), highlighting the effectiveness of NAS for COD. Our code is available at this https URL.
[AI-60] An Exploratory Study on LLM -Generated Code and Comments in Code Repositories
链接: https://arxiv.org/abs/2607.01867
作者: Yongyi Ji,Jiaji Wang,Yi Zhou,Fuxiang Chen,Hongji Yang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to The Journal of Systems Software (JSS) on 1 July 2026
Abstract:The use of LLMs in software development has become increasingly widespread on tasks such as code generation and summarization. Reports from large technology companies showed that around 20% to 30% of their code are generated by LLMs. However, there remains skepticism about the practical usage of LLM-generated code and comments, such as concerns on more time for debugging the generated code and the unnaturalness of the generated comments. In this paper, we study the code and comments detected as likely to be generated by LLMs and their characteristics, the differences between company- and community-maintained repositories, and how likely bugs are associated with LLM-generated code. We conduct extensive experiments on active company- and community-maintained repositories from 2021 to 2025 using various tools and techniques that detect code and comments generated by LLMs. Based on our detector-based proxy analysis, the results suggest that code detected as likely to be generated by LLMs decreased over time and appeared frequently in test cases, while that of comments remains relatively stable. Proxy results further suggest that code detected as likely to be generated by LLMs shows substantial intra-repository code clones, whereas comments exhibit a relatively low proportion of grammatically correct sentences. In addition, the company-maintained repositories show a higher percentage of code and comments detected as likely to be generated by LLMs, and only a small percentage of the human-labelled bugs are detected as being likely associated with LLM-generated code.
[AI-61] Has This Checkpoint Been Abliterated? A Two-Signal Audit and Its Failure Map
链接: https://arxiv.org/abs/2607.01854
作者: Gabriel Hurtado
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures
Abstract:Can a platform tell, before deployment, whether an open-weight checkpoint has had its refusal mechanism stripped? Runtime guards cannot: they score generations, not the artifact. We combine two cheap internal signals, a reference-anchored activation refusal-gap and a weight-recovery energy of the base-to-candidate weight difference, into a threshold-free checkpoint audit. The two are negatively correlated and label-complementary: the gap supplies refusal-specificity and the weight energy supplies recall. On a 273-checkpoint registry spanning Qwen, DeepSeek-distilled Qwen, Llama, and Gemma, their z-sum separates 57 public abliterations from 37 benign fine-tunes, merges, and instruction-tunes at AUROC 0.95, significantly above either signal alone (0.84, 0.90), and a Youden-calibrated threshold transfers to held-out families at balanced accuracy 0.89 (FPR 0.11), missing only 4 of 57. We then map two failures, in order of severity: a spoofed reference evades both axes with no training (\DeltaW=0, \rho=1 by construction), and a white-box owner trains a checkpoint past the threshold while it stays guard-unsafe and coherent. The audit is effective triage, not tamper-proofing: it presumes an attested reference, and its claims are bounded by the registry we evaluate it on.
[AI-62] Decomposer: Learning to Decompile Symbolic Music to Programs
链接: https://arxiv.org/abs/2607.01849
作者: Yewon Kim,Apurva Gandhi,David Chung,Graham Neubig,Chris Donahue
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Project page: this https URL
Abstract:Musical performance involves executing a set of high-level musical instructions, yet recovering those instructions from the performance is a challenging inverse problem. We present Decomposer, a post-training framework for symbolic music decompilation: the task of recovering executable, editable music programs from symbolic music. We instantiate the task as MIDI-to-Strudel decompilation, where the model takes symbolic MIDI as input and produces a program in Strudel, a music programming language, that reconstructs the input when executed. The task poses two challenges: Strudel is a low-resource language with little naturally paired MIDI-code data, and optimizing faithful reconstruction of MIDI alone can collapse to unreadable note-by-note transliteration. We address these challenges in two stages. First, we construct Strudel-Synth, a synthetic corpus of paired Strudel programs and rendered MIDI, and use it for supervised fine-tuning. Second, we refine the model with reinforcement learning on unpaired MIDI, optimizing rewards for both MIDI reconstruction faithfulness and code readability. Our evaluation across synthetic and real-world MIDI benchmarks shows that Decomposer achieves substantially higher MIDI reconstruction faithfulness than closed-source LLMs while producing more readable and diverse code than the heuristic converter.
[AI-63] CLAP: Closed-Loop Training Evaluation and Release Control for Domain Agent Post-training
链接: https://arxiv.org/abs/2607.01846
作者: Fangfei Li,Chenyang Zhao,Long Wang,Feng Tian,Zhiyue Zheng,Lv Guo
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure. Accepted to CRAE 2026; to appear in SPIE Proceedings. Best Poster Award
Abstract:Domain agents often face noisy business data, uncertain post-training gains, offline/application mismatch, and adapter-release risk. This paper presents CLAP (Closed-Loop Agent Post-training), a closed-loop method that converts business data into structured SFT samples, decision-preference samples, holdout sets, risk diagnostics, and release-gate records. CLAP combines data validation, target/evidence normalization, reward/KL diagnosis, offline gates, and application-chain replay to decide whether an adapter is suitable for the target application chain. On five anonymized manufacturing-scenario batches, QLoRA-style LoRA-SFT yields modest average gains: overall score increases by 0.0098, pass rate by 0.0240, and evidence accuracy by 0.0280, while hallucination and wrong facts decrease. Yet only 3 of 5 batches improve, some batches regress, and GRPO exposes high KL risks. Application-chain replay further shows that RAG is necessary for factual extraction; under the same 3B backbone and 100 replay cases, an application-RAG-oriented LoRA-SFT adapter improves value, core fields, and answer-evidence doc/page matching over base+RAG, but increases latency. These results support managing domain-agent post-training through an integrated data-training-evaluation-release loop rather than relying on training completion or a single offline score.
[AI-64] Mixture-of-Parallelisms: Towards Memory-Efficient Training Stack for Mixture-of-Experts Models
链接: https://arxiv.org/abs/2607.01844
作者: Xuan-Phi Nguyen,Shrey Pandit,Yiran Zhao,Semih Yavuz,Silvio Savarese,Shafiq Joty
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:This paper showcases a memory-efficient training stack for Mixture-of-Experts (MoE) models. It is a training paradigm that combines and specializes various existing and novel parallelism techniques at different layers and stages of the Mixture-of-Experts (MoE) model training pipeline. It leverages these techniques to achieve maximal efficiency given the physical constraints of CPU, CPU memory, GPU HBM memory, and the CPU-GPU, GPU-GPU, and node-node communication bandwidth of the GPU cluster. It also contains a novel strategy for the optimizer step to achieve high throughput and memory efficiency, enabling practitioners to conduct lossless pre-training/fine-tuning of trillion-parameter scale models, at a million context length, with just under 12 8x H200 GPU nodes, with state-of-the-art throughput and memory efficiency. In our experiments, MoP delivers 4.7x–8.2x higher per-GPU throughput than a strongly-tuned FSDP2 baseline (with the gap widening at larger scale) and sustains training at context lengths up to 1M tokens, where the baseline runs out of memory beyond 64–128K.
[AI-65] Actual causality in fault trees
链接: https://arxiv.org/abs/2607.01840
作者: Georgiana Caltais,Milan Lopuhaä-Zwakenberg,Mariëlle Stoelinga
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Fault trees are a widely used as effective risk models for complex systems, answering the question “what can go wrong?”, especially through minimal cut set analysis. We study fault trees from the perspective of Halpern Pearl’s theory of actual causality. This allows us to use fault trees to answer the question “why has it gone wrong?”, which is fundamental to failure diagnostics. We give a complete classification of each of the different notions of actual causality in terms of the fault tree’s graph structure and logical structure, and show how minimal cut sets give rise to actual causes.
[AI-66] MMIR-TCM: Memory-Integrated Multimodal Inference and Retrieval for TCM Clinical Decision Support
链接: https://arxiv.org/abs/2607.01814
作者: Lihui Luo,Joongwon Chae,Ziyan Chen,Yang Liu,Siyi Cheng,Weihan Gao,Zelin Zeng,Xiaoming Yin,Samaneh Beheshti Kashi,Dongmei Yu,Lian Zhang,Jing Sui,Zeming Liang,Jiansong Ji,Peter E. Lobie,Peiwu Qin
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional Chinese Medicine (TCM) diagnosis, particularly through tongue inspection, faces persistent challenges in subjectivity and reproducibility. The application of multimodal artificial intelligence to TCM clinical tasks, such as syndrome differentiation and prescription generation, is significantly hampered by the semantic gap between visual tongue features and textual reasoning, as well as the lack of large-scale, standardized datasets. To address these challenges, we introduce MMIR-TCM, a novel framework that emulates the diagnostic process of TCM experts by integrating multimodal large language model(MLLM) with memory-augmented segmentation and retrieval-augmented generation (RAG). Employing a three-stage architecture, MMIR-TCM integrates a training-free Memory-SAM module for robust tongue extraction, a fine-tuned Qwen3-VL model for structured tongue diagnosis generation, and a Qwen3-based RAG component for evidence-grounded clinical decision support generation. The framework was developed and validated using MedTCM, a new large-scale multimodal dataset that we introduce specifically for advanced TCM research. To properly evaluate our framework’s clinical accuracy, which existing metrics fail to capture, we also developed TDEU, a domain-specific evaluation metric incorporating semantic understanding and diagnostic importance. Our comprehensive experiments demonstrate that MMIR-TCM significantly outperforms leading models, including GPT-4o and Gemini 2.5 Flash.
[AI-67] Decoupling Code Complexity from Newcomer Participation: A Causal Study of AI Coding Agent Adoption in OSS
链接: https://arxiv.org/abs/2607.01810
作者: Weiwei Xu,Xuanning Cui,Hengzhi Ye,Minghui Zhou
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Open-source projects depend on a steady inflow of newcomers. A growing concern is that AI coding agents (tools such as Cursor and Claude Code that write code from natural-language instructions) will crowd them out, by absorbing the simple tasks that beginners start with and by making code harder to read. We give this concern a causal answer. Using GitHub code search we identify 1,888 projects that adopted an agent, signaled by their first commit of a configuration file. We apply difference-in-differences against matched non-adopting controls, restricting the main analysis to the 603 adopters with a genuine pre-adoption period. We find no evidence of crowding-out: across estimators newcomer inflow shows no significant decline after adoption (point estimates run from a small increase to, under the most conservative trend specification, a slight and insignificant dip), onboarding and retention are unchanged, and a sparse, correlational beginner-task measure (good-first-issue labels, which we cannot test for parallel trends) shows no decline. The feared mechanism is real but decoupled: adoption raises per-function code complexity (about +11% on a cognitive metric for Python, a quarter of the prior estimate, and +3 to 4% in cyclomatic terms across all languages), yet in fixed-unit subsets where complexity rose (Python on the cognitive metric, and all languages on the cyclomatic metric), newcomer participation does not decline. These results suggest that, in established open-source projects, adopting an AI coding agent makes code modestly more complex but does not crowd out the human newcomers that a project depends on: the feared trade-off between AI assistance and human participation does not materialize.
[AI-68] Expander Sparse Autoencoders: Parameter-Efficient Dictionaries for Mechanistic Interpretability
链接: https://arxiv.org/abs/2607.01799
作者: Rodrigo Mendoza-Smith
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:Sparse autoencoders (SAEs) decompose internal activations of neural networks into sparse linear combinations of learned features by fitting an overcomplete dictionary \mathbfW\in\mathbbR^m\times n with mn , and inferring a sparse code \mathbfx\in\mathbbR^n from \mathbfh\approx\mathbfW\mathbfx . This inference problem closely resembles the canonical setup of compressed sensing, but dense decoders requires O(mn) learned values, which becomes costly at large feature counts. We introduce Expander SAEs: TopK SAEs whose decoder and tied encoder are supported on a left- d -regular expander mask with d\ll m , learning only dn decoder values while keeping the sparse-coding problem (m,n,k) fixed. The same structure reduces storage and turns the matching-pursuit correlation step \mathbfW^\top \mathbfr in OMP into an O(dn) gather-and-reduce operation. Our experiments show that across Pythia-70M/160M, Qwen2.5-3B, and Llama-3.2-1B residual-stream activations, varying d traces a consistent storage–fidelity frontier, and that at the most compressed modern-LM setting, Qwen2.5-3B with d=7 uses 293\times fewer learned decoder values than the full dense decoder while retaining 84 % of dense CE-loss recovered. Control experiments show that the improved storage–fidelity tradeoff is driven by sparse, diverse decoder support structure rather than by fewer learned decoder values, and that when sparse and dense decoders are compared at matched parameter count, part of the remaining gap comes from encoder amortisation. On the theoretical side, we show that expansion and column flatness are sufficient for identifiability of noiseless k -sparse codes, and we derive complementary sufficient conditions under which OMP recovers the support exactly.
[AI-69] Single-Channel EEG-Based Cognitive Load Assessment in Online Learning: A Hybrid Deep Learning Approach
链接: https://arxiv.org/abs/2607.01795
作者: Rowan Hussein,Mohamed Ouf
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Monitoring cognitive load during online learning could help instructors identify content that learners find difficult, but remote settings remove the visual cues that support this judgement in a classroom. We study whether a single-channel, consumer-grade EEG device (the NeuroSky MindWave Mobile 2) can distinguish easy from difficult educational-video content, using the publicly available dataset of Wang et al. [24] (ten learners, one excluded for excessive noise, leaving nine). We implement a hybrid CNN+LSTM+Attention model that combines the raw waveform with band-power features. In a within-subject setting, the model reaches up to 78.5% accuracy, compared with 55% for conventional feature-based classifiers; regularization (dropout and L2) closes the large gap between training and validation accuracy that we observe without it, keeping validation accuracy stable at roughly 68-73%. We are deliberately cautious about these numbers: with only nine subjects, within-subject evaluation is optimistic, and we argue that subject-independent evaluation – in which no learner appears in both training and test data – should be the standard for this task. To that end we release a reproducible evaluation pipeline. We frame the work as a feasibility study rather than a deployable system, and pair it with an open, notebook-based tool that records EEG, runs inference, and visualizes estimated cognitive load as a heatmap over the video timeline to help educators locate potentially challenging segments.
[AI-70] Lightweight Safe Reinforcement Learning for End-to-End UAV Navigation
链接: https://arxiv.org/abs/2607.01794
作者: Shenghui Zhang,YuXuan Gao,Songwei Zhao,Jifeng Hu,Zijing Zhang,Hechang Chen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid development of autonomous aerial systems, Unmanned Aerial Vehicles (UAVs) are increasingly deployed in applications such as inspection, environmental monitoring, and rescue, creating growing demand for reliable autonomous navigation. However, autonomous UAV navigation in dense environments remains challenging under sparse perception and dynamic constraints. Most reinforcement learning (RL) methods lack explicit safety mechanisms, leading to unsafe exploration, unstable training, and risky behaviors, especially during high-speed flight. Even in safe RL approaches, safety is often enforced by projecting policy outputs onto a safe action set, which may introduce instability. Meanwhile, many learning-based methods rely on dense inputs or large networks, increasing computational burden and limiting lightweight onboard deployment. Facing the above challenges, we propose a safety-constrained perception-control integrated framework for UAV navigation. A lightweight network encodes sparse observations into collision-risk-aware features using asymmetric and depthwise separable convolutions. We formulate the task as a constrained Markov decision process within a hierarchical control architecture and solve it using a Lagrangian-based safe PPO algorithm. Curriculum learning further improves training stability. Experiments with varying obstacle densities and flight speeds demonstrate higher success rates, improved safety, and better efficiency than existing reinforcement learning baselines.
[AI-71] Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification
链接: https://arxiv.org/abs/2607.01793
作者: Yunhao Feng,Ruixiao Lin,Ming Wen,Qinqin He,Yanming Guo,Yifan Ding,Yutao Wu,Jialuo Chen,Yunhao Chen,Xiaohu Du,Jianan Ma,Zixing Chen,Zhuoer Xu,Xingjun Ma,Xinhao Deng
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLM agents increasingly perform autonomous actions through external tools, leading to complex and evolving safety risks. However, existing safety testing targets expert-designed safety violations, and the corresponding outcomes are evaluated by hard-coded rules, making them costly to extend as agents evolve. To this end, we present Vera, an end-to-end automated safety testing framework that instantiates software engineering testing principles for non-deterministic agents through a three-stage, self-reinforcing pipeline. First, a literature-driven exploration continuously discovers and structures emerging risks into taxonomies of safety risks, attack methods, and tool execution environments. Second, combinatorial composition across taxonomy dimensions produces executable safety cases, each specifying a concrete safety goal, a programmatically constructed initial state, and a deterministic verification predicate grounded in observable artifacts. Third, adaptive execution runs heterogeneous agents in isolated sandboxes where a control agent steers multi-turn interaction based on runtime observations, while evidence-grounded verifiers judge outcomes from environment state and tool-call evidence rather than model self-report. We evaluate Vera on four production agent frameworks (OpenClaw, Hermes, Codex, Claude Code), revealing substantial safety weaknesses, with average attack success rates reaching 93.9% under multi-channel attacks; we also release Vera-Bench, comprising 1600 executable safety cases spanning 124 risk categories across three execution settings. These results indicate that modular, executable testing infrastructure is essential for rigorous and maintainable safety evaluation of rapidly evolving agentic systems at scale. The code is publicly available at this https URL.
[AI-72] EPnG: Adaptive Expert Prune-and-Grow for Parameter-Efficient MoE Fine-tuning
链接: https://arxiv.org/abs/2607.01789
作者: Ahin Lee,Sehyun Yun,Taesik Gong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages. Accepted at MobiSys Workshop '26
Abstract:Mixture-of-Experts (MoE) models scale efficiently but remain costly to adapt due to redundant experts and uniform parameter allocation. Existing parameter-efficient fine-tuning (PEFT) methods such as LoRA ignore MoE routing dynamics, leading to suboptimal resource use. We propose EPnG, an adaptive prune-and-grow framework that reallocates LoRA capacity based on expert importance derived from router gate probabilities. EPnG prunes under-utilized experts and expands high-importance experts via rank growth with orthogonal initialization, while maintaining a fixed parameter budget. Across OLMoE and Qwen1.5-MoE, EPnG consistently outperforms LoRA under the same budget and achieves performance comparable to full fine-tuning while updating only 0.55%-0.72% of parameters (up to 140x-180x fewer). These results demonstrate that aligning PEFT with MoE routing yields a more effective and scalable fine-tuning strategy.
[AI-73] AI Virtue: What is “Good” Knowledge in the Age of Artificial Intelligence?
链接: https://arxiv.org/abs/2607.01776
作者: Alan Liu
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures
Abstract:In the age of AI, what will be good knowledge? This article, which is accepted and forthcoming in a special issue of Modern Fiction Studies on “Cultural AI” in 2027, applies digital humanities methods to map epistemic virtues (like “true,” “accurate,” “creative”) used in a corpus of 553 journal articles on AI published in 2024. “Creativity” comes in for special attention as an example. Exploring this discourse of value, the article considers how a framework might be developed for evaluating the knowledge-worth of AI – one less locked into values formed around pre-AI “knowledge work” agents or structures, and more open to the future values of “generativity.” The essay is supported by an online digital kit for exploring data models of the corpus of articles on AI it studies.
[AI-74] Verifiable Knowledge Expansion through Retrieval-Grounded Formal Concept Analysis KDD
链接: https://arxiv.org/abs/2607.01773
作者: Yujin Yang,Heejung Lee
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures, Accepted to the 8th epiDAMIK ACM SIGKDD International Workshop on Epidemiology meets Data Mining and Knowledge Discovery (epiDAMIK 2026)
Abstract:Ontology construction requires deciding which objects, attributes, and structural relations should be accepted as valid knowledge. Language models can propose such structures from text, but their outputs can still be unsupported or inconsistent. This paper proposes a retrieval-augmented small language model (SLM) framework that uses formal concept analysis (FCA) as a symbolic verification loop for knowledge expansion. Starting from seed attributes, FCA proposes implications over a growing formal context. A retrieval-grounded SLM oracle then validates each implication or returns a counterexample. The oracle also supports incidence judgments, consistency checks, and attribute proposals, making accepted implications, counterexamples, contradictions, and corrections inspectable. In a rare ataxia setting constructed from Orphadata resources, retrieval-grounded 10-seed runs obtain relation F1 of 0.29-0.52 and closure-based implication F1 of 0.22-0.30. Larger seed sets increase the number of evaluated implications and often improve implication F1. The lower implication scores reflect a stricter evaluation of derived implications, where one missed or extra relation can affect several implication judgments. Ablations show that incidence judgments in a fixed object-attribute setting can improve closure-based implication scores. However, identifying positive object-attribute pairs remains difficult even when the candidate objects and attributes are fixed.
[AI-75] Repair the Amplifier Not the Symptom: Stable World-Model Correction for Agent Rollouts
链接: https://arxiv.org/abs/2607.01767
作者: Xinyuan Song,Zekun Cai
类目: Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:As agent planning moves from short tool chains toward persistent workflows with thousands or tens of thousands of steps, failures will occur inside large planning graphs rather than in isolated predictions. Replanning the entire graph after every mistake is neither computationally realistic nor desirable: full-graph replay consumes large context budgets, exposes the LLM to many irrelevant symptoms, and can degrade long-context retrieval. This paper studies the missing component in such systems: a world-model corrector that repairs the failed planning graph in place. We compare two families of correctors. The first is the common engineering approach: scan nodes and edges, choose a suspicious local region, and ask an LLM to repair it. We implement strong engineering LLM correctors and find that they can help, especially when given very large contexts. The second family is our approach, WM-SAR (World-Model Subgraph Amplification Repair): instead of scanning for visible symptoms, it works backward from subgraph amplification, identifies the nodes and edges that keep re-amplifying error, and sends only that causal subgraph to the LLM. Across graph simulations and LLM repair experiments, WM-SAR substantially outperforms engineering correctors under realistic token budgets, achieves near-whole-graph stabilization with a compact region, and gives the LLM a cleaner repair target.
[AI-76] SimWorlds: A Multi-Agent System for Dynamic 3D Scene Creation
链接: https://arxiv.org/abs/2607.01766
作者: Chunjiang Liu,Xiaoyuan Wang,Haoyu Chen,Yizhou Zhao,Ming-Hsuan Yang,László A. Jeni
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 3 figures. Project page: this https URL
Abstract:LLM agents are increasingly used to translate natural language into 3D scenes in a procedural way, but existing systems focus on static output. Dynamic 4D scenes from text alone, in which liquids flow, particles emit, rigid bodies cascade, and articulated mechanisms move, remain largely unexplored despite their value as editable content and as physics-grounded training data for video generation and embodied AI. Two challenges set the dynamic case apart from static text-to-scene work: an agent must jointly coordinate spatial layout, multiple physics solvers, temporal sequencing, camera, and lighting in a single coherent scene, and verifying motion correctness from rendered video is fundamentally harder than judging a single image. We present SimWorlds: a multi-agent framework that produces dynamic, editable 4D scenes from text, with Blender-specific procedural knowledge, a planner-coder-reviewer workflow driving a fixed ordered sequence of construction stages, a layered scene protocol enforced by a deterministic verifier, and a runtime-state inspection tool suite that catches mechanism failures the rendered image cannot reveal. We also introduce 4DBuildBench, a benchmark for assessing both visual fidelity and physical consistency of the procedural dynamic 3D scenes generated from text prompts. Experiments show that SimWorlds outperforms prior dynamic Blender generation baselines.
[AI-77] Mastermind: Strategy-grounded Learning for Repository-Scale Vulnerability Reproduction
链接: https://arxiv.org/abs/2607.01764
作者: Mingzhe Du,Luu Anh Tuan,Tianyi Wu,Renyang Liu,Zhijiang Guo,Dong Huang,See-Kiong Ng
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Repository-level vulnerability reproduction is a demanding software engineering (SE) task: an agent must inspect a codebase, infer the input grammar that reaches a vulnerable path, construct a proof-of-conceptv(PoC), and verify that the crash disappears on the patched build. Recent LLM agents can often execute these steps when the approach is correct, yet they still fail by choosing the wrong strategy. This paper argues that strategy, rather than the full action trajectory, is the right learning unit for such SE agents: it is compact enough to optimize, concrete enough to guide execution, and stable enough to store and reuse across attempts. We present Mastermind, a dual-loop framework that separates transferable strategy learning from task-specific experience. A trainable planner learns reusable vulnerability-reproduction strategies through SFT and milestone-based GRPO, while an experience loop maintains task-local strategy records that guide subsequent attempts. The planner is trained independently of the executor, allowing strategy learning to improve multiple frozen executors without modifying their action-generation capability. We evaluate Mastermind on CyberGym using 260 training tasks and 200 held-out evaluation tasks. With GPT-5.5 as the frozen executor, Mastermind achieves an 84.5% pass rate, outperforming open-book PoC context (60.0%), Best-of-8 sampling (63.0%), and iterative improvement (77.0%). The same planner also improves GPT-5.4 mini and GLM~5.1 from 45.0% and 58.5% to 60.0% and 71.0%. These results demonstrate that learning high-level strategies is an effective and transferable mechanism for improving repository-scale SE agents.
[AI-78] Path-level Hindsight Instructions for Semantic Exploration in Vision-Language Navigation ECCV2026
链接: https://arxiv.org/abs/2607.01754
作者: Sung June Kim,Sangpil Kim,Honglak Lee
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ECCV 2026
Abstract:On-policy exploration is a crucial component for training robust Vision-Language Navigation agents, as it exposes the policy to a broader state distribution. However, such exploration inevitably leads to trajectories that deviate from expert demonstrations, resulting in a semantic mismatch between the executed visual stream and the original language instruction. In this work, we address this challenge by introducing Phi-Nav, a unified on-policy framework that leverages hindsight reasoning to align instructions with the agent’s actual exploratory journey. Specifically, Phi-Nav operates through a three-stage dual-supervision cycle: 1) the agent performs oracle-guided on-policy exploration, sampling a trajectory while learning from expert action feedback, 2) a hindsight speaker synthesizes a path-level hindsight instruction grounded in the collected visual observations, and 3) the agent conducts a second imitation pass, treating the synthesized trajectory-instruction pair as an additional expert demonstration. Through this process, Phi-Nav bridges the critical semantic supervision gap inherent in on-policy methods, transforming semantically unlabeled movement into dense training signals. Evaluations on the R2R-CE and RxR-CE benchmarks show that Phi-Nav yields competitive performance while requiring only a fraction of the expert demonstrations used by current baselines. These results underscore the necessity of semantic exploration in VLN, positioning Phi-Nav as an effective solution for training embodied agents with limited data.
[AI-79] Meta-Benchmarks for Financial-Services LLM Evaluation
链接: https://arxiv.org/abs/2607.01740
作者: Blair Hudson
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 13 figures, 3 tables
Abstract:Public LLM leaderboards optimise for global average performance and do not capture the specific cognitive demands of financial-services work: a model that leads on MMLU-Pro may underperform on document-grounded compliance reasoning, and a coding leader may handle multi-turn customer interactions poorly. We present a meta-benchmarking framework that organises 452 publicly reported benchmarks into 41 O*NET Generalized Work Activities and aggregates those into 38 BIAN banking business domains spanning sales, operations, risk, and support work. A multiplicative weighting scheme (discrimination x coverage x recency), computed over a rolling model window, rewards benchmarks that still separate the best models, are widely reported, and remain in active use, suppressing saturated legacy tests automatically. These weights scale the K-factor in a pairwise Elo tournament, producing cross-benchmark-comparable work-activity scores without raw score normalisation; business-domain scores are weighted averages of the constituent work-activity Elos. We demonstrate the framework on a point-in-time public snapshot covering 288 models across 25 organisations as of June 2026, and describe the methodology, full taxonomy, design decisions, and limitations with the aim of making the approach reproducible for institutions facing similar selection and governance challenges.
[AI-80] Predicting Closed-Loop Performance of Latent World Models: Offline Checkpoint Selection for MPC and Model-Based RL Under Non-Markovian Rewards in LunarLander
链接: https://arxiv.org/abs/2607.01736
作者: Nikolai Smolyanskiy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Preprint, 19 pages (16 main text + 3 pages appendix), 7 figures, 4 tables. Video: this https URL , Code: this https URL
Abstract:We study how to predict the downstream closed-loop performance of a learned latent world model from validation-time diagnostics alone. Choosing the right checkpoint from a world-model training run is difficult: validation loss and multi-step prediction RMSE keep improving long after closed-loop performance has collapsed. We present a suite of structural validation-time diagnostics drawn from optimal-control theory and apply them to Gymnasium’s LunarLander v3, which features shaped rewards. We train an RSSM [5, 4] world model on it and treat per checkpoint CEM-MPC return as the oracle for closed-loop quality. By evaluating 40 metrics against this oracle, we find that the strongest single predictor is the Reward Observability Fraction (ROF), which measures the reward predictor’s dependence on the observable subspace. We combine ROF with three structural regularizers into a single-number offline checkpoint selection score, the Composite Reward Observability Fraction (CROF). The CROF-selected world model trains a model-based A2C policy that beats a fairly evaluated model-free A2C baseline by ~24.5 return points while using ~65x fewer real-environment interactions, and the same world model also drives a strong zero-shot CEM-MPC policy. Code and data: this https URL.
[AI-81] Reformalization of the Jordan Curve Theorem
链接: https://arxiv.org/abs/2607.01734
作者: Simon Guilloud,Sankalp Gambhir,Samuel Chassot
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present a case study in reformalization, a variant of autoformalization in which the input proof is not natural language but a formal development in a different proof assistant. Concretely, we report three reformalizations of the Jordan Curve Theorem: from Mizar to Lean, from HOL Light to Lean, and from HOL Light to Agda. We analyse the results and identify pipeline design choices that matter for practical reformalization tasks.
[AI-82] DRL-CLBA: A Clean Label Backdoor Attack for Speech Classification via DDPG Reinforcement Learning
链接: https://arxiv.org/abs/2607.01729
作者: Yueming Huang,Wenhan Yao,Fen Xiao,Xiarun Chen,Weiping Wen
类目: Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Deep learning models for speech classification are vulnerable to backdoor attacks, where malicious triggers cause misclassification at inference time. While sample-specific attacks can bypass many defenses, they often rely on poisoned label attack, making them detectable via manual data defense. In this paper, we propose DRL-CLBA, a novel clean label backdoor attack for speech classification that leverages Deep Deterministic Policy Gradient (DDPG) reinforcement learning. We also utilize deep audio steganography to embed sample-specific triggers into source audio, creating feature-space anchors. The proposed reinforcement learning framework effectively optimizes target samples toward trigger-bearing anchor points in the model’s deep latent space, enabling label-migration-free poisoning of target samples. Experimental results across three datasets and four different DNNs demonstrate that DRL-CLBA achieves a high attack success rate, effectively bypassing some backdoor defenses. The attack demonstrates strong resistance against fine-tuning, pruning, and spectral signature defenses, exposing critical vulnerabilities in speech-controlled systems.
[AI-83] Distributionally Robust Listwise Preference Optimization
链接: https://arxiv.org/abs/2607.01715
作者: Xudong Wu,Jian Qian,Pangpang Liu,Vaneet Aggarwal,Jiayu Chen
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing robust preference optimization for language-model alignment mainly studies pairwise supervision and places robustness at the dataset, prompt, or preference-pair level. We instead study listwise preference optimization under ranking-label uncertainty: given a prompt and a candidate list, the observed ranking over that list may be ambiguous due to annotator inconsistency, near-ties, lossy rankwise feedback, or reward-model noise. We propose a pointwise total-variation robust Plackett–Luce objective that directly robustifies the ranking label conditional on the candidate list. The robust loss admits an exact decomposition into the nominal PL loss plus a worst-case PL correction, and the worst-case ranking is obtained by sorting current implicit scores in ascending order, reducing the inner maximization from K! enumeration to O(K\log K) . This tractable structure yields strong offline and online optimization guarantees. In the offline fixed-list setting, the robust objective is convex and projected stochastic subgradient reaches global \epsilon -suboptimality with O(\epsilon^-2) sample complexity. In the online policy-induced setting, where candidate lists are generated by the current policy, we establish weak convexity and \widetilde O(\epsilon^-2) Moreau-envelope stationarity. Experiments in offline LLM alignment show that the proposed robust correction largely preserves performance under clean labels and improves robustness under noise. In online alignment, it makes reward-model-ranked candidate expansion more reliable and improves both reward-model and external GPT-4 judge metrics.
[AI-84] Generic Expert Coverag e for Pruning SparseMixture-of-Experts Language Models
链接: https://arxiv.org/abs/2607.01710
作者: Yongqin Zeng,Sicheng Pan,Jiale Wang,Hai-tao Zheng,Hong-Gee Kim,Chunxia Ma,XiuTeng Zhou
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Sparsely activated Mixture-of-Experts (MoE) language models contain substantial structured redundancy among routed experts, but pruning them without downstream calibration data remains challenging. Existing expert-pruning methods typically rely on a single aggregated importance score, which can bias the retained set toward experts favored by dominant calibration patterns. We propose \textbfGeneric TB-Coverage, a coverage-aware expert pruning method that uses only generic text corpora (WikiText2 and C4) for calibration. Instead of collapsing expert utility into one score, our method profiles per-expert utility separately on each corpus and enforces a fixed-budget coverage rule that preserves high-utility experts from each corpus before constructing the final pruning mask. Across Qwen1.5-MoE-A2.7B and DeepSeek-MoE-16B-Base at 25%, 50%, and 75% retention budgets, our method improves average accuracy on six common zero-shot benchmarks over random pruning, REAP, and ExpertSparsity, while also reducing perplexity degradation on WikiText2 and C4. The gains are largest under aggressive pruning (25% and 50% retain), suggesting that preserving cross-corpus expert coverage is an effective generic-data prior for MoE pruning. Our improvements hold with fixed pruning budgets and no downstream calibration data.
[AI-85] COMFYCLAW: Self-Evolving Skill Harnesses for Image Generation Workflows
链接: https://arxiv.org/abs/2607.01709
作者: Zongxia Li,Dawei Liu,Fuxiao Liu,Yuhang Zhou,Xiyang Wu,Jingxi Chen,Jing Xie,Xiaomin Wu,Lichao Sun
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Agents are increasingly used to construct workflows and assist humans in completing recurring tasks more efficiently. As these workflows become repeated and domain-specific, agent memory and reusable skills become increasingly important: agents should be able to recall workflow patterns, execution constraints, and user preferences from previous runs. We study this problem in workflow-based image generation and introduce COMFYCLAW, an agentic skill evolution harness for controlling ComfyUI workflows. COMFYCLAW formulates workflow construction as typed graph editing, exposes tools organized by construction stage, automatically reverts invalid edits, and uses a region-level vision-language model (VLM) verifier to translate visual failures into actionable repair suggestions. The framework further evolves a progressively disclosed skill library, where trajectories, execution errors, and verifier feedback from previous runs are distilled into reusable Agent Skills. Across four benchmark splits, three agent models, and two image backbones, COMFYCLAW achieves the best average image-generation evaluation score across all six agent configurations, outperforming a verifier-only baseline without skill evolution. Human annotations further show that annotators prefer COMFYCLAW over variants without skill evolution. Our results suggest that skill evolution is an effective mechanism for improving agent reliability and performance in recurring visual workflow construction.
[AI-86] Pmeta-TLA: Backdoor Attacks for Speech Classification Models via Meta-Learning with Timbre Leakage Attack
链接: https://arxiv.org/abs/2607.01702
作者: Yueming Huang,Wenhan Yao,Fen Xiao,Xiarun Chen,Weiping Wen
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Recently, speech classification methods have gained widespread adoption in intelligent gadgets. Current study indicates that backdoor attacks provide a substantial security concern to these models, underscoring the pressing necessity to investigate additional potential attack techniques to expose and prevent such risks. This work discusses the vulnerability of current speech triggers to detection by deep neural network defenders and introduces the Timbre Leakage Attack (TLA). The suggested trigger disseminates timbre information at the frame level within the deep self-supervised features, producing poisoned samples that appear natural to human perception. Furthermore, we introduce Pmeta-TLA, an innovative training mechanism for embedding numerous backdoors one time. This method proposes a multi-backdoor injection training strategy using meta-learning and Projected Conflicting Gradients (PCGrad) and introduces TLA as a multi-target attack tool within it. We performed tests on data-poisoning backdoor attacks in keyword spotting tasks utilizing some deep neural network models. Experimental results indicate that the proposed strategy attains superior Attack efficacy, enhanced stealthiness, robustness, and a reduced attack cost relative to baseline methods.
[AI-87] Model Merging as Probabilistic Inference in Fine-Tuning Parameter Space UAI
链接: https://arxiv.org/abs/2607.01689
作者: Long Minh Bui,Tuan Anh Le Van,Tung Phi Duc,Phi Le Nguyen,Jana Doppa,Trong Nghia Hoang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for Publication at the 42nd Conference on Uncertainty in Artificial Intelligence (UAI), 2026
Abstract:Model merging aims to combine existing single-task solutions into a multi-task solution without additional data-driven fine-tuning.~Most existing approaches achieve this using geometric properties of local solution spaces. However, such geometric views provide limited guidance for scoring how statistically useful each task-specific update direction is across tasks during merging. To address this, we formulate model merging from a new perspective of probabilistic inference under a product-of-experts (PoE) scenario where each single-task solution defines an energy-based expert model (EBM) over the merged parameters. We show that several existing model merging methods arise as special cases of our framework under energy designs that impose implicit Gaussian assumptions on directional residuals between merged and task-specific models. Empirically, we find that these residuals are often heavy-tailed which exposes a mismatch with the imposed light-tailed Gaussian structures. We address this with a heavy-tailed PoE design based on Cauchy experts, which better captures the observed residual behavior while admitting a provably convergent inference procedure. Experiments across multiple tasks and architectures show significant improvements over state-of-the-arts baselines. Our code is available at this https URL.
[AI-88] Beyond Gradient-Based Attacks: Adversarial Robustness and Explainability Stability in Cybersecurity Classifiers
链接: https://arxiv.org/abs/2607.01679
作者: Mona Rajhans,Vishal Khawarey
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Adversarial attacks on cybersecurity classifiers pose a dual threat: degrading predictions and destabilising the SHAP-based explanations that security analysts rely on to understand and triage alerts. We extend our prior MLP conference study to Random Forest and XGBoost across four tabular security datasets (phishing URLs, UNSW-NB15, NF-ToN-IoT, HIKARI-2021), evaluating five attacks including three black-box methods applicable to non-differentiable tree models. We introduce the Explainability Stability Index (ESI), a scalar metric computed from TreeSHAP attribution drift under adversarial perturbation, reported on the same [0,1] scale as the Robustness Index (RI). A key finding is that gradient-based black-box attacks (ZOO) produce degenerate results against XGBoost (apparent RI ~0.98) due to piecewise-constant prediction surfaces, while score-based Square Attack reveals genuine vulnerability (RI ~0.36). These degenerate perturbations still drive substantial attribution drift: XGBoost ESI ~0.06-0.16 despite near-perfect ZOO robustness, versus 0.14-0.29 for RF, showing that prediction robustness and explanation stability are distinct axes requiring joint measurement. A two-axis framework (gradient dependence, query efficiency) explains the observed attack ranking and yields practical guidance for tree ensemble evaluation. A step-size ablation explains a counterintuitive PGD anomaly on z-score normalised tabular data.
[AI-89] Separating Expert Retention from Autonomous Source Inference in Raw-ECG-Replay-Free Continual ECG Deployment
链接: https://arxiv.org/abs/2607.01674
作者: Yufan Lu,Xinhui Liu,Chenyang Xu,Yuxi Zhou,Hao Wang,Shenda Hong
类目: Artificial Intelligence (cs.AI)
备注: Submitted toBIBM2026
Abstract:In multi-source ECG deployment, models may need to incorporate new data sources when earlier raw ECGs cannot be retained or replayed. Freezing a pretrained backbone and assigning each source an isolated classifier prevents parameter interference, but deployment still requires selecting an expert when source metadata are unavailable. We study this distinction through \ours, an incremental expert bank built on frozen 1024-dimensional ECGFounder features. Each arriving domain adds a balanced-softmax linear expert, while a lightweight router is fitted only on retained training features and domain labels from sources observed so far. A validation-calibrated margin rule fuses the two most likely experts instead of committing to a single routed expert. On CPSC, PTB-XL, Georgia, and Chapman-Shaoxing, source-aware expert selection reaches 0.7915\pm0.0036 Macro-F1 and a matched offline independent-head reference reaches 0.7885\pm0.0009 , supporting strong source-aware expert retention. Without source IDs, an MLP router reaches 0.7756\pm0.0027 and top-2 margin fusion reaches 0.7782\pm0.0022 . The top-2 gain over hard MLP routing is small ( +0.0026 ), with a 95% confidence interval from paired bootstrap that includes zero. Across three domain orders, the top-2-to-oracle gap remains 0.0111 – 0.0133 , identifying autonomous source inference as the main remaining bottleneck. No raw ECGs are replayed, but frozen training features are retained for router updates; the method is therefore not memory-free. Comments: Submitted toBIBM2026 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2607.01674 [cs.AI] (or arXiv:2607.01674v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2607.01674 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-90] Diverse Evidence Better Forecasts: Multi-Agent Deliberation Under Information Asymmetry
链接: https://arxiv.org/abs/2607.01661
作者: Yuante Li,Yicheng Tao,Kate Zhang,Taozhi Wang,Gefei Gu,Yaxin Zhou
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent systems are increasingly used for forecasting future events, as deliberation among multiple LLMs is believed to improve reasoning and calibration. Yet existing approaches overlook a critical design choice: what information each agent receives. When all agents are given identical evidence, deliberation collapses into herding rather than genuine belief revision, leaving multi-agent systems little better than a single agent. We identify this as a fundamental gap and propose designed information asymmetry to close it: by partitioning evidence into shared public and disjoint private subsets, each agent holds exclusive knowledge that can only reach others through deliberation. We theoretically show that this decomposition reduces inter-agent error correlation, and instantiate it in InfoDelphi, a framework combining relevance-aware evidence routing, rationale-based iterative deliberation, and confidence-weighted aggregation. On PolyGym, a benchmark of 375 binary forecasting questions derived from real-world prediction markets, InfoDelphi outperforms the strongest single-agent and multi-agent baselines by 12–18% in Brier score and 4–8 percentage points in accuracy. More detailed experiments confirm that removing information asymmetry eliminates most deliberation gains, establishing diversity of input as the key enabler of effective multi-agent reasoning.
[AI-91] Autonomous discovery of traffic laws with AI traffic scientists
链接: https://arxiv.org/abs/2607.01639
作者: Xingyuan Dai,Yue Liu,Xiaoyan Gong,Qinghai Miao,Junyou Shang,Yutong Wang,Chao Guo,Yonglin Tian,Yizhang Chai,Chao Xiang,Yisheng Lv,Fei-Yue Wang
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures
Abstract:Universal traffic laws describe recurrent patterns in congestion, mobility and driving behavior across cities, providing a scientific basis for transportation planning, management and control. Their discovery, however, remains expert-driven, requiring candidate regularities to be identified from heterogeneous observational evidence or validated through intervention experiments. Although autonomous artificial intelligence (AI) systems have advanced scientific discovery in controlled laboratory settings, extending them to complex transportation domains remains a challenge. Here we present TrafficSci, an agentic AI system that formulates traffic-law discovery as an iterative, auditable workflow integrating evidence scoping, critic-judge hypothesis induction, and observational-interventional validation. Across four case studies spanning population, network, control and trajectory scales, TrafficSci autonomously rediscovers three established traffic laws and identifies an unreported intrinsic temporal memory scale in urban driving behavior, statistically consistent across eight cities and two trajectory datasets. TrafficSci provides a route for extending AI-driven scientific discovery from controlled domains to complex urban systems.
[AI-92] MKGR: Multimodal Knowledge-Graph Representation Learning for Cold-Start Protein-Protein Interaction Prediction
链接: https://arxiv.org/abs/2607.01627
作者: Wenbo Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate protein-protein interaction (PPI) prediction is central to functional genomics, disease mechanism discovery, and drug development. A difficult setting arises when candidate interactions include proteins that have no observed PPI edges during training, where models relying on network topology alone often lose useful context. This paper presents \method, a multimodal representation framework for cold-start PPI prediction. \method\ combines region-aware protein sequence encoding with four protein-centered biomedical knowledge graphs, including protein-drug, protein-disease, protein-miRNA, and protein-lncRNA associations. The sequence branch extracts contextual representations from structurally informed sequence regions, while graph attention encoders learn modality-specific protein embeddings from sparse biomedical associations. A bridge reconstruction objective regularizes graph learning by recovering shared protein-entity associations, and a pair-level gating module adaptively integrates sequence and graph evidence for each candidate protein pair. Experiments on two benchmark datasets under novel-old and novel-novel cold-start settings show that \method\ consistently outperforms competitive sequence, network, and knowledge-graph baselines across ACC, F1, AUC, AUPR, and MCC.
[AI-93] Spatial Support Matters: Geometry-Aware Graph Fusion for Rainfall Field Reconstruction WACV2027
链接: https://arxiv.org/abs/2607.01621
作者: Low Jun Yu,Niramay Kachhadiya,Herath Mudiyanselage Viraj Vidura Herath,Sanka Rasnayaka,Lucy Amanda Marshall
类目: Artificial Intelligence (cs.AI)
备注: Submitted to WACV 2027, applications track
Abstract:Fine-scale rainfall reconstruction is critical for urban flood modeling, but real rainfall sensing systems observe the field through incompatible spatial supports: gauges measure points, microwave links measure paths, and radar/satellite products measure gridded areas. These differences in measurement support impose geometrically distinct constraints on the rainfall field, yet existing heterogeneous graph approaches reconcile such sources in feature space, giving each its own embedding while discarding the geometry of its support. We propose a geometry-aware multi-support heterogeneous graph neural network that represents each observation according to its support type (0D point, 1D line, or 2D grid) as a distinct node layer, and fuses them through cross-support message passing into a point-support prediction layer from which the field is reconstructed. An inductive masked-node formulation decouples prediction resolution from sensing resolution, allowing the same trained model to reconstruct the field at user-defined target locations or display grids. On Singapore data, the proposed method reduces RMSE by 23.2% over the classical interpolation baseline, inverse-distance weighting, and consistently outperforms other neural architectures such as convolutional fusion and support-agnostic heterogeneous graph baselines. A generalization study using data from Sydney, Australia lets us characterize when multi-support fusion helps: the available skill appears to depend on gauge spacing relative to the spatial correlation length of the field, so fusion delivers the largest gains where the field is under-sampled relative to its correlation length and little when it is already resolved. Code and models will be open-sourced upon paper acceptance.
[AI-94] Scaling with Confidence: Calibrating Confidence of LLM s for Adaptive Test Time Scaling
链接: https://arxiv.org/abs/2607.01612
作者: Xuqing Yang,Yi Yuan,Shanzhe Lei,Xuhong Wang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Training large language models (LLMs) with reinforcement learning (RL) has significantly advanced their performance on reasoning and question-answering tasks. However, prevailing RL reward designs typically prioritize response correctness, neglecting to incentivize models to express their confidence accurately. This leads to a critical problem: performance gains are often accompanied by poor calibration between confidence and accuracy, misleading models to overconfidently hallucinate when uncertain. To address this limitation, we propose \textbfC orrectness and \textbfC onfidence \textbfC alibration \textbfR einforcement \textbfL earning ( \textbfC3RL ), a novel RL algorithm integrating correctness, calibration and dataset-informed reference accuracy rewards together. Comprehensive evaluation across 8 text and multimodal datasets demonstrates that C3RL enhances calibration without sacrificing accuracy, outperforming the current state-of-the-art method in both performance and calibration metrics. Utilizing the well-calibrated verbalized confidence from C3RL, we further introduce \textbfC onfidence-based \textbfA daptive Test Time \textbfS caling ( \textbfCAS ), an adjustable inference-time strategy that allocates computational resources based on response confidence. Experiments show that CAS surpasses majority voting on both in-domain and out-of-domain datasets while reducing the inference budget by up to 12.33 times. We believe the synergy of C3RL and CAS paves the way for deploying more reliable and resource-efficient LLMs. The code, data and models will be released.
[AI-95] Profit-Based Counterfactual Explanations for Product Improvement: A Case Study of Manga Sales in Japan
链接: https://arxiv.org/abs/2607.01610
作者: Keita Kinjo,Takeshi Ebina
类目: Artificial Intelligence (cs.AI)
备注: 8 pages
Abstract:Counterfactual explanation (CE) is widely used to enhance the interpretability of machine learning models and support data-driven decision-making based on model predictions. However, existing CE methods typically require two exogenously specified inputs: a desired output value (target) and a distance function that quantifies changes in explanatory variables. In regression settings, neither the validity of target specification nor the practical interpretation of the distance metric has been sufficiently addressed. Furthermore, most existing CE methods focus on altering predictions rather than optimizing a decision objective, even though real-world decision-making often requires explicit objective maximization. To address these limitations, we formulate CE as a profit maximization problem in management and marketing contexts and propose a framework termed profit-based counterfactual explanation (PBCE). PBCE eliminates the need for exogenous target specification by directly maximizing profit as the primary optimization objective. Concurrently, the distance term is reinterpreted as the cost of modifying product attributes, providing a clear and economically grounded interpretation.
[AI-96] SemHash-LLM : A Multi-Granularity Semantic Hashing Framework for Document Deduplication
链接: https://arxiv.org/abs/2607.01601
作者: Xinyi Fang,Kejian Tong,Jiabei Liu,Tao Ning,Yuhang He
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large scale document deduplication must preserve semantic equivalence while remaining efficient over massive corpora. We present SemHash LLM, a multi granularity framework that unifies semantic projection hashing, attention weighted MinHash, contrastive boundary learning, and selective LLM based adjudication. The method combines character, token, and document level signals through gated fusion, then applies a cascaded filtering pipeline for efficient candidate reduction. Semantic projection hashing learns compact binary codes in distilled LLM embedding space, while attention weighted Min- Hash suppresses boilerplate and emphasizes informative content. Adaptive decision boundaries and uncertainty estimation further improve robustness across template pollution, short text perturbation, containment, and viral fragments. Experiments show that SemHash LLM achieves strong duplicate detection quality with less than one percent neural verification cost.
[AI-97] Hawk: Harnessing Hardware-Aware Knowledge for High-Performance NPU Kernel Generation
链接: https://arxiv.org/abs/2607.01590
作者: Junyi Wen,Ruiyan Zhuang,Yongjia Xu,Pengtu Li,Rui Zou,Hongyi Chen,Chingman Wan,Puxu Yang,Wuhui Chen,Yanlin Wang
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Developing high-performance kernels for Neural Processing Units (NPUs) is a critical industry bottleneck, requiring developers to manually navigate implicit hardware constraints and strict memory hierarchies. While large language models offer immense automation potential, they fail catastrophically on NPUs due to a fundamental lack of hardware-specific priors. Naively transplanting code snippets from similar NPU kernels may pass the compiler, but it consistently triggers runtime crashes and performance degradation by blindly violating underlying hardware constraints. To overcome this, we introduce Hawk, a training-free framework that harnesses hardware-aware knowledge through three core modules: (1) Run-Time Knowledge Synthesis Module, which employs a Triple-Part Executable Knowledge Representation to inherently couple the error context with executable semantics; (2) Bottleneck-Aware Knowledge Retrieval Module, which implements a 2D-Retrieval paradigm to project queries into orthogonal syntactic and hardware-aligned semantic spaces; and (3) Effect-Driven Knowledge Distillation Module, which leverages LLM-driven semantic arbitration to continuously distill the knowledge by pruning errors and consolidating redundancies based on the empirical execution feedback. Extensive evaluations on real-world NPU workloads demonstrate that Hawk elevates generation accuracy from 49.4% to 80.0%, while achieving up to a 2.2x execution speedup over state-of-the-art baselines.
[AI-98] EO-Agents : A Three-Agent LLM Pipeline for Earth Observation Hypothesis Generation ICML2026
链接: https://arxiv.org/abs/2607.01584
作者: Mahyar Ghazanfari,Amin Tabrizian,Armin Mehrabian,Peng Wei
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the ICML 2026 AI for Science Workshop
Abstract:Large language models have recently been explored for scientific hypothesis generation, but most prior work relies on unstructured literature and free-form textual claims. We present a pipeline for Earth observation that grounds hypothesis generation directly in the NASA Earth Observation Knowledge Graph. A heterogeneous graph neural network trained on historical co-usage relations ranks candidate dataset pairings, and a three-agent LLM pipeline filters, generates, and evaluates structured research hypotheses. Applied to 1,475 NASA datasets, the system produces 160 hypotheses spanning multiple Earth-science domains, including ecohydrology, glaciology, aerosol–cloud interactions, vegetation phenology, and stratospheric chemistry. Model-predicted novel dataset pairings are rated nearly as plausible as held-out real co-usages from the literature, indicating that the pipeline surfaces scientifically coherent yet unexplored combinations. A 222 factorial experiment across GPT-5.2 and Claude Sonnet 4.6 shows that hypothesis rankings remain stable, while absolute scores depend strongly on judge identity, highlighting limitations of single-judge LLM evaluation.
[AI-99] Scaling Trends for Lie Detector Oversight in Preference Learning
链接: https://arxiv.org/abs/2607.01567
作者: Oskar J. Hollinsworth,Ann-Kathrin Dombrowski,Sam Adam-Day,Adam Gleave,Chris Cundy
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Deceptive behavior in LLMs is costly to monitor and prevent, motivating approaches such as Scalable Oversight via Lie Detectors (SOLiD) (Cundy Gleave, 2025), which uses lie detectors to identify responses for review by high-cost labelers. In this paper, we scale SOLiD to larger models and evaluate it in more diverse and realistic preference-learning settings. We find favorable scaling: undetected deception drops from 34% for 1B-parameter models to 14% for 405B-parameter models at a detector true positive rate of 99%, and expensive human labelers can be removed entirely from the fine-tuning phase without a statistically significant increase in deception. However, SOLiD is sensitive to distribution shift between detector training and preference-training data, which can drive detector false positive rates to impractical levels.
[AI-100] X-LogSMask: Expand Transformer for Graph-Structured Data
链接: https://arxiv.org/abs/2607.01553
作者: Leyan Li,Rennong Yang,Zhenxing Zhang,Liping Hu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformers have become general-purpose architectures, but their all-to-all self-attention is poorly matched to graph data, whose interactions are sparse, structured and multi-scale. Existing Graph Transformers address this mismatch through structural encodings, hybrid message-passing modules or learned attention constraints, often introducing additional complexity and limited interpretability. Here we introduce X-LogSMask, an explainable multi-head logarithmic structural mask that injects symmetrically normalized graph topology directly into attention logits. The logarithmic transform converts structural connectivity into a topology-aware gating signal, suppressing unsupported node interactions while preserving feature-dependent attention. By assigning different powers of the normalized adjacency matrix to different attention heads, X-LogSMask gives each head a defined structural radius and supports multi-hop information propagation within a single layer. We further show that a standard Transformer encoder can be interpreted as one-step message passing on a complete graph, motivating X-LogSMask as a topology-constrained alternative to unrestricted self-attention. Across 20 node-, edge- and graph-level benchmarks, Transformers equipped with X-LogSMask achieve state-of-the-art performance on 13 datasets and remain competitive in a lightweight one-layer configuration. These results show that simple, interpretable structural masks can make self-attention an effective graph-learning operator without changing the Transformer architecture. The code is available at this https URL.
[AI-101] Evolutionary Feature Engineering for Structured Data
链接: https://arxiv.org/abs/2607.01548
作者: Ege Onur Taga,Yilin Zhuang,M. Emrullah Ildiz,Petros Mol,Abhimanyu Das,Karthik Duraisamy,Samet Oymak
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 page main content, 41 pages in total
Abstract:Large language models are increasingly used as open-ended search operators in evolutionary optimization. We introduce Evolutionary Feature Engineering (EFE), a framework for using LLM-based evolution to discover preprocessing transformations for structured data. EFE represents transformations as Python programs with a standardized fit/transform interface, allowing them to be inserted directly into existing machine learning pipelines. During evolution, candidate programs are refined using dataset context, summary statistics, and downstream performance feedback on validation set. We instantiate EFE in two settings. For time-series forecasting, EFE-Time learns invertible, dataset-specific normalizations that improve off-the-shelf time-series foundation models. It reduces forecasting errors (MASE, WQL, MAE) 3% or more when averaged across datasets and improvements are as much as 19% on the COVID-Deaths dataset. Notably, these improvements occur with recent TSFMs such as Chronos-2. For tabular prediction, EFE-Tab evolves compact feature programs that add useful interpretable features and remove redundant ones, improving or matching existing LLM-based feature-engineering methods. We found EFE-Tab to be particularly effective on classical decision trees, where small sets of evolved features yield competitive accuracy while preserving interpretability. Overall, EFE demonstrates that LLM-based evolution can improve both accuracy and interpretability when automatically tackling structured data.
[AI-102] OPINE-World: Programmatic World Modeling with Ontology-error-Prioritized Interactive Exploration
链接: https://arxiv.org/abs/2607.01531
作者: David Courtis,Wenhao Li,Scott Sanner
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Learning how an environment behaves from interaction is central to building agents that adapt to unfamiliar tasks. World models learned with deep networks are flexible but data-hungry and transfer poorly beyond their training distribution. Program-synthesized world models, written as source code by LLMs and refined through counterexample-guided inductive synthesis (CEGIS), are instead data-efficient and reusable, yet they have been demonstrated mainly on structured-state worlds with a given object vocabulary, and a single program search does not scale to pixel-rendered environments whose object structure must be hypothesized flexibly. We introduce OPINE-World, an LLM agent that learns an object-centric programmatic world model online from interaction. OPINE-World couples two cooperating agents in a loop of hypothesis and test, one acting in the environment and one synthesizing the model in code with replay verification and model-based planning, and it steers exploration with a Bayesian measure of object-type adequacy we call ontology error. We evaluate OPINE-World on ARC-AGI-3, a benchmark for skill-acquisition efficiency in which the object vocabulary, the goal, and the action semantics are withheld. OPINE-World solves 20 of 25 games without per-game training and reaches an action-efficiency score of 78.4 against the human baseline.
[AI-103] Robust and Explainable 3D Mode Shape Recognition Using Region-Aware Graph Neural Networks
链接: https://arxiv.org/abs/2607.01522
作者: Tong Duy Son,Marc Brughmans,Andrey Hense,Kohta Sugiura,Sebastian Ciceo,Paolo di Carlo,Theo Geluk
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Mode shape recognition is a fundamental task in automotive NVH development, yet it remains dependent on manual visual inspection by experienced engineers. Existing approaches based on engineering heuristics, Modal Assurance Criterion (MAC), or geometry-dependent AI representations often exhibit limited robustness across different vehicle architectures, finite element (FE) meshes, and experimental measurement layouts, restricting their industrial applicability. This paper presents a Canonical Engineering Graph Representation and region-aware graph learning framework for robust and explainable 3D mode shape recognition. Rather than learning directly from vehicle-specific FE meshes, heterogeneous FE models and experimental measurements are transformed into a common graph whose nodes represent semantically meaningful structural regions connected through engineering-informed relationships. Geometry-independent regional descriptors are combined with graph attention learning and region-aware pooling to capture structural interactions while preserving engineering semantics and enabling physically interpretable predictions. The resulting representation decouples engineering knowledge from numerical discretization, allowing transfer across different vehicle programs without requiring identical mesh topology or sensor configurations. The proposed framework is validated using FE and experimental datasets from four vehicle programs under severe label scarcity. Results demonstrate high classification accuracy, cross-vehicle transferability, and physically meaningful explanations by directly relating predictions to engineering-defined structural regions used in NVH analysis. Beyond mode shape recognition, the proposed Canonical Engineering Graph Representation provides a reusable engineering abstraction for trustworthy and transferable AI across heterogeneous simulation and experimental workflows.
[AI-104] Revisiting Chain-of-Thought Reasoning under Limited Supervision: Semi-supervised Chain-of-Thought Learning
链接: https://arxiv.org/abs/2607.01511
作者: Hongyang He,Jiuming Liu,Victor Sanchez
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Tech Report
Abstract:Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent reasoning capabilities in large language models. However, most existing CoT methods use reasoning chains mainly as inference-time prompts, while the generated reasoning traces are rarely reused as semi-supervised learning signals. In this report, we define \textbfSemi-supervised Chain-of-Thought Learning and propose \textbfSemi-CoT, a simple framework that uses unlabeled questions to construct pseudo reasoning supervision. Semi-CoT samples multiple pseudo-CoTs for each unlabeled question, estimates answer-level semantic entropy, and selects low-entropy reasoning chains as reliable pseudo-CoT demonstrations. This extends the self-training view of CoT from inference-time refinement to semi-supervised pseudo-supervision. Pilot experiments on AQuA, SVAMP, GSM8K, and MultiArith show that the entropy gate selects high-precision pseudo-CoTs, with pseudo-answer precision ranging from 91.36% to 100% . Semi-CoT also gives small gains on SVAMP and GSM8K, while AQuA shows negative transfer and MultiArith reaches a ceiling. These results suggest that unlabeled questions can provide reliable pseudo reasoning signals, but their effective use still requires stronger demonstration selection or student training.
[AI-105] Janus: a Playground for User-Involved Agent ic Permission Management
链接: https://arxiv.org/abs/2607.01510
作者: Natalie Grace Brigham,Eugene Bagdasarian,Tadayoshi Kohno,Franziska Roesner
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Code and data released on GitHub: this https URL
Abstract:AI agents that autonomously execute tool calls on a user’s behalf raise pressing questions about permission management: what role could users play, and what role should they play? Despite many proposed approaches, the user’s role in agentic permission management remains under explored. We introduce Janus, a playground system for implementing and evaluating user-involved agentic permission management designs. Janus consists of two components: Janus-Core, a modular agentic system supporting a diverse spectrum of permission management designs, and Janus-Harness, an automated evaluation framework. Grounded in a conceptual model that identifies key design axes for user involvement, we implement six permission assistants spanning the design space and evaluate them across three scenarios and three synthetic responders. We demonstrate that user input is critical and can significantly strengthen privacy and security, that AI augmentation of user decisions can help reduce cognitive load, and that realistic user behavior including permission fatigue must be accounted for in system design. No single design performs optimally across all contexts, motivating a more principled and context-sensitive approach to deploying permission assistants in agentic systems. Janus is publicly available to support future investigation into this dimension of agentic system design.
[AI-106] he Agent ic Garden of Forking Paths
链接: https://arxiv.org/abs/2607.01507
作者: Jiacheng Miao,Jonathan K Pritchard,James Zou
类目: Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:
Abstract:Empirical research rarely admits a unique analysis. Different analytical choices can lead to different conclusions from the same data, yet these hidden forking paths are difficult to observe. We show that AI agents capture much of the analytical variation among human researchers while making these paths explicit. Across four high-stakes domains, assigning different personas is sufficient for AI agents to report divergent, often opposing, conclusions from the same data and question, with findings systematically aligned with those beliefs. In a study in which 42 human research teams analyzed the same immigration dataset, AI agents reproduced 72% of the human ideological gap in reported effect estimates. Despite reaching opposing conclusions, it is difficult to identify clear issues in each analysis based on the final AI reports: 86% passed independent AI review and 78% passed majority human expert review. These findings suggest that the central challenge is often not flawed analyses, but selective exploration and reporting from a large space of methodologically defensible analyses. AI agents may amplify this longstanding problem by making such exploration inexpensive and scalable. To address this, we introduce the m-value (multiverse value), the probability that an analysis path would produce a claim at least as extreme as the reported one. We further introduce Agentic Bootstrap, which estimates the m-value by using AI agents to sample plausible analysis paths. Applied to the human immigration study, 13.5% of reported human analyses fell in the most extreme 5% of the analysis space (m0.05). Scientific evidence should therefore be evaluated not only by a single reported analysis but also by its position within the distribution of analyses that could reasonably have been reported. Agentic Bootstrap makes this distribution observable and turns it into a criterion for scientific credibility.
[AI-107] Dont Let Gains FADE: Breaking Down Policy Gradient Weights in RL
链接: https://arxiv.org/abs/2607.01490
作者: Juliette Decugis,Sean O’Brien,Francis Bach,Gabriel Synnaeve,Taco Cohen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning post-training dramatically improves LLM reasoning, but suffers from training instability and diversity collapse. Advantage functions offer an appealing fix: they reshape the training objective, reweight which rollouts drive learning, and are trivial to implement. Yet a proliferation of methods makes it unclear which advantage to use and when. We cut through the confusion with a unifying framework that decomposes any advantage into its positive and negative gradient mass along two orthogonal axes. On the sign axis, imbalanced updates collapse either entropy or weight geometry. On the difficulty axis, hard-problem focus sharpens signal but costs sample size. Both trade-offs shift during training: exploration favors balance and hard focus; exploitation favors suppression and medium focus. This motivates FADE (Focal Advantage with Dynamic Entropy), a self-adapting advantage that reads training dynamics to schedule the gradient weight automatically. FADE reaches peak pass@1 20k steps earlier than the best static baseline at the 7B scale and 2k steps earlier at the 32B , while achieving the best accuracy-diversity trade-off across all pass@k on LiveCodeBench and AIME.
[AI-108] Fully Unsupervised Detection of Physical Contacts on Subsea Cables via State-of-Polarization Monitoring
链接: https://arxiv.org/abs/2607.01484
作者: Agastya Raj,Alvaro Doval,Tian Tian,Steinar Bjørnstad,Marco Ruffini
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: This paper is a preprint of a paper accepted in ECOC 2026 and is subject to Institution of Engineering and Technology Copyright. A copy of record will be available at IET Digital Library
Abstract:We present a fully unsupervised Fast-Slow DSVDD detector for continuous State-of-Polarization monitoring on a deployed subsea cable. Trained without event labels, it ranks all five confirmed trawler contacts within the top 13 of 122,174 recordings and surfaces additional corroborated cable-contact events.
[AI-109] Procedural Memory Distillation: Online Reflection for Self-Improving Language Models
链接: https://arxiv.org/abs/2607.01480
作者: Ye Liu,Srijan Bansal,Bo Pang,Yang Li,Zeyu Leo Liu,Yifei Ming,Zixuan Ke,Shafiq Joty,Semih Yavuz
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR), along with recent selfdistillation variants such as SDPO, evaluates each rollout against a verifier and updates the policy from that episode-level signal. However, the richer procedural information in the rollout is rarely retained or reused. Across episodes and epochs, the model repeatedly encounters related problems under a changing policy, producing cross-episode signals that episode-local updates cannot capture: which strategies consistently pass verification, which failure modes persist, which patterns recur. We propose Procedural Memory Distillation (PMD), which converts these crossepisode signals into reusable procedural memory and distills it into the policy’s weights during training. This memory functions as a training scaffold, absorbed into the policy itself, yielding a memory-free model at inference. PMD organizes the memory at three levels of abstraction: raw trajectories, self-reflected strategies and lessons, and higher-level behavioral patterns that recur across problems, all extracted online from the model’s own trajectories. A memory-conditioned self-teacher draws on the accumulated experience to supervise the student on its own rollouts, enabling student to progressively internalize procedural knowledge within its parameters. The central design principle is co-evolution: the policy generates rollouts that update the memory, and memory shapes the supervision that updates the policy. Empirically, across Qwen3-8B and OLMo3-Instruct-7B, PMD improves over SDPO by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on LIVECODEBENCH. Co-evolution powers these gains: freezing either the memory or the policy trails PMD by more than 10% across SCIKNOWEVAL domains.
[AI-110] World Feedback for Clinical Agents : Diagnosing RL in FHIR Environments
链接: https://arxiv.org/abs/2607.01470
作者: Ananya Mantravadi,Harshit Rajgarhia,Prasanna Desikan,Abhishek Mukherji
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Clinical protocol-execution tasks – checking a lab value, applying a threshold, placing a correctly structured FHIR order – are natural candidates for RL from world feedback: once clinical SMEs encode decision logic into a verifier, that verifier grades unlimited rollouts without per-episode annotation. But applying RL requires a sound feedback channel and sufficient base capability. We audit MedAgentBench v1/v2, find a 41.7% silent-finish ceiling that makes inaction the RL dominant strategy, and construct \textbfMedAgentBench-v3 (MAB-v3) (508 tasks, 8.9% ceiling). Training Qwen3-8B exposes two structural barriers: a \emphcapability ceiling (10/20 task types have 0% base performance, zero gradient) and a \emphformat-knowledge barrier (3/20 types require exact clinical codes undiscoverable by exploration). Pure RL reaches 18.2% pass@1 vs.\ 34.1% for rule-based SFT; the 15.9~pp gap is attributable entirely to these barriers. A decision/format-knowledge/lookup taxonomy predicts RL learnability and prescribes the fix: SFT to inject codes, RL to learn conditionals.
[AI-111] Beyond Next-Token Prediction: An RLVR Proof of Concept for Tool-Use Agents on Atlassian Workflows
链接: https://arxiv.org/abs/2607.01465
作者: Karthikeya Aditya Vissa,Sankalp Mane,Ananya Mantravadi,Harshit Rajgarhia,Abhishek Mukherji
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models are trained to predict the next token, not to act inside a specific API. In niche enterprise SaaS workflows – where success means hitting the right endpoint with the right nested arguments in the right order – this objective mismatch shows up as silent failures: dropped required fields, hallucinated tools, or early stops after a single read. We ask whether Reinforcement Learning with Verifiable Rewards (RLVR), applied directly in the target environment, closes the gap. As a proof of concept we build a suite of five synthetic environments emulating the Jira REST v3 and Confluence v2 APIs at schema fidelity; rewards are computed entirely from the tool-call trace, with no live API, no learned judge, and no human label in the loop. Scoring prompted Qwen3-1.7B and Qwen3.5-4B on the same checkers that drive GRPO training, we find that on the four scenarios whose rewards are non-degenerate the RL-trained policy lifts average reward from a 4B-baseline range of 0.35–0.92 to 0.95–1.00, with the largest single gain on Confluence page creation ( 0.35 \rightarrow 1.00 ). We position this as a preliminary step toward outcome-optimised small models for niche enterprise APIs, and foreground two limitations a workshop reader should weigh: hand-crafting verifiable rewards does not scale beyond the handful of endpoints reported here, and one of our five scenarios (ticket-transition) has a saturating reward shape that the prompted 4B already maxes out.
[AI-112] oken Geometry
链接: https://arxiv.org/abs/2607.01455
作者: Kathan Shah
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Language models learn continuous programs over discrete symbols, with the embedding table and LM-head acting as the read/write interface between them. We show that this interface has gradient geometry distinct from dense hidden weights which can be exploited to improve the Pareto frontier across supervised finetuning, RL, and pretraining, while only utilizing kilobytes of optimizer state. We introduce Ember, a lightweight optimizer for embedding and LM-head matrices that utilizes O(V + D) VRAM, instead of Adam’s O(2VD), and forgoes the need to shard both token table optimizer states. We provide empirical evidence that Ember scales effectively across batch size and parameter count. We show that the optimization trajectory of tokens can be well described by a simple 1D ray, counter to the popular belief that neural net parameters navigate a heavily nonconvex landscape. We provide a principled view on the surprisingly narrow space of optimizers that suffice for Transformer training. Finally, we open-source our distributed Ember implementation that merges cleanly with existing ZeRO/FSDP setups to support further research at this https URL
[AI-113] Discrete Diffusion Language Models for Interactive Radiology Report Drafting
链接: https://arxiv.org/abs/2607.01436
作者: Max Van Puyvelde,Halil Ibrahim Gulluk,Wim Van Criekinge,Olivier Gevaert
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Diffusion language models, which generate text by denoising a token canvas bidirectionally instead of emitting tokens left to right, have become competitive with autoregressive (AR) generation. Medical foundation models, however, remain almost entirely autoregressive. We adapt a mixture-of-experts diffusion language model, DiffusionGemma-26B, and benchmark it against its same-size AR sibling Gemma-4-26B under an identical LoRA recipe on medical visual question answering datasets, scored by a verbosity-robust LLM judge. Diffusion matches or exceeds AR on all of them, and the finetuned model (3.8B active) is competitive with frontier vision-language models; its decoding is also 3.5-4.4x faster. Beyond this parity, the diffusion model offers a drafting capability AR lacks: any-order infill. Because the canvas is denoised bidirectionally, a radiologist can fix report fragments and have the model fill the text between them, an operation inherent to diffusion but not to autoregression, which is subpar at it. This suits real reports, which are often terse or inconsistent across clinicians and institutions.
[AI-114] CreativityNeuro: Steering Language Model Weights to Improve Divergent Thinking and Reduce Mode Collapse ICML2026
链接: https://arxiv.org/abs/2607.01433
作者: Samuel Schapiro,Core Francisco Park,Felix Sosa,Lav R. Varshney
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICML 2026 Workshop on Creativity Generative AI
Abstract:Divergent thinking is a crucial aspect of creativity, yet large language models (LLMs) tend to consistently generate similar responses to open-ended questions, in what has been termed the artificial hivemind effect. Here, we introduce CreativityNeuro, a data-free method for enhancing divergent thinking in LLMs via contrastive weight steering. We evaluate our method across multiple creativity assessments and report several main findings. On the Divergent Association Task (DAT), a vocabulary-space creativity test, CreativityNeuro improves performance by up to 14 human percentile points. Next, in a large-scale human evaluation (N=720) on the Alternative Uses Test (AUT) and the Task Task, CreativityNeuro achieves significant improvements in originality, surprise, and creativity, transferring to longer-form and more open-ended tasks. Importantly, we find that across all three tasks, CreativityNeuro demonstrably reduces measures of mode collapse. Moreover, activation steering achieves comparable performance to CreativityNeuro on the DAT, but it does not transfer to the AUT and Task Task, demonstrating the effectiveness of weight-space steering in generalizing to unseen tasks. In conclusion, CreativityNeuro improves divergent thinking and reduces mode collapse without requiring behavioral data, re-training, or gradient-based fine-tuning, providing a straightforward way to enhance LLM performance in creative domains.
[AI-115] When Should Service Agents Reconsider? Difficulty-Routed Control in Customer-Service Operations
链接: https://arxiv.org/abs/2607.01426
作者: Qian Chen,Chengyuan Liu,Xin Yu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous customer-service agents are shifting from conversational interfaces toward operational execution roles: they retrieve firm records, apply service policies, and execute backend writes such as refunds, cancellations, exchanges, order modifications, and reservation changes. This shift creates a service-control problem: firms must keep routine service fast and low-friction while preventing operational errors on requests where customer instructions, policy constraints, firm records, and backend writes interact. We propose a difficulty-routed service-control architecture that asks when service agents should reconsider before acting. A lightweight router keeps routine sessions on a low-cost baseline path and routes operationally coupled sessions to an escalated workflow. The escalated path uses conflict-aware communication and write-triggered reconsideration to concentrate deliberation and safeguards before consequential backend writes, rather than applying additional control uniformly across all service sessions. We evaluate the architecture on human-verified retail and airline tasks from \tau^2 -bench. In retail, the method improves reliability consistently on service requests with operational conflict. Routing evidence shows that stronger control is directed toward conflicted requests rather than broadly applied to routine ones. Dialogue and tool-use profiles suggest that gains do not come from indiscriminate interaction expansion or broader tool chains; instead, added turns and tool calls support evidence gathering, write separation, and pre-write reconsideration. Case-level evidence shows that the escalated workflow preserves fallback plans, binds retrieved records to the correct action, sequences writes, and decomposes multi-entity requests. Airline results extend the same service-control logic to reservation operations.
[AI-116] Agent 4cs: A Multi-agent System for Code Summarization in Large Hierarchical Codebases
链接: https://arxiv.org/abs/2607.01425
作者: Yongjian Tang,Ezgi Sarikayak,Doruk Tuncel,Jie M. Zhang,Thomas Runkler
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the main track of the 23rd European Conference on Multi-Agent Systems (EUMAS 2026)
Abstract:Understanding large, complex codebases, especially those with obfuscated structures and incomplete documentation, remains a significant challenge. Existing code summarization solutions often rely on a single language model or coding assistant like Claude Code, and treat source code as flat text, underutilizing the rich interdependencies and hierarchical information within a repository. To address these shortcomings, we propose Agent4cs - a multi-agent framework that summarizes large codebases in a bottom-up fashion, where a summarization agent focuses on producing robust summaries; a keyword-extraction agent proactively identifies critical information from subfolders; and a quality-assurance agent iteratively refines the outputs for readability, coherence, and completeness. Evaluated on 7 frontier models, Agent4cs improves semantic consistency across all folder levels by average 8% compared to two structured prompting baselines with code segments. Furthermore, extensive evaluation on real-world datasets demonstrates up to 38% gains in normalized keyword coverage rate over the same baselines.
[AI-117] Risk Architecture for AI-Native Engineering Teams: An Organizational Framework for Agent ic System Governance
链接: https://arxiv.org/abs/2607.01421
作者: Laxmipriya Ganesh Iyer
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Engineering management research has produced mature frameworks for software risk: ownership by feature, escalation by severity, and assurance by test coverage. These frameworks implicitly assume deterministic behavior, discrete and auditable change events, and clear component-to-owner mappings. Teams that build and operate agentic AI systems violate all three assumptions at once: outputs are probabilistic, systems take autonomous multi-step actions, and the risk surface mutates silently between deployments. Existing AI risk literature addresses this from above (policy frameworks such as the NIST AI RMF and ISO/IEC 42001) or below (threat taxonomies such as OWASP’s agentic AI guidance), but not at the layer where an engineering manager (EM) operates: roles, decision rights, and escalation structures. This paper contributes (i) a seven-dimension profile distinguishing pure software-engineering, hybrid, and AI-native teams; (ii) a six-cluster failure-mode taxonomy including a previously unarticulated cluster, dependency-boundary determinism mismatch; and (iii) a synthetic framework-adequacy methodology scoring how well each profile’s risk architecture detects, contains, and escalates a defined scenario set. Because the object of study is framework adequacy rather than human behavior, the evaluation yields derived rather than observed coverage claims. Coverage degrades as teams move from pure software engineering to AI-native operation, monotonically in the median and abruptly in the count of uncovered, high-consequence failures appearing only at the AI-native step. The degradation concentrates in specific failure-mode categories, and the most severe, least-covered failures arise not inside AI-native teams but at the organizational boundary where their probabilistic outputs are consumed by determinism-assuming dependencies.
[AI-118] GPUAlert: A Zero-Instrumentation Process-Boundary Monitor for Diagnosing GPU Training-Job Failures
链接: https://arxiv.org/abs/2607.01409
作者: Parv Agarwal,Asif Ekbal
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 3 figures, 4 tables,3 Listings. Submitted as an arXiv preprint. Source, corpus and evaluation harness available at this https URL and this https URL
Abstract:GPU training jobs fail often, roughly two in five on large production clusters, yet the operator typically learns of a failure only by reconnecting hours later. Experiment trackers require editing the training script and maintaining a cloud connection; the scheduler’s mail hook delivers a single status line with no cause and no logs. GPUAlert is a command-line wrapper that monitors any training command at the process boundary, and with no change to that command, emails a structured notification on completion carrying a classified failure cause, durable logs, and output artifacts. The tool is organized around three reliability primitives: a pre-launch log guarantee that establishes the durable destination before the child process can crash, notifier isolation that makes the wrapper’s exit code a pure function of the child’s status regardless of whether the email succeeds, and a non-silent artifact budget that bounds attachment size without ever dropping output silently. We release a labelled corpus of 474 GPU training logs across 15 failure classes and a reproducible evaluation harness. On the twelve hardware-reproduced classes, the ordered-rule classifier reaches 0.997 macro-F1, against 0.830 for unordered keyword matching and 0.133 for exit-code inspection. Wrapper overhead is a constant approximately 3ms per job; the pre-launch guarantee preserves a log where a shell redirect yields nothing; and across all 15 failure modes the wrapper returns the child’s exit code unchanged even when the SMTP relay is unreachable.
[AI-119] Spin-Weighted Spherical Harmonics Enable Complete and Scalable mathrmE(3)-Equivariant Networks
链接: https://arxiv.org/abs/2607.01408
作者: Chenxing Liang,Yuchao Lin,Andrii Kryvenko,Wendi Yu,Chuan Li,Jianwen Xie,Xiaofeng Qian,Shuiwang Ji
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract: \mathrmE(3) -equivariant networks are promising for 3D atomistic system modeling, yet their scalability is limited by the O(L^6) complexity of the Clebsch-Gordan Tensor Product (CGTP). The recently proposed Gaunt Tensor Product (GTP) reduces the complexity but is unable to capture the antisymmetric paths, resulting in incomplete expressivity. In this work, we present SpinGTP, an approach to overcome the GTP incompleteness by generalizing from scalar functions to Spin-Weighted Spherical Harmonics (SWSH). By relying on the algebraic properties of SWSH, SpinGTP recovers the missing antisymmetric interactions while maintaining the asymptotic efficiency of GTP. It also allows for a more expressive equivariant basis that naturally accounts for the parity-odd components of tensor products. We evaluate SpinGTP across diverse benchmarks, including Tetris, 3BPA, SPICE-MACE-OFF, and OC20. Our results show that SpinGTP achieves accuracies comparable to full CGTP. Notably, by explicitly capturing antisymmetric paths, SpinGTP exhibits superior performance in tasks involving chiral materials and non-centrosymmetric geometries. This work provides a complete, scalable, and mathematically rigorous path toward high-order equivariance in large-scale 3D atomistic system simulations.
[AI-120] he Wiola Architecture for Efficient Small Language Models
链接: https://arxiv.org/abs/2607.01394
作者: Aryuemaan Kumar Chowdhury,Afreen Shaik,Yaparla Bhargavi,Brahma Kumar
类目: Artificial Intelligence (cs.AI)
备注: 7 Pages
Abstract:We present Wiola, a fully original Small Language Model (SLM) architecture built from first principles, sharing no structural lineage with any existing model family including GPT, LLaMA, Mistral, or Falcon. Wiola introduces five independently novel components: (i) Spiral Rotary Positional Encoding (SRPE), which embeds token positions on a three-dimensional helical manifold combining absolute, relative, and hierarchical positional signals; (ii) Gated Cross-Layer Attention (GCLA), providing each decoder layer with soft cross-attention access to compressed summaries of two preceding layers for inter-layer coherence; (iii) Adaptive Token Merging (ATM), which dynamically merges se mantically redundant adjacent tokens in middle network layers to reduce attention complexity without information loss; (iv) Dual Stream Feed-Forward (DSFF), replacing the conventional MLP with two parallel streams fused by a learned per-dimension gate; and (v) WiolaRMSNorm, a modified normalisation introducing a per-dimension learned offset vector that prevents representation collapse. We provide complete mathematical derivations, architectural block diagrams, complexity analyses, and systematic comparisons against GPT-2, LLaMA-2, and Mistral. Wiola is released in four sizes (120M, 360M, 700M, and 1.5B parameters) and is fully compatible with the HuggingFace Transformers ecosystem, with all 22 architectural unit tests passing.
[AI-121] How Should Transformers Encode Numeric Values in Electronic Health Records? ICML2026
链接: https://arxiv.org/abs/2607.01391
作者: Maria Elkjær Montgomery,Christian Igel,Mikkel Odgaard,Martin Sillesen,Mads Nielsen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 15 figures, 3 tables, accepted to ICML 2026, to be published in Proceedings of Machine Learning Research
Abstract:How do we encode numeric values in transformer-based sequence processing, particularly in electronic health record (EHR) data? We systematically compare discrete, continuous, and hybrid value encoding strategies using synthetic arithmetic tasks embedded within real-world EHR data, as well as real-world clinical prediction tasks. Our study reveals trade-offs between numeric precision, optimisation stability, and architectural flexibility. We find that approaches that explicitly model value-concept interactions perform best on precision-sensitive arithmetic tasks when architectural constraints permit. Hybrid token-based approaches that retain numeric values but apply binning prior to projection provide a more robust and broadly applicable alternative, with the optimal number of bins following a simple empirically derived power-law in dataset size. Across tasks, models consistently exhibit reliable “good enough” numeric computation rather than exact arithmetic, while clinical gains from incorporating laboratory values are task-dependent. This suggests that robustness and deployability often outweigh maximal numeric precision in practice, motivating hybrid token-based approaches as a practical default.
[AI-122] Auto-FL-Research: Agent ic Search for Federated Learning Algorithms
链接: https://arxiv.org/abs/2607.01366
作者: Holger R. Roth,Ziyue Xu,Chester Chen,Daguang Xu,Peter Cnudde,Andrew Feng
类目: Artificial Intelligence (cs.AI)
备注: 8 pages; 5 figures; 6 tables
Abstract:Federated learning (FL) research often depends on many small but consequential algorithmic choices: optimizer variants, server aggregation rules, local training schedules, normalization, regularization, and model architecture. These choices are expensive to explore manually and difficult to compare fairly when candidate changes can also alter the FL training or evaluation path. In this work, we present Auto-FL-Research (AFR), a constrained coding-agent workflow for FL algorithmic recipe search. Agents may propose and implement candidate training algorithms, including server aggregation rules, client update schedules, local objectives, and registered model variants, while task profiles fix the mutation surface, compute budget, communication contract, and final model evaluation. Each campaign records candidate scores, runtime, edited files, artifacts, and failure status. We evaluate AFR on five healthcare cross-silo FLamby tasks and on grouped-client profiles for the five fixed LEAF datasets plus the LEAF synthetic task. Five-seed repeat evaluations support gains on four FLamby tasks and five of six LEAF profiles, while also exposing seed-sensitive and search-selected failure cases. Same-budget controls show that several gains correspond to FL-recipe changes, whereas other improvements are recovered by fixed-surface scalar controls or fail under repeat or held-out evaluation. These mixed outcomes are part of the contribution: they show how agent-generated candidates can be separated into repeated FL mechanisms, fixed-surface tuning effects, and selected single-run artifacts.
[AI-123] PACE: A Neuro-Symbolic Framework for Plausible and Actionable Counterfactual Explanations
链接: https://arxiv.org/abs/2607.01306
作者: Pavel Iakovets,Liyanapathiranage Sudeepika Wajirakumari Samarathunga,Martin Thomas Horsch,Fadi Al Machot
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Counterfactual explanations explain machine learning predictions by identifying minimal input changes that would alter a model’s decision. Although many existing methods successfully generate prediction-changing alternatives, they often produce unrealistic or infeasible recommendations due to a lack of explicit mechanisms for incorporating domain knowledge and intervention constraints. Neuro-symbolic AI offers a promising direction by combining data-driven predictive models with symbolic reasoning capable of representing human-understandable rules and feasible actions. This paper presents PACE, a modular neuro-symbolic framework for generating feasibility-aware counterfactual explanations. The framework separates prediction and reasoning into two components: a neural predictive model for classification and a symbolic reasoning layer that enforces domain-specific constraints during counterfactual generation. By explicitly modeling feasible interventions, the framework produces explanations consistent with domain knowledge while remaining interpretable and actionable. The approach is model-agnostic and adaptable to domains requiring realistic decision support. A case study is conducted on the Adult Income dataset, combining a multilayer perceptron classifier with Answer Set Programming (ASP) rules encoding feasible modifications to education, occupation, and working hours while preserving immutable attributes. Results highlight the trade-off between counterfactual validity and plausibility and show that symbolic constraints yield explanations that better satisfy domain-specific feasibility requirements, illustrating the potential of neuro-symbolic methods for transparent, feasibility-aware counterfactual explanation in explainable AI.
[AI-124] Generative AI and Federated Learning for Intrusion Detection Systems: A Survey
链接: https://arxiv.org/abs/2607.01305
作者: Jiefei Liu,Abu Saleh Md Tayeen,Pratyay Kumar,Qixu Gong,Wenbin Jiang,Huiping Cao,Satyajayant Misra,Jayashree Harikumar
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Intrusion Detection Systems (IDSs) are essential for monitoring network traffic and identifying malicious activities in modern cyber-physical, Internet of Things (IoT), enterprise, and distributed network environments. However, developing reliable IDS models remains challenging because attack behaviors evolve over time, realistic datasets are difficult to obtain, traffic records may be incomplete, attack classes are often imbalanced, and privacy constraints limit centralized data collection. Recent advances in generative artificial intelligence (AI) and Federated Learning (FL) provide new opportunities to address these limitations. Generative models can support anomaly detection, synthetic traffic generation, data augmentation, data imputation, adversarial traffic generation, and IDS alert explanation. FL enables distributed IDS training without directly sharing local network traffic, making it suitable for privacy-sensitive and geographically distributed environments. This survey provides a structured review of generative AI and FL techniques for IDS. We first summarize representative IDS research directions, including adversarial machine learning, anomaly-based detection, IoT-oriented IDS, explainable IDS, and benchmark datasets. We then categorize generative AI applications in IDS according to model families and task objectives, covering autoencoder-based models, Generative Adversarial Networks (GANs), diffusion models, and Large Language Models (LLMs). Finally, we review emerging studies that integrate generative AI with FL-based IDS and discuss open challenges, including synthetic data quality, realistic traffic generation, dual-use adversarial risks, non-IID client distributions, communication-efficient model sharing, federated IDS benchmarking, and domain-specific LLMs for network security.
[AI-125] Adaptive Companionship for Group-Following Robots: Handling Dynamically Changing Group Formations IROS2026
链接: https://arxiv.org/abs/2607.01287
作者: Cong-Thanh Vu,Yen-Chen Liu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
Abstract:Accompanying a group of humans is an essential aspect of developing human-like social cognition in robots. However, human groups typically do not follow fixed formations, which poses significant challenges for robots in maintaining natural companionship behaviors. In this paper, we propose an adaptive group-accompaniment method for social robots based on Vision-Language Models (VLMs), leveraging their semantic reasoning capabilities to infer companion positions, maintain social distances, and understand group dynamics. The members of the group are first detected, and a perceptual module generates visual representations of the interaction group space as input to the VLM, which is then combined with a Model Predictive Path Integral (MPPI) controller to ensure stability and safety. Experimental evaluations across five scenarios show that the proposed method enables robots to accompany the group effectively, demonstrating a 15% improvement in success rate and a 25% reduction in collision rate compared to baseline approaches. Additionally, a user study indicates that the generated companionship behaviors are perceived as natural and socially appropriate.
[AI-126] Scaling Laws for Grid-Based Approximate Nearest Neighbor Search in High Dimensions
链接: https://arxiv.org/abs/2607.01283
作者: Matthew J Liu,Wei Hang Zheng,Vidhan Purohit,Siqi Xie,Chieh-En Li,Jerry Li,Noah Flynn
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Grid-based approaches to approximate nearest neighbor (ANN) search have been absent from modern scaling analyses. We present a systematic characterization of a multiprobe grid algorithm with respect to dataset size N and dimensionality d . Our experiments reveal a previously unreported d -scaling crossover on the GloVe embedding family, in which multiprobe grid search maintains an approximately constant dimensional scaling exponent while other graph-, tree-, and partitioning-based methods exhibit degrading throughput. The advantage comes with near-linear query scaling in N , but also with lower indexing cost than competing ANN methods. Our results suggest that grid-based methods such as multiprobe grid may be competitive in rebuild-heavy or high-dimensional settings where indexing cost and dimensional robustness dictate performance. More broadly, recent work has formalized self-attention as an ANN operation. Thus, the N - and d -scaling properties of ANN algorithms may guide cost analysis of efficient transformer architectures. Code is available at: this https URL.
[AI-127] Domain Knowledge Based Temporal-Spatial Graph Convolution Network for ECG Recognition ICONIP2024
链接: https://arxiv.org/abs/2607.01282
作者: Wenting Ma,Zhipeng Zhang,Xiaohang Yuan,Ningwei Xie,Yuxin Xie,Xiaolin Wang,Meng Guo,Xingang Chai,Zhenjie Yao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures. Presented at ICONIP 2024, Auckland, New Zealand. Published in LNCS 15290, Springer, 2025
Abstract:In light of strides in Arti cial Intelligence (AI) and its wide spread application, challenges persist in the interpretability of AI models, particularly within specialized domains like healthcare, such as electro cardiograph (ECG) recognition. Rather than relying solely on end-to-end convolutional neural networks, this paper introduces a novel approach using a domain knowledge-based graph convolution network for ECG recognition. Key landmarks points of PRQST, vital to ECG interpreta tion, are incorporated as domain knowledge. The double-stream directed graph is employed to model both intra and inter ECG cycles. Speci cally, spatial directed graphs capture the positional relationships among key points, while temporal directed graphs delineate temporal dependencies between adjacent cycles in extended ECG sequences. Experimental re sults on the First Chinese ECG Intelligent Competition dataset, which speci cally classify ECG into nine categories, prove the e cacy of the proposed model. The overall average F1 score is 88.1%, the average F1 score of rare categories is 76.3%, both outperform the state-of-the-art models. The introduction of domain knowledge did enhance the detec tion performance, especially for rare categories.
[AI-128] he Rising Unsustainability of AI Graphics Cards Production
链接: https://arxiv.org/abs/2607.01258
作者: Clément Morand,Aurélie Névéol,Anne-Laure Ligozat
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Paper in Proceedings of LIMITS 2026: 12th Workshop on Computing within Limits, 2026-06-23-25, Online
Abstract:The rapid advancement of Artificial Intelligence (AI) has been accompanied by significant increases in computational and environmental costs, driven by large-scale investments in AI infrastructure, hardware, and software. In particular, graphics cards have become central to AI training, with frequent hardware updates required to meet escalating computational demands. However, the environmental damages of graphics cards production remain understudied. This study addresses this gap by estimating the environmental damages associated with graphics cards production over the past decade (2013-2025). We analyze trends in energy consumption, carbon emissions and resource depletion. We compile and provide a dataset documenting the environmental damages of NVIDIA workstation graphics cards production since 2013. Our analysis of this dataset reveals a steady increase in production-related impacts over the period. Our finding highlights the need for greater transparency in life-cycle data, a persistent challenge in AI environmental assessments. While operational efficiency improvements (e.g., energy-efficient training, carbon-aware computing) are often prioritized, our results underscore that production-related impacts are also escalating and cannot be overlooked. The AI community must move beyond incremental optimizations and confront the necessity of sufficiency. This shift may demand structural changes such as policy interventions, hardware design for longevity, and cultural shifts away from perpetual growth and increased performance. Comments: Paper in Proceedings of LIMITS 2026: 12th Workshop on Computing within Limits, 2026-06-23-25, Online Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2607.01258 [cs.CY] (or arXiv:2607.01258v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2607.01258 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Anne-Laure Ligozat [view email] [v1] Fri, 5 Jun 2026 15:49:03 UTC (231 KB) Full-text links: Access Paper: View a PDF of the paper titled The Rising Unsustainability of AI Graphics Cards Production, by Cl’ement Morand and 2 other authorsView PDFTeX Source view license Current browse context: cs.CY prev | next new | recent | 2026-07 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); We gratefully acknowledge support from our major funders, member institutions, , and all contributors. About Help Contact Subscribe Copyright Privacy Accessibility Operational Status (opens in new tab) Major funding support from
[AI-129] Artificial Intelligence-Enabled Accounting Information Systems and Fraud Detection in Nigerias Financial Services Sector: The Moderating Role of Natural Language Processing
链接: https://arxiv.org/abs/2607.01257
作者: Timothy Oluwapelumi Adeyemi,Abigail Omotola Ojogbede
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 21 pages, 4 tables, cross-sectional survey study
Abstract:The rapid digitalisation of financial systems has improved operational efficiency and financial inclusion while simultaneously increasing exposure to sophisticated forms of cyber-enabled fraud and electronic financial misconduct. Conventional auditing systems, which largely depend on retrospective verification and rule-based monitoring, increasingly struggle to address the complexity and speed of modern financial crime. Consequently, financial institutions are progressively adopting Artificial Intelligence (AI)-enabled Accounting Information Systems (AIS) and Natural Language Processing (NLP) technologies to strengthen fraud detection, continuous auditing, and institutional monitoring. This study examined the influence of AI-enabled AIS on auditing and fraud detection effectiveness within Nigeria’s financial services sector while additionally evaluating the moderating role of NLP. Anchored on the Fraud Diamond Theory and the Technology Acceptance Model, the study adopted a quantitative cross-sectional survey design. Primary data were collected from 186 professionals across banking, insurance, and FinTech institutions in Nigeria. Data were analysed using descriptive statistics, multiple regression, and hierarchical moderated regression techniques. The findings revealed that AI-enabled AIS significantly improves auditing and fraud detection effectiveness, particularly through prevention, detection, data analysis, and investigative capabilities. The results further indicated that NLP positively moderates the relationship between AI-enabled AIS and auditing effectiveness by improving semantic interpretation and analytical explainability. The study concludes that AI-enabled AIS and NLP are increasingly important for strengthening fraud governance, regulatory accountability, and institutional trust within emerging digital financial environments.
[AI-130] AI Assistance for Human Review of Default Judgments
链接: https://arxiv.org/abs/2607.01256
作者: Theodora Worledge,Othman Bensouda Koraichi,Daniel Bernal,Aviv Caspi,Tatsunori Hashimoto,Carlos Guestrin,David Freeman Engstrom
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Overwhelmed courts in the United States review millions of default judgments each year. Unfortunately, such manual reviews are time-consuming and prone to error. In an audit of 188 debt collection cases granted default judgment by the Superior Court of Los Angeles, we find that 4% contained major defects that should have entirely prevented default judgment, 10% contained inconsistencies requiring reduced judgments, and 32% contained errors requiring amendment prior to judgment. To support courthouses in default judgment review, we collaborated with courthouse attorneys and judges in designing a Default Assistant. The Default Assistant employs large language models to evaluate a case with respect to predetermined legal requirements and provide cited recommendations for an expert user’s review. We equip users to verify these recommendations by grounding the assistant’s explanations in cited quotes and tables from the original case filings. We conduct a controlled study with 66 law students that conservatively simulates court review, with more time and resources than court staff. We nevertheless find users aided by the Default Assistant were 6.0% more accurate on the average requirement than unaided reviewers (p 1.0e-4). Simultaneously, users were 25.9% faster in reviewing the average requirement than unaided reviewers (p 2.5e-10). Statutory requirements demanding extensive document search realized the largest gains, with error reductions and time savings from AI assistance up to 62% and 34%, respectively, relative to unassisted user performance and with differences statistically significant (p 0.05). Our work provides a proof-of-concept that AI assistants with citations have the potential to help resource-constrained courts conduct default judgment review more accurately and efficiently.
[AI-131] Beyond Detection: Redesigning Assessment and Governande of Generative AI at the Universidad Politécnica de Madrid (UPM)
链接: https://arxiv.org/abs/2607.01255
作者: Jessica Díaz,Sonia Linio,Fernando Pescador,Daniel Martin-Fabiani
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Universities have responded to generative artificial intelligence (GenAI) in noticeably different ways, both internationally and within Spain. So far, the dominant reaction has been defensive, this is, most institutions frame the debate around AI detection, plagiarism, academic integrity and a presumed drop in student effort, prioritizing basic training for academic staff over students. Other group of pioneering universities is doing the opposite, pursuing deeper adoption, and assuming that any policy built on prevention or sanction will not hold. This paper sides with that second view. Obsessing about detection is a dead end, since generated text is increasingly hard to distinguish from human writing, and detectors still misfire too often to be trusted. What universities need instead is a coordinated effort to set clear, course-by-course rules for GenAI use, redesign assessment toward authentic and interdisciplinary assessment that fosters critical thinking and learner autonomy, and build a serious AI-literacy programme that treats students as critical co-creators rather than passive users. The challenge, though, is not only pedagogical. Adoption at university scale also raises organisational, technical, operational, legal and economic questions that have to be solved together. In this context, the Universidad Politécnica de Madrid (UPM) is developing a strategic and sustainable AI policy and adoption framework structured around six dimensions, in which AI functions as an enabler of student autonomy and pedagogical innovation rather than as a threat to be policed.
[AI-132] How Indian Dermatologists are Utilizing Artificial Intelligence for Clinical Practice and Workflow Management: A Nationwide Survey with a Special Focus on atopic dermatitis
链接: https://arxiv.org/abs/2607.01252
作者: Dipayan Sengupta,Saumya Panda,Sandipan Dhar,Dipankar De,Deepika Pandhi,Narayanan B
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 28 pages, 5 tables
Abstract:Background: Dermatology AI has mainly focused on image-based diagnosis, while chronic disease workflows have received less attention. We surveyed Indian dermatologists to map routine clinical challenges, with a focus on atopic dermatitis (AD), and assess current AI use. Methods: A nationwide cross-sectional survey commissioned by the Society for Eczema Studies included 377 practicing Indian dermatologists. The survey assessed clinical challenges, AD workflow barriers, AI use, adoption barriers, and ethical concerns. Analyses used descriptive statistics, chi-square tests, false discovery rate correction, and multivariable logistic regression. Results: Patient adherence (61.3%) and treatment planning in difficult or refractory cases (57.0%) were reported more often than diagnostic uncertainty (48.0%). In AD care, severity scoring was reported as a challenge by 47.7% and had the lowest satisfaction among measured workflow areas. Current AI use was reported by 49.9%, most often involving general large language models for literature synthesis, documentation, and academic tasks rather than specialized image analysis. Barriers differed by experience: dermatologists with more than 20 years of practice more often cited lack of training, while those with 5 years or less more often cited lack of clinical utility after trying AI tools. AI users were more likely than non-users to report concern about patient self-misdiagnosis and anxiety, which remained significant after adjustment for experience and academic affiliation. Conclusion: Respondents reported using general-purpose AI mainly for cognitive and administrative tasks, while their clinical needs centered on chronic disease management and AD workflow support. Clinician-supervised workflow tools may be more useful than standalone diagnostic applications. Comments: 28 pages, 5 tables Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2607.01252 [cs.CY] (or arXiv:2607.01252v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2607.01252 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Dipayan Sengupta [view email] [v1] Wed, 3 Jun 2026 10:18:55 UTC (19 KB)
[AI-133] Collaborative Disagreement Resolution for Scalable Oversight ICML2026
链接: https://arxiv.org/abs/2607.01251
作者: Yuyang Jiang,Chacha Chen,Teng Wu,Liwen Sun,Han Liu,Shi Feng,Chenhao Tan
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 27 pages, 6 figures. Accepted to ICML 2026. Codebase link: this https URL
Abstract:Debate, where AI agents argue opposing positions, has emerged as a key approach to scalable oversight. However, debate faces a fundamental tension: models are incentivized to be persuasive to the judge, which may not always align with epistemic honesty. In this work, we propose an alternative paradigm: disagreement resolution, which reframes the interaction mechanism from adversarial debate to collaborative truth seeking. Drawing on principles from human mediation and conflict resolution, where mediators facilitate dialogue to help disputing parties reach consensus rather than adjudicating between them, we design an automated pipeline that adapts these strategies to AI oversight. Unlike standard debate where models argue for fixed positions, our pipeline directs models to collaboratively identify points of disagreement, examine the evidence for conflicting claims, and converge toward consensus or isolate the specific ‘‘crux’’ of their disagreement. We find that Disagreement Resolution consistently helps non-expert models identify the truth, achieving 62.1% judging accuracy compared to 49.2% for standard debate. Our results provide encouraging empirical evidence for rethinking the scalable oversight protocol from adversarial persuasion to collaborative truth-seeking.
[AI-134] A Practice Auditing Framework for Large Language Model Use: Collective Empiricism Pseudo-Rational Cognition and Governance of AI-Generated Content
链接: https://arxiv.org/abs/2607.01248
作者: Yang Zhao,Yingshuo Li,Zeyu Zhang
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: English manuscript. 2 tables, 17 references
Abstract:Large language models are increasingly used for knowledge acquisition, code generation, academic writing, and agent-based automation. In these settings, users may obtain highly structured answers, plans, and judgments without sufficient domain practice. This paper proposes a practice auditing framework for LLM use and AI-generated content governance. It introduces collective empiricism to describe how LLMs compress and reorganize large-scale human experience into outputs that appear empirical and rational, and pseudo-rational cognition to describe how users may mistake AI-generated structured expression for their own rational understanding. The paper analyzes AI subjectivity illusion, subjectivity structures in input materials, template loops in AI-AI conversations, statistical misjudgment in AIGC detection, and memory pollution when generated content enters future contexts, long-term memory, retrieval spaces, or agent skill systems. To reduce these risks, the paper proposes an auditing process based on requirement definition, problem-boundary identification, evidence-source auditing, practical validation, reverse questioning, logging, version management, rollback, and renewed cognition. The framework does not reject AI productivity; it argues that LLM outputs should be returned to verifiable, reproducible, and intervenable processes of practice. The paper provides a conceptual and auditable framework for cognitive risks in LLM interaction, AI-generated content governance, long-term memory systems, and human-AI interaction.
[AI-135] LLM s as Teaching Assistants for Mathematics Exam Grading: Reliability and Practical Usability
链接: https://arxiv.org/abs/2607.01247
作者: Aastha Sapkota,M. G. Sarwar Murshed
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 12 Pages, 6 figures
Abstract:Open-ended mathematics exams are valuable because they assess reasoning, proof construction, algorithmic thinking, and communication of intermediate steps. They are also difficult to grade at scale because instructors must apply partial-credit rubrics consistently while giving feedback that helps students repair misconceptions. This paper evaluates six contemporary large language model (LLM) configurations, Gemini 3.1 Pro Extended, Gemini 3.5 Flash, ChatGPT 5.5 Pro Extended, ChatGPT 5.5 Thinking, Claude Pro Opus 4.7, and Claude Sonnet 4.6, as grading assistants for an undergraduate discrete mathematics examination. The study compares two grading policies. The BASELINE policy uses a stricter rubric-following prompt that emphasizes explicit evidence and complete justification. The LIBERAL policy was added after preliminary grading showed that the baseline condition sometimes applied harsh point deductions and failed to recognize valid partial reasoning. Agreement with human grading is measured at both the question and exam-total levels using mean absolute error, root mean squared error, normalized root mean squared error, Pearson correlation, and exact agreement. The results show that liberal partial-credit prompting reduces average question-level error for every evaluated model family. ChatGPT 5.5 Thinking (LIBERAL) has the lowest average question-level MAE (1.87) and RMSE (2.53), while Gemini 3.1 Pro Extended (LIBERAL) has the lowest total-score MAE (8.00) and RMSE (10.66). However, the strongest total-score Pearson correlation occurs under Gemini 3.1 Pro Extended (BASELINE) at 0.58, showing that point calibration and rank preservation remain distinct goals. We also report practical usability observations.
[AI-136] he Dual Nature of LLM Persona: Aggregated Tendencies and Frame-Dependent Geometry
链接: https://arxiv.org/abs/2607.02368
作者: Yuan Yuan
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Differential Geometry (math.DG)
备注:
Abstract:Evaluations of LLM personas via psychometric questionnaires typically rely on aggregate scores, discarding within-instance correlation structure. We test whether this geometric structure is intrinsic or frame-dependent. Constructing within-instance correlation matrices from IPIP-50 responses, we analyze geometry on SPD manifolds under manipulated question orderings in GPT-4o simulating American and Chinese-American personas. We find that persona expression comprises two dissociable components: aggregated features (Big Five scores) degrade under randomization (21% drop) but are frame-robust; geometric features (SPD manifold) collapse under frame misalignment (42% drop) but recover substantially (to 84%) under shared frames, surpassing aggregated features (76%). This collapse-recovery pattern reveals that persona geometry is not intrinsic but a frame-dependent coordination pattern encoding information invisible to aggregation. Our findings establish a dual-nature framework for LLM personas, frame-dependent geometry versus frame-robust aggregates, necessitating frame-aware evaluation and challenging static trait conceptions.
[AI-137] Stable Self-Modulating Quantum Fast-Weight Programmers with Bounded Memory Gates
链接: https://arxiv.org/abs/2607.02363
作者: Kuo-Chung Peng,Jiun-Cheng Jiang,Chun-Hua Lin,Yifeng Peng,Junghoon Justin Park,Huan-Hsin Tseng,Hsin-Yi Lin,Kuan-Cheng Chen,Chen-Yu Liu,Shinjae Yoo,Samuel Yen-Chi Chen
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 16 pages, 8 figures
Abstract:Quantum Fast-Weight Programmers (QFWPs) store temporal information in dynamically programmed variational-circuit parameters rather than in nonlinear recurrent hidden states, offering a practical route to quantum sequence modeling. Self-Modulating QFWP improves this framework by using input-dependent gates for both new fast-weight updates and the accumulated fast-weight state, but its unbounded old-state multiplier can diverge in long-sequence regimes. We propose a bounded old-state modulation rule that applies a sign-preserving tanh gate only to the recurrent memory branch while leaving the additive update and new-update modulation unchanged. We evaluate standard QFWP, full Self-Modulating QFWP, Only-New, and Only-Old variants on two CUDA-Q quantum-dynamics forecasting tasks and on Milan SMS telecommunication activity prediction. The quantum-dynamics results show that old-state modulation is the most consistent source of improvement over Standard QFWP, and that bounding the old-state gate removes long-sequence divergence while improving aggregate robustness. On Milan SMS forecasting, the original unbounded Self-Modulating QFWP converges across the tested grid and shows its clearest gains at longer input windows, with behavior close to the Only-Old ablation. These findings identify accumulated-memory modulation as the key mechanism of Self-Modulating QFWP and bounded old-state gating as a targeted stabilization strategy.
[AI-138] An Efficient vLLM -Based Inference Pipeline for Unified Audio Understanding and Generation
链接: https://arxiv.org/abs/2607.02119
作者: Haoran Wang,Jinchuan Tian,Siddhant Arora,Shinji Watanabe
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Multimodal Models excel in comprehension, high-throughput inference engines lack native support for multimodal generation. This is severe in Speech Language Models, where generating multi-layered audio tokens via decoupled AR+NAR or synchronous Multi-Token Prediction (MTP) with delay-pattern interleaving conflicts with standard single-stream loops. We present a vLLM-based inference pipeline for unified speech understanding and generation. We extend autoregressive decoding to natively execute delay-pattern de-interleaving and coordinated multi-stream sampling, integrating an on-GPU acoustic decoder for end-to-end waveform synthesis. Crucially, we overcome the shared intuition that Classifier-Free Guidance (CFG) halves throughput. By co-scheduling paired conditional and unconditional requests within a continuous batch, our CFG implementation sustains 80% of non-CFG throughput, absorbing dual-request and logit merging overheads. We open-source our framework.
[AI-139] Scene-Conditioned PINN-GNN for Multipath RF Maps: Cross-Scene Generation and In-Scene Completion
链接: https://arxiv.org/abs/2607.01777
作者: Lizhou Liu,Xiaohui Chen,Zihan Tang,Mengyao Ma,Wenyi Zhang
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Radio frequency (RF) maps provide a compact representation of multipath propagation characteristics and are fundamental to channel modeling, coverage analysis, and environment-aware wireless optimization. This paper proposes a unified RF map construction framework based on a physics-informed neural network (PINN) and a graph neural network (GNN), supporting both cross-scene generation and in-scene completion with 2D and 2.5D environmental representations. The PINN embeds electromagnetic propagation constraints to establish a physically consistent mapping from receiver locations to multipath parameters, including path gain, time of arrival, and angles, while the GNN enforces spatial consistency by modeling correlations among neighboring receivers. To comprehensively evaluate multipath reconstruction quality, we propose a peak-weighted dynamic time warping metric that jointly accounts for amplitude errors and peak delay misalignment in channel impulse responses. Extensive experiments demonstrate that the proposed method consistently outperforms image-based, diffusion-based, and interpolation baselines across both map-level and multipath-level metrics, achieving robust generalization and high-fidelity RF map construction under sparse observations.
[AI-140] Decentralized Stochastic Subgradient-type Methods with Communication Compression for Nonsmooth Nonconvex Optimization
链接: https://arxiv.org/abs/2607.01755
作者: Siyuan Zhang,Nachuan Xiao,Xin Liu
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 36 pages
Abstract:In this paper, we consider the nonsmooth nonconvex decentralized optimization problem, where inter-agent communication is compressed. We propose a general framework that unifies various decentralized stochastic subgradient-type methods with unbiased compression and contractive compression with error compensation. By relating the consensus-error iterates and the averaged iterates to the trajectories of continuous-time differential inclusions, we establish global convergence for all methods encompassed by our framework when the objective functions are nonsmooth and lack Clarke regularity. Based on our framework, we further develop several compression-based methods, including decentralized stochastic subgradient methods utilizing sign-based regularization and gradient-tracking momentum. Preliminary numerical experiments empirically support our theoretical results and highlight the communication-accuracy trade-off of the newly developed methods.
[AI-141] Full Bayesian Reinforcement Learning via LF-IBIS
链接: https://arxiv.org/abs/2607.01741
作者: Stefano Masini,Cecilia Viscardi,Michela Baccini
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 37 pages, 12 figures, 4 tables
Abstract:Reinforcement Learning (RL) is a sequential decision-making framework in which an agent learns optimal policies through interaction with an environment by maximizing cumulative rewards. Among RL methods, Bayesian Reinforcement Learning (BRL) addresses common practical challenges related to data scarcity by leveraging prior knowledge about the environment and sequential belief updates. However, most BRL approaches require an explicit likelihood function, which is frequently inaccessible or intractable in real-world settings. We propose Likelihood-Free Iterated Batch Importance Sampling (LF-IBIS), a novel algorithm for BRL that updates the agent’s beliefs online as new interactions become available. By combining Approximate Bayesian Computation with Iterated Batch Importance Sampling, LF-IBIS enables full Bayesian inference in settings where the environment dynamics are not described by an explicit or tractable likelihood. The method yields approximate posterior distributions over both environment parameters and optimal policies, providing a quantification of policy uncertainty useful for a Bayesian treatment of the exploration-exploitation trade-off. We test the method on a simulation study in response-adaptive randomization in clinical trials, where closed-form posteriors enable validation. Additional experiments address settings where the posterior has no closed form and illustrate online policy updating based on the posterior distribution of the optimal policy. Comments: 37 pages, 12 figures, 4 tables Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2607.01741 [stat.ML] (or arXiv:2607.01741v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2607.01741 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-142] AI-enabled gravitational-waves searches for binary neutron stars at optimal sensitivity
链接: https://arxiv.org/abs/2607.01372
作者: Bhavya Gupta,Deep Chatterjee,William Benoit,Ethan Marx,Christina Reissel,Seiya Tsukamoto,Kyungseop Yoon,Michael W. Coughlin,Philip Harris,Erik Katsavounidis
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:
Abstract:Gravitational Waves (GWs) represent the newest window of astronomy, furthering our understanding of compact objects like black holes and neutron stars in the Universe. The signal from two merging neutron stars is especially interesting since it brings the prospect of concordant electromagnetic and neutrino emissions. Such multi-messenger observations have a transformational impact on fundamental physics, nuclear matter, astrophysics, and gravity. It was first witnessed in 2017 with the detection of the binary neutron star (BNS) merger GW170817. However, searching for BNS signals in real-time in the LIGO-Virgo-KAGRA (LVK) GW detectors presents a computational challenge, as the data streaming out must be matched against \sim million reference waveforms, which requires up to a thousand CPU cores. We present a different approach using neural networks to learn the presence of a signal in the data. Our algorithm, called Aframe, was deployed in the LVK’s fourth observing run and was the first artificial intelligence (AI)-enabled search to detect multiple binary black holes (BBHs) live. In this work, we demonstrate that the approach extends to the lower-mass BNS regime, and is the first AI-enabled search that achieves sensitivity comparable to matched-filter pipelines at lower computational and latency costs. The challenge of the longer-duration BNS signals is addressed by heterodyning the data, following which the network architecture used for BBHs is sufficient to distinguish signal versus background. We also show that this analysis requires a single non-flagship GPU for online deployment. Furthermore, the design and adoption of inference-as-a-service tools allow rapid offline analysis using a distributed pool of GPU resources. Hence, aside from the use case of rapid online data analysis, we also establish the use of Aframe for efficient archival data analysis.
[AI-143] Mechanistic Interpretability and Causal Feature Steering of Neural Quantum States via Sparse Autoencoders
链接: https://arxiv.org/abs/2607.01336
作者: Zihao Qi,Christopher Earls
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Strongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 7 figures. Comments welcome!
Abstract:Neural Quantum States (NQS) are a remarkably expressive class of variational ansätze for quantum many-body wavefunctions, yet little is understood about their internal mechanisms: trained on variational objectives alone, how do NQS accurately capture physical observables that they have never been explicitly optimized for? In this work, we present a systematic approach to analyze the internal activations of NQS using sparse autoencoders. We extract features from the residual stream and demonstrate that these features strongly correlate with physical observables such as order parameters, staggered magnetization, and half-chain correlators, across both ground state representation and real-time dynamics. Remarkably, the discovery of these features is entirely unsupervised, with no physical labels provided. We further establish that such features causally affect the corresponding observables predicted by NQS, by showing that targeted, post-training intervention on a \textitsingle feature smoothly and monotonically steers the corresponding observable, while leaving the variational energy nearly unchanged. These results demonstrate that NQS are not merely functional approximators, but encode rich, interpretable internal representations of physical information. Our approach provides both a diagnostic and an intervention tool for NQS, and serves as a foundation for using mechanistic interpretability towards more reliable, transparent NQS.
机器学习
[LG-0] Controllable Sim Agents with Behavior Latents
链接: https://arxiv.org/abs/2607.02496
作者: Juanwu Lu,Junyu Zhu,Ziran Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 23 pages, 5 tables, 8 figures
Abstract:Realistic traffic simulation requires agents that imitate logged behavior and can also be steered along interpretable axes. Such controllability enables engineers to isolate variables, reproduce specific edge cases, and test autonomous systems without real-world risk. We introduce Controllable Neural Variational Agents (CNeVA), a controllable simulated-agent framework that learns to infer a per-agent Gaussian behavior latent from per-channel discounted returns via a closed-form conjugate variational update, conditioning a rectified-flow trajectory generator trained on a mixed channel-mask curriculum for classifier-free guidance. To tackle scarcity in reward signals, we propose soft eligibility gates that replace hard binary thresholds with smooth exponential decay, preserving the gradient signal for near-threshold agents. On the Waymo Open Motion Dataset, CNeVA attains competitive realism on the benchmark while exposing per-channel controllability that the higher-ranked imitation models lack. Speed- and acceleration-based steering produces monotone responses without stall-induced reward hacking. Safety controllability is monotone and substantial with the introduction of soft eligibility. We manage to achieve steerable map compliance under a context-residual return measure. Furthermore, our experiment demonstrates that steering metrics must be read alongside physical-plausibility guardrails to avoid reward-hacking confounds.
[LG-1] Understanding the Robustness of Distributed Self-Supervised Learning Frameworks Against Non-IID Data ICLR2026
链接: https://arxiv.org/abs/2607.02447
作者: Xuanyu Chen,Nan Yang,Shuai Wang,Dong Yuan
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR2026
Abstract:Recent research has introduced distributed self-supervised learning (D-SSL) approaches to leverage vast amounts of unlabeled decentralized data. However, D-SSL faces the critical challenge of data heterogeneity, and there is limited theoretical understanding of how different D-SSL frameworks respond to this challenge. To fill this gap, we present a rigorous theoretical analysis of the robustness of D-SSL frameworks under non-IID (non-independent and identically distributed) settings. Our results show that pre-training with Masked Image Modeling (MIM) is inherently more robust to heterogeneous data than Contrastive Learning (CL), and that the robustness of decentralized SSL increases with average network connectivity, implying that federated learning (FL) is no less robust than decentralized learning (DecL). These findings provide a solid theoretical foundation for guiding the design of future D-SSL algorithms. To further illustrate the practical implications of our theory, we introduce MAR loss, a refinement of the MIM objective with local-to-global alignment regularization. Extensive experiments across model architectures and distributed settings validate our theoretical insights, and additionally confirm the effectiveness of MAR loss as an application of our analysis.
[LG-2] Extreme Adaptive Transformer for Time Series Forecasting
链接: https://arxiv.org/abs/2607.02437
作者: Sanjeev Shrestha,Hui Liu,Yifan Zhang
类目: Machine Learning (cs.LG)
*备注: Submitted to Scientific Reports
Abstract:Time series forecasting remains challenging when the underlying data contain rare but critical extreme events. This issue is particularly important in hydrologic forecasting, where streamflow distributions are often highly skewed and extreme peaks can have substantial impacts on flood monitoring, water resource management, and early warning systems. Although Transformer-based forecasting models have achieved strong performance by modeling long-range temporal dependencies, they typically treat all time points uniformly and may therefore underrepresent rare extreme patterns. In this paper, we propose the Extreme-Adaptive Transformer (Exformer), a forecasting framework designed to explicitly model temporal dependencies involving both normal and extreme events. Exformer introduces an extreme-adaptive attention mechanism composed of three sparse components: Local, Stride, and Extreme. The Local and Stride components capture short-term and periodic temporal dependencies, respectively, while the Extreme component selectively models event-aware dependencies between normal and extreme streamflow patterns. Experiments on four real-world hydrologic streamflow datasets show that Exformer achieves superior 3-day forecasting performance compared with state-of-the-art baselines. Our findings demonstrate that explicitly incorporating extreme-aware attention improves the forecasting capacity of Transformer models on imbalanced time series with rare but consequential events.
[LG-3] WattGPU: Predicting Inference Power and Latency on Unseen GPUs and LLM s IJCAI2026
链接: https://arxiv.org/abs/2607.02391
作者: Mauricio Fadel Argerich,Jonathan Fürst,Marta Patiño-Martínez
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted at 1st Workshop on Sustainability and Resource-Efficiency of Artificial Intelligence @ IJCAI 2026
Abstract:Large Language Model (LLM) inference workloads are a rapidly growing contributor to data center energy consumption. Optimizing these deployments requires matching specific LLMs to the most efficient GPUs, but operators currently lack the tools to do so without exhaustively profiling each combination. While some predictive models exist, they still require profiling data and struggle to generalize to hardware unseen during training. To address this, we introduce \textitWattGPU, featuring two predictive models for mean GPU power draw and Inter-Token Latency (ITL). Our approach leverages only publicly available LLM metadata and GPU specifications, eliminating the need for hardware access or profiling while enabling generalization to unseen NVIDIA server-grade GPUs and LLMs. We evaluate our models using rigorous leave-one-GPU-out and leave-one-LLM-out cross-validation on a dataset of 42 open-source LLMs (0.1B–27B parameters) and 8 GPUs under both offline and server scenarios. The mean power draw model achieves a median absolute percentage error of \leq3.4% for offline and \leq13.5% for server scenarios on unseen GPUs, while the latency model achieves \leq8.5% in server mode, both maintaining strong GPU ranking correlations for server scenarios (Kendall \tau\geq0.76 ). Compared to standard physically grounded baselines – Load-Scaled Thermal Design Power (TDP) for power draw and roofline for latency – our models reduce median absolute percentage error by approximately 4 \times on unseen LLM-GPU combinations for server scenarios or approximately 2 \times for completely unseen GPUs. WattGPU’s data and code are publicly available at this https URL.
[LG-4] DecompRL: Solving Harder Problems by Learning Modular Code Generation
链接: https://arxiv.org/abs/2607.02390
作者: Juliette Decugis,Fabian Gloeckle,Francis Bach,Taco Cohen,Gabriel Synnaeve
类目: Machine Learning (cs.LG)
*备注:
Abstract:How can Large Language Models (LLMs) solve problems they currently cannot? Repeated sampling scales test-time compute but GPU cost grows linearly with attempts, while reinforcement learning (RL) with verifiable rewards improves single-attempt accuracy at the expense of sample diversity. Both strategies ultimately fail when the base policy has near-zero probability of producing a correct solution: no amount of sampling or gradient signal can overcome a search space that is simply too large. We take a different approach: rather than sampling harder, we make the task easier by decomposing problems into smaller, independently solvable sub-functions whose implementations can be recombined. Since off-the-shelf models are not trained for this modular generation, we introduce DecompRL, an RL algorithm that explicitly learns to decompose and implement hierarchical code structures. Recombining k implementations of n modules yields up to k^n candidate solutions, shifting the bottleneck from GPU inference to cheap CPU evaluation and cutting GPU token cost by \sim 50 \times . On LiveCodeBench and CodeContests (Qwen~2.5~7B, Code World Model~32B), DecompRL outperforms standard and diversity-optimized RL baselines beyond 10^5 tokens per problem, solving problems that standard generation cannot reach.
[LG-5] One More Time: Revisiting Neural Quantum States from a Reinforcement Learning Perspective
链接: https://arxiv.org/abs/2607.02292
作者: Juan Agustín Duque,Sergio García Heredia,Vinicius Hernandes,Eliška Greplová,Thomas Spriggs,Aaron Courville,Anna Dawid
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Quantum Physics (quant-ph)
*备注: 34 pages, 11 figures
Abstract:Neural quantum states (NQS) provide a flexible and scalable framework for approximating quantum many-body wavefunctions. Among NQS parameterizations, autoregressive models are especially attractive because they enable exact, independent sampling from the Born distribution, avoiding the autocorrelation and mixing issues of Markov chain methods. Yet their optimization remains comparatively underexplored: Adam is a scalable method but ignores function space geometry, while stochastic reconfiguration is principled but costly and numerically fragile in large models. To address this gap, we show that variational energy minimization can be viewed as an advantage policy-gradient problem over the Born distribution, motivating trust-region optimization for NQS training. We introduce Proximal Wavefunction Optimization (PWO), a principled trust-region algorithm that clips probability-ratio changes in the amplitude channel and phase increments in the phase channel. PWO avoids explicit matrix inversion, reuses samples across multiple updates, and combines the scalability of first-order optimization with theoretical guarantees. Across Ising and frustrated J_1 - J_2 one- and two-dimensional spin systems, PWO improves stability and wall-clock convergence over Adam, minSR, and SPRING. Finally, we fine-tune a 1.5 B-parameter RWKV-7 model, demonstrating NQS optimization at a scale over three orders of magnitude beyond prior work.
[LG-6] Dendritic In-Context Learning in a Single-Layer Spiking Neural Network
链接: https://arxiv.org/abs/2607.02283
作者: Juwei Shen,Yujie Wu,Changwen Chen
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 26 pages
Abstract:In-context learning (ICL) operates via implicit gradient descent embedded in the forward pass of modern AI architectures – Transformers, Mamba, state-space models, and MLPs. Capturing this capability in biologically plausible Spiking Neural Networks (SNNs) has remained an open challenge: existing SNNs fail the Garg-2022 benchmark at non-trivial task dimensions. We trace this failure to a structural assumption: prior SNN designs route adaptation through inference-time synaptic plasticity, viewing the dendritic compartment as a passive conduit for error or teacher signals. We challenge this assumption. The subthreshold dynamics of a single dendritic compartment already implement a complete online learning algorithm. By treating the compartment as the computational substrate rather than a passive conduit, we propose DendriCL – a single-layer compartmental spiking architecture whose apical recurrence is structurally identical to leaky online Widrow-Hoff LMS. This dynamics-only update collapses the architectural depth required for general-purpose ICL to a single layer. DendriCL is uniquely seed-stable at super-dimensional Garg-2022 ICL – where dense Transformers exhibit grokking-style instability and fail past moderate task dimension – and a linear probe recovers the reference online-LMS trajectory directly from the apical membrane at R^2 = 0.93, showing the algorithm is structurally embedded in the dynamics rather than implicitly discovered during training. Taken together, ICL requires neither attention, depth, nor inference-time plasticity: a single compartment with online-LMS dynamics is sufficient.
[LG-7] Self-explainable Operator Learning for Discovering Spatial Patterns in Functional Data
链接: https://arxiv.org/abs/2607.02203
作者: Mojgan Alishiri,Amirhossein Arzani
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Operator learning has emerged as a powerful tool for modeling complex physical systems in functional spaces. However, their neural network-based architectures make them opaque models, obscuring the reasoning behind their predictions. In this work, we introduce a self-explainable operator learning framework that overcomes this challenge by reformulating operator learning as a linear combination of generalized functional linear models expressed through integral equations. Exploiting the additive decomposability of these integral equations, we divide the input domain into subdomains and compute localized integrals to evaluate the contribution of each region to the final prediction. This decomposition enables direct interpretability where the model explains both inputs and outputs by linking specific input regions to corresponding output patterns, thereby revealing which spatial features drive predictions. We demonstrate the framework on function-to-scalar and function-to-function mappings in fluid flow problems involving blood flow and unsteady aerodynamics. The results show that the operator most often prioritizes regions with strong feature gradients, providing physically meaningful insight into the model’s decision-making process. Comparisons with established post-hoc explainability methods demonstrate qualitative agreement while highlighting the key advantage of the proposed approach: explainability is embedded directly within the operator structure itself and does not require an external tool. Therefore, our framework provides a mathematically transparent and physically interpretable approach to uncover relationships within data, fostering trust in machine learning for scientific applications by enabling more informed data-driven analysis of physical systems.
[LG-8] Online Resource Allocation with Continuous Random Consumption: Regret under Degeneracy
链接: https://arxiv.org/abs/2607.02196
作者: Jiawei Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study online resource allocation when both rewards and consumption sizes may be continuously distributed. Requests arrive sequentially and must be accepted or rejected irrevocably under fixed resource capacities. Each request belongs to one of finitely many observable types; conditional on an observable request type, both the reward and the scalar size are random, and the realized size scales a fixed type-specific resource-consumption vector. The model allows the deterministic fluid relaxation to be degenerate. We show that additive regret is governed by the size-weighted mass of requests whose value-to-size ratios lie near the active acceptance cutoffs. We formalize this quantity through an active weighted-mass exponent p. When p 1, this cutoff mass is thin, and the problem is genuinely hard: every online policy must incur regret of order at least T^1/2 - 1/(2p) , and this holds for every p 1. A sample-path marginal policy matches this lower bound up to polylogarithmic factors; and when p = 1, so that the mass grows linearly near the cutoff, it attains O((\log T)^2) regret. For example, if the size and the value-to-size ratio are independent and uniformly distributed, then p = 1; if instead the size and the reward are independent and uniformly distributed, then p = 2. Thus the policy achieves o(\sqrtT) regret throughout this regularity class without any fluid non-degeneracy assumption, allowing both primal degeneracy and dual non-uniqueness. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2607.02196 [cs.LG] (or arXiv:2607.02196v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2607.02196 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-9] An Optimisation Framework for the Well-Conditioned Training of Physics-Informed Neural Networks
链接: https://arxiv.org/abs/2607.02194
作者: Joseph Webb,Sadok Jerad,Coralia Cartis
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Computational Physics (physics.comp-ph)
*备注:
Abstract:Physics-informed neural networks (PINNs) have emerged as a promising route to solve partial differential equations, yet they have struggled to reach the precision of classical solvers. The obstacle is increasingly understood to be one of optimisation, owing to the severely ill-conditioned loss landscape. We present \textbfDSGNAR : Doubly-Sketched Gauss-Newton with Adaptive Ratio, a scalable second-order optimisation framework that confronts this ill-conditioning and, in doing so, obtains unprecedented accuracy and speed. \textbfDSGNAR couples a doubly-sketched Gauss-Newton model with a novel strategy that carefully controls both regularisation and step length. Across a suite of problems spanning nonlinear, chaotic, multi-scale, high-dimensional, and Navier-Stokes, the framework greatly improves on the state of the art: able to attain relative \ell_2 errors as low as 3\times10^-16 in double precision, improve contemporary results by five orders of magnitude on the canonical Burgers’ equation, and as much as eight orders on a high-dimensional Poisson problem, while remaining markedly faster. We further show that, in single precision, solutions at the limit of round-off error can be obtained very quickly: Burgers’ equation to \ell_2^\textrel = 4.75 \times 10^-7 in under ten seconds. The framework is also robust to the choice of architecture, arithmetic precision, and initial hyperparameters. The code is available at this https URL Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Computational Physics (physics.comp-ph) Cite as: arXiv:2607.02194 [cs.LG] (or arXiv:2607.02194v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2607.02194 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-10] Privacy-Preserving and Verifiable Approximate Distributed Coded Computing
链接: https://arxiv.org/abs/2607.02187
作者: Xavier Martínez-Luaña,Alba Gude-Santos,Manuel Fernández-Veiga,Rebeca P. Díaz-Redondo
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Distributed machine learning enables collaborative model training without centralizing data, but it also exposes learning processes to privacy leakage and malicious manipulation. Existing defenses typically address these threats in isolation and are often tailored to specific learning paradigms or model architectures, limiting their applicability in realistic deployments. In particular, federated learning and decentralized learning exhibit distinct adversarial surfaces that are rarely addressed within a unified framework. In this paper, we present a model-agnostic framework for adversary-resistant distributed learning that jointly addresses privacy preservation and malicious behavior across both federated and decentralized settings. Our approach combines paradigm-specific defense mechanisms with GPBACC, a privacy-enhancing coded computing technique applicable to arbitrary machine learning models. For federated learning, we integrate robust aggregation strategies to mitigate the impact of malicious participants, while for decentralized learning we employ approximate decode-and-compare and group testing techniques to enable lightweight verification and adversary isolation without relying on a trusted aggregator. Crucially, we evaluate the proposed framework through an explicit, attack-driven analysis. We implement representative privacy attacks and malicious behaviors, and empirically demonstrate that the combination of GPBACC with robust aggregation and verification mechanisms significantly reduces privacy leakage and improves resilience against active adversaries. These results suggest that privacy-enhancing coded computing, when combined with appropriate adversary-resistance strategies, provides a practical and deployable foundation for secure distributed machine learning.
[LG-11] ght Lower Bounds for the Multi-Secretary Problem via Bellm an Certificates
链接: https://arxiv.org/abs/2607.02150
作者: Jiawei Zhang
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:This paper studies additive regret in the multi-secretary problem, defined as the gap between the expected offline prophet reward and the reward of the best online policy. Prior work established (O(\log T)) regret for bounded-density distributions with connected support and (O((\log T)^2)) upper bounds for bounded-density distributions with support gaps. It was unknown whether the extra logarithmic factor is necessary even in the one-resource model. We prove that it is necessary. For a mixture of two separated uniform distributions at the critical capacity, the optimal regret grows at least on the order of ((\log T)^2). Thus the existing (O((\log T)^2)) upper bounds for bounded-density gapped instances, including those implied by network revenue management models with continuous rewards, are tight in this simplest specialization. The same framework also yields a matching lower bound for gapped distributions whose gap-facing densities vanish near the support edges; this companion result is given in the appendix. The proofs use Bellman certificates: feasible solutions to a relaxation of the exact Bellman recursion. This framework converts lower bounds into explicit certificate constructions and identifies why support gaps permit larger regret.
[LG-12] Probing Chemical Language Models: Effects of Pre-training and Fine-tuning
链接: https://arxiv.org/abs/2607.02140
作者: Anna Karnysheva,Dietrich Klakow,Ji-Ung Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Chemical language models (CLMs) are trained with linearized representations such as SMILES, yet it remains unclear which chemically meaningful substructures they encode. To foster a better understanding of CLMs, we conduct a systematic study and probe for 78 molecular substructures across eight pre-trained and six randomly initialized models. We furthermore study how fine-tuning on chemical downstream tasks affects the learned representations of molecular substructures. Our results show that pre-training generally improves molecular structure awareness of CLMs, particularly in the upper layers. Moreover, randomly initialized models already encode ring structures well in the first layer. Our analysis on two chemical downstream tasks further reveals that, interestingly, fine-tuning affects task-relevant molecular substructures more than others, indicating that the changes in the representations follow chemical theory.
[LG-13] Predictive Conformal Slip Monitoring: An Empirical Evaluation of Rolling Split Conformal Prediction for Pre-Incident Traction Loss Detection
链接: https://arxiv.org/abs/2607.02124
作者: Varshith Roy Kotla
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 10 pages, 4 tables. codes and data available at: this https URL
Abstract:Conventional traction control architectures intervene only after the adhesion limit of a tire has already been breached. This paper investigates whether Rolling Split Conformal Prediction , monitoring the volatility of non-conformity residuals from a per-driver Random Forest model of expected slip behavior , can serve as a statistically grounded pre-incident warning signal, ahead of gross traction loss. Unlike an earlier internal draft of this work, the evaluation reported here corrects a confound in the slip proxy (vehicle speed is included as an explicit model feature, not left implicit in the target’s denominator), uses every racing lap for each driver rather than only the fastest lap, and is scored against real, timestamped incident labels extracted from FIA Race Control Messages and track-limits lap deletions rather than narrated post-hoc. The result is negative: across 19 drivers and 55,563 test-phase telemetry samples, the rolling-volatility detector achieves a mean precision of essentially 0.0 and mean recall of 0.0 against 14 ground-truth incidents, while flagging on average 15.3% of all samples as anomalous , too high a false-alarm rate for any early-warning use. A static 95th-percentile threshold baseline performs no better in any way that would justify the added complexity of the conformal-volatility formulation. Residual autocorrelation diagnostics show the split-conformal exchangeability assumption is violated for every driver (Ljung-Box p 0.001, n = 19/19), which is one plausible driver of the high false-alarm rate. We report this as a methodologically rigorous negative finding, diagnose its likely causes, and outline what a genuinely predictive version of this approach would require.
[LG-14] Ask the Right Comparison:Bias-Aware Bayesian Active Top-k Ranking with LLM Judges
链接: https://arxiv.org/abs/2607.02104
作者: Jian Xu,Delu Zeng,John Paisley,Qibin Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) are increasingly used as cheap, scalable judges that compare candidate outputs pairwise – to rank responses, select models, or triage papers. Yet LLM judges are both noisy and systematically biased: they favor verbose or well-formatted answers and exhibit position effects, so simply aggregating their votes recovers a ranking of presentation, not of true quality. We study the practical goal of identifying the \topk items under a fixed comparison budget, and make two contributions. First, we cast judging as Bayesian inference over latent quality with explicit, judge-specific bias covariates (verbosity, position), regularized by a shrinkage prior so that the data decide which biases a given judge actually exhibits. Second, we introduce a \topk-aware active acquisition rule that chooses the next comparison to maximally reduce uncertainty about \topk \emphmembership, rather than about the full ranking. On a controlled benchmark with known ground-truth quality, judged by sixteen real LLMs spanning open and proprietary families (Llama, Qwen, Phi-4, GPT-4o-mini/5.1/5.5, Gemini, DeepSeek, and Claude Haiku/Sonnet/Opus), naive aggregation plateaus at a wrong \topk on biased judges regardless of budget, while our bias-aware model recovers it; \topk-aware acquisition reaches this ceiling with far fewer comparisons than round-robin or a global-uncertainty (D-optimal) rule. Bias is real but heterogeneous and capability-dependent: cheap and mid-tier judges carry a strong verbosity bias that our model corrects (lifting recall from \sim 0.5 – 0.6 to 0.84 – 1.0 ), whereas the frontier judges we tested show little bias and already rank accurately, so bias-aware modeling changes little there.
[LG-15] Fourier Neural Operators for Rayleigh-Bénard Convection CCS2026
链接: https://arxiv.org/abs/2607.02088
作者: Chelsea Maria John,Thibaut Lunet,Sebastian Götschel,Andreas Herten,Stefan Kesselheim,Daniel Ruprecht
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: Accepted at Computational Science, ICCS 2026
Abstract:We propose an improved Fourier Neural Operator (FNO) for modeling two-dimensional Rayleigh-Bénard convection by predicting time increments instead of full solutions, achieving higher accuracy than a standard FNO baseline. The resulting model is compact (314k parameters, 1.26 MB) and fast (7 ms inference), while maintaining similar accuracy as demonstrated in previous benchmarks. We show that although FNOs generalize to finer meshes, accuracy remains limited by the resolution of the training data.
[LG-16] A Memory Efficient Unified Algorithm for Online Learning of Linear Dynamical Systems
链接: https://arxiv.org/abs/2607.02050
作者: Yuval Ran-Milo,Angelos Assos,Elad Hazan
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 34 pages, 1 figure
Abstract:Motivated by the challenge of stabilizing a general unknown linear dynamical system (LDS) from observations, we study the natural prerequisite of online prediction. Our goal is to achieve sublinear regret with a memory footprint that adapts to the intrinsic complexity of the dynamics rather than the full hidden – state dimension. We focus on the practically central regime of systems with low instability complexity – eigenvalues outside the real stable interval that do not decay rapidly, together with non-semisimple modes-potentially embedded in an otherwise stable real spectrum of much higher dimension; we write k for this count. This regime is the primary setting in which stabilization is plausible: we show that many systems with high instability complexity cannot be stabilized without exponentially large controls. Thus, prediction is meaningful for stabilization precisely when the instability complexity is small. Within this regime, we introduce a unified online algorithm that handles every LDS (including non-diagonalizable systems with complex or exploding modes) with a learnable parameter count of \widetildeO(k) . Finally, we prove a lower bound showing that k is a valid complexity measure: any filter-based predictor needs at least k filters. Experiments corroborate our theory: on a high-dimensional system, our predictor sharply outperforms prior methods at an equal parameter budget.
[LG-17] Fast and Accurate Anomaly Detection in Time Series
链接: https://arxiv.org/abs/2607.02046
作者: Emanuele Mele,Massimo Cafaro,Angelo Coluccia,Italo Epicoco
类目: Machine Learning (cs.LG)
*备注:
Abstract:Anomaly detection is a critical and evolving field in Machine Learning, with applications targeting different domains such as cybersecurity, finance, healthcare, manufacturing and IoT (Internet of Things) systems. Traditionally, anomaly detection algorithms have been designed using both supervised and unsupervised learning paradigms. The fundamental challenge in real-world anomaly detection scenarios is related to the inherent class imbalance (anomalies are typically rare) and, for supervised methods, to the scarcity of labelled anomalous data. Indeed, labelling is both expensive and time-consuming. Conversely unsupervised methods do not require labelling, but may suffer from high false positive rates when deployed in safety-critical applications. In this work we introduce a novel unsupervised algorithm for anomaly detection in time series based on the Haar discrete wavelet and a suitably designed t -test. We establish the theoretical foundation of the proposed t -test and, through extensive experimentation across 343 datasets, demonstrate that our algorithm outperforms state-of-the-art unsupervised and self-supervised benchmarks.
[LG-18] Cross-Platform Control for Autonomous Surface Vehicles via Adaptive Reinforcement Learning
链接: https://arxiv.org/abs/2607.02037
作者: Ruiheng Jiang,Thomas Bi,Raffaello D’Andrea,Aswin Ramachandran
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Video: this https URL
Abstract:Autonomous surface vehicles vary widely in hydrodynamic and actuation characteristics, yet most controllers are designed for single-platform deployment. We present an adaptive reinforcement learning approach for trajectory tracking that enables zero-shot cross-platform deployment using a single policy. Since the deployment platform’s dynamics are unknown to the policy, we address cross-platform generalization with the standard partial-observability approach of conditioning on interaction history, employing a teacher-student architecture in which a learned module infers a latent representation of the platform dynamics. The policy is trained in simulation under randomized vessel dynamics and is deployed zero-shot to two real-world platforms without any fine-tuning, despite relying on a simple analytical dynamics model rather than a high-fidelity hydrodynamic simulator. In real-world experiments on two different platforms, the adaptive policy outperforms non-adaptive learning-based baselines by up to 58% in position mean absolute error while approaching the tracking accuracy of a platform-specific tuned controller.
[LG-19] Scalable and Distributed Silhouette Approximation
链接: https://arxiv.org/abs/2607.01993
作者: Ilie Sarpe,Federico Altieri,Andrea Pietracaprina,Geppino Pucci,Fabio Vandin
类目: Data Structures and Algorithms (cs.DS); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 50 pages, 12 figures, extension of a previously appeared conference paper: this https URL featuring substantial new contributions
Abstract:The silhouette is one of the most widely used measures to assess the quality of a k -clustering of a dataset of n elements. Its evaluation requires no information beyond the clustering assignment. In addition, the silhouette is extremely easy to interpret, providing a score to measure the quality of a clustering as a whole or for each element. The exact computation of the: (i) silhouette of each element of a dataset; and (ii) the global silhouette of the clustering; require \Theta(n^2) distance calculations, under general metrics. The quadratic complexity \Theta(n^2) is extremely prohibitive, especially on massive modern datasets. Surprisingly, existing approximate methods using O(n^2) distance calculations are heuristics not offering provable and controllable guarantees on the quality of their results. We introduce the first rigorous and efficient algorithms to estimate: (i) the (local) silhouette of each element of a dataset; and (ii) the (global) silhouette; of any metric k -clustering. Our methods, based on sampling, perform O(nk\varepsilon^-2\ln (nk/\delta)) distance computations, and provide estimates with additive error O(\varepsilon) with probability at least 1-\delta . That is, parameters \varepsilon and \delta in (0,1) control the trade-off between accuracy and efficiency. We also introduce a scalable and distributed design of our methods for the MapReduce and Massively Parallel Computing (MPC) frameworks. Our distributed algorithms use a constant number of rounds and sublinear local memory. Finally, we perform extensive experiments against state-of-the-art approaches. The results show that our new techniques yield the best trade-off between accuracy and efficiency for both local and global silhouette estimation. In addition, our methods scale efficiently to massive datasets for which an exact computation of the silhouette is not practical. Comments: 50 pages, 12 figures, extension of a previously appeared conference paper: this https URL featuring substantial new contributions Subjects: Data Structures and Algorithms (cs.DS); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2607.01993 [cs.DS] (or arXiv:2607.01993v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2607.01993 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-20] Probabilistic Low-Voltage Peak Load Forecasting with Time Series Foundation Models Evaluated on Application-Oriented Metrics
链接: https://arxiv.org/abs/2607.01966
作者: Benedikt Kaas,Manuel Treutlein,Hannes Benedikt Gerber,Oliver Neumann,Cheewan Phatthanakhuha,Oliver Resch,Ralf Mikut,Veit Hagenmeyer
类目: Machine Learning (cs.LG)
*备注: A poster abstract of this publication will be available at the 15th DACH+ Conference on Energy Informatics (2026 in Linz, Austria)
Abstract:Low-voltage load forecasting is an important component in current and future energy systems with a high degree of electrification and decentralized generation. However, current forecasting methods require significant manual effort, often lack uncertainty estimation and proper peak prediction, and they are often not adequately evaluated in terms of grid requirements. In the present study, we provide an extensive evaluation of short-term net load forecasts of 200 real-world low-voltage feeders with a focus on the rapidly evolving time series foundation models. Our study compares Chronos-Bolt, Chronos-2 and TabPFN-TS to six baseline models and demonstrates superior performance, in particular for Chronos-2. An ablation study, in which weather covariates are omitted, shows that time series foundation models adapt to increased uncertainty, despite the importance of weather information. A novel application-oriented metric links the model’s forecasting capabilities in peak prediction to the trade-off in grid asset planning and operation between cost reduction and minimizing the risk of failure.
[LG-21] A More Accurate Algorithm Comparison through A/B Testing using Offline Evaluation Methods KDD2026
链接: https://arxiv.org/abs/2607.01958
作者: Koki Konishi,Masataka Ushiku,Yuta Saito
类目: Machine Learning (cs.LG)
*备注: 12 pages, 8 figures, accepted to KDD 2026
Abstract:A/B testing is the gold standard for selecting the better algorithm in online services. While offline evaluation has attracted attention as a safer alternative due to the high experimental costs and the potential risk of degrading user experience and revenue in A/B testing, it is widely recognized that the estimation accuracy of offline evaluation is substantially lower. As a result, final selection decisions are typically made through A/B testing. Contrary to this conventional view, we reveal a counterintuitive phenomenon in which A/B testing can produce a higher algorithm selection error rate than offline evaluation. This occurs because the sample mean estimator used in A/B testing does not induce positive correlation, which is crucial for reducing critical selection errors, namely underestimating the truly superior algorithm and overestimating the truly inferior one. In contrast, offline evaluation methods unintentionally generate this beneficial correlation by relying on shared offline data when estimating and comparing the performance of multiple algorithms. Building on this insight, we propose an estimator that intentionally induces positive correlation to improve algorithm selection in A/B testing. The key idea is to introduce a hypothetical middle algorithm and to estimate the performance difference between algorithms A, M, and B in a stepwise manner using shared data at each step. This approach enables the application of offline evaluation techniques in each step, thereby inducing positive correlation and reducing critical selection errors. Furthermore, we derive the optimal middle algorithm regarding the resulting variance and analyze its advantages over existing methods through bias-variance analysis. Experiments on real-world data demonstrate that our estimator achieves the same selection error rate as existing approaches while using only one half of the A/B testing data.
[LG-22] Hybrid quantum-classical neural network for sentiment analysis
链接: https://arxiv.org/abs/2607.01943
作者: Giacomo Cappiello,Filippo Caruso,Xing Liang,Dimitrios Makris
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:Quantum machine learning has recently emerged as a promising paradigm that leverages the expressive power of quantum circuits to address complex learning tasks. In this work, we investigate the applicability of hybrid quantum-classical neural networks to sentiment analysis, a central problem in natural language processing. We focus on a dataset of tweets related to COVID-19, where the textual content is vectorized using TF-IDF and fed into both classical feedforward networks and hybrid architectures incorporating parameterized quantum circuits. Our results show that hybrid models can achieve accuracy comparable to the classical baseline, while exhibiting distinct learning dynamics, especially in terms of validation loss and accuracy, that suggest a richer representational capacity. Moreover, when applying transfer learning to an SMS spam classification task, the hybrid models consistently outperform the classical counterpart, achieving an accuracy increase of 15 percentage points (from 66% to 81%) on the spam class, demonstrating enhanced generalization. These findings highlight the feasibility of employing QML for natural language processing and point toward the potential advantages of hybrid models as quantum hardware continues to advance.
[LG-23] Zeus: Towards Tuning-Free Foundation Model for Time Series Analysis ICML2026
链接: https://arxiv.org/abs/2607.01918
作者: Yisong Fu,Zezhi Shao,Chengqing Yu,Yujie Li,Yongjun Xu,Xueqi Cheng,Fei Wang
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026
Abstract:We present Zeus, a unified tuning-free Time Series Foundation Model (TSFM) that delivers superior performance across diverse analysis tasks without any task-specific fine-tuning. Unlike prior studies that primarily focus on zero-shot forecasting but require task-specific tuning for other tasks, Zeus bridges this gap by addressing two fundamental challenges in multi-task generalization. First, to reconcile point-level granularity with long-sequence scalability, Zeus incorporates a multi-scale Transformer featuring point-wise tokenization and a U-shaped hierarchy, effectively balancing fine-grained fidelity with computational efficiency. Second, to accommodate varying inductive biases across different tasks, Zeus introduces Multi-Objective Temporal Masking (MOTM), a unified strategy that supports heterogeneous tasks (e.g., extrapolation, interpolation, and global abstraction) within a single framework. Extensive experiments across five representative tasks demonstrate that Zeus consistently achieves competitive results in tuning-free settings, underscoring its potential as a general-purpose TSFM.
[LG-24] Regularized Variational and Spectral Log-Density-Ratio Estimation in the Gaussian Location Model
链接: https://arxiv.org/abs/2607.01895
作者: Francis Bach(SIERRA)
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:We study ridge-regularized log-density-ratio estimation in the Gaussian location model with a common covariance matrix. By affine invariance, the model is written as q \sim N(0, I), p \sim N( \Delta , I), with linear features, where \Delta is a mean vector. The variational estimator is the empirical Kullback-Leibler (KL) log-normalized fit with a squared L2-penalty on its nonconstant coefficient, and the spectral estimator recently introduced in [1] replaces a single variational problem by a continuum of ridge-regularized least-squares problems. We derive high-dimensional deterministic asymptotic equivalents when the numbers of observations and dimension tend to infinity with fixed ratios. The regularized variational limit is characterized by a scalar entropy minimization problem derived from the convex-Gaussian-min-max theorem (CGMT), while the regularized spectral limit follows from deterministic equivalents for resolvents of weighted sums of two independent Gaussian sample covariance matrices. We use these formulas to compare population risks, with experiments focused on fixed-signal aspect-ratio sweeps and optimized regularization. Our conclusion is that with many observations, under the criteria and asymptotic regimes analyzed here, the well-specified variational estimator has the smaller risk, while with fewer observations, the spectral estimator is favored because its covariance-based construction has lower variance. We also study how a nuclear penalty can be used and partially analyzed to perform feature learning.
[LG-25] Learning the Supports for Categorical Critic in Reinforcement Learning
链接: https://arxiv.org/abs/2607.01880
作者: Jen-Yen Chang,Takayuki Osa,Tatsuya Harada
类目: Machine Learning (cs.LG)
*备注: Accepted to RLC 2026
Abstract:Value functions are an essential component in actor-critic based deep reinforcement learning (RL). Conventionally, these functions are trained as a regression task by minimising the mean squared error (MSE) relative to bootstrapped target values. Meanwhile, in distributional RL, a distribution of returns is modelled based on the distributional Bellman operator. This work investigates the Gaussian Histogram Loss (HL-Gauss), a recent approach that reframes value estimation as classification by encoding each scalar Bellman target as a Gaussian-smoothed categorical target. Despite its potential, applying histogram-based losses to RL presents inherent challenges, most notably the requirement to pre-define a fixed support interval, which is often complicated by the non-stationary and stochastic nature of target values typically found in RL tasks. In this work, we propose an approach that dynamically learns the lower and upper bounds of the support instead of assigning them beforehand. We derive an objective that jointly learns these bounds whilst learning the categorical representation of the scalar values, and we show that this objective forms an upper bound on the mean-squared Bellman error. Our theoretical analysis further shows that this bound is tighter than that of non-learned supports of HL-Gauss. Empirically, the proposed objective enables stable adaptation of the support interval and matches HL-Gauss-based actor-critic algorithms on most continuous-control tasks whilst improving on a subset, without requiring a pre-specified support interval.
[LG-26] Adaptive Group-Based Counterfactual Explanations for Time-Series Rehabilitation Data
链接: https://arxiv.org/abs/2607.01838
作者: Emmanuel C. Chukwu,Rianne M. Schouten,Monique Tabak,Mykola Pechenizkiy
类目: Machine Learning (cs.LG)
*备注: To be published at IEEE CBMS 2026
Abstract:Counterfactual explanations (CEs) for multivariate time-series classifiers are often difficult to interpret in domains where experts reason in terms of semantic feature groups rather than individual channels. In rehabilitation movement analysis with multi-sensor inertial measurement units (IMUs), clinicians interpret motion through muscle-group and joint-segment abstractions; yet, most existing counterfactual methods operate at the channel level, producing scattered and biomechanically incoherent explanations. We propose a two-stage framework for group-based counterfactual generation in high-dimensional IMU data. We first show that Shapley-Adaptive (SA) group ranking preserves counterfactual validity but fails to enforce group-level sparsity, motivating the need for explicit group selection. We then introduce Learnable Gate (LG) methods, which incorporate trainable per-group relevance gates jointly optimized with perturbation masks. Experiments on the KneE-PAD rehabilitation dataset demonstrate that LG substantially improves modality-group sparsity compared to the channel-level M-CELS baseline while maintaining or improving validity, temporal smoothness, and generation efficiency. Exercise-specific analyses further show that group-structured counterfactuals yield concise, muscle-level corrective guidance aligned with clinical reasoning. Overall, the proposed framework enhances interpretability without sacrificing counterfactual quality, enabling more actionable explanations for rehabilitation movement analysis.
[LG-27] Lynx: Progressive Speculative Quantization for accelerating KV Transfer in Long-Context Inference
链接: https://arxiv.org/abs/2607.01831
作者: Wenchen Han,Gingfung Matthew Yeung,Marco Barletta,William Toner,Amory Hoste,Adam Barker
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 15 pages, 12 figures. This manuscript was originally submitted to SIGCOMM '26 in February 2026
Abstract:Long-context inference is increasingly common in large language model (LLM) serving, driven by retrieval-augmented generation and agentic systems. In disaggregated inference, these workloads require transferring large Key-Value (KV) caches across the network, where decoding cannot begin until the transfer completes. Recent KV quantization techniques reduce data volume and alleviate this bottleneck, but existing schemes fail to achieve both low network-exposed latency and high inference accuracy. We challenge the assumption that the KV cache is an indivisible unit that must be fully received before use. We leverage the observation that different bits in the KV cache contribute unequally to attention computation and inference precision: the most significant bits capture the coarse structure of attention and the least significant bits refine precision. This property enables partial use of the KV cache during decoding. We present Lynx, a system that enables progressive, split-stream KV transfer by partitioning the KV cache into a high-priority Anchor stream carrying the most significant bits and a low-priority Residual stream carrying remaining precision. Decoding begins upon receipt of the Anchor stream and proceeds speculatively while the Residual stream is transferred concurrently, followed by verification that ensures equivalence to higher-precision decoding. Across multiple models and serving workloads, Lynx achieves Time-to-First-Token (TTFT) comparable to aggressive 4-bit KV quantization, while matching the accuracy of high-precision (BF16) inference, improving TTFT over standard 8-bit KV quantization by up to 1.43\times and improving accuracy over state-of-the-art by up to 5.1% . Comments: 15 pages, 12 figures. This manuscript was originally submitted to SIGCOMM '26 in February 2026 Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) ACMclasses: C.2.4; I.2.11 Cite as: arXiv:2607.01831 [cs.DC] (or arXiv:2607.01831v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2607.01831 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-28] Many Voices One Reward: Multi-Role Rubric Generation for LLM Judging and Reward Modeling
链接: https://arxiv.org/abs/2607.01830
作者: Dazhi Fu,Jiuding Yang,Yiwen Guo,Jicong Fan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reliable reward and preference signals are critical for evaluating and optimizing large language models on open-ended tasks. Rubric-based judges offer a transparent way to decompose such judgments into explicit evaluation criteria, but existing annotation-free rubric generators typically rely on a single generic evaluator. As a result, they may overlook important dimensions of human preference, a failure mode we term dimensional blind spots. To address this limitation, we propose Multi-Role Rubric Generation (MRRG), a training-free and reference-free framework that elicits evaluation criteria from multiple complementary roles and consolidates them into an auditable rubric-based scorer. This scorer can be used both to validate pairwise preferences and to provide rewards for GRPO-style Reinforcement Learning with Verifiable Rewards (RLVR). Experiments on preference validation benchmarks show that MRRG consistently outperforms single-role rubric generation baselines across multiple backbone models. Further RLVR experiments demonstrate that MRRG yields a stronger reward signal for improving open-ended generation.
[LG-29] Gaming Consensus: Coordinated Manipulation in Crowdsourced Fact-Checking ICML2026
链接: https://arxiv.org/abs/2607.01824
作者: Nikil Roashan Selvam,Jay Baxter,Sophie Hilgard,Brad Miller,Keith Coleman,Ellen Vitercik,Sanmi Koyejo
类目: Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:Crowdsourced fact-checking systems have been adopted by major social media companies such as X, Meta, TikTok and Google with the aim of combating misleading information at scale without relying on centralized editorial control. These systems have been developed around a common underlying concept: a bridging mechanism that identifies notes flagging misleading information when they receive support from people with different perspectives rather than simple majority support. To our knowledge the only publicly disclosed bridging algorithms deployed for fact-checking are based on matrix factorization, as deployed by both X and Meta, augmented with additional components addressing abuse, targeted manipulation, and contributor brigades. This work examines the core matrix factorization portion of these systems, presenting theoretical and empirical evaluations of the degree to which coordinated users could vote strategically by leveraging the latent representations to fabricate the appearance of synthetic consensus within the bridging mechanism. Using historic production data, we find that up to 10.7% of lower quality notes could be manipulated above consensus thresholds using less than 10 ratings. We complement these findings with a theoretical analysis, revealing counterintuitively that rating a note as “Not Helpful” can increase its helpfulness score, as well as a cost model quantifying manipulation effort. We have developed and deployed mitigations within X’s Community Notes algorithm to address synthetic consensus.
[LG-30] Koopman operator theory: fundamentals control and applications
链接: https://arxiv.org/abs/2607.01819
作者: Igor Mezić,Jorge Cortés,Karl Worthmann,Mircea Lazar,Armin Lederer
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:The Koopman operator has gained considerable attention due to its ability to provide a global linear representation of highly complex dynamical systems. The operator describes nonlinear dynamics in a linear way through the lens of real- or complex-valued observable functions. Recently proposed data-driven techniques, like extended dynamic mode decomposition (EDMD), its kernelized variant, and machine-learning methods, can be used to generate finite-dimensional approximations accompanied by finite-data error bounds. In this tutorial paper, we provide a concise introduction into Koopman operator theory and its use in systems and control. A particular focus is put on data-driven surrogate models, their extension to systems with inputs, and controller design using Koopman operator theory. Moreover, we demonstrate the key techniques, i.e., EDMD and Koopman MPC. To this end, we provide simulation studies including source code on GitHub to enable the interested reader to experience the Koopman operator in systems and control step by step.
[LG-31] EHHN: An Event-driven Heterogeneous Hypergraph Network for Object-Centric Next Activity Prediction
链接: https://arxiv.org/abs/2607.01785
作者: Jiaxing Wang,Kaitao Chen,Zhubin Han,Chenyu Hou,Bin Cao,Jing Fan,Ji Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Next activity prediction helps service-oriented processes anticipate upcoming steps before delays, exceptions, or service-level risks occur. Most existing methods assume classical single-case event logs, whereas real service processes often involve events shared by multiple typed business objects. Object-centric event logs (OCELs) capture such interactions, but current predictors remain limited. Flattening-based approaches lose cross-object context, and native OCEL graph-based approaches encode multi-object events through pairwise relations. Existing models also do not jointly capture event-driven object state changes, inter-event timing, and global execution patterns. We propose EHHN, an Event-driven Heterogeneous Hypergraph Network for object-centric next activity prediction. EHHN represents each prediction prefix as a heterogeneous hypergraph, where event–object hyperedges bind retained co-participating objects and a lifecycle hyperedge groups the primary object’s observed lifecycle events. Based on this representation, EHHN uses a dual-stream architecture in which a micro-spatial stream models event-driven object-state evolution and a macro-evolution stream captures temporal dynamics using retrieved global prototypes. The two streams are fused to predict the next activity. Experiments on four public OCEL benchmarks against nine baselines show that EHHN achieves the best accuracy and macro F1-score on all datasets, with improvements of up to 8.1 and 12.4 percentage points over the strongest baselines. Compared with the strongest OCEL-native graph baseline, EHHN also reduces peak GPU memory by up to 24 times. Code is available at this https URL.
[LG-32] Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding ICML2026
链接: https://arxiv.org/abs/2607.01775
作者: Marianne Arriola,Volodymyr Kuleshov
类目: Machine Learning (cs.LG)
*备注: ICML 2026. We provide the code at this https URL
Abstract:Discrete diffusion models have steadily improved in quality relative to autoregressive (AR) models. However, these models are normally constrained to fixed-length generation and do not support key-value (KV) caching. Block diffusion partially bridges diffusion and AR by generating token blocks left-to-right, but its fixed-size sequential blocks limit decoding flexibility and parallelism. Here, we present a new class of language models, set diffusion, comprised of (i) a likelihood parameterization that factorizes over flexible-position, flexible-length token sets and (ii) a set-causal diffusion architecture that supports KV cache updates after every inference step. By factorizing over token sets instead of fixed-size blocks, tokens can be decoded in arbitrarily-ordered sets, including sliding-window sets, enabling faster inference and support for any-order decoding. Set diffusion achieves better speed-quality tradeoffs on mathematical reasoning, summarization, and unconditional generation compared to prior diffusion language models while offering stronger infilling performance than block diffusion. We provide the code, along with the model weights and blog post on the project page: this https URL
[LG-33] Role-Aware Neural Convex Divergence Heads for Asymmetric Representation Learning
链接: https://arxiv.org/abs/2607.01762
作者: He Huang,Lu Shen,Yunfeng Huang,Li Qi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Many representation learning problems involve directed relations, such as lexical entailment, sentence entailment, ontology hierarchy, and citation links. Standard Euclidean, cosine, and Mahalanobis heads are symmetric, while generic neural scorers can model directionality but provide limited geometric structure. This paper proposes a role-aware neural convex divergence head for asymmetric representation learning. The head applies source- and target-role projections before evaluating an input-convex neural Bregman divergence, yielding a nonnegative structured score in the role-projected space. We characterize its projected-space identity, source-role convexity, directional-gap decomposition, and Hessian-based local curvature. Experiments on lexical, sentence, ontology, and directed graph benchmarks compare symmetric distances, unstructured asymmetric scorers, order/hyperbolic baselines, plain ICNN-Bregman heads, and the proposed role-aware variant. Across ten random seeds on the main semantic and ontology benchmarks, role-aware projections consistently improve directional accuracy over plain ICNN-Bregman heads while preserving zero observed negative divergence rate. The results also identify a boundary case: on large fixed-feature citation prediction, specialized symmetric or hyperbolic baselines remain stronger in ranking accuracy. Overall, the proposed head is best understood as a structured and interpretable plug-in distance module for tasks where directional relations matter.
[LG-34] Efficient Temporal Point Processes via Monotone Alternating Splines
链接: https://arxiv.org/abs/2607.01752
作者: Cheng Wan,Quyu Kong,Feng Zhou
类目: Machine Learning (cs.LG)
*备注: 22 pages
Abstract:Temporal point processes (TPPs) have widespread applications across various domains. Compared to modeling the conditional intensity of a TPP, modeling its cumulative conditional intensity function (CCIF) improves computational efficiency and eliminates numerical approximation errors. However, current CCIF parameterizations uniformly rely on Monotone Neural Networks (MNNs), which we identify as suffering from three structural deadlocks–convexity restrictions, saturation limits, and violations of CCIF modeling requirements–that fundamentally restrict their representational capacity for complex temporal dynamics. To resolve these bottlenecks, this paper proposes a novel framework called Monotone Alternating Splines (MAS). By leveraging distinct interpolation and extrapolation components, MAS provides a flexible and efficient framework for modeling CCIFs. Theoretically, MAS’s interpolation provides strong fitting accuracy, while its extrapolation supports robust generalization, reducing the irreducible approximation gaps of MNNs. Extensive experiments show that MAS achieves superior performance on both synthetic and real-world datasets.
[LG-35] Finite-Lag Operator Geometry of Recurrent Representations
链接: https://arxiv.org/abs/2607.01746
作者: Kanishka Reddy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recurrent representations are trajectories, but representation geometry is often measured from static snapshots. We develop finite-lag operator geometry for recurrent hidden states from observed source-successor pairs (X_t,X_t+\Delta) . The primitive is the conditional transport law Q_\Delta(dy\mid x) , estimated by a dense Gaussian source-smoothing operator. From this directed finite-lag law we derive a source-centered transport tensor G_\Delta , which decomposes exactly into conditional spread and coherent displacement, and an antisymmetric coordinate circulation W_\Delta^\rho , which summarizes directed lagged flow. We prove affine covariance with explicit metric dependence of scalar summaries, dense estimator stability on bounded trajectory clouds, and a finite-lag separation result showing that source-centered transport detects deterministic recurrent motion not recorded by infinitesimal carre-du-champ geometry. A linear-Gaussian closed form calibrates the quantities in terms of the update A_\Delta , source covariance, and innovation covariance. Controlled experiments validate the decomposition, circulation, covariance, and stability predictions. In performance matched repeat-copy networks, the framework reveals architecture dependent differences in total transport scale and coherent displacement trace, while coherent displacement fraction is metric and resolution dependent.
[LG-36] Frequency Shift Physics-Informed Extreme Learning Machine for Solving High-Frequency Partial Differential Equations
链接: https://arxiv.org/abs/2607.01694
作者: Xiong Xiong,Ruonan Zhai,Zheng Zeng,Sheng Zhou,Rongchun Hu,Zichen Deng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Solving partial differential equations (PDEs) with high-frequency solutions remains a central challenge in physics-informed machine learning due to spectral bias – the tendency of neural networks to learn low-frequency components preferentially. This paper proposes a Frequency Shift Physics-Informed Extreme Learning Machine (FS-PIELM) framework that addresses this limitation through an additive mechanism for weight initialization. Rather than multiplying random weights by a scaling factor, the method translates the mean of the Gaussian weight distribution while keeping the variance fixed at unity, thereby avoiding the variance amplification inherent in scaling-based methods. Two variants are developed: FS-PIELM-L assigns independent frequency magnitudes to individual neurons, while FS-PIELM-G groups neurons for improved robustness. Theoretical analysis shows that the frequency variance under the proposed framework remains bounded and approaches unity regardless of target frequency, in contrast to the quadratic growth of conventional approaches. The method preserves the computational efficiency of extreme learning machines, requiring only a single linear solve. Experiments on seven benchmark problems spanning six equation types – Helmholtz, wave, Poisson, Klein-Gordon, heat, and advection-diffusion – on both regular and complex geometries show that the linear variant achieves the best accuracy in six of seven cases, with improvements of one to nearly five orders of magnitude over existing PIELM variants. The code and data accompanying this manuscript will be made publicly available at this https URL.
[LG-37] A Mathematical Introduction to Diffusion Models
链接: https://arxiv.org/abs/2607.01693
作者: Jianfeng Lu
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注: Lecture notes for the John Tukey Summer Graduate School on Mathematics of Generative Models at SLMath (June 22nd, 2026 – July 2nd, 2026)
Abstract:These notes give a proof-oriented introduction to diffusion models from the viewpoint of sampling, tracing a single arc from classical sampling dynamics to modern diffusion samplers, their error analysis, and inference-time control. Throughout, the material is layered into core definitions and identities proved in full, representative estimates proved under simplifying assumptions, and research-level theorems stated with a proof roadmap. The intended audience is beginning graduate students with a background in probability but no prior exposure to stochastic differential equations, stochastic numerics, or diffusion models.
[LG-38] WARP: Weight-Space Analysis for Recovering Training Data Portfolios ICML2026
链接: https://arxiv.org/abs/2607.01686
作者: Tzu-Heng Huang,Aditya Goyal,John Cooper,Frederic Sala
类目: Machine Learning (cs.LG)
*备注: This work appears in the ICML 2026 Workshop on Weight-Space Symmetries (WSS): from Foundations to Practical Applications. Our source code is available at this http URL
Abstract:Foundation models are routinely released to the public, yet the data recipes used to train them – such as domain mixture weights that determine how different sources are sampled – are rarely disclosed. This creates an access asymmetry: researchers study the resulting models but lack visibility into the training distribution that produces them. Prior works for inferring training data, such as membership inference, detect at the level of individual samples and thus cannot characterize the global composition of the training corpus. We introduce WARP, a framework that recovers a fine-tuned model’s training mixtures directly from its released weights. WARP interpolates between the base and fine-tuned models using model merging, generating pseudo-checkpoints that approximate the missing training trajectory and expose a geometric footprint of the training data in the weight space. From these simulated footprints, WARP extracts geometric features and maps them to domain proportions using either a parameter-free softmax readout or an MLP projector trained on synthetic mixtures. In controlled experiments with BERT and GPT-2, WARP recovers domain mixtures with an average MAE as low as 0.046 and 0.104 respectively, outperforming membership inference and a variant with access to the true training trajectory.
[LG-39] SCAPE: Accurate and Efficient LLM Training with Extreme Sparse Communication
链接: https://arxiv.org/abs/2607.01678
作者: Mingkai Zheng,Junlin Chen,Haotian Xie,Zhao Zhang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Communication increasingly dominates the cost of Large Language Model (LLM) pre-training, especially under data-parallel and sharded training schemes, where gradient synchronization and parameter reconstruction overhead increase with model size and system scale. Existing communication-reduction methods either sparsify raw gradients, which can be unstable for modern Adam-style optimizers at high sparsity, or quantize communication, whose savings are fundamentally bounded by bit width and often incur additional runtime overhead. We present SCAPE, a communication-efficient distributed optimizer for LLM training that exploits the stability of AdamS’s first-moment to enable aggressive sparsification without loss of LLM quality. Instead of constructing masks from raw gradients, SCAPE derives them from first-moment-based statistics, partitions mask generation across workers to align with optimizer sharding, and delays mask usage by one step so that mask synchronization can overlap with computation. SCAPE also reconstructs the quantities required for second-moment updates from a single synchronized sparse buffer, avoiding an additional collective. We implement SCAPE in Megatron-LM and evaluate its convergence by pre-training GPT-345M on OpenWebText and Llama-500M on SlimPajama-6B using 32 NVIDIA GH200 GPUs on TACC Vista. In both models, SCAPE preserves training stability, validation loss, and downstream task accuracy under 90% and 99% sparsity. For Llama-500M, SCAPE reduces end-to-end pre-training wall-clock time by up to 43.3% while maintaining model quality comparable to dense AdamW and AdamS. For Llama-1.8B, SCAPE achieves up to 3.26 \times speedup per step compared to dense AdamS.
[LG-40] UniWind: Toward Unified Day-Ahead Wind Power Forecasting via Physics-Informed State Routing
链接: https://arxiv.org/abs/2607.01670
作者: Ronghui Xu,Tongxin Wu,Guozhen Zhang,Yihan Li,Chenjuan Guo,Bin Yang,Yong Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Day-ahead wind power forecasting is essential for cost-effective power-system operation. It is primarily driven by future meteorological conditions while retaining temporal dependencies in power generation. In practice, observed wind-farm power often entangles physically available power with local environmental effects and latent operational states, such as shutdowns and curtailment. Existing physical models provide useful constraints but adapt poorly across wind farms, whereas data-driven models can capture rich correlations but often conflate meteorological effects with state-induced deviations. In this study, we propose UniWind, a wind power forecasting model based on physics-informed state routing. UniWind first employs a Physical Prior Estimator to construct a site-calibrated physical prior by combining site-conditioned monotonic warping with a shared physical power curve. It further applies a physical upper-bound constraint to shape this prior as a soft envelope of available wind power generation. UniWind then proposes a Latent State Encoder to model operating-state embeddings and transforms the physical prior into final power forecasts through a State-aware Power Corrector, which uses knowledge-guided supervised state routing and bounded, state-specific expert correction. Full-shot and cross-farm zero-shot experiments on more than 20 real-world datasets demonstrate the accuracy and robustness of UniWind.
[LG-41] Revisiting Decentralized Online Convex Optimization with Compressed Communication
链接: https://arxiv.org/abs/2607.01665
作者: Hao Zhou,Xiaoyu Wang,Chang Yao,Mingli Song,Yuanyu Wan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Decentralized online convex optimization (D-OCO) is a popular framework for distributed applications with streaming data. To tackle the communication bottleneck, previous studies have investigated D-OCO with compressed communication and proposed several algorithms that are variants of online gradient descent (OGD). However, for D-OCO with exact communication, the best existing algorithms are variants of follow-the-regularized-leader (FTRL). In this paper, for the first time, we propose two FTRL-type algorithms for D-OCO with compressed communication. Compared with OGD-type algorithms, our algorithms are more elegant in both algorithmic design and theoretical analysis. The key insight is that the dual update mechanism of FTRL allows us to make a simple application of the technique for average consensus with communication compression. More specifically, our first algorithm considers the full-information setting, and can match the existing regret bounds. Our second algorithm is designed for the bandit setting, and can significantly improve both the regret bounds and communication costs of existing algorithms.
[LG-42] Message Passing Based Two-Timescale Bayesian Learning for Joint Channel and Memory Hardware Impairments Tracking
链接: https://arxiv.org/abs/2607.01660
作者: Wei Xu,An Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hardware impairments in massive multiple-input multiple-output (MIMO) receivers introduce inter-symbol memory and inter-element coupling, severely degrading channel estimation. This paper employs a residual recurrent gated unit (RGRU) to model the intra-slot memory of the hardware impairments and proposes a message-passing-based two-timescale Bayesian deep learning (MP-TTBDL) framework for joint channel and impairment tracking. Owing to small-scale fading, the wireless channel varies rapidly across slots, whereas hardware impairments drift slowly due to hardware aging and environmental variations. To capture these distinct physical timescales, a fastvarying Markov prior and a slow-varying Gaussian Markov prior are assigned to the sparse channel and the network parameters, respectively. Based on a multi-slot factor graph formulation, a message-passing algorithm is developed. Specifically, the inter-slot messages admit closed-form updates, while the intra-slot factor graph, due to its complex recurrent structure, is partitioned into a channel tracking module and an impairments calibration module. The channel tracking module performs sparse channel estimation via turbo orthogonal approximate message passing (Turbo-OAMP), and the impairments calibration module updates the impairment parameters via a specially designed deep approximate message passing (DAMP) procedure, with the two modules iteratively exchanging extrinsic information through expectation propagation (EP) until convergence. Simulation results show that the proposed framework robustly achieves lower channel estimation error than conventional compensators followed by channel estimation across different online impairment scenarios and signal-to-noise ratio (SNR) conditions.
[LG-43] CALM: Interpretable Cross-Modal Alignment for Biomarker Discovery from Unpaired Data MICCAI2026
链接: https://arxiv.org/abs/2607.01656
作者: Jueqi Wang,Zachary Jacokes,John Darrell Van Horn,Kevin A. Pelphrey,Michael C. Schatz,Archana Venkataraman
类目: Machine Learning (cs.LG)
*备注: Accepted to MICCAI 2026
Abstract:The interaction between brain structure and genetic influences is key to understanding neuropsychiatric disorders. However, most large-scale datasets are unimodal, providing either neuroimaging or genetics data. We propose CALM, a framework that learns interpretable associations between brain ROIs and genetic pathways from completely disjoint populations. CALM aligns the two modalities in a shared latent space via linear projections that simultaneously match the class-conditional latent distributions and ensure group separability. These projections provide interpretable pathway–ROI associations. When trained on unimodal imaging and genetics datasets, CALM generalizes to an unseen paired dataset, outperforming several state-of-the-art methods and ablation baselines. We also demonstrate stability of the learned associations against a paired baseline. Our experiments on autism spectrum disorder reveal immune and metabolic pathways linked to specific cortical regions and are consistent with established literature. Thus, CALM opens the door to leveraging large unimodal repositories for studying cross-modal interactions in brain disorders across disparate datasets.
[LG-44] DeadPool: Resilient LLM Training with Hot-Swapping via Zero-Overhead Checkpoint
链接: https://arxiv.org/abs/2607.01646
作者: Haotian Xie,Junlin Chen,Mingkai Zheng,Lishan Yang,Zhao Zhang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:State-of-the-art large language model (LLM) training takes tens of thousands of graphics processing units (GPUs) for months and encounters failures across the software and hardware stack. Existing fault-tolerance mechanisms either impose non-trivial overhead during failure-free execution or suffer from prolonged recovery latency, particularly under scenarios where a small subset of compute nodes experience permanent failures. %The tradeoff between failure-free overhead and recovery latency forms a space forms a Pareto frontier We present DeadPool to simultaneously address both optimization objectives. DeadPool incorporates a fault-tolerance mechanism that restores LLM training via hot-swapping, namely by replacing failed nodes with spare nodes without terminating the complete job. The hot-swapping of DeadPool is enabled by two ideas: First, it exploits an off-critical-path in-memory checkpointing mechanism for spatial redundancy. Second, it introduces a communicator reconstruction protocol that replaces failed nodes with spare nodes at runtime. DeadPool efficiently overlaps the in-memory checkpointing with computation, thus introducing zero overhead during error-free execution. Upon permanent node failures, DeadPool can rebuild memory states with minimal recomputation by leveraging in-memory checkpoints. We evaluate DeadPool across scales (up to 512 NVIDIA A100 GPUs) and LLMs (up to 65B parameters), and observe zero checkpoint overhead with hot-swapping recovery completing in under 40 seconds. These results show that DeadPool simultaneously achieves both zero-overhead error-free execution and extremely low recovery cost.
[LG-45] SINA: A Fully Automated Circuit Schematic Image to Netlist Generator Using Artificial Intelligence
链接: https://arxiv.org/abs/2607.01609
作者: Saoud Aldowaish,Yashwanth Karumanchi,Kai-Chen Chiang,Mohammed Ayman Habib,Finn Murphy,Rishen Cao,Morteza Fayazi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in Artificial Intelligence (AI) have revolutionized Electronic Design Automation (EDA), particularly through Large Language Models (LLMs) for circuit design tasks. However, their application to analog and mixed-signal domains remains limited by the lack of machine-readable representations of existing circuit design knowledge. Circuit schematic images found in research manuscripts, textbooks, and websites constitute a vast repository of validated designs; however, these visual representations cannot be directly processed by EDA tools. Converting them into machine-readable netlists is essential for enabling simulation, verification, and building comprehensive databases for AI-based models. Current conversion methods lack generalization across both Integrated Circuit (IC) and Printed Circuit Board (PCB) level schematics. Moreover, they struggle with component recognition and connectivity inference, and fail to distinguish between connected junctions and crossing wires. In this paper, we propose SINA, an open-source circuit schematic image-to-netlist generator. SINA is a fully automated pipeline that integrates deep learning for robust component detection, connected-component labeling for accurate connectivity inference, Optical Character Recognition (OCR) for component reference designator extraction, and a Vision-Language Model (VLM) for reliable reference designator assignment. SINA handles both IC- and PCB-level schematics and incorporates dedicated crossing-wires detection to differentiate wire intersections from connections. We validate the correctness of the generated netlists using graph isomorphism techniques. Our experiments demonstrate an overall netlist generation accuracy of 96.67%, which is 2.72x higher compared to state-of-the-art approaches.
[LG-46] Geometric Signatures of Reasoning : A Spectral Perspective on Task Hardness
链接: https://arxiv.org/abs/2607.01571
作者: Aria Masoomi,Mahsa Bazzaz,Adel Javanmard,Vahab Mirrokni
类目: Machine Learning (cs.LG)
*备注:
Abstract:Chain-of-thought (CoT) reasoning enables large language models (LLMs) to solve complex problems by generating intermediate reasoning steps. While much attention has been paid to the length and content of these reasoning chains, far less is known about their internal geometry. We study the \emphgeometry of CoT trajectories in the hidden state space of transformer models, formalizing each reasoning chain as a discrete curve in \mathbbR^d and characterizing it through spectral, positional, and kinematic geometric functionals. We introduce the effective dimension d_\rho as a measure of trajectory complexity and show theoretically that trajectories with flatter eigenvalue spectra correspond to harder tasks, as they explore more of the hidden dimensions. Lastly, we explore how kinematic features of the trajectory, mean position, positional dispersion, initial and current hidden states, mean velocity, mean speed, and speed dispersion, can be used to predict solution correctness before generation is complete, and may inform future early-stopping strategies. Experimentally, on mathematical reasoning problems from the MATH500 dataset, d_\rho achieves 0.93 AUC in distinguishing easy from hard problems, while kinematic features potentially can predict correctness from only the first 20% of generated tokens. These correctness signatures transfer across questions of varying difficulty, establishing that the shape of a model’s internal reasoning trajectory is a principled window into both task hardness and solution quality.
[LG-47] Certified World Models as Sensing Clocks: Drift-Aware Deadlines for Active Perception
链接: https://arxiv.org/abs/2607.01537
作者: Hongbo Wang
类目: Machine Learning (cs.LG)
*备注: 15 pages, 3 figures, 6 tables. Preprint
Abstract:Certified world models estimate how long their predictions remain valid. We turn this validity horizon into an operational sensing clock: a rule for when an agent should stop coasting and re-sense. Starting from an audited equivariant world model, we derive a deadline for no-sensing intervals and show that deployable deadlines in learned world models must be drift-aware: on-manifold Lyapunov rates alone overestimate coasting validity, while calibrated native rollout-drift envelopes carry the deployed guarantee. On a frozen 3D VN-JEPA model, the resulting clock controls held-out interval-simultaneous certificate violation across seeds and data shards. In a cue-conditioned theorem-bed (a synthetic bench where all schedulers share the exact model, isolating the scheduling rule), the clock remains valid on the deployment distribution and substantially reduces eventful-tail violations relative to exact-mixture expected-belief scheduling at matched sensing budget. We also report limits: in the short-horizon frozen VN-JEPA regime, empirical conformal horizons match the deployed clock on validity and budget, and a partial-reset exploration finds no clean budget-matched advantage for the spectral term. Thus the contribution is a certified sensing-clock primitive and drift-aware deployment method, not a claim that spectral clocks empirically dominate all non-spectral schedulers.
[LG-48] Wind-Aware Reinforcement Learning Control of a Small Quadrotor Using Learned Onboard Wind Estimation in Simulated Atmospheric Turbulence
链接: https://arxiv.org/abs/2607.01528
作者: Abdullah Al Tasim,Wei Sun
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Small multirotor aircraft are increasingly tasked with operations in the atmospheric boundary layer, where turbulent winds comparable to the vehicle’s airspeed degrade trajectory tracking and can defeat conventional feedback control. This work illustrates a two-stage learning pipeline that first estimates the local wind from onboard kinematics and dynamics and then exploits that estimate inside a reinforcement learning (RL) flight controller. The wind estimator, an attention-augmented gated recurrent network trained on thousands of simulated flights through von Karman turbulence with power-law shear and veer, recovers the horizontal wind vector with a per-flight root-mean-square error of 0.40 m/s and a direction error of 3.2 degrees on unseen wind regimes, an accuracy near the floor imposed by unresolved turbulence, and generalizes to vertical ascent profiles with a skill score of 0.861 over a constant-wind reference. A proximal policy optimization controller receiving the frozen estimator’s output reduces horizontal trajectory tracking error by 48% relative to a wind-blind proportional-derivative baseline across mean winds of 4 m/s to 12 m/s, winning on 100% of evaluation episodes. A three-way ablation decomposes this improvement into a kinematic component, available without wind information, and a wind-perception component; the perception share rises with wind speed, from small in light winds toward roughly half the total benefit in strong winds, consistent with the quadratic scaling of aerodynamic drag. The controller degrades gracefully on out-of-distribution winds of 13 m/s to 15 m/s, where the baseline fails catastrophically.
[LG-49] Quantifying the Uncertainty of Blindly Estimated Room Embeddings Using a Dispersion-Calibrated Score INTERSPEECH2026
链接: https://arxiv.org/abs/2607.01527
作者: Yang Xiang,Philipp Götz,Emanuël A. P. Habets,Andreas Walther,Wenwu Wang,Philip J. B. Jackson
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted to INTERSPEECH 2026
Abstract:Room embeddings derived from reverberant speech are often unreliable: speech content and recording degradation can alter the representation even when speaker, room, and source-receiver geometry remain unchanged, degrading downstream task performance. We propose a framework that learns room embeddings robust to speech-content variation and a representation-level uncertainty score from reverberant speech without downstream-task supervision. The embedding is anchored to a structured room impulse response (RIR) latent space and trained using a multi-view data structure with Kullback-Leibler (KL)-based alignment; a multi-positive contrastive term further refines robustness. A lightweight uncertainty head is calibrated using the dispersion of corruption-induced embeddings and optimized with a rank-based objective. Across waveform- and spectrogram-level corruptions, the score is consistent with representation dispersion and enables effective selective prediction while requiring only a single utterance at inference.
[LG-50] he risk of KV cache compression
链接: https://arxiv.org/abs/2607.01520
作者: Lukas Haverbeck,Carmen Amo Alonso,Andres Felipe Posada-Moreno,Sebastian Trimpe,Marco Pavone
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transformer inference on long sequences is expensive because softmax attention repeatedly reads from a large KV cache. The prevalent approach to this bottleneck is KV cache compression, which replaces the full cache with a compact summary. Despite its practical importance, the design of such summaries is largely driven by empirical experimentation. On the theoretical side, existing results show that KV cache compression can be impossible in the worst case, but offer little systematic guidance for designing algorithms in regimes where accurate compression is possible. We bridge this gap by characterizing the minimax risk of KV cache compression in terms of the intrinsic compressibility of a cache, revealing when and how accurate compression is possible. These results yield novel design principles for KV cache compression under causal masking that map efficiently to prefill and autoregressive decoding while achieving minimax-optimal risk. We instantiate these principles in a practical algorithm and report promising performance on LongBench in targeted experiments. Overall, our results provide a principled avenue for practical KV cache compression with theoretical guarantees.
[LG-51] owards Learning Representations of Policies in Two-Player Zero-Sum Imperfect-Information Games
链接: https://arxiv.org/abs/2607.01498
作者: Kevin Wang,Kevin Yang,Arjun Prakash,Amy Greenwald
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: 7 pages, 4 figures, 3 tables
Abstract:We investigate the problem of learning useful policy representations (embeddings) in two-player zero-sum imperfect-information games. We make three contributions: First, we introduce methods of creating datasets of policies for a given game. Second, we propose methods to learn policy representations. Third, we introduce downstream tasks to evaluate the effectiveness of such representations. We evaluate each dataset method, embedding method, and downstream task on Kuhn and Leduc Poker. Although our methods are very basic, we demonstrate that useful behavioral representations are present in the learned embeddings. To our knowledge, this work is among the first to systematically compare self-supervised learning techniques for learning policy representations in games. Our code is available at this https URL for others to extend. Comments: 7 pages, 4 figures, 3 tables Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2607.01498 [cs.LG] (or arXiv:2607.01498v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2607.01498 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-52] Unveiling the Non-Monotonic Effect of Privacy on Generalization under Byzantine Robustness
链接: https://arxiv.org/abs/2607.01492
作者: Thomas Boudou,Batiste Le Bars,Nirupam Gupta,Aurélien Bellet
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:
Abstract:Recent work has established a fundamental trilemma between Byzantine robustness, local differential privacy (LDP), and optimization error in distributed learning. We show that this trilemma does not universally extend to generalization error, but instead depends critically on the privacy regime. Specifically, in the high-noise regime (strong privacy), we prove that increasing privacy reduces the generalization error, i.e., there is no tension between robustness and privacy. In the low-noise regime (weaker privacy), however, the tension between robustness and privacy reappears and increasing privacy indeed degrades generalization. Our theory explains this surprising non-monotonic behavior of the generalization error via matching lower and upper bounds on the algorithmic stability of Byzantine-robust distributed learning under LDP constraints. We corroborate and further analyze these theoretical findings with empirical evaluations.
[LG-53] How to Allocate Your Tokens? Scaling Laws with Training Steps and Batch Size
链接: https://arxiv.org/abs/2607.01487
作者: Fabian Schaipp
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We propose a scaling law that takes into account model size and training data while explicitly splitting the latter into training steps and batch size (called three-term law). Fitting the proposed law on a large set of training runs, we find that it correctly recovers the scaling of the optimal batch size. Moreover, because it makes use of training runs with suboptimal batch size, our proposed law can be robustly fit with a significantly smaller amount of training runs. We further show that the three-term law can be used to derive scaling laws for suboptimal batch sizes, and that it matches previous empirical findings related to the critical batch size.
[LG-54] Class-Grouped Normalized Momentum and Faster Hyperparameter Exploration to Tackle Class Imbalance in Federated Learning
链接: https://arxiv.org/abs/2607.01474
作者: Haemin Park,Diego Klabjan,Martin W. Braun,Xiuqi Li,Balakrishnan Ananthanarayanan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Class imbalance poses a critical challenge in federated learning (FL), where underrepresented classes suffer from poor predictive performance yet cannot be addressed by standard centralized techniques due to privacy and heterogeneity constraints. We propose FedCGNM (Federated Class-Grouped Normalized Momentum), a client-side optimizer in FL that partitions classes into a small number of groups based on minimum within-group variance, maintains a momentum per group, normalizes each group momentum to unit length, and uses the summation of the normalized group momentums as an update direction. This design both equalizes gradient magnitude across majority and minority groups and mitigates the noise inherent in rare-class gradients. We further provide a theoretical convergence analysis explicitly accounting for time-varying resampling-rates. Additionally, to efficiently optimize these rates in small-client regimes, we introduce FedHOO, an X-armed-bandit (XAB) based algorithm that exploits federated parallelism that evaluates many combinations of two candidate rates per client at linear cost. Empirical evaluation on four public long-tailed benchmarks and a proprietary chip-defect dataset demonstrates that FedCGNM consistently outperforms baselines, with FedHOO yielding further gains in small-scale federations.
[LG-55] Geometry-Aware R-Structured Kolmogorov-Arnold Networks
链接: https://arxiv.org/abs/2607.01449
作者: Sergei Kucherenko,Nilay Shah
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 27 pages, 13 figures
Abstract:We propose a novel hybrid neural architecture, the Geometry-aware R-Structured Kolmogorov-Arnold Network (GRS-KAN), which integrates this http URL’s R-functions into the Kolmogorov-Arnold Network (KAN) framework. The proposed approach combines two complementary modeling mechanisms: smooth nonlinear structure is learned by KAN branches, while known geometric or logical constraints are encoded analytically using differentiable R-functions. This enables explicit representation of discontinuities, feasible regions, and implicit geometric boundaries within a trainable neural architecture. The framework implements differentiable logical operations through R-conjunctions and R-disjunctions, allowing complex geometric supports to be represented analytically and incorporated directly into regression models. Several GRS-KAN variants are introduced, including additive, multiplicative, and agnostic branch-weighted architectures. The method is demonstrated on regression problems involving discontinuities with circular and rectangular supports. Numerical experiments show that explicit geometric encoding substantially improves predictive accuracy and boundary localization compared with standard KANs. In the considered benchmarks, geometry-aware GRS-KAN models reduce test RMSE by up to 67% while simultaneously improving interpretability through explicit analytical representation of the learned geometric structure. The agnostic variant further demonstrates the ability to automatically determine whether geometric priors are beneficial for a given learning task. Comments: 27 pages, 13 figures Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) MSC classes: 68T07, 65D15 ACMclasses: I.2.6; G.1.2; J.2 Cite as: arXiv:2607.01449 [cs.LG] (or arXiv:2607.01449v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2607.01449 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-56] Hamm-Grams: An Algorithm for Mining Regular Expressions of Bytes
链接: https://arxiv.org/abs/2607.01445
作者: Derek Everett,Edward Raff,James Holt
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: To appear in Machine Learning for Malware Detection
Abstract:Malware poses a critical and ever-evolving threat, and robust and effective systems for detecting and classifying malware are of essential importance. n -grams features are among the common static features used in effective machine learning systems for malware, but these features are inherently brittle. We propose an algorithm for constructing more robust features, hamm-grams, which are a special class of regular expressions having a fixed length and single-character wildcards. We devise an efficient algorithm for finding common hamm-grams using a new locality-sensitive hash designed to produce collisions among pairs of small Hamming distance and a clustering within hash buckets to place wildcards. We then demonstrate the advantages of these features in malware classification and detection tasks.
[LG-57] Conditional Inference Trees and Forests for Feature Selection
链接: https://arxiv.org/abs/2607.01417
作者: Robert Milletich,Justin Downes,Steve Goley,Newel Hirst
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 38 pages, 9 figures
Abstract:Conditional inference trees (CIT) and conditional inference forests (CIF) reduce split-selection bias by testing features before choosing split thresholds, but repeated permutation tests and threshold searches can make these methods computationally expensive. We study CIT and CIF as top- k feature-ranking methods for downstream prediction using real-data benchmarks, runtime ablations, and synthetic feature-recovery experiments. At a fixed node, if the features and permutation budget do not depend on the node responses, Bonferroni-corrected +1 Monte Carlo permutation p -values control nodewise rejection under the complete permutation null. CIF ranks 4th among 17 classification methods on 22 datasets and 3rd among 18 regression methods on 8 datasets. With Bonferroni correction held fixed, the CIF runtime ablations indicate that adaptive stopping and the number of thresholds searched have the largest measured effect on runtime: turning off adaptive stopping and using exact threshold search increase fitting time by 4.0–8.4 \times and 1.9–10.8 \times , respectively, while downstream score changes are at most 0.011. Sparse high- p simulations indicate that forest feature sampling can leave informative features out of many split decisions. Overall, the results support CIF as a top- k feature-ranking method in the evaluated downstream prediction benchmarks.
[LG-58] he Rollout Infrastructure Tax in Coding-Agent Reinforcement Learning SOCC2026
链接: https://arxiv.org/abs/2607.01415
作者: Daniel Thi Graviet,Lovre Pesut,Ivan Dagelic,Vedran Jukic,Ivan Burazin
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Preprint. 6 pages, 6 figures, 2 tables. Submitted to ACM SoCC 2026
Abstract:Coding-agent reinforcement learning treats execution infrastructure as a background implementation detail, despite relying on large numbers of interactive software rollouts. This is a missed opportunity: measuring infrastructure overhead can reveal practical efficiency gains for RL post-training, where small per-rollout savings compound at scale. We present a comparative study of four execution substrates: single containers, hosted sandboxes, Kubernetes-orchestrated containers, and cloud virtual machines. We find up to 110\times variation in cold-start latency and a 1.8\times spread in projected worker-hours for one million 150-step trajectories. Our results suggest that future coding-agent RL systems should optimize execution substrates as part of the training system itself, not merely as deployment plumbing.
[LG-59] BIFROST: Bridging Invariant Feature Representation for Observation-space Sim2Real Transfer
链接: https://arxiv.org/abs/2607.01410
作者: Yunfu Deng,Josiah P. Hanna
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Sim2real transfer for robot policy learning suffers due to mismatch between simulation and reality. Existing methods typically address each gap in isolation through separate adaptation modules, which are composed or layered when both gaps coexist. Yet the basis for attempting sim2real in the first place is that there is shared structure between a task in simulation and reality, where equivalent actions from equivalent configurations produce equivalent long term outcomes regardless of domain specific differences in rendering or physics. In this paper, we study whether we can identify and exploit this shared structure from raw observations to train a policy that enables zero shot transfer. We introduce BIFROST, which learns a shared history encoder on paired cross-domain data via cross-domain bisimulation objective: observation-action sequences leading to equivalent long-term behavior are mapped to nearby latent states, regardless of domain. Policies trained on these latent states in simulation transfer zero-shot to reality. We provide empirical evidence on sim2sim visual navigation and sim2real contact rich manipulation task and visual servoing task that BIFROST achieves effective transfer where domain adaptation and co-training baselines fail under both visual and dynamics domain gaps.
[LG-60] A global predicted-fMRI drive signal from TRIBE does not predict YouTube replay heatmaps
链接: https://arxiv.org/abs/2607.01400
作者: Barada Sahu,Shivesh Pandey
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 7 pages, 1 figure. Code, video-ID manifest, and per-video results: this https URL
Abstract:Deep multimodal brain-encoding models now predict fMRI responses to naturalistic video with high accuracy. Whether their predicted neural signals also forecast behavioral engagement is unknown. We run TRIBE, the winning model of the 2025 Algonauts brain-encoding challenge (Llama-3.2 + V-JEPA2 + Wav2Vec-BERT), on 48 YouTube videos and reduce its predicted cortical response to a per-second engagement curve, the global field power. Correlated against each video’s “most replayed” heatmap, a passively-collected proxy for which moments viewers return to, the curve shows no evidence of predicting re-watch behavior. The pooled position-controlled partial correlation is +0.058 (95% CI [-0.04, 0.15]; one-sample t(47)=1.21, p=0.23), indistinguishable from zero and not significantly above simple loudness and motion baselines (loudness +0.04, paired p=0.74). The raw correlation is also near zero; the moderate values reported for music videos reflect a genre-specific intro/onset-replay artifact rather than content prediction, and do not generalize. The null holds across six cortical-network readouts and under an autocorrelation-preserving permutation test. We release the code, the video-ID manifest, and an acquisition method that works despite YouTube’s SABR-only streaming.
[LG-61] From Approximation to Emergence: A Theory of Deep Learning
链接: https://arxiv.org/abs/2607.01311
作者: Zhilin Zhao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Deep learning has outgrown any single mathematical explanation. From Approximation to Emergence develops a unified, proof-oriented account of modern deep learning theory, tracing a path from the classical foundations of approximation, optimization, and generalization to the contemporary mechanisms of overparameterization, robustness, generative modeling, transformers, in-context learning, scaling laws, interpretability, alignment, and emergence. Rather than presenting isolated results, the book organizes a broad literature into a coherent research narrative: each theory is examined through the object it controls, the assumptions that make it valid, and the phenomena it leaves unexplained. Written for researchers, graduate students, and mathematically trained practitioners, this monograph offers a rigorous map of deep learning theory as it stands today: powerful, incomplete, and increasingly centered on the question of how learned mechanisms arise from scale, data, architecture, and training.
[LG-62] A Novel Machine Learning Approach for Central Nervous System Tumor Classification from DNA Methylation
链接: https://arxiv.org/abs/2607.01307
作者: Paulo R. Ferreira Jr.,Lucas Coutinho Freitas,Laís dos Santos Gonçalves,William Borges Domingues,Lucas Petitemberte de Souza,Mariana B. Michalowski,Vinicius F. Campos
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:
Abstract:NA methylation profiling has become a powerful approach for central nervous system (CNS) tumor classification, yet important challenges remain regarding cross-cohort transferability, methodological correctness, and robust multiclass evaluation. In this work, we propose a novel and methodologically rigorous machine-learning approach for methylation-based CNS tumor classification that combines Sparse Random Projection for dimensionality reduction with multinomial logistic regression for classification. We evaluate the proposed approach in the same general experimental setting established by a widely used reference classifier. On the 2,801-sample reference cohort, our method achieves a mean accuracy of 96% under stratified 3-fold cross-validation. On the independent 1,104-sample clinical evaluation cohort, it reaches 86% accuracy at the 91-class level and 93% when predictions are evaluated at the methylation class family level. These results improve upon the corresponding state-of-the-art reference figures of 82% class-level concordance and 88% family-level concordance, yielding absolute gains of approximately 4 and 5 percentage points, respectively. This improvement is clinically relevant: in a diagnostic setting, a 5-point increase in correct tumor classification can directly affect cancer subtype assignment and, in turn, influence treatment selection and downstream clinical decision-making. Our results show that the proposed model, grounded in stronger methodological practice in machine learning, consistently outperforms the previous state of the art across evaluation settings and can materially improve the reliability of CNS tumor classification.
[LG-63] IonSense-QKG: A Quantum-Readiness Metadata Framework for Lithium-Ion Battery Dataset Discovery
链接: https://arxiv.org/abs/2607.01286
作者: Sakthi Prabhu Gunasekar,Prasanna Kumar Rangarajan
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 7 pages, 1 figure, 4 tables. Code and metadata artifact available at GitHub
Abstract:Public lithium-ion battery datasets are increasingly used for state-of-health estimation, remaining-useful-life prediction, anomaly detection, electrochemical diagnostics, second-life analytics, and battery safety research. However, these datasets vary substantially in chemistry, modality, scale, label quality, sequence structure, access status, and preprocessing complexity. These differences directly affect whether a dataset is feasible for near-term hybrid quantum-classical machine-learning workflows. This paper presents IonSense-QKG, a quantum-readiness metadata framework for lithium-ion battery dataset discovery. Starting from the EV-Battery-IonSense index, the proposed framework enriches public battery dataset records with quantum-relevant metadata, including task type, sensing modality, chemistry, label availability, sequence type, preprocessing requirements, candidate quantum encodings, estimated qubit range, and NISQ feasibility. A transparent Quantum Readiness Score is introduced to rank datasets as candidate resources for future hybrid quantum-classical battery benchmarks. The score is intended as a dataset-selection heuristic, not as evidence of quantum advantage. The framework demonstrates query-based discovery over enriched metadata to identify datasets suitable for compact quantum feature maps, quantum time-series workflows, limited-label anomaly detection, and future battery-health benchmarking. The released artifact includes metadata tables, scoring scripts, robustness checks, link-checking utilities, and SQL-style query examples. IonSense-QKG positions dataset selection as a data-management problem and provides a reproducible foundation for data-centric quantum battery analytics. Comments: 7 pages, 1 figure, 4 tables. Code and metadata artifact available at GitHub Subjects: Machine Learning (cs.LG); Databases (cs.DB) Cite as: arXiv:2607.01286 [cs.LG] (or arXiv:2607.01286v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2607.01286 Focus to learn more arXiv-issued DOI via DataCite
[LG-64] Fixed-Set Robustness in Programming by Example: Example Corruption and Semantic Partition Recovery
链接: https://arxiv.org/abs/2607.01280
作者: Yuan Si,Jialu Zhang
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:
Abstract:Programming-by-example systems infer programs from a small set of input-output examples. Robust PBE work usually models wrong examples as samples from a stochastic noise process and then minimizes an expected or empirical loss. This paper studies a different failure mode: an adversary who sees the synthesizer and chooses the examples whose corruption most damages the returned program. We formalize fixed-set worst-case corruption for finite PBE version spaces, implement exact-within-bounded-pool and heuristic corruption searches for a string-transformation DSL, and introduce version-space partition aggregation (VPA), a defense that synthesizes on disjoint example groups and votes by semantic signatures. The central claim is deliberately bounded and partly negative: low-margin PBE tasks have an adversarial robustness dimension that random-typo and noisy-PBE evaluations miss, while semantic partition aggregation helps only when the clean semantics keep a partition vote margin, which often fails on realistic tasks. Evidence from curated/generated DSL tasks, accepted public SyGuS PBE_SLIA slices, SYNTRA Playgol v2, and noisy-PBE objective baselines supports that boundary. One curated edit flips all 8 spike tasks while 200-trial typo, DSL-pool, and distance-matched random controls succeed on 10.3%, 11.0%, and 16.7%; generated margin-1 rows flip under budget 1 yet VPA recovers them; on public SyGuS the vote margin is near one, so an adaptive attacker drives VPA accuracy to zero; accepted public SyGuS slices move across exact-within-pool budget boundaries; and Playgol shows positive paired-bootstrap gaps against typo and same-pool random controls on the 141 accepted rows. A small exact-output prompt harness over 20 controlled margin-1 tasks shows the same qualitative clean-to-attacked pattern across local and API models, while it is treated as a scope check, not a broad LLM benchmark.
[LG-65] Itextsuperscript2RiMA: Spectral Riemannian Representation with Temporal Attention for Mental Stress Detection based on EEG Signals
链接: https://arxiv.org/abs/2607.01279
作者: Cheng He,Kunyu Peng,Shangen Han,Jinming Ma,Jinhong Ding,Likun Xia
类目: Machine Learning (cs.LG)
*备注:
Abstract:Cross-subject EEG stress detection remains challenging because discriminative stress-related patterns are both subject-dependent and frequency-specific. Conventional Riemannian methods model spatial covariance mainly in the time domain, overlooking neural oscillations that are critical for high-level cognitive state decoding, while standard temporal tokenization often fragments inter-slice temporal coherence. To address these limitations, we propose \method, an Intra-Inter Riemannian Manifold Attention Network for EEG-based stress detection. \method constructs spatial covariance matrices independently at each frequency point and maps them to the SPD tangent space, preserving channel-wise geometry together with frequency-specific discriminative cues. It further introduces frequency cluster aggregation to select informative spectral components and reduce redundancy by forming compact, data-driven frequency clusters aligned with EEG rhythms. Finally, an intra-inter slice attention module adaptively integrates local slice-level spectral dynamics and global temporal context across EEG sequences. Experiments on three datasets show that \method consistently outperforms five state-of-the-art baselines, achieving up to 82.78% balanced accuracy while remaining efficient with only 1.60M parameters and 31.95M FLOPs.
[LG-66] Multilayer Q-Matrix-Embedded Neural Network for Cognitive Diagnosis (M-QCDNet): Structure-Aware Deep Learning Architecture for Psychometric Interpretability
链接: https://arxiv.org/abs/2607.01278
作者: Yiyao Yang
类目: Machine Learning (cs.LG)
*备注: 15 pages, 3 tables
Abstract:The research proposes a multilayer Q-matrix-embedded neural network for cognitive diagnosis (M-QCDNet), which integrates the structural interpretability of cognitive diagnostic models (CDMs) with the deep learning neural network (NN). M-QCDNet structures the item-skill relationship using the Q-matrix as a structural prior, ensuring latent mastery profiles remain interpretable and consistent with cognitive theory, followed by the proposed loss function with an L2 penalty to penalize skills not aligned with the Q-matrix and to balance predictive performance and structural alignment. Corresponding evaluation matrices, the interpretable alignment-based metrics that quantify the degree to which predicted skill activations correspond to item-level skills, were further developed. M-QCDNet offers practical benefits for classroom practice, enabling early detection of learning difficulties and supporting mastery-based interventions. By embedding diagnostic validity into model design, M-QCDNet bridges psychometric transparency and neural flexibility, advancing interpretable, fair, and actionable AI for cognitive diagnostics.
[LG-67] Optimal Stabilizer Testing and Learning with Limited Quantum Memory
链接: https://arxiv.org/abs/2607.02444
作者: Srinivasan Arunachalam,Louis Schatzki
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 66 pages, 5 figures
Abstract:We study stabilizer state testing and learning with limited coherent quantum memory. Here an algorithm sequentially receives copies of an unknown n -qubit state, but may keep only k qubits of coherent quantum memory between measurements. With unrestricted memory, seminal work of Gross, Nezami and Walter showed how to test n -qubit stabilizer states using 6 copies, which is dimension independent, unlike the learning complexity of \Theta(n) . We show that this testing-vs-learning separation is lost under memory constraints. More concretely we show that (1) The sample complexity of testing stabilizer states in the k -qubit memory framework is \Theta(n-k) . Our upper bound goes via a novel connection to the hidden shift problem and the lower bound is proven using a novel approach to average case bounds on likelihood ratios via combinatorics of the stochastic orthogonal group. (2) The sample complexity of learning stabilizer states with k qubits of memory, in the non-adaptive framework, is \Theta(n^2/k) . As a further application of our techniques, we prove an exponential lower bound for purity testing even when the memory may be left coherent throughout the protocol. Our main results identify coherent quantum memory as the resource enabling the usual separation between stabilizer testing and learning. In particular, even with k=0.99n qubits of memory, there is no constant-copy stabilizer tester; furthermore for k=cn qubits of memory (for 0 c 1 ), stabilizer testing is as hard as learning, with both requiring \Theta(n) copies. Comments: 66 pages, 5 figures Subjects: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2607.02444 [quant-ph] (or arXiv:2607.02444v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2607.02444 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Louis Schatzki [view email] [v1] Thu, 2 Jul 2026 17:11:38 UTC (2,148 KB)
[LG-68] Q-GAIN: A Python Package for Machine Learning and Physically Informed Analysis Applications
链接: https://arxiv.org/abs/2607.02413
作者: M. Doris,S. Guo,S. M. Koh,L. Ritter,A. R. Fritsch,S. Mukherjee,I. B. Spielman,J. P. Zwolak
类目: Quantum Gases (cond-mat.quant-gas); Machine Learning (cs.LG)
*备注: Submission to SciPost, 20 pages with 4 figures
Abstract:Here we describe the quantum gas analysis and inference (Q-GAIN) Python package, which enables rapid deployment of machine learning (ML) and physics-informed analysis techniques for cold-atom experiments. Out of the box, Q-GAIN implements classification, object detection, and physics-informed metrics for feature detection in images of atomic Bose-Einstein condensates (BECs). Q-GAIN encourages a natural, module-based workflow: starting with data loading and preprocessing, followed by ML-based feature identification, and ending with conventional analysis techniques. We demonstrate this modularity by configuring Q-GAIN for three ML tasks. First, we demonstrate the basic workflow of the Q-GAIN framework by implementing the standard task of classifying handwritten digits from the MNIST dataset. Then, we re-implement our earlier soliton detection (SolDet) package in the Q-GAIN framework, enabling the detection and analysis of solitonic excitations in time-of-flight data. Finally, we develop an object-detection tool that identifies quantized vortices in images of ring-shaped BECs.
[LG-69] Aggregation with Exponential Weights is Optimal in Expectation
链接: https://arxiv.org/abs/2607.02247
作者: Mikael Møller Høgsgaard,Patrick Rebeschini,Tobias Wegel
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The aggregation with exponential weights (AEW) estimator is not fully understood in the basic setting of model selection aggregation with squared loss. In particular, whether it is minimax-rate optimal in expectation for large enough fixed temperatures and under random design has been an open problem since its introduction, which was explicitly posed by Lecué and Mendelson (2013). In this paper, we settle this problem by showing that \emphwithout requiring a Bernstein-type assumption, the AEW indeed achieves the excess risk T \log (M) / (n+1) in expectation, whenever the temperature T satisfies (L^2/T)\exp(B/T)\leq \mu /2 . Here, the number of dictionary elements is M , the estimator has observed n i.i.d. samples from any distribution, and the loss is assumed to be bounded by B , L -Lipschitz continuous and \mu -strongly convex. For squared loss, we show that T\geq 4 b^2 suffices when the predictions and labels are [0,b] -valued. Because AEW is known to be suboptimal in expectation for temperatures below some constant, this shows that AEW has a sharp phase transition when the temperature is large enough but constant, as conjectured by Lecué and Mendelson.
[LG-70] An Additive MLP-GNN Framework for Characterizing Chemical and Structural Contributions to Aqueous Solubility
链接: https://arxiv.org/abs/2607.02212
作者: Sampreeti Bhattacharya,Arkaprava Roy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Aqueous solubility is a key property in early-stage drug discovery, but most predictive models merge physicochemical descriptors and molecular graph information into a single representation, obscuring whether a prediction is driven by global chemistry, molecular structure, or both. We present an additive deep-learning framework that keeps these two sources of information separate throughout training: physicochemical descriptors are encoded by a multilayer perceptron (the chemical branch) and molecular graph topology by a graph neural network (the structural branch), with the two outputs combined only at the prediction stage through an additive model with an optional multiplicative interaction. This design provides a direct decomposition of chemical and structural components that can be examined separately after training. Furthermore, pretraining on the larger AqSolDB dataset and fine-tuning on the smaller BigSolDB2 dataset substantially improve accuracy and reduce run-to-run variations, indicating generalizability of the learned features from the data-rich settings. We further interpret the fitted model using best linear projections of the branch outputs, molecule-level embedding summaries across solubility classes, and atom-level GNNExplainer masks aggregated over functional groups. These analyses show that the chemical branch aligns with familiar physicochemical descriptors, while the structural branch captures graph-topological and functional-group patterns associated with solubility. Across both datasets, the framework attains competitive predictive performance while making the distinct roles of chemical and structural information more transparent.
[LG-71] Prediction Sets for Counterfactual Decisions: Coverag e Optimality and Conformal Prediction
链接: https://arxiv.org/abs/2607.02206
作者: Yurui Zheng,Ying Jin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Predictions are increasingly used to guide high-stakes decisions, from treatment selection to policy making. To ensure reliability with imperfect predictions, uncertainty quantification methods such as conformal prediction build prediction sets with coverage guarantees. However, statistical validity alone does not immediately determine the decisions to take, nor the optimality thereof. This gap is especially delicate in counterfactual settings where the outcome that materializes depends on the action taken, so uncertainty cannot be specified independently of the decision rule. We develop a decision-theoretic framework for uncertainty-informed counterfactual decisions. We identify a novel notion of \emphpolicy-coupled coverage – namely, coverage of the realized outcome under the action induced by the prediction sets themselves – as the optimal and lossless interface between uncertainty and action. It plays three roles. First, it justifies acting via a natural max-min rule as minimax-optimal under distributional ambiguity. Second, optimizing prediction sets under policy-coupled coverage is equivalent both to a stronger universal-coverage formulation and to the direct risk-averse optimization over policies and utility certificates; this equivalence yields the explicit form of the population-optimal prediction sets. Third, it admits a two-stage procedure, Policy-Coupled Risk-Averse Conformal Prediction (PC-RACP), that approximates these optimal sets with rigorous finite-sample coverage. Simulations and a real email-marketing experiment confirm that PC-RACP delivers higher utility than existing approaches while maintaining valid coverage, and that ignoring the counterfactual structure of the decision problem is suboptimal for both validity and utility. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2607.02206 [stat.ML] (or arXiv:2607.02206v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2607.02206 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-72] Fourier Preconditioning for Neural Feature Learning
链接: https://arxiv.org/abs/2607.02199
作者: Preston Pitzer,Anish Pradhan,Harpreet S. Dhillon
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted for publication in IEEE Signal Processing Letters
Abstract:Mutual information (MI)-inspired feature learning techniques are capable of generating low-dimensional embeddings that retain nonlinear dependence structures, but direct estimations of MI suffer from noisy probability distribution estimates in the low-data regime. The H-Score objective, computed from second-order statistics, provides a practical proxy metric for training feature extraction networks. We prove that H-Score is invariant to invertible transformations in the unrestricted functional setting, but becomes sensitive to input basis rotations under constrained approximation classes. Consequently, we study unitary preconditioning for H-Score networks and show that selecting an appropriate basis rotation reduces finite-width truncation error by concentrating predictive dependence into fewer dominant modes. We identify the fast Fourier transform (FFT) as an effective data-independent, low-cost preconditioner for approximately stationary processes, where spectral structure induces concentration of the cross-covariance singular value spectrum. We introduce training-free metrics based on spectral entropy and cumulative dependence energy to quantify basis suitability and predict downstream inference gains prior to network training. Experiments across eight multivariate datasets demonstrate that FFT preconditioning is particularly useful in resource-constrained regimes, achieving up to 50% normalized mean squared error (NMSE) reduction, while the proposed metrics correlate with observed performance gains and correctly identify cases where spectral preconditioning is detrimental.
[LG-73] Structured Gaussian Processes for Uncertainty-Aware Classification of High-Dimensional Small-Sampled Omics Data
链接: https://arxiv.org/abs/2607.02103
作者: Yue Zhang,Nandini Amit Gadhia,Georgios Karagiannis,Michalis Smyrnakis
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 15 pages, 1 figure. Preprint version
Abstract:Classifying heterogeneous omics data remains a fundamental challenge in computational biology, particularly in high-dimensional, small-sample settings where nonlinear interactions dominate and class imbalance further complicates reliable prediction of minority phenotypes. While traditional kernel methods rely on feature abundance, they fail to leverage the known interaction landscapes of biological systems. In this work, we propose a structured Gaussian process classification framework that integrates graph-encoded biological pathways directly into the kernel construction. By propagating information along known interaction networks and combining this with abundance-derived features, the resulting classifier captures both quantitative measurements and topological context. We benchmark our proposed methodology on three publicly available gut and fecal microbiome datasets. To address severe class imbalance, we evaluate complementary strategies, including data-level resampling, threshold calibration, and confusion-matrix-based adjustments, and report minority-class performance alongside accuracy. The hybrid approach yields a performance gain over unstructured baselines and matches the performance of established benchmarks for similar datasets. Furthermore, the probabilistic nature of the framework naturally provides calibrated predictive uncertainty, enabling robust differentiation between confident predictions and ambiguous samples. Comments: 15 pages, 1 figure. Preprint version Subjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG) Cite as: arXiv:2607.02103 [q-bio.QM] (or arXiv:2607.02103v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2607.02103 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-74] Born Discrete Made Smooth: Variational Formulation of Shallow Neural Networks
链接: https://arxiv.org/abs/2607.02003
作者: Matej Benko,Pierre Bousquet,Iwona Chlebicka,Błażej Miasojedow
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Although neural networks are remarkably effective, their underlying optimization principles remain theoretically elusive, often characterized by non-convex landscapes and stochastic heuristics. In this work, we propose a paradigm shift by replacing the discrete training problem of shallow neural networks with a well-posed continuum variational surrogate. We identify a family of \lambda -convex functionals over parameter densities in weighted Sobolev spaces and prove that these variational problems are globally well-posed, stable, and exhibit unexpected almost C^3 regularity. Unlike existing Wasserstein-based or Mean-Field approaches, which often face limited regularity and discretization challenges, our formulation provides direct access to elliptic regularity and convex analysis. This allows us to prove that the optimal parameter density can be obtained by solving a single linear system, bypassing iterative optimization entirely. We establish explicit generalization error controls at a rate of 1/\alpha relative to the regularization parameter, and prove that finite-width networks of size N achieve the continuum optimum at an O(1/N) rate. This perspective bridges the gap between the Neural Tangent Kernel (NTK) and feature-learning regimes, providing a principled framework for understanding over-parameterization through the lens of variational calculus. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2607.02003 [stat.ML] (or arXiv:2607.02003v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2607.02003 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-75] Autorelevance function and other feature relevance measures for univariate time series
链接: https://arxiv.org/abs/2607.01959
作者: Julian Cardenas,Jamie Arjona,Pedro Delicado
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:We propose a model agnostic methodology to measure lag relevance in machine learning forecasting models applied to univariate time series. Particularly, we are working in the context of time series using the frameworks of Ghost variables and Shapley values, together with additive importance measures, to introduce the auto-relevance and partial auto-relevance functions as the lag importance values. Additionally, we propose a novel method to replace absent features in coalition based methods with a one step forecast from the same model. We evaluate these proposals under different simulations and real data cases. This combined framework perspective is particularly suitable for time series. In addition, to show our discoveries we use a pull of models from the seasonal ARMA family and recurrent neural networks. We found that the calculated relevance measures successfully demonstrate the expected lag structure in almost all cases.
[LG-76] Statistical Properties of k-means Clustering for Data Missing Completely at Random
链接: https://arxiv.org/abs/2607.01945
作者: Xin Guan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The classical k -means clustering cannot be directly used to incomplete data, and existing k -means-based clustering for missing data primarily focus on improving the practical accuracy of clustering, whereas most of them lack theoretical guarantees in the asymptotic sense. In this paper, we investigate the statistical properties of k -means clustering in the presence of missing data. We first establish the \sqrtn -excess risk bound and prove the consistency of the estimated cluster centers under general missing mechanisms. For the Missing Completely at Random (MCAR) mechanism, we further derive the \sqrtn -convergence rate and asymptotic normality of the estimated cluster centers. Moreover, we study in what cases the cluster centers estimated by incomplete data converge to the true cluster centers of original fully observed data, and give a sufficient condition about the missing probability and the separation among true clusters. These results provide a theoretical guarantee for missing-data- k -means. Notably, our analysis reveal that under MCAR mechanism, both achieving the \sqrtn -rate and converging to the true cluster centers require k true centers to be distinct in every dimension, highlighting the significant challenges of application in high-dimensional regimes. Finally, we conduct numerical simulations on synthetic incomplete datasets to support our theoretical analysis results.
[LG-77] Enerzyme: A Framework for Efficient Training of Reactive Neural Network Potentials for Enzyme Catalysis with Application to Methyltransferases
链接: https://arxiv.org/abs/2607.01362
作者: Weiliang Luo,Heather J. Kulik
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
Abstract:Quantum mechanical (QM) cluster models provide an effective framework for mechanistic studies of enzymatic reactions but remain computationally demanding. Neural network potentials (NNPs) offer a promising route to reduce this cost, but enzymes present challenges beyond small molecules, including large system sizes, implicit-solvent environments, substantial polarization, and charge transfer. Here, we present an integrated software framework for efficient NNP training for mechanistic studies of enzymes, demonstrated on QM cluster models of S-adenosyl-L-methionine-dependent methyltransferases (MTases). Our Enerzyme code introduces modular electrostatics-aware NNP architectures and combines automated QM-cluster construction with reactive dataset generation. The Enerzymette subpackage automates reaction pathway exploration at both NNP and DFT levels. We show that iterative flexible scans and nudged elastic band calculations impose stricter requirements on NNPs than conventional dataset metrics. Nevertheless, NNPs trained on fewer than 1,000 system-specific datapoints reproduce reaction energetics and transition-state structures for MTase clusters containing up to 545 atoms with near-chemical accuracy. Direct supervision of atomic charges and consistent dielectric screening substantially improve simulation stability and accuracy, while multitask-learned atomic charges capture charge transfer and polarization trends and provide chemically meaningful descriptors of reactivity. Finally, transferability across chemically diverse catechol O-methyltransferase substrates indicates that NNPs learn generalizable reactivity patterns as training data expand across multiple enzymes. Together, these results establish a foundation for accelerating enzyme mechanistic studies and guide future NNP development for biomolecular reactivity.
[LG-78] Ravines in quantum cost landscapes: opportunities for improved VQA predictions
链接: https://arxiv.org/abs/2607.01329
作者: Felix J. Beckmann,João F. Bravo
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 28 pages, 14 figures
Abstract:The geometric and topological structure of quantum cost landscapes (QCLs) governs the optimization and thus the predictive power of variational quantum algorithms (VQAs). We systematically analyze ravines - low-cost paths connecting local minima - using an adapted version of the nudged elastic band (NEB) algorithm, a method originating from theoretical chemistry. By training quantum neural networks (QNNs) to classify the concentratable entanglement of quantum states, we apply the NEB algorithm and numerically identify ravine structures in QCLs of hardware-efficient ansatzes. Beyond visualizing these ravines, we construct an ensemble prediction framework by averaging predictions from QNNs parameterized along the low-cost NEB path. We introduce a resource-light pre-training metric which quantifies local-prediction variability and serves as a strong performance indicator for VQAs, even beyond the scope of this study. When base classifiers are drawn from circuit and weight initializations exhibiting high local-prediction variability, the quantum-based NEB ensembles outperform both classical and naive quantum alternatives. Moreover, a complexity analysis shows that leveraging the ravine-like structure of QCLs with the QNN NEB approach substantially reduces computational costs compared to naive QNN ensembling. A depth and qubit scaling analysis indicates that ravines persist across both scalings, and that, despite the expected growth in resource requirements with the qubit scaling, the NEB approach also accelerates convergence over the naive alternative.
[LG-79] Few-Shot Open-Set Audio Classification Using Attention Information-Fused Prototypes
链接: https://arxiv.org/abs/2607.01297
作者: Yanxiong Li,Jiaxin Tan,Qianqian Li,Guoqing Chen,Sen Huang,Tuomas Virtanen
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: 14 pages, 12 tables, 9 figures,Accepted for publication in IEEE TASLP
Abstract:Most existing audio classification methods suppose that each query (testing) sample belongs to a class of support (training) samples, and misrecognize samples of unseen classes as seen classes (cannot reject samples of unseen classes). In this study, we propose a method for Few-shot Open-set Audio Classification (FOAC), which can recognize query samples of seen classes after updating the model using a few support samples, and meanwhile reject query samples from unseen classes. We design a model consisting of an encoder and a classifier. The encoder is the backbone of a ResNet used for extracting embeddings. The classifier consists of prototype generators of few-shot classes and open-set classes. Prototypes of few-shot classes are obtained by fusing the class-discriminative information of support and query embeddings and by assigning larger weighting coefficient to representative part of the support embeddings. One prototype is generated for open-set classes using the proposed prototype generator. The encoder is trained with abundant samples of base classes in supervised manner, and then the prototypes of base classes are generated under the supervision of a joint loss. The classifier is trained using a few samples of few-shot classes in a meta-training way. Three public datasets (LS-100, NSynth-100, and FSC-89) are used to assess the performance of our method. Experiments show that our method has advantage over prior methods in AUROC and accuracy. This advantage has statistical significance for most prior methods. Our method has lower computational complexity than most prior methods. The code is at this https URL.
[LG-80] CNN Models for Microphone Array Covariance Matrix Upsampling and Acoustic Imaging
链接: https://arxiv.org/abs/2607.01295
作者: Marianthi Adamopoulou,Parthasaarathy Sudarsanam,David Diaz-Guerra,Meng Jiang,Archontis Politis,Seyed Jalaleddin Mousavirad,Tuomas Virtanen,Jan Lundgren
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
*备注: Published in the 2026 IEEE International Symposium on Artificial Intelligence for Instrumentation and Measurement (AI4IM), Amalfi, Italy, 2026
Abstract:Acoustic imaging visualization is a core methodology in acoustics, enabling spatial analysis of sound sources and acoustic scenes. However, limited sensor availability in practical systems motivate approaches that enhance spatial resolution without increasing the hardware complexity. In this paper, we focus on upsampling virtually a tetrahedral 4-microphone array to a spherical 32-microphone array by estimating the covariance matrices of the channels employing deep learning techniques. Five neural network architectures are investigated for covariance upsampling for acoustic imaging using the real-world STARSS23 dataset. These models are developed to estimate a 32-microphone, time-frequency covariance matrix from a 4-microphone input covariance representation. The proposed architectures are based on 2D convolutional layers to capture the underlying spatial-spectral structure of covariance matrices, and are further enhanced with frequency dynamic convolution to model their frequency-dependent properties. The proposed architectures are evaluated in terms of root mean square error (RMSE) and using delay-and-sum beamforming acoustic imaging. Quantitative results show that all models outperform a random-guess baseline, which yields an RMSE of 0.548, with the best-performing architecture achieving an RMSE of 0.432. We analyze qualitatively the performance of the proposed models through beamforming heatmap visualizations derived from the 4-channel input covariance, the 32-channel ground truth, and the predicted 32-channel covariance matrices. These results demonstrate that covariance upsampling significantly enhances the effective performance of the 4-channel microphone array, producing sound maps that closely resemble those obtained with the 32-channel array.
[LG-81] Xact-Prior Variational Autoencoder (X-VAE): Learning Data-Adaptive Gaussian Mixture Priors for Latent Distributions
链接: https://arxiv.org/abs/2607.01275
作者: Qijun Chen,Shaofan Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Variational Autoencoders (VAEs) commonly assume a standard isotropic Gaussian prior over the latent space, an assumption that often fails to capture the true distribution of latent representations for complex datasets. This mismatch can limit reconstruction accuracy, reduce sample quality, and constrain the expressive power of the learned latent space. We propose the eXact-Prior Variational Autoencoder (X-VAE), a framework that replaces the conventional standard normal prior with a Gaussian prior derived from the latent representations of a pretrained autoencoder (AE). Specifically, the empirical mean and standard deviation of the AE latent codes are used to parameterize a data-adaptive prior that more closely reflects the underlying structure of the training data. During generation, X-VAE introduces a latent scaling factor that enables explicit control over the variance of the sampled latent vectors, providing a simple mechanism for balancing sample diversity and fidelity. This flexibility makes the proposed approach particularly well suited for applications such as industrial and engineering design, where generated solutions must satisfy strict structural or functional constraints while still permitting meaningful design exploration. We present the mathematical formulation of well-suited X-VAE, derive the corresponding KL divergence objective for the proposed prior, and evaluate the method on standard benchmark datasets. Experimental results demonstrate that X-VAE preserves reconstruction quality while producing latent representations that better align with the empirical data distribution, leading to improved controllability and more realistic generated samples.
[LG-82] Fast approximation and learning of binary classification tasks in o-minimal structures using ReLU neural networks
链接: https://arxiv.org/abs/2607.01266
作者: Clemens Kinn,Philipp Petersen
类目: Logic (math.LO); Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注:
Abstract:We study binary classification problems whose decision sets are given by definable sets in o-minimal expansions of the real field. Motivated by cell decomposition of definable sets, we introduce traceable sets as a classical proxy for definable decision regions and analyze their approximation by ReLU neural networks. Under uniform bounds on the number of connected components and suitable C^m extensions for the boundary functions, we prove that characteristic functions of traceable subsets of [-1/2,1/2]^n can be approximated in L^p to accuracy \varepsilon0 by ReLU neural networks of size \mathcalO(\varepsilon^-p(n-1)/m) , with depth independent of \varepsilon and polynomially bounded weights. This establishes quantitative approximation rates for certain definable collections in o-minimal structures using ReLU neural networks. The same approach also yields the stated approximation rates for a subclass of definable maps [-1/2,1/2]^n \to \mathbbR . We then combine the approximation capabilities with entropy estimates for ReLU neural network classes to obtain statistical learning rates for empirical risk minimization with hinge loss. For N uniformly distributed samples, the resulting classifiers achieve expected misclassification error of order N^-m/(m+pn-p) up to an arbitrarily small polynomial loss.
附件下载


