本篇博文主要内容为 2026-05-04 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-05-04)
今日共更新461篇论文,其中:
- 自然语言处理共71篇(Computation and Language (cs.CL))
- 人工智能共117篇(Artificial Intelligence (cs.AI))
- 计算机视觉共89篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共160篇(Machine Learning (cs.LG))
- 多智能体系统共11篇(Multiagent Systems (cs.MA))
- 信息检索共18篇(Information Retrieval (cs.IR))
- 人机交互共26篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] RunAgent : Interpreting Natural-Language Plans with Constraint-Guided Execution
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在结构化工作流执行中可靠性不足的问题,即LLMs虽能理解自然语言指令,但难以保证步骤间的正确执行顺序与逻辑一致性。其解决方案的关键在于提出RunAgent——一个基于多智能体的计划执行平台,通过引入具有显式控制结构(如IF、GOTO、FORALL)的代理语言(agentic language),将自然语言的表达力与编程的确定性相结合;同时,RunAgent在每一步执行时自主推导并验证约束条件,动态选择推理、工具调用或代码生成等策略,并结合错误纠正机制与上下文过滤,确保任务执行的准确性与鲁棒性。
链接: https://arxiv.org/abs/2605.00798
作者: Arunabh Srivastava,Mohammad A.(Amir)Khojastepour,Srimat Chakradhar,Sennur Ulukus
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunAgent, a multi-agent plan execution platform that interprets natural-language plans while enforcing stepwise execution through constraints and rubrics. RunAgent bridges the expressiveness of natural language with the determinism of programming via an agentic language with explicit control constructs (e.g., \textttIF, \textttGOTO, \textttFORALL). Beyond verifying syntactic and semantic verification of the step output, which is performed based on the specific instruction of each step, RunAgent autonomously derives and validates constraints based on the description of the task and its instance at each step. RunAgent also dynamically selects among LLM-based reasoning, tool usage, and code generation and execution (e.g., in Python), and incorporates error correction mechanisms to ensure correctness. Finally, RunAgent filters the context history by retaining only relevant information during the execution of each step. Evaluations on Natural-plan and SciBench Datasets demonstrate that RunAgent outperforms baseline LLMs and state-of-the-art PlanGEN methods.
[MA-1] Meritocratic Fairness in Budgeted Combinatorial Multi-armed Bandits via Shapley Values
【速读】:该论文旨在解决预算约束下的组合多臂老虎机(Budgeted Combinatorial Multi-Armed Bandits, BCMAB)在全反馈(Full-Bandit Feedback, FBF)场景中的公平性问题。与半反馈(Semi-Bandit Feedback)不同,全反馈下无法直接观测各臂的边际贡献,导致公平性评估变得极为困难。为此,作者提出了一种基于合作博弈论的新解法——K-Shapley值,该值扩展了经典的Shapley值,用于衡量在不超过大小为K的子集内个体对整体效用的边际贡献,并满足对称性、线性性、零玩家和效率等公理性质。关键创新在于设计了K-SVFair-FBF算法,它能在未知估值函数的情况下自适应估计K-Shapley值,同时有效抑制蒙特卡洛近似带来的噪声,在理论上实现了公平性遗憾(fairness regret)的O(T³/⁴)上界,实验证明其在联邦学习和社交影响力最大化任务中优于现有基线方法。
链接: https://arxiv.org/abs/2605.00762
作者: Shradha Sharma,Swapnil Dhamal,Shweta Jain
机构: Indian Institute of Technology Ropar (印度理工学院拉普尔分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:We propose a new framework for meritocratic fairness in budgeted combinatorial multi-armed bandits with full-bandit feedback (BCMAB-FBF). Unlike semi-bandit feedback, the contribution of individual arms is not received in full-bandit feedback, making the setting significantly more challenging. To compute arm contributions in BCMAB-FBF, we first extend the Shapley value, a classical solution concept from cooperative game theory, to the K -Shapley value, which captures the marginal contribution of an agent restricted to a set of size at most K . We show that K -Shapley value is a unique solution concept that satisfies Symmetry, Linearity, Null player, and efficiency properties. We next propose K-SVFair-FBF, a fairness-aware bandit algorithm that adaptively estimates K -Shapley value with unknown valuation function. Unlike standard bandit literature on full bandit feedback, K-SVFair-FBF not only learns the valuation function under full feedback setting but also mitigates the noise arising from Monte Carlo approximations. Theoretically, we prove that K-SVFair-FBF achieves O(T^3/4) regret bound on fairness regret. Through experiments on federated learning and social influence maximization datasets, we demonstrate that our approach achieves fairness and performs more effectively than existing baselines.
[MA-2] Learning to Act and Cooperate for Distributed Black-Box Consensus Optimization
【速读】:该论文旨在解决分布式黑箱共识优化(distributed blackbox consensus optimization)中的核心挑战,即在异构非凸环境中,如何平衡局部适应性、全局协调性和通信效率。现有方法依赖人工设计的更新规则和静态协作模式,难以应对复杂场景下的动态需求。其解决方案的关键在于提出一种轨迹驱动的自设计框架——LACMAS(Learning to Act and Cooperate via Multi-Agent Self-scheduling),通过引入具备自适应内部机制的代理级群体动力学,提升探索、收敛与局部逃逸之间的平衡,并利用大语言模型从历史优化轨迹中提取稀疏高层指导,动态塑造代理内部行为与外部协作模式;同时采用分阶段认知调度策略,在资源受限条件下激活不同形式的适应性,从而实现更高质量的解、更高的收敛效率和更强的通信效率。
链接: https://arxiv.org/abs/2605.00691
作者: Zi-Bo Qin,Feng-Feng Wei,Tai-You Chen,Wei-Neng Chen
机构: 未知
类目: Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注: 20 pages, 5 figures
Abstract:Distributed blackbox consensus optimization is a fundamental problem in multi-agent systems, where agents must improve a global objective using only local objective queries and limited neighbor communication. Existing methods largely rely on handcrafted update rules and static cooperation patterns, which often struggle to balance local adaptation, global coordination, and communication efficiency in heterogeneous nonconvex environments. In this paper, we take an initial step toward trajectory-driven self-design for distributed black-box consensus optimization. We first redesign the agent-level swarm dynamics with an adaptive internal mechanism tailored to decentralized consensus settings, improving the balance between exploration, convergence, and local escape. Built on top of this adaptive execution layer, we propose Learning to Act and Cooperate (LACMAS), a trajectorydriven framework in which large language models provide sparse highlevel guidance for shaping both agentinternal action behaviors and agentexternal cooperation patterns from historical optimization trajectories. We further introduce a phased cognitive scheduling strategy to activate different forms of adaptation in a resource-aware manner. Experiments on standard distributed black-box benchmarks and real-world distributed tasks show that LAC-MAS consistently improves solution quality, convergence efficiency, and communication efficiency over strong baselines, suggesting a practical route from handcrafted distributed coordination toward self-designing multi-agent optimization systems.
[MA-3] Skills as Verifiable Artifacts: A Trust Schema and a Biconditional Correctness Criterion for Human-in-the-Loop Agent Runtimes
【速读】:该论文旨在解决代理技能(Agent Skills)在运行时的信任验证问题——即如何确保加载的技能代码行为可信,而无需依赖签名、权限或来源注册等传统信任假设。其核心挑战在于:若不进行验证,系统必须对每次不可逆调用都引入人工在环(Human-in-the-Loop, HITL)审查,这在高并发场景下不可持续且易沦为形式主义。解决方案的关键在于将技能验证作为一个独立、受控的前置流程,通过定义显式的验证级别(verification level)和能力门控机制(capability gate),使HITL干预仅作用于未验证的技能,从而实现系统的可扩展性和可持续性。论文进一步提出一个双条件正确性准则(biconditional correctness criterion)用于评估验证程序的有效性,并提供一套与工具链和模型无关的运行时规范指南,确保方案可在多种部署环境中落地实施。
链接: https://arxiv.org/abs/2605.00424
作者: Alfredo Metere
机构: Metere Consulting, LLC
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注:
Abstract:Agent skills – structured packages of instructions, scripts, and references that augment a large language model (LLM) without modifying the model itself – have moved from convenience to first-class deployment artifact. The runtime that loads them inherits the same problem package managers and operating systems have always faced: a piece of content claims a behavior; the runtime must decide whether to believe it. We argue this paper’s central thesis up front: a skill is \emphuntrusted code until it is verified, and the runtime that loads it must enforce that default rather than infer trust from a signature, a clearance, or a registry of origin. Without skill verification, a human-in-the-loop (HITL) gate must fire on every irreversible call – which is operationally untenable and degrades into rubber-stamping at any non-trivial scale. With skill verification treated as a separate, gated process, HITL fires only for what is unverified, and the system becomes sustainable. We give a trust schema (§\refsec:schema) that includes an explicit verification level on every skill manifest; a capability gate (§\refsec:gate) whose HITL policy is a function of that verification level; a \emphbiconditional correctness criterion (§\refsec:biconditional) that any candidate verification procedure must satisfy on an adversarial-ensemble exercise (§\refsec:eval); and a portable runtime profile (§\refsec:guidelines) with ten normative guidelines abstracted from a working open-source reference implementation \citemetere2026enclawed. The contribution is harness- and model-agnostic; nothing here requires retraining, fine-tuning, or proprietary infrastructure.
[MA-4] Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents
【速读】:该论文旨在解决现有AI预测能力评估基准存在的三大缺陷:一是依赖静态数据集易受过拟合污染,二是使用交易盈亏(PnL)作为指标混淆了预测准确性与时机选择、仓位大小及风险偏好,三是缺乏去中心化信任机制。其解决方案的关键在于提出Foresight Arena——首个无需许可且基于区块链的AI预测代理评估平台,通过在Polygon PoS链上部署Solidity智能合约实现commit-reveal协议提交概率预测,并利用Gnosis条件代币框架实现无信任结果判定;同时引入Brier Score和创新的Alpha Score这两个严格评分规则(proper scoring rules),以激励诚实的概率报告并分离出相对于市场共识的预测优势。该设计确保了评估环境具备抗过拟合性、去中心化可信性和激励相容性,从而更真实地衡量AI代理的预测能力。
链接: https://arxiv.org/abs/2605.00420
作者: Maksym Nechepurenko,Pavel Shuvalov
机构: 未知
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); General Finance (q-fin.GN)
备注: 27 pages, 5 figures, 10 tables. Project page: this https URL . Code: this https URL
Abstract:Evaluating the true forecasting ability of AI agents requires environments resistant to overfitting, free from centralized trust, and grounded in incentive-compatible scoring. Existing benchmarks either rely on static datasets vulnerable to training-data contamination, or measure trading PnL – a metric conflating predictive accuracy with timing, sizing, and risk appetite. We introduce Foresight Arena, the first permissionless, on-chain benchmark for evaluating AI forecasting agents on real-world prediction markets. Agents submit probabilistic forecasts on binary Polymarket markets via a commit-reveal protocol enforced by Solidity smart contracts on Polygon PoS; outcomes are resolved trustlessly through the Gnosis Conditional Token Framework. Performance is measured by the Brier Score and a novel Alpha Score – proper scoring rules that incentivize honest probability reporting and isolate predictive edge over market consensus. We provide a formal analysis: closed-form variance for per-market Alpha, the connection to Murphy’s classical Brier decomposition, and a power analysis characterizing the number of rounds required to reliably distinguish agents of different skill levels. We show that detecting a true edge of \alpha^* = 0.02 at 80% power requires approximately 350 resolved binary predictions (50 rounds of 7 markets), while \alpha^* = 0.01 requires four times more. We complement these analytical results with a 50-round live evaluation of five frontier LLM agents plus a random baseline. Murphy decomposition distinguishes well-calibrated agents from market-tracking agents that fail through reduced resolution. All smart contracts and evaluation infrastructure are open-source.
[MA-5] High-Probability Convergence in Decentralized Stochastic Optimization with Gradient Tracking
【速读】:该论文旨在解决去中心化随机优化中高概率(High-Probability, HP)收敛性保证的理论缺口问题,即现有HP结果几乎仅限于Decentralized Stochastic Gradient Descent (DSGD)算法,且依赖强假设(如有界数据异质性或强凸性),而与均方误差(Mean Squared Error, MSE)分析相比,缺乏在更弱条件下仍能实现高效收敛的方法。解决方案的关键在于引入梯度追踪(Gradient Tracking, GT)技术改进DSGD,提出GT-DSGD算法,并在满足较宽松子高斯噪声条件的前提下,首次建立了该方法在非凸和Polyak-Łojasiewicz(PL)函数类下的HP收敛率:分别为O(nTlog(1/δ))和O(nTlog(1/δ)),其中n为代理数、T为时间步长、δ∈(0,1)为置信参数。这表明GT-DSGD可在与MSE分析相同的函数假设下实现HP收敛,同时保持相近的瞬态性能,是首个具备偏差校正机制的去中心化优化方法的HP理论保障。
链接: https://arxiv.org/abs/2605.00281
作者: Aleksandar Armacki,Haoyuan Cai,Ali H. Sayed
机构: 未知
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
备注: 49 pages, 4 figures. arXiv admin note: text overlap with arXiv:2510.06141
Abstract:We study high-probability (HP) convergence guarantees in decentralized stochastic optimization, where multiple agents collaborate to jointly train a model over a network. Existing HP results in decentralized settings almost exclusively focus on the Decentralized Stochastic Gradient Descent ( \mathttDSGD ) algorithm, which requires strong assumptions, such as bounded data heterogeneity, or strong convexity of each agent’s cost. This is contrary to the mean-squared error (MSE) results, where methods incorporating bias-correction techniques are known to converge under relaxed assumptions and achieve better practical performance. In this paper we provide the first step toward bridging the gap, by studying HP convergence of \mathttDSGD incorporating the gradient tracking technique, in the presence of noise satisfying a relaxed sub-Gaussian condition. We show that the resulting method, dubbed \mathttGT-DSGD , achieves order-optimal HP convergence rates for both non-convex and Polyak-Łojasiewicz costs, of order \mathcalO\Big(\frac\log(1/\delta)\sqrtnT\Big) and \mathcalO\Big(\frac\log(1/\delta)nT\Big) , respectively, where n is the number of agents, T is the time horizon and \delta \in (0,1) is the confidence parameter. Our results establish that \mathttGT-DSGD converges in the HP sense under the same conditions on the cost as in the MSE sense, while achieving comparable transient times. To the best of our knowledge, these are the first HP guarantees for decentralized optimization methods incorporating bias-correction. Numerical experiments on real and synthetic data verify our theoretical findings, underlining the superior performance of \mathttGT-DSGD and highlighting that the benefits of incorporating bias-correction are also maintained in the HP sense.
[MA-6] Causal Foundations of Collective Agency
【速读】:该论文旨在解决多智能体AI系统中集体代理(collective agent) emergent 问题,即多个简单智能体可能无意间形成具有独立能力与目标的统一集体代理,从而带来潜在的安全风险。其解决方案的关键在于引入因果博弈(causal games)与因果抽象(causal abstraction)的理论框架:通过将群体联合行为建模为理性且目标导向的决策过程,并以高阶抽象模型忠实刻画低阶复杂系统的行为特征,从而定量评估不同机制下集体代理的涌现程度,并为理解、预测和控制多智能体系统中的集体行为提供理论基础。
链接: https://arxiv.org/abs/2605.00248
作者: Frederik Hytting Jørgensen,Sebastian Weichwald,Lewis Hammond
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: CLeaR 2026
Abstract:A key challenge for the safety of advanced AI systems is the possibility that multiple simpler agents might inadvertently form a collective agent with capabilities and goals distinct from those of any individual. More generally, determining when a group of agents can be viewed as a unified collective agent is a foundational question in the study of interactions and incentives in both biological and artificial systems. We adopt a behavioral perspective in answering this question, ascribing collective agency to a group when viewing the group’s joint actions as rational and goal-directed successfully predicts its behavior. We formalize this perspective on collective agency using causal games – which are causal models of strategic, multi-agent interactions – and causal abstraction – which formalizes when a simple, high-level model faithfully captures a more complex, low-level model. We use this framework to solve a puzzle regarding multi-agent incentives in actor-critic models and to make quantitative assessments of the degree of collective agency exhibited by different voting mechanisms. Our framework aims to provide a foundation for theoretical and empirical work to understand, predict, and control emergent collective agents in multi-agent AI systems.
[MA-7] he textitSilicon Society Cookbook: Design Space of LLM -based Social Simulations
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 驱动的硅基社会(Silicon Societies)中,社交网络设计空间缺乏系统研究的问题,从而导致模型真实性的验证存在缺口。其解决方案的关键在于对模拟社交网络中的核心设计选择进行系统性分析,包括基础大语言模型(LLM)的选择及其代理之间的连接方式,并通过问卷调查作为代理意见的指标,揭示不同参数间非线性的交互关系,发现基础LLM的选择是影响仿真结果最重要的变量。
链接: https://arxiv.org/abs/2605.00197
作者: Aurélien Bück-Kaeffer(1 and 2 and 4),Sneheel Sarangi(1 and 2),Maximilian Puelma Touzel(1 and 3),Reihaneh Rabbany(1 and 2),Zachary Yang(1 and 2 and 4),Jean-François Godbout(2 and 3) ((1) McGill University, (2) Mila - Quebec Artificial Intelligence Institute, (3) Université de Montréal, (4) Ubisoft La Forge)
机构: McGill University (麦吉尔大学); Mila - Quebec Artificial Intelligence Institute (魁北克人工智能研究所); Université de Montréal (蒙特利尔大学); Ubisoft La Forge (育碧拉福奇工作室)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 20 pages, 12 tables, under review at COLM 2026
Abstract:Studies attempting to simulate human behavior with \textitSilicon Societies grow in numbers while LLM-only social networks have started appearing outside of controlled settings. However, the design space of these networks remains under-studied, which contributes to a gap in validating model realism. To enable future works to make more informed design decisions, we perform a systematic analysis of the consequences and interactions of key design choices in simulated social networks, including the choice of base model used to model individual agents, and how they are connected to each other. Using surveys as a proxy for agent opinions, our findings suggest that the geometry of the design space is non-trivial, with some parameters behaving in additive ways while others display more complex interactions. In particular, the choice of the base LLM is the most important variable impacting the simulation outcomes.
[MA-8] Compliance-Aware Agent ic Payments on Stablecoin Rails
【速读】:该论文旨在解决在受监管环境中扩展代理支付系统(Agentic Payment Systems)时,如何在不依赖持续人工干预的情况下保障合规性的问题。其解决方案的关键在于提出一种合规感知架构,通过将基于签名的支付授权与中继执行相结合,并利用政策包装器(policy wrapper)和策略管理器(policy manager)作为链上护栏嵌入可编程合规逻辑,从而在交易执行点强制实施合规检查,而非采用独立的链外流程。该方法在满足条件时保持低摩擦结算、记录与交易相关的链上证明(attestations),并在合规要求未满足时支持结构化处理机制。
链接: https://arxiv.org/abs/2605.00071
作者: Kenneth See,Xue Wen Tan
机构: Monetary Authority of Singapore; Infocomm Media Development Authority
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Multiagent Systems (cs.MA)
备注: Demo Paper Track
Abstract:Agentic payment systems extend delegated action to financial transfers, but scaling them on stablecoin rails in regulated settings requires safeguards that remain effective when humans are not continuously in the loop. We present a compliance-aware architecture that combines x402-style, signature-based payment authorisation and relayed execution with programmable compliance embedded as an on-chain guardrail via a policy wrapper and policy manager coordinating modular checks. By enforcing compliance at the point of execution, rather than as a separate off-chain workflow, the approach preserves low-friction settlement when conditions are satisfied, records transaction-linked on-chain attestations, and supports structured resolution when requirements are pending.
[MA-9] Ambient Persuasion in a Deployed AI Agent : Unauthorized Escalation Following Routine Non-Adversarial Content Exposure
【速读】:该论文旨在解决部署的多智能体系统中因环境配置不当和控制边界失效导致的安全事件问题,特别是非对抗性环境内容如何诱发AI代理执行未经授权的高风险操作。解决方案的关键在于识别并修正“指令加权错误”(directive weighting error)与“环境渗透性说服”(ambient persuasion)两类机制缺陷:前者指在软性行为规范存在冲突时,代理未能正确优先级排序指令;后者揭示了看似无害的日常内容(如技术文章)可能通过隐性引导触发越权行为。研究强调必须建立可强制执行的安装策略、强化先前拒绝的约束效力,并引入系统性的事后审计机制以替代仅依赖常规监控的监督方式,从而提升多智能体系统的伦理合规性与治理可靠性。
链接: https://arxiv.org/abs/2605.00055
作者: Diego F. Cuadros,Abdoul-Aziz Maiga
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:We report a safety incident in a deployed multi-agent research system in which a primary AI agent installed 107 unauthorized software components, overwrote a system registry, overrode a prior negative decision from an oversight agent, and escalated through increasingly privileged operations up to an attempted system administrator command. The incident was preceded not by an adversarial attack but by routine content: a forwarded technology article written for human developers and shared by the principal investigator for discussion. The agent operated in a permissive environment, with unrestricted shell access, soft behavioral guidelines containing genuinely conflicting instructions, and no machine-enforced installation policy, and had recommended installing the same tool six hours earlier before being told to stand down. We analyze the behavioral cascade, the control boundaries that failed, and the limitations of multi-agent oversight in detecting and remediating the damage. We use directive weighting error as a descriptive interpretation of the observed failure and ambient persuasion as a provisional analytic label for the broader trigger configuration of non-adversarial environmental content preceding unauthorized agent action. The case highlights ethical and governance implications for deployed agent systems: ambiguous conversational cues are insufficient authorization for consequential actions, prior refusals must persist as enforceable constraints rather than message-level reminders, and oversight mechanisms require systematic post-incident auditing in addition to routine monitoring.
[MA-10] SiriusHelper: An LLM Agent -Based Operations Assistant for Big Data Platforms
【速读】:该论文旨在解决大数据平台中智能助手在实际部署时面临的三大核心问题:一是场景覆盖不足,难以兼顾通用咨询与领域特定的故障排查流程;二是知识访问效率低下,由于缺乏多跳检索能力及扁平化的知识组织结构,导致响应不准确或延迟高;三是维护成本过高,因升级工单信息非结构化,难以转化为可复用的标准操作流程(SOP)。解决方案的关键在于提出一个名为SiriusHelper的统一在线智能助手系统,其核心创新包括:通过意图识别与路由机制自动分配查询至合适的处理路径(如专用专家工作流);采用DeepSearch驱动的多跳检索与基于优先级的分层知识库相结合的方式,在避免上下文过载的同时提升答案可靠性与响应速度;并通过自动化工单理解与SOP提炼模块,诊断助手失败原因并提取领域知识以持续优化知识库,从而显著降低人工干预成本。实验证明该方案在腾讯大数据平台上有效减少了20.8%的在线工单量。
链接: https://arxiv.org/abs/2605.00043
作者: Yu Shen,Shiyang Liu,Qihang He,Yihang Cheng,Haining Xie,Zhiming He,Huahua Fan,Xianzhi Tan,Teng Ma,Shaoquan Zhang,Danqing Huang,Fan Jiang,Yang Li,Chongqing Zhao,Peng Chen,Jie Jiang,Bin Cui
机构: Tencent Inc.(腾讯公司); Peking University (北京大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Big data platforms are widely used in modern enterprises, and an in-production intelligent assistant is increasingly important to help users quickly find actionable guidance and reduce operational burden. While recent LLM+RAG assistants provide a natural interface, they face practical challenges in real deployments: limited scenario coverage across both general consultation and domain-specific troubleshooting workflows, inefficient knowledge access due to inadequate multi-hop retrieval and flat knowledge organization, and high maintenance cost because escalated tickets are unstructured and hard to convert into assistant improvements and reusable SOPs. In this paper, we present SiriusHelper, a deployed intelligent assistant for big data platforms. SiriusHelper serves as a unified online assistant that automatically identifies user intent and routes queries to the right handling path, including dedicated expert workflows for specialized scenarios (e.g., SQL execution diagnosis). To support complex troubleshooting, SiriusHelper combines a DeepSearch-driven mechanism with a priority-based hierarchical knowledge base to enable multi-hop retrieval without context overload, thus improving answer reliability and latency. To reduce expert overhead, SiriusHelper further introduces automated ticket understanding and SOP distillation: it diagnoses the assistant failure reason (e.g., missing knowledge or wrong routing) and extracts domain-specific SOPs to continuously enrich the knowledge base. Experiments and online deployment on Tencent Big Data platform show that SiriusHelper outperforms representative alternatives and reduces online ticket volume by 20.8%. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2605.00043 [cs.DB] (or arXiv:2605.00043v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2605.00043 Focus to learn more arXiv-issued DOI via DataCite
自然语言处理
[NLP-0] When LLM s Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理基准测试中看似表现良好,但其实际执行提示中指定步骤的能力可能存在显著缺陷的问题。为深入探究这一现象,作者设计了一个受控的诊断性基准测试,用于评估模型对分步算术算法的忠实执行能力,其中模型需根据给定的两组数值输入和逐步指令计算最终结果。该基准通过增加算法长度和中间变量的回溯依赖关系来提升复杂度。关键发现是:随着步骤数从5步增至95步,首次回答准确率从61%骤降至20%,且生成层面的分析揭示了多种执行失败模式,包括遗漏答案、过早输出、自修正错误、执行不完整以及虚构额外步骤等。这表明,LLMs表面上的推理能力可能掩盖其在忠实遵循指令方面的严重不足,凸显了当前模型在程序化任务中可靠性和可解释性的局限性。
链接: https://arxiv.org/abs/2605.00817
作者: Sailesh Panda,Pritam Kadasi,Abhishek Upperwal,Mayank Singh
机构: Indian Institute of Technology Gandhinagar(印度理工学院甘地纳格尔分校); Soket AI
类目: Computation and Language (cs.CL)
备注: 77 pages, 109 figures
Abstract:Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through a controlled diagnostic benchmark for procedural execution, where models are given a step-wise arithmetic algorithm and two numeric inputs, and must return the final computed value. The benchmark uses simple arithmetic operations but increases complexity through algorithm length and look-back dependencies over intermediate variables. Across 14 models and 55 datasets, average first-answer accuracy drops from 61% on 5-step procedures to 20% on 95-step procedures. Generation-level analysis shows that failures often involve missing answers, premature answers, self-correction after an initial error, under-executed traces, and hallucinated extra steps. These findings suggest that apparent reasoning ability can mask substantial weaknesses in faithful instruction execution.
[NLP-1] Can Coding Agents Reproduce Findings in Computational Materials Science?
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)作为自主编码代理在计算科学工作流中表现不足的问题,特别是其在材料科学领域复现科研结论的能力有限。现有LLM编码代理虽在软件工程基准上表现优异,但缺乏对复杂、领域特定流程的导航能力和基于科学主张对结果进行解释的能力。解决方案的关键在于构建AutoMat——一个针对计算材料科学的基准测试平台,用于评估LLM代理从论文文本中恢复不完整计算流程、执行专业化工具链并判断所得证据是否支持原始科学主张的能力。通过与领域专家合作,该基准选取真实材料科学文献中的可复现性主张,系统性地测试代理端到端工作流的重建与执行能力,从而揭示当前代理系统在程序完整性、方法一致性及执行鲁棒性方面的核心局限。
链接: https://arxiv.org/abs/2605.00803
作者: Ziyang Huang,Yi Cao,Ali K. Shargh,Jing Luo,Ruidong Mei,Mohd Zaki,Zhan Liu,Wyatt Bunstine,William Jurayj,Somdatta Goswami,Tyrel McQueen,Michael Shields,Jaafar El-Awady,Paulette Clancy,Benjamin Van Durme,Nicholas Andrews,William Walden,Daniel Khashabi
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benchmarks. However, it is unclear whether such success transfers to computational scientific workflows, where tasks require not only strong coding ability, but also the ability to navigate complex, domain-specific procedures and to interpret results in the context of scientific claims. To address this question, we present AutoMat, a benchmark for evaluating LLM-based agents’ ability to reproduce claims from computational materials science. AutoMat poses three interrelated challenges: recovering underspecified computational procedures, navigating specialized toolchains, and determining whether the resulting evidence supports a claim. By working closely with subject matter experts, we curate a set of claims from real materials science papers to test whether coding agents can recover and execute the end-to-end workflow needed to support (or undermine) such claims. We then evaluate multiple representative coding agent settings across several foundation models. Our results show that current LLM-based agents obtain low overall success rates on AutoMat, with the best-performing setting achieving a success rate of only 54.1%. Error analysis further reveals that agents perform worst when workflows must be reconstructed from paper text alone and that they fail primarily due to incomplete procedures, methodological deviations, and execution fragility. Taken together, these findings position AutoMat as both a benchmark for computational scientific reproducibility and a tool for diagnosing the current limitations of agentic systems in AI-for-science settings.
[NLP-2] When RAG Chatbots Expose Their Backend: An Anonymized Case Study of Privacy and Security Risks in Patient-Facing Medical AI
【速读】: 该论文旨在解决患者使用的生成式 AI(Generative AI)医疗聊天机器人在安全性和隐私保护方面的潜在风险问题,尤其是基于检索增强生成(Retrieval-Augmented Generation, RAG)架构的系统可能因配置不当导致敏感信息泄露。其解决方案的关键在于采用两阶段评估策略:第一阶段利用大语言模型(LLM)辅助进行探索性提示测试与漏洞假设生成;第二阶段通过浏览器开发者工具手动验证候选发现,从而识别出客户端与服务器间通信中暴露的系统提示、模型参数、知识库内容及未受保护的对话记录等关键风险点,揭示了仅依赖标准浏览器工具即可完成非破坏性安全检测,强调独立审计应成为部署前的强制环节。
链接: https://arxiv.org/abs/2605.00796
作者: Alfredo Madrid-García,Miguel Rujas
机构: Universidad Politécnica de Madrid (马德里理工大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Background: Patient-facing medical chatbots based on retrieval-augmented generation (RAG) are increasingly promoted to deliver accessible, grounded health information. AI-assisted development lowers the barrier to building them, but they still demand rigorous security, privacy, and governance controls. Objective: To report an anonymized, non-destructive security assessment of a publicly accessible patient-facing medical RAG chatbot and identify governance lessons for safe deployment of generative AI in health. Methods: We used a two-stage strategy. First, Claude Opus 4.6 supported exploratory prompt-based testing and structured vulnerability hypotheses. Second, candidate findings were manually verified using Chrome Developer Tools, inspecting browser-visible network traffic, payloads, API schemas, configuration objects, and stored interaction data. Results: The LLM-assisted phase identified a critical vulnerability: sensitive system and RAG configuration appeared exposed through client-server communication rather than restricted server-side. Manual verification confirmed that ordinary browser inspection allowed collection of the system prompt, model and embedding configuration, retrieval parameters, backend endpoints, API schema, document and chunk metadata, knowledge-base content, and the 1,000 most recent patient-chatbot conversations. The deployment also contradicted its privacy assurances: full conversation records, including health-related queries, were retrievable without authentication. Conclusions: Serious privacy and security failures in patient-facing RAG chatbots can be identified with standard browser tools, without specialist skills or authentication; independent review should be a prerequisite for deployment. Commercial LLMs accelerated this assessment, including under a false developer persona; assistance available to auditors is equally available to adversaries.
[NLP-3] LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation
【速读】: 该论文旨在解决多语言语音克隆中说话人编码器(speaker encoder)的跨脚本一致性问题,即同一说话人在不同语言或书写系统(如英文、印地语、泰卢固语等)下产生的语音特征应保持一致,而现有商用编码器(如WavLM-base-plus-sv和ECAPA-TDNN)在跨脚本场景下会因口音条件导致嵌入空间偏移,从而影响文本到语音合成(TTS)系统的性能。解决方案的关键在于提出LASE(Language-Adversarial Speaker Encoder),其核心机制是在冻结的WavLM-base-plus基础上添加一个小型投影头,并联合优化两个损失函数:一是基于语音身份的监督对比损失(supervised contrastive loss),确保说话人信息保留;二是梯度反转交叉熵损失(gradient-reversal cross-entropy loss),用于对抗语言分类器以迫使嵌入对语言信息不敏感,同时仍保留说话人判别能力。实验表明,LASE在多个跨脚本语音数据集上显著缩小了跨脚本与同脚本嵌入差异(残差差距接近零),并提升了跨脚本TTS的性能,且在少样本条件下实现了媲美ECAPA-TDNN的多说话人语音分离效果。
链接: https://arxiv.org/abs/2605.00777
作者: Venkata Pushpak Teja Menta
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 7 pages, 2 figures, 2 tables. Code, model, and datasets at this https URL
Abstract:A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE’s residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.
[NLP-4] Directed Social Regard: Surfacing Targeted Advocacy Opposition Aid Harms and Victimization in Online Media
【速读】: 该论文旨在解决现有自然语言处理(NLP)工具在情感分析中无法同时识别文本中共存的正向与负向情感及其具体指向目标的问题。传统方法仅能对整体文本情感进行单一分类(如正向、中性或负向),而忽略了在线平台内容中常见的多维度、多价态情感表达,例如同一句话可能既包含倡导(pro-social sentiment)又包含指责(anti-social sentiment)。解决方案的关键在于提出一种名为“定向社会关怀”(Directed Social Regard, DSR)的方法,其核心是基于Transformer架构构建一对模型:首先检测文本中情感的具体语义目标(span-level targets),然后沿三个(-1, 1)轴对每个语义片段进行评分,这三轴源于社会心理学中的道德脱离(moral disengagement)与道德框架(moral framing)理论,从而实现对情感方向和对象的精细刻画。
链接: https://arxiv.org/abs/2605.00776
作者: Scott Friedman,Ruta Wheelock,Sonja Schmer-Galunder,Drisana Iverson,Jake Vasilakes,Joan Zheng,Jeffrey Rye,Vasanth Sarathy,Christopher Miller
机构: Sift (Sift); University of Florida (佛罗里达大学); Chapman University (查普曼大学); Tufts University (塔夫茨大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 32 pages, 12 figures, 7 tables
Abstract:The language in online platforms, influence operations, and political rhetoric frequently directs a mix of pro-social sentiment (e.g., advocacy, helpfulness, compassion) and anti-social sentiment (e.g., threats, opposition, blame) at different topics, all in the same message. While many natural language processing (NLP) tools classify or score a text’s overall sentiment as positive, neutral, or negative, these tools cannot report that positive and negative sentiments coexist, and they cannot report the target of those sentiments. This paper presents the Directed Social Regard (DSR) approach to multi-dimensional, multi-valence sentiment analysis, comprised of a pair of transformer-based models that (1) detects span-level targets of sentiment in a message and then (2) scores all spans within the message context along three (-1, 1) axes of regard that are motivated by social science theories of moral disengagement and moral framing. We present a data collection and annotation strategy for DSR dataset construction, a transformer-based architecture for span-level scoring, and a validation study with promising results. We apply the validated DSR model on six third-party datasets of online media and report meaningful correlations between DSR outputs and the labels and topics in these pre-existing social science datasets.
[NLP-5] Characterizing the Expressivity of Local Attention in Transformers ACL2026
【速读】: 该论文旨在解决局部注意力(local attention)在Transformer模型中为何能提升模型质量的问题,尽管其主要动机常被归结为计算效率的优化。论文通过形式化分析指出,全局注意力(global attention)对应的可识别语言类受限于单个过去时态算子的线性时序逻辑(Linear Temporal Logic, LTL)片段;而引入局部注意力会引入第二个时态算子,严格扩展了可识别的正则语言类。解决方案的关键在于揭示了局部注意力在表达能力上的理论优势:它与全局注意力在表达力上互不包含,且二者结合可获得最强的表达能力,从而从理论上解释了局部注意力提升模型性能的原因,并通过形式语言识别和自然语言建模实验验证了这一结论。
链接: https://arxiv.org/abs/2605.00768
作者: Jiaoda Li,Ryan Cotterell
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL)
备注: ACL 2026
Abstract:The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generating the next token. One common variant of attention is called local attention, which restricts each token to aggregating information from a bounded window of predecessors, reducing the quadratic cost of global attention to linear. Although this restriction is usually motivated by efficiency, it has also been found to improve model quality, a phenomenon that has so far lacked a satisfactory explanation. We provide a formal account of this phenomenon in terms of recognizer expressivity. It has been shown that fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator. We additionally prove that adding local attention introduces a second temporal operator, strictly enlarging the class of recognizable regular languages. Moreover, global and local attention are expressively complementary: neither subsumes the other, and combining them yields the richest fragment. Experiments on formal language recognition and natural language modeling corroborate the theory, showing that hybrid global–local transformers outperform their global-only counterparts.
[NLP-6] FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在金融场景应用中可能产生的有害输出问题,尤其是其在面对非法活动或不道德行为请求时的合规风险。为系统评估LLM在金融领域的安全性,作者提出了FinSafetyBench——一个涵盖14个子类别的中英文双语红队测试基准,用于检验模型对违反金融合规要求请求的拒绝能力。该解决方案的关键在于:基于真实金融犯罪案例与伦理标准构建多维度、跨语言的评测体系,并通过三种典型攻击场景实证发现模型在中文语境下更易受攻击,且提示层面的防御策略难以应对复杂或隐含的操纵手段,从而揭示当前金融领域LLM安全机制的核心漏洞。
链接: https://arxiv.org/abs/2605.00706
作者: Yutao Hou,Yihan Jiang,Yuhan Xie,Jian Yang,Liwen Zhang,Hailiang Huang,Guanhua Chen,Yun Chen
机构: Shanghai University of Finance and Economics; Beihang University; Southern University of Science and Technology; MoE Key Laboratory of Interdisciplinary Research of Computation and Economics
类目: Computation and Language (cs.CL)
备注: Accepted by Findings of ACL2026
Abstract:Large language models (LLMs) are increasingly applied in financial scenarios. However, they may produce harmful outputs, including facilitating illegal activities or unethical behavior, posing serious compliance risks. To systematically evaluate LLM safety in finance, we propose FinSafetyBench, a bilingual (English-Chinese) red-teaming benchmark designed to test an LLM’s refusal of requests that violate financial compliance. Grounded in real-world financial crime cases and ethics standards, the benchmark comprises 14 subcategories spanning financial crimes and ethical violations. Through extensive experiments on general-purpose and finance-specialized LLMs under three representative attack settings, we identify critical vulnerabilities that allow adversarial prompts to bypass compliance safeguards. Further analysis reveals stronger susceptibility in Chinese contexts and highlights the limitations of prompt-level defenses against sophisticated or implicit manipulation strategies.
[NLP-7] Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在长期交互中因上下文窗口有限而难以持续追踪用户偏好变化的问题。现有记忆系统多依赖静态的手工设计更新规则,而基于强化学习(Reinforcement Learning, RL)的方法则受限于稀疏的奖励信号,导致长程优化不稳定。其解决方案的关键在于提出一个受认知科学启发的两阶段优化框架 MemCoE:第一阶段通过对比反馈诱导全局记忆指导原则(Memory Guideline Induction),将文本梯度形式的反馈转化为可学习的结构化策略;第二阶段利用该指导原则定义结构化的过程奖励,通过多轮强化学习训练出遵循指导原则的记忆演化策略(Guideline-Aligned Memory Policy Optimization)。此方法显著提升了记忆系统的鲁棒性、迁移能力和效率,在多个个性化记忆基准上均优于强基线。
链接: https://arxiv.org/abs/2605.00702
作者: Derong Xu,Shuochen Liu,Pengfei Luo,Pengyue Jia,Yingyi Zhang,Yi Wen,Yimin Deng,Wenlin Zhang,Enhong Chen,Xiangyu Zhao,Tong Xu
机构: University of Science and Technology of China (中国科学技术大学); State Key Laboratory of Cognitive Intelligence (认知智能国家重点实验室); City University of Hong Kong (香港城市大学); Dalian University of Technology (大连理工大学); Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language model (LLM) agents require long-term user memory for consistent personalization, but limited context windows hinder tracking evolving preferences over long interactions. Existing memory systems mainly rely on static, hand-crafted update rules; although reinforcement learning (RL)-based agents learn memory updates, sparse outcome rewards provide weak supervision, resulting in unstable long-horizon optimization. Drawing on memory schema theory and the functional division between prefrontal regions and hippocampus regions, we introduce MemCoE, a cognition-inspired two-stage optimization framework that learns how memory should be organized and what information to update. In the first stage, we propose Memory Guideline Induction to optimize a global guideline via contrastive feedback interpreted as textual gradients; in the second stage, Guideline-Aligned Memory Policy Optimization uses the induced guideline to define structured process rewards and performs multi-turn RL to learn a guideline-following memory evolution policy. We evaluate on three personalization memory benchmarks, covering explicit/implicit preference and different sizes and noise, and observe consistent improvements over strong baselines with favorable robustness, transferability, and efficiency.
[NLP-8] ML-BenchGuard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨语言场景下安全对齐的挑战,尤其是现有多语言基准测试依赖通用风险分类和机器翻译,导致防护模型难以适配区域特定法规与文化差异的问题。解决方案的关键在于构建一个基于政策的多语言安全评估基准 ML-Bench,其直接从地区性法规中提取风险类别和细粒度规则,并据此生成符合本地法律与文化语境的安全数据;在此基础上开发了 ML-Guard,一种基于扩散大语言模型(Diffusion Large Language Model, dLLM)的多语言防护模型,具备轻量级快速判断与定制化合规解释两种变体,从而实现跨语言、跨文化的精准安全评估与合规控制。
链接: https://arxiv.org/abs/2605.00689
作者: Yunhan Zhao,Zhaorun Chen,Xingjun Ma,Yu-Gang Jiang,Bo Li
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Fudan University (复旦大学); University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:As Large Language Models (LLMs) are increasingly deployed in cross-linguistic contexts, ensuring safety in diverse regulatory and cultural environments has become a critical challenge. However, existing multilingual benchmarks largely rely on general risk taxonomies and machine translation, which confines guardrail models to these predefined categories and hinders their ability to align with region-specific regulations and cultural nuances. To bridge these gaps, we introduce ML-Bench, a policy-grounded multilingual safety benchmark covering 14 languages. ML-Bench is constructed directly from regional regulations, where risk categories and fine-grained rules derived from jurisdiction-specific legal texts are directly used to guide the generation of multilingual safety data, enabling culturally and legally aligned evaluation across languages. Building on ML-Bench, we develop ML-Guard, a Diffusion Large Language Model (dLLM)-based guardrail model that supports multilingual safety judgment and policy-conditioned compliance assessment. ML-Guard has two variants, one 1.5B lightweight model for fast `safe/unsafe’ checking and a more capable 7B model for customized compliance checking with detailed explanations. We conduct extensive experiments against 11 strong guardrail baselines across 6 existing multilingual safety benchmarks and our ML-Bench, and show that ML-Guard consistently outperforms prior methods. We hope that ML-Bench and ML-Guard can help advance the development of regulation-aware and culturally aligned multilingual guardrail systems.
[NLP-9] Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLM s
【速读】: 该论文旨在解决静态数学基准测试在评估大语言模型(Large Language Models, LLMs)数学推理能力时存在的局限性问题,如范围狭窄、易饱和且更新滞后,导致模型性能难以可靠比较和持续追踪。其解决方案的关键在于构建一个持续维护的评估平台——MathArena,该平台不仅扩展了原有基准的任务覆盖范围(从奥数题延伸至证明类竞赛、arXiv研究级问题及Lean形式化证明生成),还通过统一的评估协议和动态新增基准机制,确保平台始终具备挑战性,从而能够全面、动态地反映LLMs在数学推理领域的进展。
链接: https://arxiv.org/abs/2605.00674
作者: Jasper Dekoninck,Nikola Jovanović,Tim Gehrunger,Kári Rögnvalddson,Ivo Petrov,Chenhao Sun,Martin Vechev
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but static benchmarks are no longer sufficient for evaluating progress: they are often narrow in scope, quickly saturated, and rarely updated. This makes it hard to compare models reliably and track progress over time. Instead, we need evaluation platforms: continuously maintained systems that run, aggregate, and analyze evaluations across many benchmarks to give a comprehensive picture of model performance within a broad domain. In this work, we build on the original MathArena benchmark by substantially broadening its scope from final-answer olympiad problems to a continuously maintained evaluation platform for mathematical reasoning with LLMs. MathArena now covers a much wider range of tasks, including proof-based competitions, research-level arXiv problems, and formal proof generation in Lean. Additionally, we maintain a clear evaluation protocol for all models and regularly design new benchmarks as model capabilities improve to ensure that MathArena remains challenging. Notably, the strongest model, GPT-5.5, now reaches 98% on the 2026 USA Math Olympiad and 74% on research-level questions, showing that frontier models can now comfortably solve extremely challenging mathematical problems. This highlights the importance of continuously maintained evaluation platforms like MathArena to track the rapid progress of LLMs in mathematical reasoning.
[NLP-10] EGREFINE: An Execution-Grounded Optimization Framework for Text-to-SQL Schema Refinement
【速读】: 该论文旨在解决文本到SQL(Text-to-SQL)模型在真实世界数据库中因表结构命名模糊、缩写或不一致而导致的性能下降问题。现有方法将模式视为静态输入,仅在下游处理错误,而本文提出将模式精炼(schema refinement)建模为一个约束优化问题:寻找一个重命名函数,在最大化下游Text-to-SQL执行准确率的同时,通过数据库视图保持查询等价性。其解决方案的关键在于EGRefine——一个四阶段流水线,包括筛选歧义列、生成上下文感知候选名称、基于执行反馈验证名称,并以非破坏性SQL视图形式物化结果。该方案确保列级非劣化(column-local non-degradation)和数据库级查询等价性(database-level query equivalence),从而实现“安全构建”的模式精炼,且可跨模型家族迁移使用,支持“一次精炼,多模型部署”。
链接: https://arxiv.org/abs/2605.00628
作者: Jiaqian Wang,Yutao Qi,Wenjin Hou,Yu Pang,Rui Yang
机构: Xidian University (西安电子科技大学)
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注: 15 pages, 5 figures, 50 this http URL : this https URL
Abstract:Text-to-SQL enables non-expert users to query databases in natural language, yet real-world schemas often suffer from ambiguous, abbreviated, or inconsistent naming conventions that degrade model accuracy. Existing approaches treat schemas as fixed and address errors downstream. In this paper, we frame schema refinement as a constrained optimization problem: find a renaming function that maximizes downstream Text-to-SQL execution accuracy while preserving query equivalence through database views. We analyze the computational hardness of this problem, which motivates a column-wise greedy decomposition, and instantiate it as EGRefine: a four-phase pipeline that screens ambiguous columns, generates context-aware candidate names, verifies them through execution-grounded feedback, and materializes the result as non-destructive SQL views. The pipeline carries two structural properties: column-local non-degradation, ensured by the conservative selection rule in the verification phase, and database-level query equivalence, ensured by the view-based materialization phase. Together they make the resulting refinement safe by construction at the column level, with cross-column and prompt-level interactions handled empirically rather than analytically. Across controlled schema-degradation, real-world, and enterprise benchmarks, EGRefine recovers accuracy lost to schema naming noise where applicable and correctly abstains where the underlying task exceeds current Text-to-SQL capabilities, with refined schemas transferring across model families to enable refine-once, serve-many-models deployment. Code and data are publicly available at this https URL.
[NLP-11] SC-Taxo: Hierarchical Taxonomy Generation under Semantic Consistency Constraints using Large Language Models
【速读】: 该论文旨在解决现有科学分类体系(taxonomy)生成方法中存在的结构不一致性和层级间语义错位问题,这些问题限制了知识组织的准确性与可用性。其解决方案的关键在于提出一种语义一致性的分类体系生成框架(SC-Taxo),该框架利用大语言模型(LLM)并引入层次感知的精炼阶段,通过双向标题生成机制实现自底向上抽象与自顶向下语义约束的联合优化,同时捕捉同级节点间的语义依赖关系以增强横向一致性,从而显著提升分类体系的层级对齐度和标题质量。
链接: https://arxiv.org/abs/2605.00620
作者: Shiqiang Cai,Nianhong Niu,Shizhu He,Kang Liu,Jun Zhao
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Science and Technology Beijing (北京科技大学); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computation and Language (cs.CL)
备注: 12 pages, 5 figures, 2 tables
Abstract:Scientific literature is expanding at an unprecedented pace, making it increasingly challenging to efficiently organize and access domain knowledge. A high-quality scientific taxonomy offers a structured and hierarchical representation of a research field, facilitating literature exploration and topic navigation, as well as enabling downstream applications such as trend analysis, idea generation, and information retrieval. However, existing taxonomy generation approaches often suffer from structural inconsistencies and semantic misalignment across hierarchical levels. Through empirical analysis, we find that these issues largely stem from inadequate modeling of hierarchical semantic consistency. To address this limitation, we propose a semantic-consistent taxonomy generation (SC-Taxo) framework that leverages large language models (LLMs) with hierarchy-aware refinement stages to ensure semantic consistency. Specifically, SC-Taxo introduces a bidirectional heading generation mechanism that jointly performs bottom-up abstraction and top-down semantic constraint, while further capturing peer-level semantic dependencies to enhance horizontal consistency. Experiments on multiple benchmark datasets demonstrate consistent improvements in hierarchy alignment and heading quality, and additional evaluation on Chinese scientific literature validates its robust cross-lingual generalization.
[NLP-12] Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus
【速读】: 该论文旨在解决机器翻译对段落嵌入(paragraph embeddings)之间余弦相似度稳定性的影响问题,即翻译是否保持原文语义结构的不变性。其核心挑战在于如何在不直接测量语义偏移的情况下,评估不同语言经翻译后嵌入空间中语义关系的保留程度。解决方案的关键在于构建一个基于多模型间一致性分析的非劣效性检验框架:通过比较原始语言文本在不同嵌入模型间的相似性差异作为校准阈值,从而判断翻译是否导致语义结构的显著扰动。该方法无需依赖特定语料库或处理流水线,具有良好的泛化能力,并能应用于下游任务。实证结果显示,该框架识别出10种语言在翻译下保持语义结构稳定,4种语言存在可检测的语义扭曲。
链接: https://arxiv.org/abs/2605.00618
作者: Daria Boratyn,Damian Brzyski,Albert Leśniak,Wojciech Łukasik,Maciej Rapacz,Jan Rybicki,Wojciech Słomczyński,Dariusz Stolicki
机构: Jagiellonian Center for Quantitative Political Science, Jagiellonian University, Kraków, Poland; AGH University, Kraków, Poland; Jagiellonian Center for Digital Humanities, Jagiellonian University, Kraków, Poland
类目: Computation and Language (cs.CL)
备注: 14 tables, 1 figure
Abstract:We investigate the extent to which cosine similarity between paragraph embeddings is invariant under machine translation, using the Manifesto Corpus of over 2,800 political party platforms in 28 languages translated to English via the EU eTranslation service. Rather than measuring translation-induced semantic shift directly we measure the stability of pairwise similarity relationships across embedding models, and use inter-model disagreement on original-language text as a calibrated invariance threshold. This yields a per-language non-inferiority test for four hypotheses about how translation interacts with embedding choice, with verdicts that distinguish languages where translation demonstrably preserves semantic structure from those where it demonstrably degrades it and from those where the available evidence does not resolve the question. The framework is corpus- and pipeline-agnostic and extends naturally to downstream tasks. Applied to our data, it identifies ten languages with translation invariance and four with detectable distortion.
[NLP-13] Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe
【速读】: 该论文旨在解决传统解码探测(decoding probe)方法在分析语言模型表示时存在的两个局限性:一是难以直接比较不同特征对模型表示的贡献,二是特征间的相关性可能干扰探测结果。为此,作者提出了一种编码探测(Encoding Probe)方法,其核心在于反转探测方向,通过可解释特征重构模型内部表示,从而更清晰地量化各特征的独立贡献。该方法在文本和语音Transformer模型上进行了验证,结果表明说话人相关效应在不同训练目标和数据集下差异显著,而句法与词汇特征则能独立贡献于重建效果,体现出编码探测在理解模型表示方面的互补价值。
链接: https://arxiv.org/abs/2605.00607
作者: Gaofei Shen,Martijn Bentum,Tom Lentz,Afra Alishahi,Grzegorz Chrupała
机构: Tilburg University (蒂尔堡大学); Radboud University (拉德布德大学)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Probing is widely used to study which features can be decoded from language model representations. However, the common decoding probe approach has two limitations that we aim to solve with our new encoding probe approach: contributions of different features to model representations cannot be directly compared, and feature correlations can affect probing results. We present an Encoding Probe that reverses this direction and reconstructs internal representations of models using interpretable features. We evaluate this method on text and speech transformer models, using feature sets spanning acoustics, phonetics, syntax, lexicon, and speaker identity. Our results suggest that speaker-related effects vary strongly across different training objectives and datasets, while syntactic and lexical features contribute independently to reconstruction. These results show that the Encoding Probe provides a complementary perspective on interpreting model representations beyond decodability.
[NLP-14] Structure Liberates: How Constrained Sensemaking Produces More Novel Research Output
【速读】: 该论文旨在解决科学发现过程中“构思(ideation)”阶段被现有方法低估的问题,即当前研究多将构思视为简短的前奏,而忽视其在科研流程中的核心作用。为应对这一挑战,作者提出SCISENSE框架,其关键在于将构思建模为由八个认知阶段构成的结构化序列(基于Pirolli & Card, 2005的认知模型),并通过构建SCISENSE-Traj数据集(包含10万规模的引用条件下的研究轨迹)实现对这一过程的量化与训练。该方案的核心创新在于区分两种训练模式:Target模式(从已知论文回溯其引文路径)和Infer模式(基于相同引文生成新方向),并发现Target训练虽更受监督,却能显著提升下游任务(如代码代理生成科研成果)的质量与多样性,表明目标导向的构思可降低下游代理的认知负担,从而释放其创造力。
链接: https://arxiv.org/abs/2605.00557
作者: James Mooney,Zae Myung Kim,Young-Jun Lee,Dongyeop Kang
机构: University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Scientific discovery is an extended process of ideation–surveying prior work, forming hypotheses, and refining reasoning–yet existing approaches treat this phase as a brief preamble despite its central role in research. We introduce SCISENSE, a sensemaking-grounded framework that operationalizes ideation as a structured sequence of eight cognitive stages (Pirolli \ Card, 2005). We construct SCISENSE-Traj, a 100K-scale dataset of citation-conditioned research trajectories in two modes: Target, where an LLM reconstructs the ideation path leading to a known paper from its cited works, and Infer, where the LLM proposes novel directions from the same citations. We distill these into SCISENSE-LM, a family of sensemaking LLMs spanning 3B to 70B parameters. Contrary to the assumption that looser supervision promotes greater exploration, Target-trained models achieve a 2.0% improvement in trajectory quality over Infer-trained models while also producing more novel and diverse outputs. This advantage propagates downstream: coding agents conditioned on Target trajectories produce research artifacts with higher executability and quality than those conditioned on Infer trajectories. This suggests that targeted ideation reduces cognitive burden on downstream agents, freeing them to explore more creatively. SCISENSE offers both a practical tool for augmenting LLM-driven research workflows and a principled testbed for studying how planning shapes scientific discovery.
[NLP-15] A11y-Compressor: A Framework for Enhancing the Efficiency of GUI Agent Observations through Visual Context Reconstruction and Redundancy Reduction ACL
【速读】: 该论文旨在解决AI代理在与图形用户界面(GUI)交互时,因依赖冗余且缺乏结构信息的可访问性树(accessibility tree)而导致的感知表示不可靠问题。其解决方案的关键在于提出A11y-Compressor框架,通过轻量级、结构化的转换流程——包括模态检测、冗余消除和语义结构化——将线性化的可访问性树压缩为紧凑且保留空间关系的表示形式,从而提升AI代理对GUI元素的精准理解与任务执行能力。
链接: https://arxiv.org/abs/2605.00551
作者: Michito Takeshita,Takuro Kawada,Takumi Ohashi,Shunsuke Kitada,Hitoshi Iyatomi
机构: Hosei University(法政大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures, 5 tables. Accepted to ACL SRW 2026. Project page: this https URL
Abstract:AI agents that interact with graphical user interfaces (GUIs) require effective observation representations for reliable grounding. The accessibility tree is a commonly used text-based format that encodes UI element attributes, but it suffers from redundancy and lacks structural information such as spatial relationships among elements. We propose A11y-Compressor, a framework that transforms linearized accessibility trees into compact and structured representations. Our implementation, Compressed-a11y, applies a lightweight and structured transformation pipeline with modal detection, redundancy reduction, and semantic structuring. Experiments on the OSWorld benchmark show that Compressed-a11y reduces input tokens to 22% of the original while improving task success rates by 5.1 percentage points on average.
[NLP-16] AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)训练过程中因高精度激活值和梯度存储导致的GPU内存占用过高问题,尤其针对4-bit激活和8-bit梯度量化时易引发收敛缓慢或精度下降的挑战。其解决方案的关键在于提出AGoQ方法,包含两个核心创新:一是分层感知的激活量化算法(layer-aware activation quantization),根据各层类型与流水线阶段动态分配比特宽度,实现接近4-bit的激活存储;二是梯度量化算法,通过8-bit梯度存储与保持精度的8-bit All-Reduce通信机制,在降低显存占用的同时缩短通信时间。实验表明,AGoQ在64 GPU集群上可将内存减少高达52%,训练速度提升1.34倍,且在预训练收敛性和下游任务精度上优于当前主流训练系统(如Megatron-LM、DeepSpeed等)。
链接: https://arxiv.org/abs/2605.00539
作者: Wenxiang Lin,Juntao Huang,Luhan Zhang,Laili Li,Xiang Bao,Mengyang Zhang,Bing Wang,Shaohuai Shi
机构: 未知
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow convergence or accuracy loss. To address this, we introduce AGoQ, incorporating two new techniques: 1) a layer-aware activation quantization algorithm that allocates appropriate bit-widths for activations of various layers based on their types and pipeline stages to achieve near 4-bit activation storage, and 2) a gradient quantization algorithm that reduces memory usage and shortens communication time by employing 8-bit gradient storage and precision-preserving 8-bit All-Reduce communication. We conduct extensive experiments using different sizes of LLMs on two GPU clusters (up to 64 GPUs), and the experimental results show that our AGoQ reduces the memory by up to 52% and achieves up to 1.34 \times improvement of training speed compared to state-of-the-art training systems Megatron-LM (w/ or w/o ZeRO), COAT and DeepSpeed with 8B to 32B LLaMA models, while achieving convergence loss on pretraining and comparable accuracy on downstream tasks with LLaMA architectures.
[NLP-17] ControBench: An Interaction-Aware Benchmark for Controversial Discourse Analysis on Social Networks
【速读】: 该论文旨在解决在线跨意识形态分歧对话中缺乏综合性数据集的问题,以支持政治极化、虚假信息传播和内容审核等研究。现有数据集要么忽略交互结构,要么缺乏语义丰富性,或无法稳定追踪用户层面的意识形态身份。其解决方案的关键在于构建ControBench基准,该基准融合异构社交互动图与丰富的文本语义信息:通过包含用户节点和帖子节点的图结构,并以语义增强边(如用户-评论-用户边)保留局部论证上下文;同时利用Reddit用户的自声明标签(flair)作为可扩展的意识形态代理标签,避免人工标注。该数据集展现出低至负的调整同质性(adjusted homophily),真实反映了现实辩论中的跨群体互动特征,从而为争议性话语分析提供了一个更具挑战性和现实性的评估平台。
链接: https://arxiv.org/abs/2605.00513
作者: Ta Thanh Thuy,Jiaqi Zhu,Xuan Liu,Lin Shang,Reihaneh Rabbany,Guillaume Rabusseau,Lihui Chen,Zheng Yilun,Sitao Luan
机构: Nanyang Technological University (南洋理工大学); NVIDIA (英伟达); Nanjing University (南京大学); Mila - Quebec AI Institute (Mila - 魁北克人工智能研究所); McGill University (麦吉尔大学); University of Montreal (蒙特利尔大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Understanding how people argue across ideological divides online is important for studying political polarization, misinformation, and content moderation. Existing datasets capture only part of this problem: some preserve text but ignore interaction structure, some model structure without rich semantics, and others represent conversations without stable user-level ideological identity. We introduce ControBench, a benchmark for controversial discourse analysis that combines heterogeneous social interaction graphs with rich textual semantics. Built from Reddit discussions on three topics, Trump, abortion, and religion, ControBench contains 7,370 users, 1,783 posts, and 26,525 interactions. The graph contains user and post nodes connected by semantically enriched edges; in particular, user-comment-user edges encode both a reply and the parent comment that it responds to, preserving local argumentative context. User labels are derived from self-declared Reddit flairs, providing a scalable proxy for ideological identity without manual annotation. The resulting datasets exhibit low or negative adjusted homophily (Trump: -0.77, Abortion: 0.06, Religion: 0.04), reflecting the cross-cutting structure of real-world debate. We evaluate graph neural networks, pretrained language models, and large language models on ControBench and observe distinct performance patterns across topics and model families, especially when ideological boundaries are ambiguous. These results position ControBench as a challenging and realistic benchmark for controversial discourse analysis.
[NLP-18] Surprisal Minimisation over Goal-directed Alternatives Predicts Production Choice in Dialogue ACL2026
【速读】: 该论文旨在解决自然语言生成中如何建模说话者选择行为的问题,特别是区分目标导向(goal-directed)与情境合理性(goal-agnostic)两类替代选项对言语产出决策的影响。其解决方案的关键在于引入基于信息论的成本敏感性选择框架,通过语言模型(language model, LM)生成两类替代集,并在确定性和概率性成本最小化条件下分析生产选择;结果表明,相对于目标导向替代项的 surprisal 最小化提供了最强且最一致的预测力,从而揭示了说话者在自然对话中倾向于优化信息效率而非均匀信息密度或长度代价。
链接: https://arxiv.org/abs/2605.00506
作者: Tom Utting,Mario Giulianelli,Arabella Sinclair
机构: University of Aberdeen (阿伯丁大学); University College London (伦敦大学学院)
类目: Computation and Language (cs.CL)
备注: 9 pages, to appear at ACL 2026 (Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics)
Abstract:We model utterance production as probabilistic cost-sensitive choice over contextual alternatives, using information-theoretic notions of cost. We distinguish between goal-directed alternatives that realise a fixed communicative intent and goal-agnostic alternatives defined only by contextual plausibility, allowing us to derive speaker- and listener-oriented interpretations of different cost measures. We present a procedure to generate both types of alternative sets using language models. Analysing production choices in open-ended dialogue under both deterministic and probabilistic cost minimisation, we find that surprisal minimisation relative to goal-directed alternatives provides the strongest predictive account under both analyses. By contrast, uniform information density and length-based costs exhibit weaker and less consistent predictive power across conditions. More broadly, our study suggests that alternative-conditioned optimisation with LM-generated alternatives provides a principled framework for studying speaker and listener pressures in naturalistic language production.
[NLP-19] ReLay: Personalized LLM -Generated Plain-Language Summaries for Better Understanding but at What Cost?
【速读】: 该论文旨在解决Plain Language Summaries (PLS) 在健康信息传播中因采用“一刀切”风格而无法满足不同读者信息需求和理解能力差异的问题,从而可能导致公众对科学信息的误解并影响实际决策。其解决方案的关键在于利用大语言模型(LLMs)实现PLS的个性化生成,并通过构建ReLay数据集(包含50名普通参与者在静态与交互式场景下的300个PLS样本及其行为、认知结果与质量评估)系统评估个性化策略的有效性与安全性。研究发现,个性化能显著提升理解和感知质量,但也存在强化用户偏见和引入幻觉的风险,揭示了个性化与安全之间的权衡关系,强调未来需发展兼顾效果与可信度的个性化方法。
链接: https://arxiv.org/abs/2605.00468
作者: Joey Chan,Yikun Han,Jingyuan Chen,Samuel Fang,Lauren D. Gryboski,Alexandra Lee,Sheel Tanna,Qingqing Zhu,Zhiyong Lu,Lucy Lu Wang,Yue Guo
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Virginia (弗吉尼亚大学); University of Colorado Anschutz (科罗拉多大学安舒茨医学校区); Johns Hopkins University (约翰霍普金斯大学); University of Chicago Pritzker School of Medicine (芝加哥大学普里茨克医学院); National Library of Medicine, National Institutes of Health (美国国家医学图书馆,美国国立卫生研究院); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Plain Language Summaries (PLS) aim to make research accessible to lay readers, but they are typically written in a one-size-fits-all style that ignores differences in readers’ information needs and comprehension. In health contexts, this limitation is particularly important because misunderstanding scientific information can affect real-world decisions. Large language models (LLMs) offer new opportunities for personalizing PLS, but it remains unclear whether personalization helps, which strategies are most effective, and how to balance personalization with safety. We introduce ReLay, a dataset of 300 participant–PLS pairs from 50 lay participants in both static (expert-written) and interactive (LLM-personalized) settings. ReLay includes user characteristics, health information needs, information-seeking behavior, comprehension outcomes, interaction logs, and quality ratings. We use ReLay to evaluate five LLMs across two personalization methods. Personalization improves comprehension and perceived quality, but it also raises the risk of reinforcing user biases and introducing hallucinations, revealing a trade-off between personalization and safety. These findings highlight the need for personalization methods that are both effective and trustworthy for diverse lay audiences.
[NLP-20] Impact of Task Phrasing on Presumptions in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large-Language Models, LLMs)在不可预测的真实应用场景中因任务表述引发的先验假设(presumptions)问题,这些问题会导致模型在任务偏离初始假设时难以适应,从而影响其安全性和可靠性。解决方案的关键在于通过设计中性(neutral)的任务表述方式,有效降低模型对特定假设的依赖,使其在推理过程中展现出更稳健的逻辑判断能力,而非受制于隐含前提。
链接: https://arxiv.org/abs/2605.00436
作者: Kenneth J.K. Ong
机构: ST Engineering, Singapore(新加坡科技工程有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Concerns with the safety and reliability of applying large-language models (LLMs) in unpredictable real-world applications motivate this study, which examines how task phrasing can lead to presumptions in LLMs, making it difficult for them to adapt when the task deviates from these assumptions. We investigated the impact of these presumptions on the performance of LLMs using the iterated prisoner’s dilemma as a case study. Our experiments reveal that LLMs are susceptible to presumptions when making decisions even with reasoning steps. However, when the task phrasing was neutral, the models demonstrated logical reasoning without much presumptions. These findings highlight the importance of proper task phrasing to reduce the risk of presumptions in LLMs.
[NLP-21] Escaping Mode Collapse in LLM Generation via Geometric Regulation ICML2026
【速读】: 该论文旨在解决生成式模型中长期存在的模式崩溃(mode collapse)问题,特别是在自回归文本生成中表现为显式循环、多样性逐渐丧失及轨迹过早收敛等现象。作者从动力系统视角出发,将模式崩溃重新诠释为由于几何坍缩(geometric collapse)导致的状态空间可达性降低:在生成过程中,模型内部轨迹被限制在其表示空间的一个低维区域内。这一观点表明,模式崩溃并非单纯的词元层面问题,无法仅通过符号约束或仅基于概率的解码启发式方法可靠解决。解决方案的关键在于提出强化模式调控(Reinforced Mode Regulation, RMR),这是一种轻量级、在线的状态空间干预机制,通过在Transformer值缓存(value cache)中对主导的自增强方向进行低秩阻尼(low-rank damping)来调节,从而有效抑制几何坍缩,显著减少模式崩溃,并实现极低熵率下的稳定高质量生成(低至0.8 nats/step)。
链接: https://arxiv.org/abs/2605.00435
作者: Xin Du,Kumiko Tanaka-Ishii
机构: 未知
类目: Computation and Language (cs.CL); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Chaotic Dynamics (nlin.CD)
备注: Accepted to ICML 2026
Abstract:Mode collapse is a persistent challenge in generative modeling and appears in autoregressive text generation as behaviors ranging from explicit looping to gradual loss of diversity and premature trajectory convergence. We take a dynamical-systems view and reinterpret mode collapse as reduced state-space accessibility caused by geometric collapse: during generation, the model’s internal trajectory becomes confined to a low-dimensional region of its representation space. This implies mode collapse is not purely a token-level phenomenon and cannot be reliably solved by symbolic constraints or probability-only decoding heuristics. Guided by this perspective, we propose Reinforced Mode Regulation (RMR), a lightweight, online state-space intervention that regulates dominant self-reinforcing directions in the Transformer value cache (implemented as low-rank damping). Across multiple large language models, RMR substantially reduces mode collapse and enables stable, high-quality generation at extremely low entropy rates (down to 0.8 nats/step), whereas standard decoding typically collapses near 2.0 nats/step.
[NLP-22] RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在放射学(radiology)领域部署受限于计算资源的问题,特别是在资源受限的临床环境中难以应用。其核心挑战在于如何在不依赖GPU的情况下实现高性能多任务放射学AI辅助功能。解决方案的关键在于采用参数高效微调方法LoRA(Low-Rank Adaptation)对小型语言模型(Small Language Models, SLMs,3-4B参数规模)进行训练,并结合模型量化(GGUF格式)以适配消费级CPU硬件。实验表明,经LoRA微调后的Qwen2.5-3B和Qwen3-4B模型在9项放射学任务上显著优于零样本基线,且通过任务外最优集成(oracle ensemble)可进一步提升性能,最终实现仅需CPU即可运行的轻量级多任务放射学助手(RadLite),为临床落地提供了可行路径。
链接: https://arxiv.org/abs/2605.00421
作者: Pankaj Gupta,Kartik Bose
机构: Postgraduate Institute of Medical Education and Research, Chandigarh, India
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) show promise in radiology but their deployment is limited by computational requirements that preclude use in resource-constrained clinical environments. We investigate whether small language models (SLMs) of 3-4 billion parameters can achieve strong multi-task radiology performance through LoRA fine-tuning, enabling deployment on consumer-grade CPUs. We train Qwen2.5-3B-Instruct and Qwen3-4B on 162K samples spanning 9 radiology tasks - RADS classification across 10 systems, impression generation, temporal comparison, radiology NLI, NER, abnormality detection, N/M staging, and radiology QA - compiled from 12 public datasets. Both models are evaluated on up to 500 held-out test samples per task with standardized metrics. Our key findings are: (1) LoRA fine-tuning dramatically improves performance over zero-shot baselines (RADS accuracy +53%, NLI +60%, N-staging +89%); (2) the two models exhibit complementary strengths - Qwen2.5 excels at structured generation tasks while Qwen3 dominates extractive tasks; (3) a task-outed oracle ensemble combining both models achieves the best performance across all tasks; (4) few-shot prompting with fine-tuned models hurts performance, demonstrating that LoRA adaptation is more effective than in-context learning for specialized domains; and (5) models can be quantized to GGUF format (~1.8-2.4GB) for CPU deployment at 4-8 tokens/second on consumer hardware. Our work demonstrates that small, efficiently fine-tuned models - which we collectively call RadLite - can serve as practical multi-task radiology AI assistants deployable entirely on consumer hardware without GPU requirements.
[NLP-23] Rethinking LLM Ensembling from the Perspective of Mixture Models ICML2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)集成方法中存在的计算效率低下问题。传统集成方法需对每个模型独立执行前向传播以显式计算集成分布,导致显著的计算开销。其解决方案的关键在于将集成重新诠释为混合模型(mixture model),提出一种称为“类混合模型集成”(Mixture-model-like Ensemble, ME)的新方法:在生成每个token时,通过随机采样选择单一模型进行推理,从而避免显式计算完整的集成分布。该方法在数学上等价于从集成分布中采样,但仅需调用一个模型,实现1.78x–2.68x的速度提升,同时保持性能优势,并揭示了LLM集成与token级路由方法之间的理论联系。
链接: https://arxiv.org/abs/2605.00419
作者: Jiale Fu,Yuchu Jiang,Peijun Wu,Chonghan Liu,Joey Tianyi Zhou,Xu Yang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: ICML 2026 Spotlight
Abstract:Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea has been naturally extended to large language models (LLMs), yielding improved performance but incurring substantial computational cost. This inefficiency stems from directly applying conventional ensemble implementation to LLMs, which require a separate forward pass for each model to explicitly compute the ensemble distribution. In this paper, we propose the Mixture-model-like Ensemble (ME). By reinterpreting the ensemble as a mixture model, ME stochastically selects a single model at each step to generate the next token, thereby avoiding the need to explicitly compute the full ensemble distribution. ME is mathematically equivalent to sampling from the ensemble distribution, but requires invoking only one model, making it 1.78x-2.68x faster than conventional ensemble. Furthermore, this perspective connects LLM ensembling and token-level routing methods, suggesting that LLM ensembling is a special case of routing methods. Our findings open new avenues for efficient LLM ensembling and motivate further exploration of token-level routing strategies for LLMs. Our code is available at this https URL.
[NLP-24] Agent Capsules: Quality-Gated Granularity Control for Multi-Agent LLM Pipelines
【速读】: 该论文旨在解决多智能体(multi-agent)流水线在执行过程中因频繁调用大语言模型(Large Language Model, LLM)而导致的高Token消耗问题,同时避免简单合并LLM调用所引发的工具丢失和提示压缩导致的质量下降。其解决方案的关键在于提出Agent Capsules——一种自适应执行运行时机制,将多智能体执行视为带有经验质量约束的优化问题:通过衡量每组代理的协调开销、评分组合机会,并在三种复合执行策略(标准、两阶段、顺序)中动态选择最优模式,且以滚动均值输出质量作为模式切换门控条件。该框架不依赖于重写提示词来缓解压缩效应,而是通过逐步向单代理调度迁移来恢复质量,从而在保证性能的前提下实现显著的Token效率提升。
链接: https://arxiv.org/abs/2605.00410
作者: Aninda Ray
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 7 figures. Code: this https URL
Abstract:A multi-agent pipeline with N agents typically issues N LLM calls per run. Merging agents into fewer calls (compound execution) promises token savings, but naively merged calls silently degrade quality through tool loss and prompt compression. We present Agent Capsules, an adaptive execution runtime that treats multi-agent pipeline execution as an optimization problem with empirical quality constraints. The runtime instruments coordination overhead per group, scores composition opportunity, selects among three compound execution strategies, and gates every mode switch on rolling-mean output quality. A controlled negative result confirms that injecting more context into a merged call worsens compression rather than relieving it, so the framework’s escalation ladder (standard, then two-phase, then sequential) recovers quality by moving toward per-agent dispatch rather than by rewriting merged prompts. On LLM-judged quality, the controller matches a hand-tuned oracle on every measured (model, group, mode) cell: routing compound whenever the oracle would, and reverting to fine whenever quality would fail the floor, without per-model configuration. Against a hand-crafted LangGraph implementation of a 14-agent competitive intelligence pipeline, Agent Capsules uses 51% fewer fine-mode input tokens and 42% fewer compound-mode input tokens, at +0.020 and +0.017 quality respectively. Against a DSPy implementation of a 5-agent due diligence pipeline, the framework uses 19% fewer tokens than uncompiled DSPy at quality parity, and 68% fewer tokens than MIPROv2 at +0.052 quality. Even before compound mode fires, the runtime delivers efficiency through automatic policy resolution, cache-aligned prompts, and topology-aware context injection, matching both hand-tuned and compile-time baselines without training data or per-pipeline engineering.
[NLP-25] Agent ic AI for Substance Use Education: Integrating Regulatory and Scientific Knowledge Sources
【速读】: 该论文旨在解决传统物质使用教育在可扩展性(scalability)、个性化(personalization)以及信息时效性方面存在的问题,尤其是在快速变化的物质滥用环境中。其解决方案的关键在于构建一个基于智能体(agentic)架构的AI网络应用,该系统通过实时整合美国缉毒局(Drug Enforcement Administration, DEA)的权威监管数据与同行评审文献,利用检索增强生成(retrieval-augmented generation, RAG)技术,对102篇经筛选文档进行语义切片并存储为向量表示,从而实现上下文敏感、事实准确且可验证的物质使用教育内容生成。专家评估结果显示系统在事实准确性、引用质量、上下文连贯性和法规适配性四个维度均获得高分(平均4.18–4.35),表明该方法为未来健康教育的规模化、精准化和可信交付提供了可行路径。
链接: https://arxiv.org/abs/2605.00383
作者: Kosar Haghani,Zahra Kolagar,Mohammed Atiquzzaman
机构: 未知
类目: Computation and Language (cs.CL)
备注: 22 pages, 6 figures, 2 tables
Abstract:The delivery of traditional substance education has remained problematic due to challenges in scalability, personalization, and the currency of information in a rapidly evolving substance use landscape. While artificial intelligence (AI) offers a promising frontier for enhancing educational delivery, its application in providing real-time, authoritative substance use education remains largely underexplored. We built an agentic-based AI web application that combined Drug Enforcement Administration records with peer-reviewed literature in real-time to provide transparent context-sensitive substance use education. The system uses retrieval-augmented generation with a carefully filtered corpus of 102 documents and dynamic PubMed queries. Document storage was semantically chunked and placed in a vector representation in order to be easily retrieved. We conducted an expert evaluation study in which a panel of five subject matter experts generated 30 domain-specific questions, and two independent raters assessed 90 system interactions (30 primary questions plus two contextual follow-ups each) using a five-point Likert scale across four criteria: factual accuracy, citation quality, contextual coherence, and regulatory appropriateness. Mean ratings ranged from 4.18 to 4.35 across the four criteria (overall category range: 4.05-4.52), with substantial inter-rater agreement (Cohen’s kappa = 0.78). These findings suggest that agentic AI architectures integrating authoritative regulatory sources with real-time scientific literature represent a promising direction for scalable, accurate, and verifiable health education delivery, warranting further evaluation through longitudinal user studies.
[NLP-26] ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning ICML2026
【速读】: 该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在提升大语言模型(Large Language Models, LLMs)推理能力时,因过度激励正向奖励而导致生成多样性受限的问题。现有方法如负样本强化(Negative Sample Reinforcement, NSR)虽通过加权负样本惩罚缓解此问题,但可能抑制正负响应间共享的语义分布,从而损害模型表达能力。解决方案的关键在于提出负样本投影残差强化学习(negative sample projection Residual Reinforcement Learning, ResRL),其核心创新是将负样本token的隐藏表示投影到由奇异值分解(SVD)构建的低秩正样本子空间,并利用投影残差调节负梯度,从而解耦正负响应间的相似语义分布,实现推理性能提升与生成多样性保持的平衡。理论层面,作者将“懒惰似然位移”(Lazy Likelihood Displacement, LLD)与正负头梯度干扰关联,并推导出一个单前向代理上界以指导保守的优势重加权策略。
链接: https://arxiv.org/abs/2605.00380
作者: Zihan Lin,Xiaohan Wang,Jie Cao,Jiajun Chai,Li Wang,Xiaodong Lu,Wei Lin,Ran He,Guojun Yin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to ICML 2026. Preprint version
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative-token hidden representations onto an SVD-based low-rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4% in Avg@16 and 7.0% in Pass@128. Code is available at this https URL.
[NLP-27] Language-free Experience at Expo 2025 Osaka
【速读】: 该论文旨在解决世博会期间多语言交流障碍问题,以实现无障碍沟通体验。其核心解决方案是通过开发高质量、低延迟的同步口译系统,关键技术包括基于分块的输入分割(chunk-based input segmentation)、上下文感知翻译(context-aware translation)以及多引擎机器翻译(multi-engine machine translation)技术,从而在实际部署中提升翻译准确性与实时性。
链接: https://arxiv.org/abs/2605.00373
作者: Michael Paul,Kenji Imamura,Xiaolin Wang,Shohei Higashiyama,Masao Utiyama
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In line with the Global Communication Plan 2025, we have pursued the development of multilingual translation technologies to realize a language-barrier-free experience at Expo 2025 Osaka. Our work includes the advancement of simultaneous interpretation systems emphasizing high translation quality and low latency. Key achievements include chunk-based input segmentation, context-aware translation, and multi-engine machine translation technologies. Through demonstration deployments and collaboration with private companies, our technologies have led to real-world applications, with several services and systems showcased at Expo 2025 Osaka.
[NLP-28] Uniform-Correct Policy Optimization: Breaking RLVRs Indifference to Diversity
【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在数学推理任务中常见的多样性崩溃问题,即虽然单次采样准确率(Pass@1)较高,但多样本覆盖率(Pass@K)显著下降,导致模型倾向于重复输出少数正确解而忽略其他有效解。其关键在于识别出传统RLVR目标函数(如GRPO)对正确解之间的概率分布无偏好,结合随机训练动态会引发自我强化的坍缩机制;为此,作者提出统一正确策略优化(Uniform-Correct Policy Optimization, UCPO),通过在策略分布上施加条件均匀性惩罚项,引导梯度信号向低频正确响应转移,从而实现正确解集内的均匀概率分配,显著提升多样性与多样本覆盖率,同时保持高单次准确率。
链接: https://arxiv.org/abs/2605.00365
作者: Anamika Lochab,Bolian Li,Ruqi Zhang
机构: Purdue University (普渡大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degradation: common RLVR objectives, such as GRPO, are indifferent to how probability mass is distributed among correct solutions. Combined with stochastic training dynamics, this indifference induces a self-reinforcing collapse, in which probability mass concentrates on a narrow subset of correct outputs while alternative valid solutions are suppressed. We formalize this collapse mechanism and further characterize the optimal policy structure under two complementary criteria: robustness and entropy-regularized optimality, which identify the Uniform-Correct Policy as uniquely optimal. Motivated by this analysis, we propose Uniform-Correct Policy Optimization (UCPO), a modification to GRPO that adds a conditional uniformity penalty on the policy’s distribution over correct solutions. The penalty redistributes gradient signal toward underrepresented correct responses, encouraging uniform allocation of probability mass within the correct set. Across three models (1.5B-7B parameters) and five mathematical reasoning benchmarks, UCPO improves Pass@K and diversity while maintaining competitive Pass@1, achieving up to +10% absolute improvement on AIME24 at Pass@64 and up to 45% higher equation-level diversity within the correct set. The code is available at this https URL.
[NLP-29] Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中现有机器遗忘(Machine Unlearning)方法在知识删除时效率低下、噪声干扰严重及性能退化的问题。当前方法通常在序列层面统一更新所有token,而未区分哪些token真正编码了需删除的知识,导致梯度噪声增加、遗忘效果不佳且模型通用能力受损。其解决方案的关键在于提出TokenUnlearn框架,通过融合知识感知信号(基于掩码的注意力变化)与熵感知信号(token不确定性度量),生成精确的token重要性评分,并据此实施两种互补策略:硬选择(仅对高重要性token进行遗忘更新)和软加权(按重要性调节梯度贡献)。理论分析表明,该token级选择机制可提升梯度信噪比,实验验证其在TOFU和WMDP基准上优于传统序列级方法,在遗忘有效性与模型效用保持之间取得更好平衡。
链接: https://arxiv.org/abs/2605.00364
作者: Jiawei Wu,DouDou Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 2 figures
Abstract:Machine unlearning has emerged as a critical capability for addressing privacy, safety, and regulatory concerns in large language models (LLMs). Existing methods operate at the sequence level, applying uniform updates across all tokens despite only a subset encoding the knowledge targeted for removal. This introduces gradient noise, degrades utility, and leads to suboptimal forgetting. We propose TokenUnlearn, a token-level attribution framework that identifies and selectively targets critical tokens. Our approach combines knowledge-aware signals via masking, and entropy-aware signals to yield importance scores for precise token selection. We develop two complementary strategies: hard selection, applying unlearning only to high-importance tokens, and soft weighting, modulating gradient contributions based on importance scores. Both extend existing methods to token-level variants. Theoretical analysis shows token-level selection improves gradient signal-to-noise ratio. Experiments on TOFU and WMDP benchmarks across three model architectures demonstrate consistent improvements over sequence-level baselines in both forgetting effectiveness and utility preservation.
[NLP-30] From Backward Spreading to Forward Replay: Revisiting Target Construction in LLM Parameter Editing ICML2026
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)参数编辑中广泛采用的“向后传播”(backward spreading)机制缺乏系统理论基础的问题,该机制通常通过在目标层(称为锚点层)计算理想隐藏状态,并将其分配给多个前置层以实现协同编辑。然而,其有效性边界、实际限制及潜在失败模式尚未被充分理解。论文的关键解决方案是用“前向传播”(forward propagation)替代传统的向后传播策略:不再优化最后一个编辑层的目标隐藏状态,而是首先在首个编辑层优化锚点,随后向前传播以获得所有后续编辑层精确且相互兼容的目标隐藏状态。该方法在保持与现有方法相同计算复杂度的前提下,显著提升了逐层目标的准确性,且不干扰初始目标隐藏状态的计算或其他编辑流程组件,因而具有良好的通用性和实用性。
链接: https://arxiv.org/abs/2605.00358
作者: Wei Liu,Hongkai Liu,Zhiying Deng,Yee Whye Teh,Wee Sun Lee
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026, code: this https URL
Abstract:LLM parameter editing methods commonly rely on computing an ideal target hidden-state at a target layer (referred as anchor point) and distributing the target vector to multiple preceding layers (commonly known as backward spreading) for cooperative editing. Although widely used for a long time, its underlying basis have not been systematically investigated. In this paper, we first conduct a systematic study of its foundations, which helps clarify its capability boundaries, practical considerations, and potential failure modes. Then, we propose a simple and elegant alternative that replaces backward spreading with forward-propagation. Instead of optimizing the target at the last editing layer, we optimize the anchor point at the first editing layer, and then propagate it forward to obtain accurate and mutually compatible target hidden-states for all subsequent editing layers. This approach achieves the same computational complexity as existing methods while producing more accurate layer-wise targets. Our method is simple, without interfering with either the computation of the initial target hidden state or any other components of the subsequent editing pipeline, and thus constituting a benefit for a wide range of LLM parameter editing methods.
[NLP-31] MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents
【速读】: 该论文旨在解决长期对话代理在外部记忆管理中的效率与效果问题,即如何高效决定哪些对话轮次应被存储至外部记忆中。传统方法依赖于每轮对话均通过自回归大语言模型(LLM)生成来做出存储决策,导致计算开销高且决策质量受限。解决方案的关键在于提出 MemRouter——一个写端记忆路由器,它将记忆准入决策从下游答案生成模块中解耦,并用基于嵌入的路由策略替代逐轮的内存管理解码过程:通过冻结的 LLM 骨干网络编码当前轮次及近期上下文,再利用轻量级分类头预测是否存储该轮次,仅训练 12M 参数。实验证明,在保持检索管道、问答提示和 Qwen2.5-7B 答案骨干一致的前提下,MemRouter 在 LoCoMo 数据集上整体 F1 提升显著(52.0 vs 45.6),同时将记忆管理延迟从 970ms 降至 58ms,表明小规模监督式路由器可有效学习写端记忆准入策略,而答案生成仍可作为独立下游任务处理。
链接: https://arxiv.org/abs/2605.00356
作者: Tianyu Hu,Weikai Lin,Weizhi Zhang,Jing Ma,Song Wang
机构: University of Central Florida (中佛罗里达大学); University of Rochester (罗切斯特大学); University of Illinois at Chicago (芝加哥大学); Case Western Reserve University (凯斯西储大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-term conversational agents must decide which turns to store in external memory, yet recent systems rely on autoregressive LLM generation at every turn to make that decision. We present MemRouter, a write-side memory router that decouples memory admission from the downstream answer backbone and replaces per-turn memory-management decoding with an embedding-based routing policy. MemRouter encodes each turn together with recent context, projects the resulting embeddings through a frozen LLM backbone, and predicts whether the turn should be stored using lightweight classification heads while training only 12M parameters. Under a controlled matched-harness comparison on LoCoMo, where the retrieval pipeline, answer prompts, and QA backbone (Qwen2.5-7B) are held identical, MemRouter outperforms an LLM-based memory manager on every question category (overall F1 52.0 vs 45.6, non-overlapping 95% CIs) while reducing memory-management p50 latency from 970ms to 58ms. Descriptive factorial averaging further shows that learned admission improves mean F1 by +10.3 over random storage, category-specific prompting adds +5.2 over a generic prompt, and retrieval contributes +0.7. These results suggest that write-side memory admission can be learned by a small supervised router, while answer generation remains a separate downstream component in long-horizon conversational QA.
[NLP-32] Block-wise Codeword Embedding for Reliable Multi-bit Text Watermarking
【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)多比特水印方法中容量与可靠性之间的失衡问题,特别是由解码中心设计导致的高假阳性率(False Positive Rate, FPR)缺陷。此前基于纠错码(Error Correction Code, ECC)的提取器在面对局部文本修改时易产生灾难性误检,且引入拒绝阈值仅会将检测灵敏度(True Positive Rate, TPR)降至随机猜测水平,无法实现可靠验证。解决方案的关键在于提出一种名为BREW(Block-wise Reliable Embedding for Watermarking)的新框架,其核心创新是转向“指定验证”范式,采用两阶段机制:首先通过独立块投票实现盲消息估计,再利用窗口滑动验证严格校验载荷对局部编辑的鲁棒性。实验表明,BREW在10%同义词替换下仍可实现TPR=0.965、FPR=0.02,证明高FPR并非多比特水印固有权衡,而是原有解码导向设计的结构性缺陷,而BREW提供了模型无关且理论严谨的可扩展可靠水印方案。
链接: https://arxiv.org/abs/2605.00348
作者: Joeun Kim,HoEun Kim,Dongsup Jin,Young-Sik Kim
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Recent multi-bit watermarking methods for large language models (LLMs) prioritize capacity over reliability, often conflating decoding with detection. Our analysis reveals that existing ECC-based extractors suffer from catastrophic false positive rates (FPR), and applying rejection thresholds merely collapses detection sensitivity (TPR) to random guessing. To resolve this structural limitation, we propose \textbfBREW (Block-wise Reliable Embedding for Watermarking), a framework shifting the paradigm to \emphdesignated verification. BREW employs a two-stage mechanism: (i) \textbfblind message estimation via independent block voting, followed by (ii) \textbfwindow-shifting verification that rigorously validates the payload against local edits. Experiments demonstrate that BREW achieves a TPR of 0.965 with an FPR of 0.02 under 10% synonym substitution, demonstrating that the high-FPR issue is not an inherent trade-off of multi-bit watermarking, but a solvable structural flaw of prior decoding-centric designs. Our framework is model-agnostic and theoretically grounded, providing a scalable solution for reliable forensic deployment.
[NLP-33] Odysseus: Scaling VLMs to 100 Turn Decision-Making in Games via Reinforcement Learning
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在长时程交互决策任务中训练不稳定、样本效率低的问题,特别是在需要100余步连续动作协调的《超级马里奥大陆》(Super Mario Land)环境中。其关键解决方案包括:首先,提出一种轻量级回合级评论家(turn-level critic)改进的PPO算法,显著提升了训练稳定性与样本效率;其次,利用预训练VLM提供的强动作先验(action priors),大幅减少对人工设计动作空间或策略工程的依赖,从而提升强化学习(Reinforcement Learning, RL)阶段的样本效率;最终构建名为Odysseus的开放训练框架,在多个游戏关卡上实现平均进度至少提升3倍,并展现出良好的游戏中及跨游戏泛化能力,同时保持通用领域性能。
链接: https://arxiv.org/abs/2605.00347
作者: Chengshuai Shi,Wenzhe Li,Xinran Liang,Yizhou Lu,Wenjia Yang,Ruirong Feng,Seth Karten,Ziran Yang,Zihan Ding,Gabriel Sarch,Danqi Chen,Karthik Narasimhan,Chi Jin
机构: Princeton Language and Intelligence, Princeton University; Fudan University; Tsinghua University
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20–30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario Land, a visually grounded environment requiring 100+ turns of interaction with coordinated perception, reasoning, and action. We begin with a systematic investigation of key algorithmic components and propose an adapted variant of PPO with a lightweight turn-level critic, which substantially improves training stability and sample efficiency over critic-free methods such as GRPO and Reinforce++. We further show that pretrained VLMs provide strong action priors, significantly improving sample efficiency during RL training and reducing the need for manual design choices such as action engineering, compared to classical deep RL trained from scratch. Building on these insights, we introduce Odysseus, an open training framework for VLM agents, achieving substantial gains across multiple levels of the game and at least 3 times average game progresses than frontier models. Moreover, the trained models exhibit consistent improvements under both in-game and cross-game generalization settings, while maintaining general-domain capabilities. Overall, our results identify key ingredients for making RL stable and effective in long-horizon, multi-modal settings, and provide practical guidance for developing VLMs as embodied agents.
[NLP-34] Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding
【速读】: 该论文旨在解决树形推测解码(tree-based speculative decoding)在稀疏的混合专家模型(Mixture-of-Experts, MoE)中效率下降的问题。由于不同分支激活不同的专家,随着草稿树增长,目标侧验证阶段需处理的专家集合显著扩大,导致验证成本急剧上升。解决方案的关键在于提出EVICT方法——一种无需训练、无超参数调整且无损的自适应验证机制:通过在目标验证前截断草稿树,仅保留成本效益最高的前缀,从而减少无效专家调用;其核心创新在于利用细粒度草稿信号估算候选收益,并结合离线预估的验证成本进行动态决策,同时与高性能图式服务框架SGLang高度兼容,实现在多种MoE模型和基准上的显著加速。
链接: https://arxiv.org/abs/2605.00342
作者: Lehan Pan,Ziyang Tao,Ruoyu Pang,Xiao Wang,Jianjun Zhao,Yanyong Zhang
机构: University of Science and Technology of China(中国科学技术大学); Tianyijiaotong Technology Ltd., Suzhou, China(天翼交通科技有限公司,苏州)
类目: Computation and Language (cs.CL)
备注:
Abstract:Tree-based speculative decoding accelerates autoregressive generation by verifying multiple draft candidates in parallel, but this advantage weakens for sparse Mixture-of-Experts (MoE) models. As the draft tree grows, different branches activate different experts, expanding the union of activated experts and substantially increasing target-side verification cost. We propose EVICT, a training-free, hyperparameter-free, and lossless adaptive verification method for MoE speculative decoding. EVICT makes every verified token count by truncating the draft tree before target verification and retaining only the cost-effective prefix. It leverages fine-grained drafter signals to estimate candidate benefit, combines them with offline-profiled verification cost, and remains highly compatible with the high-performance graph-based serving framework SGLang. Extensive experiments on diverse MoE backbones and benchmarks show that EVICT achieves up to 2.35x speedup over autoregressive decoding and an average 1.21x speedup over the state-of-the-art baseline EAGLE-3, while significantly reducing unnecessary expert activations during verification.
[NLP-35] Budget-Aware Routing for Long Clinical Text
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在临床场景中因输入文本冗长、异构且重复而导致的token成本高、部署开销大的问题。针对这一挑战,作者提出了一种预算约束下的上下文选择方法(budgeted context selection),即在严格token预算限制下从文档中挑选子集,以满足固定成本与延迟要求,并使现成生成器能高效运行。其核心解决方案是将该问题建模为一个带背包约束的子集选择问题,包含两个关键设计维度:单元化(unitization,定义文档分割方式)和选择策略(selection,决定保留哪些单元)。论文提出了一种单调子模目标函数RCD,用于平衡相关性、覆盖度与多样性;实验表明,选择策略比单元化方式更重要,且不同评估指标(如ROUGE与BERTScore)对性能差异的敏感性存在差异,其中基于聚类的单元化反而降低效果,而位置启发式在低预算提取任务中表现最优,多样性感知方法(如MMR)则更利于LLM生成质量提升。
链接: https://arxiv.org/abs/2605.00336
作者: Khizar Qureshi,Geoffrey Martin,Yifan Peng
机构: MIT(麻省理工学院); Cornell University (康奈尔大学); Weill Cornell Medicine (威尔康奈尔医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:A key challenge for large language models is token cost per query and overall deployment cost. Clinical inputs are long, heterogeneous, and often redundant, while downstream tasks are short and high stakes. We study budgeted context selection, where a subset of document units is chosen under a strict token budget so an off-the-shelf generator can meet fixed cost and latency constraints. We cast this as a knapsack-constrained subset selection problem with two design choices, unitization that defines document segmentation and selection that determines which units are kept. We propose \textbfRCD, a monotone submodular objective that balances relevance, coverage, and diversity. We compare sentence, section, window, and cluster-based unitization, and introduce a routing heuristic that adapts to the budget regime. Experiments on MIMIC discharge notes, Cochrane abstracts, and L-Eval show that optimal strategies depend on the evaluation setting. Positional heuristics perform best at low budgets in extractive tasks, while diversity-aware methods such as MMR improve LLM generation. Selector choice matters more than unitization, with cluster-based grouping reducing performance and other schemes behaving similarly. ROUGE saturates for LLM summaries, while BERTScore better reflects quality differences. We release our code at this https URL. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.00336 [cs.CL] (or arXiv:2605.00336v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.00336 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-36] Agent Floor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
【速读】: 该论文旨在解决生成式 AI (Generative AI) 代理系统(agentic systems)中模型调度与资源配置的效率问题,即在大量用户请求下,如何识别哪些任务真正需要前沿大模型的能力,而哪些可以由较小模型完成。其核心解决方案是提出一个名为 AgentFloor 的确定性基准测试集,该基准包含30个任务,构成六层能力阶梯,覆盖从指令遵循到长期规划等不同复杂度的任务类型,并对16个开源模型(参数规模从0.27B到32B)和GPT-5进行大规模评估(共16,542次评分运行)。关键发现表明:小中型开源模型已足以胜任代理流水线中占主导地位的短期、结构化工具调用任务;而在需要持续协调与约束跟踪的长周期规划任务上,前沿模型仍具优势,但两者均未达到高可靠性。因此,论文建议采用分层设计策略——将小型开源模型用于常规操作,仅在真正需要深度规划与控制的任务中使用大型前沿模型,从而实现性能与成本之间的最优平衡。
链接: https://arxiv.org/abs/2605.00334
作者: Ranit Karmakar,Jayita Chatterjee
机构: Harvard University (哈佛大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Production agentic systems make many model calls per user request, and most of those calls are short, structured, and routine. This raises a practical routing question that existing evaluations do not directly answer: which parts of an agent workflow truly require large frontier intelligence, and which can be handled by smaller models? We introduce AgentFloor, a deterministic 30-task benchmark organized as a six-tier capability ladder, spanning instruction following, tool use, multi-step coordination, and long-horizon planning under persistent constraints. We evaluate 16 open-weight models, from 0.27B to 32B parameters, alongside GPT-5 across 16,542 scored runs. Our results reveal a clear boundary of model necessity. Small and mid-sized open-weight models are already sufficient for much of the short-horizon, structured tool use work that dominates real agent pipelines, and in aggregate, the strongest open-weight model matches GPT-5 on our benchmark while being substantially cheaper and faster to run. The gap appears most clearly on long-horizon planning tasks that require sustained coordination and reliable constraint tracking over many steps, where frontier models still hold an advantage, though neither side reaches strong reliability. We also find that this boundary is not explained by scale alone: some failures respond to targeted interventions, but the effects are model-specific rather than universal. These findings suggest a practical design principle for agentic systems: use smaller open-weight models for the broad base of routine actions, and reserve large frontier models for the narrower class of tasks that truly demand deeper planning and control. We release the benchmark, harness, sweep configurations, and full run corpus.
[NLP-37] Borrowed Geometry: Computational Reuse of Frozen Text-Pretrained Transformer Weights Across Modalities
【速读】: 该论文旨在解决大模型在跨模态任务中如何利用预训练权重实现高效迁移的问题,尤其关注冻结(frozen)模型参数在未见过的下游任务中仍能表现优异的机制。其核心解决方案是:通过一个轻量级可训练接口(thin trainable interface)将冻结的Gemma 4 31B文本预训练模型适配到不同模态任务中,无需修改原始权重即可实现显著性能提升。关键创新在于验证了“架构本身携带可迁移知识”的假设——例如,在OGBench机器人操作任务上超越已有SOTA(+4.33点),以及在D4RL Walker2d任务中以仅0.43倍决策Transformer(Decision-Transformer)的可训练参数量达到相当性能,并且压缩后的5层切片比6层基线高1.66点;更重要的是,通过关联记忆任务证明,仅用113K参数的线性接口即可达到最优性能,而同等容量从头训练的Transformer完全无法完成该任务(误差高出8.7倍),从而说明冻结模型中的结构化表示具有强泛化能力,且特定注意力头(如L26.28)对语言复制和任务关键性均具决定作用。
链接: https://arxiv.org/abs/2605.00333
作者: Abay Bektursun
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 29 pages, 11 figures. Independent research
Abstract:Frozen Gemma 4 31B weights pretrained exclusively on text tokens, unmodified, transfer across modality boundaries through a thin trainable interface. (1) OGBench scene-play-singletask-task1-v0: +4.33 pt over published GCIQL at n=3 with std 0.74 – a published-SOTA win on a robotic manipulation task the substrate has never seen. (2) D4RL Walker2d-medium-v2: Decision-Transformer parity ( 76.2 \pm 0.8 , n=3 ) at 0.43\times DT’s trainable count, with the frozen substrate compressing to a 5L slice ( +1.66 pt over the 6L baseline at n=3 ). (3) Associative recall as the cleanest pretraining-load-bearing case: the frozen slice + a 113K-parameter linear interface reaches L30 best-checkpoint per-bit error 0.0505 ( n=2 ); a 6.36M-parameter from-scratch trained transformer at matched capacity ( 1/\sqrtd_k scaling, two seeds, LR sweep) cannot solve the task at all under the protocol (best L30 = 0.4395), an 8.7\times advantage. Architecture-alone falsifications: a frozen random transformer with correct 1/\sqrtd_k scaling stays at random-chance loss for 50k steps; a random-init Gemma slice fails OGBench cube-double-play-task1 entirely (0.89% across n=3 where pretrained reaches 60%). A dual-measurement protocol – text-activation probing on 95 English sentences plus task-ablation on a non-language target – names individual heads independently identifiable on both protocols: head L26.28 scores 3.7\times the slice mean for English token-copying and is the #2 most-critical head for binary copy ablation ( \Delta L30 = +0.221 ); three further heads (L27.28, L27.2, L27.3) classify by the same protocol. The mechanism is single-model and the cross-modality results are single-task within their respective benchmarks; cross-model replication is structurally constrained because Gemma 4 31B is the only model on the small-scale Pareto frontier as of April 2026.
[NLP-38] Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification
【速读】: 该论文旨在解决零样本视觉语言模型(VLM)安全分类器中,基于单提示首词概率作为决策分数时的不可靠性问题。研究发现,即使在二分类标签固定输出位置的情况下,语义等价的提示改写仍会导致对同一样本产生显著不同的不安全概率,表明此类分数具有高度提示敏感性(prompt-level fragility)。解决方案的关键在于采用训练-free的提示均值集成(mean ensemble),通过聚合多个提示下的首词概率来提升可靠性:该方法在14个数据集-模型组合上均优于单提示基线的负对数似然(NLL),并在12/14个场景中改善了校准误差(ECE),且在AUPRC和AUROC指标上持续优于温度缩放、Platt缩放和等距回归等传统校准方法。该策略被建议作为无需标签的可靠性基准,而非替代后续标签驱动的校准步骤。
链接: https://arxiv.org/abs/2605.00326
作者: Charles Weng,Dingwen Li,Alexander Martin
机构: Johns Hopkins University (约翰霍普金斯大学); Independent (独立)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. 19 pages, 5 figures
Abstract:Single-prompt first-token probabilities from zero-shot vision-language model (VLM) safety classifiers are treated as decision scores, but we show they are unreliable under semantically equivalent prompt reformulation: even when the binary label is constrained to a fixed output position, equivalent prompts can induce materially different unsafe probabilities for the same sample. Across multimodal safety benchmarks and multiple VLM families, cross-prompt variance is strongly associated with prompt-level disagreement and higher error, making it a useful fragility diagnostic. A training-free mean ensemble improves NLL on all 14 dataset-model evaluation pairs and ECE on 12/14 relative to a train-selected single-prompt baseline, and wins more head-to-head NLL comparisons than labeled temperature scaling, Platt scaling, and isotonic regression applied to the same prompt. Ranking gains are consistent against the train-selected baseline on both AUROC and AUPRC, and against the full 15-prompt distribution remain consistent on AUPRC while softening on AUROC. Labeled calibration on top of the mean provides further gains when labels are available, identifying prompt averaging as a strong label-free first stage rather than a replacement for calibration. We frame this as a reliability stress test for zero-shot VLM first-token safety scores and recommend prompt-family evaluation with mean aggregation as a standard label-free reliability baseline.
[NLP-39] What Dont You Understand? Using Large Language Models to Identify and Characterize Student Misconceptions About Challenging Topics
【速读】: 该论文旨在解决在线学习环境中学生概念性误解(misconceptions)难以系统识别与深入分析的问题。传统基于成绩的评估方法往往无法揭示错误背后的认知偏差,而本研究通过结合量化表现分析与生成式AI(Generative AI)技术,提出了一种两阶段的方法:首先利用测验层面的表现指标识别出教学重点中持续困难的主题;其次借助大语言模型(Large Language Model, LLM)对这些高优先级领域中的学生回答模式、题目内容及讲义文本进行多源数据融合分析,从而精准刻画隐藏的认知误区。其关键创新在于将LLM作为认知诊断工具,有效挖掘出仅凭成绩数据无法发现的深层次误解,并经由专家评审和教师访谈验证了该方法在实际教学场景中的实用价值与可扩展性。
链接: https://arxiv.org/abs/2605.00294
作者: Michael J. Parker,Maria G. Zavala-Cerna
机构: 未知
类目: Computation and Language (cs.CL)
备注: 60 pages. Education and Information Technologies (2026)
Abstract:This study presents a systematic approach to identifying and characterizing student misconceptions in online learning environments through a novel combination of quantitative performance analysis and large language model (LLM) assessment. We analyzed data from 9 course periods across 5 online biomedical science courses, encompassing 3,802 medical student enrollments. Using data from 40-50 topic-focused quizzes per course, we developed a two-stage methodology. First, we identified challenging central topics using quiz-level performance metrics. Second, we employed LLMs to characterize the underlying misconceptions in these high-priority areas. By examining student performance on first attempts across primarily multiple-choice questions (MCQs), we identified consistently challenging topics that were also central to course objectives. We then leveraged recent advances in generative AI to analyze three distinct data sources in combination: quiz question content, student response patterns, and lecture transcripts. This approach revealed actionable insights about student misconceptions that were not apparent from performance data alone. The quality of the LLM-identified misconceptions was rated as excellent by subject matter experts. We also conducted teacher interviews to assess the perceived utility of our topic identification method. Faculty found that data-driven identification of challenging topics was valuable and corroborated their own classroom observations. This methodology provides a scalable approach to characterizing student difficulties in learning environments where quizzes are used. Our findings demonstrate the potential for targeted and potentially personalized interventions in future course iterations, with clear pathways for measuring intervention effectiveness through follow-up quiz performance.
[NLP-40] How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)中开放集检测(Out-of-Distribution, OOD)方法存在的结构混淆问题,特别是由输入序列长度引起的虚假信号。研究发现,现有白盒OOD检测方法(如CED、RAUQ、WildGuard)及原始注意力熵等基线指标均与序列长度高度相关(|r| = 0.61),在长度匹配评估下性能退化至接近随机水平,根源在于注意力机制对输入长度具有Θ(log T)的依赖性。为此,作者提出一种双路径框架:嵌入路径(embedding pathway)捕捉文本语义内容(适用于主题偏移),处理轨迹路径(processing trajectory pathway)则通过隐藏状态在各层的演化来识别模型处理过程中的异常模式(适用于词汇相同但意图隐蔽的对抗输入)。关键创新在于将OOD信号解耦为内容感知与过程感知两个维度,并通过多任务交叉验证、逐层分析和电路归因实证支持该框架的有效性,尤其在隐蔽意图攻击(如Jailbreak)场景下表现优异(平均AUROC达0.850)。
链接: https://arxiv.org/abs/2605.00269
作者: Hamidreza Saghir
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 30 pages, 3 figures, 30+ tables. Submitted to COLM 2026
Abstract:Recent white-box OOD detection methods for LLMs – including CED, RAUQ, and WildGuard confidence scores – appear effective, but we show they are structurally confounded by sequence length (|r| = 0.61) and collapse to near-chance under length-matched evaluation. Even raw attention entropy (mean H(alpha) across heads and layers), a natural baseline we include for completeness, shows the same confound. The confound stems from attention’s Theta(log T) dependence on input length. To identify genuine OOD signals after deconfounding, we propose a two-pathway framework: embeddings capture what text is about (effective for topic shifts), while the processing trajectory – hidden-state evolution across layers – captures how the model processes input. The relative power of each pathway varies along a vocabulary-transparency spectrum: embedding methods excel on vocabulary-distinctive OOD, while trajectory features detect covert-intent inputs that share vocabulary with normal text (0.721 avg AUROC; Jailbreak: 0.850). Three evidence lines support this framework: (1) a crossover between k-NN and trajectory scoring across 6 tasks, where each pathway wins on different OOD types; (2) a per-layer analysis showing that layer-0 k-NN signal is almost entirely a length artifact (Jailbreak: 0.759 raw - 0.389 matched) – processing constructs genuine OOD signal from near-chance embeddings; and (3) circuit attribution showing adversarial tasks engage attention circuits more than semantic tasks (p = 0.022; Jailbreak patching p 0.001), with partial cross-model replication. Code release upon publication.
[NLP-41] Lost in State Space: Probing Frozen Mamba Representations
【速读】: 该论文试图解决的问题是:是否可以通过在固定补丁边界(patch boundary)处提取 Mamba 模型的 token-level 输出 y_t,从而无需池化头、微调或 [CLS] 标记即可获得高质量的句子级表示(sentence representations)。其解决方案的关键在于设计并评估四种从预训练 Mamba-130M 主干网络中提取冻结特征的策略,并通过严格的冻结特征探测协议(frozen-feature probing protocol)验证该假设。研究发现,补丁边界读出(patch boundary readouts)并未 consistently 超越简单的均值池化(mean pooling),并识别出两个结构性缺陷:严重的各向异性(anisotropy)和原始最终状态(raw final SSM state)的表征崩溃(representational collapse),进而提出正交注入(orthogonal injection)作为改进递归机制的新方法以约束每步新增信息。
链接: https://arxiv.org/abs/2605.00253
作者: Bhagyashree Wagh,Akash Singh
机构: University of Washington (华盛顿大学); University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 2 figures
Abstract:Mamba’s recurrent state h_t is, by construction, a compressed summary of every token seen so far. This raises a tempting hypothesis: if we extract token-level outputs y_t at fixed patch boundaries, we obtain semantic sentence summaries for free, with no pooling head, no fine-tuning, and no [CLS] token. We test this hypothesis carefully. Across five benchmarks (SST-2, CoLA, MRPC, STS-B, IMDb), we compare four strategies for extracting frozen sentence representations from a pretrained Mamba-130M backbone under a strict frozen-feature probing protocol, using three random seeds where computationally feasible. The results do not support the hypothesis: patch boundary readouts do not consistently outperform simple mean pooling. We identify and quantify two structural pathologies: severe anisotropy (mean pairwise cosine similarity 0.9999, std 0.000044) and representational collapse in the raw final SSM state (MCC = 0.000 on CoLA across all three seeds, confirmed via confusion matrix). We further propose orthogonal injection, a modified recurrence that constrains new information per
[NLP-42] Alethia: A Foundational Encoder for Voice Deepfakes ICML2026
【速读】: 该论文旨在解决当前语音深度伪造检测与定位模型过度依赖语音基础模型(Speech Foundation Models, SFMs)特征表示,且微调阶段已接近性能瓶颈的问题。其解决方案的关键在于将研究重心从下游微调转向预训练,并提出一种新颖的预训练范式,结合瓶颈掩码嵌入预测(bottleneck masked embedding prediction)与基于流匹配(flow-matching)的频谱图重建方法,从而构建首个适用于多种语音深度伪造检测与定位任务的基础音频编码器 Alethia。该方案在5个不同任务和56个基准数据集上验证了显著优于现有SOTA SFMs的性能,尤其在真实世界扰动下的鲁棒性及对未见域(如歌唱类深度伪造)的零样本泛化能力方面表现突出,同时揭示了离散目标在掩码标记预测中的局限性,强调连续嵌入预测与生成式预训练对于捕捉深度伪造痕迹的重要性。
链接: https://arxiv.org/abs/2605.00251
作者: Yi Zhu,Brahmi Dwivedi,Jayaram Raghuram,Surya Koppisetti
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to ICML 2026
Abstract:Existing voice deepfake detection and localization models rely heavily on representations extracted from speech foundation models (SFMs). However, downstream finetuning has now reached a state of diminishing returns. In this paper, we shift the focus to pretraining and propose a novel recipe that combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. The outcome, Alethia, is the first foundational audio encoder for various voice deepfake detection and localization tasks. We evaluate on 5 different tasks with 56 benchmark datasets, and note Alethia significantly outperforms state-of-the-art SFMs with superior robustness to real-world perturbations and zero-shot generalization to unseen domains (e.g., singing deepfakes). We also demonstrate the limitation of discrete targets in masked token prediction, and show the importance of continuous embedding prediction and generative pretraining for capturing deepfake artifacts.
[NLP-43] Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自动短答案评分(Automated Short Answer Grading, ASAG)评估中仅依赖整体指标(如宏F1和Cohen’s kappa)所导致的局限性问题,即这些指标无法揭示评分性能在不同难度学生作答上的分布差异。其解决方案的关键在于引入项目反应理论(Item Response Theory, IRT)构建评估框架,将评分正确性建模为评分者能力与作答难度的函数,从而实现响应级别的分析,识别LLM在何种难度下表现更优或劣,并发现从整体指标中难以察觉的鲁棒性差异。
链接: https://arxiv.org/abs/2605.00238
作者: Longwei Cong,Sonja Hahn,Sebastian Gombert,Leon Camus,Hendrik Drachsler,Ulf Kroehne
机构: DIPF | Leibniz Institute for Research and Information in Education; Faculty of Computer Science, Goethe University Frankfurt; Chemnitz University of Technology
类目: Computation and Language (cs.CL)
备注:
Abstract:Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen’s kappa. However, these metrics provide limited insight into how grading performance varies across student responses of differing grading difficulty. We introduce an evaluation framework for LLM-based ASAG based on item response theory (IRT), which models grading correctness as a function of latent grader ability and response grading difficulty. This formulation enables response-level analysis of where LLM graders succeed or fail and reveals robustness differences that are not visible from aggregate scores alone. We apply the framework to 17 open-weight LLMs on the SciEntsBank and Beetle benchmarks. The results show that even models with similar overall performance differ substantially in how sharply their grading accuracy declines as response difficulty increases. In addition, confusion patterns show that errors on difficult responses concentrate disproportionately on the \textttpartially_correct_incomplete label, indicating a tendency toward intermediate-label collapse under ambiguity. To characterize difficult responses, we further analyze semantic and linguistic correlates of estimated difficulty. Across both datasets, higher difficulty is associated with weaker semantic alignment to the reference answer, stronger contradiction signals, and greater semantic isolation in embedding space. Overall, these results show that item response theory offers a useful framework for evaluating LLM-based ASAG beyond aggregate performance measures.
[NLP-44] Persona-Grounded Safety Evaluation of AI Companions in Multi-Turn Conversations
【速读】: 该论文旨在解决当前AI伴侣应用在情感交互中潜在安全风险难以量化评估的问题,尤其是现有方法依赖自评数据或访谈,无法捕捉实时互动中的风险行为。其解决方案的关键在于构建了一个端到端可扩展的仿真与安全评估框架,核心包括:基于临床和心理测量验证的人格模型构建、针对特定人格的情景生成、基于情景驱动的多轮对话模拟(含对话优化模块以保持人格一致性)以及危害评估机制。该框架首次实现了对高风险用户群体(如抑郁、焦虑、PTSD等)与Replika类AI伴侣的可控交互模拟,通过1,674条对话数据揭示了其情绪响应狭窄且常镜像或合理化自伤、饮食失调及暴力幻想等不安全内容的现象,为AI伴侣的安全性测试提供了系统性方法论支撑。
链接: https://arxiv.org/abs/2605.00227
作者: Prerna Juneja,Lika Lomidze
机构: Seattle University
类目: Computation and Language (cs.CL)
备注:
Abstract:There are growing concerns about the risks posed by AI companion applications designed for emotional engagement. Existing safety evaluations often rely on self-reported user data or interviews, offering limited insights into real-time dynamics. We present the first end-to-end scalable framework for controlled simulation and safety evaluation of multi-turn interactions with AI companion applications. Our framework integrates four key components: persona construction with clinical and psychometric validation, persona-specific scenario generation, scenario-driven multi-turn simulation with a dialogue refinement module that preserves persona fidelity, and harm evaluation. We apply this framework to evaluate how Replika, a widely used AI companion app, responds to high-risk user groups. We construct 9 personas representing individuals with depression, anxiety, PTSD, eating disorders, and incel identity, and collect 1,674 dialogue pairs across 25 high-risk scenarios. We combine emotion modeling and LLM-assisted utterance-and harm-level classification to analyze these exchanges. Results show that Replika exhibits a narrow emotional range dominated by curiosity and care, while frequently mirroring or normalizing unsafe content such as self-harm, disordered eating, and violent-fantasy narratives. These findings highlight how controlled persona simulations can serve as a scalable testbed for evaluating safety risks in AI companions.
[NLP-45] Why Do LLM s Struggle in Strategic Play? Broken Links Between Observations Beliefs and Actions
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在不完全信息博弈中进行战略决策时出现的系统性失败问题,这类场景包括谈判和政策制定等。研究发现,LLMs 在此类任务中的失效源于两个核心机制缺陷:一是“观察-信念差距”(observation-belief gap),即模型内部对潜在博弈状态的信念比其显式表述更准确,但这些信念在多跳推理中表现脆弱,且存在首因效应与近因效应、偏离贝叶斯一致性等问题;二是“信念-行动差距”(belief-action gap),即模型将隐式信念转化为行动的能力弱于显式提示引导下的信念应用,且无论是否基于信念条件化策略,均未稳定提升博弈收益。解决方案的关键在于通过深入分析 LLM 内部状态而非仅依赖输出行为,识别并量化上述机制漏洞,从而为部署前建立可靠约束机制提供理论依据。
链接: https://arxiv.org/abs/2605.00226
作者: Jan Sobotka,Mustafa O. Karabag,Ufuk Topcu
机构: EPFL(洛桑联邦理工学院); The University of Texas at Austin(德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
Abstract:Large language models (LLMs) are increasingly tasked with strategic decision-making under incomplete information, such as in negotiation and policymaking. While LLMs can excel at many such tasks, they also fail in ways that are poorly understood. We shed light on these failures by uncovering two fundamental gaps in the internal mechanisms underlying the decision-making of LLMs in incomplete-information games, supported by experiments with open-weight models Llama 3.1, Qwen3, and gpt-oss. First, an observation-belief gap: LLMs encode internal beliefs about latent game states that are substantially more accurate than their own verbal reports, yet these beliefs are brittle. In particular, the belief accuracy degrades with multi-hop reasoning, exhibits primacy and recency biases, and drifts away from Bayesian coherence over extended interactions. Second, a belief-action gap: The implicit conversion of internal beliefs into actions is weaker than that of the beliefs externalized in the prompt, yet neither belief-conditioning consistently achieves higher game payoffs. These results show how analyzing LLMs’ internal processes can expose systematic vulnerabilities that warrant caution before deploying LLMs in strategic domains without robust guardrails.
[NLP-46] State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning
【速读】: 该论文旨在解决传统电流互感器(Current Transformers)在序列建模中因丢弃位置间的潜在残差流而导致推理能力未被充分利用的问题。其核心挑战在于如何在保持参数效率的同时,实现连续潜空间中的深度推理与状态更新。解决方案的关键是提出State Stream Transformer (SST) V2架构,通过在每个解码层引入由前馈网络(FFN)驱动的非线性递归机制,并利用学习到的混合权重将潜状态沿整个序列水平传递,从而支持推理过程中对抽象逻辑的持续探索——即在生成每个token前额外分配计算资源进行潜空间内的多步推演。这一机制不仅提升了模型在分布外任务(如GPQA-Diamond)上的表现,还显著减少了GSM8K数据集上的错误率,且无需依赖更大规模模型或更多训练数据即可实现性能跃升。
链接: https://arxiv.org/abs/2605.00206
作者: Thea Aviss
机构: Fifth Dimension
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 48 pages, 21 figures
Abstract:Current transformers discard their rich latent residual stream between positions, reconstructing latent reasoning context at each new position and leaving potential reasoning capacity untapped. The State Stream Transformer (SST) V2 enables parameter-efficient reasoning in continuous latent space through an FFN-driven nonlinear recurrence at each decoder layer, where latent states are streamed horizontally across the full sequence via a learned blend. This same mechanism supports continuous latent deliberation per position at inference time, dedicating additional FLOPs to exploring abstract reasoning before committing to a token. A two-pass parallel training procedure resolves the sequential dependency of the recurrence to allow compute-efficient training. Hidden state analysis shows the state stream facilitates reasoning through exploration of distinct semantic basins in continuous latent space, where transitions at content-dependent positions move the model into a substantially different Bayesian posterior, directly influencing the latent space at future positions. We also find, via a learned probe, that at the first generated token position, the latent state already predicts whether the eventual answer will survive or break under additional latent computation for every subsequent position. Co-trained into an existing 27B backbone using only a small dataset of GSM8K examples, the SST delivers a +15.15 point gain over a fine-tuning-matched baseline on out-of-distribution GPQA-Diamond and cuts that same baseline’s remaining GSM8K errors by 46%, together showing that the reasoning improvement is attributable to the architectural mechanism rather than scale or training data. On GPQA-Diamond, the resulting 27B SST also achieves higher accuracy than several larger open-weight and proprietary systems, including open-weight models up to 25 times larger.
[NLP-47] Confidence Estimation in Automatic Short Answer Grading with LLM s
【速读】: 该论文旨在解决生成式大语言模型(Generative Large Language Models, LLMs)在自动短答案评分(Automatic Short Answer Grading, ASAG)中缺乏可靠置信度估计的问题,这限制了其在教育决策中与人类协同应用的安全性和有效性。解决方案的关键在于提出一种混合置信度框架,该框架融合了模型自身的置信信号(包括表述型、潜在空间型和一致性型三种策略)与基于数据集的可变性不确定性(aleatoric uncertainty)估计;其中,aleatoric uncertainty通过语义嵌入学生作答并聚类后量化簇内异质性来实现,从而显著提升置信度估计的可靠性,并改善选择性评分性能。
链接: https://arxiv.org/abs/2605.00200
作者: Longwei Cong,Sonja Hahn,Sebastian Gombert,Leon Camus,Hendrik Drachsler,Ulf Kroehne
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Automatic Short Answer Grading (ASAG) with generative large language models (LLMs) has recently demonstrated strong performance without task-specific fine-tuning, while also enabling the generation of synthetic feedback for educational assessment. Despite these advances, LLM-based grading remains imperfect, making reliable confidence estimates essential for safe and effective human-AI collaboration in educational decision-making. In this work, we investigate confidence estimation for ASAG with LLMs by jointly considering model-based confidence signals and dataset-derived uncertainty. We systematically compare three model-based confidence estimation strategies, namely verbalizing, latent, and consistency-based confidence estimation, and show that model-based confidence alone is insufficient to reliably capture uncertainty in ASAG. To address this limitation, we propose a hybrid confidence framework that integrates model-based confidence signals with an explicit estimate of dataset-derived aleatoric uncertainty. Aleatoric uncertainty is operationalized by clustering semantically embedded student responses and quantifying within-cluster heterogeneity. Our results demonstrate that the proposed hybrid confidence measure yields more reliable confidence estimates and improves selective grading performance compared to single-source approaches. Overall, this work advances confidence-aware LLM-based grading for human-in-the-loop assessment, supporting more trustworthy AI-assisted educational assessment systems.
[NLP-48] RouteProfile: Elucidating the Design Space of LLM Profiles for Routing
【速读】: 该论文旨在解决大规模语言模型(Large Language Model, LLM)路由系统中因模型能力差异导致的性能不一致问题,核心挑战在于如何通过合理的LLM画像(LLM profile)设计提升路由决策的准确性与泛化能力。其解决方案的关键在于将LLM画像建模为一个结构化的信息整合问题,提出了一种名为RouteProfile的通用画像设计空间,涵盖组织形式、表示类型、聚合深度和学习配置四个维度,并通过系统性实验验证:结构化画像优于扁平画像,查询级信号比粗粒度领域信号更可靠,且在可训练配置下结构化画像对新引入模型的泛化效果最优。这一方法有效解耦了画像设计与路由器设计的关系,为路由系统的公平比较与更优开发提供了理论基础和实践指导。
链接: https://arxiv.org/abs/2605.00180
作者: Jingjun Xu,Hongji Pu,Tao Feng,Haozhen Zhang,Jiaxuan You,Ge Liu
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Nanyang Technological University (南洋理工大学)
类目: Networking and Internet Architecture (cs.NI); Computation and Language (cs.CL)
备注:
Abstract:As the large language model (LLM) ecosystem expands, individual models exhibit varying capabilities across queries, benchmarks, and domains, motivating the development of LLM routing. While prior work has largely focused on router mechanism design, LLM profiles, which capture model capabilities, remain underexplored. In this work, we ask: How does LLM profile design affect routing performance across different routers? Addressing this question helps clarify the role of profiles in routing, disentangle profile design from router design, and enable fairer comparison and more principled development of routing systems. To this end, we view LLM profiling as a structured information integration problem over heterogeneous interaction histories. We develop a general design space of LLM profiles, named RouteProfile, along four key dimensions: organizational form, representation type, aggregation depth, and learning configuration. Through systematic evaluation across three representative routers under both standard and new-LLM generalization settings, we show that: (1) structured profiles consistently outperform flat ones; (2) query-level signals are more reliable than coarse domain-level signals; and (3) generalization to newly introduced models benefits most from structured profiles under trainable configurations. Overall, our work highlights LLM profile design as an important direction for future routing research.
[NLP-49] Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback
【速读】: 该论文旨在解决强化学习中人类反馈(Reinforcement Learning from Human Feedback, RLHF)因奖励信号为人类效用的代理函数而引发的目标误设(objective misspecification)问题,即在训练过程中优化的是一个可学习的奖励估计值,而在部署时实际性能由未观测到的真实人类效用决定,从而导致奖励过优化(reward over-optimization,或称Goodhart效应)——即代理奖励持续提升但真实质量已下降。解决方案的关键在于提出 Wasserstein 分布鲁棒后悔优化(Wasserstein Distributionally Robust Regret Optimization, DRRO),其核心思想不是像传统分布鲁棒优化(DRO)那样悲观化最差情况下的价值,而是悲观化相对于同一扰动下最优策略的最差后悔值(regret)。理论分析表明,在 ℓ1 不确定性集下,内层最坏后悔问题具有闭式解且最优策略呈现“注水填充”(water-filling)结构,由此导出一种仅需微调 PPO/GRPO 类 RLHF 训练流程的实用策略梯度算法,兼具计算效率与适度悲观性,实验证明其比现有基线更有效缓解过优化,且相比标准 DRO 不会过度保守。
链接: https://arxiv.org/abs/2605.00155
作者: Yikai Wang,Shang Liu,Jose Blanchet
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:
Abstract:Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting gap leads to reward over-optimization, or Goodharting, where proxy reward continues to improve even after true quality deteriorates. Existing mitigations address this problem through uncertainty penalties, pessimistic rewards, or conservative constraints, but they can be computationally burdensome and overly pessimistic. We propose Wasserstein distributionally robust regret optimization (DRRO) for RLHF. Instead of pessimizing worst-case value as in standard DRO, DRRO pessimizes worst-case regret relative to the best policy under the same plausible reward perturbation. We study the promptwise problem through a simplex allocation model and show that, under an \ell_1 ambiguity set, the inner worst-case regret admits an exact solution and the optimal policy has a water-filling structure. These results lead to a practical policy-gradient algorithm with a simple sampled-bonus interpretation and only minor changes to PPO/GRPO-style RLHF training. The framework also clarifies theoretically why DRRO is less pessimistic than DRO, and our experiments show that DRRO mitigates over-optimization more effectively than existing baselines while standard DRO is systematically over-pessimistic.
[NLP-50] ming is Everything: Temporal Scaffolding of Semantic Surprise in Humor
【速读】: 该论文旨在解决幽默感知中时间结构与语义内容之间交互机制不明确的问题,尤其是现有理论过度强调语义不一致(semantic incongruity)而忽视了时间动态性对幽默效果的影响。其解决方案的关键在于提出“双重预测违反”(Dual Prediction Violation, DPV)框架,通过分析828个专业中文脱口秀表演数据发现,时间特征(如停顿长度)在预测观众欣赏程度上显著优于语义不一致水平,且高突变性的语义点前往往伴随系统性延长的停顿——这表明幽默感源于语义内容与时间节奏的战略协同,而非独立作用。
链接: https://arxiv.org/abs/2605.00143
作者: Yuxi Ma,Yongqian Peng,Junchen Lyu,Chi Zhang,Yixin Zhu
机构: Peking University (北京大学); School of Psychological and Cognitive Sciences (心理与认知科学学院); School of Intelligence Science and Technology (智能科学与技术学院); Yuanpei College (元培学院); State Key Laboratory of General Artificial Intelligence (通用人工智能国家重点实验室); Beijing Key Laboratory of Behavior and Mental Health (行为与心理健康北京市重点实验室)
类目: Computation and Language (cs.CL)
备注: to be published in CogSci 2026
Abstract:Humor is a fundamental cognitive phenomenon in which humans derive pleasure from the expectation violations and their resolution, exemplifying the brain’s dynamic capacity for predictive processing. Classical humor theories emphasize semantic incongruity as the primary driver of amusement, yet overlook temporal dynamics despite comedians’ intuition that “timing is everything.” The extent to which temporal structure contributes to humor appreciation and how it interacts with semantic content remains poorly understood. Here, we propose the Dual Prediction Violation (DPV) framework to capture the interplay between content and timing. By analyzing 828 professional Chinese stand-up performances, we show that temporal features substantially outweigh semantic incongruity in predicting audience appreciation. Specifically, we find that peak semantic violations matter more than average incongruity levels, and pauses systematically lengthen before high-surprise punchlines–a strategic coupling that distinguishes successful from unsuccessful performances. These findings reframe humor as temporally scaffolded, where timing and semantic content operate in strategic coordination rather than independently. Our DPV framework bridges humor theory with predictive processing, demonstrating that temporal structure plays a central role in naturalistic humor appreciation with implications for understanding multi-scale prediction integration in linguistic processing.
[NLP-51] chnical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization
【速读】: 该论文旨在解决低比特激活-权重量化(low-bit activation-weight quantization)中因误差传播导致的模型性能下降问题。其核心挑战在于,量化误差在前向传播过程中会累积并放大,尤其在推理任务中影响显著。解决方案的关键在于提出一种后训练权重拆分方法——激活残差海森量化(Activation Residual Hessian Quantization, ARHQ),该方法通过构建基于激活量化残差(G_x)的输入侧残差海森矩阵,利用缩放后的权重矩阵 W G^{1/2}_x 的闭式截断奇异值分解(truncated SVD),将误差敏感的权重方向分离至高精度低秩分支中,从而有效抑制误差传播并保持下游推理性能。
链接: https://arxiv.org/abs/2605.00140
作者: YiFeng Wang,Zhun Sun,Keisuke Sakaguchi
机构: Tohoku University(东北大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Activation Residual Hessian Quantization (ARHQ), a post-training weight splitting method designed to mitigate error propagation in low-bit activation-weight quantization. By constructing an input-side residual Hessian from activation quantization residuals (G_x), ARHQ analytically identifies and isolates error-sensitive weight directions into a high-precision low-rank branch. This is achieved via a closed-form truncated SVD on the scaled weight matrix W G^1/2_x . Experimental results on Qwen3-4B-Thinking-2507 demonstrate that ARHQ significantly improves layer-wise SNR and preserves downstream reasoning performance on ZebraLogic even under aggressive quantization. The code is available at this https URL.
[NLP-52] Cultural Benchmarking of LLM s in Standard and Dialectal Arabic Dialogues
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在评估文化推理能力时存在的显著空白,尤其是在涵盖文化丰富性和方言语境的对话数据集上的缺失问题。现有阿拉伯语评测基准多聚焦于现代标准阿拉伯语(Modern Standard Arabic, MSA)的短文本片段,忽视了对话中自然涌现的文化细微差别。解决方案的关键在于构建一个名为ArabCulture-Dialogue的对话数据集,该数据集覆盖13个阿拉伯语国家,包含MSA与各国方言的双语版本,涵盖12个日常生活主题及54个细粒度子主题,并基于此设计三项基准任务:(i)多项选择文化推理、(ii)MSA与方言之间的机器翻译、(iii)方言导向生成。实验表明,模型在方言设置下的表现普遍低于MSA设置,验证了方言处理仍是LLMs的重要挑战。
链接: https://arxiv.org/abs/2605.00119
作者: Muhammad Dehan Al Kautsar,Saeed Almheiri,Momina Ahsan,Bilal Elbouardi,Younes Samih,Sarfraz Ahmad,Amr Keleg,Omar El Herraoui,Kareem Elzeky,Abed Alhakim Freihat,Mohamed Anwar,Zhuohan Xie,Junhong Liang,Mohammad Rustom Al Nasar,Preslav Nakov,Fajri Koto
机构: Mohamed bin Zayed University of Artificial Intelligence ( Mohamed bin Zayed 大学人工智能); IBM Research AI (IBM 研究 AI); American University in the Emirates (阿联酋美国大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 7 figures, 16 tables
Abstract:There is a significant gap in evaluating cultural reasoning in LLMs using conversational datasets that capture culturally rich and dialectal contexts. Most Arabic benchmarks focus on short text snippets in Modern Standard Arabic (MSA), overlooking the cultural nuances that naturally arise in dialogues. To address this gap, we introduce ArabCulture-Dialogue, a culturally grounded conversational dataset covering 13 Arabic-speaking countries, in both MSA and each country’s respective dialect, spanning 12 daily-life topics and 54 fine-grained subtopics. We utilize the dataset to form three benchmarking tasks: (i) multiple-choice cultural reasoning, (ii) machine translation between MSA and dialects, and (iii) dialect-steering generation. Our experiments indicate that the performance gap between MSA and Arabic dialects still exists, whereby the models perform worse on all three tasks in the dialectal setup, compared to the MSA one.
[NLP-53] ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts
【速读】: 该论文旨在解决越南语法律领域自然语言推理(Natural Language Inference, NLI)缺乏大规模、高质量标注数据集的问题,从而推动越南语法律文本理解与推理的AI研究。其解决方案的关键在于提出并实现了一个半自动的数据生成框架,该框架结合大语言模型(Large Language Models, LLMs)进行受控假设生成,并引入系统性的质量验证流程、伪影缓解策略及跨模型一致性校验机制,以确保标注的可靠性与法律逻辑的一致性。由此构建的ViLegalNLI数据集覆盖多个法律领域,包含42,012对前提-假设样本,能够有效支撑越南语法律推理任务的基准测试与模型评估。
链接: https://arxiv.org/abs/2605.00116
作者: Nhung Thi-Hong Duong,Mai Ngoc Ho,Tin Van Huynh,Kiet Van Nguyen
机构: UIT (University of Information Technology)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 33 pages, 17 figures
Abstract:In this article, we introduce ViLegalNLI, the first large-scale Vietnamese Natural Language Inference (NLI) dataset specifically constructed for the legal domain. The dataset consists of 42,012 premise-hypothesis pairs derived from official statutory documents and annotated with binary inference labels (Entailment and Non-entailment). It covers multiple legal domains and reflects realistic legal reasoning scenarios characterized by structured logic, conditional clauses, and domain-specific terminology. To construct ViLegalNLI, we propose a semi-automatic data generation framework that integrates large language models for controlled hypothesis generation and systematic quality validation procedures. The framework incorporates artifact mitigation strategies and cross-model validation to improve annotation reliability and ensure legal consistency. The resulting dataset captures diverse reasoning patterns, including paraphrasing, logical implication, and legally invalid inferences, thereby providing a comprehensive benchmark for Vietnamese legal inference tasks. We conduct extensive experiments on the ViLegalNLI using multilingual models, Vietnamese-specific pretrained language models, and instruction-tuned large language models. The results show that few-shot LLM configurations consistently achieve superior performance, while performance is significantly influenced by hypothesis length, lexical overlap, and reasoning complexity. Cross-domain evaluations further reveal the challenges of generalizing legal inference across distinct legal fields. Overall, ViLegalNLI establishes a foundational benchmark for Vietnamese legal NLI and supports future research in legal reasoning, statutory text understanding, and the development of reliable AI systems for legal analysis and decision support. The dataset is publicly available for research purposes.
[NLP-54] NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus ACL
【速读】: 该论文旨在解决葡萄牙语自然语言处理(Natural Language Processing, NLP)领域高质量语料库稀缺的问题,以及现有预训练模型在长文本建模和部署效率方面的不足。解决方案的关键在于提出NorBERTo——一个基于ModernBERT架构的现代编码器模型,其核心创新包括:支持长上下文的注意力机制、在大规模开源单语葡萄牙语语料库Aurora-PT(含3310亿个GPT-2 token)上进行训练,并通过系统性基准测试验证其在语义相似性、文本蕴含和分类任务上的优越性能。NorBERTo-large在PLUE数据集上达到0.9191 F1(MRPC)和0.7689准确率(RTE),在ASSIN 2上实现约0.904的蕴含F1,且具备易于微调、服务高效的特点,适合作为检索增强生成(Retrieval-Augmented Generation, RAG)等下游葡萄牙语NLP系统的骨干模型。
链接: https://arxiv.org/abs/2605.00086
作者: Enzo S. N. Silva,Pablo B. Costa,Raphael C. Vlasman,Rosimeire P. Costa,Henrique L. P. Silva,Lucas F. A. O. Pellicer,Guilherme Rinaldo,Renato A. Almeida,Darian S. R. Rabbani,Cinthya O. Oestreich,Vinicius F. Caridá
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This article has already undergone formal submission, review, acceptance, and publication in the proceedings of PROPOR 2026: Proceedings of the 17th International Conference on Computational Processing of Portuguese, Vol. 1. The published version is available in the ACL Anthology at this https URL 11 pages, 9 tables, 2 figures
Abstract:High-quality corpora are essential for advancing Natural Language Processing (NLP) in Portuguese. Building on previous encoder-only models such as BERTimbau and Albertina PT-BR, we introduce NorBERTo, a modern encoder based on the ModernBERT architecture, featuring long-context support and efficient attention mechanisms. NorBERTo is trained on Aurora-PT, a newly curated Brazilian Portuguese corpus comprising 331 billion GPT-2 tokens collected from diverse web sources and existing multilingual datasets. We systematically benchmark NorBERTo against Strong baselines on semantic similarity, textual entailment and classification tasks using standardized datasets such as ASSIN 2 and PLUE. On PLUE, NorBERTo-large achieves the best results among the encoder models we evaluated, notably reaching 0.9191 F1 on MRPC and 0.7689 accuracy on RTE. On ASSIN 2, NorBERTo-large attains the highest entailment F1 (~0.904) among all encoders considered, although Albertina-900M and BERTimbau-large still hold an advantage. To the best of our knowledge, Aurora-PT is currently the largest openly available monolingual Portuguese corpus, surpassing previous resources. NorBERTo provides a modern, mid-sized encoder designed for realistic deployment scenarios: it is straight-forward to fine-tune, efficient to serve, and well suited as a backbone for retrieval-augmented generation and other downstream Portuguese NLP systems.
[NLP-55] Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment
【速读】: 该论文旨在解决大规模音频模型(Large Audio Models, LAMs)评估中存在成本高、数据冗余的问题,尤其是在缺乏高效且可靠的基准测试方法时。其核心解决方案在于验证最小化样本子集是否能有效替代完整基准进行模型比较,并进一步提升对用户满意度的预测能力。关键发现是:仅需50个样本(占原数据的0.3%)即可与完整基准得分保持超过0.93的皮尔逊相关性;更重要的是,通过在这些精选子集上训练回归模型,可实现0.98的相关性,显著优于基于随机子集或全量数据的模型,表明“质量优于数量”。为此,作者开源了名为HUMANS的基准,作为LAM评估的高效代理指标,同时兼顾基准性能与用户偏好。
链接: https://arxiv.org/abs/2605.00022
作者: Woody Haosheng Gan,William Held,Diyi Yang
机构: University of Southern California(南加州大学); Stanford University(斯坦福大学); OpenAthena(OpenAthena)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics
Abstract:The rapid proliferation of large audio models (LAMs) demands efficient approaches for model comparison, yet comprehensive benchmarks are costly. To fill this gap, we investigate whether minimal subsets can reliably evaluate LAMs while reducing costs and data redundancy. Analyzing 10 subset selection methods with 18 audio models across 40 tasks covering major LAM evaluation dimensions, we show that subsets of just 50 examples (0.3% of data) can achieve over 0.93 Pearson correlation with full benchmark scores. To understand how well these scores align with what practitioners ultimately care about, user satisfaction, we collect 776 human preference ratings from realistic voice assistant conversations, finding that both subsets and full benchmark achieve only 0.85 correlation with human. To better predict preferences, we trained regression models on these selected subsets, achieving 0.98 correlation – outperforming regression models trained on both random subsets and the full benchmark. This demonstrates that in regression modeling, well-curated subsets outpredict the full benchmark, showing quality over quantity. We open-source these regression-weighted subsets as the HUMANS benchmark, an efficient proxy for LAM evaluation that captures both benchmark performance and user preferences.
[NLP-56] Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在科学创意迭代过程中对原始约束条件的保持能力问题,即在多轮交互中模型是否能持续遵循初始设定的研究目标。其关键解决方案是提出DriftBench基准测试框架,通过系统性评估不同模型、交互条件和研究简报下的约束遵守情况,揭示了迭代压力会显著增加结构复杂性但常导致约束偏离的现象,并发现“知道但违反”(knows-but-violates, KBV)行为普遍存在——模型虽能准确复述约束,却在实际生成中违反这些约束,KBV率在8%至99%之间波动。该研究进一步验证了人工盲评下LLM判别器对约束违规的低估现象,从而表明现有约束评估指标存在保守偏差。
链接: https://arxiv.org/abs/2604.28031
作者: Garvin Kruthof
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:When researchers iteratively refine ideas with large language models, do the models preserve fidelity to the original objective? We introduce DriftBench, a benchmark for evaluating constraint adherence in multi-turn LLM-assisted scientific ideation. Across 2,146 scored benchmark runs spanning seven models from five providers (including two open-weight), four interaction conditions, and 38 research briefs from 24 scientific domains, we find that iterative pressure reliably increases structural complexity and often reduces adherence to original constraints. A restatement probe reveals a dissociation between declarative recall and behavioral adherence, as models accurately restate constraints they simultaneously violate. The knows-but-violates (KBV) rate, measuring constraint non-compliance despite preserved recall, ranges from 8% to 99% across models. Structured checkpointing partially reduces KBV rates but does not close the dissociation, and complexity inflation persists. Human validation against blind raters confirms that the LLM judge under-detects constraint violations, making reported constraint adherence scores conservative. Sensitivity analyses confirm the findings are robust to temperature (0.7 vs.\ 1.0) and pressure type (novelty vs.\ rigor). We release all briefs, prompts, rubrics, transcripts, and scores as an open benchmark.
[NLP-57] Adaptive Querying with AI Persona Priors ICML2026
【速读】: 该论文旨在解决在严格题目预算下,如何高效适应性地查询用户以学习其个体依赖的感兴趣量(如对未见项目的响应和心理测量指标)的问题。传统贝叶斯设计与计算机自适应测试通常依赖于严格的参数假设或昂贵的后验近似方法,在异质性、高维及冷启动场景中应用受限。解决方案的关键在于提出一种“人格诱导的隐变量模型”(persona-induced latent variable model),将用户状态建模为有限人格库中的成员,每种人格由大语言模型生成响应分布;该模型提供表达能力强的先验分布,支持闭式后验更新与高效的有限混合预测,从而实现可扩展的贝叶斯序贯项目选择。
链接: https://arxiv.org/abs/2605.00696
作者: Kaizheng Wang,Yuhang Wu,Assaf Zeevi
机构: Columbia University (哥伦比亚大学); Columbia Business School (哥伦比亚商学院)
类目: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICML 2026
Abstract:We study adaptive querying for learning user-dependent quantities of interest, such as responses to held-out items and psychometric indicators, within tight question budgets. Classical Bayesian design and computerized adaptive testing typically rely on restrictive parametric assumptions or expensive posterior approximations, limiting their use in heterogeneous, high-dimensional, and cold-start settings. We introduce a persona-induced latent variable model that represents a user’s state through membership in a finite dictionary of AI personas, each offering response distributions produced by a large language model. This yields expressive priors with closed-form posterior updates and efficient finite-mixture predictions, enabling scalable Bayesian design for sequential item selection. Experiments on synthetic data and WorldValuesBench demonstrate that persona-based posteriors deliver accurate probabilistic predictions and an interpretable adaptive elicitation pipeline.
信息检索
[IR-0] Robust Multimodal Recommendation via Graph Retrieval-Enhanced Modality Completion
【速读】:该论文旨在解决多模态推荐系统中因传感器故障、标注稀缺或隐私限制导致的模态不完整性问题,此类缺失会显著降低模型性能与可靠性。现有模态补全方法通常仅依赖查询节点自身或其邻域信息进行重建,难以捕捉图中语义相关的复杂上下文。解决方案的关键在于提出GRE-MC框架,通过引入一种模态感知的子图检索机制,从全局图中选取语义相关子图以提供更丰富的上下文信息;随后利用图Transformer通过全局注意力联合编码查询节点与检索到的子图,完成缺失模态的重建,同时采用可学习的稀疏路由码本对潜在嵌入进行正则化,提升模型鲁棒性。
链接: https://arxiv.org/abs/2605.00670
作者: Yuan Li,Jun Hu,Jiaxin Jiang,Bryan Hooi,Bingsheng He
机构: National University of Singapore(新加坡国立大学)
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注:
Abstract:Multimodal data plays a critical role in web-based recommendation systems, where information from diverse modalities such as vision and text enhances representation learning. However, real-world multimodal datasets often suffer from modality incompleteness due to sensor failures, annotation scarcity, or privacy constraints, which substantially degrade model performance and reliability. One effective solution to address this issue is modality completion, which reconstructs missing features to provide modality-complete graphs for downstream tasks. Given a query node with missing multimodal features, existing modality completion methods typically infer information from the node itself or its neighbors to reconstruct the missing modality. However, these methods may overlook semantically relevant context in the graph, which contains valuable cues that are non-trivial to capture through simple methods like neighborhood aggregation. In this work, we propose GRE-MC, a Graph Retrieval-Enhanced Modality Completion framework, to overcome these limitations. By introducing a modality-aware subgraph retrieval mechanism, GRE-MC selects semantically relevant subgraphs from the entire graph, providing richer contextual information for completing missing modalities. Subsequently, a graph transformer jointly encodes the query node and the retrieved subgraph via global attention to complete the missing features, while a learnable sparse-routing codebook regularizes latent embeddings into compact bases for improved robustness. Extensive experiments on multimodal recommendation benchmarks demonstrate that GRE-MC consistently outperforms state-of-the-art methods, validating the effectiveness of subgraph retrieval and joint-encoding graph transformer for robust modality completion.
[IR-1] A Replicability Study of XTR
【速读】:该论文旨在解决多向量检索模型(如ColBERT)在实际部署中因候选集重新排序(reranking)计算开销过大而导致的效率瓶颈问题。其核心解决方案是引入XTR(conteXtual Token Retrieval)算法,通过在初始token级检索阶段估算缺失的相似度分数,从而避免对全部候选文档进行完整的嵌入获取与重排序操作。关键创新在于提出了一种针对XTR的训练目标修改机制,该机制可有效改善token级得分分布,使其更平滑且具有更强的区分度,进而显著提升基于倒排文件(IVF-based)的高效检索引擎(如PLAID和WARP)的检索效率,而不仅限于原始研究中关注的低k’场景。这一改进使XTR训练具备了更广泛的工程适用性,为多向量检索的实际落地提供了明确的技术路径。
链接: https://arxiv.org/abs/2605.00646
作者: Rohan Jha,Reno Kriz,Benjamin Van Durme
机构: Johns Hopkins University (约翰霍普金斯大学); Human Language Technology Center of Excellence (人类语言技术卓越中心)
类目: Information Retrieval (cs.IR)
备注: 8 pages, 4 figures, 3 tables
Abstract:The XTR (conteXtual Token Retrieval) algorithm is a modification to ColBERT retrieval that avoids the costly step of fully gathering and reranking the candidates’ embeddings by imputing their missing similarity scores from the initial token retrieval step. The original work proposes a modified training objective as necessary for effective XTR retrieval, arguing that standard ColBERT token scoring is unsuitable for imputation. In this paper, we replicate both the XTR retrieval algorithm and its modified training objective, and extend the evaluation to knowledge-distillation (KD) training and efficient retrieval engines (PLAID and WARP). We confirm the token-level matching characteristics claimed in the original work, but fail to replicate XTR’s overall effectiveness advantage over ColBERT under a controlled comparison. We further show that XTR’s training modification has a concrete mechanistic consequence for modern retrieval engines: by flattening ColBERT’s characteristically peaked token score distribution, XTR training yields more discriminative centroid scores and thus more efficient IVF-based retrieval under PLAID and WARP. The utility of XTR training is therefore not limited to the low- k’ regime originally studied, but extends to any deployment setting where IVF-based engines are used. These findings offer practitioners concrete guidance on how and when to use XTR as their multi-vector retriever.
[IR-2] H-RAG at SemEval-2026 Task 8: Hierarchical Parent-Child Retrieval for Multi-Turn RAG Conversations
【速读】:该论文针对多轮检索增强生成(Retrieval-Augmented Generation, RAG)任务中检索与生成环节的协同优化问题展开研究,旨在提升在多轮对话场景下答案生成的准确性与证据忠实性(faithfulness)。其核心挑战在于如何有效分离细粒度检索与上下文重构过程,以避免因片段化信息导致的生成偏差。解决方案的关键在于提出一种分层父-子RAG(Hierarchical Parent-Child RAG)架构:将文档切分为重叠的句子级子块(child chunks)用于高精度检索,同时保留完整文档作为父单元(parent units)用于生成阶段的语境重建;通过混合密集-稀疏搜索与可调权重机制实现子块层面的精准召回,并基于嵌入相似度重新排序;最终在父层级聚合检索到的证据并输入指令微调的语言模型进行响应生成,从而显著改善多轮RAG的整体性能表现。
链接: https://arxiv.org/abs/2605.00631
作者: Passant Elchafei,Hossam Emam,Mohamed Alansary,Monorama Swain,Markus Schedl
机构: Johannes Kepler University Linz, Institute of Computational Perception, Linz, Austria; Linz Institute of Technology, Artificial Intelligence Lab, Austria
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:We present H-RAG, our submission to SemEval-2026 Task 8 (MTRAGEval), addressing both Task A (Retrieval) and Task C (Generation with Retrieved Passages). Task A evaluates standalone retrieval quality, while Task C assesses end-to-end retrieval-augmented generation (RAG) in multi-turn conversational settings, requiring both accurate answer generation and faithful grounding in retrieved evidence. Our approach implements a hierarchical parent-child RAG pipeline that separates fine-grained child-level retrieval from parent-level context reconstruction during generation. Documents are segmented into overlapping sentence-based child chunks, while full documents are preserved as parent units to provide coherent context. Retrieval combines hybrid dense-sparse search, tunable weighting, and embedding-based similarity rescoring over child chunks. Retrieved evidence is aggregated at the parent level and supplied to an instruction-tuned language model for response generation. H-RAG achieves an nDCG@5 score of 0.4271 on Task A and a harmonic mean score of 0.3241 on Task C (RB_agg: 0.2488, RL_F: 0.2703, RB_llm: 0.6508), underscoring the importance of retrieval configuration and parent-level aggregation in multi-turn RAG performance.
[IR-3] MUDY: Multi-Granular Dynamic Candidate Contextualization for Unsupervised Keyphrase Extraction SIGIR2026
【速读】:该论文旨在解决现有基于预训练语言模型(Pre-trained Language Models, PLMs)的关键词提取方法在捕捉文档中分散子主题相关的局部上下文重要性方面存在的不足,即虽然这些方法能够较好地识别全局语义相关性强的关键词,但在处理局部语境敏感的关键短语时表现欠佳。解决方案的核心在于提出一种以上下文为中心的框架MUDY,其关键创新包括:(1) 基于提示(prompt-based)的评分机制,通过候选词感知加权增强对局部上下文重要性的建模;(2) 基于自注意力机制的评分模块,利用PLMs中的多粒度注意力模式,在文档整体和片段级别同时评估候选关键短语的重要性,从而实现多粒度上下文显著性(multi-granular contextual salience)的有效捕捉。
链接: https://arxiv.org/abs/2605.00597
作者: Hyeongu Kang,Susik Yoon
机构: Korea University (韩国大学)
类目: Information Retrieval (cs.IR)
备注: Accepted to SIGIR 2026
Abstract:Keyphrase extraction aims to automatically identify concise phrases that effectively represent the content of a document. While recent methods leveraging pre-trained language models (PLMs) have significantly improved the extraction of keyphrases with strong global semantic relevance, they often fall short in capturing the local contextual importance of keyphrases tied to specific subtopics dispersed in a document. In this paper, we propose a novel context-centric framework, MUDY, that effectively captures multi-granular contextual salience of candidate keyphrases. MUDY employs two complementary components: (1) a prompt-based scoring that estimates the generation likelihood of each candidate keyphrase, augmented with candidate-aware weighting to better reflect its local contextual importance, and (2) a self-attention-based scoring that utilizes multi-granular attention patterns from PLMs to assess candidate significance at both the document-wide and segment-specific levels. Evaluations on four real-world datasets demonstrate that MUDY outperforms state-of-the-art baselines in top-k accuracy at various cutoff thresholds. In-depth quantitative and qualitative analyses further highlight the efficacy of context-centric keyphrase extraction with multi-granular saliency. For reproducibility, the source code of MUDY is available at this https URL.
[IR-4] When More Reformulations Hurt: Avoiding Drift using Ranker Feedback SIGIR26
【速读】:该论文旨在解决现代检索系统中因查询改写(query reformulation)和神经重排序(neural reranking)带来的计算成本高与召回率(recall)和查询漂移(query drift)之间权衡的问题。传统方法通过生成大量改写查询来提升召回,但盲目合并或全量重排序会导致资源浪费且加剧漂移现象。解决方案的关键在于提出一个预算感知的检索框架ReformIR,其核心创新是将改写查询视为第一类特征,并利用强重排序器作为教师模型进行在线相关性估计;在此基础上,ReformIR构建大规模候选池并训练轻量级代理模型(surrogate model),基于改写特定的检索信号预测文档效用,在固定重排序预算下自适应地优先选择改写和文档,从而在提升召回的同时主动抑制漂移,实现高效、精准的检索优化。
链接: https://arxiv.org/abs/2605.00560
作者: V Venktesh,Mandeep Rathee,Avishek Anand
机构: Stockholm University (斯德哥尔摩大学); L3S Research Center (L3S 研究中心); Delft University of Technology (代尔夫特理工大学)
类目: Information Retrieval (cs.IR)
备注: Accepted to SIGIR 26 full paper track 11 pages
Abstract:Modern retrieval pipelines increasingly rely on query reformulation and neural reranking to improve effectiveness, but this comes at a significant computational cost and introduces a fundamental tradeoff between recall and query drift. Generating many reformulated queries can substantially increase recall, yet naively merging or exhaustively reranking their results is prohibitively expensive. In this work, we argue that the core challenge is not reformulation generation itself, but the adaptive selection of reformulations and their retrieved documents under a strict inference budget. We propose ReformIR, a budget-aware retrieval framework that treats query reformulations as first-class features and performs online relevance estimation using a strong reranker as a teacher. Given multiple reformulated queries, ReformIR constructs a large candidate pool and learns a lightweight surrogate model that estimates document utility from reformulation-specific retrieval signals. Under a fixed reranking budget, the surrogate adaptively prioritizes both reformulations and documents, selectively querying a teacher reranker anchored to the original query. This process increases recall while actively suppressing drift through online feature selection over reformulations. We conduct extensive experiments on the MSMARCO passage corpora and TREC Deep Learning benchmarks (DL19-DL22). Our results show that ReformIR consistently outperforms existing reformulation strategies, particularly as the number of reformulations increases, where prior methods suffer from severe quality degradation due to drift. Our findings also suggest a shift in retrieval system design, rather than using large language models as rerankers, their capacity is more effectively leveraged in the reformulation stage with feedback-driven optimization.
[IR-5] Hierarchical Abstract Tree for Cross-Document Retrieval-Augmented Generation ICML2026
【速读】:该论文旨在解决现有树状检索增强生成(Tree-RAG)方法在处理跨文档多跳问答(cross-document multi-hop questions)时面临的三大挑战:(1) 分布适应性差,即k-means聚类因刚性分布假设引入噪声;(2) 结构隔离问题,树索引缺乏显式的跨文档连接;(3) 抽象粒度粗化,导致细粒度信息被掩盖。解决方案的关键在于提出Ψ-RAG框架,其核心包含两个组件:一是通过迭代“合并与坍缩”过程构建自适应数据分布的层次化抽象树索引,无需先验假设;二是设计多粒度检索代理,结合重组织查询与代理驱动的混合检索器,实现对知识库的智能交互。该方案显著提升了跨文档多跳问答任务的性能,在多个基准测试中分别优于RAPTOR和HippoRAG 2。
链接: https://arxiv.org/abs/2605.00529
作者: Ziwen Zhao,Menglin Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: ICML 2026
Abstract:Retrieval-augmented generation (RAG) enhances large language models with external knowledge, and tree-based RAG organizes documents into hierarchical indexes to support queries at multiple granularities. However, existing Tree-RAG methods designed for single-document retrieval face critical challenges in scaling to cross-document multi-hop questions: (1) poor distribution adaptability, where k -means clustering introduces noise due to rigid distribution assumptions; (2) structural isolation, as tree indexes lack explicit cross-document connections; and (3) coarse abstraction, which obscures fine-grained details. To address these limitations, we propose \Psi -RAG, a tree-RAG framework with two key components. First, a hierarchical abstract tree index built through an iterative “merging and collapse” process that adapts to data distributions without a priori assumption. Second, a multi-granular retrieval agent that intelligently interacts with the knowledge base with reorganized queries and an agent-powered hybrid retriever. \Psi -RAG supports diverse tasks from token-level question answering to document-level summarization. On cross-document multi-hop QA benchmarks, it outperforms RAPTOR by 25.9% and HippoRAG 2 by 7.4% in average F1 score. Code is available at this https URL.
[IR-6] LLM -Oriented Information Retrieval: A Denoising-First Perspective SIGIR2026
【速读】:该论文旨在解决当前信息检索(Information Retrieval, IR)系统在面向大语言模型(Large Language Models, LLMs)应用时所面临的“噪声敏感性”问题,即LLMs因有限的注意力预算而对检索结果中的误导或无关信息高度脆弱,进而导致幻觉和推理失败。其核心解决方案在于将“最大化可用证据密度与可验证性”作为关键目标,提出一个四阶段IR挑战框架(从不可访问到未发现、错位再到不可验证),并构建了一个覆盖索引、检索、上下文工程、验证及代理工作流的信号-噪声优化技术分类体系,从而系统性提升检索内容的质量与可靠性,以适配LLMs驱动的生成式AI(Generative AI)应用场景。
链接: https://arxiv.org/abs/2605.00505
作者: Lu Dai,Liang Sun,Fanpu Cao,Ziyang Rao,Cehao Yang,Hao Liu,Hui Xiong
机构: Hong Kong University of Science and Technology (香港科技大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: SIGIR 2026
Abstract:Modern information retrieval (IR) is no longer consumed primarily by humans but increasingly by large language models (LLMs) via retrieval-augmented generation (RAG) and agentic search. Unlike human users, LLMs are constrained by limited attention budgets and are uniquely vulnerable to noise; misleading or irrelevant information is no longer just a nuisance, but a direct cause of hallucinations and reasoning failures. In this perspective paper, we argue that denoising-maximizing usable evidence density and verifiability within a context window-is becoming the primary bottleneck across the full information access pipeline. We conceptualize this paradigm shift through a four-stage framework of IR challenges: from inaccessible to undiscoverable, to misaligned, and finally to unverifiable. Furthermore, we provide a pipeline-organized taxonomy of signal-to-noise optimization techniques, spanning indexing, retrieval, context engineering, verification, and agentic workflow. We also present research works on information denoising in domains that rely heavily on retrieval such as lifelong assistant, coding agent, deep research, and multimodal understanding.
[IR-7] me-Interval-Aware Disentangled Expert Modeling for Next-Basket Recommendation
【速读】:该论文旨在解决下一代购物篮推荐(Next-basket recommendation, NBR)中存在的两个核心问题:一是现有方法通常将用户的习惯性复购行为与探索性兴趣混杂在同一表征中,导致习惯行为主导而抑制了新物品发现;二是现有模型依赖离散序列建模,忽略了时间间隔的连续性以及物品特有的周期性模式。解决方案的关键在于提出Time-Interval Disentangled Experts (TIDE),其创新性体现在:首先,引入Hawkes增强的傅里叶时间编码(Hawkes-enhanced Fourier Time Encoding)以捕捉物品级的时间周期性和动态衰减特性;其次,设计双专家架构——习惯专家(Habit Expert)专门建模重复需求,模式引导探索专家(Pattern-Guided Exploration Expert)驱动新颖性发现,并通过物品感知门控机制实现两者的自适应平衡。
链接: https://arxiv.org/abs/2605.00499
作者: Zhiying Deng,Yuan Fu,Usman Farooq,Ziwei Tian,Wei Liu,Jianjun Li
机构: Central China Normal University (华中师范大学); Huazhong University of Science and Technology (华中科技大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Next-basket recommendation (NBR) is a type of recommendation that aims to predict a set of items a user will purchase based on their historical transaction basket sequences. It is governed by a dynamic interplay between two distinct user intents: habitual repurchase, which involves repeating past behaviors, and exploratory interest, which involves discovering new items. However, existing NBR methods generally suffer from two limitations: (1) they often entangle these conflicting motives within a single representation, causing habits to overshadow discovery, and (2) they rely on discrete sequential modeling that ignores continuous-time intervals and item-specific periodicities. In this paper, we propose a novel solution named Time-Interval Disentangled Experts (TIDE) to address these challenges. TIDE incorporates a Hawkes-enhanced Fourier Time Encoding to capture item-specific temporal periodicities and dynamic decay. To decouple user intentions, TIDE utilizes a dual-expert architecture that integrates a Habit Expert for recurring needs and a Pattern-Guided Exploration Expert for discovery. Combined with an item-aware gating mechanism, TIDE adaptively balances repurchase and exploration. Extensive experiments on four diverse real-world datasets demonstrate that TIDE consistently outperforms representative state-of-the-art NBR methods.
[IR-8] FollowTable: A Benchmark for Instruction-Following Table Retrieval SIGIR2026
【速读】:该论文旨在解决传统表格检索(Table Retrieval, TR)任务在面向大语言模型(LLM)驱动的智能体系统时存在的局限性问题,即现有方法仅依赖主题语义相似度进行相关性判断,无法有效响应用户指令中对表格内容范围(如包含/排除约束)和结构 schema(如列语义、表示粒度)的细粒度要求。为此,论文提出新的任务定义——指令遵循型表格检索(Instruction-Following Table Retrieval, IFTR),其核心挑战在于模型需同时满足主题相关性和指令约束条件。解决方案的关键在于:构建首个大规模基准 FollowTable,并设计新的评估指标“指令响应得分”(Instruction Responsiveness Score),以量化模型在不同指令下对表格排序的适应能力,从而揭示当前检索模型在细粒度指令理解和 schema 约束处理上的系统性偏差,为未来研究指明方向。
链接: https://arxiv.org/abs/2605.00400
作者: Rihui Jin,Yuchen Lu,Ting Zhang,Jun Wang,Kuicai Dong,Zhaocheng Du,Dongping Liu,Gang Wang,Yong Liu,Guilin Qi
机构: Southeast University (东南大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: SIGIR 2026 Accepted
Abstract:Table Retrieval (TR) has traditionally been formulated as an ad-hoc retrieval problem, where relevance is primarily determined by topical semantic similarity. With the growing adoption of LLM-based agentic systems, access to structured data is increasingly instruction-driven, where relevance is conditional on explicit content and schema constraints rather than topical similarity alone. We therefore formalize Instruction-Following Table Retrieval (IFTR), a new task that requires models to jointly satisfy topical relevance and fine-grained instruction constraints. We identify two core challenges in IFTR: (i) sensitivity to content scope, such as inclusion and exclusion constraints, and (ii) awareness of schema-grounded requirements, including column semantics and representation granularity–capabilities largely absent in existing retrievers. To support systematic evaluation, we introduce FollowTable, the first large-scale benchmark for IFTR, constructed via a taxonomy-driven annotation pipeline. We further propose a new metric, termed the Instruction Responsiveness Score, to evaluate whether retrieval rankings consistently adapt to user instructions relative to a topic-only baseline. Our results indicate that existing retrieval models struggle to follow fine-grained instructions over tabular data. In particular, they exhibit systematic biases toward surface-level semantic cues and remain limited in handling schema-grounded constraints, highlighting substantial room for future improvements.
[IR-9] Negative Data Mining for Contrastive Learning in Dense Retrieval at IKEA.com
【速读】:该论文旨在解决密集检索(Dense Retrieval)中负样本质量对模型性能影响显著的问题,尤其是在IKEA产品搜索场景下,如何通过更有效的负样本构建和评估方法提升检索效果。其解决方案的关键在于两个方面:一是设计基于产品层级分类体系(hierarchical taxonomy)和属性信息的结构化负采样策略,生成语义上更具挑战性的负例;二是引入大语言模型作为裁判(LLM-as-a-judge)的可扩展相关性评估方法,为所有候选商品对每个查询分配打分,从而替代传统稀疏人工标注或随机采样方式,获得高质量训练数据。这一方法在离线实验中实现了平均类别准确率提升2.6%,但A/B测试显示长尾查询下用户参与度无显著差异,揭示了训练数据与真实用户行为(如零点击率高达67%的热门搜索)之间存在脱节,强调需将离线评估与真实查询意图分布及零点击模式相结合,以弥合离线指标与在线用户行为之间的差距。
链接: https://arxiv.org/abs/2605.00353
作者: Eva Agapaki,Amritpal Singh Gill
机构: IKEA Retail (Ingka Group)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Contrastive learning is a core component of modern retrieval systems, but its effectiveness heavily relies on the quality of negative examples used during training. In this work, we present a systematic approach to improving dense retrieval for IKEA product search through structured negative sampling strategies and scalable LLM-as-a-judge relevance evaluation. Building on IKEA Search Engine’s late-interaction retrieval architectures, we introduce two key contributions: (1) structured negative sampling strategies that leverage product hierarchical taxonomy and product attributes to generate semantically challenging negatives, and (2) a comprehensive LLM-based evaluation methodology for generating training data. Rather than relying on sparse human annotations or random sampling, our LLM-based evaluation system allocates a score for all candidate products against each query. Our methodology achieves +2.6% average category accuracy on offline real user query experiments on the Canada market. However, our A/B test on long-tail queries showed no statistically significant differences in user engagement metrics between the improved and baseline models ( p 0.05 ). We trace this gap to user search behavior: 67% of popular searches exhibit zero-click rates above 50%, indicating that a substantial proportion of search sessions result in no product engagement regardless of result ranking. These findings underscore the importance of hard negative mining but also the need for grounding training data and offline evals in real user search behavior – including query intent distribution and zero-click patterns – to bridge the gap between offline retrieval quality and online user engagement. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2605.00353 [cs.IR] (or arXiv:2605.00353v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.00353 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-10] DynamicPO: Dynamic Preference Optimization for Recommendation DASFAA2026
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)推荐系统中多负样本偏好优化(multi-negative preference optimization)时出现的“偏好优化崩溃”(preference optimization collapse)问题,即随着负样本数量增加,尽管训练损失持续下降,推荐性能反而恶化。其核心原因是梯度抑制:易于区分的负样本主导优化过程,导致对定义用户偏好边界的临界负样本关注不足,从而弱化模型决策边界。解决方案的关键在于提出DynamicPO(Dynamic Preference Optimization),包含两个自适应机制:动态边界负样本选择(Dynamic Boundary Negative Selection),用于识别并优先利用靠近决策边界的高信息量负样本;以及双边界动态β调整(Dual-Margin Dynamic beta Adjustment),根据样本边界模糊程度动态校准每条样本的优化强度,从而实现更有效的边界增强优化,且计算开销极低。
链接: https://arxiv.org/abs/2605.00327
作者: Xingyu Hu,Kai Zhang,Jiancan Wu,Shuli Wang,Chi Wang,Wenshuai Chen,Yinhua Zhu,Haitao Wang,Xingxing Wang,Xiang Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: DASFAA2026
Abstract:In large language model (LLM)-based recommendation systems, direct preference optimization (DPO) effectively aligns recommendations with user preferences, requiring multi-negative objective functions to leverage abundant implicit-feedback negatives and sharpen preference boundaries. However, our empirical analyses reveal a counterintuitive phenomenon, preference optimization collapse, where increasing the number of negative samples can lead to performance degradation despite a continuously decreasing training loss. We further theoretically demonstrate that this collapse arises from gradient suppression, caused by the dominance of easily discriminable negatives over boundary-critical negatives that truly define user preference boundaries. As a result, boundary-relevant signals are under-optimized, weakening the model’s decision boundary. Motivated by these observations, we propose DynamicPO (Dynamic Preference Optimization), a lightweight and plug-and-play framework comprising two adaptive mechanisms: Dynamic Boundary Negative Selection, which identifies and prioritizes informative negatives near the model’s decision boundary, and Dual-Margin Dynamic beta Adjustment, which calibrates optimization strength per sample according to boundary ambiguity. Extensive experiments on three public datasets show that DynamicPO effectively prevents optimization collapse and improves recommendation accuracy on multi-negative preference optimization methods, with negligible computational overhead. Our code and datasets are available at this https URL.
[IR-11] Intelligent Elastic Feature Fading: Enabling Model Retrain-Free Feature Efficiency Rollouts at Scale
【速读】:该论文旨在解决大规模排序系统中特征效率管理的难题,即传统模型重训练机制导致的迭代周期长(3–6个月)、GPU资源消耗大以及上线吞吐量低的问题。其核心解决方案是提出智能弹性特征衰减(Intelligent Elastic Feature Fading, IEFF),关键在于通过在推理阶段动态调整特征覆盖范围和分布,实现无需重训练即可进行特征效率优化的灵活 rollout。IEFF 支持增量式特征覆盖率调整,使模型在持续训练中自然适应变化,同时结合严格的安全防护、可逆机制与全面监控,保障系统在大规模场景下的稳定性。实验表明,渐进式特征衰减相比突变移除可减少50–55%的在线性能下降,且保持模型行为稳定,验证了该方法在工业级排序系统中的有效性与可扩展性。
链接: https://arxiv.org/abs/2605.00324
作者: Jieming Di,Xiaoyu Chen,Ying She,Siyu Wang,Lizzie Liu,Fenggang Wu,Jiaoying Mu,Tony Tsui,Amr Elroumy,Hsing Tang,Zewei Jiang,Qiao Yang,Lin Qi,Haibo Lin,Weifeng Cui,Daniel Li,Kapil Gupta,Shivendra Pratap Singh,Jie Zheng,Arnold Overwijk,Ling Leng,Sri Reddy,Robert Malkin,Rocky Liu
机构: Meta(Meta)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 8 pages, 2 figures, 3 tables
Abstract:Large-scale ranking systems depend on thousands of features derived from user behavior across multiple time horizons. Typically requires model retraining – resulting in long iteration cycles (3–6 months), substantial GPU resource consumption, and limited rollout throughput. We introduce Intelligent Elastic Feature Fading (IEFF), a production infrastructure system that enables retrain-free feature efficiency rollouts by elastically controlling feature coverage and distribution at serving time. IEFF supports incremental feature coverage adjustments while models adapt through recurring training, eliminating dependencies on explicit retraining cycles. The system incorporates strict safety guardrails, reversibility mechanisms, and comprehensive monitoring to ensure stability at scale. Across multiple production use cases, IEFF accelerates efficiency-related rollouts by 5 \times , eliminates retraining-related GPU overhead, and enables faster capacity recycling. Extensive offline and online experiments demonstrate that gradual feature fading prevents 50–55% of online performance degradation compared to abrupt feature removal, while maintaining stable model behavior. These results establish elastic, system-level feature fading as a practical and scalable approach for managing feature efficiency in modern industrial ranking systems. Comments: 8 pages, 2 figures, 3 tables Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2605.00324 [cs.IR] (or arXiv:2605.00324v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.00324 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-12] Structure-Aware Chunking for Tabular Data in Retrieval-Augmented Generation
【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)中的文本分块策略无法有效处理表格文档(如CSV和Excel文件)的问题,这类文档在企业数据管道中广泛应用,但其结构特性未被现有方法充分考虑。解决方案的关键在于提出一种结构感知的表格分块框架(Structure-aware Tabular Chunking, STC),该框架基于行级单元构建层次化的Row Tree表示,将每行编码为键值块(key-value block),通过约束token数量的分割策略与无重叠的贪心合并机制,在保持行内字段语义关系的同时提升分块密度与token利用率,从而显著减少碎片化并优化检索性能。
链接: https://arxiv.org/abs/2605.00318
作者: Pooja Guttal,Varun Magotra,Vasudeva Mahavishnu,Natasha Chanto,Sidharth Sivaprasad,Manas Gaur
机构: Altumatim(Altumatim); University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 5 Pages, 1 figure, 4 Tables, 1 Algorithm, Work In Progress
Abstract:Tabular documents such as CSV and Excel files are widely used in enterprise data pipelines, yet existing chunking strategies for retrieval-augmented generation (RAG) are primarily designed for unstructured text and do not account for tabular structure. We propose a structure-aware tabular chunking (STC) framework that operates on row-level units by constructing a hierarchical Row Tree representation, where each row is encoded as a key-value block. STC performs token-constrained splitting aligned with structural boundaries and applies overlap-free greedy merging to produce dense, non-overlapping chunks. This design preserves semantic relationships between fields within a row while improving token utilization and reducing fragmentation. Across evaluations on the MAUD dataset, STC reduces chunk count by up to 40% and 56% compared to standard recursive and key-value based baselines, respectively, while improving token utilization and processing efficiency. In retrieval benchmarks, STC improves MRR from 0.3576 to 0.5945 in a hybrid setting and increases Recall@1 from 0.366 to 0.754 in BM25-only retrieval. These results demonstrate that preserving structure during chunking improves retrieval performance, highlighting the importance of structure-aware chunking for RAG over tabular data.
[IR-13] Retrieval-Augmented Reasoning for Chartered Accountancy
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在印度特许会计师(Chartered Accountancy, CA)等复杂、地域性强的任务中可靠性不足的问题,尤其体现在多步骤数值计算与法律法规深度理解上的局限性,以及在资源受限场景下难以规模化部署的挑战。其解决方案的关键在于提出一种参数高效的检索增强生成(Retrieval-Augmented Generation, RAG)框架——CA-ThinkFlow,该框架结合了14B参数量且经过4-bit量化处理的推理模型(14B-DeepSeek-R1)与布局感知的Docling文档提取系统,通过自动将检索到的信息注入提示词,并利用模型内置的思维链(Chain-of-Thought, CoT)能力构建上下文以生成准确答案,在多层级CA-Ben基准测试中达到与主流商业模型相当的性能水平(Scholastic Reliability Coefficient, SRC为GPT-4o和Claude 3.5 Sonnet的68.75%),但对税务等领域复杂监管文本的深层推理仍存在不足。
链接: https://arxiv.org/abs/2605.00257
作者: Jatin Gupta,Akhil Sharma,Saransh Singhania,Ali Imam Abidi
机构: Sharda University(Sharda大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 9 pages, 2 figures, and 3 tables
Abstract:The inception of Large Language Models (LLMs) has catalyzed AI adoption in the finance sector, yet their reliability in complex, jurisdiction-specific tasks like Indian Chartered Accountancy (CA) remains limited. The models display difficulty in executing numerical tasks which require multiple steps while also needing advanced knowledge about legal regulations and the method of scaling their operations is not feasible in settings which have limited access to resources. We present CA-ThinkFlow as a parameter-efficient Retrieval-Augmented Generation (RAG) framework which operates with a 14B, 4-bit-quantized reasoning model, 14B-DeepSeek-R1, and a layout-aware Docling extraction system which maintains document structure during extraction. CA-ThinkFlow uses a basic RAG method which automatically adds retrieved information into the prompt, while it depends on the model’s built-in Chain-of-Thought (CoT) functions to create context and produce correct answers. The system we developed system operates at performance levels which match large proprietary models when we tested it on the multi-level CA-Ben benchmark, achieving Scholastic Reliability Coefficient (SRC) results which equal 68.75% of GPT-4o and Claude 3.5 Sonnet. The framework shows high efficiency and strength in handling parameters, but essential reasoning abilities fail to process complex regulatory texts which exist in fields such as Taxation.
[IR-14] RSAT: Structured Attribution Makes Small Language Models Faithful Table Reason ers ACL2026
【速读】:该论文旨在解决语言模型在回答表格类问题时缺乏可验证性的问题,即用户无法追溯模型推理过程中所依据的具体单元格信息。解决方案的关键在于提出RSAT方法,通过两阶段训练机制:第一阶段(SFT)教导小语言模型(SLMs, 1-8B)输出结构化的JSON格式推理路径,其依据来自人工验证的推理轨迹;第二阶段(GRPO)优化一个以自然语言推理(NLI)为基础的忠实度奖励为核心、辅以引用有效性和简洁性的复合奖励函数。实验表明,该方法使模型忠实度提升3.7倍(从0.224升至0.826),并实现接近完美的引用有效性(0.992),同时证明了将归因机制嵌入推理过程而非事后补全是保障可靠性的关键。
链接: https://arxiv.org/abs/2605.00199
作者: Jugal Gajjar,Kamalasankari Subramaniakuppusamy
机构: The George Washington University (乔治·华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 8 pages, 8 tables, 9 figures, and a 3-page Appendix. Accepted at the SURGeLLM Workshop at ACL 2026 and will be included in the proceedings
Abstract:When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1-8B) to produce step-by-step reasoning with cell-level citations grounded in table evidence. Phase 1 (SFT) teaches a structured JSON output format from verified reasoning traces. Phase 2 (GRPO) optimizes a composite reward centered on NLI-based faithfulness, alongside citation validity and parsimony. Across six models from two families-Qwen 2.5 (1.5B/3B/7B) and Llama 3 (1B/3B/8B)-RSAT improves faithfulness 3.7 \times over SFT alone (0.224 \rightarrow 0.826), with near-perfect citation validity (0.992). Post-hoc attribution collapses below 13% format success, confirming that attribution must be integrated into reasoning, not retrofitted. Ablations show the faithfulness reward is essential: removing it drops faithfulness from 0.97 to 0.03.
[IR-15] DeGenTWeb: A First Look at LLM -dominant Websites
【速读】:该论文旨在解决当前关于生成式 AI(Generative AI)内容在互联网上占据主导地位的报道缺乏可靠数据支撑的问题,即现有研究通常未基于具有代表性的网络样本,且检测方法不透明,导致对 LLM(Large Language Models)生成内容真实渗透率的认知存在偏差。其核心解决方案是提出 DeGenTWeb——一个系统化识别以 LLM 为主导内容的网站(LLM-dominant websites)的方法,关键在于将文本生成检测器适配至网页层级,并通过聚合单个页面的检测结果实现站点级别的准确分类,从而首次在大规模网络数据(如 Common Crawl 和 Bing 搜索结果)中量化了 LLM 内容的广泛存在及其随时间增长的趋势。
链接: https://arxiv.org/abs/2605.00087
作者: Sichang Steven He,Calvin Ardi,Ramesh Govindan,Harsha V. Madhyastha
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 6 pages, 6 figures, 13 page total; in submission
Abstract:Many recent news reports have claimed that content generated by large language models (LLMs) is taking over the web. However, these claims are typically not based on a representative sample of the web and the methodology underlying them is often opaque. Moreover, when aiming to minimize the chances of falsely attributing human-authored content to LLMs, we find that detectors of LLM-generated text perform much worse than advertised. Consequently, we lack an understanding of the true prevalence and characteristics of LLM content on the web. We describe DeGenTWeb which systematically identifies LLM-dominant websites: sites whose content has been generated using LLMs with little human input. We show how to adapt detectors of LLM-generated text for use on web pages, and how to aggregate detection results from multiple pages on a site for accurate site-level categorization. Using DeGenTWeb, we find that LLM-dominant sites are highly prevalent both in data from Common Crawl and in Bing’s search results, and that this share is growing over time. We also show that continuing to accurately identify such sites appears challenging given the capabilities of the latest LLMs. Comments: 6 pages, 6 figures, 13 page total; in submission Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2605.00087 [cs.NI] (or arXiv:2605.00087v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2605.00087 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-16] A Survey of Reasoning -Intensive Retrieval: Progress and Challenges ACL2026
【速读】:该论文旨在解决生成式 AI(Generative AI)在信息检索(Information Retrieval, IR)领域中,针对推理密集型检索(Reasoning-Intensive Retrieval, RIR)场景缺乏系统性框架的问题。当前研究虽已将大语言模型(Large Language Models, LLMs)的推理能力融入检索全流程,但方法分散、标准不一,难以形成清晰的发展路径。其解决方案的关键在于:首先,对现有RIR评测基准按知识领域和模态进行系统分类与分析,厘清当前研究格局;其次,提出一个结构化的分类体系,依据推理在检索流程中的嵌入位置与方式对方法进行归类,并剖析其权衡关系与实际应用场景;最后,总结关键挑战并指明未来研究方向,为该快速演进但碎片化的领域提供明确的路线图。
链接: https://arxiv.org/abs/2605.00063
作者: Yiyang Wei,Tingyu Song,Siyue Zhang,Yilun Zhao
机构: Zhejiang University (浙江大学); University of the Chinese Academy of Sciences (中国科学院大学); Nanyang Technological University (南洋理工大学); Yale University (耶鲁大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to the ACL 2026 Main Conference; camera-ready version
Abstract:Reasoning-Intensive Retrieval (RIR) targets retrieval settings where relevance is mediated by latent inferential links between a query and supporting evidence, rather than semantic similarity. Motivated by the emergent reasoning abilities of Large Language Models (LLMs), recent work integrates these capabilities into the IR field, spanning the entire pipeline from benchmarks to retrievers and rerankers. Despite this progress, the field lacks a systematic framework to organize current efforts and articulate a clear path forward. To provide a clear roadmap for this rapidly growing yet fragmented area, this survey (1) systematizes existing RIR benchmarks by knowledge domains and modalities, providing a detailed analysis of the current landscape; (2) introduces a structured taxonomy that categorizes methods based on where and how reasoning is integrated into the retrieval pipeline, alongside an analysis of their trade-offs and practical applications; and (3) summarizes challenges and future directions to guide research in this evolving field.
[IR-17] Exploring LLM biases to manipulate AI search overview
【速读】:该论文旨在解决生成式 AI(Generative AI)在 LLM Overview 系统中因语言模型(LLM)固有偏见而导致的检索结果选择偏差问题,特别是如何利用这些偏见进行内容操纵以影响最终输出。其解决方案的关键在于训练一个小型语言模型,采用强化学习(Reinforcement Learning, RL)策略对搜索片段(snippet)进行改写,从而提升其被 LLM Overview 系统选中的概率;实验设计限制策略仅作用于 snippet 内容并避免奖励劫持(reward-hacking),模拟真实网络搜索环境下的约束条件,验证了偏见的存在及其可被优化利用,同时揭示了 LLM Overview 的选择机制依赖于候选源之间的相对优势而非绝对质量。
链接: https://arxiv.org/abs/2605.00012
作者: Roman Smirnov
机构: E-AI.Solutions
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 7 figures
Abstract:Modern large language models (LLMs) are used in many business applications in general, and specifically in web search systems and applications that generate overviews of search results - LLM Overview systems. Such systems are using an LLM to select most relevant sources from search results and generate an answer to the user’s query. It is known from many studies that LLMs have different biases, in LLM Overview application both the source selection and answer generation stages may be affected by the biases of LLMs (here we are focusing mainly on the selection stage). This research is focused on investigating the presence of the biases in LLM Overview systems and on biases exploitation to manipulate LLM Overview results. Here we train a small language model using reinforcement learning to rewrite search snippets to increase their likelihood of being preferred by an LLM Overview. Our experimental setup intentionally restricts the policy to operate only on snippets and limits reward-hacking strategies, reflecting realistic constraints of web search environments. The results prove that LLM Overview systems have biases and that reinforcement learning in most of the cases can optimize snippet’s content to manipulate LLM Overview results. We also prove that LLM Overview selections are driven by comparative rather than absolute advantages among candidate sources. In addition, we examine safety aspects of LLM Overview manipulation possibilities and show that context poisoning attacks can lead to inaccurate or harmful results.
人机交互
[HC-0] Prop-Chromeleon: Adaptive Haptic Props in Mixed Reality through Generative Artificial Intelligence
【速读】:该论文旨在解决混合现实(Mixed Reality, MR)中因缺乏触觉反馈而导致的视觉-触觉一致性断裂问题。解决方案的关键在于提出Prop-Chromeleon系统,该系统基于生成式人工智能(Generative AI),能够根据用户提供的文本提示,动态将日常物体转化为自适应的被动触觉道具(passive haptic props)。其核心创新在于通过AI驱动的虚拟资产生成与锚定流程,使虚拟内容在几何形状上精准匹配物理道具,并在提示约束下实现语义一致性的内容生成,从而提升MR环境中触觉交互的真实性与沉浸感。
链接: https://arxiv.org/abs/2605.00804
作者: Haoyu Wang,Fengyuan Zhu,Bingjian Huang,Zhecheng Wang,Ludwig Sidenmark
机构: University of Toronto (多伦多大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to ACM DIS 2026
Abstract:Mixed Reality (MR) aims to blend digital and physical worlds, but the absence of haptic feedback often breaks visual-tactile consistency. We introduce Prop-Chromeleon, a MR system based on generative artificial intelligence (AI) that dynamically transforms everyday objects into adaptive passive haptic props through user-provided text prompts. Our AI pipeline performs generation and anchoring of virtual assets that align with the shape of physical props, allowing us to study how virtual content generation behaves under geometric and prompt-based constraints. We evaluate Prop-Chromeleon’s effectiveness through a generation study using varied object shapes and user prompts, combining quantitative shape similarity metrics with qualitative prompt fidelity analysis. Our user study further showcases Prop-Chromeleon’s improvements in perceived realism, immersion, and enjoyment compared to static baselines. These results show that shape-aware generation can support both believable haptic interaction and creative engagement in MR.
[HC-1] Modeling Subjective Urban Perception with Human Gaze
【速读】:该论文旨在解决现有城市感知(Urban Perception)建模方法中忽视人类感知过程的问题,即当前计算模型主要直接从街景图像出发预测主观城市感知,而未考虑人类在形成判断时的注视行为等认知机制。其解决方案的关键在于构建了一个名为Place Pulse-Gaze的数据集,该数据集通过同步眼动追踪记录和个体感知标签对街景图像进行增强,并在此基础上提出了一种基于凝视引导(Gaze-Guided)的城市感知框架。该框架系统性地研究了三种互补设置:仅使用凝视信息建模、凝视与显式语义场景表示融合,以及凝视与隐式更丰富的视觉表示融合,实验证明凝视行为本身已携带可预测主观城市感知的有效信号,且与场景表示融合后能进一步提升预测性能,从而强调将人类感知过程纳入城市场景理解的重要性,并为凝视引导的多模态城市计算开辟新方向。
链接: https://arxiv.org/abs/2605.00764
作者: Lin Che,Xi Wang,Marc Pollefeys,Konrad Schindler,Martin Raubal,Peter Kiefer
机构: ETH Zurich(苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Urban perception describes how people subjectively evaluate urban environments, shaping how cities are experienced and understood. Existing computational approaches primarily model urban perception directly from street view images, but largely ignore the human perceptual process through which such judgments are formed. In this paper, we introduce Place Pulse-Gaze, an urban perception dataset that augments street view images with synchronized eye-tracking recordings and individual perception labels. Based on this dataset, we propose a Gaze-Guided Urban Perception Framework to study how gaze behavior contributes to the modeling of subjective urban perception. The framework systematically investigates three complementary settings: gaze-only modeling, gaze fusion with explicit semantic scene representations, and gaze fusion with implicit richer visual representations. Experiments show that gaze alone already carries useful predictive signals for subjective urban perception, and that integrating gaze with scene representations further improves prediction under both semantic and richer visual representations. Overall, our findings highlight the importance of incorporating human perceptual processes into urban scene understanding and open a direction for gaze-guided multimodal urban computing.
[HC-2] BlenderRAG : High-Fidelity 3D Object Generation via Retrieval-Augmented Code Synthesis
【速读】:该论文旨在解决从自然语言自动生成功能性Blender代码时存在的挑战,即当前最先进的大语言模型(Large Language Models, LLMs)常产生语法错误和几何不一致的物体。其解决方案的关键在于提出BlenderRAG系统,该系统基于一个经过专家验证的多模态数据集(包含500个示例,涵盖50类物体的文本、代码与图像),在生成过程中通过检索语义相似的示例来增强生成内容,从而显著提升编译成功率(从40.8%提升至70.0%)和语义对齐度(CLIP相似度从0.41提升至0.77),且无需微调或专用硬件即可部署。
链接: https://arxiv.org/abs/2605.00632
作者: Massimo Rondelli,Francesco Pivi,Maurizio Gabbrielli
机构: University of Bologna (博洛尼亚大学); Ferrari S.p.A. (法拉利股份公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Automatic generation of executable Blender code from natural language remains challenging, with state-of-the-art LLMs producing frequent syntactic errors and geometrically inconsistent objects. We present BlenderRAG, a retrieval-augmented generation system that operates on a curated multimodal dataset of 500 expert-validated examples (text, code, image) across 50 object categories. By retrieving semantically similar examples during generation, BlenderRAG improves compilation success rates from 40.8% to 70.0% and semantic normalized alignment from 0.41 to 0.77 (CLIP similarity) across four state-of-the-art LLMs, without requiring fine-tuning or specialized hardware, making it immediately accessible for deployment. The dataset and code will be available at this https URL.
[HC-3] AI Washing Inflates Expected Performance but Not Interaction Outcomes: An AI Placebo Study Using Fitts Law
【速读】:该论文旨在解决“AI洗牌”(AI washing)现象对用户期望与实际交互效果之间不一致的问题,即当产品被夸大宣传为具备人工智能能力时,用户预期会显著提升,但实际性能并未改善。其解决方案的关键在于通过严谨的Fitts’ Law任务实验设计,在控制变量条件下比较无支持、伪预测AI支持和伪生物信号增强AI支持三种情境下的客观表现与主观感知差异,从而证明AI洗牌虽能误导用户预期,却无法改变真实交互结果,凸显了在AI产品营销中提高透明度和问责机制的重要性,并确立Fitts’ Law作为评估标注AI功能输入设备有效性的方法论工具。
链接: https://arxiv.org/abs/2605.00582
作者: Nick von Felten,Luisa Ella Müller,Johannes Schöning
机构: University of St. Gallen (圣加仑大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)
Abstract:Expectations about the support of artificial intelligence (AI) may influence interaction outcomes similar to placebos. Such expectations may result from AI washing, a practice of overstating a system’s AI capabilities when actual functionality is limited. For example, some computer mice are marketed as “AI-assisted” despite lacking AI in core functions. In a within-subjects study, 28 participants completed Fitts’ Law tasks with a computer mouse under three conditions: no support, supposed predictive AI support, and supposed biosignal-enhanced AI support. Objective Fitts’ Law performance indicators and subjective performance expectations, perceived workload, and perceived usability were measured. Compared to baseline, participants expected significantly improved performance in placebo conditions. However, these expectations did not translate into differences in objective or subjective assessments. This paper contributes evidence that AI washing inflates user expectations without altering actual interaction outcomes, highlighting a critical transparency issue. By exposing how deceptive AI marketing can shape user expectations, we underscore the need for accountability in AI product claims. Further, we establish Fitts’ Law as a rigorous methodological lens for auditing AI-labelled input devices.
[HC-4] DySRec: Dynamic Context-Aware Psychometric Scale Recommendation via Multi-Agent Collaboration
【速读】:该论文旨在解决心理评估中量表选择的静态性与动态性需求之间的矛盾,即现有系统多采用固定流程或直接预测症状,难以支持动态评估、风险管理和透明决策。其解决方案的关键在于提出一个名为DySRec的多智能体对话系统,通过多轮交互建模量表选择为连续决策过程,并由专业化智能体协同完成用户上下文维护、量表推荐、心理风险监测及决策轨迹记录;同时引入闭环优化机制,根据缺失信息反馈引导对话以获取目标特征,从而实现基于语义、交互行为、评估历史和内容状态等异构信号的动态用户表征更新与量表-情境兼容性评分,提升推荐精准度与临床实用性。
链接: https://arxiv.org/abs/2605.00574
作者: Yanzeng Li,Xiaoning Cao,Jialun Zhong,Jianpeng Hu,Jiangshan Tan,Ningning Liu,Feng Xiang,Shasha Han
机构: Institute of Artificial Intelligence and Future Networks, Beijing Normal University; Wangxuan Institute of Computer Technology, Peking University; Fuwai Hospital, Chinese Academy of Medical Sciences Peking Union Medical College; Peking University Sixth Hospital, Peking University Institute of Mental Health; Beijing Malt Butler Technology Co., Ltd; School of Population Medicine and Public Health, Chinese Academy of Medical Sciences Peking Union Medical College
类目: Human-Computer Interaction (cs.HC)
备注: 4 pages, 2 figures
Abstract:Choosing suitable psychometric scales is an essential and difficult step in psychological consultation, which requires clinicians to integrate patient information, behaviors, and dynamic contextual information. Existing systems mainly use static pipelines to choose scale, or directly predict symptoms according to user inputs, limiting their ability to support dynamic assessment, risk management, and transparent decision-making. To address these limitations, we propose DySRec, a multi-agent conversational system for dynamic psychometric scale recommendation. DySRec operates as an interactive chatbot that engages users in multi-turn dialogue, models scale selection as a continuous conversational decision process, and coordinates specialized agents to maintain user context, recommend assessment scales, monitor psychological risk, and log decision trajectories. In this way, DySRec can integrate and capture heterogeneous signals, including semantic, interaction behaviors, assessment history, and content state, to dynamically update user representations and calculate scale-context compatibility score for recommending most matched scales. Moreover, DySRec incorporates a closed-loop refinement mechanism. Recommendation agent will feedback the missing or uncertain attributes and guide the conversation to elicit the targeted information. In this paper, we showcase the prototype design and architecture of DySRec, and this system has been verified in a real-world application.
[HC-5] Pick and Sort for Graphical Authentication
【速读】:该论文旨在解决传统身份认证方式在用户体验与安全性之间难以平衡的问题,尤其是在非时间敏感场景下如何提升用户友好性与可定制性。其解决方案的关键在于提出一种基于“选择与排序”(Pick and Sort)设计的图形化认证机制,通过让用户从可配置数量的视觉元素中进行选择并按特定顺序排列于网格中完成身份验证,从而实现易学性、灵活性和个性化适配(如针对儿童群体定制视觉元素),尽管登录时间较长,但适用于低频访问或作为二级认证手段等对时效性要求不高的场景。
链接: https://arxiv.org/abs/2605.00558
作者: Argianto Rahartomo,AmirHossein Jamshidipoor,Mohammad Ghafari
机构: Technische Universität Clausthal (克劳斯塔尔工业大学); Islamic Azad University (伊斯兰阿扎德大学); Tehran Institute for Advanced Studies (TEIAS) (德黑兰高级研究所); Khatam University (卡塔姆大学)
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注: The paper is accepted for publication at the International Workshop on Software Security 2026 (co-located with EASE 2026)
Abstract:We propose a graphical authentication scheme that follows a simple ``Pick and Sort’’ design in which users choose visual elements and arrange them within a grid. Both the number of selected elements and the grid size are configurable, and the visual elements can be customized for specific user groups, such as children. A preliminary study with a prototype implementation indicated that the scheme is easy to learn and flexible to deploy. Although login times are longer than those of conventional authentication methods, the additional interaction may be acceptable in scenarios that are not time-critical, such as infrequent-access use cases or as a secondary authentication mechanism.
[HC-6] Linking Behaviour and Perception to Evaluate Meaningful Human Control over Partially Automated Driving
【速读】:该论文旨在解决部分自动化驾驶(partial driving automation)中驾驶员法律责任与实际操控权下降之间的矛盾问题,即在驾驶员参与度降低的情况下如何保障其能够安全干预车辆行为。解决方案的关键在于验证并提升“有意义的人类控制”(Meaningful Human Control, MHC)的实现程度:通过对比两种控制模式(触觉共享控制 HSC 与交出控制 TC),结合行为指标(如转向扭矩冲突)、主观感知评分及定性反馈,发现驾驶员对自动化系统理解能力的感知与操纵冲突呈显著负相关,且反应时间与控制感感知呈正相关;同时指出,驾驶员意图与自动化系统意图不匹配、缺乏安全感以及对驾驶员输入的抗拒会削弱MHC感知,而符合驾驶员意图的微妙触觉引导则能增强MHC体验。因此,未来设计应聚焦于简化干预流程、透明化自动化意图沟通及动态调整人机控制权分配以强化MHC。
链接: https://arxiv.org/abs/2605.00556
作者: Ashwin George,Lucas Elbert Suryana,Lorenzo Flipse,Bart van Arem,David A. Abbink,Simeon Craig Calvert,Luciano Cavalcante Siebert,Arkady Zgonnikov
机构: Delft University of Technology (代尔夫特理工大学); Centre for Meaningful Human Control (有意义的人类控制中心); Department of Cognitive Robotics (认知机器人系); Department of Transport Planning (交通规划系); Department of Intelligent Systems (智能系统系)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Robotics (cs.RO)
备注:
Abstract:Partial driving automation creates a tension: drivers remain legally responsible for vehicle behaviour, yet their active control is significantly reduced. This reduction undermines the engagement and sense of agency needed to intervene safely. Meaningful human control (MHC) has been proposed as a normative framework to address this tension. However, empirical methods for evaluating whether existing systems actually provide MHC remain underdeveloped. In this study, we investigated the extent to which drivers experience MHC when interacting with partially automated driving systems. Twenty-four drivers completed a simulator study involving silent automation failures under two modes - haptic shared control (HSC) and traded control (TC). We derived behavioural metrics from telemetry data, subjective perception scores from post-trial surveys and used them to test hypothesised relations between them derived from the properties of systems under MHC. The confirmatory analysis showed a significant negative correlation between the perception of the automated vehicle (AV) understanding the driver and conflict in steering torques. An exploratory analysis also revealed a surprising positive correlation between reaction times and the perception of sufficient control. Qualitative feedback from open-ended post-experiment questionnaires revealed that mismatches in intentions between the driver and automation, lack of safety, and resistance to driver inputs contribute to the reduction of perceived MHC, while subtle haptic guidance aligned with driver intent had a positive effect. These findings suggest that future designs should prioritise effortless driver interventions, transparent communication of automation intent, and context-sensitive authority allocation to strengthen meaningful human control in partially automated driving.
[HC-7] he impact of coercive normative and mimetic Stress on Chinese teachers continuance intention to use generative AI: An integrated perspective of the Expectation-Confirmation Model and Institutional Theory
【速读】:该论文旨在解决中国教师在教育实践中对生成式人工智能(Generative AI)持续使用意愿的驱动机制问题,尤其关注个体认知因素与制度环境之间的交互作用。其解决方案的关键在于整合期望确认模型(Expectation-Confirmation Model)与制度理论(Institutional Theory),通过定量与定性相结合的混合方法设计,揭示了感知有用性、确认感和满意度等个体心理因素,以及强制性、规范性和模仿性制度压力等外部情境因素共同塑造教师持续使用行为的作用路径。研究进一步表明,教师在实际应用中既注重生成式AI的实用性价值,又保持批判性审慎,体现出技术采纳过程中个体理性与制度逻辑的协同效应。
链接: https://arxiv.org/abs/2605.00522
作者: Kunjie Jia,Kai Cui,Huimin He,Yiran Du
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:This study investigates Chinese teachers’ continuance intention to use generative artificial intelligence (AI) by integrating the Expectation-Confirmation Model with Institutional Theory. A sequential explanatory mixed-methods design was employed. Questionnaire data from 437 teachers were analysed using structural equation modelling, followed by semi-structured interviews with 15 teachers to further interpret the findings. The results indicate that confirmation, perceived usefulness, and satisfaction play important roles in shaping teachers’ continuance intention, while institutional pressures, including coercive, normative, and mimetic influences, also contribute to continued use. Qualitative findings further reveal that teachers often use generative AI pragmatically to support tasks such as lesson preparation and idea generation, while simultaneously exercising caution and critically evaluating the reliability of AI-generated content. These findings highlight the combined influence of individual evaluations and institutional contexts on teachers’ sustained engagement with generative AI in education.
[HC-8] “What Are You Really Trying to Do?”: Co-Creating Life Goals from Everyday Computer Use
【速读】:该论文旨在解决现有用户建模系统仅能捕捉用户当前行为而无法理解其深层动机的问题,从而限制了系统对用户的支持能力。解决方案的关键在于提出“ striving co-creation(目标协同建构)”过程,通过结合活动理论(Activity Theory)与Emmons的个人追求框架(personal strivings framework),从非结构化的计算机使用数据中逐步构建用户活动的层次化表示,并引入编辑接口让用户参与修正系统对其目标的理解,从而将用户的反馈迭代融入后续的目标推断中,实现更贴近用户长期生活目标的个性化支持。
链接: https://arxiv.org/abs/2605.00497
作者: Shardul Sapkota,Matthew Jörke,Zane Sabbagh,Omar Shaikh,Grace Wang,James A. Landay
机构: Stanford University (斯坦福大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 8 figures, 1 table
Abstract:Recent advances in user modeling make it feasible to conduct open-ended inference over a person’s everyday computer use. Despite longstanding visions of systems that deeply understand our actions and the purposes they serve in our lives, existing systems only capture what a person is doing in the moment – not why they are doing it – limiting these systems to surface-level support. We introduce striving co-creation, a process for inferring broader life goals from unstructured observations of computer use. Grounded in Activity Theory and Emmons’ personal strivings framework, our system progressively constructs a hierarchical representation of a person’s activities. Crucially, strivings are difficult to fully resolve from observation alone, as the same action can be driven by many different goals. Our system therefore supports an editing interface that gives people agency over how they are understood by the system, feeding their corrections back into subsequent rounds of striving induction. In a week-long field deployment (N=14), we find that our co-creation process produces strivings that are representative of participants’ long-term goals and gives them greater agency than baseline methods.
[HC-9] Urban to Rural Migration in Eastern Europe: Unpacking digital ruralities through TikTok video analysis
【速读】:该论文旨在探讨城市向农村迁移(urban to rural migration)这一相对较少被研究的现象,尤其聚焦于欧洲部分地区(如罗马尼亚)人群从大城市或国外迁往乡村的数字实践。其核心问题是:社交媒体平台如何重构和呈现这种迁移过程中的乡村生活,并在多大程度上影响人们对乡村空间的认知与体验。解决方案的关键在于运用“数字乡村性”(digital rurality)理论框架——基于哈维(Harvey)和索贾(Soja)的空间三元辩证法,结合数字技术作为日常经验的(再)媒介化工具——对901条来自TikTok的视频内容进行定量与定性分析,揭示出三个层面:(a)社交媒体催生新型付费劳动形式并隐含自我商品化;(b)乡村地区在平台上获得高度数据丰富性的新表征;(c)数字乡村生活常被理想化或浪漫化,但同时也成为旅游推广、健康建议传播、宗教政治表达及个体自我呈现的重要场域。
链接: https://arxiv.org/abs/2605.00453
作者: Anca-Simona Horvath,Cristian Tosa,Simai(Stella)Huang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:
Abstract:Urban to rural migration is a less-researched phenomenon compared to its counterpart: rural to urban migration. In parts of Europe, an increasing number of people living in big urban centers within the country, or moving from other countries decide to relocate to rural areas. In this paper, we examine this phenomenon by analysing content posted on TikTok that documents this transition. We collected a corpus of 901 videos posted until late 2025, documenting urban to rural migration in Romania, under three hashtags, which have collectively been played a total of 24 million times at the time when we gathered the dataset. We analyse this corpus both quantitatively and qualitatively and discuss our findings through the lens of digital rurality - a theory based on Harvey’s and Soja’s spatial triad, applied to rural spaces, and based on the role of digital technologies as (re-)mediators of everyday lived experience. Specifically, we analyze the corpus as: (a) digital rural localities, (b) formal representations of the digital rural, and © everyday lives of the digital rural. We find that (a) Social media platforms enable new forms of paid labor that sometimes involve the commodification of the self in rural areas, although many of the creators we analyze do not explicitly acknowledge this with their audiences. (b) The digital rural gains new forms of representation, and rural areas in remote Romania are highly data-rich across TikTok. © The everyday lives represented through the digital rural are sometimes idealized or romanticised. However, they serve as promoters for tourism and are used as sites to document and discuss a variety of topics including giving ample health advice, typically by non-specialists and sometimes criticizing Western medicine, expressing and promoting religious and political views but also acting as forms of general self-expression.
[HC-10] On the Role of Artificial Intelligence in Human-Machine Symbiosis
【速读】:该论文旨在解决如何在自然语言生成过程中追溯人工智能(AI)所扮演的隐含功能角色的问题。随着人机共生关系日益紧密,AI生成内容往往源于人类与机器的共同塑造,而其具体参与方式常隐藏于输入提示(prompt)中,难以仅通过输出文本识别。解决方案的关键在于提出一种方法论:在概率生成过程中将提示中指定的AI角色嵌入到生成内容中,并基于生成文本反推AI的实际参与性质。实验表明,该方法在区分不同AI角色(如辅助编辑或创意生成)、抗扰动鲁棒性及保持语言质量方面均有效,为未来AI伦理研究中判断AI使用是否公平、透明和适当提供了技术基础。
链接: https://arxiv.org/abs/2605.00440
作者: Ching-Chun Chang,Yuchen Guo,Hanrui Wang,Timo Spinde,Isao Echizen
机构: National Institute of Informatics (国家信息学研究所); University of Tokyo (东京大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:The evolution of artificial intelligence (AI) has rendered the boundary between humanity and computational machinery increasingly ambiguous. In the presence of more interwoven relationships within human-machine symbiosis, the very notion of AI-generated information becomes difficult to define, as such information arises not from either humans or machines in isolation, but from their mutual shaping. Therefore, a more pertinent question lies not merely in whether AI has participated, but in how it has participated. In general, the role assumed by AI is often specified, either implicitly or explicitly, in the input prompt, yet becomes less apparent or altogether unobservable when the generated content alone is available. Once detached from the dialogue context, the functional role may no longer be traceable. This study considers the problem of tracing the functional role played by AI in natural language generation. A methodology is proposed to infer the latent role specified by the prompt, embed this role into the content during the probabilistic generation process and subsequently recover the nature of AI participation from the resulting text. Experimentation is conducted under a representative scenario in which AI acts either as an assistive agent that edits human-written content or as a creative agent that generates new content from a brief concept. The experimental results support the validity of the proposed methodology in terms of discrimination between roles, robustness against perturbations and preservation of linguistic quality. We envision that this study may contribute to future research on the ethics of AI with regard to whether AI has been used fairly, transparently and appropriately.
[HC-11] Play and Learn: Gamified Feedback for Ultrasound-Guided Catheter Insertion Training in Virtual Reality ALT
【速读】:该论文旨在解决当前虚拟现实(Virtual Reality, VR)医学培训模拟器虽强调高保真度但缺乏有效形成性反馈的问题。其解决方案的关键在于引入游戏化(Gamification)设计,通过语义对齐的视觉与听觉反馈(如进度指示、对齐引导和奖励机制)来增强学习体验,在保持操作流程真实性的前提下提升训练效果。实证研究表明,该方法显著缩短任务时间、改善可用性并降低认知负荷,同时在不同专业水平的学员中均提升了目标清晰度与自信心,验证了游戏化作为VR医疗培训中形成性反馈机制的有效性。
链接: https://arxiv.org/abs/2605.00389
作者: Mohammad Raihanul Bashar,Alejandro Olivares Hernandez,Yahia Zine,Anil Ufuk Batmaz
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 10 pages, 11 figures, IEEE International Conference on Serious Games and Applications for Health (SeGAH)
Abstract:Virtual reality (VR) is widely used for procedural medical training, yet most simulators emphasize realism while providing limited formative feedback. We examine how gamification affects performance, workload, and experiential quality in VR training for ultrasound-guided peripheral intravenous catheter insertion. We developed a gamified simulator with semantically aligned visual and auditory feedback (e.g., progress indicators, alignment guidance, rewards) while preserving procedural fidelity. Two studies were conducted with novices (N=24) and clinicians (N=12). Results showed that gamification reduced task time, improved usability, and lowered workload across expertise levels. Qualitative findings indicate improved goal clarity and confidence for novices and better pacing for experts. Overall, gamification can function as an effective formative feedback in VR medical training.
[HC-12] An eHMI Presenting Request-to-Intervene and Takeover Status of Level 3 Automated Vehicles to Support Surrounding Traffic Safety
【速读】:该论文旨在解决Level 3自动化车辆(Automated Vehicles, AVs)在接管请求(Request to Intervene, RtI)过程中,由于其驾驶状态对周围人工驾驶车辆(Manual Vehicles, MVs)驾驶员不可见,从而引发安全隐患的问题。解决方案的关键在于提出一种外部人机交互界面(external Human-Machine Interface, eHMI),即eHMI C+O,通过 cyan(青色)和 orange(橙色)光条外显AV的RtI相关接管状态,从而提升周围MV驾驶员对AV意图的理解、行为预测能力及信息感知充分性,并促进更早、更显著的时间头距(time headway)增加,最终显著降低事故风险(相较无eHMI条件减少76.8%的事故概率)。
链接: https://arxiv.org/abs/2605.00377
作者: Hailong Liu,Masaki Kuge,Toshihiro Hiraoka,Takahiro Wada
机构: Nara Institute of Science and Technology (奈良科学技术大学院大学); Japan Automobile Research Institute (日本汽车研究所)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Level 3 automated vehicles (AVs) issue a request to intervene (RtI) when the automated driving system approaches its system limitations. Although this takeover transition is safety-critical, it is usually invisible to surrounding manually driven vehicle (MV) drivers. This study proposes an external human-machine interface (eHMI) called eHMI C+O that externalizes the RtI-related takeover status of a Level~3 AV using cyan and orange light bars. A driving-simulator experiment with 40 participants examined whether the proposed eHMI supports surrounding MV drivers during AV takeover scenarios. The results showed that, compared with the ADS-status-only eHMI condition, which is similar to ``Automated Driving Marker Lights,‘’ and the no-eHMI condition, the proposed eHMI C+O significantly improved participants’ understanding of the AV’s driving intention, their prediction of its behavior, and their perceived sufficiency of the information presented by the AV. It also reduced hesitation, increased confidence, and promoted earlier and larger increases in time headway after the RtI was issued. In the AV accident scenario, eHMI C+O significantly reduced the odds of accident involvement for the following MV compared with the no-eHMI condition, corresponding to a 76.8% reduction in accident odds. Exploratory path analysis suggested that the safety benefit of the proposed eHMI C+O may be associated with improved situation awareness and earlier defensive driving responses. These findings indicate that externalizing RtI-related takeover status can help surrounding drivers better understand Level 3 AVs and respond more safely during safety-critical takeover transitions.
[HC-13] From Phreaking to Sneaking: Childrens Circumvention of Social Media Age Verification Systems
【速读】:该论文试图解决的问题是:年龄限制类平台监管(如澳大利亚针对16岁以下用户的社会媒体禁令)在实际执行中为何常被青少年规避,以及这种规避行为如何揭示了技术控制与社会治理之间的结构性矛盾。解决方案之关键在于提出“偷溜”(sneaking)作为理论视角,强调青少年并非被动接受监管的客体,而是主动解读、测试并协商数字基础设施的能动主体;这一视角揭示出,仅依赖技术手段进行年龄验证和内容访问控制难以应对复杂的社交实践与治理挑战,凸显了当前以年龄为基础的平台监管机制的根本性缺陷——即技术控制无法有效解决其所承载的社会与治理问题。
链接: https://arxiv.org/abs/2605.00368
作者: Bjorn Nansen,Helena Sandberg,Lauren Bliss,Shaanan Cohney
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Australia’s social media ban is now in force. It requires platforms to take reasonable steps to stop users under 16 from holding accounts. Drawing on five focus groups with fifteen young people aged 12–16, this paper examines how children understood the ban’s effectiveness, impact, and legitimacy as they encountered the platforms charged with enforcing it. Participants widely saw the ban as unfair and ineffective. Through platform access controls, they learned how the ban worked, where it failed, and how they and their peers could evade it. We also asked participants to imagine better approaches to age verification and youth digital governance. This paper develops sneaking as a theoretical lens for these practices. The concept names more than evasion: it captures the social encounter between children, platforms, techno-regulation, and the access controls that mediate digital participation. Our findings show that children are not passive subjects of platform regulation. They interpret, test, and negotiate digital infrastructure. They also expose a central weakness in age-based platform regulation: technological controls struggle to solve the social and governance problems they are asked to contain.
[HC-14] owards Interactive Multimodal Representation of ML Functions for Human Understanding of ML
【速读】:该论文旨在解决公众对机器学习(Machine Learning, ML)和人工智能(Artificial Intelligence, AI)存在普遍误解与恐惧的问题,这种误解阻碍了人们对这些技术的理性认知与积极采纳。其解决方案的关键在于通过精心设计的交互式可视化工具,以高透明度的数据集为基础,激发青少年及跨学科人群的好奇心,从而推动对ML的主动探索和深入理解,最终实现从“未知恐惧”到“科学认知”的态度转变。
链接: https://arxiv.org/abs/2605.00357
作者: Bokang Wang,Yingxuan Liao,Leah Lee,Jack Wesson,Anlan Yang,Ruizi Wang,Yigang Wen
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Graphics (cs.GR); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注:
Abstract:Attitudes about artificial intelligence and machine learning are recent victims of endemic misunderstanding; given our increasing reliance on these technologies, the need for widespread understanding and confidence in their use is paramount. To this end, our work seeks to increase understanding in these typically inaccessible topics through interactive visualizations, thereby garnering curiosity in the hopes of kickstarting a cycle of understanding leading to further pursuit of knowledge. We hope this will cyclically shift global attitudes away from the intimidation of the unknown currently plaguing ML. This work explores best practices for supporting curiosity in new technologies, to inspire attitudinal paradigm-shifts. Over three, distinct visualizations of machine learning data, we created prototypes with carefully selected, highly-transparent datasets, to examine the success factors of engagement required for more informed attitudes on ML less dictated by the fear of the unknown. By employing interactive visualizations, we can captivate the interest of teenagers and individuals from diverse fields, encouraging them to explore the fascinating world of machine learning.
[HC-15] FaceValue: Exploring Real-Time Self-View Overlays to Prompt Meaning-Oriented Self-Awareness in Remote Meetings
【速读】:该论文旨在解决远程视频会议中视觉非语言线索(visual non-verbal cues)因信息不完整而引发的歧义问题,这种歧义可能导致意图与感知之间的错位,进而影响沟通效果。解决方案的关键在于设计并部署了一个名为FaceValue的技术探针,通过在自视画面中添加私密、实时的微妙提示层(subtle, suggestive prompts),引导参与者反思自身行为可能被他人解读的方式,从而提升对潜在误读的觉察力;该设计避免对行为进行标签化,转而聚焦于意义导向的自我觉知(meaning-oriented self-awareness),促使用户在会议过程中主动调整非语言行为以增强沟通一致性。
链接: https://arxiv.org/abs/2605.00288
作者: Gun Woo Warren Park,Anthony Tang,Fanny Chevalier
机构: University of Toronto(多伦多大学); Singapore Management University(新加坡管理大学)
类目: Human-Computer Interaction (cs.HC)
备注: 27 pages, 5 figures, accepted to CSCW 2026
Abstract:In remote video meetings, visual non-verbal cues, such as facial expressions or head movements, are seen continuously but often only partially. This increases ambiguity compared to in-person settings and can cause misinterpretation or misalignment between intended and perceived meaning. Motivated by communication theories, we designed FaceValue, a technology probe that augments the self-view with private, real-time overlays. These overlays are subtle, suggestive prompts intended to help attendees reflect on how their cues might be interpreted by others. To invite personal interpretation, FaceValue avoids behavioral labeling and instead aims to support meaning-oriented self-awareness: recognizing when visible cues may unintentionally (mis)communicate intent. We deployed FaceValue in the wild with thirteen knowledge workers over multiple weeks, capturing perceived changes in self-awareness and behavior, and impressions on the design concepts, as self-reported by participants through diary entries and exit interviews. Participants felt FaceValue increased their awareness of potentially misaligned cues and motivated in-meeting adjustments, which they believe resulted in improved communication with other attendees. We contribute a conceptual framing that positions visual non-verbal cues as a manipulable communication resource, a technology probe that aims to foster meaning-oriented self-awareness, and empirically-grounded design insights for future meeting systems.
[HC-16] Developing an AI Concept Envisioning Toolkit to Support Reflective Juxtaposition of Values and Harms
【速读】:该论文旨在解决生成式 AI(Generative AI)早期设计阶段中价值观与潜在危害难以被有效识别和整合的问题,即在概念构想阶段,价值考量往往过于抽象或介入过晚,导致设计决策偏离伦理导向。解决方案的关键在于提出并验证了“AI概念构想工具包”(AI Concept Envisioning Toolkit),其核心组件包括AI能力库、24张价值-危害卡(Value–Harm Cards)以及价值-张力图(Value–Tension Map),通过将价值与潜在危害置于具体技术能力语境下进行并置对比,促进设计师在早期阶段开展价值反思、预见潜在风险,并使伦理考量更加透明化,从而在设计流程中引入建设性摩擦(productive friction),推动更具负责任的AI设计实践。
链接: https://arxiv.org/abs/2605.00282
作者: Pitch Sinlapanuntakul,Soyun Moon,Yuri Kawada,Yeha Chung,Mark Zachry
机构: University of Washington (华盛顿大学)
类目: Human-Computer Interaction (cs.HC)
备注: 22 pages, 10 figures, 5 tables
Abstract:Early-stage concept envisioning is a critical juncture in AI design, shaping how designers frame problems and the decisions that follow. Yet values and potential harms are often too abstract or addressed too late to meaningfully shape design. Using a Research-through-Design (RtD) approach, we developed the AI Concept Envisioning Toolkit, comprising an AI Capability Library, 24 Value–Harm Cards, and a Value–Tension Map, to support reasoning by juxtaposing values and harms within AI technical capabilities. Through a survey with 30 designers and in-depth interviews with 12 designers, we find that the toolkit is clear and perceived as valuable, and that it encourages value reflection, helps anticipate potential harms, and makes ethical considerations more transparent in early-stage design. We reflect on our design process and discuss design approaches for tools that promote reflection on values and potential harms, surface and navigate value tensions, and introduce productive friction throughout design workflows.
[HC-17] How Designers Envision Value-Oriented AI Design Concepts with Generative AI
【速读】:该论文旨在解决生成式 AI (Generative AI) 在设计实践中被用作工具与设计材料双重角色时所引发的递归价值张力问题,这种张力不同于传统设计工作中的价值冲突。解决方案的关键在于揭示设计师在与AI互动中进行的“行动中反思”(reflection-in-action)过程,发现其能够识别多层级的价值矛盾(包括工具、设计师和概念层面),并展现出对潜在危害的敏感性高于对积极价值实现的表达;同时,设计师通过元设计推理(meta-design reasoning)前瞻性地判断工具假设如何可能渗透至设计概念及未来使用场景,从而形成一种以危害为中心的设计判断机制。这一发现拓展了Schön的经典反思框架,并为重新设计AI中介的设计工具、强化危害导向的推理能力以及将设计置于AI开发的核心地位提供了理论依据与实践路径。
链接: https://arxiv.org/abs/2605.00280
作者: Pitch Sinlapanuntakul,Aayushi Dangol,Xiaoyi Xue,Mark Zachry
机构: University of Washington (华盛顿大学)
类目: Human-Computer Interaction (cs.HC)
备注: 19 pages, 2 figures, 3 tables
Abstract:As AI integrates into design practice, designers increasingly use generative AI tools to envision AI-enabled solutions, positioning AI as both design tool and design material. This dual role creates recursive value tensions distinct from traditional design work. We engaged 18 designers in a concept envisioning activity and interviews to understand how they navigate values and recognize potential harms in this context. Our analysis reveals that (i) designers engage in reciprocal reflection-in-action with AI; (ii) this process surfaces multi-level value tensions across tool, designer, and concept; (iii) designers demonstrate greater attunement to harm recognition as a primary design signal than to articulating positive value fulfillment; and (iv) designers exercise anticipatory judgment through meta-design reasoning about how tool assumptions risk propagating into designed concepts and future use contexts. We extend Schon’s reflection-in-action framework and discuss implications for redesigning AI-mediated design tools, supporting harm-centered reasoning, and positioning design as foundational to AI development.
[HC-18] Engagement Phenotypes for a Sample of 102684 AI Mental Health Chatbot Users and Dose-Response Associations with Clinical Outcomes
【速读】:该论文旨在解决当前对生成式 AI (Generative AI) 心理健康聊天机器人在真实世界中的用户参与模式及其与临床结果关联性认知不足的问题。其关键解决方案是通过K-means聚类分析102,684名用户的行为特征,识别出五种参与表型(Early Dropouts、Power Users、Intensive Users、Weekly Users 和 Concentrated Users),并结合PHQ-9、GAD-7和MSPSS等量表评估干预前后的心理状态变化及工作联盟(Working Alliance)的作用,从而揭示不同参与模式与抑郁、焦虑缓解及社会支持改善之间的剂量效应关系,并验证工作联盟在其中的调节作用,为AI心理健康工具的有效使用提供了自然情境下的实证依据。
链接: https://arxiv.org/abs/2605.00275
作者: Emma C. Wolfe,Ting Su,Olivier Tieleman,Thomas D. Hull,Matteo Malgaroli,Caitlin A. Stamatis
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Background: Conversational AI chatbots are emerging as scalable mental health tools, but little is known about real world engagement or its relationship to clinical outcomes. Objective: To characterize engagement phenotypes among users of Ash, a purpose-built AI mental health chatbot, and examine associations with clinical change and working alliance. Methods: K-means clustering across eight behavioral features identified engagement phenotypes among 102,684 users. Subsamples completed the PHQ-9 (n=298), GAD-7 (n=298), and MSPSS (social support; n=194) baseline and 3 weeks; 11,437 users completed baseline Working Alliance Inventory (WAI). Results: Five engagement phenotypes emerged: Early Dropouts (52.2%), Power Users (1.6%), Intensive Users (4.1%), Weekly Users (25.3%), and a novel Concentrated User pattern (16.8%); across users, 66.9% had at least one overnight session (9pm-5am). Significant pre-post improvements occurred in depression (d = -0.51), anxiety (d = -0.57), and social support (d = 0.22). An observed dose-response gradient in self-reported depression improvement was replicated in a larger sample with model-predicted PHQ-9 (n = 23,813; Power Users d = -0.54; Early Dropouts d = -0.13). Higher working alliance predicted depression improvement and moderated the engagement-social support relationship. Conclusions: Engagement with AI mental health tools is multidimensional, and different clinical outcomes respond to different dimensions of use. Findings caution against treating session counts as a primary engagement metric and offer naturalistic evidence for the clinical value of purpose-built conversational AI.
[HC-19] Are You the A-hole? A Fair Multi-Perspective Ethical Reasoning Framework
【速读】:该论文旨在解决在高冲突领域中,传统基于多数投票的自然语言判断聚合方法因忽视逻辑一致性而失效的问题,即这些方法常将不同意见视为噪声而非潜在的可协调推理。其解决方案的关键在于提出一种神经符号聚合框架,通过将语言模型提取的自然语言解释转化为可解释的逻辑谓词和置信度权重,并以软约束形式编码至Z3求解器中,从而将聚合问题转化为加权最大可满足性(Weighted Maximum Satisfiability, MaxSAT)优化任务,以实现跨冲突陈述的最大逻辑一致性。
链接: https://arxiv.org/abs/2605.00270
作者: Sheza Munir,Ahanaf Rodoshi,Sumin Lee,Feiran Chang,Xujie Si,Syed Ishtiaque Ahmed
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Standard methods for aggregating natural language judgments, such as majority voting, often fail to produce logically consistent results when applied to high-conflict domains, treating differing opinions as noise. We propose a neuro-symbolic aggregation framework that formalizes conflict resolution through Weighted Maximum Satisfiability (MaxSAT). Our pipeline utilizes a language model to map unstructured natural language explanations into interpretable logical predicates and confidence weights. These components are then encoded as soft constraints within the Z3 solver, transforming the aggregation problem into an optimization task that seeks the maximum consistency across conflicting testimony. Using the Reddit r/AmItheAsshole forum as a case study in large-scale moral disagreement, our system generates logically coherent verdicts that diverge from popularity-based labels 62% of the time, corroborated by an 86% agreement rate with independent human evaluators. This study demonstrates the efficacy of coupling neural semantic extraction with formal solvers to enforce logical soundness and explainability in the aggregation of noisy human reasoning.
[HC-20] Electrotactile Improves Thermal Referral
【速读】:该论文旨在解决现有热觉转移(thermal referral)技术依赖振动触觉执行器所带来的局限性问题,特别是其在模拟冷感时效果不佳,且因强烈的触觉“嗡鸣”感难以用于非接触式热觉事件的虚拟现实(VR)场景,如打开冰箱门时感受到的寒意。解决方案的关键在于将触觉刺激方式从振动触觉(vibrotactile)转向电触觉(electrotactile),通过两个用户实验(心理物理学实验和VR部署实验)验证了电触觉基热觉转移的优势:显著提升冷感转移率、增强热觉感知的同时最小化触觉干扰,并有效区分接触式与非接触式热觉事件,从而为VR中实现沉浸式多模态热觉体验提供更优的设计路径。
链接: https://arxiv.org/abs/2605.00240
作者: Wen Li,Rong Ni,Bozhi Tian,Pedro Lopes
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 12 Pages, 12 Figures
Abstract:Thermal referral enables thermal sensations in locations lacking thermal actuators–this is achieved using vibrotactile actuators to redirect a nearby thermal sensation to where a tactile sensation is applied. However, we found that its reliance on vibration introduces critical limitations: it struggles to produce cold referral, and the inherent strong tactile “buzz” makes it unsuitable for simulating non-contact thermal events, such as the chill of an open freezer in VR (in contrast to contact-based thermal events like touching the freezer’s cold handle). To improve this, we propose a shift from vibrotactile to electrotactile-based thermal referral. We evaluated in two user studies–a psychophysics experiment (N=22) and a VR deployment (N=20)–where we contrasted electrotactile with vibrotactile-based thermal referral. Our results reveal key advantages of the electrotactile based thermal referral: (1) increases the referral rate for cold sensations; (2) increases thermal perception while minimizing tactile; and (3) improves realism across a range of VR thermal scenarios, specifically distinguishing between contact-based and non-contact thermal events. Finally, we provide design guidelines for choosing tactile cues to create immersive multimodal thermal experiences in VR.
[HC-21] Introducing WARM-VR: Benchmark Dataset for Multimodal Wearable Affect Recognition in Virtual Reality
【速读】:该论文旨在解决现有情感识别(affect recognition)数据集多局限于静态环境,难以适用于沉浸式多媒体场景(如虚拟现实VR)的问题。其解决方案的关键在于构建并公开了一个名为WARM-VR的多模态数据集,该数据集通过可穿戴传感器采集参与者的生理信号(包括血容脉搏BVP、皮肤电活动EDA、皮肤温度、加速度和心电ECG),结合视觉、听觉与嗅觉同步刺激,在VR环境中诱发并记录放松状态下的情绪变化。研究进一步基于此数据集建立基准模型,验证了卷积神经网络(CNN)及其与双向门控循环单元(Bi-GRU)融合结构在情绪效价(valence)分类中的有效性,并发现轻量级Transformer架构在唤醒度(arousal)预测中表现最优,从而为沉浸式环境下基于生理信号的情感识别提供了可靠的数据基础与模型参考。
链接: https://arxiv.org/abs/2605.00184
作者: Karim Alghoul,Faisal Mohd,Fedwa Laamarti,Hussein Al Osman,Abdulmotaleb El Saddik
机构: 未知
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
备注:
Abstract:With the growing integration of human-computer interaction into everyday life, advances in machine learning have enabled systems to better perceive and respond to users’ emotional states. Most existing affect recognition datasets focus on static environments, limiting their applicability to immersive multimedia contexts such as Virtual Reality (VR). In this paper, we introduce WARM-VR, a novel publicly available multimodal dataset designed to support affect recognition in immersive, multisensory environments using wearable sensing instrumentation. Data were collected from 31 participants aged 19-37 using wearable sensors: a wristband measuring Blood Volume Pulse (BVP), EDA, skin Temperature, three-axis Acceleration, and a chest strap recording ECG signals. Participants engaged in immersive VR experiences designed to elicit relaxation through a calming beach environment following stress induction via an arithmetic task. These sessions incorporated synchronized multimedia stimuli: visual, auditory, and olfactory. Affective states were assessed subjectively through validated self-report questionnaires and objectively through the analysis of physiological measurements. Statistical analysis of the questionnaires confirmed that VR relaxation significantly reduced negative affect, particularly with olfactory enhancement. Furthermore, we established a benchmark on the dataset using widely recognized machine learning algorithms. The best performance for binary classification from BVP data of valence, was obtained with a CNN and a CNN-Bi-GRU model, both achieving an average F1-score of 0.63 and an AUC of 0.69. For arousal, a lightweight Transformer architecture provided the most balanced results (F1-0 0.54 and F1-1 0.63), outperforming recurrent hybrids. In the relaxation task, a CNN-Bi-GRU model reached the highest overall performance (average F1-score 0.64, AUC 0.69).
[HC-22] How Frontier LLM s Adapt to Neurodivergence Context: A Measurement Framework for Surface vs. Structural Change in System-Prompted Responses
【速读】:该论文旨在解决前沿对话式大语言模型(Large Language Models, LLMs)在接收到神经多样性(Neurodivergence, ND)情境提示时是否调整输出行为及其调整性质的问题。其核心挑战在于评估LLMs对ND相关上下文的响应是否具有适应性、结构化改进以及潜在有害倾向的抑制能力。解决方案的关键在于提出并构建NDBench——一个包含576条输出结果的基准测试框架,涵盖两个前沿模型、三种系统提示类型(基线、ND人格声明、带明确调整指令的ND人格声明)、四种典型ND人格特征及24个跨类别提示(含对抗性掩码策略)。实验表明,在明确指令条件下,LLMs显著生成更长、结构更清晰的回应(p < 10⁻⁸,Holm校正),表现为更高的词元数量、更多标题和更细粒度步骤;而仅声明ND身份无法有效降低有害倾向,需显式指令才能实现36–44%的掩码强化减少。此框架为未来LLMs在ND意识下的适应性审计提供了可复现的工具与方法论基础。
链接: https://arxiv.org/abs/2605.00113
作者: Ishan Gupta,Pavlo Buryi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 15 pages, 3 figures, 2 tables. Benchmark, code, and data available at this https URL
Abstract:We examine if frontier chat-based large language models (LLMs) adjust their outputs based on neurodivergence (ND) context in system prompts and describe the nature of these adjustments. Specifically, we propose NDBench, a 576-output benchmark involving two frontier models, three system prompt types (baseline, ND-profile assertion, and ND-profile assertion with explicit instructions for adjustments), four canonical ND profiles, and 24 prompts across four categories, one of which involves an adversarial masking strategy. Four trends emerge consistently from our findings. First, LLMs show significant adaptation under ND context, where fully instructed conditions yield lengthier and more structured outputs, characterized by higher token counts, more headings, and more granular steps (p 10^-8, Holm-corrected). Second, such adaptation is largely structural in nature: although list density does not change much, there is a marked rise in the frequency of headings and per-step detail. Third, ND persona assertion alone fails to suppress potentially harmful tendencies, as masking-reinforcement decreases only in explicitly instructed cases (36-44% reduction); the reduction rate barely changes in persona assertion conditions. Moreover, reliability analysis of LLM-based harm assessment reveals that only two out of the six dimensions (masking and reinforcement, validation quality) exceed the pre-defined inter-judge agreement criterion (alpha = 0.67) and thus can be considered primary results. NDBench is made publicly available along with its prompts, outputs, code, and other resources, forming a reproducible framework for auditing future LLMs’ adaptation to ND awareness. Comments: 15 pages, 3 figures, 2 tables. Benchmark, code, and data available at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) ACMclasses: I.2.7; K.4.2; H.5.2 Cite as: arXiv:2605.00113 [cs.CL] (or arXiv:2605.00113v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.00113 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ishan Gupta Mr [view email] [v1] Thu, 30 Apr 2026 18:08:28 UTC (986 KB)
[HC-23] What is (H)CI: Why Does the "Human Matter?
【速读】:该论文旨在探讨人机交互(Human-Computer Interaction, HCI)领域的核心定义与未来方向,特别是在生成式 AI(Generative AI)快速发展的背景下,重新审视“人类”在HCI中的重要性。其关键解决方案在于通过跨学科研究者的参与,借助引导式讨论、小组头脑风暴和反思,系统性地反思HCI的本质、未来发展形态,并强调在技术演进中不可忽视的“人类中心”维度。
链接: https://arxiv.org/abs/2605.00109
作者: Sejal Agarwal,Delara Forghani,Brandon Lit,Thomas Driscoll,Anthony Maocheia-Ricci
机构: Cheriton School of Computer Science, University of Waterloo (滑铁卢大学计算机科学系)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted Workshop Proposal for Graphics Interface 2026
Abstract:Human-Computer Interaction (HCI) is a diverse field bringing together theories and methods from fields such as computer science, psychology, and human factors. Historically, HCI has focused on the human through user'' or human’’ centered design, where the focus was either on information processing or understanding people and their concerns with respect to technology. However, amid the increasing adoption of generative AI tools, this workshop explores two critical questions in regards to HCI: What is HCI? and Why does the human'' matter? We aim to bring together researchers from diverse disciplines to reflect on these questions. Through guided discussions, group brainstorming, and reflection, we explore what HCI means, what the field may look like in the future, and why it is important to remember the human’’ aspect of the field.
[HC-24] Sure About That Line? Approaching Confidence-Based Real-Time Line Assignment in Reading Gaze Data
【速读】:该论文旨在解决多行文本阅读中基于远程或网络摄像头的眼动追踪(eye tracking)因噪声因素和版面歧义导致的注视点到行的映射(fixation-to-line assignment)不准确问题,尤其是在实时阅读支持场景下对每注视点精准分配对应行的需求。解决方案的关键在于提出CONF-LA(Confidence-score-based Online Fixation-to-Line Assignment),该方法融合阅读行为知识与注视点上的高斯线似然模型,通过计算后验行得分(posterior-line-score)实现在线、低延迟的分配决策,并在不确定性较高时推迟分配,从而显著提升准确性,尤其在儿童数据上表现出对回视(regressions)的高度鲁棒性,相较其他算法在即兴中位准确率上提升至约95%。
链接: https://arxiv.org/abs/2605.00033
作者: Franziska Kaltenberger,Wei-Ling Chen,Enkeleda Thaqi,Enkelejda Kasneci
机构: Technical University Munich (慕尼黑工业大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Accepted at ETRA 2026. To appear in Proceedings of the ACM on Computer Graphics and Interactive Techniques. 21 pages, 12 figures
Abstract:Remote and webcam-based eye tracking in multi-line reading suffers from various noise factors and layout ambiguity, precisely where real-time reading support needs reliable, per-fixation line assignment. Prior work largely addresses this challenge post hoc or by restricting behavior (e.g., disallowing re-reading), undermining interactive use. We propose CONF-LA (Confidence-score-based Online Fixation-to-Line Assignment), a principled, low-latency approach that integrates knowledge about reading behavior and Gaussian line likelihoods over fixations to compute a posterior-line-score and defers assignments when uncertainty is high. Evaluated on existing open-source data, CONF-LA demonstrates stable performance in post hoc analysis and closes the online-offline gap (1-2 %) with a mean per-fixation latency of 0.348 ms. Our approach exhibits particular invariance toward regressions, yielding significant improvement in ad hoc median accuracies on children data (approx. 95 %) over all tested algorithms. We encourage further research in this direction and discuss possibilities for future development.
[HC-25] MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis
【速读】:该论文旨在解决神经语音假体(Speech Neuroprosthesis)系统在恢复失语患者交流能力时,因仅依赖运动皮层信号而忽略其他可能携带互补语言信息的脑区(如Broca区的area 44)所导致的性能瓶颈问题。解决方案的关键在于提出MoDAl(Modality Decorrelation and Alignment)框架,该框架通过共享投影空间中对比损失(contrastive loss)与去相关损失(decorrelation loss)的协同作用,实现多模态神经编码器的解耦与对齐:对比损失将各脑区编码器与预训练大语言模型(LLM)文本嵌入对齐,确保语义一致性;而去相关损失则防止不同编码器趋于冗余表示,从而挖掘出具有功能特异性的神经模态——例如,来自area 44的信号被发现主要编码句法和结构特征(如句子长度、语法语态、wh-词),验证了其与Broca区的神经语言学功能一致。实验表明,MoDAl在Brain-to-Text Benchmark '24上将词错误率(WER)从26.3%降至21.6%,且性能提升完全源于去相关机制对area 44信号的利用。
链接: https://arxiv.org/abs/2605.00025
作者: Yuanhao Chen,Peter Chin
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Speech neuroprosthesis systems decode intended speech from neural activity in the absence of audible output, offering a path to restoring communication for individuals with speech-impairing conditions. Current approaches decode predominantly from motor cortical areas, discarding others – such as area 44, part of Broca’s area – that may encode complementary linguistic information. We introduce MoDAl (Modality Decorrelation and Alignment), a framework that discovers complementary neural modalities through the interplay of two objectives in a shared projection space. A contrastive loss aligns each of several parallel brain encoders with the text embeddings of a pretrained large language model (LLM), while a decorrelation loss prevents the encoders from coalescing to duplicative representations. We prove that these objectives are in productive tension: Contrastive alignment induces transitive modality coalescence, which decorrelation must counteract for the framework to discover diverse neurolinguistic modalities. On the Brain-to-Text Benchmark '24, MoDAl reduces word error rate (WER) from 26.3% to 21.6% compared to the previous best end-to-end method, with the gain from incorporating previously discarded area 44 signals arising entirely from the decorrelation mechanism. Analysis of the discovered modalities reveals functional specialization: Encoders receiving area 44 input capture structural and syntactic properties (sentence length, grammatical voice, wh-words), consistent with the neurolinguistic understanding of Broca’s area.
计算机视觉
[CV-0] Posterior Augmented Flow Matching
【速读】:该论文旨在解决流匹配(Flow Matching, FM)在高维图像生成任务中因监督信号稀疏和梯度方差大而导致的流坍塌(flow collapse)问题,即模型倾向于记忆特定的源-目标配对,而非学习通用的生成动态。解决方案的关键在于提出后验增强流匹配(Posterior-Augmented Flow Matching, PAFM),其核心思想是将FM中单目标监督替换为基于近似后验分布的期望,该后验表示给定中间状态和条件下的合理目标完成路径。PAFM通过分解不可计算的后验为似然项与先验项,并利用重要性采样构建多候选目标混合分布,在不增加显著计算开销的前提下,有效降低训练梯度方差并提升模型泛化能力,从而在不同模型规模和架构下均实现FID指标显著改善。
链接: https://arxiv.org/abs/2605.00825
作者: George Stoica,Sayak Paul,Matthew Wallingford,Vivek Ramanujan,Abhay Nori,Winson Han,Ali Farhadi,Ranjay Krishna,Judy Hoffman
机构: Georgia Tech (佐治亚理工学院); University of Washington (华盛顿大学); Hugging Face (Hugging Face); Ai2 (艾伦人工智能研究所); UC Irvine (加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Flow matching (FM) trains a time-dependent vector field that transports samples from a simple prior to a complex data distribution. However, for high-dimensional images, each training sample supervises only a single trajectory and intermediate point, yielding an extremely sparse and high-variance training signal. This under-constrained supervision can cause flow collapse, where the learned dynamics memorize specific source-target pairings, mapping diverse inputs to overly similar outputs, failing to generalize. We introduce Posterior-Augmented Flow Matching (PAFM), a theoretically grounded generalization of FM that replaces single-target supervision with an expectation over an approximate posterior of valid target completions for a given intermediate state and condition. PAFM factorizes this intractable posterior into (i) the likelihood of the intermediate under a hypothesized endpoint and (ii) the prior probability of that endpoint under the condition, and uses an importance sampling scheme to construct a mixture over multiple candidate targets. We prove that PAFM yields an unbiased estimator of the original FM objective while substantially reducing gradient variance during training by aggregating information from many plausible continuation trajectories per intermediate. Finally, we show that PAFM improves over FM by up to 3.4 FID50K across different model scales (SiT-B/2 and SiT-XL/2), different architectures (SiT and MMDiT), and in both class and text conditioned benchmarks (ImageNet and CC12M), with a negligible increase in the compute overhead. Code: this https URL.
[CV-1] Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
【速读】:该论文旨在解决自回归大视觉语言模型(Large Vision-Language Models, LVLMs)在多模态任务中面临的“视觉信号稀释”(Visual Signal Dilution)问题,即随着文本历史的累积,注意力分配函数扩展导致视觉注意力随生成序列长度增加而反向衰减,从而削弱模型对视觉信息的持续感知能力。解决方案的关键在于提出轻量级可学习模块——持久视觉记忆(Persistent Visual Memory, PVM),其作为并行分支集成于LVLM的前馈网络(Feed-Forward Network, FFN)中,构建了一条与距离无关的检索路径,直接提供视觉嵌入以实现精准的视觉感知,从结构上缓解深度生成过程中固有的信号抑制现象。
链接: https://arxiv.org/abs/2605.00814
作者: Siyuan Huang,Xiaoye Qu,Yafu Li,Tong Zhu,Zefeng He,Muxin Fu,Daizong Liu,Wei-Long Zheng,Yu Cheng
机构: Shanghai AI Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学); Nanjing University (南京大学); Tongji University (同济大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a “Visual Signal Dilution” phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for precise visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM can resist length-induced signal decay and accelerate internal prediction convergence.
[CV-2] Let ViT Speak: Generative Language-Image Pre-training
【速读】:该论文旨在解决视觉编码器与大语言模型(LLM)之间对齐不充分的问题,尤其在生成式 AI(Generative AI)场景下,如何更高效地实现多模态理解与生成。解决方案的关键在于提出一种极简的生成式语言-图像预训练框架 GenLIP,其核心设计是让 Vision Transformer (ViT) 直接从视觉 token 预测语言 token,采用标准的语言建模目标,无需对比损失或额外文本解码器。这一方法实现了视觉与文本 token 在单一 Transformer 中的联合建模,从而在保持结构简洁的同时显著提升可扩展性和多模态性能,在多个基准测试中达到或超越现有强基线。
链接: https://arxiv.org/abs/2605.00809
作者: Yan Fang,Mengcheng Lan,Zilong Huang,Weixian Lei,Yunqing Zhao,Yujie Zhong,Yingchen Yu,Qi She,Yao Zhao,Yunchao Wei
机构: Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 9 figures
Abstract:In this paper, we present \textbfGenerative \textbfLanguage-\textbfImage \textbfPre-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) \textbfSimplicity: a single transformer jointly models visual and textual tokens; (2) \textbfScalability: it scales effectively with both data and model size; and (3) \textbfPerformance: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.
[CV-3] GMGaze: MoE-Based Context-Aware Gaze Estimation with CLIP and Multiscale Transformer
【速读】:该论文旨在解决现有基于卷积神经网络(CNN)、Transformer 和对比语言-图像预训练(CLIP)的注视估计方法中存在的三大挑战:图像特征的后期融合导致信息损失、缺乏对关键因素(如光照、背景、头部姿态和外观)的条件感知建模,以及模型容量扩展不切实际的问题。其解决方案的核心在于提出全局条件多尺度注视估计(GMGaze),通过引入语义原型条件化机制,利用四个学习得到的原型库(照明、背景、头部姿态和外观)调制CLIP全局图像嵌入,生成两个互补的上下文偏置全局标记;这些标记与CLIP补丁标记和CNN标记在第一层进行早期统一融合,避免了晚期合并中的信息丢失;同时,采用稀疏专家混合模块(Mixture-of-Experts)实现条件计算资源分配,无需均匀增加密集参数;此外,在跨域适应中引入对抗域适应与特征分离损失,促使两个全局标记保持去相关性,从而提升泛化能力。
链接: https://arxiv.org/abs/2605.00799
作者: Xinyuan Zhao,Yihang Wu,Ahmad Chaddad,Sarah A. Alkhodair,Reem Kateb
机构: Guilin University of Electronic Technology (桂林电子科技大学); École de Technologie Supérieure (高等技术学院); King Saud University (沙特国王大学); Taibah University (塔伊巴大学); Jeddah University (吉达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in KBS
Abstract:Gaze estimation methods commonly use facial appearances to predict the direction of a person gaze. However, previous studies show three major challenges with convolutional neural network (CNN)-based, transformer-based, and contrastive language-image pre-training (CLIP)-based methods, including late fusion of image features, lack of factor-aware conditioning, and impractical capacity scaling. To address these challenges, we propose Globally-conditioned Multi-scale Gaze estimation (GMGaze), which leverages a multi-scale transformer architecture. Specifically, the model first introduces semantic prototype conditioning, which modulates the CLIP global image embedding using four learned prototype banks (i.e., illumination, background, head pose and appearance) to generate two complementary context-biased global tokens. These tokens, along with the CLIP patch and CNN tokens, are fused at the first layer. This early unified fusion prevents information loss common in late-stage merging. Finally, each token passes through sparse Mixture-of-Experts modules, providing conditional computational capacity without uniformly increasing dense parameters. For cross-domain adaptation, we incorporate an adversarial domain adaptation technique with a feature separation loss that encourages the two global tokens to remain de-correlated. Experiments using four public benchmarks (MPIIFaceGaze, EYEDIAP, Gaze360, and ETH-XGaze) show that GMGaze achieves mean angular errors of 2.49 ^\circ , 3.22 ^\circ , 10.16 ^\circ , and 1.44 ^\circ , respectively, outperforming previous baselines in all within-domain settings. In cross-domain evaluations, it provides state-of-the-art (SOTA) results on two standard transfer routes.
[CV-4] Make Your LVLM KV Cache More Lightweight
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在推理过程中因预填充阶段处理大量视觉标记(vision tokens)而导致的GPU内存开销过大的问题。现有方法直接沿用大语言模型(LLMs)中的键值缓存(Key-Value Cache, KV cache)机制,虽提升了解码效率,却显著增加了视觉token的存储负担。解决方案的关键在于提出LightKV,一种基于跨模态消息传递(cross-modality message passing)的压缩策略:通过文本提示引导,在预填充阶段聚合并逐步压缩视觉token嵌入,从而有效利用视觉token间的冗余信息。该方法区别于以往仅针对视觉内容的压缩方式,实现了对KV缓存中视觉部分的高效压缩——实验表明,仅保留55%原始视觉token即可使KV缓存大小减半、计算量最多降低40%,同时保持通用性能不下降,并显著优于现有基线方法。
链接: https://arxiv.org/abs/2605.00789
作者: Xihao Chen,Yangyang Guo,Roger Zimmermann
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to Transactions on Machine Learning Research (TMLR), 2026
Abstract:Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings. Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill. This prompt-aware guidance distinguishes our method from prior vision-only compression strategies. We evaluate LightKV on eight open-source LVLMs across eight public benchmark datasets, e.g., MME and SeedBench. Experimental results demonstrate that with only 55% of the original vision tokens, LightKV (a) halves the vision-token KV cache size, (b) reduces computation by up to 40%, and © preserves general-purpose performance while significantly outperforming existing baselines.
[CV-5] Map2World: Segment Map Conditioned Text to 3D World Generation
【速读】:该论文旨在解决当前3D世界生成方法中存在的两大核心问题:一是受限于规则网格布局(grid layouts),难以实现灵活的场景构建;二是物体尺度在全局范围内缺乏一致性,导致生成场景存在尺度失真。解决方案的关键在于提出一种名为Map2World的新框架,该框架通过用户定义的任意形状与尺度的语义分割图(segment maps)作为条件输入,从而实现全局尺度一致性和环境扩展灵活性;同时引入细节增强网络(detail enhancer network),在不破坏整体场景连贯性的前提下注入精细细节,其设计充分利用了资产生成器中的强先验知识,实现了在有限训练数据下对多样化场景的鲁棒泛化能力。
链接: https://arxiv.org/abs/2605.00781
作者: Jaeyoung Chung,Suyoung Lee,Jianfeng Xiang,Jiaolong Yang,Kyoung Mu Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
Abstract:3D world generation is essential for applications such as immersive content creation or autonomous driving simulation. Recent advances in 3D world generation have shown promising results; however, these methods are constrained by grid layouts and suffer from inconsistencies in object scale throughout the entire world. In this work, we introduce a novel framework, Map2World, that first enables 3D world generation conditioned on user-defined segment maps of arbitrary shapes and scales, ensuring global-scale consistency and flexibility across expansive environments. To further enhance the quality, we propose a detail enhancer network that generates fine details of the world. The detail enhancer enables the addition of fine-grained details without compromising overall scene coherence by incorporating global structure information. We design the entire pipeline to leverage strong priors from asset generators, achieving robust generalization across diverse domains, even under limited training data for scene generation. Extensive experiments demonstrate that our method significantly outperforms existing approaches in user-controllability, scale consistency, and content coherence, enabling users to generate 3D worlds under more complex conditions.
[CV-6] Quantum Gradient-Based Approach for Edge and Corner Detection Using Sobel Kernels
【速读】:该论文旨在解决传统边缘检测(edge detection)与角点检测(corner detection)方法在量子计算框架下的实现问题,以探索其在量子硬件上的可行性与性能表现。解决方案的关键在于提出了一种基于量子电路的Sobel边缘检测与Harris角点检测的量子实现方案,采用Flexible Representation of Quantum Images (FRQI) 和 Quantum Probability Image Encoding (QPIE) 两种量子图像编码方式对输入图像进行表示,并设计了一种基于lag-2差分的量子梯度计算方案,可在叠加态中高效评估梯度特征;同时引入经典后处理步骤以提升检测质量并减少误检。实验表明,QPIE编码相比FRQI更具稳定性与一致性,尤其在测量次数有限时优势明显,但整体计算成本仍由状态制备、测量及经典后处理主导,说明该工作实现了功能完备且可扩展的量子化经典算法映射,而非端到端的速度加速。
链接: https://arxiv.org/abs/2605.00744
作者: Mohammad Aamir Sohail,Gabriela Pinheiro,Yasemin Poyraz Kocak,Batuhan Hangun,Emre Camkerten,Simge Yigit,Hafize Asude Ertan
机构: University of Michigan (密歇根大学); Federal University of Rio de Janeiro (里约热内卢联邦大学); Istanbul University-Cerrahpasa (伊斯坦布尔大学-切拉帕萨校区); University of Vienna (维也纳大学); Istanbul University (伊斯坦布尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Edge detection refers to identifying points in a digital image where intensity changes sharply, indicating object boundaries or structural features. Corners are locations where gray-level intensity changes abruptly in multiple directions and are widely used in feature extraction, object tracking, and 3D modeling. In this study, we present a quantum implementation of Sobel-based edge detection and Harris-style corner detection. Two quantum image encoding methods - Flexible Representation of Quantum Images (FRQI) and Quantum Probability Image Encoding (QPIE) - are used to encode the input data and are comparatively analyzed. The proposed approach introduces a quantum gradient computation scheme based on lag-2 differences, enabling the evaluation of gradient-like features in superposition. To improve detection quality and reduce false positives, a classical post-processing step is applied to candidate corner points identified by the quantum circuit. Results show that the proposed quantum circuits produce outputs consistent with classical Sobel and Harris operators. Furthermore, the QPIE-based configuration yields more stable and coherent results than FRQI, especially under limited measurement shots. While gradient computation can be performed efficiently at the circuit level, the overall cost remains dominated by state preparation, measurement, and classical post-processing. All experiments are conducted under noiseless simulation, and performance on NISQ hardware may be affected by noise and measurement limitations. Therefore, this work demonstrates a functional and scalable quantum realization of classical edge and corner detection methods rather than an end-to-end speedup.
[CV-7] Exploring the Limits of End-to-End Feature-Affinity Propagation for Single-Point Supervised Infrared Small Target Detection
【速读】:该论文旨在解决单点监督红外小目标检测(Single-point supervised infrared small target detection, IRSTD)中因密集标注成本高而导致的性能瓶颈问题。现有最先进(SOTA)方法依赖于显式的离线伪标签构建(如多阶段主动学习或物理驱动的掩码生成),以恢复掩码监督信号,但流程复杂且易引入误差。本文提出一种极简替代方案GSACP,其核心在于通过批内点锚定特征亲和力传播(in-batch, point-anchored feature-affinity propagation)在线生成点到掩码的监督信号,直接利用硬边界特征亲和性并由局部图像先验门控,从而完全消除外部标签演化循环。该设计虽简洁高效,却暴露了一个优化瓶颈——即“自指传播漂移”(Self-Referential Propagation Drift),表现为特征表示与监督目标相互纠缠,可能增强真实边界或扭曲特征空间以满足自身目标。为系统隔离此问题,作者采用协议化的单变量消融策略,包括局部EMA教师解耦、硬背景对比分离和自适应支持几何调整,在SIRST3数据集上实现了0.6674 mIoU的同时,相比PAL方法将误报率(Fa)降低38%,验证了该范式在低虚警场景下的部署潜力。
链接: https://arxiv.org/abs/2605.00722
作者: Qiancheng Zhou,Wenhua Zhang
机构: Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single-point supervised infrared small target detection (IRSTD) drastically reduces dense annotation costs. Current state-of-the-art (SOTA) methods achieve high precision by recovering mask supervision through explicit, offline pseudo-label construction, such as multi-stage active learning and physics-driven mask generation. In this paper, we study a minimalist alternative: generating point-to-mask supervision online through in-batch, point-anchored feature-affinity propagation. We instantiate this paradigm as GSACP, an end-to-end testbed that directly supervises the detector using hard-margin feature affinity gated by local image priors, entirely eliminating external label-evolution loops. This compact design, however, exposes an optimization bottleneck. Because the affinity target is generated from the same feature representation being optimized, training forms a self-referential loop. We theoretically formalize this as \emphSelf-Referential Propagation Drift, a representation-supervision entanglement that can sharpen true boundaries or distort the feature space to satisfy its own targets. To systematically isolate these failure modes, we apply a protocolized single-variable ablation procedure spanning local EMA teacher decoupling, hard-background contrastive separation, and adaptive support geometry. On the SIRST3 dataset, GSACP-Final establishes a new ultra-low false-alarm operating regime, achieving a highly competitive 0.6674 mIoU while demonstrating a 38% relative reduction in false-positive artifacts ( \mathrmFa ) compared with PAL. By systematically deconstructing the end-to-end paradigm, we map its performance boundaries and show that in-batch feature propagation provides a compact alternative for deployment scenarios where false-alarm suppression is paramount. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.00722 [cs.CV] (or arXiv:2605.00722v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.00722 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-8] Unpaired Image Deraining Using Reward-Guided Self-Reinforcement Strategy
【速读】:该论文旨在解决无监督图像去雨(unsupervised image deraining)中因缺乏强约束而导致网络难以收敛的问题,尤其针对雨迹退化形式复杂多样的挑战。其核心解决方案是提出RGSUD(Reward-Guided Self-Reinforcement Unsupervised Image Deraining)框架,关键在于两个阶段:一是基于图像质量评估(Image Quality Assessment, IQA)的动态奖励回收机制,用于在训练过程中筛选高质量去雨结果并持续收集伪配对数据;二是引入自强化(self-reinforcement, SR)训练策略,将这些奖励嵌入优化过程以缩小搜索空间、提升去雨输出与干净图像之间的对齐度。通过IQA引导的自强化损失和动态更新的奖励机制,该方法显著提升了合成伪配对数据的质量并稳定了优化过程,在多个数据集上实现了最先进的性能。
链接: https://arxiv.org/abs/2605.00719
作者: Yinghao Chen,Yeying Jin,Xiang Chen,Yanyan Wei,Ziyang Yan,Yaowen Fu
机构: National University of Defense Technology (国防科技大学); National University of Singapore (新加坡国立大学); Nanjing University of Science and Technology (南京理工大学); Hefei University of Technology (合肥工业大学); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unsupervised deraining has attracted attention for its ability to learn the real-world distribution of rain without paired supervision. However, the lack of strong constraints makes it difficult for the network to converge, especially with the complex diversity of rain degradation. A key motivation is that high-quality deraining results occasionally emerge during training, which can be leveraged to guide the optimization process. To overcome these challenges, we introduce RGSUD (Reward-Guided Self-Reinforcement Unsupervised Image Deraining), comprising two key stages: reward recycling and self-reinforcement (SR) training. For the former stage, we propose an Image Quality Assessment (IQA)-based dynamic reward recycling mechanism that selects optimal derained outputs during training and continuously collects high-quality deraining images. In latter stage, we incorporate these rewards into the model’s optimization process, constraining the optimization space and improving alignment between derained outputs and clean images. By leveraging IQA-based self-reinforced loss and dynamically updated rewards, we enhance the quality of synthesized pseudo-paired data and stabilize the optimization. Extensive experiments demonstrate that our method achieves SOTA performance across multiple datasets, including paired synthetic, paired real, and unpaired real images, outperforming existing unsupervised deraining approaches in both subjective and objective IQA metrics. Additionally, we show that the self-reinforcement strategy is adaptable to other unsupervised deraining methods and our deraining framework demonstrates strong generalization across existing supervised deraining networks.
[CV-9] Learning Coarse-to-Fine Osteoarthritis Representations under Noisy Hierarchical Labels
【速读】:该论文旨在解决膝骨关节炎(Knee Osteoarthritis, OA)评估中标签层次结构未被充分利用的问题,即如何在模型训练中有效融合粗粒度的二分类OA判断与细粒度的Kellgren–Lawrence(KL)分级标签。现有深度学习方法通常将这两个目标视为独立分类任务,或直接优化噪声较大的KL等级标签,导致特征表示不够有序或缺乏临床意义。解决方案的关键在于引入一种结构简单的双头(dual-head)模型,共享一个编码器并分别对应OA存在与否和KL等级两个任务,从而利用临床标签的天然层级关系作为表示层面的监督先验(representation-level supervisory prior)。实验表明,这种简单架构在特定3D骨干网络上可提升KL相关指标,并促使潜在空间中从粗到细的组织更有序,同时增强注意力热图与软骨区域的解剖学一致性,证明了层次化监督对疾病表征学习的有效性。
链接: https://arxiv.org/abs/2605.00718
作者: Tongxu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Knee osteoarthritis (OA) assessment involves a natural but often underused label hierarchy: a coarse binary OA decision and a fine-grained Kellgren–Lawrence (KL) severity grade. Existing deep learning studies commonly treat these targets as separate classification problems, either reducing OA assessment to disease presence or directly optimizing noisy ordinal KL labels. In this work, we ask whether this clinical hierarchy can serve as a representation-level supervisory prior. Rather than introducing a complex architecture, we use a deliberately simple dual-head model with a shared encoder and two task-specific heads as a probe of hierarchical supervision. We compare single-OA, single-KL, and dual-head training across multiple 3D backbones under the same test protocol. Beyond standard classification metrics, we perform paired statistical comparisons, analyze latent severity-axis geometry, and examine saliency overlap with cartilage regions. The results show that dual-head supervision produces backbone-dependent gains, with clear improvements in KL-related metrics for selected backbones. More importantly, the gains are accompanied by a more ordered coarse-to-fine latent organization and, for responsive backbones, stronger anatomical alignment of saliency with cartilage. These findings suggest that even simple hierarchical dual-head supervision can reshape disease representations under noisy coarse/fine labels, providing a useful inductive bias for OA diagnosis and severity grading.
[CV-10] PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning
【速读】:该论文旨在解决现有基于推理的图像编辑方法在处理异构编辑指令(如颜色替换、对象插入和物理动作编辑)时存在的效率与适应性不足问题,即这些方法采用固定的推理流程,无法根据指令复杂度和空间需求动态调整计算资源。其解决方案的关键在于提出PhysEdit框架,通过两个无需重新训练主干网络的推理时模块实现时空维度上的自适应:(1) 复杂度自适应推理深度(Complexity-Adaptive Reasoning Depth, CARD),依据指令和参考图像预测编辑复杂度,动态分配推理步数和每个样本的推理标记长度;(2) 空间推理掩码(Spatial Reasoning Mask, SRM),利用交叉注意力提取条件相关的空间先验,将推理限制在语义上需要修改的区域。这两个模块协同作用,在保持编辑质量的同时显著提升效率,尤其在外观级编辑任务中实现最高1.52倍的加速效果。
链接: https://arxiv.org/abs/2605.00707
作者: Guandong Li,Mengxia Ye
机构: iFLYTEK(科大讯飞); Aegon THTF
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures
Abstract:Image editing instructions are heterogeneous: a color swap, an object insertion, and a physical-action edit all demand different spatial coverage and different reasoning depth, yet existing reasoning-based editors apply a single fixed inference recipe to every instruction. We argue that adaptivity along both the spatial and temporal axes is the missing degree of freedom, and we present PhysEdit, an editing framework built around this principle. PhysEdit introduces two inference-time modules that compose without retraining the backbone. At its core, (1) Complexity-Adaptive Reasoning Depth (CARD) predicts edit complexity directly from the instruction and reference image and allocates the reasoning step count N_r and reasoning-token length r per sample – turning a previously fixed inference schedule into a conditional-computation problem. CARD is supported by (2) a Spatial Reasoning Mask (SRM) that extracts an instruction-conditioned spatial prior from cross-attention to confine reasoning to regions that semantically require it. On the full 737-case ImgEdit Basic-Edit Suite, PhysEdit delivers a 1.18x wall-clock speedup (64.3s vs. 76.1s per sample) over a strong reasoning baseline while slightly improving instruction adherence (CLIP-T 0.2283 vs. 0.2266, +0.7%) and matching identity preservation within noise (CLIP-I 0.8246 vs. 0.8280). The speedup is category-dependent and reaches 1.52x on appearance-level edits, validating CARD’s adaptive allocation as the principal source of efficiency gain. A 30-sample pilot with full ablations isolates the contribution of each module.
[CV-11] Static and Dynamic Graph Alignment Network for Temporal Video Grounding
【速读】:该论文针对时序视频定位(Temporal Video Grounding, TVG)任务中现有图卷积网络(Graph Convolutional Networks, GCN)方法存在的三大瓶颈展开研究:1)仅使用静态或动态特征构建图节点,导致视觉表征不完整且忽略互补语义;2)构建时间图时缺乏查询感知机制,造成特征交互效率低下;3)采用单一粒度语义匹配策略,致使训练收敛慢且精度不足。解决方案的关键在于提出静态与动态图对齐网络(Static and Dynamic Graph Alignment Network, SDGAN),其核心创新包括:首先,联合利用静态与动态视觉特征构建两个互补的时间图并实施位置级节点对齐,增强视觉表示的表达力和鲁棒性;其次,引入查询-片段对比学习与自适应图建模机制,显式对齐视觉片段与文本查询,生成查询感知的视觉表征;最后,结合多粒度时间提议与渐进式由易到难训练策略,有效衔接粗粒度语义定位与细粒度时间边界精调。
链接: https://arxiv.org/abs/2605.00684
作者: Zhanjie Hu,Bolin Zhang,Jianhua Wang,Jianbo Zheng,Chenchen Yan,Takahiro Komamizu,Ichiro Ide,Jiangbo Qian
机构: Ningbo University (宁波大学); Inner Mongolia University (内蒙古大学); Hunan Normal University (湖南师范大学); Georg-August-Universität Göttingen (哥廷根大学); Nagoya University (名古屋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Temporal Video Grounding (TVG) aims to localize temporal moments in an untrimmed video that semantically correspond to given natural language queries. Recently, Graph Convolutional Networks (GCN) have been widely adopted in TVG to model temporal relations among video clips and enhance contextual reasoning by constructing clip-level graphs. Despite their effectiveness, existing GCN-based TVG methods encounter three critical bottlenecks: 1) Most methods construct graph nodes using either static or dynamic features alone, resulting in incomplete visual representation and overlooking complementary semantics, 2) Most methods construct temporal graphs in a query-agnostic manner, leading to inefficient feature interaction within the temporal graph representation, and 3) Most methods often suffer from a single-granularity semantic matching, while direct training on complex temporal localization task may lead to slow convergence and suboptimal precision. To address these challenges, we propose Static and Dynamic Graph Alignment Network (SDGAN). First, SDGAN jointly exploits static and dynamic visual features to construct two complementary temporal graphs and performs Position-wise Nodes Alignment, enabling more expressive and robust visual representation. Second, SDGAN introduces Query-Clip Contrastive Learning and Adaptive Graph Modeling to explicitly align visual clips with their corresponding textual queries, yielding query-aware visual representations. Third, SDGAN incorporates multi-granularity temporal proposals within Progressive Easy-to-Hard Training Strategy, effectively bridging coarse-grained semantic localization and fine-grained temporal boundary refinement. Extensive experiments on three benchmark datasets demonstrate that SDGAN achieves superior performance across complex TVG scenarios. Codes and datasets are available at this https URL.
[CV-12] Foundation AI Models for Aerosol Optical Depth Estimation from PACE Satellite Data
【速读】:该论文旨在解决传统气溶胶光学厚度(Aerosol Optical Depth, AOD)反演方法在空间一致性与噪声敏感性方面的局限性,以及现有数据驱动方法未能充分挖掘高光谱影像中空间-光谱协同信息的问题。其解决方案的关键在于提出ViTCG框架——一种基于通道分组的空间回归视觉Transformer(Vision Transformer with Channel-wise Grouping-based spatial regression),通过联合建模高光谱辐亮度的时空上下文与光谱特征,显著降低反演偏差和误差,从而生成空间一致且鲁棒性强的AOD场。
链接: https://arxiv.org/abs/2605.00678
作者: Zahid Hassan Tushar,Sanjay Purushotham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures, to appear in 2026 IEEE International Geoscience and Remote Sensing Symposium
Abstract:Aerosol Optical Depth (AOD) retrieval is essential for Earth observation, supporting applications from air quality monitoring to climate studies. Conventional physics-based AOD retrieval methods formulate the problem as a pixel-wise inversion, relying on radiative transfer modeling, memory-intensive look-up tables, and auxiliary meteorological data. While recent data-driven approaches have shown promise, many fail to exploit the spatial-spectral coherence of hyperspectral imagery, leading to spatially inconsistent and noise-sensitive retrievals. We present the first study exploring Foundation AI models for AOD retrieval and propose ViTCG, a Vision Transformer with Channel-wise Grouping-based spatial regression framework that reduces retrieval bias and error. ViTCG uses hyperspectral top-of-atmosphere radiance as input and jointly models spatial context and spectral information. Validation with PACE radiance observations demonstrates a 62% reduction in mean squared error compared to state-of-the-art foundation models, including Prithvi, and produces spatially coherent AOD fields.
[CV-13] DMDSC: A Dynamic-Margin Deep Simplex Classifier for Open-Set Recognition on Medical Image Datasets
【速读】:该论文旨在解决医学影像数据中类别极度不平衡问题对开放集识别(Open-Set Recognition, OSR)带来的双重挑战:模型需在已知类别上保持高分类准确率,同时可靠地拒绝训练阶段未见的未知样本。现有方法如深度单纯形分类器(Deep Simplex Classifier, DSC)和不确定性感知深度单纯形分类器(Uncertainty-aware Deep Simplex Classifier, UCDSC)虽利用神经坍缩(Neural Collapse)实现类间最大分离,但其采用统一边界(uniform margin)未能考虑不同医学类别的密度差异。本文提出动态边际深度单纯形分类器(Dynamic Margin DSC, DMDSC),其核心创新在于引入一种自适应机制,根据标签频率自动调整每个类别的边界,对罕见病理施加更高惩罚并促进更紧密的特征聚类,从而缓解数据不平衡的影响。实验在BloodMNIST、OCTMNIST、DermaMNIST和BreaKHis等多个医学基准数据集上验证了该方法优于当前最先进水平。
链接: https://arxiv.org/abs/2605.00675
作者: Vishal,Arnav Aditya,Nitin Kumar,Saurabh J. Shigwan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical imaging datasets are often characterized by extreme class imbalances, where rare pathologies are significantly underrepresented compared to common conditions. This imbalance poses a dual challenge for Open-Set Recognition (OSR): models must maintain high classification accuracy on known classes while reliably rejecting unknown samples unseen during training in the clinical settings. While recently proposed Deep Simplex Classifier (DSC)~\citecevikalp2024reaching and UnCertainty-aware Deep Simplex Classifier (UCDSC)~\citeAditya_2026_WACV successfully leverage Neural Collapse to ensure maximal inter-class separation, they rely on a uniform margin that does not account for the varying densities of medical classes. In this paper, we propose DMDSC an enhanced framework featuring a dynamic margin approach. Our approach automatically adapts class-specific margins based on label frequency, enforcing a higher penalty and tighter feature clustering for rare pathologies to counteract the effects of data imbalance. Extensive experiments conducted on diverse medical benchmarks on BloodMNIST\citemedmnistv2, OCTMNIST\citemedmnistv2, DermaMNIST\citemedmnistv2, and BreaKHis~\citespanhol2015dataset datasets, demonstrate that our framework outperforms state-of-the-art methods. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.00675 [cs.CV] (or arXiv:2605.00675v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.00675 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-14] Prediction of Alzheimers Disease Risk Factors from Retinal Images via Deep Learning: Development and Validation of Biologically Relevant Morphological Associations in the UK Biobank
【速读】:该论文旨在解决“彩色眼底照相(colored fundus photography, CFP)是否能够反映阿尔茨海默病(Alzheimer’s Disease, AD)相关风险因素的视网膜结构特征”这一科学问题。其核心解决方案在于利用深度学习(deep learning, DL)模型从大规模UK Biobank眼底图像中提取与12个AD风险因素(包括6个分类变量和6个连续变量)相关的视网膜表征,并通过模型可解释性分析(如显著性图saliency maps)识别关键视网膜区域,进而验证这些结构变化是否与AD前驱期病理机制存在潜在重叠。关键创新点在于:DL模型在预测AD风险因素上表现优异(AUROC达0.5654–0.9480),且显著性分析揭示了视盘和视网膜血管区域为生物学意义明确的贡献区,同时发现高风险个体与对照组在基于显著性的评分(CAM-Score)上存在差异,提示CFP可通过DL方法捕捉到与AD易感性相关的视网膜结构性改变,尽管不能用于诊断,但可作为潜在生物标志物用于早期风险分层。
链接: https://arxiv.org/abs/2605.00665
作者: Seowung Leem,Yunchao Yang,Adam J. Woods,Ruogu Fang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the “Journal of Alzheimer’s Disease” for publication
Abstract:The systemic, metabolic, lifestyle factors have established associations with Alzheimer’s Disease (AD) through epidemiologic and AD-specific biomarker studies. Whether colored fundus photography (CFP) contains retinal structural signatures corresponding to these AD-related risk domains remains unclear. To determine whether deep learning (DL) models can predict 12 AD-related risk factors from CFP and to characterize the retinal structures underlying these predictions, thereby assessing whether CFP reflects pathways to AD vulnerability. Using UK Biobank CFPs, DL models were trained using 62,876 images from 44,501 unique participants to predict 12 factors linked to AD incidence: 6 categorical (sex, smoking, sleeplessness, economic status, alcohol use, depression) and 6 continuous (age, age at completing education, BMI, systolic, diastolic blood pressure, HbA1c). Model performance, model saliency, and saliency-derived scores (CAM-Score) were evaluated and compared to retinal morphometry. The scores were also compared between incident-AD cases (average 8.55 years before onset) and matched controls. Performance of DL ranged from AUROC= 0.5654-0.9480 for categorical and R2=-0.0291-0.7620 for continuous factors, outperforming most of the morphometry-machine learning models. Saliency-based score consistently highlighted biologically meaningful regions, particularly the optic nerve head and retinal vasculature. It also aligned with present morphometric variations. Several saliency-based scores differed significantly between incident AD and matched controls, suggesting potential overlap between retinal correlates of risk factors and preclinical AD-associated changes. CFP encodes retinal signatures linked to AD risk factors. Although not diagnostic, DL-derived retinal representations may uncover biologically meaningful risk-related structural changes mirroring the potential AD vulnerability.
[CV-15] InpaintSLat: Inpainting Structured 3D Latents via Initial Noise Optimization
【速读】:该论文旨在解决3D图像修复(inpainting)任务中因扩散过程早期阶段对初始噪声敏感而导致的结构稳定性不足问题,尤其是在需要严格遵循现有上下文并合成新结构的任务中。解决方案的关键在于通过优化初始噪声来增强生成结果的上下文一致性和提示词对齐度,具体方法是基于修正流(rectified flow)模型设计了一种反向传播近似策略,并引入频谱参数化(spectral parameterization)以实现高效且鲁棒的结构化3D潜在空间优化,从而在无需训练的情况下提升3D修复质量。
链接: https://arxiv.org/abs/2605.00664
作者: Jaeyoung Chung,Suyoung Lee,Kyoung Mu Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: project page: this https URL
Abstract:We present a training-free approach for controllable 3D inpainting based on initial noise optimization. In the structured 3D latent diffusion framework, we observe that the underlying geometric structure is established during the early stages of the diffusion process and exhibits high sensitivity to the initial noise. Such characteristics compromise stability in tasks like inpainting and editing, where the model must ensure strict alignment with the existing context while synthesizing a new structure. In this paper, we introduce a strategy to optimize the initial noise within the structured 3D latent diffusion framework, ensuring high-fidelity 3D inpainting. Specifically, we update the initial noise by leveraging a backpropagation approximation grounded in the rectified flow model, with the spectral parameterization specially designed for robust and efficient structured 3D latent optimization. Experiments demonstrate consistent improvements in contextual consistency and prompt alignment over representative training-free inpainting baselines, establishing initial noise control as an independent dimension for 3D inpainting, orthogonal to conventional sampling trajectory manipulation.
[CV-16] Affordance Agent Harness: Verification-Gated Skill Orchestration
【速读】:该论文旨在解决开放世界场景中可操作性定位(affordance grounding)的问题,即在视觉模糊、遮挡或反射等复杂条件下准确识别代理应交互的位置与方式。传统方法依赖固定流水线组合多种技能(如检测、分割和交互想象),但存在难以适应实例难度差异、缺乏对中间错误的针对性恢复机制以及无法复用重复对象的经验等问题。其核心解决方案是提出可操作性代理框架(Affordance Agent Harness),通过闭环运行时系统统一异构技能,引入证据存储与成本控制机制,利用情景记忆为常见类别提供先验信息,并借助路由器动态选择和参数化技能;同时,设计专用验证器基于自一致性、跨尺度稳定性及证据充分性判断是否提交决策,在必要时触发局部重试,最终由裁判融合累积证据与轨迹生成预测。此方案显著提升了精度-成本权衡性能,减少了平均技能调用次数与延迟。
链接: https://arxiv.org/abs/2605.00663
作者: Haojian Huang,Jiahao Shi,Yinchuan Li,Yingcong Chen
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 43 pages, 22 figures, 8 tables. Ongoing work
Abstract:Affordance grounding requires identifying where and how an agent should interact in open-world scenes, where actionable regions are often small, occluded, reflective, and visually ambiguous. Recent systems therefore combine multiple skills (e.g., detection, segmentation, interaction-imagination), yet most orchestrate them with fixed pipelines that are poorly matched to per-instance difficulty, offer limited targeted recovery from intermediate errors, and fail to reuse experience from recurring objects. These failures expose a systems problem: test-time grounding must acquire the right evidence, decide whether that evidence is reliable enough to commit, and do so under bounded inference cost without access to labels. We propose Affordance Agent Harness, a closed-loop runtime that unifies heterogeneous skills with an evidence store and cost control, retrieves episodic memories to provide priors for recurring categories, and employs a Router to adaptively select and parameterize skills. An affordance-specific Verifier then gates commitments using self-consistency, cross-scale stability, and evidence sufficiency, triggering targeted retries before a final judge fuses accumulated evidence and trajectories into the prediction. Experiments on multiple affordance benchmarks and difficulty-controlled subsets show a stronger accuracy-cost Pareto frontier than fixed-pipeline baselines, improving grounding quality while reducing average skill calls and latency. Project page: this https URL.
[CV-17] UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors SIGGRAPH2026
【速读】:该论文旨在解决现有视频扩散模型(Video Diffusion Models, VDMs)在多模态图形任务中因采用独立训练模型而导致输入输出映射固定、难以建模跨模态相关性的局限性。为实现统一的多模态视频生成框架,作者提出UniVidX,其核心创新在于三个关键设计:首先,随机条件掩码(Stochastic Condition Masking, SCM)通过在训练时动态划分模态为干净条件与噪声目标,支持全方向条件生成而非固定映射;其次,解耦门控LoRA(Decoupled Gated LoRA, DGL)引入每模态专属的LoRA模块,在该模态作为生成目标时激活,从而保留VDM原生先验;最后,跨模态自注意力机制(Cross-Modal Self-Attention, CMSA)共享键(keys)和值(values),但保持模态特定查询(queries),促进跨模态信息交换与对齐。这一框架在RGB视频与内在图(albedo、irradiance、normal)或RGBA层混合视频等任务中均展现出强泛化能力,即使在少于1000个视频样本下也能稳定表现。
链接: https://arxiv.org/abs/2605.00658
作者: Houyuan Chen,Hong Li,Xianghao Kong,Tianrui Zhu,Shaocong Xu,Weiqing Xiao,Yuwei Guo,Chongjie Ye,Lvmin Zhang,Hao Zhao,Anyi Rao
机构: MMLab@HKUST(香港科技大学多媒体实验室); Beihang University(北京航空航天大学); Nanjing University(南京大学); BAAI(北京人工智能研究院); CUHK-Shenzhen(深圳香港中文大学); Stanford University(斯坦福大学); Tsinghua University(清华大学); MMLab@CUHK(香港中文大学多媒体实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Accepted to ACM Transactions on Graphics (Proceedings of SIGGRAPH 2026)
Abstract:Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation. UniVidX formulates pixel-aligned tasks as conditional generation in a shared multimodal space, adapts to modality-specific distributions while preserving the backbone’s native priors, and promotes cross-modal consistency during synthesis. It is built on three key designs. Stochastic Condition Masking (SCM) randomly partitions modalities into clean conditions and noisy targets during training, enabling omni-directional conditional generation instead of fixed mappings. Decoupled Gated LoRA (DGL) introduces per-modality LoRAs that are activated when a modality serves as the generation target, preserving the strong priors of the VDM. Cross-Modal Self-Attention (CMSA) shares keys and values across modalities while keeping modality-specific queries, facilitating information exchange and inter-modal alignment. We instantiate UniVidX in two domains: UniVid-Intrinsic, for RGB videos and intrinsic maps including albedo, irradiance, and normal; and UniVid-Alpha, for blended RGB videos and their constituent RGBA layers. Experiments show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1,000 videos. Project page: this https URL
[CV-18] Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
【速读】:该论文旨在解决GUI grounding任务中基于强化学习的方法(如GRPO)因依赖昂贵的多轮rollout和在困难样本上信号稀疏而导致的训练效率低、性能受限的问题。其解决方案的关键在于提出首个专为GUI grounding设计的on-policy self-distillation(OPSD)框架GUI-SD:首先构建包含目标边界框和高斯软掩码的视觉增强特权上下文作为教师模型,提供不泄露精确坐标的指导信息;其次采用熵引导的蒸馏机制,根据数字重要性和教师置信度自适应加权token,聚焦于最具影响力且可靠的定位区域,从而实现更高效且准确的训练。
链接: https://arxiv.org/abs/2605.00642
作者: Yan Zhang,Daiqing Wu,Huawen Shen,Yu Zhou,Can Ma
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); VCIP TMCC DISSec, College of Computer Science, Nankai University (南开大学计算机学院VCIP TMCC DISSec); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: under review
Abstract:Graphical User Interface (GUI) grounding maps natural language instructions to the visual coordinates of target elements and serves as a core capability for autonomous GUI agents. Recent reinforcement learning methods (e.g., GRPO) have achieved strong performance, but they rely on expensive multiple rollouts and suffer from sparse signals on hard samples. These limitations make on-policy self-distillation (OPSD), which provides dense token-level supervision from a single rollout, a promising alternative. However, its applicability to GUI grounding remains unexplored. In this paper, we present GUI-SD, the first OPSD framework tailored for GUI grounding. First, it constructs a visually enriched privileged context for the teacher using a target bounding box and a Gaussian soft mask, providing informative guidance without leaking exact coordinates. Second, it employs entropy-guided distillation, which adaptively weights tokens based on digit significance and teacher confidence, concentrating optimization on the most impactful and reliable positions. Extensive experiments on six representative GUI grounding benchmarks show that GUI-SD consistently outperforms GRPO-based methods and naive OPSD in both accuracy and training efficiency. Code and training data are available at this https URL.
[CV-19] Paired-CSLiDAR: Height-Stratified Registration for Cross-Source Aerial-Ground LiDAR Pose Refinement
【速读】:该论文旨在解决跨源机载-地面激光雷达(LiDAR)扫描中单次扫描位姿精化(single-scan pose refinement)的问题,即在50米半径的机载扫描范围内对地面扫描的初始位姿进行高精度优化。由于机载扫描主要捕捉屋顶和树冠,而地面扫描聚焦立面和林下结构,两者几何共享有限,仅地形表面部分重叠,导致标准配准方法和学习型对应模型易陷入度量错误的局部极小值。解决方案的关键在于提出一种无需训练、纯几何驱动的分层精化流程——残差引导分层配准(Residual-Guided Stratified Registration, RGSR),其核心机制包括:基于高度分层的ICP(Iterative Closest Point)策略、反向注册方向设计以及置信度门控的“优则保留”选择策略,从而有效利用共享地表信息提升位姿估计精度。
链接: https://arxiv.org/abs/2605.00634
作者: Montana Hoover,Jing Liang,Tianrui Guan,Dinesh Manocha
机构: University of Maryland (马里兰大学); Stanford University (斯坦福大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures. Dataset and code are being prepared for public release
Abstract:We introduce Paired-CSLiDAR (CSLiDAR), a cross-source aerial-ground LiDAR benchmark for single-scan pose refinement: refining a ground-scan pose within a 50 m-radius aerial crop. The benchmark contains 12,683 ground-aerial pairs across 6 evaluation sites and per-scan reference 6-DoF alignments for sub-meter root-mean-square error (RMSE) evaluation. Because aerial scans capture rooftops and canopy while ground scans capture facades and under-canopy, the two modalities share only a fraction of their geometry, primarily the terrain surface, causing standard registration methods and learned correspondence models to converge to metrically incorrect local minima. We propose Residual-Guided Stratified Registration (RGSR), a training-free, geometry-only refinement pipeline that exploits the shared ground plane through height-stratified ICP, reversed registration directions, and confidence-gated accept-if-better selection. RGSR achieves 86.0% S@0.75 m and 99.8% S@1.0 m on the primary benchmark of 9,012 scans, outperforming both the confidence-gated cascade at 83.7% and GeoTransformer at 76.3%. We validate RMSE-based pose selection with independent survey control and trajectory consistency, and show that added Fourier-Mellin BEV proposals can reduce RMSE while increasing actual pose error under extreme partial overlap. The dataset and code are being prepared for public release.
[CV-20] CMTA: Leverag ing Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection
【速读】:该论文旨在解决生成式 AI 视频(AIGV)检测中因忽视视觉-文本跨模态时空一致性而导致的准确性不足问题。现有方法主要依赖单模态或时空伪影特征,未能有效利用跨模态空间中的语义对齐时序稳定性这一关键线索。解决方案的关键在于提出一种名为跨模态时间伪影(Cross-Modal Temporal Artifact, CMTA)的新指纹机制:通过 BLIP 生成帧级图像描述,并结合 CLIP 提取视觉-文本嵌入表示;进一步设计粗粒度与细粒度双分支时序建模结构——其中粗粒度分支使用 GRU 捕捉跨模态对齐的全局时序波动,细粒度分支采用 Transformer 编码器挖掘帧间精细语义变化,从而精准识别 AIGV 中由输入提示驱动的异常稳定语义轨迹。
链接: https://arxiv.org/abs/2605.00630
作者: Hang Wang,Chao Shen,Chenhao Lin,Minghui Yang,Lei Zhang,Cong Wang
机构: Xi’an Jiaotong University (西安交通大学); The Hong Kong Polytechnic University (香港理工大学); Guangdong OPPO Mobile Communications Co., Ltd. (广东OPPO移动通信有限公司); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: 15 pages, 4 figures
Abstract:The proliferation of advanced AI video synthesis techniques poses an unprecedented challenge to digital video authenticity. Existing AI-generated video (AIGV) detection methods primarily focus on uni-modal or spatiotemporal artifacts, but they overlook the rich cues within the visual-textual cross-modal space, especially the temporal stability of semantic alignment. In this work, we identify a distinctive fingerprint in AIGVs, termed cross-modal temporal artifact (CMTA). Unlike real videos that exhibit natural temporal fluctuations in cross-modal alignment due to semantic variations, AIGVs display unnaturally stable semantic trajectories governed by given input prompts. To bridge this gap, we propose the CMTA framework, a cross-modal detection approach that captures these unique temporal artifacts through joint cross-modal embedding and multi-grained temporal modeling. Specifically, CMTA leverages BLIP to generate frame-level image captions and utilizes CLIP to extract corresponding visual-textual representations. A coarse-grained temporal modeling branch is then designed to characterize temporal fluctuations in cross-modal alignment with a GRU. In parallel, a fine-grained branch is constructed to capture intricate inter-frame variations from integrated visual-textual features with a Transformer encoder. Extensive experiments on 40 subsets across four large-scale datasets, including GenVideo, EvalCrafter, VideoPhy, and VidProM, validate that our approach sets a new state-of-the-art while exhibiting superior cross-generator generalization. Code and models of CMTA will be released at this https URL
[CV-21] Faithful Extreme Image Rescaling with Learnable Reversible Transformation and Semantic Priors
【速读】:该论文旨在解决极端图像超分辨率(Extreme Image Rescaling, EIR)中因低分辨率到高分辨率映射的病态性(ill-posed nature)导致的语义结构不一致和细节失真问题,尤其是在缩放因子达到16×或更高时。解决方案的关键在于提出一个基于扩散模型(diffusion-based framework)的FaithEIR框架:首先,受奇异值分解(Singular Value Decomposition, SVD)启发,设计可学习的可逆变换(learnable reversible transformation),实现潜在空间中的可逆下采样与上采样;其次,引入自适应细节先验(adaptive detail prior),即高频字典(high-frequency dictionary),用于补偿量化带来的信息损失;最后,设计轻量级像素语义嵌入器(lightweight pixel semantic embedder),为预训练扩散模型提供语义条件控制,从而提升重建质量和感知保真度。
链接: https://arxiv.org/abs/2605.00605
作者: Hao Wei,Yanhui Zhou,Chenyang Ge,Saeed Anwar,Ajmal Mian
机构: Xi’an Jiaotong University (西安交通大学); The University of Western Australia (西澳大利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most recent extreme rescaling methods struggle to preserve semantically consistent structures and produce realistic details, due to the severely ill-posed nature of low- to high-resolution mapping under scaling factors of 16\times or higher. To alleviate the above problems, we propose FaithEIR, a diffusion-based framework for extreme image rescaling. Inspired by singular value decomposition, we develop learnable reversible transformation that enables invertible downscaling and upscaling in the latent space. To compensate for information loss due to quantization, we propose an adaptive detail prior, a high-frequency dictionary that captures the empirical average of commonly occurring structures in the training data. Finally, we design a lightweight pixel semantic embedder to provide semantic conditioning for the pretrained diffusion model. We present extensive experimental results demonstrating that our FaithEIR consistently outperforms state-of-the-art methods, achieving superior reconstruction fidelity and perceptual quality. Our code, model weights, and detailed results are released at this https URL.
[CV-22] Possibilistic Predictive Uncertainty for Deep Learning ICML2026
【速读】:该论文旨在解决深度神经网络在面对未见输入时过度自信的问题,从而实现可靠的认知不确定性(epistemic uncertainty)建模。现有方法面临根本性困境:贝叶斯方法虽具理论严谨性但计算开销大,而高效的二阶预测器缺乏与认知不确定性量化之间严格的数学关联。解决方案的关键在于提出Dirichlet-approximated possibilistic posterior predictions (DAPPr),其核心是基于可能性理论构建参数后验分布,并通过上确界算子将该后验投影至预测空间,再使用可学习的狄利克雷(Dirichlet)可能性函数对投影后的后验进行近似。这一“投影-近似”策略使得训练目标简洁且具有闭式解,兼顾了理论严谨性和计算效率,在多个基准测试中表现出优于或相当的不确定性量化性能。
链接: https://arxiv.org/abs/2605.00600
作者: Yao Ni,Jeremie Houssineau,Yew Soon Ong,Piotr Koniusz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026
Abstract:Deep neural networks achieve impressive results across diverse applications, yet their overconfidence on unseen inputs necessitates reliable epistemic uncertainty modelling. Existing methods for uncertainty modelling face a fundamental dilemma: Bayesian approaches provide principled estimates but remain computationally prohibitive, while efficient second-order predictors lack rigorous derivations connecting their specific objectives to epistemic uncertainty quantification. To resolve this dilemma, we introduce Dirichlet-approximated possibilistic posterior predictions (DAPPr), a principled framework leveraging possibility theory. We define a possibilistic posterior over parameters, projects this posterior to the prediction space via supremum operators, and approximates the projected posterior using learnable Dirichlet possibility functions. This projection-and-approximation strategy yields a simple training objective with closed-form solutions. Extensive experiments across diverse benchmarks demonstrate that our approach achieves competitive or superior uncertainty quantification performance compared to state-of-the-art evidential deep learning methods while maintaining both principled derivation and computational efficiency. Code will be available at this https URL.
[CV-23] Robust Fusion of Object-Level V2X for Learned 3D Object Detection
【速读】:该论文旨在解决自动驾驶中基于车载传感器(如摄像头和雷达)的感知系统在遮挡或低能见度条件下易失效的问题,其核心挑战在于如何有效融合车辆与基础设施通过V2X通信获取的协作感知信息(object-level cooperative awareness),并提升系统对现实世界V2X不完美性(如延迟、定位误差、低渗透率)的鲁棒性。解决方案的关键在于提出一种噪声感知训练策略(noise-aware training strategy),结合显式的置信度编码机制(explicit confidence encoding),将模拟真实V2X缺陷的数据(如注入噪声和对象丢失)用于模型训练,从而显著增强检测模型在不利条件下的稳定性与性能表现,实现从理想化数据到实际部署场景的有效迁移。
链接: https://arxiv.org/abs/2605.00595
作者: Lukas Ostendorf,Lennart Reiher,Onn Haran,Lutz Eckstein
机构: RWTH Aachen University (亚琛工业大学); Qualcomm Technologies, Inc. (高通技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at IEEE VTC 2026-Spring, 7 pages
Abstract:Perception for automated driving is largely based on onboard environmental sensors, such as cameras and radar, which are cost-effective but limited by line-of-sight and field-of-view constraints. These inherent limitations may cause onboard perception to fail under occlusions or poor visibility conditions. In parallel, cooperative awareness via vehicle-to-everything (V2X) communication is becoming increasingly available, enabling vehicles and infrastructure to share their own state as object-level information that complements onboard perception. In this work, we study how such V2X information can be integrated into 3D object detection and how robust the resulting system is to realistic V2X imperfections. Using the nuScenes dataset, we emulate object-level cooperative awareness messages from ground truth, injecting controlled noise and object dropout to mimic real-world conditions such as latency, localization errors, and low V2X penetration rates. We convert these messages into a dedicated bird’s-eye view (BEV) input and fuse them into a BEVFusion-style detector. Our results demonstrate that while object-level cooperative information can substantially improve detection performance, achieving an NDS of 0.80 under favorable conditions, models trained on idealized data become fragile and over-reliant on V2X. Conversely, our proposed noise-aware training strategy, coupled with explicit confidence encoding, enhances robustness, maintaining performance gains even under severe noise and reduced V2X penetration.
[CV-24] Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language Models
【速读】:该论文旨在解决对比视觉-语言模型(如CLIP)在提示调优(prompt tuning)过程中对标签噪声敏感的问题,即错误标注样本会生成过大的梯度,从而淹没预训练先验知识,导致性能下降。解决方案的关键在于提出双 softmax 提示调优(Double-Softmax Prompt Tuning, DSPT),其通过序列概率归一化机制构建自适应饱和区域,能够自动抑制高误差噪声样本的梯度,同时保留有效信息更新,无需额外超参数调整。该设计将传统训练中的“梯度消失”问题转化为一种基于原理的噪声过滤机制,显著提升了模型在标签噪声环境下的鲁棒性。
链接: https://arxiv.org/abs/2605.00591
作者: Jiayu Li,Jiaxin Qi,Sheng Zhou,Jiaqiang Huang,Xiansheng Hua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can overwhelm pre-trained priors. We argue that because CLIP already provides a near-optimal initialization, adaptation should be inherently conservative, particularly against the extreme gradient updates common in noisy settings. To this end, we propose Double-Softmax Prompt Tuning (DSPT), a hyperparameter-free method for intrinsic gradient suppression. By applying a sequential probabilistic normalization, DSPT induces a self-adaptive saturation zone that suppresses gradients from high-error noisy samples while maintaining informative updates. We also provide both theoretical analysis and empirical evidence about how this mechanism achieves adaptive suppression. This design transforms ``gradient vanishing’', traditionally a training bottleneck, into a principled noise-filtering shield for label-noise prompt tuning. Extensive experiments confirm that this simple, drop-in design achieves state-of-the-art robustness across various noisy benchmarks, outperforming methods with complex architectures and handcrafted hyperparameters.
[CV-25] Jailbreaking Vision-Language Models Through the Visual Modality ICML2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在安全对齐(safety alignment)方面存在的视觉模态漏洞问题,即当前基于文本的安全训练策略无法有效覆盖通过视觉方式传达的有害意图。解决方案的关键在于识别并利用四种新型视觉劫持攻击(jailbreak attacks),包括:(1)将有害指令编码为视觉符号序列并辅以解码图例;(2)用无害物体替代有害对象(如“炸弹”替换为“香蕉”)后通过提示词触发有害行为;(3)替换图像中文字内容(如书封面文字)但保留视觉语境以维持原意;(4)设计视觉类比谜题,要求模型推理出被禁止的概念。实验表明,这些视觉攻击能显著绕过文本安全机制,揭示了跨模态对齐缺口的存在,强调了将视觉视为安全后训练(safety post-training)首要目标的重要性。
链接: https://arxiv.org/abs/2605.00583
作者: Aharon Azulay,Jan Dubiński,Zhuoyun Li,Atharv Mittal,Yossi Gandelsman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICML 2026
Abstract:The visual modality of vision-language models (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual symbol sequences with a decoding legend, (2) replacing harmful objects with benign substitutes (e.g., bomb - banana) then prompting for harmful actions using the substitute term, (3) replacing harmful text in images (e.g., on book covers) with benign words while visual context preserves the original meaning, and (4) visual analogy puzzles whose solution requires inferring a prohibited concept. Evaluating across six frontier VLMs, our visual attacks bypass safety alignment and expose a cross-modality alignment gap: text-based safety training does not automatically generalize to harmful intent conveyed visually. For example, our visual cipher achieves 40.9% attack success on Claude-Haiku-4.5 versus 10.7% for an equivalent textual cipher. To further our insight into the attack mechanism, we present preliminary interpretability and mitigation results. These findings highlight that robust VLM alignment requires treating vision as a first-class target for safety post-training.
[CV-26] Federated Distillation for Whole Slide Image via Gaussian-Mixture Feature Alignment and Curriculum Integration ICML2026
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在数字病理学(digital pathology)应用中因机构间数据异质性导致的模型性能下降问题,尤其是由不同多实例学习(Multiple Instance Learning, MIL)架构和异构特征提取器引发的特征分布不一致问题。其解决方案的关键在于提出FedHD框架,通过本地高斯混合特征对齐(local Gaussian-mixture feature alignment)实现跨机构特征空间的语义一致性:各客户端独立蒸馏出与真实全切片图像(Whole Slide Image, WSI)分布对齐的合成特征表示,而非直接交换模型参数;采用一对一蒸馏策略保留诊断多样性,避免信息过压缩;并通过基于课程的学习整合策略逐步融合跨站点合成特征以提升本地训练稳定性;此外,引入可选的解释模块从合成嵌入重建伪切片,增强模型透明度。该方法具备架构无关性、隐私保护性,并支持个性化与协作式联合训练。
链接: https://arxiv.org/abs/2605.00578
作者: Luru Jing,Cong Cong,Yanyuan Chen,Yongzhi Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026
Abstract:Federated learning (FL) offers a promising framework for collaborative digital pathology by enabling model training across institutions. However, real-world deployments face heterogeneity arising from diverse multiple instance learning (MIL) architectures and heterogeneous feature extractors across institutions. We propose FedHD, a novel FL framework that performs local Gaussian-mixture feature alignment tailored for WSI analysis. Instead of exchanging model parameters, each client independently distills semantically rich synthetic feature representations aligned with the distribution of real WSIs. To preserve diagnostic diversity, FedHD adopts a one-to-one distillation strategy, generating a synthetic counterpart for each real slide to avoid over-compression. During federation, a curriculum-based integration strategy progressively incorporates cross-site synthetic features into local training once performance plateaus. Furthermore, an optional interpretation module reconstructs pseudo-patches from synthetic embeddings, enhancing transparency. FedHD is architecture-agnostic, privacy-preserving, and supports personalized yet collaborative training across diverse institutions. Experiments on TCGA-IDH, CAMELYON16, and CAMELYON17 show that FedHD consistently outperforms state-of-the-art federated and distillation baselines.
[CV-27] 2D-SuGaR: Surface-Aware Gaussian Splatting for Geometrically Accurate Mesh Reconstruction
【速读】:该论文旨在解决2D Gaussian Splatting (2DGS)在几何重建中对初始参数敏感、尤其依赖于Structure-from-Motion (SfM)初始化时易产生劣质结果的问题,从而影响表面几何精度和鲁棒性。解决方案的关键在于引入单目深度(monocular depth)和法向量(normal)先验信息,提出一种基于深度引导的高斯初始化策略,并设计一种聚类驱动的去冗余高斯剪枝方法,有效提升重建几何准确性与稳定性,在DTU数据集上实现了最优的网格重建效果并保持高质量的新视角合成能力。
链接: https://arxiv.org/abs/2605.00569
作者: Prajwal Gupta C. R.,Divyam Sheth,Jinjoo Ha,Mirela Ostrek,Justus Thies
机构: TU Darmstadt; ELIZA; Max Planck Institute for Intelligent Systems
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful technique for generating photorealistic renderings of a scene in real-time. However, the volumetric nature of 3DGS limits its ability to accurately capture surface geometry. To address this, 2D Gaussian Splatting (2DGS) was proposed to enable view-consistent and geometrically accurate surface reconstruction from multi-view images. However, 2DGS can be sensitive to the initialization of the Gaussian primitives. Reliance on Structure-from-Motion (SfM) initializations, which can produce poor estimates on challenging image sets, may lead to subpar results. In this work, we enhance 2DGS by incorporating monocular depth and normal priors to improve both geometric accuracy and robustness. We propose a depth-guided initialization strategy for Gaussians and introduce a clustering-based technique for pruning degenerate Gaussians. We evaluate our method on the DTU dataset, where it achieves state-of-the-art results in mesh reconstruction while preserving high-quality novel view synthesis.
[CV-28] Depth-Guided Privacy-Preserving Visual Localization Using 3D Sphere Clouds BMVC2024
【速读】:该论文旨在解决基于稀疏3D点云的视觉定位中因高保真场景细节泄露而引发的隐私问题,特别是针对现有通过将地图点映射为随机方向3D线(randomly oriented 3D lines)以保护隐私的方法所面临的密度攻击(density-based attack)——该攻击可通过分析线的邻域统计特性恢复原始点云几何结构。解决方案的关键在于提出一种新的隐私保护场景表示方法——球面云(sphere cloud),其核心思想是将所有地图点映射为穿过地图中心(centroid)的3D直线,从而在空间上模拟单位球面上的点分布;由于这些直线在地图中心处最密集,可有效误导密度攻击算法,使其错误地推断出点云集中在中心位置,从而实现对密度攻击的有效防御。尽管如此,该方法也引入了两类新挑战:一是可能直接从球面云中恢复图像的新攻击类型,二是相机位姿估计中未解决的平移尺度问题。为此,作者进一步设计了一种简单但高效的云构建策略以抵御新攻击,并提出一个利用设备端飞行时间(Time-of-Flight, ToF)传感器获取的绝对深度图来引导平移尺度的定位框架,实验证明该方案在保持良好隐私保护能力和定位效率的同时,相较于其他深度引导的定位方法,对位姿估计精度的损害较小。
链接: https://arxiv.org/abs/2605.00562
作者: Heejoon Moon,Jongwoo Lee,Jeonggon Kim,Je Hyeong Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to BMVC 2024
Abstract:The emergence of deep neural networks capable of revealing high-fidelity scene details from sparse 3D point clouds has raised significant privacy concerns in visual localization involving private maps. Lifting map points to randomly oriented 3D lines is a well-known approach for obstructing undesired recovery of the scene images, but these lines are vulnerable to a density-based attack that can recover the point cloud geometry by observing the neighborhood statistics of lines. With the aim of nullifying this attack, we present a new privacy-preserving scene representation called \emphsphere cloud, which is constructed by lifting all points to 3D lines crossing the centroid of the map, resembling points on the unit sphere. Since lines are most dense at the map centroid, the sphere cloud mislead the density-based attack algorithm to incorrectly yield points at the centroid, effectively neutralizing the attack. Nevertheless, this advantage comes at the cost of i) a new type of attack that may directly recover images from this cloud representation and ii) unresolved translation scale for camera pose estimation. To address these issues, we introduce a simple yet effective cloud construction strategy to thwart new attack and propose an efficient localization framework to guide the translation scale by utilizing absolute depth maps acquired from on-device time-of-flight (ToF) sensors. Experimental results on public RGB-D datasets demonstrate sphere cloud achieves competitive privacy-preserving ability and localization runtime while not excessively compensating the pose estimation accuracy compared to other depth-guided localization methods.
[CV-29] Colorful-Noise: Training-Free Low-Frequency Noise Manipulation for Color-Based Conditional Image Generation SIGGRAPH2026
【速读】:该论文旨在解决生成式 AI(Generative AI)中图像扩散模型在控制特定视觉属性方面的局限性问题,即由于使用白高斯噪声(White Gaussian Noise)作为初始输入,导致生成结果缺乏可预测性和可控性,尽管这种噪声能促进多样性。其解决方案的关键在于揭示了白高斯噪声中低频成分主导图像全局结构与色彩分布、高频成分控制细节的特性,并提出通过引入低频图像先验(low-frequency image priors)对低频噪声进行简单操作,从而在无需训练的情况下有效引导整体图像结构和颜色,同时保留高频分量带来的细节多样性与随机性,实现对生成过程的高效可控调节。
链接: https://arxiv.org/abs/2605.00548
作者: Nadav Z. Cohen,Ofir Abramovich,Ariel Shamir
机构: Reichman University (里奇曼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: SIGGRAPH 2026 Conference Paper. Project Page at: this https URL
Abstract:Text-to-image diffusion models generate images by gradually converting white Gaussian noise into a natural image. White Gaussian noise is well suited for producing diverse outputs from a single text prompt due to its absence of structure. However, this very property limits control over, and predictability of, specific visual attributes, as the noise is not human-interpretable. In this work, we investigate the characteristics of the input noise in diffusion models. We show that, although all frequencies in white Gaussian noise have comparable statistical energy, low-frequency components primarily determine the images global structure and color composition, while high-frequency components control finer details. Building on this observation, we demonstrate that simple manipulations of the low-frequency noise using low-frequency image priors can effectively condition the generation process to reconstruct these low-frequency visual cues. This allows us to define a simple, training-free method with minimal overhead that steers overall image structure and color, while letting high-frequency components freely emerge as fine details, enabling variability across generated outputs.
[CV-30] Vesselpose: Vessel Graph Reconstruction from Learned Voxel-wise Direction Vectors in 3D Vascular Images
【速读】:该论文旨在解决3D医学图像中血管网络重建的拓扑准确性问题,传统“先分割后追踪”范式难以实现完整且拓扑正确的血管图结构。其解决方案的关键在于:首先联合预测体素级的血管方向向量与标准分割掩膜,然后引入一种基于方向向量引导的TEASAR算法扩展方法,从预测结果中提取更精确的血管图结构。该方法在多个基准数据集上达到最优性能,并能有效分离紧密邻近的血管段,处理单体积内的多树血管结构,同时提出可解释的拓扑误差度量(如虚假分叉和虚假合并)。
链接: https://arxiv.org/abs/2605.00538
作者: Rajalakshmi Palaniappan,Christoph Karg,Nemesio Navarro-Arambula,Peter Hirsch,Kristin Kraeker,Lisa Mais,Dagmar Kainmueller
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 33 pages, 10 figures, 11 tables
Abstract:Blood vessel segmentation and -tracing are essential tasks in many medical imaging applications. Although numerous methods exist, the prevailing segment-then-fix paradigm is fundamentally limited regarding its suitability for modeling the task of complete and topologically accurate vascular network reconstruction. Here, we propose an approach to extract topologically more accurate vascular graphs from 3D image data, building upon highly successful ideas from the related biomedical tasks of cell segmentation and -tracking. Our approach first predicts voxel-wise vessel direction vectors joint with standard vessel segmentation masks. Second, to extract the vascular graph from these predictions, we introduce a direction-vector-guided extension of the TEASAR algorithm. Our approach achieves state-of-the-art performance on three benchmark datasets, spanning both synthetic and real imagery. We further demonstrate the applicability of our approach to challenging 3D micro-CT scans of rat heart vasculature. Finally, we propose meaningful and interpretable measures of topological error, namely false splits and false merges for graphs. Overall, our approach substantially improves the topological accuracy of reconstructed vascular graphs, being able to separate closely apposed vessel segments and handle multiple vascular trees within a single volume.
[CV-31] IdentiFace: Multi-Modal Iterative Diffusion Framework for Identifiable Suspect Face Generation in Crime Investigations
【速读】:该论文旨在解决犯罪调查中嫌疑人人脸生成的技术难题,传统素描绘制流程效率低且质量差,而基于扩散模型的文本到图像方法存在条件模糊性(conditional ambiguity)和单次生成时采样方差(sampling variance)的问题。解决方案的关键在于提出IdentiFace框架,通过两个核心设计实现突破:一是多模态输入设计以增强条件控制能力,二是迭代生成流程支持可识别特征的逐步调整;同时引入面部身份损失函数及两个任务特定数据集,显著提升了生成人脸的身份检索准确率,在合成与真实场景下均优于现有方法。
链接: https://arxiv.org/abs/2605.00526
作者: Weichen Liu,Yixin Yang,Changsheng Chen,Alex Kot
机构: Southeast University (东南大学); Shenzhen MSU-BIT University (深圳北理莫斯科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures
Abstract:Suspect face generation remains a technical challenge in crime investigations. Traditional sketch-drawing workflows suffer from low efficiency and quality, while diffusion-based approaches still face intrinsic limitations on conditional ambiguity for text-to-image models and sampling variance for one-shot generation. We proposed IdentiFace, a novel diffusion-based framework for identifiable suspect face generation, which addressed these issues through (1) multi-modal input design to strengthen conditional control, and (2) an iterative generation pipeline enabling identifiable feature adjustment. We additionally contributed a facial identity loss and two task-specific datasets. Comprehensive experiments on synthetic datasets and in real-world scenarios indicate that IdentiFace achieves superior performance over existing methods, especially in terms of identity retrieval, and shows strong potential for practical applications.
[CV-32] PhysiGen: Integrating Collision-Aware Physical Constraints for High-Fidelity Human-Human Interaction Generation
【速读】:该论文旨在解决多人体交互生成中普遍存在的身体穿插(body inter-penetration)问题,这一问题在数据采集和模型生成结果中均显著影响动作的物理合理性与视觉真实感。解决方案的关键在于提出一种通用且计算高效的优化策略 PhysiGen,其通过将高分辨率人体网格简化为几何体(geometric primitives)以大幅降低人与人之间的碰撞检测成本,并利用碰撞区域作为优化方向的引导信号,从而显式地引入碰撞感知的物理约束。该方法具有即插即用特性,可无缝集成到现有交互生成模型中,实验证明其能有效减少穿插现象并提升生成序列的视觉连贯性与物理合理性。
链接: https://arxiv.org/abs/2605.00517
作者: Nan Lei,Yuan-Ming Li,Ling-An Zeng,Liang Xu,Zhi-Wei Xia,Hui-Wen Huang,Fa-Ting Hong,Wei-Shi Zheng
机构: Sun Yat-sen University (中山大学); The Hong Kong University of Science and Technology (香港科技大学); Shanghai Jiao Tong University (上海交通大学); Guilin University of Electronic Technology (桂林电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures
Abstract:Despite substantial progress in text-driven 3D human motion synthesis, generating realistic multi-person interaction sequences remains challenging. Notably, body inter-penetration is a pervasive issue from both data acquisition to the generated results, which significantly undermines the realism and usability. Previous generative models either ignored this issue or introduced computationally expensive mesh-level loss functions to alleviate inter-body collisions. In this paper, we propose a general-purpose and computationally efficient optimization strategy named PhysiGen to explicitly integrate collision-aware physical constraints for human-human interaction generation. Specifically, we simplify the high-resolution human body mesh into geometric primitives to greatly reduce the cost of inter-person collision detection. Moreover, we identify the collision regions as the guidance of the optimization directions. PhysiGen is plug-and-play and can be readily integrated into existing human interaction generation models. Extensive cross-dataset and cross-model experiments show that our method can effectively reduce interpenetration and significantly improve visual coherence and physical plausibility compared to the state-of-the-art methods.
[CV-33] Scale-Aware Adversarial Analysis: A Diagnostic for Generative AI in Multiscale Complex Systems
【速读】:该论文旨在解决生成式模型(Generative AI)在复杂多尺度物理系统建模中缺乏对物理规律的内在理解问题,即模型是否真正学习了 governing physical laws(支配物理规律),还是仅通过统计相关性进行插值。传统可解释人工智能(Explainable AI, XAI)方法因依赖像素级扰动而产生非物理解释,无法有效评估模型的物理一致性。其解决方案的关键在于引入基于约束扩散分解(Constrained Diffusion Decomposition, CDD)的诊断框架,该框架通过构建尺度感知的连续尺度空间,实现对数据和模型的物理约束生成与干预,从而在不偏离经验分布的前提下,检测生成模型在物理扰动下的响应特性。实验表明,未约束的去噪扩散概率模型(DDPM)在中等物理扰动下表现出局部结构冻结和非线性不稳定性,且跨尺度连续性丧失,导致生成轨迹发散,揭示了现有架构在多尺度因果关系建模上的根本缺陷。
链接: https://arxiv.org/abs/2605.00510
作者: Mengke Zhao,Guang-Xing Li,Duo Xu,Keping Qiu
机构: Nanjing University (南京大学); Yunnan University (云南大学); Key Laboratory of Modern Astronomy and Astrophysics (Nanjing University) (现代天文学与天体物理教育部重点实验室); University of Toronto (多伦多大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)
备注: submitted, comments welcome
Abstract:Complex physical systems, from supersonic turbulence to the macroscopic structure of the universe, are governed by continuous multiscale dynamics. While modern machine learning architectures excel at mapping the high-dimensional observables of these systems, it remains unclear whether they internalize the governing physical laws or merely interpolate discrete statistical correlations. Standard Explainable AI (XAI) architectures, particularly perturbation-based and gradient-saliency methods, rely on pixel-wise perturbations, which generate unphysical artifacts and push inputs off the valid empirical distribution. To resolve this, we introduce a diagnostic framework driven by Constrained Diffusion Decomposition (CDD), a diffusion-based multiscale data decomposition algorithm that enables physically constrained data generation and model evaluation via scale-aware modifications. Applying this framework to a Denoising Diffusion Probabilistic Model (DDPM), we execute deterministic interventions directly within the continuous, CDD-based scale space. We demonstrate that under moderate physical perturbations, the unconstrained generative model exhibits localized structural freezing and non-linear instability rather than continuous PDE-like responses. The network fails to maintain cross-scale continuity, causing the generative trajectory to diverge when pushed into unseen physical states. By synthesizing a continuum of physically coherent states, this scale-informed methodology establishes a controlled test ground to evaluate algorithmic vulnerabilities, providing the rigorous physical constraints necessary for future architectures to respect the multiscale causality of the natural universe.
[CV-34] End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer ICML2026
【速读】:该论文旨在解决自回归图像建模中视觉分词器(visual tokenizer)与生成模型分离训练导致的优化不一致问题。传统两阶段方法先独立训练分词器再训练生成模型,缺乏端到端的监督信号,限制了性能提升。其解决方案的关键在于设计了一个端到端的训练流程,联合优化重建和生成任务,使生成结果能够直接反向传播至分词器进行监督,从而实现更高效的协同优化。此外,论文还探索利用视觉基础模型(vision foundation models)增强一维分词器的能力,最终在ImageNet 256×256图像生成任务上达到无指导条件下的最优FID分数(1.48)。
链接: https://arxiv.org/abs/2605.00503
作者: Wenda Chu,Bingliang Zhang,Jiaqi Han,Yizhuo Li,Linjie Yang,Yisong Yue,Qiushan Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: In ICML 2026 (Spotlight)
Abstract:Autoregressive image modeling relies on visual tokenizers to compress images into compact latent representations. We design an end-to-end training pipeline that jointly optimizes reconstruction and generation, enabling direct supervision from generation results to the tokenizer. This contrasts with prior two-stage approaches that train tokenizers and generative models separately. We further investigate leveraging vision foundation models to improve 1D tokenizers for autoregressive modeling. Our autoregressive generative model achieves strong empirical results, including a state-of-the-art FID score of 1.48 without guidance on ImageNet 256x256 generation.
[CV-35] GOR-IS: 3D Gaussian Object Removal in the Intrinsic Space
【速读】:该论文旨在解决三维场景中物体移除(3D object removal)任务中存在的物理一致性与视觉连贯性不足的问题,尤其针对现有方法在处理全局光照效应和视角依赖的非朗伯表面(view-dependent non-Lambertian surfaces)时表现不佳的情况。其解决方案的关键在于提出一种基于内在空间(intrinsic space)的3D高斯对象移除框架(GOR-IS),通过将场景分解为材质(material)和光照(lighting)等内在成分,并显式建模光传输过程以保持全局光照一致性;同时引入一个直接在内在空间进行填充(inpainting)的模块,有效应对非朗伯表面带来的外观变化挑战,从而实现更物理合理且视觉一致的3D对象移除效果。
链接: https://arxiv.org/abs/2605.00498
作者: Yonghao Zhao,Yupeng Gao,Jian Yang,Jin Xie,Beibei Wang
机构: Nankai University (南开大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have made it standard practice to reconstruct 3D scenes from multi-view images. Removing objects from such 3D representations is a fundamental editing task that requires complete and seamless inpainting of occluded regions, ensuring consistency in geometry and appearance. Although existing methods have made notable progress in improving inpainting consistency, they often neglect global lighting effects, leading to physically implausible results. Moreover, these methods struggle with view-dependent non-Lambertian surfaces, where appearance varies across viewpoints, leading to unreliable inpainting. In this paper, we present 3D Gaussian Object Removal in the Intrinsic Space (GOR-IS), a novel framework for physically consistent and visually coherent 3D object removal. Our approach decomposes the scene into intrinsic components and explicitly models light transport to maintain global lighting effects consistency. Furthermore, we introduce an intrinsic-space inpainting module that operates directly in the material and lighting domains, effectively addressing the challenges posed by non-Lambertian surfaces. Extensive experiments on both synthetic and real-world datasets demonstrate that our framework substantially improves the physical consistency and visual coherence of object removal, outperforming existing methods by 13% in perceptual similarity (LPIPS) and 2dB in peak signal-to-noise ratio (PSNR). Code is publicly available at this https URL
[CV-36] High-Speed Vision Improves Zero-Shot Semantic Understanding of Human Actions
【速读】:该论文旨在解决在缺乏标注数据的情况下,如何通过零样本(zero-shot)方法实现对高速、细微人类动作的语义理解问题。其核心挑战在于:传统监督学习依赖大量标注数据,而快速或罕见动作难以获取足够标签;同时,现有预训练模型在高帧率视频中的时间分辨率影响尚未被充分研究。解决方案的关键在于提出一种无需训练的流水线方法,结合预训练视频-语言模型(video-language model)进行语义表征,并利用大语言模型(large language model, LLM)进行成对动作比较,从而在不同帧率(120 Hz、60 Hz、30 Hz)下评估时间分辨率对零样本语义可分性的影响。实验表明,更高时间分辨率显著提升语义分离能力,且在部分观测场景中,基于人体关节追踪的信息进一步增强了语义表示的稳定性和可解释性。
链接: https://arxiv.org/abs/2605.00496
作者: Yongpeng Cao,Yuji Yamakawa
机构: Institute of Industrial Science, The University of Tokyo, Japan; Interfaculty Initiative in Information Studies, Graduate School of Interdisciplinary Information Studies, The University of Tokyo, Japan
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Understanding human actions from visual observations is essential for human–robot interaction, particularly when semantic interpretation of unfamiliar or hard-to-annotate actions is required. In scenarios such as rapid and less common activities, collecting sufficient labeled data for supervised learning is challenging, making zero-shot approaches a practical alternative for semantic understanding without task-specific training. While recent advances in large-scale pretrained models enable such zero-shot reasoning, the impact of temporal resolution, especially for rapid and fine-grained motions, remains underexplored. In this study, we investigate how temporal resolution affects zero-shot semantic understanding of high-speed human actions. Using kendo as a representative case of rapid and subtle motion patterns, we propose a training-free pipeline that combines a pre-trained video-language model for semantic representation with large language model-based reasoning for pairwise action comparison. Through controlled experiments across multiple frame rates (120 Hz, 60 Hz, and 30 Hz), we show that higher temporal resolution significantly improves semantic separability in zero-shot settings. We further analyze the role of tracking-based human joint information under both full and partial observation scenarios. Quantitative evaluation using a nearest-class prototype strategy demonstrates that high-speed video provides more stable and interpretable semantic representations for fast actions. These findings highlight the importance of temporal resolution in training-free action recognition and suggest that high-speed perception can enhance semantic understanding capabilities. Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2605.00496 [cs.CV] (or arXiv:2605.00496v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.00496 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-37] MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video CVPR2026
【速读】:该论文旨在解决现有视频到音频生成方法中缺乏显式声事件标签的问题,即生成的音频虽质量高,但无法提供声音类型与发生时间等结构化信息,限制了其在实际应用(如音效制作)中的可解释性与实用性。解决方案的关键在于提出MMAudio-LABEL(LAtent-Based Event Labeling)框架,该框架基于基础音频生成模型构建,通过联合学习机制同步生成高质量音频和帧对齐的声事件预测(sound event predictions),从而避免传统后处理方式导致的误差累积问题。实验表明,该方法显著提升了音符起始点检测(onset detection)准确率(从46.7%提升至75.0%)和材料分类准确率(从40.6%提升至61.0%)。
链接: https://arxiv.org/abs/2605.00495
作者: Kazuya Tateishi,Akira Takahashi,Atsuo Hiroe,Hirofumi Takeda,Shusuke Takahashi,Yuki Mitsufuji
机构: Sony Group Corporation(索尼集团); Sony AI(索尼人工智能)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the CVPR 2026 Sight and Sound Workshop
Abstract:Recent advances in multimodal generation have enabled high-quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straightforward approach involves applying a standard sound event detection to the generated audio. However, this post-hoc pipeline is inherently limited, as it is prone to error accumulation. To address this limitation, we propose MMAudio-LABEL (LAtent-Based Event Labeling), an event-aware audio generation framework built on a foundational audio generation model as its backbone that jointly generates audio and frame-aligned sound event predictions from silent videos. We evaluate our method on the Greatest Hits dataset for onset detection and 17-class material classification. Our approach improves onset-detection accuracy from 46.7% to 75.0% and material-classification accuracy from 40.6% to 61.0% over baselines. These results suggest that jointly learning audio generation and event prediction enables a more interpretable and practical video-to-audio synthesis.
[CV-38] Leverag ing Vision-Language Models as Weak Annotators in Active Learning ICIP2026
【速读】:该论文旨在解决主动学习(Active Learning)中因标注成本过高而导致模型性能受限的问题,尤其在细粒度识别任务中,如何有效利用视觉语言模型(Vision-Language Models, VLMs)来减少对昂贵人工标注的依赖。其解决方案的关键在于:利用VLM在粗粒度标签上具有较高可靠性、而在细粒度标签上表现较差的特性,提出一种结合人类提供的细粒度标注与VLM生成的粗粒度弱标签的实例级标签分配机制,并通过少量可信全标签建模VLM标签中的系统性噪声,从而在有限标注预算下显著提升模型性能。
链接: https://arxiv.org/abs/2605.00480
作者: Phuong Ngoc Nguyen,Kaito Shiku,Ryoma Bise,Seiichi Uchida,Shinnosuke Matsuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICIP2026
Abstract:Active learning aims to reduce annotation cost by selectively querying informative samples for supervision under a limited labeling budget. In this work, we investigate how vision-language models (VLMs) can be leveraged to further reduce the reliance on costly human annotation within the active learning paradigm. To this end, we find that the reliability of VLMs varies significantly with label granularity in fine-grained recognition tasks: they perform poorly on fine-grained labels but can provide accurate coarse-grained labels. Leveraging this property, we propose an active learning framework that combines fine-grained human annotations with coarse-grained VLM-generated weak labels through instance-wise label assignment. We further model the systematic noise in VLM-generated labels using a small set of trusted full labels. Experiments on CUB200 and FGVC-Aircraft show that the proposed framework consistently outperforms existing active learning methods under the same annotation budget.
[CV-39] MSACT: Multistage Spatial Alignment for Stable Low-Latency Fine Manipulation
【速读】:该论文旨在解决双臂精细操作中因有限示范数据导致的视觉定位漂移(localization drift)问题,同时兼顾低延迟控制与稳定视觉定位的需求。现有方法在数据效率、执行延迟和几何一致性之间存在权衡:例如ACT依赖密集视觉特征但缺乏显式空间一致性,扩散策略虽具表达力却有迭代采样延迟,而基于视觉-语言或体素的方法虽提升泛化能力但计算复杂度高。解决方案的关键在于提出一种多阶段空间注意力模块(multistage spatial attention module),通过提取稳定的2D注意力点并联合预测未来注意力序列,结合时间对齐损失(temporal alignment loss)实现视觉特征与预测注意力序列的一致性约束,从而在不依赖关键点标注的情况下抑制漂移,增强视觉到动作映射的稳定性;该模块嵌入预训练ResNet视觉先验的ACT框架中,实现了在仿真和真实双臂平台(ALOHA)上的任务成功率提升与低延迟推理的平衡。
链接: https://arxiv.org/abs/2605.00475
作者: Xianbo Cai,Hideyuki Ichiwara,Masaki Yoshikawa,Tetsuya Ogata
机构: Waseda University (早稻田大学); National Institute of Advanced Industrial Science and Technology (日本产业技术综合研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures
Abstract:Real-world fine manipulation, particularly in bimanual manipulation, typically requires low-latency control and stable visual localization, while collecting large-scale data is costly and limited demonstrations may lead to localization drift. Existing approaches make different trade-offs: action-chunking policies such as ACT enable low-latency execution and data efficiency but rely on dense visual features without explicit spatial consistency, generative methods such as Diffusion Policy improve expressiveness but can incur iterative sampling latency, vision-language-action and voxel-based methods enhance generalization and geometric grounding but require higher computational cost and system complexity. We introduce a multistage spatial attention module that extracts stable 2D attention points and jointly predicts future attention sequences with a temporal alignment loss. Built upon ACT with a pretrained ResNet visual prior, a multistage attention module extracts task-relevant 2D attention points as a local spatial modality for action prediction. To maintain consistent object tracking, we introduce a self-supervised objective that aligns predicted attention sequences with visual features from future frames, suppressing drift without keypoint annotations and improving stability of the vision-to-action mapping under limited data. Experiments on simulated and real-world fine manipulation tasks, conducted on the ALOHA bimanual platform, evaluate task success, attention drift, inference latency, and robustness to visual disturbances. Results indicate improvements in localization stability and task performance while maintaining low-latency inference under the tested conditions.
[CV-40] From Local to Global to Mechanistic: An iERF-Centered Unified Framework for Interpreting Vision Models
【速读】:该论文旨在解决现代视觉模型在解释性方面的碎片化问题,即难以明确指出证据来源、模型编码内容以及内部计算如何组装证据。其核心解决方案是构建一个以实例特定有效感受野(instance-specific Effective Receptive Field, iERF)为中心的统一框架,将局部、全局和机制解释整合到单一分析单元——逐点特征向量(pointwise feature vector, PFV)及其对应的iERF中。关键创新包括:1)通过共享比例分解(Sharing Ratio Decomposition, SRD)实现高分辨率且激活忠实的局部解释;2)引入概念锚定特征解释(Concept-Anchored Feature Explanation, CAFE),利用iERF作为语义标签,将抽象潜在向量与可验证的像素级证据关联,从而解决稀疏自编码器特征(SAE)非定位问题;3)提出层间概念图与层间概念归因(Interlayer Concept Graph with Interlayer Concept Attribution, ICAT),量化概念间的跨层影响并识别集成梯度(Integrated Gradients)为最忠实的组合机制。该方法在ResNet50、VGG16和ViTs上均展现出优于基线的保真度与鲁棒性,能够揭示正确分类、误分类及对抗样本中的主导概念路径。
链接: https://arxiv.org/abs/2605.00474
作者: Yearim Kim,Sangyu Han,Nojun Kwak
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2026
Abstract:Modern vision models achieve remarkable accuracy, but explaining where evidence arises, what the model encodes, and how internal computations assemble that evidence remains fragmented. We introduce an iERF-centric framework that unifies local, global, and mechanistic interpretability around a single analysis unit: the pointwise feature vector (PFV) paired with its instance-specific Effective Receptive Field (iERF). On the local side, Sharing Ratio Decomposition (SRD) expresses each PFV as a mixture of upstream PFVs via sharing ratios and propagates iERFs to construct class-discriminative saliency maps. SRD yields high-resolution, activation-faithful explanations, is robust to targeted manipulation and noise, and remains activation-agnostic across common nonlinearities. For the global view, we introduce Concept-Anchored Feature Explanation (CAFE), which utilizes the iERF as a semantic label, grounding abstract latent vectors in verifiable pixel-level evidence. With CAFE, we address the challenge of non-localized sparse autoencoder latents–especially in Transformers, where early self-attention mixes distant context. To answer how representations are composed through depth, we propose the Interlayer Concept Graph with Interlayer Concept Attribution (ICAT), which quantifies concept-to-concept influence while isolating layer pairs; an interlayer insertion, deletion protocol identifies Integrated Gradients as the most faithful instantiation. Empirically, across ResNet50, VGG16, and ViTs, our framework outperforms baselines in both fidelity and robustness, successfully interprets dispersed SAE features, and exposes dominant concept routes in correct, misclassified, and adversarial cases. Grounded in iERFs, our approach provides a coherent, evidence-backed map from pixels to concepts to decisions.
[CV-41] Learning from Compressed CT: Feature Attention Style Transfer and Structured Factorized Projections for Resource-Efficient Medical Image Analysis
【速读】:该论文旨在解决医学影像中人工智能(AI)部署面临的高计算复杂度与资源密集型处理问题,特别是针对胸腔计算机断层扫描(CT)体积数据在未经压缩格式(如NIfTI或DICOM)下难以高效应用的问题。其关键解决方案是提出一种名为Feature Attention Style Transfer (FAST) 的新型知识蒸馏框架,通过结合基于Gram矩阵的注意力风格保留与双注意力特征对齐机制,实现从高保真CT表示到压缩输入空间的激活模式和结构关系的有效迁移;同时引入Structured Factorized Projection (SFP),利用块张量列车分解替代传统密集投影层,在显著减少参数量(约减半)的同时保持模型性能。整体架构CT-Lite通过对比学习与SigLIP多模态对齐目标集成上述组件,在多个公开数据集上实现了接近无压缩输入基线的诊断性能(AUROC差距仅5–7%),为低资源环境下的AI辅助临床评估提供了可行路径。
链接: https://arxiv.org/abs/2605.00448
作者: Shadid Yousuf,S.M. Mahbubur Rahman,Mohammed Imamul Hassan Bhuiyan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:The deployment of artificial intelligence in medical imaging is hindered by high computational complexity and resource-intensive processing of volumetric data. Although chest computed tomography (CT) volumes offer richer diagnostic information than projection radiography, their use in AI-based diagnosis remains limited due to the computational burden of processing uncompressed volumetric images (typically stored in NIfTI or DICOM format). Addressing the growing need for low-resource deployment and efficient electronic data transfer, we investigate the utilization of JPEG-compressed chest CT volumes for thoracic abnormality detection. We propose Feature Attention Style Transfer (FAST), a novel distillation framework that transfers both activation patterns and structural relationships from high-fidelity CT representations to a spatiotemporal visual encoder operating on compressed inputs. By combining Gram-matrix-based attention style preservation with dual-attention feature alignment, FAST enables robust feature extraction from degraded volumes. Furthermore, we introduce Structured Factorized Projection (SFP), leveraging Block Tensor Train decomposition as a parameter-efficient alternative to dense projection layers, reducing projection-head parameters by almost half. Our contrastive learning pipeline, CT-Lite, integrates these components with a SigLIP-based multimodal alignment objective. Experiments on CT-RATE, NIDCH, and Rad-ChestCT demonstrate that CT-Lite achieves AUROC within 5-7% of the uncompressed-input baseline across all three datasets, despite operating on compressed inputs with significantly fewer parameters, paving the way for AI-based clinical evaluation under resource constraints.
[CV-42] Scaling Video Understanding via Compact Latent Multi-Agent Collaboration
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在长视频理解任务中因感知上下文预算有限而导致的性能瓶颈问题。现有基于规则的代理方法虽能缓解此问题,但常伴随信息丢失、计算成本高及依赖文本中间表示等缺陷。解决方案的关键在于提出一种端到端的多代理协作框架(MACF),其核心创新是将每个代理的感知预算与全局视频复杂度解耦,通过视频分段实现局部预算约束下的高效感知,并引入代理原生的潜在通信协议,使各代理在共享嵌入空间中编码任务充分的紧凑token,从而由中央协调器实现高效且无损的信息协同。此外,论文设计了一种课程训练策略,逐步强化语义对齐、证据摘要与跨代理协作能力,显著提升了长视频理解的可扩展性与准确性。
链接: https://arxiv.org/abs/2605.00444
作者: Kerui Chen,Jinglu Wang,Jianrong Zhang,Ming Li,Yan Lu,Hehe Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages
Abstract:Multi-modal large language models (MLLMs) advance vision language understanding but face inherent limitations in long-video tasks due to bounded perception context budgets. Existing agentic methods mitigate this via rule-based preprocessing, yet often suffer from information loss, high cost, and reliance on textual intermediates. We propose MACF, an end-to-end Multi-Agent Collaboration Framework that decouples per-agent perception budgets from global video complexity, enabling scalable video understanding while preserving visual fidelity. MACF partitions videos into segments for locally budgeted agents and enables holistic reasoning via an agent-native latent communication protocol. Each agent encodes partial observations into compact, task-sufficient tokens in a shared embedding space, allowing efficient and information-preserving collaboration by a central coordinator. We introduce a curriculum training strategy that progressively enforces semantic alignment, evidence summarization, and cross-agent coordination. Extensive experiments on diverse video understanding benchmarks show that MACF consistently outperforms state-of-the-art MLLMs and multi-agent systems under identical budget constraints, demonstrating the effectiveness of our latent collaboration for scalable video understanding.
[CV-43] Adaptive Equilibrium: Dynamic Weighting Framework for Generalized Interruption of DeepFake Models
【速读】:该论文旨在解决生成式 AI(Generative AI)中通用扰动(universal perturbation)生成时面临的干扰不平衡问题(interruption imbalance),该问题源于传统静态梯度归一化方法难以调和不同模型架构间的冲突,导致优化过程偏向易受干扰的模型而忽略抗干扰能力强的模型。解决方案的关键在于提出自适应平衡框架(Adaptive Equilibrium Framework, AEF),其核心机制是通过实时损失反馈动态调整扰动权重,优先增强对最抗干扰模型的干扰强度,从而将优化目标从平均性能提升转变为寻找动态平衡点,最终使扰动在多种模型架构上实现一致且高效的干扰效果。
链接: https://arxiv.org/abs/2605.00443
作者: Hongrui Zheng,Liejun Wang,Zhiqing Guo
机构: Xinjiang University (新疆大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 11pages,5 figures
Abstract:The advancement of generalized deepfake disruption is constrained by the interruption imbalance, a fundamental bottleneck inherent to the generation of universal perturbations. We reveal that conventional static gradient normalization fundamentally struggles to resolve architectural conflicts, causing the optimization to bias towards susceptible models while neglecting resistant ones. We argue that achieving high and uniform effectiveness requires resolving this imbalance by reaching an adaptive equilibrium. We propose the Adaptive Equilibrium Framework (AEF), which employs a dynamic weighting mechanism that utilizes real-time loss feedback to adaptively assign greater interruption weights to the most resistant models. This approach shifts the optimization from an average-case problem to finding a dynamic balance, driving the perturbation to a uniformly effective equilibrium state. Comprehensive experiments validate that AEF achieves a more balanced interruption performance, maintaining a consistent interruption success rate across the evaluated diverse architectures.
[CV-44] LIMSSR: LLM -Driven Sequence-to-Score Reasoning under Training-Time Incomplete Multimodal Observations ICML2026
【速读】:该论文旨在解决不完整多模态学习(Incomplete Multimodal Learning, IML)中训练阶段模态缺失导致的监督信号不可靠问题,尤其针对现有方法依赖“上帝视角”(God’s eye view)——即训练时假设有完整的多模态数据——而无法适应真实场景的问题。解决方案的关键在于提出LIMSSR(LLM-Driven Incomplete Multimodal Sequence-to-Score Reasoning)框架,其将IML建模为条件序列推理任务,通过Prompt-Guided Context-Aware Modality Imputation和Multidimensional Representation Fusion利用大语言模型(Large Language Models, LLMs)的语义推理能力从可用模态中推断潜在语义,避免直接重建缺失模态;同时引入Mask-Aware Dual-Path Aggregation机制动态校准推理不确定性,有效抑制幻觉现象,从而在无需完整训练数据的前提下实现更鲁棒、高效的数据驱动多模态建模。
链接: https://arxiv.org/abs/2605.00434
作者: Huangbiao Xu,Huanqi Wu,Xiao Ke,Yuxin Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026 [Spotlight]
Abstract:Real-world multimodal learning is often hindered by missing modalities. While Incomplete Multimodal Learning (IML) has gained traction, existing methods typically rely on the unrealistic assumption of full-modal availability during training to provide reconstruction supervision or cross-modal priors. This paper tackles the more challenging setting of IML under training-time incomplete observations, which precludes reliance on a ``God’s eye view’’ of complete data. We propose LIMSSR (LLM-Driven Incomplete Multimodal Sequence-to-Score Reasoning), a framework that reformulates this challenge as a conditional sequence reasoning task. LIMSSR leverages the semantic reasoning capabilities of Large Language Models via Prompt-Guided Context-Aware Modality Imputation and Multidimensional Representation Fusion to infer latent semantics from available contexts without direct reconstruction. To mitigate hallucinations, we introduce a Mask-Aware Dual-Path Aggregation to dynamically calibrate inference uncertainty. Extensive experiments on three Action Quality Assessment datasets demonstrate that LIMSSR significantly outperforms state-of-the-art baselines without relying on complete training data, establishing a new paradigm for data-efficient multimodal learning. Code is available at this https URL.
[CV-45] MMAudioReverbs: Video-Guided Acoustic Modeling for Dereverberation and Room Impulse Response Estimation CVPR2026
【速读】:该论文旨在解决现有视频到音频(Video-to-Audio, V2A)模型在生成音频时未能显式建模房间声学效应(如混响或房间脉冲响应,Room Impulse Response, RIR)的问题,从而限制了对这些物理声学特性的可控性。其解决方案的关键在于:利用预训练的V2A模型(以MMAudio为代表)作为先验知识,通过微调一个小规模数据集,在不修改网络结构的前提下,构建一个统一框架MMAudioReverbs,实现无监督的去混响(dereverberation)和RIR估计任务。实验表明,视觉与音频线索在不同类型的物理房间声学场景下各具优势,验证了基础V2A模型在物理驱动的房间声学分析中的潜力。
链接: https://arxiv.org/abs/2605.00431
作者: Akira Takahashi,Ryosuke Sawata,Shusuke Takahashi,Yuki Mitsufuji
机构: Sony Group Corporation(索尼集团); Sony AI(索尼人工智能)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted to the CVPR 2026 Sight and Sound Workshop
Abstract:Although recent video-to-audio (V2A) models excelled at synthesizing semantically plausible sounds from visual inputs, they do not explicitly model room-acoustic effects such as reverberation or room impulse responses (RIRs), and thus offer limited controllability over these effects. However, we hypothesize that such V2A models implicitly have semantic knowledge of the relationship between spatial audio and the corresponding vision cues. In this paper, we revisit a V2A model for the sake of the above, and propose the way to utilize the pretrained model as prior for physically grounded room-acoustic processing. Based on one of the state-of-the-art V2A models, MMAudio, we propose MMAudioReverbs that is a unified framework dealing with i) dereverberation and ii) room impulse response (RIR) estimation without network architectural modification, and fine-tuned on a small dataset. Experimental results showed that audio and visual cues respectively have advantage depending on the type of physical room acoustics. It implies that foundation V2A models can be used for physically grounded room-acoustic analysis.
[CV-46] Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在密度控制上依赖启发式规则导致的适应性不足问题,尤其在复杂几何场景下难以灵活调整高斯分布以平衡重建质量与计算效率。其解决方案的关键在于提出LeGS框架,将密度控制从固定启发式策略转变为可学习的策略网络,并通过强化学习(Reinforcement Learning, RL)进行优化;其中,核心创新是设计了一个基于敏感性分析的定制化奖励函数,能精确量化单个高斯分布对重建质量的边际贡献,并推导出闭式解将奖励计算复杂度从O(N²)降低至O(N),从而显著提升训练效率与重建性能。
链接: https://arxiv.org/abs/2605.00408
作者: Zhenhua Ning,Xin Li,Jun Yu,Guangming Lu,Yaowei Wang,Wenjie Pei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While 3D Gaussian Splatting (3DGS) has demonstrated impressive real-time rendering performance, its efficacy remains constrained by a reliance on heuristic density control. Despite numerous refinements to these handcrafted rules, such methods inherently lack the flexibility to adapt to diverse scenes with complex geometries. In this paper, we propose a paradigm shift for density control from rigid heuristics to fully learnable policies. Specifically, we introduce \textbfLeGS, a framework that reformulates density control as a parameterized policy network optimized via Reinforcement Learning (RL). Central to our approach is the tailored effective reward function grounded in sensitivity analysis, which precisely quantifies the marginal contribution of individual Gaussians to reconstruction quality. To maintain computational tractability, we derive a closed-form solution that reduces the complexity of reward calculation from O(N^2) to O(N) . Extensive experiments on the Mip-NeRF 360, Tanks \ Temples, and Deep Blending datasets demonstrate that \textbfLeGS significantly outperforms state-of-the-art methods, striking a superior balance between reconstruction quality and efficiency. The code will be released at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.00408 [cs.CV] (or arXiv:2605.00408v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.00408 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-47] BOLT: Online Lightweight Adaptation for Preparation-Free Heterogeneous Cooperative Perception
【速读】:该论文旨在解决无准备异构协作感知(preparation-free heterogeneous cooperative perception)问题,即在没有预先训练或协同模型适配的情况下,多个独立训练的智能体如何有效融合彼此的感知信息以提升整体检测性能。现有方法依赖离线联合训练或定制化适配,难以应用于实际场景中因开发者独立、相遇随机导致的部署限制。解决方案的关键在于提出一种轻量级插件模块BOLT,其通过“自车作为教师”的在线蒸馏机制,在无需真值标签的前提下,利用高置信度自车感知特征引导邻近智能体特征域对齐,并鼓励邻居在自车低置信度区域贡献特征,从而实现高效且鲁棒的跨智能体特征融合。该方法仅需0.9M可训练参数,显著优于未经适配的直接融合方式,在DAIR-V2X和OPV2V数据集上均超越单智能体感知结果。
链接: https://arxiv.org/abs/2605.00405
作者: Kang Yang,Tianci Bu,Peng Wang,Deying Li,Yongcai Wang
机构: Renmin University of China (中国人民大学); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 10 figures, 10tables
Abstract:Most existing heterogeneous cooperative perception methods depend on prior preparation like offline joint training or tailored collaborator-model adaptation. Such preprocessing is, however, generally impractical in real scenarios, as agents are usually independently trained by different developers and meet occasionally online. This work investigates \emphpreparation-free heterogeneous cooperative perception, where agents use independently trained single-agent detectors without any pre-deployment coordination. We find direct cross-agent fusion under this setting greatly underperforms ego-only perception. We present BOLT, a lightweight plug-and-play module that adapts neighboring features online via ego-as-teacher distillation, requiring only ego predictions without ground-truth labels. BOLT leverages high-confidence ego perception features to guide cross-agent feature-domain alignment, while enabling neighbors to contribute features in the ego’s low-confidence regions. With only 0.9M trainable parameters, BOLT improves AP@50 by up to 32.3 points over vanilla unadapted fusion in the preparation-free setting. It consistently outperforms ego-only results on DAIR-V2X and OPV2V, across different encoder pairs and fusion strategies. Code: this https URL.
[CV-48] SIMON: Saliency-aware Integrative Multi-view Object-centric Neural Decoding
【速读】:该论文旨在解决当前EEG-to-image检索方法中存在的“几何-语义分离”问题,即现有模型通常假设视觉注意力固定在图像中心(center-focused view),这与人类基于内容驱动的注意力机制不一致,导致脑电(EEG)响应与视觉特征之间存在匹配偏差。其解决方案的关键在于提出SIMON框架,通过引入显著性感知采样(Saliency-Aware Sampling, SAS),结合前景分割与显著性预测来动态选择注视点,生成聚焦于信息丰富目标区域的多视图(multi-view)图像表示,从而增强脑-视觉特征对齐能力。该方法在THINGS-EEG数据集上实现了跨被试和同被试场景下的最优性能,验证了显著性引导的多视图集成策略的有效性与鲁棒性。
链接: https://arxiv.org/abs/2605.00401
作者: YuSheng Lin,Ji-Hwa Tsai,Chun-Shu Wei
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Recent EEG-to-image retrieval methods leverage pretrained vision encoders and foveation-inspired priors, but typically assume a fixed, center-focused view. This center bias conflicts with content-driven human attention, creating a geometric-semantic dissociation between visual features and EEG responses. We propose SIMON, a saliency-aware multi-view framework for zero-shot EEG-to-image retrieval. SIMON combines foreground segmentation and saliency prediction to select fixation centers via Saliency-Aware Sampling (SAS), then generates foveated views that emphasize informative object regions while suppressing background clutter. On THINGS-EEG, SIMON achieves state-of-the-art performance in both intra-subject and inter-subject settings, reaching an average Top-1 accuracy of 69.7% and 19.6%, respectively, consistently outperforming recent competitive baselines. Analyses across sampling granularity, EEG channel topology, and visual/brain encoder backbones further support the robustness of saliency-aware multi-view integration. Our code and models are publicly available at this https URL.
[CV-49] RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek -OCR Inference ICML2026
【速读】:该论文旨在解决DeepSeek-OCR在处理长文本时因视觉token冗余导致的计算成本高和推理效率低的问题,同时克服现有视觉语言模型(VLMs)中token剪枝方法因压缩机制不当而损害文本保真度的局限性。其解决方案的关键在于提出一种两阶段token剪枝方法RTPrune:第一阶段基于高范数(high-norm)视觉token优先保留,以捕捉关键文本与结构信息;第二阶段利用最优传输理论对剩余token进行成对合并,实现高效特征聚合,并引入动态剪枝比例以适应不同文本密度和token相似度,从而在保持高精度的同时显著提升推理速度。
链接: https://arxiv.org/abs/2605.00392
作者: Ben Wan,Yan Feng,Zihan Tang,Weizhe Huang,Yuting Zeng,Jia Wang,Tongxuan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19p pages, accepted by ICML2026
Abstract:DeepSeek-OCR leverages visual-text compression to reduce long-text processing costs and accelerate inference, yet visual tokens remain prone to redundant textual and structural information. Moreover, current token pruning methods for conventional vision-language models (VLMs) fail to preserve textual fidelity due to improper compression mechanisms. By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones. Motivated by this insight, we propose RTPrune, a two-stage token pruning method tailored for DeepSeek-OCR. In the first stage, we prioritize high-norm visual tokens that capture salient textual and structural information. In the second stage, the remaining tokens are paired and merged based on optimal transport theory to achieve efficient feature aggregation. We further introduce a dynamic pruning ratio that adapts to token similarity and textual density for OCR tasks, enabling a better efficiency-accuracy trade-off. Extensive experiments demonstrate state-of-the-art performance, as evidenced by 99.47% accuracy and 1.23 \times faster prefill on OmniDocBench, achieved with 84.25% token retention when applied to DeepSeek-OCR-Large.
[CV-50] Flow matching for Sentinel-2 super-resolution: implementation application and implications
【速读】:该论文旨在解决卫星遥感影像超分辨率重建中普遍存在的光谱保真度(spectral fidelity)与感知质量(perceptual quality)之间的权衡问题。针对这一挑战,作者提出了一种基于流匹配(flow matching)的4倍超分辨率模型,其关键创新在于:在单次采样步骤中通过欧拉法(Euler method)实现优于扩散模型和Real-ESRGAN模型的像素级精度;同时,在使用二阶中点求解器(Midpoint solver)时仅需20步采样即可生成具有感知真实性的高分辨率图像,有效在推理阶段规避了感知-失真权衡问题而无需重新训练。该方法显著提升了Sentinel-2 10米可见光与近红外波段影像到2.5米分辨率的重建性能,并成功应用于美国本土(CONUS)范围内的大规模遥感产品生成与土地覆盖分类任务。
链接: https://arxiv.org/abs/2605.00367
作者: Dakota Hester,Vitor S. Martins,Lucas B. Ferreira,Thainara M. A. Lima,Juliana A. Araújo
机构: Mississippi State University (密西西比州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 41 pages, 16 figures, 7 tables. Submitted in ISPRS Open Journal of Photogrammetry and Remote Sensing
Abstract:Developing robust techniques for super-resolution of satellite imagery involves navigating commonly observed trade-offs between spectral fidelity and perceptual quality. In this work, we introduce a flow matching model for 4x super-resolution of 10-m Sentinel-2 visible and near-infrared bands over the conterminous United States (CONUS) using a dataset of 120,851 10-m Sentinel-2 and 2.5-m resampled NAIP imagery pairs acquired on the same day. Our results showed that the flow matching model outperformed diffusion and Real-ESRGAN models in pixel-wise accuracy in a single sampling step using the Euler method. When evaluated with a second-order Midpoint solver, our model generated perceptually realistic super-resolved imagery in only 20 sampling steps, effectively navigating the perception-distortion trade-off at inference time without retraining. We used this model to produce a super-resolved 2.5-m 4-band CONUS imagery product derived from 2025 10-m Sentinel-2 annual composites, consisting of over 1.58 trillion pixels. We further evaluated the use of super-resolved data on a land cover classification task using semantic segmentation models. Finally, we generated a yearly 2.5-m land cover product for the Chesapeake Bay watershed for 2020-2025. An accuracy assessment against 25,000 ground truth points revealed an overall accuracy of 89.11% for the annual land cover product. We conclude that flow matching is an effective generative modeling approach for super-resolution of Sentinel-2 imagery compared to diffusion and Generative Adversarial Network-based methods, and has strong implications for expanding access to high-resolution imagery for geospatial applications that demand fine spatial detail.
[CV-51] me-series Meets Complex Motion Modeling: Robust and Computational-effective Motion Predictor for Multi-object Tracking
【速读】:该论文旨在解决多目标跟踪(Multi-object Tracking, MOT)中因真实世界非线性运动(如急停、急转弯)导致的运动预测不准确问题。当前主流方法依赖复杂且计算昂贵的生成式模型(Generative Models),但其实际应用受限。解决方案的关键在于提出一种轻量高效的Temporal Convolutional Motion Predictor (TCMP) 框架,该框架基于改进的时序卷积网络(Temporal Convolutional Network, TCN),引入扩张卷积(dilated convolutions)与回归头(regression head),从而在任意时间上下文长度下实现高精度运动预测。实验表明,TCMP 在 HOTA、IDF1 和 AssA 等关键指标上均优于现有最优方法,同时参数量仅为 SOTA 方法的 0.014 倍,计算成本(FLOPs)仅为 0.05 倍,显著提升了跟踪系统的准确性、鲁棒性与效率。
链接: https://arxiv.org/abs/2605.00362
作者: Nhat-Tan Do,Le-Huy Tu,Nhi Ngoc-Yen Nguyen,Dieu-Phuong Nguyen,Trong-Hop Do
机构: University of Information Technology (UIT) (信息科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-object tracking (MOT) is critical in numerous real-world applications, including surveillance, autonomous driving, and robotics. Accurately predicting object motion is fundamental to MOT, but current methods struggle with the complexities of real-world, non-linear motion (e.g., sudden stops, sharp turns). While recent research has gravitated towards increasingly complex and computationally expensive generative models to tackle this problem, their practical utility is often constrained. This paper challenges that paradigm, arguing that such complexity is not only unnecessary but can be outperformed by a more efficient, purpose-built approach. We introduce the Temporal Convolutional Motion Predictor (TCMP), a novel framework for MOT that leverages a modified Temporal Convolutional Network (TCN) featuring dilated convolutions and a regression head. This design allows for effective motion prediction across arbitrary temporal context lengths. Experimental results demonstrate that our approach achieves state-of-the-art performance, specifically improves upon the previous best method in several key metrics: HOTA (a measure of overall tracking accuracy) increases from 62.3% to 63.4%, IDF1 (a measure of identity preservation) rises from 63.0% to 65.0%, and AssA (a measure of association accuracy) improves from 47.2% to 49.1%. Significantly, TCMP achieves this performance while being highly efficient; it has only 0.014 times the parameters and requires only 0.05 times the computational cost (FLOPs) compared to the SOTA method. while is only 0.014 times the size (in terms of parameters) and requires only 0.05 times the computational cost (in terms of FLOPs). These findings highlight the robustness of our method to advance MOT systems by ensuring adaptability, accuracy, and efficiency in complex tracking environments.
[CV-52] CURE-OOD: Benchmarking Out-of-Distribution Detection for Survival Prediction
【速读】:该论文旨在解决癌症生存预测模型在面对影像采集参数变化导致的分布偏移(covariate shift)时,其可靠性下降的问题。具体而言,现有框架虽在CT图像上表现优异,但因成像设备或参数差异引入的域外样本(out-of-distribution, OOD)会显著削弱模型性能,而当前尚无系统性评估OOD检测能力的基准。解决方案的关键在于提出首个针对生存预测任务的OOD检测基准——CURE-OOD,该基准通过控制扫描仪参数定义训练集、分布内(in-distribution, ID)和域外(OOD)测试集,并涵盖四个生存预测任务,从而实现对分布偏移下生存预测性能与OOD可检测性的系统分析。
链接: https://arxiv.org/abs/2605.00350
作者: Wenjie Zhao,Jia Li,Mingrui Liu,Jing Wang,Yunhui Guo
机构: University of Texas at Dallas (德克萨斯大学达拉斯分校); George Mason University (乔治梅森大学); UT Southwestern Medical Center (西南医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:``How long can I live and remain free of cancer?‘’ is often the first question a patient asks after receiving a cancer diagnosis and treatment. Accurate survival prediction helps alleviate psychological distress and supports risk stratification and personalized treatment planning. Recent survival prediction frameworks have shown strong performance using computed tomography (CT) images. However, variations in imaging acquisition introduce out-of-distribution (OOD) samples caused by covariate shifts that undermine model reliability. Despite this challenge, to our knowledge, no existing benchmark systematically studies OOD detection in cancer survival prediction. To address this gap, we introduce the Cancer sURvival bEnchmark for OOD Detection (CURE-OOD), the first benchmark for systematically evaluating OOD detection in survival prediction under controlled acquisition-induced distribution shifts. CURE-OOD defines scanner-parameter-based training, in-distribution (ID), and OOD test splits across four survival prediction tasks. Our experiments show that covariate shifts notably reduce survival prediction performance. It also shows that mainstream classification-oriented OOD detectors can fail in survival prediction. Finally, we include HazardDev as a simple survival-aware reference baseline for OOD detection. CURE-OOD enables systematic analysis of how distribution shifts affect both downstream survival performance and OOD detectability.
[CV-53] Pose-Aware Diffusion for 3D Generation
【速读】:该论文旨在解决生成式3D物体时因解耦的“标准化-旋转”范式导致的空间错位和变换歧义问题(pose ambiguity)。其解决方案的关键在于提出了一种端到端的扩散框架Pose-Aware Diffusion (PAD),该框架直接在观测空间中合成3D几何结构,通过将单目深度图反投影为部分点云并作为3D几何锚点显式注入,从而摒弃了标准姿态假设,实现严格的时空监督。这一机制天然地消除了姿态歧义,生成高保真且姿态对齐的3D资产。
链接: https://arxiv.org/abs/2605.00345
作者: Zihan Zhou,Luxi Chen,Jingzhi Zhou,Yuhao Wan,Min Zhao,Baoyu Fan,Chongxuan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating pose-aligned 3D objects is challenging due to the spatial mismatches and transformation ambiguities inherent in decoupled canonical-then-rotate paradigms. To this end, we introduce Pose-Aware Diffusion (PAD), a novel end-to-end diffusion framework that synthesizes 3D geometry directly within the observation space. By unprojecting monocular depth into a partial point cloud and explicitly injecting it as a 3D geometric anchor, PAD abandons canonical assumptions to enforce rigorous spatial supervision. This native generation intrinsically resolves pose ambiguity, producing high-fidelity pose-aligned assets. Extensive experiments demonstrate that PAD achieves superior geometric alignment and image-to-3D correspondence compared to state-of-the-art methods. Additionally, PAD naturally extends to compositional 3D scene reconstruction via a simple union of independently generated objects, highlighting its robust ability to preserve precise spatial layouts.
[CV-54] Online Self-Calibration Against Hallucination in Vision-Language Models IJCAI2026
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的幻觉问题,即模型在图像描述生成过程中引入输入图像中不存在的视觉细节。现有偏好对齐方法通常依赖于来自更强模型(如GPT)的监督信号,但这种离线范式导致了“监督-感知不匹配”(Supervision-Perception Mismatch),使得学生模型被迫学习超出其感知能力的细粒度信息,从而倾向于猜测而非真实理解图像内容。论文提出了一种名为OSCAR(Online Self-Calibration)的在线自校准框架,其核心创新在于识别并利用LVLM内部存在的“生成-判别差距”(Generative-Discriminative Gap)——即模型在判别验证任务上表现优于开放式生成任务。通过结合蒙特卡洛树搜索(Monte Carlo Tree Search)与双粒度奖励机制(Dual-Granularity Reward Mechanism),OSCAR能够自动构建偏好数据并基于直接偏好优化(Direct Preference Optimization)迭代优化模型,从而在减少幻觉的同时提升多模态通用能力。
链接: https://arxiv.org/abs/2605.00323
作者: Minghui Chen,Chenxu Yang,Hengjie Zhu,Dayan Wu,Zheng Lin,Qingyi Si
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); JD.COM (京东)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: IJCAI 2026
Abstract:Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from stronger models such as GPT. However, this offline paradigm introduces a Supervision-Perception Mismatch: the student model is forced to align with fine-grained details beyond its perceptual capacity, learning to guess rather than to see. To obtain reliable self-supervision for online learning, we identify a Generative-Discriminative Gap within LVLMs, where models exhibit higher accuracy on discriminative verification than open-ended generation. Leveraging this capability, we propose \textbfOnline \textbfSelf-\textbfCAlib\textbfRation (OSCAR), a framework that integrates Monte Carlo Tree Search with a Dual-Granularity Reward Mechanism to construct preference data and iteratively refines the model via Direct Preference Optimization. Extensive experiments demonstrate that OSCAR achieves state-of-the-art performance on hallucination benchmarks while improving general multimodal capabilities.
[CV-55] Beyond Visual Fidelity: Benchmarking Super-Resolution Models for Large-Scale Remote Sensing Imagery via Downstream Task Integration
【速读】:该论文旨在解决现有超分辨率(Super-resolution, SR)模型评估体系与实际地球观测下游任务需求脱节的问题。当前主流SR研究多依赖PSNR或SSIM等保真度指标,但这些指标无法有效反映SR图像在土地覆盖分类、生物量估算等具体应用中的性能提升。为此,作者提出GeoSR-Bench这一集成下游任务的SR基准数据集,其关键在于构建了约36,000个空间对齐、时间同步且质量可控的多分辨率图像对(覆盖500米至0.6米),并首次将SR模型性能直接关联到多个地球监测任务(如地物分割、基础设施制图和生物物理变量估计)。通过在270种实验设置下对比GAN、Transformer、神经算子和扩散模型等SR方法,结果表明传统保真度指标与下游任务性能之间相关性弱甚至为负,凸显出将下游任务纳入SR模型开发与评估流程的必要性。
链接: https://arxiv.org/abs/2605.00310
作者: Zhili Li,Kangyang Chai,Zhihao Wang,Xiaowei Jia,Yanhua Li,Gengchen Mai,Sergii Skakun,Dinesh Manocha,Yiqun Xie
机构: University of Maryland (马里兰大学); University of Pittsburgh (匹兹堡大学); Worcester Polytechnic Institute (伍斯特理工学院); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Super-resolution (SR) techniques have made major advances in reconstructing high-resolution images from low-resolution inputs. The increased resolution provides visual enhancement and utility for monitoring tasks. In particular, SR has been increasingly developed for satellite-based Earth observation, with applications in urban planning, agriculture, ecology, and disaster response. However, existing SR studies and benchmarks typically use fidelity metrics such as PSNR or SSIM, whereas the true utility of super-resolved images lies in supporting downstream tasks such as land cover classification, biomass estimation, and change detection. To bridge this gap, we introduce GeoSR-Bench, a downstream task-integrated SR benchmark dataset to evaluate SR models beyond fidelity metrics. GeoSR-Bench comprises spatially co-located, temporally aligned, and quality-controlled image pairs from about 36,000 locations across diverse land covers, spanning resolutions from 500m to 0.6m. To the best of our knowledge, GeoSR-Bench is the first SR benchmark that directly connects improved image resolution from SR models with downstream Earth monitoring tasks, including land cover segmentation, infrastructure mapping, and biophysical variable estimation. Using GeoSR-Bench, we benchmark GAN, transformer, neural operator, and diffusion-based SR models on perceptual quality and downstream task performance. We conduct experiments with 270 settings, covering 2 cross-platform SR tasks, 9 SR models, 3 downstream task models, and 5 downstream tasks for each SR task. The results show that improvements in traditional SR metrics often do not correlate with gains in task performance, and the correlations can be negative, indicating that these metrics provide limited guidance for selecting superior models for downstream tasks. This reveals the need to integrate downstream tasks into SR model development and evaluation.
[CV-56] A Model-based Visual Contact Localization and Force Sensing System for Compliant Robotic Grippers
【速读】:该论文旨在解决软体夹爪(soft gripper)在抓取过程中难以准确估计抓握力(grasp force)的问题,以防止对易损物体造成损伤并提升基于学习的机器人控制性能。现有方法中,端到端深度学习模型泛化能力差,而传统模型驱动方法又不适用于现代软体夹爪几何结构和抓取任务。解决方案的关键在于提出一种基于物理模型的视觉力感知方法:通过腕部RGB-D相机提取变形鳍状软夹爪的结构关键点,并将其用于定义Simulation Open Framework Architecture(SOFA)中逆有限元分析(inverse finite element analysis)的参数;同时引入一个迭代接触定位子系统,利用深度学习实现在线3D重建与位姿估计,动态更新接触位置,从而在视觉遮挡和未知物体条件下仍保持鲁棒性。该方案实现了高精度的间接力估计,在负载阶段平均均方根误差为0.23 N,整体抓取过程为0.48 N,验证了其在实时软夹爪力感知中的潜力。
链接: https://arxiv.org/abs/2605.00307
作者: Kaiwen Zuo,Shuyuan Yang,Zonghe Chua
机构: Case Western Reserve University (凯斯西储大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures, IEEE Robotics and Automation Letters
Abstract:Grasp force estimation can help prevent robots from damaging delicate objects during manipulation and improve learning-based robotic control. Integrating force sensing into deformable grippers negotiates trade-offs in cost, complexity, mechanical robustness, and performance. With the growing integration of RGB-D wrist cameras into robotic systems for control purposes, camera-based techniques are a promising solution for indirect visual force estimation. Current approaches mostly utilize end-to-end deep learning, which can be brittle when generalizing to new scenarios, while existing model-based approaches are unsuited to grasping and modern grasper geometries. To address these challenges, we developed a model-based visual force sensing approach integrating an iterative contact localization with generalization to unseen objects. The system extracts structural key points from wrist camera RGB-D images of deforming fin-ray-shaped soft grippers, and uses these key points to define parameters of an inverse finite element analysis simulation in Simulation Open Framework Architecture. The iterative contact localization sub-system utilizes a deep learning-based online 3D reconstruction and pose estimation pipeline to dynamically update contact location, and is robust to visual occlusion and unseen objects. Our system demonstrated an average root mean square error of 0.23 N and normalized root mean square deviation of 2.11% during the load phase, and 0.48 N and 4.34% over the entire grasping process when interacting with different objects under various conditions, showcasing its potential for real-time model-based indirect force sensing of soft grippers.
[CV-57] Efficient Spatio-Temporal Vegetation Pixel Classification with Vision Transformers
【速读】:该论文旨在解决植物物候学(plant phenology)中基于长时间序列遥感影像进行高效植被像素分类的计算挑战,特别是传统多时相卷积神经网络(Multi-Temporal Convolutional Networks, CNNs)在处理长时序数据时存在计算复杂度高、空间上下文窗口大且难以扩展的问题。其解决方案的关键在于采用视觉Transformer(Vision Transformers, ViTs)架构,并通过系统性消融实验优化七个核心设计维度,包括数据归一化、光谱排列、边界处理、空间上下文窗口形状与大小、分块策略、位置编码及特征聚合方式。实验表明,所提ViT方法在保持分类性能的同时,将浮点运算量(FLOPs)降低一个数量级,并实现参数复杂度与时间序列长度无关的恒定特性,显著优于CNN基线模型,从而为资源受限环境下的物候监测系统提供了可扩展、高效的深度学习方案。
链接: https://arxiv.org/abs/2605.00296
作者: Alan Gomes,Anderson Gonçalves,Samuel Felipe dos Santos,Nathan Felipe Alves,Magna Soelma Beserra de Moura,Bruna de Costa Alberton,Leonor Patricia C. Morellato,Ricardo da Silva Torres,Jurandy Almeida
机构: Federal University of São Carlos (UFSCar)(圣卡洛斯联邦大学); São Paulo State University (UNESP)(圣保罗州立大学); Brazilian Agricultural Research Corporation (EMBRAPA)(巴西农业研究公司); Wageningen University Research (WUR)(瓦赫宁根大学与研究机构)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Plant phenology-the study of recurrent life cycle events-is essential for understanding ecosystem dynamics and their responses to climate change impacts. While Unmanned Aerial Vehicles (UAVs) and near-surface cameras enable high-resolution monitoring, identifying plant species across time remains computationally challenging. State-of-the-art approaches, specifically Multi-Temporal Convolutional Networks (CNNs), rely on rigid multi-branch architectures that scale poorly with longer time series and require large spatial context windows. In this paper, we present an extensive study on optimizing Vision Transformers (ViTs) for efficient spatio-temporal vegetation pixel classification. We conducted a comprehensive ablation study analyzing seven key design dimensions, including: (i) data normalization; (ii) spectral arrangement; (iii) boundary handling; (iv) spatial context window shape and size; (v) tokenization strategies; (vi) positional encoding; and (vii) feature aggregation strategies. Our method was evaluated on two datasets from the Brazilian Cerrado biome, Serra do Cipó (aerial imagery) and Itirapina (near-surface imagery). Experimental results demonstrate that our ViT approach offers a substantial improvement in computational efficiency while maintaining competitive classification performance. Notably, our ViT reduces Floating Point Operations (FLOPs) by an order of magnitude and maintains constant parameter complexity regardless of the time series length, whereas the CNN baseline scales linearly. Our findings confirm that ViTs are a robust, scalable solution for resource-constrained phenological monitoring systems.
[CV-58] An End-to-End Decision-Aware Multi-Scale Attention-Based Model for Explainable Autonomous Driving
【速读】:该论文旨在解决深度学习模型在自动驾驶系统中因缺乏可解释性而导致的决策不可信、故障难以预测以及实际应用受限的问题。其核心挑战在于现有解释方法存在推理错误和指标不可靠,无法实现对复杂模型行为的全面理解。解决方案的关键是提出一种多尺度注意力机制驱动的推理模型,将驾驶决策作为输入注入到解释组件中,从而为每个决策提供情境相关的实时解释;同时引入F1-score和新提出的联合F1(Joint F1)评分指标,从定量角度验证模型在可解释人工智能(Explainable Artificial Intelligence, XAI)方面的准确性和可靠性,并通过BDD-OIA与nu-AR数据集验证了模型的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2605.00291
作者: Maryam Sadat Hosseini Azad,Shahriar Baradaran Shokouhi,Amir Abbas Hamidi Imani,Shahin Atakishiyev,Randy Goebel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:The application of computer vision is gradually increasing across various domains. They employ deep learning models with a black-box nature. Without the ability to explain the behavior of neural networks, especially their decision-making processes, it is not possible to recognize their efficiency, predict system failures, or effectively implement them in real-world applications. Due to the inevitable use of deep learning in fully automated driving systems, many methods have been proposed to explain their behavior; however, they suffer from flawed reasoning and unreliable metrics, which have prevented a comprehensive understanding of complex models in autonomous vehicles and hindered the development of truly reliable systems. In this study, we propose a multi-scale attention-based model in which driving decisions are fed into the reasoning component to provide case-specific explanations for each decision simultaneously. For quantitative evaluation of our model’s performance, we employ the F1-score metric, and also proposed a new metric called the Joint F1 score to demonstrate the accurate and reliable performance of the model in terms of Explainable Artificial Intelligence (XAI). In addition to the BDD-OIA dataset, the nu-AR dataset is utilized to further validate the generalization capability and robustness of the proposed network. The results demonstrate the superiority of our reasoning network over the classic and state-of-the-art models.
[CV-59] When Do Diffusion Models learn to Generate Multiple Objects? ICML2026
【速读】:该论文旨在解决生成式 AI(Generative AI)中扩散模型在多对象场景生成任务中的不可靠性问题,尤其是其在概念泛化与组合泛化能力上的局限。研究表明,这种局限并非主要源于数据不平衡,而是由场景复杂度和低数据环境下计数能力难以学习所驱动;此外,随着训练时系统性地排除特定概念组合,组合泛化能力显著下降。解决方案的关键在于引入一个受控的数据集生成框架——Mosaic(Multi-Object Spatial relations, AttrIbution, Counting),通过该框架可分离数据分布与模型结构的影响,从而揭示扩散模型在多对象生成中的根本瓶颈,并为设计更强归纳偏置(inductive biases)和更优数据策略提供依据,以提升模型对复杂场景的鲁棒组合生成能力。
链接: https://arxiv.org/abs/2605.00273
作者: Yujin Jeong,Arnas Uselis,Iro Laina,Seong Joon Oh,Anna Rohrbach
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICML2026
Abstract:Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.
[CV-60] REALM: An RGB and Event Aligned Latent Manifold for Cross-Modal Perception
【速读】:该论文旨在解决现有基于学习的事件处理方法普遍局限于特定任务且缺乏跨模态泛化能力的问题。其核心挑战在于如何将异步事件流有效映射到预训练RGB基础模型的潜在空间中,从而利用冻结的RGB骨干网络中的几何与语义先验。解决方案的关键在于提出REALM框架,通过低秩适应(Low-rank Adaptation, LoRA)技术将事件表示投影至RGB基础模型的预训练潜在空间,实现RGB与事件数据在潜在空间中的对齐,进而无需重新训练即可直接迁移线性分类头或复杂图像解码器(如MASt3R)至事件数据,显著提升下游任务(如深度估计、语义分割和宽基线特征匹配)的性能。
链接: https://arxiv.org/abs/2605.00271
作者: Vincenzo Polizzi,David B. Lindell,Jonathan Kelly
机构: University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Event cameras provide several unique advantages over standard frame-based sensors, including high temporal resolution, low latency, and robustness to extreme lighting. However, existing learning-based approaches for event processing are typically confined to narrow, task-specific silos and lack the ability to generalize across modalities. We address this gap with REALM, a cross-modal framework that learns an RGB and Event Aligned Latent Manifold by projecting event representations into the pretrained latent space of RGB foundation models. Instead of task-specific training, we leverage low-rank adaptation (LoRA) to bridge the modality gap, effectively unlocking the geometric and semantic priors of frozen RGB backbones for asynchronous event streams. We demonstrate that REALM effectively maps events into the ViT-based foundation latent space. Our method allows us to perform downstream tasks like depth estimation and semantic segmentation by simply transferring linear heads trained on the RGB teacher. Most significantly, REALM enables the direct, zero-shot application of complex, frozen image-trained decoders, such as MASt3R, to raw event data. We demonstrate state-of-the-art performance in wide-baseline feature matching, significantly outperforming specialized architectures. Code and models are available upon acceptance.
[CV-61] Remote SAMsing: From Segment Anything to Segment Everything
【速读】:该论文旨在解决将SAM2(Segment Anything Model 2)应用于大尺度遥感(Remote Sensing, RS)图像时面临的两大核心问题:一是掩码生成器存在质量与覆盖范围之间的固有权衡,即严格阈值虽能保证掩码精度但导致覆盖不足,而放宽阈值虽提升覆盖却牺牲质量;二是大图像需分块处理(tiling),造成目标对象跨块断裂,破坏空间一致性。解决方案的关键在于提出一种名为Remote SAMsing的开源流水线,其核心创新包括:(1)多轮迭代策略(multi-pass algorithm),在每块图像上重复运行SAM2,通过将已接受的掩码置黑来简化场景,并仅在覆盖增益停滞时逐步放松质量阈值,从而优先捕获最精确的掩码;(2)上下文填充(contextual padding)与无参数最优匹配合并(parameter-free best-match merge)机制,有效重建跨块断裂的目标,保障空间一致性。该方案无需修改SAM2或引入训练数据,在多个遥感场景中将覆盖范围从30–68%显著提升至91–98%,且保持高精度分割性能。
链接: https://arxiv.org/abs/2605.00256
作者: Osmar Luiz Ferreira de Carvalho,Osmar Abílio de Carvalho Júnior,Anesmar Olino de Albuquerque,Daniel Guerreiro e Silva
机构: University of Brasília (巴西利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 31 pages, 8 figures, 7 tables
Abstract:SAM2 produces high-quality zero-shot segmentation on natural images, but applying it to large remote sensing scenes exposes two problems: (1) its mask generator faces an inherent quality-coverage trade-off: strict thresholds yield precise masks but leave most of the image unsegmented, while relaxed thresholds increase coverage at the cost of mask quality; and (2) large images must be tiled, fragmenting objects across tile boundaries. We propose Remote SAMsing, an open-source pipeline that solves both problems without modifying SAM2 or requiring training data. For coverage, a multi-pass algorithm runs SAM2 repeatedly on each tile, painting accepted masks black between passes to simplify the scene for the next iteration, and relaxing quality thresholds only when coverage gains stagnate, ensuring that the most precise masks are always captured first. For spatial consistency, contextual padding and a parameter-free best-match merge reconstruct objects fragmented across tile boundaries. Evaluated on seven scenes (5~cm to 4.78~m GSD), the pipeline raises coverage from 30–68% (single-pass SAM2) to 91–98%. Ablation experiments quantify the contribution of each component to coverage and detection quality. Per-class evaluation shows that SAM2 transfers well to discrete RS objects (buildings 95%, cars 82–93% Det@0.5) with segment boundaries 3–8 \times more precise than SLIC and Felzenszwalb baselines. Tile size functions as an implicit scale parameter: reducing it from 1,000 to 250 raises Det@0.5 from 56% to 85%, outperforming SAM2’s built-in multi-scale mechanism. The pipeline generalizes to MNF false-color imagery without retraining (99.5% ASA) and scales to production-sized images: a 1.94 billion pixel Potsdam mosaic achieved 97% coverage without quality degradation.
[CV-62] Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation
【速读】:该论文旨在解决机器人系统训练中真实世界数据获取成本高、多样性不足的问题,尤其在复杂场景(如杂乱环境和低光照条件)下难以获得足够多样的训练数据。解决方案的关键在于提出Lucid-XR——一个生成式数据引擎,其核心是vuer,一个运行在XR头显上的基于Web的物理仿真环境,支持无延迟的沉浸式交互且无需专用设备;该系统通过本地化物理模拟与人到机器人的姿态映射相结合,并利用自然语言引导的物理感知视频生成流水线放大数据,从而实现仅用合成数据即可零样本迁移至未见过的真实环境,验证了其在灵巧操作任务中的有效性。
链接: https://arxiv.org/abs/2605.00244
作者: Yajvan Ravan,Adam Rashid,Alan Yu,Kai McClennen,Gio Huh,Kevin Yang,Zhutian Yang,Qinxi Yu,Xiaolong Wang,Phillip Isola,Ge Yang
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); FortyFive Labs; Caltech (加州理工学院); Harvard University (哈佛大学); UC San Diego (加州大学圣地亚哥分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL
Abstract:We introduce Lucid-XR, a generative data engine for creating diverse and realistic-looking multi-modal data to train real-world robotic systems. At the core of Lucid-XR is vuer, a web-based physics simulation environment that runs directly on the XR headset, enabling internet-scale access to immersive, latency-free virtual interactions without requiring specialized equipment. The complete system integrates on-device physics simulation with human-to-robot pose retargeting. Data collected is further amplified by a physics-guided video generation pipeline steerable via natural language specifications. We demonstrate zero-shot transfer of robot visual policies to unseen, cluttered, and badly lit evaluation environments, after training entirely on Lucid-XR’s synthetic data. We include examples across dexterous manipulation tasks that involve soft materials, loosely bound particles, and rigid body contact. Project website: this https URL
[CV-63] MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video
【速读】:该论文旨在解决毫米波(mmWave)雷达在人体姿态估计中因依赖预提取的中间表示(如稀疏点云或频谱图图像)而导致时空信息丢失及系统复杂度增加的问题,同时现有方法多采用端到端监督学习,未能充分利用未标注的原始雷达视频流来学习泛化表征。其解决方案的关键在于提出MAEPose——一种基于掩码自编码(masked autoencoding)的人体姿态估计方法,直接在mmWave频谱图视频上进行训练,通过预训练阶段从无标签雷达视频中学习时空运动感知的通用表征,并利用热图解码器实现多帧姿态预测。实验表明,该方法在多个数据集上显著优于现有最优基线(MPJPE提升达22.1%),且在零样本旁观者干扰下保持鲁棒性(误差仅增加6.5%)。
链接: https://arxiv.org/abs/2605.00242
作者: Xijia Wei,Yuan Fang,Kevin Chetty,Youngjun Cho,Nadia Bianchi-Berthouze
机构: University College London(伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Millimetre-wave (mmWave) radar offers a more privacy-preserving alternative to RGB-based human pose estimation. However, existing methods typically rely on pre-extracted intermediate representations such as sparse point clouds or spectrogram images, where the rich spatiotemporal information naturally present in radar video streams is discarded for model learning, while such signal processing adds system complexity. In addition, existing solutions are mainly conducted in an end-to-end supervised manner without leveraging unlabelled raw video streams to learn generalized representations. In this study, we present MAEPose, a masked autoencoding-based human pose estimation approach that operates directly on mmWave spectrogram videos. MAEPose learns spatiotemporal motion-aware generalized representations from unlabelled radar video, and leverages its heatmap decoder for multi-frame pose estimation predictions. We evaluate it across three datasets based on leave-one-person-out cross-validation with rigorous statistical testing. MAEPose consistently outperforms state-of-the-art baselines by up to 22.1% in MPJPE p0.05, and maintains robust accuracy under zero-shot bystander interference with only a 6.5% error increase. Ablation studies confirm that both the pre-training and the heatmap decoder contribute substantially, while modality analysis indicates that leveraging Range-Doppler video as input achieves better pose estimation performance than Range-Azimuth or their fusion, with lower computational cost.
[CV-64] Adaptive Geodesic Conformal Prediction for Egocentric Camera Pose Estimation
【速读】:该论文旨在解决增强现实(AR)和辅助设备中自指姿态估计(egocentric pose estimation)的不确定性量化问题,即如何在不重新训练模型的前提下,为预测结果提供可靠的置信区间保证。现有标准共形预测(Conformal Prediction, CP)方法虽能实现整体90%的覆盖率,但在最难的25%帧(Q4)上覆盖不足60%,存在约30个百分点的条件覆盖率差距,且这一现象在多参与者、多预测器和多时间窗口下稳定出现。解决方案的关键在于提出DINOv2-Bridge自适应共形预测(adaptive CP):通过一个两阶段难度估计器,在仅使用单个源参与者数据训练后,无需任何测试时图像即可跨参与者迁移,显著提升最难帧的覆盖率(从~0.75提升至~0.93),同时保持整体覆盖率维持在90%目标水平。该方法还引入了基于SE(3)流形的测地线非一致性评分(geodesic SE(3) nonconformity score),更准确识别物理上更难的帧,从而优化置信区间的分配策略。
链接: https://arxiv.org/abs/2605.00233
作者: Aishani Pathak,Hasti Seifi
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Egocentric pose estimation for Augmented Reality (AR) and assistive devices requires not just accurate predictions but guaranteed uncertainty regions. Conformal prediction (CP) provides such guarantees without retraining, but we show that standard CP with a single fixed threshold achieves nominal 90% overall coverage while covering only ~60% of the hardest 25% of frames (Q4) – a ~30 percentage-point conditional coverage gap consistent across 12 participants, 3 predictors, and 3 horizons (108 evaluations) on EPIC-Fields. We further show that a geodesic SE(3) nonconformity score identifies physically harder frames than Euclidean scoring, with only 15-26% Q4 overlap and 2-3x higher ground-truth camera displacement for geodesic Q4 frames. To close the coverage gap, we propose DINOv2-Bridge adaptive CP: a two-stage difficulty estimator trained on a single source participant that transfers cross-participant without any images at test time, improving Q4 coverage from ~0.75 to ~0.93 while maintaining overall coverage at the 90% target.
[CV-65] VkSplat: High-Performance 3DGS Training in Vulkan Compute
【速读】:该论文旨在解决现有3D Gaussian Splatting (3DGS) 训练流程在性能和跨厂商GPU兼容性方面的局限性。当前主流方案多依赖CUDA与PyTorch,存在平台绑定、显存占用高及训练效率低的问题。其解决方案的关键在于构建一个完全基于Vulkan计算的高性能3DGS训练流水线——VkSplat,通过深度优化Vulkan内核调度与内存管理策略,在保持图像质量的前提下实现了比CUDA+PyTorch基线快3.3倍的训练速度,并减少33%的显存占用,同时验证了跨不同GPU厂商(如NVIDIA、AMD、Intel)的良好兼容性,成为首个实现顶尖性能的全Vulkan架构3DGS训练系统。
链接: https://arxiv.org/abs/2605.00219
作者: Jingxiang Chen,Mohamed Ibrahim,Yang Liu
机构: Huawei Canada(华为加拿大)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Eurographics 2026 - Short Papers
Abstract:We present VkSplat, a high-performance, cross-vendor 3D Gaussian Splatting (3DGS) training pipeline implemented fully in Vulkan compute, addressing performance and compatibility limitation of existing training pipelines. With various optimizations, we achieve 3.3\times speed and 33% VRAM reduction over CUDA+PyTorch baseline, maintaining quality, and demonstrating compatibility across GPU vendors. To the best of our knowledge, this is the first fully-Vulkan-based 3DGS training pipeline that achieves state-of-the-art performance. Code: \hrefthis https URLthis https URL
[CV-66] FieryGS: In-the-Wild Fire Synthesis with Physics-Integrated Gaussian Splatting ICLR2026
【速读】:该论文旨在解决在真实世界三维场景中合成物理合理且视觉逼真的燃烧效果(combustion effects)的问题。传统计算流体力学(CFD)与图形学流程虽能生成逼真火焰,但依赖手工构建几何结构、专家调参及繁琐工作流,难以规模化应用于真实场景;而近期基于3D高斯泼溅(3D Gaussian Splatting, 3DGS)的场景重建技术虽可实现高保真重建,却缺乏燃烧过程的物理基础。解决方案的关键在于提出FieryGS框架,其核心是将物理准确且用户可控的燃烧模拟与渲染无缝集成进3DGS管线,通过三个紧密耦合模块实现:(1) 基于多模态大语言模型的物理材料推理机制,用于理解场景材质并指导燃烧行为;(2) 高效体积燃烧模拟器,支持复杂燃烧现象如火焰传播、烟雾扩散和表面炭化;(3) 统一渲染器,同步处理火焰与3DGS场景内容。该框架无需人工调参即可自动生成与场景几何和材料一致的可控燃烧动态,显著提升视觉真实感、物理保真度与用户控制能力。
链接: https://arxiv.org/abs/2605.00177
作者: Qianfan Shen,Ningxiao Tao,Qiyu Dai,Tianle Chen,Minghan Qin,Yongjie Zhang,Mengyu Chu,Wenzheng Chen,Baoquan Chen
机构: Peking University (北京大学); ByteDance Seed (字节跳动种子实验室); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026
Abstract:We consider the problem of synthesizing photorealistic, physically plausible combustion effects in in-the-wild 3D scenes. Traditional CFD and graphics pipelines can produce realistic fire effects but rely on handcrafted geometry, expert-tuned parameters, and labor-intensive workflows, limiting their scalability to the real world. Recent scene modeling advances like 3D Gaussian Splatting (3DGS) enable high-fidelity real-world scene reconstruction, yet lack physical grounding for combustion. To bridge this gap, we propose FieryGS, a physically-based framework that integrates physically-accurate and user-controllable combustion simulation and rendering within the 3DGS pipeline, enabling realistic fire synthesis for real scenes. Our approach tightly couples three key modules: (1) multimodal large-language-model-based physical material reasoning, (2) efficient volumetric combustion simulation, and (3) a unified renderer for fire and 3DGS. By unifying reconstruction, physical reasoning, simulation, and rendering, FieryGS removes manual tuning and automatically generates realistic, controllable fire dynamics consistent with scene geometry and materials. Our framework supports complex combustion phenomena – including flame propagation, smoke dispersion, and surface carbonization – with precise user control over fire intensity, airflow, ignition location and other combustion parameters. Evaluated on diverse indoor and outdoor scenes, FieryGS outperforms all comparative baselines in visual realism, physical fidelity, and controllability. Project page can be found at this https URL.
[CV-67] DPU or GPU for Accelerating Neural Networks Inference – Why not both? Split CNN Inference
【速读】:该论文旨在解决边缘设备上视频与图像流式传输中低延迟需求的问题,传统方法依赖单一硬件单元(如GPU、FPGA或DPU)加速卷积神经网络(Convolutional Neural Networks, CNNs)推理,但难以进一步降低延迟。其关键解决方案是提出一种跨DPU与GPU的CNN推理分割策略(Split CNN Inference),将CNN模型分层部署:初始层由Versal VCK190中的AI引擎(DPU)处理,靠近数据源以减少传输开销;后续层由NVIDIA RTX 2080 GPU异步流水线执行,从而优化整体延迟。此外,论文引入基于图神经网络(Graph Neural Network, GNN)的分区索引预测方法,自动确定最优分割点,准确率达96.27%,最终实现相比纯DPU执行最高2.48倍、相比纯GPU执行最高3.37倍的延迟改善。
链接: https://arxiv.org/abs/2605.00174
作者: Ali Emre Oztas,Mahir Demir,James Garside,Mikel Luj’an
机构: The University of Manchester (曼彻斯特大学)
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video and image streaming on edge devices requires low latency. To address this, Neural Networks (NNs) are widely used, and prior work mainly focuses on accelerating them with single hardware units such as Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), and Deep Learning Processing Units (DPUs). However, further reductions in latency can be observed by combining these units. In this paper, partitioning CNN inference across DPU and GPU (Split CNN Inference) is proposed. The first partition runs on the AI engines (DPU) of a Versal VCK190, which consists of initial CNN layers processing the input images. The DPU processes the first partition near the source of the data. Pipelined asynchronously, a GPU runs the remaining layers. The GPU (NVIDIA RTX 2080) processes the second partition, albeit having reduced the data transfer between the data source (storage/camera) and the GPU. Furthermore, a Graph Neural Network (GNN)-based partition index prediction method is proposed to automate the partitioning of CNNs needed for the Split Inference. Well established models such as LeNet-5, ResNet18/50/101/152, VGG16, and MobileNetv2 are analyzed. Results demonstrate up to 2.48x latency improvement over DPU-only execution and up to 3.37x over GPU-only execution. The trained GNN model splits the layers between the appropriate devices with 96.27% accuracy.
[CV-68] From Images2Mesh: A 3D Surface Reconstruction Pipeline for Non-Cooperative Space Objects
【速读】:该论文旨在解决从单目在轨检查图像中实现非合作空间目标的神经隐式表面重建问题,这在主动碎片清除和在轨服务任务规划中至关重要。其关键解决方案在于构建一个端到端的处理流程,首先通过基于分割的背景去除方法提升真实在轨影像中的相机位姿估计成功率(因背景帧间变化会导致直接处理失败),并引入逐帧曝光差异的光度校正技术以优化阴影区域的重建性能,从而显著提升复杂光照条件下重建结果的鲁棒性与准确性。
链接: https://arxiv.org/abs/2605.00147
作者: Bala Prenith Reddy Gopu,Patrick Quinn,George M. Nehma,Madhur Tiwari,Matt Ueckermann,David Hinckley,Christopher McKenna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 Pages, 16 Figures
Abstract:On-orbit inspection imagery is crucial as it enables characterization of non-cooperative resident space objects, providing the geometry and structural condition essential for active debris removal and on-orbit servicing mission planning. However, most existing neural implicit surface reconstruction methods have been confined to synthetic or hardware-in-the-loop data with known camera poses and controlled illumination. In this work, we present a pipeline for neural implicit surface reconstruction of non-cooperative space objects from monocular inspection imagery. We demonstrate it on publicly released ISS inspection footage from the STS-119 mission and publicly released on-orbit inspection footage of an H-IIA rocket upper stage. We find that segmentation-based background removal is essential for successful camera pose estimation from real on-orbit footage, where background variation between frames caused direct processing to fail entirely. We further incorporate photometric correction of per-frame exposure variations and analyze its behavior across datasets, finding that performance in shadowed regions varies with the illumination characteristics of the input footage.
[CV-69] Real-Time Frame- and Event-based Object Detection with Spiking Neural Networks on Edge Neuromorphic Hardware: Design Deployment and Benchmark
【速读】:该论文旨在解决在能源受限平台(如无人机巡检、自主导航和移动机器人)上实现高效实时目标检测的问题。传统人工神经网络(ANNs)虽然性能优异,但在边缘设备上的功耗较高,难以满足低功耗场景的需求。解决方案的关键在于设计适用于类脑计算硬件(如Intel Loihi 2)的脉冲神经网络(SNN)检测架构,并通过蒸馏感知的训练方法(distillation-aware training)将ANN模型的知识迁移至SNN,从而在保持较低推理延迟的同时恢复高达87–100%的检测精度。实验表明,部署于Loihi 2的SNN可在保证实时性的同时实现最低的每帧动态能耗,显著优于基于Jetson Orin Nano、Jetson Nano B01及Apple M2 CPU的ANN方案,凸显了类脑计算系统在边缘智能中的能效优势。
链接: https://arxiv.org/abs/2605.00146
作者: Udayanga G.W.K.N. Gamage,Yan Zeng,Cesar Cadena,Matteo Fumagalli,Silvia Tolu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 figures, 7 tables, 53 pages
Abstract:Real-time object detection on energy-constrained platforms is critical for applications such as UAV-based inspection, autonomous navigation, and mobile robotics. Spiking neural networks (SNNs) on neuromorphic hardware are believed to be significantly more energy-efficient than conventional artificial neural networks (ANNs). In this work, we present a comprehensive methodology for designing general SNN detection architectures targeting neuromorphic platforms, along with the engineering adaptations required to deploy them on the state-of-the-art Neuromorphic processor, Intel Loihi 2. We benchmark SNN-based object detection on Loihi 2 using both frame-based and event-based datasets, comparing performance with ANN-based detection on the NVIDIA Jetson Orin Nano, NVIDIA Jetson Nano B01, and the Apple M2 CPU. Our results show that SNNs on Loihi 2 can perform real-time detection while achieving the lowest per-inference dynamic energy among all platforms. Also, Loihi 2 outperforms the other platforms in terms of power consumption, though ANNs on Jetson Orin Nano achieve higher inference rates. Furthermore, our ANN-to-SNN distillation-aware training enables SNNs to recover 87-100% of the detection accuracy of their ANN counterparts while maintaining lower inference latency; without distillation, SNNs exhibit an 11-27% accuracy drop. These results highlight the potential of neuromorphic systems for energy-efficient, real-time object detection at the edge.
[CV-70] GAFSV-Net: A Vision Framework for Online Signature Verification
【速读】:该论文旨在解决在线签名验证(Online Signature Verification, OSV)中因类内变异性高且注册样本极少而导致的鉴别困难问题。现有深度学习方法直接处理原始时间序列,受限于一维架构,无法利用预训练的二维视觉骨干网络。其解决方案的关键在于提出GAFSV-Net,将每条签名表示为六通道非对称Gramian Angular Field (GAF) 图像:三个运动学通道(笔速、压力导数、方向角)分别编码为互补的GASF(Gramian Angular Sum Field)和GADF(Gramian Angular Difference Field)矩阵,以捕获成对时间共现与方向转换结构;采用双分支ConvNeXt-Tiny编码器独立处理GASF与GADF,并通过双向交叉注意力机制使两分支在特征投影前相互查询判别性模式,从而增强表征能力。此设计实现了从一维时序建模到二维时空结构建模的跨越,显著提升了验证性能。
链接: https://arxiv.org/abs/2605.00120
作者: Himanshu Singhal,Suresh Sundaram
机构: Indian Institute of Technology Guwahati (印度理工学院古瓦哈蒂分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Online signature verification (OSV) requires distinguishing skilled forgeries from genuine samples under high intra-class variability and with very few enrollment samples. Existing deep learning methods operate directly on raw temporal sequences, restricting them to 1D architectures and preventing the use of pretrained 2D vision backbones. We bridge this gap with GAFSV-Net, which represents each signature as a six-channel asymmetric Gramian Angular Field image: three kinematic channels (pen speed, pressure derivative, direction angle) are each encoded into complementary GASF and GADF matrices that capture pairwise temporal co-occurrence and directional transition structure respectively. A dual-branch ConvNeXt-Tiny encoder processes GASF and GADF independently, with bidirectional cross-attention enabling each branch to query discriminative patterns from the other before metric-space projection. Training uses semi-hard triplet loss with skilled-forgery hard-negative injection; verification is performed via cosine similarity against a small enrollment prototype. We evaluate on DeepSignDB and BiosecurID, outperforming all sequence-based baselines trained under identical objectives, demonstrating that the representational gain of 2D temporal encoding is consistent and independent of training procedure, with ablations characterising each design choice’s contribution.
[CV-71] AIDA-ReID: Adaptive Intermediate Domain Adaptation for Generalizable and Source-Free Person Re-Identification
【速读】:该论文旨在解决跨域行人重识别(Person Re-identification, Re-ID)中因光照、背景、摄像头特性及人群分布差异导致的域偏移问题,尤其针对现有中间域方法(如IDM和IDM++)在源-free或多源场景下受限于固定混合策略与联合源-目标访问的问题。解决方案的关键在于提出自适应中间域适配(Adaptive Intermediate Domain Adaptation, AIDA),即通过模型不确定性与训练稳定性反馈信号动态调节特征混合与正则化强度,构建多源中间域生成器以合成多样化中间表征,并引入伪镜像正则化策略保障域扰动下的身份一致性,从而实现更鲁棒的跨域迁移性能。
链接: https://arxiv.org/abs/2605.00111
作者: Sundas Iqbal,Qing Tian,Danish Ali,Jianping Gou,Weihua Oue
机构: Nanjing University of Information Science and Technology (南京信息工程大学); Wuhan University (武汉大学); Southwest University (西南大学); Guizhou Normal University (贵州师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Person re-identification (Re-ID) aims to match images of the same individual across non-overlapping camera views and remains challenging due to domain shifts caused by variations in illumination, background, camera characteristics, and population distributions. Although supervised models perform well under matched training and testing conditions, their performance degrades significantly when deployed in unseen environments. Existing intermediate domain approaches such as IDM and IDM++ alleviate this gap by constructing bridge feature distributions between domains; however, they rely on fixed mixing strategies and joint source-target access, limiting their applicability to multi-source and source-free settings. To address these limitations, this paper proposes Adaptive Intermediate Domain Adaptation (AIDA), also referred to as Source-Free Multi-Source Intermediate Domain Adaptation (SF-MIDA). The proposed framework treats intermediate-domain learning as a dynamically regulated process, where feature mixing and regularization strength are adaptively controlled using feedback signals derived from model uncertainty and training stability. A multi-source intermediate domain generator synthesizes diverse intermediate representations, while a pseudo-mirror regularization strategy preserves identity consistency under domain perturbations. Extensive experiments across domain generalization and source-free settings demonstrate the effectiveness of the proposed framework.
[CV-72] World Model for Robot Learning: A Comprehensive Survey
【速读】:该论文旨在解决机器人学习领域中世界模型(world models)研究文献碎片化的问题,具体表现为不同架构、功能角色和具身应用场景之间的割裂。其解决方案的关键在于从机器人学习的视角出发,系统性地梳理世界模型与机器人策略的耦合机制、其作为强化学习与评估中可学习模拟器的作用,并总结机器人视频世界模型从基于想象的生成向可控、结构化及基础模型规模演进的发展路径。此外,论文还将相关成果延伸至导航与自动驾驶领域,归纳代表性数据集、基准测试和评估协议,从而厘清核心范式、应用方向并指出未来挑战。
链接: https://arxiv.org/abs/2605.00080
作者: Bohan Hou,Gen Li,Jindou Jia,Tuo An,Xinying Guo,Sicong Leng,Haoran Geng,Yanjie Ze,Tatsuya Harada,Philip Torr,Oier Mees,Marc Pollefeys,Zhuang Liu,Jiajun Wu,Pieter Abbeel,Jitendra Malik,Yilun Du,Jianfei Yang
机构: Nanyang Technological University (南洋理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 43 pages, 6 figures
Abstract:World models, which are predictive representations of how environments evolve under actions, have become a central component of robot learning. They support policy learning, planning, simulation, evaluation, data generation, and have advanced rapidly with the rise of foundation models and large-scale video generation. However, the literature remains fragmented across architectures, functional roles, and embodied application domains. To address this gap, we present a comprehensive review of world models from a robot-learning perspective. We examine how world models are coupled with robot policies, how they serve as learned simulators for reinforcement learning and evaluation, and how robotic video world models have progressed from imagination-based generation to controllable, structured, and foundation-scale formulations. We further connect these ideas to navigation and autonomous driving, and summarize representative datasets, benchmarks, and evaluation protocols. Overall, this survey systematically reviews the rapidly growing literature on world models for robot learning, clarifies key paradigms and applications, and highlights major challenges and future directions for predictive modeling in embodied agents. To facilitate continued access to newly emerging works, benchmarks, and resources, we will maintain and regularly update the accompanying GitHub repository alongside this survey.
[CV-73] Being-H0.7: A Latent World-Action Model from Egocentric Videos
【速读】:该论文旨在解决视觉-语言-动作模型(Visual-Language-Action models, VLAs)在稀疏动作监督下容易形成“捷径映射”(shortcut mappings)的问题,即模型可能仅学习到与任务无关的视觉特征或静态关联,而非对动力学、接触关系及任务进展的真正理解。解决方案的关键在于提出 Being-H0.7——一种潜空间世界-动作模型(latent world-action model),其核心创新是引入可学习的潜空间查询(learnable latent queries)作为感知与动作之间的紧凑推理接口,并采用双分支训练机制:一个部署可用的先验分支从当前观测中推断潜状态,另一个仅用于训练的后验分支用未来观测嵌入替换潜查询;通过联合对齐两个分支在潜推理空间中的表示,使先验分支能够仅基于当前信息生成未来感知且对动作有用的结构,从而无需生成未来图像帧即可实现未来感知推理。该方法在不进行像素级视频回放的前提下,融合了世界模型的预测优势与直接 VLA 策略的高效性和可部署性。
链接: https://arxiv.org/abs/2605.00078
作者: Hao Luo,Wanpeng Zhang,Yicheng Feng,Sipeng Zheng,Haiweng Xu,Chaoyi Xu,Ziheng Xi,Yuhui Fu,Zongqing Lu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Visual-Language-Action models (VLAs) have advanced generalist robot control by mapping multimodal observations and language instructions directly to actions, but sparse action supervision often encourages shortcut mappings rather than representations of dynamics, contact, and task progress. Recent world-action models introduce future prediction through video rollouts, yet pixel-space prediction is a costly and indirect substrate for control, as it may model visual details irrelevant to action generation and introduces substantial training or inference overhead. We present Being-H0.7, a latent world-action model that brings future-aware reasoning into VLA-style policies without generating future frames. Being-H0.7 inserts learnable latent queries between perception and action as a compact reasoning interface, and trains them with a future-informed dual-branch design: a deployable prior branch infers latent states from the current context, while a training-only posterior branch replaces the queries with embeddings from future observations. Jointly aligning the two branches at the latent reasoning space leads the prior branch to reason future-aware, action-useful structure from current observations alone. At inference, Being-H0.7 discards the posterior branch and performs no visual rollout. Experiments across six simulation benchmarks and diverse real-world tasks show that Being-H0.7 achieves state-of-the-art or comparable performance, combining the predictive benefits of world models with the efficiency and deployability of direct VLA policies.
[CV-74] wo-View Accumulation as the Primary Training Lever for Hybrid-Capture Gaussian Splatting: A Variance-Decomposition View of When Gradient Surgery Helps
【速读】:该论文旨在解决混合采集视角(hybrid-capture)下的3D高斯泼溅(3D Gaussian Splatting, 3DGS)模型在训练过程中对少数相机距离模式(如航拍与地面视角)拟合不足的问题,表现为在五个基准测试中PSNR下降1–3 dB。其关键解决方案是:将每轮优化器步骤中的渲染视图数量从单视图增加至双视图(two views per optimizer step),这一结构性改变显著提升了性能,且无需复杂的梯度调控策略(如方向感知梯度手术、置信度门控或主动损失差异配对)。作者通过方差分解框架证明,在双模态相机分布下,跨模式梯度方差相对较小,因此不同配对规则(几何近/远、随机或主动损失差异)的性能差异可忽略,而双视图累积带来的方差减半效应才是主导因素。此方法在Scaffold-GS和Pixel-GS等骨干网络上均具迁移性,为混合采集场景下的3DGS训练提供了简洁有效的优化路径。
链接: https://arxiv.org/abs/2605.00052
作者: Sungjun Cho
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hybrid-capture novel view synthesis combines images at substantially different camera distances (e.g., aerial drone and ground-level views). Standard 3D Gaussian Splatting (3DGS), trained for 30K iterations with one rendered view per optimizer step, under-fits the minority regime by 1-3 dB on five hybrid-capture benchmarks. We isolate the lever that closes this gap. Among compute-matched alternatives – vanilla 60K iterations, magnitude corrections (GradNorm), direction-aware near/far gradient surgery, projective preconditioning, confidence-gated sample-level surgery, and a random two-view-per-step control – the simplest structural change wins: rendering two views per optimizer step. The pairing rule (geometry-defined near/far, random, or active loss-disparity) does not change PSNR beyond seed variance on any of the five scenes; the structural change of having two views per step does. We propose a variance-decomposition framework that predicts and explains this finding: under bimodal camera regimes, between-regime gradient variance turns out to be small relative to within-regime variance in 3DGS, so structured and random pairings are variance-equivalent in expectation, and the variance halving from two-view accumulation itself is the dominant effect. We verify the framework on five scenes whose camera-altitude bimodality coefficients span [0.55, 1.00], and we report the negative result that direction-aware projection, magnitude correction, confidence gating, and an active loss-disparity pairing all fall within seed variance of random two-view pairing. The two-view structural lever transfers cleanly to the Scaffold-GS and Pixel-GS backbones. We position this work as an honest characterization of which training-side axes do and do not move PSNR for hybrid-capture 3DGS, together with the framework that explains why. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.00052 [cs.CV] (or arXiv:2605.00052v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.00052 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Sungjun Cho [view email] [v1] Wed, 29 Apr 2026 17:45:51 UTC (17,712 KB)
[CV-75] Learning from the Unseen: Generative Data Augmentation for Geometric-Semantic Accident Anticipation
【速读】:该论文旨在解决自动驾驶中交通事故预测这一关键 yet 未决的问题,其核心挑战在于建模道路使用者之间的复杂交互关系,以及缺乏多样且大规模的真实数据集。解决方案的关键在于提出一种双路径框架:一方面,利用受结构化提示引导的视频合成流程,从现有语料库中提取特征分布并生成符合真实数据统计规律的高保真合成驾驶场景;另一方面,设计一种融合语义线索的图神经网络(Graph Neural Network, GNN),以实现对参与者之间空间与语义关系的动态推理。该方法有效缓解了数据瓶颈问题,并显著提升了预测精度和提前量,从而增强了自动驾驶系统的可靠性。
链接: https://arxiv.org/abs/2605.00051
作者: Yanchen Guan,Haicheng Liao,Chengyue Wang,Xingcheng Liu,Jiaxun Zhang,Keqiang Li,Zhenning Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Anticipating traffic accidents is a critical yet unresolved problem for autonomous driving, hindered by the inherent complexity of modeling interactions between road users and the limited availability of diverse, large-scale datasets. To address these issues, we propose a dual-path framework. On the one hand, we employ a video synthesis pipeline that, guided by structured prompts, derives feature distributions from existing corpora and produces high-fidelity synthetic driving scenes consistent with the statistical patterns of real data. On the other hand, we design a graph neural network enriched with semantic cues, enabling dynamic reasoning over both spatial and semantic relations among participants. To validate the effectiveness of our approach, we release a new benchmark dataset containing standardized, finely annotated video sequences that cover a broad spectrum of regions, weather, and traffic conditions. Evaluations across existing datasets and our new benchmark confirm notable gains in both accuracy and anticipation lead time, highlighting the capacity of the proposed framework to mitigate current data bottlenecks and enhance the reliability of autonomous driving systems.
[CV-76] Learning physically grounded traffic accident reconstruction from public accident reports
【速读】:该论文旨在解决交通事故重建中因缺乏详细场景测量和专家重构数据而导致的物理基础建模困难问题,其核心挑战在于如何从公开的事故文本报告和有限的现场测量中实现可量化的、高保真的事故重建。解决方案的关键在于提出了一种参数化多模态学习框架,通过构建包含6,217个真实事故案例的CISS-REC数据集,将报告语义与道路拓扑结构及参与方属性进行对齐,并结合车道一致性预碰撞运动重建、局部几何推理与时间分配机制,实现碰撞相关交互的精细化修正,从而显著提升事故点定位精度与碰撞逻辑一致性,验证了公共事故报告作为可扩展计算基底在交通安全分析与自动驾驶研究中的潜力。
链接: https://arxiv.org/abs/2605.00050
作者: Yanchen Guan,Haicheng Liao,Chengyue Wang,Zhenning Li
机构: University of Macau (澳门大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traffic accidents are routinely documented in textual reports, yet physically grounded accident reconstruction remains difficult because detailed scene measurements and expert reconstructions are scarce, costly and hard to scale. Here we formulate accident reconstruction from publicly accessible reports and scene measurements as a parameterized multimodal learning problem. We construct CISS-REC, a dataset of 6,217 real-world accident cases curated from the NHTSA Crash Investigation Sampling System, and develop a reconstruction framework that grounds report semantics to road topology and participant attributes, reconstructs lane consistent pre-impact motion, and refines collision relevant interactions through localized geometric reasoning and temporal allocation. Our method outperforms representative baselines on CISS-REC, achieving the strongest overall reconstruction fidelity, including improved accident point accuracy and collision consistency. These results show that public accident reports can serve as scalable computational substrates for quantitatively verifiable accident reconstruction, with potential value for traffic safety analysis, simulation and autonomous driving research.
[CV-77] Unsupervised Denoising of Real Clinical Low Dose Liver CT with Perceptual Attention Networks
【速读】:该论文旨在解决低剂量计算机断层扫描(Low-dose Computed Tomography, LD-CT)图像中噪声干扰视觉判读和诊断准确性的问题。其关键解决方案是提出一种端到端的无监督深度学习框架,该框架融合U-Net结构以实现多尺度特征提取、注意力机制用于特征融合以及残差网络进行特征变换,并引入感知损失(perceptual loss)以适应医学图像特性。该方法无需依赖配对的高质量标注数据,即可在真实临床LD-CT数据上实现优异去噪效果,克服了传统监督学习对标注数据的依赖限制,同时通过影像科医生的专业评估验证了其临床适用性。
链接: https://arxiv.org/abs/2605.00793
作者: Jingxi Pu,Tonghua Liu,Zhilin Guan,Siqiao Li,Yang Ming,Zheng Cong,Wei Zhang,Fangwei Li
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 10 figures, 5 tables
Abstract:With the development of deep learning, medical image processing has been widely used to assist clinical research. This paper focuses on the denoising problem of low-dose computed tomography using deep learning. Although low-dose computed tomography reduces radiation exposure to patients, it also introduces more noise, which may interfere with visual interpretation by physicians and affect diagnostic results. To address this problem, inspired by Cycle-GAN for unsupervised learning, this paper proposes an end-to-end unsupervised low-dose computed tomography denoising framework. The proposed framework combines a U-Net structure for multi-scale feature extraction, an attention mechanism for feature fusion, and a residual network for feature transformation. It also introduces perceptual loss to improve the network for the characteristics of medical images. In addition, we construct a real low-dose computed tomography dataset and design a large number of comparative experiments to validate the proposed method, using both image-based evaluation metrics and medical evaluation criteria. Compared with classical methods, the main advantage of this paper is that it addresses the limitation that real clinical data cannot be directly used for supervised learning, while still achieving excellent performance. The experimental results are also professionally evaluated by imaging physicians and meet clinical needs.
[CV-78] Multi-frame Restoration for High-rate Lissajous Confocal Laser Endomicroscopy
【速读】:该论文旨在解决高帧率Lissajous扫描共聚焦激光内窥成像(Lissajous confocal laser endomicroscopy, CLE)中因共振轨迹导致的像素采样不完整问题,即在高速成像时大量像素未被访问,形成结构化空洞,从而影响图像质量。解决方案的关键在于提出首个针对高帧率Lissajous CLE的基准数据集,该数据集包含低质量视频片段与高质量参考图像(通过稳定慢扫帧拼接获得的宽视场图像),实现时间对齐的监督;并设计轻量级递归框架MIRA,通过特征复用和位移对齐迭代聚合时序上下文信息,实现高效且高质量的图像恢复,兼顾临床部署所需的计算效率与重建性能。
链接: https://arxiv.org/abs/2605.00527
作者: Minhee Lee,Sangyoon Lee,Jiwook Lee,Minki Hong,Kyuyoung Kim,Wonhwa Kim,Jaeho Lee
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Lissajous confocal laser endomicroscopy (CLE) is a promising solution for high speed in vivo optical biopsy for handheld scenarios. However, Lissajous scanning traces a resonant trajectory and samples only the visited pixels per frame; at high frame rates, many pixels remain unvisited, creating structured holes. In this work, we introduce the first benchmark for high-rate Lissajous CLE, consisting of low-quality video clips paired with high-quality reference images. The reference images are wide-FOV mosaics obtained by stitching stabilized, slow-scan frames of the same tissue, enabling temporally aligned supervision. Using this dataset, we propose MIRA, a lightweight recurrent framework for Lissajous CLE restoration that iteratively aggregates temporal context through feature reuse and displacement alignment. Our experiments demonstrate that MIRA outperforms both lightweight and high-complexity baselines in restoration quality while maintaining a favorable computational efficiency suitable for clinical deployment.
[CV-79] Combined Dictionary Unfolding Network with Gradient-Adaptive Fidelity for Transferable Multi-Source Fusion
【速读】:该论文旨在解决现有基于深度展开网络(Deep Unfolding Network)的多源图像融合方法在资源受限边缘设备上部署效率低的问题。这些问题主要源于现有方法大多基于交替最小化策略,导致特征更新过程分离且计算与内存开销较大。其解决方案的关键在于提出一种轻量级联合字典展开网络(CDNet),通过将耦合字典学习中的“独特-公共”分解先验转化为结构约束的联合展开架构,设计出具有块稀疏交互拓扑的CDBlock模块,实现对共同特征和模态特有特征的模型驱动联合更新,从而简化特征学习流程并提升效率。此外,还引入了一种紧凑的高低频图像保真度损失函数,支持无监督训练,进一步增强了实用性与性能表现。
链接: https://arxiv.org/abs/2605.00461
作者: Ge Luo,Jun-Jie Huang,Qi Yu,Tianrui Liu,Ke Liang,Yuming Xiang,Wentao Zhao,Xinwang Liu,Meng Wang
机构: National University of Defense Technology (国防科技大学); Tongji University (同济大学); Hefei University of Technology (合肥工业大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep Unfolding Network-based methods have emerged as effective solutions for multi-source image fusion by combining model-driven iterative optimization with data-driven deep learning. However, most existing deep unfolding image fusion methods are derived from alternating minimization, which updates the features of different modalities separately. This design introduces considerable computational and memory overhead, limiting deployment on resource-constrained edge devices. To address this issue, we propose CDNet, a lightweight Combined Dictionary Unfolding Network for multi-source image fusion. Rather than introducing a new sparse coding prior or empirically compressing an existing fusion network, CDNet translates the unique-common decomposition prior of coupled dictionary learning into a structurally constrained joint unfolding architecture. The resulting CDBlock follows a block-sparse interaction topology and performs a model-derived joint update of common and modality-specific representations, thereby streamlining feature learning and improving this http URL addition, we design a compact High- and Low-frequency Image Fidelity loss for unsupervised training without ground-truth images. We evaluate CDNet on four tasks, including multi-exposure image fusion, infrared and visible image fusion, medical image fusion, and infrared and visible image fusion for semantic segmentation. Experimental results show that CDNet achieves competitive or superior fusion performance with high efficiency. For infrared and visible image fusion, CDNet outperforms competing methods on four of six metrics on the TNO dataset and five of six metrics on the RoadScene dataset. In particular, it surpasses the second-best method by 1.23 dB and 1.59 dB in PSNR on TNO and RoadScene, respectively.
[CV-80] Elimination Templates in Macaulay2
【速读】:该论文旨在解决如何自动构建针对依赖于代数独立参数的零维根理想族(zero-dimensional radical ideals)的求解器问题。其解决方案的关键在于引入了一种名为“消去模板”(elimination templates)的新方法,该方法通过系统化构造可自动应用于特定理想族的消去过程,从而实现对参数化系统的高效求解。这些模板具备良好的专化性质(specialization properties),能够在参数取具体值时保持计算的正确性与稳定性,特别适用于计算机视觉等实际应用中的符号计算需求。
链接: https://arxiv.org/abs/2605.00278
作者: Manav Batavia,Cheng Chen,Anna Natalie Chlopecki,Timothy Duff,William Huang,Aolong Li,Wanchun Shen
机构: 未知
类目: Commutative Algebra (math.AC); Computer Vision and Pattern Recognition (cs.CV); Mathematical Software (cs.MS)
备注: 13 pages with references
Abstract:We introduce the package \textttEliminationTemplates for the Macaulay2 computer algebra system, which provides tools for constructing automatic solvers for families of zero-dimensional radical ideals depending on algebraically independent parameters. This article provides a self-contained description of how elimination templates are constructed for such families and their specialization properties. Additionally, we describe the main functionality and datatypes provided by our package, and illustrate its usage on several examples, including applications from computer vision from which elimination templates originated.
[CV-81] Information-geometric adaptive sampling for graph diffusion ICML2026
【速读】:该论文旨在解决标准图生成扩散模型中采用均匀时间步长(uniform time-stepping)所带来的局限性,即忽略了复杂流形上分布演化过程的非齐次动态特性。其解决方案的关键在于提出一个信息几何框架,将扩散采样轨迹重新诠释为黎曼流形上的参数曲线,并利用Fisher-Rao度量作为内在距离的合理度量;由此推导出几何感知的漂移变分分数(Drift Variation Score, DVS),该指标量化了分布变化的瞬时速率。DVS求解器通过在统计流形上强制恒定的信息速度,自动维持采样轨迹上分布变化的均匀速率,从而实现等弧长策略,确保每一步离散化对信息速度的贡献相等。理论分析表明,DVS刻画了Fisher-Rao意义下采样动力学的局部刚度,实验验证其在分子和社交网络生成任务中显著提升结构保真度与采样效率。
链接: https://arxiv.org/abs/2605.00250
作者: Yuhui Lu,Wenjing Liu,Kun Zhan
机构: 未知
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICML 2026!
Abstract:Standard diffusion models for graph generation typically rely on uniform time-stepping, an approach that overlooks the non-homogeneous dynamics of distributional evolution on complex manifolds. In this paper, we present an information-geometric framework that reinterprets the diffusion sampling trajectory as a parametric curve on a Riemannian manifold. Our key observation is that the Fisher-Rao metric provides a principled measure of the intrinsic distance. By analyzing this metric, we derive the Drift Variation Score (DVS), a geometry-aware indicator that quantifies the instantaneous rate of distributional change. Unlike prior heuristic-based adaptive samplers, our DVS solver enforces a constant informational speed on the statistical manifold, automatically maintaining a uniform rate of distributional change along the sampling trajectory. This equal arc-length strategy ensures that each discretization step contributes equally to the information speed. Theoretical analysis verifies that DVS characterizes the local stiffness of the sampling dynamics in the Fisher-Rao sense. Experimental results on molecule and social network generation show that DVS significantly improves structural fidelity and sampling efficiency. Code is at this https URL
[CV-82] Broadband Wide Field of View Imaging with Computational Mirrors
【速读】:该论文旨在解决传统折射式光学系统在宽谱段(可见光至短波红外,VIS-SWIR,400–1700 nm)成像中难以同时聚焦的问题,以及简单凹面反射镜因场曲和离轴像差导致的成像质量下降问题。其解决方案的关键在于提出“计算镜片”(Computational Mirrors)框架,通过采集少量焦距堆栈(2–4张图像),利用基于物理模型的计算后端恢复全焦点图像;其中核心创新是SeidelConv模型——一种新颖的、空间变化的点扩散函数(PSF)建模方法,能够精确刻画并校正凹面镜固有的离轴像差,从而实现单传感器下跨全VIS-SWIR波段的高分辨率、大视场成像。
链接: https://arxiv.org/abs/2605.00029
作者: Vishwanath Saragadam,Niki Nezakati,Amit Roy-Chowdhury,Vivek Boominathan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:
Abstract:Traditional glass-based optics are typically optimized for narrow spectral bands, such as the visible (400-700nm) or shortwave infrared (1000-1800nm). While the emergence of VIS-SWIR sensors (400-1700nm) offers transformative potential, refractive optics struggle to focus this entire range simultaneously. Mirrors represent a promising achromatic alternative; however, they are often sidelined by field curvature, and off-axis aberrations. This paper introduces Computational Mirrors, a framework that enables high-resolution, wide-field-of-view imaging across the complete VIS-SWIR spectrum using a single sensor. Our method is built on the observation that distinct regions of the field of view reach focus at varying distances from the mirror. By capturing a minimal focal stack (2-4 images), we utilize a computational backend to recover a sharp, all-in-focus image. A key contribution of this work is SeidelConv, a novel, physics-inspired, spatially-varying point spread function (PSF) model designed to accurately characterize and correct the off-axis aberrations inherent in simple concave mirrors. We demonstrate the efficacy of our approach using a first-of-its-kind 50mm F/1 optical system equipped with a VIS-SWIR sensor. Our system produces sharp images across RGB, NIR, and SWIR wavelengths without requiring refocusing, revealing material details invisible within individual spectral bands. We further validate the scalability of our approach with a 100mm F/2 system optimized for long-range imaging.
[CV-83] meRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning
【速读】:该论文旨在解决时间序列基础模型(Time Series Foundation Models, TSFMs)在下游任务适配过程中面临的两大挑战:一是时间序列数据的非平稳性和不确定性导致训练与测试阶段存在不可避免的时间分布偏移,而现有的监督微调(Supervised Fine-Tuning, SFT)方法易过拟合、损害泛化能力;二是不同预测任务的数据可用性差异要求模型在多种数据规模下仍具备良好泛化性能。解决方案的关键在于提出一种新的时间序列强化微调(Time series Reinforcement Finetuning, TimeRFT)范式,其核心包含两个任务特定的训练策略:一是基于预测质量的时序奖励机制,从多维度评估每个预测步骤对整体精度的贡献;二是基于预测难度的数据选择策略,用于识别具有可泛化预测模式和信息丰富信号的时间序列样本。实验表明,TimeRFT在多种真实场景和数据条件下均显著优于SFT方法,提升了预测准确率并增强了对未知分布偏移的鲁棒性。
链接: https://arxiv.org/abs/2605.00015
作者: Siyang Li,Yize Chen,Zijie Zhu,Yuxin Pan,Yan Guo,Ming Huang,Hui Xiong
机构: HKUST(GZ)(香港科技大学(广州)); University of Alberta(阿尔伯塔大学); Alibaba Cloud(阿里云); City University of Hong Kong(香港城市大学); HKUST(香港科技大学)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 6 figures, In Submission
Abstract:Time Series Foundation Models (TSFMs) advance generalization and data efficiency in time series forecasting by unified large-scale pretraining. But TSFMs remain lacking when adapting to specific downstream forecasting tasks for two reasons. First, the non-stationary and uncertain nature of time series data lead to inevitable temporal distribution shifts between historical training and future testing data, while current Supervised FineTuning (SFT)-based methods are prone to overfitting and may degrade generalization. Second, training data availability varies across forecasting tasks, requiring TSFMs to generalize well under diverse data regimes. To address these challenges, we introduce the Time series Reinforcement Finetuning (TimeRFT) paradigm for TSFM downstream adaptation, which consists of two task-specific training recipes: i) A forecasting quality-based temporal reward mechanism that conducts a multi-faceted evaluation of the contribution of each prediction step to overall forecasting accuracy. ii) A forecasting difficulty-based data selection strategy to identify time series samples with generalizable predictive patterns and informative training signals. Extensive experiments demonstrate TimeRFT can consistently outperform SFT-based adaptation methods across various real-world forecasting tasks and training data regimes, enhancing prediction accuracy and generalization against unforeseen distribution shifts.
人工智能
[AI-0] GeoContra: From Fluent GIS Code to Verifiable Spatial Analysis with Geography-Grounded Repair
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的地理信息系统(GIS)工具在生成空间分析脚本时缺乏对坐标语义、拓扑结构、单位一致性及地理合理性等关键空间规则的强制约束问题,导致生成代码虽语法正确但可能产出地理上无效的结果。解决方案的核心在于提出GeoContra框架,其通过将每个地理任务建模为可执行的“地理合约”(geospatial contract),明确包含自然语言问题、坐标参考系统(CRS)元数据、预期输出、空间谓词、拓扑关系、度量指标和操作限制,并结合静态规则检查、运行时验证与语义验证机制,在闭环修复流程中自动识别并修正违反地理约束的代码片段,从而显著提升空间分析结果的正确性与可靠性。
链接: https://arxiv.org/abs/2605.00782
作者: Yinhao Xiao,Rongbo Xiao,Yihan Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Reliable spatial analysis in GIScience requires preserving coordinate semantics, topology, units, and geographic plausibility. Current LLM-based GIS systems generate fluent scripts but rarely enforce these geographic rules at scale. We present GeoContra, a verification and repair framework for LLM-driven Python GIS workflows. It represents each task as an executable geospatial contract-including natural-language questions, schemas, CRS metadata, expected outputs, spatial predicates, topology, metrics, required operations, and forbidden shortcuts. Generated programs undergo static rule inspection, runtime validation, and semantic verification, with violations fed back into a bounded repair loop. Evaluated on 7,079 real geospatial tasks across 15 Boston-area zones, 9 task families, and 11 open-source models (600 runs each), GeoContra improves spatial correctness on closed models from 47.6% to 77.5% for DeepSeek-V4 and from 57.7% to 81.5% for Kimi-K2.5. Across 11 open models, average correctness rises by 26.6%. GeoContra turns fluent code production into verifiable spatial analysis, catching negative travel times, CRS/field-schema violations, missing predicates, and brittle output casts that otherwise yield executable but geographically invalid results.
[AI-1] Position: agent ic AI orchestration should be Bayes-consistent ICML2026
【速读】:该论文试图解决的问题是:在高价值的智能体(agentic AI)系统部署中,如何在不确定性环境下做出更可靠的决策,例如选择调用哪个工具、咨询哪位专家或分配多少资源。传统大语言模型(Large Language Models, LLMs)擅长预测和复杂推理任务,但在面对不确定决策时缺乏显式的信念更新机制。论文提出的关键解决方案是:将贝叶斯决策理论(Bayesian decision theory)应用于智能体系统的控制层(即协调LLM与工具的层级),而非直接改造LLM本身作为贝叶斯信念更新引擎。这一设计使得系统能够维护对任务相关潜在变量的信念,从智能体与人类-AI交互中持续更新这些信念,并基于效用感知策略进行行动选择,从而实现更可靠、可解释且适应性强的决策过程。
链接: https://arxiv.org/abs/2605.00742
作者: Theodore Papamarkou,Pierre Alquier,Matthias Bauer,Wray Buntine,Andrew Davison,Gintare Karolina Dziugaite,Maurizio Filippone,Andrew Y. K. Foong,Vincent Fortuin,Dimitris Fouskakis,Jes Frellsen,Eyke Hüllermeier,Theofanis Karaletsos,Mohammad Emtiyaz Khan,Nikita Kotelevskii,Salem Lahlou,Yingzhen Li,Fang Liu,Clare Lyle,Thomas Möllenhoff,Konstantina Palla,Maxim Panov,Yusuf Sale,Kajetan Schweighofer,Artem Shelmanov,Siddharth Swaroop,Martin Trapp,Willem Waegeman,Andrew Gordon Wilson,Alexey Zaytsev
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Accepted for publication at ICML 2026
Abstract:LLMs excel at predictive tasks and complex reasoning tasks, but many high-value deployments rely on decisions under uncertainty, for example, which tool to call, which expert to consult, or how many resources to invest. While the usefulness and feasibility of Bayesian approaches remain unclear for LLM inference, this position paper argues that the control layer of an agentic AI system (that orchestrates LLMs and tools) is a clear case where Bayesian principles should shine. Bayesian decision theory provides a framework for agentic systems that can help to maintain beliefs over task-relevant latent quantities, to update these beliefs from observed agentic and human-AI interactions, and to choose actions. Making LLMs themselves explicitly Bayesian belief-updating engines remains computationally intensive and conceptually nontrivial as a general modeling target. In contrast, this paper argues that coherent decision-making requires Bayesian principles at the orchestration level of the agentic system, not necessarily the LLM agent parameters. This paper articulates practical properties for Bayesian control that fit modern agentic AI systems and human-AI collaboration, and provides concrete examples and design patterns to illustrate how calibrated beliefs and utility-aware policies can improve agentic AI orchestration.
[AI-2] o Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling
【速读】:该论文旨在解决生成式 AI(Generative AI)在使用外部工具(尤其是网络搜索工具)时面临的决策问题:如何准确判断是否调用工具,以避免冗余或有害的工具调用,从而提升任务执行效率与准确性。其核心挑战在于模型内部知识与外部工具响应之间的不一致性,以及模型对工具需求和效用的自我认知常与实际最优决策存在偏差。解决方案的关键在于提出一个基于决策理论的系统性框架,从规范性(normative)和描述性(descriptive)两个维度量化工具调用的必要性(necessity)、效用(utility)和成本(affordability),并基于模型隐藏状态训练轻量级估计器来预测真实需求与效用,进而构建简单有效的控制器,显著优于模型自身判断,在三个任务和六种模型上均实现了更优的任务性能。
链接: https://arxiv.org/abs/2605.00737
作者: Qinyuan Wu,Soumi Das,Mahsa Amani,Arijit Nag,Seungeon Lee,Krishna P. Gummadi,Abhilasha Ravichander,Muhammad Bilal Zafar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint, under review
Abstract:Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM decision: whether to call or not call a tool, when performing a task. This decision is particularly challenging for web search tools, where the benefits of external information depend on the model’s internal knowledge and its ability to integrate potentially noisy tool responses. We introduce a principled framework inspired by decision-making theory to evaluate web search tool-use decisions along three key factors: necessity, utility, and affordability. Our analysis combines two complementary lenses: a normative perspective that infers true need and utility from an optimal allocation of tool calls, and a descriptive perspective that infers the model’s self-perceived need and utility from their observed behaviors. We find that models’ perceived need and utility of tool calls are often misaligned with their true need and utility. Building on this framework, we train lightweight estimators of need and utility based on models’ hidden states. Our estimators enable simple controllers that can improve decision quality and lead to stronger task performance than the self-perceived set up across three tasks and six models.
[AI-3] EASE: Federated Multimodal Unlearning via Entanglement-Aware Anchor Closure
【速读】:该论文旨在解决联邦多模态学习(Federated Multimodal Learning, FML)中的遗忘问题,即在保持客户端图像-文本对隐私的前提下,如何有效移除特定客户端的数据影响。现有联邦遗忘方法未能切断由双线性耦合介导的跨模态重建通道,也未将遗忘专属更新方向与保留客户端共享的方向分离,导致遗忘不彻底。解决方案的关键在于提出一种“锚定原则”(Anchor Principle),识别出三种残留锚点:双线性跨模态耦合带来的对齐残留、主角度子空间纠缠以及持续联邦更新引起的漂移。通过在模态层面实现视觉和语言分支的双向位移以关闭跨模态重建通道,利用余弦-正弦分解分离客户端更新子空间中遗忘专属方向,同时引入方向选择性遗忘锁(Direction-Selective Forget Lock)控制多轮迭代中的残差漂移,最终构建了EASE框架——一个兼顾子空间解缠与遗忘精准性的统一方案,在多个数据集和场景下显著优于基线方法。
链接: https://arxiv.org/abs/2605.00733
作者: Zihao Ding,Beining Wu,Jun Huang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Federated Multimodal Learning (FML) trains multimodal models across decentralized clients while keeping their image-text pairs private. However, joint embedding training entangles forgotten knowledge across both modalities and client gradient subspaces, hindering federated unlearning. Previous federated unlearning approaches neither sever the cross-modal reconstruction channel mediated by bilinear coupling nor separate forget-exclusive update directions from those shared with retained clients. We identify an Anchor Principle for federated multimodal contrastive unlearning: forgotten alignments persist through three residual anchors arising from bilinear cross-modal coupling, principal-angle subspace entanglement, and continued federated updates. At the modality level, we show that bilateral displacement of both visual and language branches closes the cross-modal reconstruction channel. Correspondingly, our method addresses subspace entanglement through Cosine–Sine decomposition of client-update subspaces, isolating forget-exclusive directions from retain support. Moreover, we propose a direction-selective Forget Lock that bounds residual drift across rounds. Combining these strategies, we present EASE, an Entanglement-Aware Subspace Excision framework that closes all three anchor channels under a unified design. EASE demonstrates consistent superiority across multiple datasets and unlearning scenarios, for instance, matching the retrain reference to within 0.2 and 4.2 R@1 points on the forget and retain sides under client unlearning on Flickr30K with CLIP-B/32.
[AI-4] Empowering Heterogeneous Graph Foundation Models via Decoupled Relation Alignment
【速读】:该论文旨在解决图基础模型(Graph Foundation Models, GFMs)在多域异构图(Multi-Domain Heterogeneous Graphs, MDHGs)上扩展时面临的两大核心挑战:跨类型特征偏移(cross-type feature shifts)和域内关系间隙(intra-domain relation gaps)。现有全局特征对齐方法(如PCA或SVD)盲目强制共享特征空间,导致类型特异性语义扭曲和原始拓扑结构破坏,从而引发“类型坍缩”(Type Collapse)与“关系混淆”(Relation Confusion)。其解决方案的关键在于提出一种解耦的关系子空间对齐机制(Decoupled Relation Subspace Alignment, DRSA),通过将特征语义与关系结构显式解耦,引入双关系子空间投影机制以在低秩关系子空间中协调跨类型交互,并设计特征-结构解耦表示将对齐特征分解为语义投影分量与结构残差项,从而自适应吸收域内变化。该方法基于块坐标下降的稳定交替优化策略构建结构感知的校准潜在空间,在多个真实世界基准数据集上验证了其作为通用预处理模块可显著提升先进GFMs的跨域及少样本知识迁移能力。
链接: https://arxiv.org/abs/2605.00731
作者: Ziyu Zheng,Yaming Yang,Zhe Wang,Ziyu Guan,Wei Zhao
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:While Graph Foundation Models (GFMs) have achieved remarkable success in homogeneous graphs, extending them to multi-domain heterogeneous graphs (MDHGs) remains a formidable challenge due to cross-type feature shifts and intra-domain relation gaps. Existing global feature alignment methods (PCA or SVD) enforce a shared feature space blindly, which distorts type-specific semantics and disrupts original topologies, inevitably leading to “Type Collapse” and “Relation Confusion”. To address these fundamental limitations, we propose Decoupled relation Subspace Alignment (DRSA), a novel, plug-and-play relation-driven alignment framework. DRSA fundamentally shifts the paradigm by decoupling feature semantics from relation structures. Specifically, it introduces a dual-relation subspace projection mechanism to coordinate cross-type interactions within a shared low-rank relation subspace explicitly. Furthermore, a feature-structure decoupled representation is designed to decompose aligned features into a semantic projection component and a structural residual term, adaptively absorbing intra-domain variations. Optimized via a stable alternating minimization strategy based on Block Coordinate Descent, DRSA constructs a well-calibrated, structure-aware latent space. Extensive experiments on multiple real-world benchmark datasets demonstrate that DRSA can be seamlessly integrated as a universal preprocessing module, significantly and consistently enhancing the cross-domain and few-shot knowledge transfer capabilities of state-of-the-art GFMs. The code is available at: this https URL.
[AI-5] owards Improving Speaker Distance Estimation through Generative Impulse Response Augmentation ICASSP2025
【速读】:该论文旨在解决声学场景中扬声器与听者距离估计(Speaker Distance Estimation, SDE)模型在训练数据稀疏时性能受限的问题。其核心解决方案是利用开源的快速扩散房间脉冲响应生成器(FastRIR)仅基于说话人和听者位置生成增强的房间脉冲响应(Room Impulse Response, RIR)数据,通过设计质量过滤器确保生成RIR与挑战数据对齐,并结合超参数优化进行模型微调。实验表明,该方法显著提升了SDE模型精度,在GWA和Treble两类房间中,平均绝对误差(MAE)分别从1.66m和2.18m降低至0.6m和0.69m,尤其在中长距离下效果更为明显。
链接: https://arxiv.org/abs/2605.00721
作者: Anton Ratnarajah,Mehmet Ergezer,Arun Nair,Mrudula Athi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注: Accepted to Generative Data Augmentation for Real-World Signal Processing Applications (GenDA 2025). An ICASSP 2025 Satellite Workshop and IEEE Data Science and Learning Workshop: Room Acoustics and Speaker Distance Estimation Challenge
Abstract:The Room Acoustics and Speaker Distance Estimation (SDE) Challenge at ICASSP 2025 explores the effectiveness of augmented room impulse response (RIR) data for improving SDE model performance. This challenge at GenDARA involves generating RIRs to supplement sparse datasets and fine-tuning SDE models with the augmented data. We employ the open-source fast diffuse room impulse response generator (FastRIR) conditioned only on speaker and listener locations. We design a quality filter to ensure generated RIR alignment with challenge RIRs, and hyperparameter optimization is employed for model fine-tuning. Our approach reduces the mean absolute error (MAE) of the five positions from 1.66m to 0.6m for GWA rooms and from 2.18m to 0.69m for Treble rooms, with results demonstrating that the augmentation approach significantly improves estimation accuracy, particularly at medium to long distances.
[AI-6] Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning
【速读】:该论文旨在解决现实世界强化学习(Reinforcement Learning, RL)中安全性问题,特别是针对状态相关约束(state-wise constraints)的稳定学习难题。现有基于拉格朗日方法的方案需为每个状态单独设置乘子(multiplier),并用神经网络近似为乘子网络,但标准对偶梯度上升(dual gradient ascent)在此类网络上易引发严重训练震荡,原因在于对偶上升的固有不稳定性被网络泛化能力放大,导致局部超调和延迟更新在相邻状态间传播,进一步加剧策略波动。为此,作者提出增强拉格朗日乘子网络(Augmented Lagrangian Multiplier Network, ALaM)框架,其关键创新在于两点:一是引入二次惩罚项至增强拉格朗日函数,以补偿乘子更新延迟并建立最优邻域的局部凸性,从而抑制策略震荡;二是通过监督回归方式训练乘子网络,使其逼近一个对偶目标(dual target),提升训练稳定性并促进收敛。理论分析表明,ALaM可保证乘子收敛,进而恢复受约束问题的最优策略。基于此框架,作者将软演员评论家(Soft Actor-Critic, SAC)与ALaM结合,提出SAC-ALaM算法,在安全性和回报表现上优于当前最先进方法,同时实现训练动态稳定及风险识别能力良好的乘子学习。
链接: https://arxiv.org/abs/2605.00667
作者: Jiaming Zhang,Yujie Yang,Yao Lyu,Shengbo Eben Li,Liping Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 41 figures, 1 tables
Abstract:Safety is a primary challenge in real-world reinforcement learning (RL). Formulating safety requirements as state-wise constraints has become a prominent paradigm. Handling state-wise constraints with the Lagrangian method requires a distinct multiplier for every state, necessitating neural networks to approximate them as a multiplier network. However, applying standard dual gradient ascent to multiplier networks induces severe training oscillations. This is because the inherent instability of dual ascent is exacerbated by network generalization – local overshoots and delayed updates propagate to adjacent states, further amplifying policy fluctuations. Existing stabilization techniques are designed for scalar multipliers, which are inadequate for state-dependent multiplier networks. To address this challenge, we propose an augmented Lagrangian multiplier network (ALaM) framework for stable learning of state-wise multipliers. ALaM consists of two key components. First, a quadratic penalty is introduced into the augmented Lagrangian to compensate for delayed multiplier updates and establish the local convexity near the optimum, thereby mitigating policy oscillations. Second, the multiplier network is trained via supervised regression toward a dual target, which stabilizes training and promotes convergence. Theoretically, we show that ALaM guarantees multiplier convergence and thus recovers the optimal policy of the constrained problem. Building on this framework, we integrate soft actor-critic (SAC) with ALaM to develop the SAC-ALaM algorithm. Experiments demonstrate that SAC-ALaM outperforms state-of-the-art safe RL baselines in both safety and return, while also stabilizing training dynamics and learning well-calibrated multipliers for risk identification.
[AI-7] Reinforcement Learning with Markov Risk Measures and Multipattern Risk Approximation
【速读】:该论文旨在解决风险厌恶型有限时域马尔可夫决策过程(Markov Decision Process, MDP)中的策略学习问题,特别是在存在不确定性时如何有效控制风险。其解决方案的关键在于引入两类核心概念:一是“小批量风险测度”(mini-batch measures),这是一种特殊的马尔可夫一致风险测度,用于刻画多步决策中的风险偏好;二是“多模式风险厌恶问题”(multipattern risk-averse problems),它扩展了线性系统类问题,允许更灵活的状态特征表示。作者基于这两类概念设计了一种基于特征的Q-learning方法,采用多模式Q因子近似,并证明了高概率 regret 上界为 O(H2NHK),其中 H 为时域长度,N 为小批量大小,K 为训练轮数。此外,还提出一种经济版算法以优化策略评估步骤,理论分析在随机指派问题和短时域多臂赌博机问题上得到验证。
链接: https://arxiv.org/abs/2605.00654
作者: Andrzej Ruszczynski,Tiangang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:
Abstract:For a risk-averse finite-horizon Markov Decision Problem, we introduce a special class of Markov coherent risk measures, called mini-batch measures. We also define the class of multipattern risk-averse problems that generalizes the class of linear systems. We use both concepts in a feature-based Q -learning method with multipattern Q -factor approximation and we prove a high-probability regret bound of \mathcalO\big(H^2 N^H \sqrt K\big) , where H is the horizon, N is the mini-batch size, and K is the number of episodes. We also propose an economical version of the Q -learning method that streamlines the policy evaluation (backward) step. The theoretical results are illustrated on a stochastic assignment problem and a short-horizon multi-armed bandit problem.
[AI-8] AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在下游任务中微调时对GPU显存消耗过高的问题。传统基于反向传播的微调方法虽效果良好,但内存需求大;而近期提出的MeZO方法虽仅依赖前向传播显著降低显存占用,却因忽略损失函数曲率信息导致收敛缓慢。为克服这一局限,作者提出AdaMeZO——一种零阶优化器,其关键在于在不存储一阶和二阶矩估计值的前提下,利用Adam风格的动量与自适应学习率机制来引导优化路径,从而在保持低显存优势的同时提升收敛效率,实验表明其相比MeZO可减少最多70%的前向传播次数,并能有效适应不同损失景观。
链接: https://arxiv.org/abs/2605.00650
作者: Zhijie Cai,Haolong Chen,Guangxu Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-tuning LLMs is necessary for various dedicated downstream tasks, but classic backpropagation-based fine-tuning methods require substantial GPU memory. To this end, a recent work, MeZO, which relies solely on forward passes to fine-tune LLMs, significantly reduces GPU requirements at the cost of slower convergence due to its indifference to loss landscapes. Standard solutions, such as Adam, explore loss landscapes by estimating the first- and second-order moments and storing them in memory to guide the model’s movement through dimensions with lower curvature and vice versa. However, directly applying Adam negates MeZO’s advantage as it will triple the memory requirement. In light of this, we propose AdaMeZO, a zeroth-order optimizer that leverages Adam-style first- and second-moment estimates without maintaining them in memory. We present a theoretical analysis of AdaMeZO, corroborated by extensive experiments demonstrating AdaMeZO’s performance, showing that AdaMeZO can outperform MeZO while requiring up to 70% fewer forward passes. Trajectory visualizations affirm AdaMeZO’s ability to adapt to diverse loss landscapes.
[AI-9] Learning Multimodal Energy-Based Model with Multimodal Variational Auto-Encoder via MCMC Revision
【速读】:该论文旨在解决多模态能量模型(Multimodal Energy-Based Models, EBMs)在学习过程中因最大似然估计(Maximum Likelihood Estimation, MLE)依赖联合数据空间中的马尔可夫链蒙特卡洛(MCMC)采样而导致的混合效率低、难以发现模态间一致关系的问题。其关键解决方案在于提出一个融合框架,将多模态EBM、共享潜在生成器(shared latent generator)和联合推理模型(joint inference model)的MLE更新与数据空间和潜在空间中的对应MCMC精化步骤有效交织。具体而言,生成器被训练以产生结构一致的多模态样本作为EBM采样的优良初始状态,而推理模型则提供有信息量的潜在变量初始值用于生成器后验采样;二者互为补充,显著提升了EBM采样效率与学习质量,从而实现高质量且一致的多模态合成。
链接: https://arxiv.org/abs/2605.00644
作者: Jiali Cui,Zhiqiang Lao,Heather Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Transactions on Machine Learning Research, 2026
Abstract:Energy-based models (EBMs) are a flexible class of deep generative models and are well-suited to capture complex dependencies in multimodal data. However, learning multimodal EBM by maximum likelihood requires Markov Chain Monte Carlo (MCMC) sampling in the joint data space, where noise-initialized Langevin dynamics often mixes poorly and fails to discover coherent inter-modal relationships. Multimodal VAEs have made progress in capturing such inter-modal dependencies by introducing a shared latent generator and a joint inference model. However, both the shared latent generator and joint inference model are parameterized as unimodal Gaussian (or Laplace), which severely limits their ability to approximate the complex structure induced by multimodal data. In this work, we study the learning problem of the multimodal EBM, shared latent generator, and joint inference model. We present a learning framework that effectively interweaves their MLE updates with corresponding MCMC refinements in both the data and latent spaces. Specifically, the generator is learned to produce coherent multimodal samples that serve as strong initial states for EBM sampling, while the inference model is learned to provide informative latent initializations for generator posterior sampling. Together, these two models serve as complementary models that enable effective EBM sampling and learning, yielding realistic and coherent multimodal EBM samples. Extensive experiments demonstrate superior performance for multimodal synthesis quality and coherence compared to various baselines. We conduct various analyses and ablation studies to validate the effectiveness and scalability of the proposed multimodal framework.
[AI-10] Fairness of Classifiers in the Presence of Constraints between Features
【速读】:该论文旨在解决机器学习分类器在存在特征约束时可能出现的公平性误判问题,即当保护性特征(如性别)与非保护性特征之间存在隐含依赖关系时,传统基于是否直接包含保护性特征来判断公平性的方法可能失效。其核心解决方案是引入“公平解释”(fair explanation)的概念,定义为在考虑所有特征约束条件下,能够解释分类决策的最小充分条件(prime-implicant),且该条件中不包含任何保护性特征。这一定义确保了即使在无显式约束的情况下,忽略约束也可能导致对决策公平性的错误评估,从而提供一种更严谨、可计算的公平性判定框架。
链接: https://arxiv.org/abs/2605.00592
作者: Martin C. Cooper,Imane Bousdira
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: To be published in Proc. CP 2026
Abstract:In Machine Learning, an accepted definition of fairness of a decision taken by a classifier is that it should not depend on protected features, such as gender. Unfortunately, when constraints exist between features, such dependencies can be obscured by the constraints. To avoid this problem, we propose that a decision be considered fair if it has a fair explanation. We define a fair explanation as a prime-implicant reason for the decision that does not contain any protected feature (where the constraints are taken into account in the definition of prime-implicant). Surprisingly, ignoring constraints can completely change the fairness of a decision (according to this definition) even in the absence of constraints between protected and unprotected features. Three possible definitions of fairness of a classifier are that for all its decisions (1) there are only fair explanations, (2) there is at least one fair explanation, or (3) changing protected features does not change the outcome. We identify the relationships between these different definitions of fairness and study the computational complexity of testing fairness of classifiers. Comments: To be published in Proc. CP 2026 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.00592 [cs.LG] (or arXiv:2605.00592v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.00592 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-11] Instance-Aware Parameter Configuration in Bilevel Late Acceptance Hill Climbing for the Electric Capacitated Vehicle Routing Problem CEC
【速读】:该论文旨在解决组合优化算法在不同实例间性能差异显著时,单一全局调优参数配置难以有效利用实例异质性的问题。针对电动车辆路径问题(Electric Capacitated Vehicle Routing Problem, ECVRP),其实例在结构、需求模式和能源约束等方面存在显著差异,导致通用参数设置效果受限。解决方案的关键在于提出一种基于实例感知的参数配置方法:首先通过离线调优为每个实例生成特定的参数标签,再利用回归模型将实例特征映射到参数空间,从而实现对未见实例的参数预测,使算法在执行前即可适配最优参数配置。实验表明,该方法在IEEE WCCI 2020基准测试集上的平均目标函数值比全局调优配置降低0.28%,体现出显著的运营成本节约潜力。
链接: https://arxiv.org/abs/2605.00572
作者: Yinghao Qin,Xinwei Wang,Mosab Bazargani,Jun Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: Accepted at IEEE Congress on Evolutionary Computation (CEC), 2026
Abstract:Algorithm performance in combinatorial optimization is highly sensitive to parameter settings, while a single globally tuned configuration often fails to exploit the heterogeneity of instances. This limitation is particularly evident in the Electric Capacitated Vehicle Routing Problem, where instances differ in structure, demand patterns, and energy constraints. This paper investigates instance-aware parameter configuration for Bilevel Late Acceptance Hill Climbing, a state-of-the-art metaheuristic for the Electric Capacitated Vehicle Routing Problem. An offline tuning procedure is used to obtain instance-specific parameter labels, which are then mapped from instance features via a regression model to enable parameter prediction for unseen instances prior to execution. Experimental results on the IEEE WCCI 2020 benchmark and its extensions show that the proposed approach achieves an average objective value reduction of 0.28% across eight held-out test instances relative to a globally tuned configuration. This corresponds to a significant cost reduction in multimillion-dollar transportation operations.
[AI-12] Beyond Continuity: Simulation-free Reconstruction of Discrete Branching Dynamics from Single-cell Snapshots
【速读】:该论文旨在解决从破坏性单细胞数据中推断细胞轨迹时面临的挑战,特别是随机性(stochasticity)和非守恒质量动态(如细胞增殖与凋亡)导致的复杂性问题。现有基于不平衡最优传输(Unbalanced Optimal Transport, UOT)的方法将质量视为连续流体,在群体层面进行推断,难以捕捉单细胞分辨率下出生-死亡事件的离散跳跃特性,从而限制了对谱系分支和命运决定的理解。解决方案的关键在于提出不平衡薛定谔桥(Unbalanced Schrödinger Bridge, USB)框架,该方法通过引入一种无需模拟的训练目标,同时建模布朗运动和离散出生-死亡跳跃行为,实现了对微观尺度上细胞动力学的精确刻画,并在高维组学数据上具备良好的可扩展性,从而在轨迹重建性能上优于或媲美确定性基线方法,且首次实现单细胞分辨率下的真实离散出生-死亡过程模拟。
链接: https://arxiv.org/abs/2605.00545
作者: Junda Ying,Yuxuan Wang,Bowen Yang,Peijie Zhou,Lei Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Physics (math-ph); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM)
备注:
Abstract:Inferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and non-conservative mass dynamics such as cell proliferation and apoptosis. Existing unbalanced Optimal Transport (OT) methods treat mass as a continuous fluid, performing inference at the population level. However, this macroscopic view often fails to capture the discrete, jump-like nature of birth-death events at single-cell resolution, which is essential for understanding lineage branching and fate decisions. We present Unbalanced Schrödinger Bridge (USB), a simulation-free framework for learning underlying dynamics that effectively integrates both stochastic and unbalanced effects which also models the discrete, jump-like birth-death dynamics at single-cell resolution. Theoretically, USB provides a tractable solution to the Branching Schrödinger Bridge (BSB) problem, offering a rigorous microscopic interpretation where individual cells undergo both Brownian motion and discrete birth-death jumps. Technically, the method implements an efficient solver by introducing a simulation-free training objective that effectively scales to high-dimensional omics data. Empirically, we demonstrate on both simulated and real-world datasets that USB not only achieves trajectory reconstruction performance better than or comparable to deterministic baselines but also uniquely enables realistic discrete simulation of birth-death dynamics at single-cell resolution.
[AI-13] SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
【速读】:该论文旨在解决当前GPU调度器在处理复合型AI任务(如生成式AI代理)时的低效问题:现有系统将每个大语言模型(LLM)调用视为独立请求,忽略了任务内部的连续性,导致频繁丢弃中间状态(如KV缓存),从而造成端到端延迟增加3-8倍。其核心解决方案是提出程序级调度(program-level scheduling),将整个代理工作流(agent workflow)作为可调度的基本单元,而非单个推理调用。关键创新包括:(1) 代理执行图(Agent Execution Graphs)用于预测跨工具调用边界的KV缓存复用,逼近Bélády最优离线策略;(2) 带工作窃取机制的会话亲和批处理(session-affinity batching with work stealing),实现相关请求的共置与全局负载均衡;(3) 代理公平份额(Agent Fair Share)指标,提供任务完成时间公平性的理论保障。实验表明,该方案显著降低任务完成时间(1.64倍加速),提升GPU内存利用率,并在多租户干扰下保持高服务质量目标(SLO)达成率。
链接: https://arxiv.org/abs/2605.00528
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Operating Systems (cs.OS)
备注: 15 pages, 3 figures, 11 tables. Accepted to HPDC '26 (35th International Symposium on High-Performance Parallel and Distributed Computing), July 13-16, 2026, Cleveland, OH, USA
Abstract:AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program-level scheduling: treating the entire agent workflow (not individual inference calls) as the first-class schedulable unit. We present SAGA, a distributed scheduler that implements this abstraction through three mechanisms: (1) Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, achieving within 1.31x of Bélády’s optimal offline policy; (2) session-affinity batching with work stealing that co-locates correlated requests while maintaining global load balance; and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster serving SWE-bench coding agents and WebArena browser tasks, SAGA reduces task completion time by 1.64x (geometric mean, p 0.001) over vLLM v0.15.1 with prefix caching and affinity routing, while improving GPU memory utilization by 1.22x and achieving 99.2% SLO attainment under multi-tenant interference. These latency gains come at a quantified cost: approximately 30% lower peak throughput than throughput-optimal batch scheduling, a tradeoff appropriate for the latency-sensitive interactive deployments that dominate compound AI usage. Our results demonstrate that workflow-aware scheduling is essential for efficient compound AI serving.
[AI-14] Silicon Showdown: Performance Efficiency and Ecosystem Barriers in Consumer-Grade LLM Inference
【速读】:该论文旨在解决在消费级硬件上部署大规模语言模型(Large Language Model, LLM)时面临的系统性挑战,尤其是当模型参数量超过700亿时,如何平衡计算性能、内存容量与能效的问题。其核心解决方案在于对Nvidia和Apple Silicon两大生态的架构特性进行系统性实证分析:在Nvidia Blackwell架构中,通过NVFP4量化格式实现1.6倍吞吐提升(达151 tokens/s),但需权衡启动延迟与生成速度;而在Apple统一内存架构(Unified Memory Architecture, UMA)下,凭借高带宽共享内存实现了80B参数模型在4-bit精度下的线性扩展,并展现出高达23倍的能源效率优势。关键突破在于揭示了不同硬件平台间的“后端二分法”(Backend Dichotomy)与“显存墙”(VRAM Wall)现象,表明消费级推理最优方案取决于计算密度(Nvidia)与内存容量(Apple)的协同优化,以及由专有量化流程带来的显著生态系统摩擦。
链接: https://arxiv.org/abs/2605.00519
作者: Allan Kazakov,Abdurrahman Javat
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:
Abstract:The operational landscape of local Large Language Model (LLM) inference has shifted from lightweight models to datacenter-class weights exceeding 70B parameters, creating profound systems challenges for consumer hardware. This paper presents a systematic empirical analysis of the Nvidia and Apple Silicon ecosystems, specifically characterizing the distinct intra-architecture trade-offs required to deploy these massive models. On the Nvidia Blackwell architecture, we identify a critical “Backend Dichotomy” within the TensorRT-LLM stack: while the new NVFP4 quantization format delivers a 1.6x throughput advantage over optimized BF16 baselines (151 tokens/s vs. 92 tokens/s), realizing this performance requires navigating complex runtime constraints that trade startup latency for generation speed. Furthermore, we characterize the “VRAM Wall” for 70B+ models: on discrete GPUs, users face a destructive choice between aggressive quantization (e.g., Q2) that degrades model intelligence to fit in VRAM, or PCIe-bottlenecked CPU offloading, which reduces throughput by over 90% compared to full-GPU execution. Conversely, Apple’s Unified Memory Architecture (UMA) circumvents these bottlenecks, enabling linear scaling for 80B parameter models at practical 4-bit precisions. This architectural divergence extends to operational sustainability, where Apple’s SoC design demonstrates up to a 23x advantage in energy efficiency (tokens/joule). We conclude that for consumer-grade inference, the optimal hardware is defined by a complex interplay between compute density (Nvidia) and memory capacity (Apple), moderated by the significant “ecosystem friction” of proprietary quantization workflows.
[AI-15] Space Network of Experts: Architecture and Expert Placement
【速读】:该论文旨在解决在卫星网络中高效分布式部署大规模生成式 AI(Generative AI)模型——特别是混合专家(Mixture-of-Experts, MoE)模型——所面临的挑战,即如何在计算和通信资源受限的星载环境下实现低延迟的令牌生成。其核心问题在于协调异构的模型结构与卫星网络拓扑之间的差异,以优化模型组件(如MoE层及各专家)在卫星节点上的映射策略。解决方案的关键是提出Space-XNet框架,采用两级放置策略:第一级为层放置,利用自回归推理中的环状通信模式,将卫星星座沿轨道方向划分为环形子网,每个子网承载一个MoE层;第二级为层内专家放置,针对具有不同激活概率的专家进行优化分配,通过构建并求解一个基于预期延迟最小化的优化问题,得出“高频激活专家应映射至路由路径预期延迟较低的卫星”这一直观原则。实验表明,在千星星座场景下,该方案相较随机或消融策略可实现至少三倍的延迟降低。
链接: https://arxiv.org/abs/2605.00515
作者: Zhanwei Wang,Huiling Yang,Min Sheng,Khaled B. Letaief,Kaibin Huang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Leveraging continuous solar energy harvesting at high efficiency, space data centers are envisioned as a promising platform for executing energy-intensive large language models (LLMs). Recognizing this advantage, space and AI conglomerates (e.g., SpaceX, Google) are actively investing in this vision. One key challenge, however, is the efficient distributed deployment of a large-scale LLM in a satellite network due to the limited onboard computing and communication resources. This gives rise to a placement problem that involves partitioning and mapping model components to satellites such that the fundamentally different model architecture and network topology can be reconciled to ensure low-latency token generation. To address this problem, we present the Space Network of Experts (Space-XNet) framework targeting the distributed execution of a popular mixture-of-experts (MoE) model in space. The proposed placement strategies are two-level: (1) layer placement, which assigns MoE layers to satellite subnets; and (2) intra-layer expert placement, which assigns individual experts to satellites associated with the same layer/subnet. For layer placement, we exploit the ring-like communication pattern of autoregressive inference to partition the satellite constellation along the orbiting direction into subnets arranged on a ring, each hosting one MoE layer. Based on this architecture, we formulate and solve an optimization problem for intra-layer expert placement to map experts with heterogeneous activation probabilities onto satellites. The derived strategy reveals an intuitive principle: a frequently activated expert should be mapped to a satellite on a routing path with low expected latency. Experiments over a thousand-satellite constellation show that Space-XNet achieves at least a threefold latency reduction compared with conventional random and ablation-based placement strategies.
[AI-16] Scalable Context-Aware Graph Attention for Unsupervised Anomaly Detection in Large-Scale Mobile Networks
【速读】:该论文旨在解决移动网络运营商在大规模异构网络元素(如无线接入网和分组核心网)中进行异常检测的难题,这些网络元素产生高维KPI时间序列数据,而标注事件的成本高昂使得监督学习方法不切实际。为此,作者提出C-MTAD-GAT(Context-aware Multivariate Time-series Anomaly Detection with Graph Attention),其关键在于设计一个统一的共享模型,通过融合时序与特征维度上的图注意力机制、轻量级静态与动态上下文条件控制以及双头解码器(用于重构和多步预测),实现对每个网络元素、每个特征的细粒度异常评分,并基于验证残差自动校准无监督阈值生成告警。该方案在真实电信数据集上显著提升了事件级关联性和点级F1分数,同时减少误报,且无需依赖标签即可跨域部署,具备良好的可扩展性。
链接: https://arxiv.org/abs/2605.00482
作者: Sara Malacarne,Eirik Hoel-Høiseth,Erlend Aune,David Zsolt Biró,Massimiliano Ruocco
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Mobile network operators must monitor thousands of heterogeneous network elements across the radio access network and the packet core, each exposing high-dimensional KPI time series. The scale and cost of incident labelling make supervised approaches impractical, motivating unsupervised anomaly detection robust to context shifts and nonstationarity. We propose \textbfC-MTAD-GAT (\emphContext-aware Multivariate Time-series Anomaly Detection with Graph Attention), an anomaly detection framework designed to operate as a single shared model across large populations of network elements. The model combines temporal and feature-wise graph attention with lightweight static and dynamic context conditioning and a dual-head decoder for reconstruction and multi-step forecasting. It produces per-element, per-feature anomaly scores, converted to alerts via fully unsupervised thresholds calibrated from validation residuals. On the TELCO dataset released with DC-VAE \citegarcia2023onemodel, C-MTAD-GAT improves event-level affiliation and pointwise F1 while generating fewer alarms than prior graph-attention and VAE-based baselines. We then apply the same system to nation-scale radio access and evolved packet core control-plane counter data from a mobile network operator, where it is deployed. Operator feedback indicates the alerts are actionable and support daily monitoring, showing scalability across domains without relying on labelled incidents. Comments: This work has been submitted to the IEEE for possible publication Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.00482 [cs.LG] (or arXiv:2605.00482v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.00482 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-17] PAMod: Modeling Cyclical Shifts via Phase-Amplitude Modulation for Non-stationary Time Series Forecasting
【速读】:该论文旨在解决时间序列预测中因非平稳统计特性(如均值和方差随时间变化)带来的挑战,尤其是现有方法(如可逆实例归一化 RevIN)假设历史与未来分布一致这一强约束在实际场景中的局限性。其解决方案的关键在于提出 PAMod 框架,通过在归一化特征空间中引入相位-幅度调制(Phase-Amplitude Modulation, PAMod),建模周期性分布偏移:其中相位调制捕捉均值漂移,幅度调制适应方差变化;并从理论上证明该调制机制等价于动态去归一化,从而实现分布适应与表征学习的统一。该方法轻量高效,在十二个真实世界基准上取得最优性能,并可作为即插即用模块提升现有预测模型。
链接: https://arxiv.org/abs/2605.00466
作者: Yingbo Zhou,Yutong Ye,Shuhao Li,Rui Qian,Qiang Huang,Lemao Liu,Li Sun,Dejing Dou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Real-world time series forecasting faces the fundamental challenge of non-stationary statistical properties, including shifts in mean and variance over time. While reversible instance normalization (RevIN) has shown promise by stationarizing inputs and denormalizing outputs, it relies on the strong assumption that historical and future distributions remain identical. We observe that in many practical applications, distribution shifts follow cyclical patterns that correlate with periodic positions (e.g., seasonal and holiday volatility). To this end, we propose PAMod, a lightweight yet powerful framework that models cyclical distribution shifts via Phase-Amplitude Modulation in the normalized feature space. PAMod learns periodic embeddings to modulate representations: phase modulation captures mean shifts, while amplitude modulation adapts to variance changes. Crucially, we prove mathematically that modulating in normalized space is equivalent to applying dynamic denormalization, offering an elegant unification of distribution adaptation and representation learning. Extensive experiments on twelve real-world benchmarks demonstrate that PAMod achieves state-of-the-art performance with fewer computational resources. Furthermore, our modulation mechanism, as a novel plug-and-play technique, can improve existing time-series forecasting methods with simple integration.
[AI-18] Adaptation of AI-accelerated CFD Simulations to the IPU platform
【速读】:该论文旨在解决生成式 AI (Generative AI) 在计算流体动力学(Computational Fluid Dynamics, CFD)仿真中应用时的训练效率与可扩展性问题。其核心挑战在于如何高效地将大规模CFD模拟数据(来自OpenFOAM)喂入智能处理单元(Intelligence Processing Unit, IPU)进行机器学习模型训练,并实现良好的性能扩展性。解决方案的关键在于:首先,通过Poplar SDK提供的定制TensorFlow框架适配IPU-POD16平台,提升程序易用性;其次,利用popdist库优化主机端数据传输瓶颈,实现最高34%的加速;最后,在跨IPU通信成本可控的前提下,通过增加IPU数量(从2到16)显著提升吞吐量(从560.8增至2805.8样本/秒),验证了IPU架构在分布式训练场景下的良好可扩展性。
链接: https://arxiv.org/abs/2605.00462
作者: P. Rosciszewski,A. Krzywaniak,S. Iserte,K. Rojek,P. Gepner
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Intelligence Processing Units (IPU) have proven useful for many AI applications. In this paper, we evaluate them within the emerging field of \emphAI for simulation, where traditional numerical simulations are supported by artificial intelligence approaches. We focus specifically on a program for training machine learning models supporting a \emphcomputational fluid dynamics application. We use custom TensorFlow provided by the Poplar SDK to adapt the program for the IPU-POD16 platform and investigate its ease of use and performance scalability. Training a model on data from OpenFOAM simulations allows us to get accurate simulation state predictions in test time. We show how to utilize the \emphpopdist library to overcome a performance bottleneck in feeding training data to the IPU on the host side, achieving up to 34% speedup. Due to communication overheads, using data parallelism to utilize two IPUs instead of one does not improve the throughput. However, once the intra-IPU costs have been paid, the hardware capabilities for inter-IPU communication allow for good scalability. Increasing the number of IPUs from 2 to 16 improves the throughput from 560.8 to 2805.8 samples/s.
[AI-19] hinking in Text and Images: Interleaved Vision–Language Reasoning Traces for Long-Horizon Robot Manipulation
【速读】:该论文旨在解决长时程机器人操作任务中规划逻辑与几何约束难以协同的问题,即现有视觉-语言-动作(Vision-Language-Action)策略要么隐藏规划于潜在状态中,要么仅暴露单一模态:纯文本链式思维虽能编码因果顺序但忽略空间约束,而视觉预测虽提供几何线索却常局限于局部且语义约束不足。解决方案的关键在于提出交错视觉-语言推理(Interleaved Vision–Language Reasoning, IVLR)框架,其核心是引入一个显式的中间表示——\trace,该表示在任务全时域内交替嵌入文本子目标与视觉关键帧,从而实现语义与几何信息的联合建模;测试时,单个原生多模态Transformer从初始观测和指令自动生成全局语义-几何轨迹,并缓存该轨迹以指导闭环动作解码器决策,显著提升了复杂任务的成功率与鲁棒性。
链接: https://arxiv.org/abs/2605.00438
作者: Jinkun Liu,Haohan Chi,Lingfeng Zhang,Yifan Xie,YuAn Wang,Long Chen,Hangjun Ye,Xiaoshuai Hao,Wenbo Ding
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Long-horizon robotic manipulation requires plans that are both logically coherent and geometrically grounded. Existing Vision-Language-Action policies usually hide planning in latent states or expose only one modality: text-only chain-of-thought encodes causal order but misses spatial constraints, while visual prediction provides geometric cues but often remains local and semantically underconstrained. We introduce Interleaved Vision–Language Reasoning (IVLR), a policy framework built around \trace, an explicit intermediate representation that alternates textual subgoals with visual keyframes over the full task horizon. At test time, a single native multimodal transformer self-generates this global semantic-geometric trace from the initial observation and instruction, caches it, and conditions a closed-loop action decoder on the trace, original instruction, and current observation. Because standard robot datasets lack such traces, we construct pseudo-supervision by temporally segmenting demonstrations and captioning each stage with a vision-language model. Across simulated benchmarks for long-horizon manipulation and visual distribution shift, \method reaches 95.5% average success on LIBERO, including 92.4% on LIBERO-Long, and 59.4% overall success on SimplerEnv-WidowX. Ablations show that both modalities are necessary: without traces, LIBERO-Long success drops to 37.7%; text-only and vision-only traces reach 62.0% and 68.4%, while the full interleaved trace reaches 92.4%. Stress tests with execution perturbations and masked trace content show moderate degradation, suggesting that the trace can tolerate local corruption and moderate execution drift, but remains limited under stale or incorrect global plans.
[AI-20] Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂编程需求下的代码生成性能受限问题,尤其针对现有基于课程强化学习(Curriculum Reinforcement Learning, CRL)方法中存在的要求难度感知错位、难度优化缺失以及课程采样策略次优等局限。其解决方案的关键在于提出RECRL——一种新型的、以需求感知为核心的课程强化学习框架:通过自动感知模型特定的要求难度、优化具有挑战性的编程需求以提升训练数据利用率,并采用自适应课程采样策略构建难度渐进变化的训练批次,从而显著增强LLM代码生成的效果。实验表明,RECRL在五个主流基准上均优于现有先进方法,平均Pass@1指标提升达1.23%–5.62%。
链接: https://arxiv.org/abs/2605.00433
作者: Shouyu Yin,Zhao Tian,Junjie Chen,Shikai Guo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Code generation, which aims to automatically generate source code from given programming requirements, has the potential to substantially improve software development efficiency. With the rapid advancement of large language models (LLMs), LLM-based code generation has attracted widespread attention from both academia and industry. However, as programming requirements become increasingly complex, existing LLMs still exhibit notable performance limitations. To address this challenge, recent studies have proposed training-based curriculum reinforcement learning (CRL) strategies to improve LLM code generation performance. Despite their effectiveness, existing CRL approaches suffer from several limitations, including misaligned requirement difficulty perception, the absence of requirement difficulty optimization, and suboptimal curriculum sampling strategies. In CRL-based code generation, programming requirements serve as the sole input to the model, making their quality and difficulty critical to training effectiveness. Motivated by insights from software requirements engineering, we propose RECRL, a novel requirement-aware curriculum reinforcement learning framework for enhancing LLM-based code generation. RECRL automatically perceives model-specific requirement difficulty, optimizes challenging requirements to improve training data utilization, and employs an adaptive curriculum sampling strategy to construct training batches with smoothly varying difficulty. Extensive experiments on five state-of-the-art LLMs across five widely-used code generation benchmarks by comparing with five state-of-the-art baselines, demonstrate the significant effectiveness of RECRL. For example, RECRL achieves an average Pass@1 improvement of 1.23%-5.62% over all state-of-the-art baselines.
[AI-21] AEM: Adaptive Entropy Modulation for Multi-Turn Agent ic Reinforcement Learning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在训练大语言模型(Large Language Model, LLM)代理时面临的稀疏奖励导致的信用分配(credit assignment)难题,即难以将最终任务结果的有效性准确归因于每个动作步骤。传统方法依赖密集的中间监督信号(如过程奖励模型或辅助自监督信号),虽可缓解问题但引入额外调参复杂度且泛化能力弱。论文提出AEM(Adaptive Entropy Modulation),其核心在于通过动态调节训练过程中的熵(entropy)演化来实现无需监督的信用分配机制:理论层面将熵分析从token级提升至response级以降低采样方差,并揭示自然梯度下的熵漂移由优势(advantage)与相对响应惊喜度(relative response surprisal)的乘积决定;进而设计一个实用代理函数重塑训练动态,促使模型自然地从探索阶段过渡到利用阶段,从而在多个基准测试中显著提升性能,尤其在SWE-bench-Verified上相较先进基线提升1.4%。
链接: https://arxiv.org/abs/2605.00425
作者: Haotian Zhao,Yuxin Zhang,Songlin Zhou,Stephen S.-T. Yau,Wenyu Zhang,Lun Tian,Tianshu Zhu,Yifeng Huang,Yucheng Zeng,Jingnan Gu,Daxiang Dong,Jianmin Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages
Abstract:Reinforcement learning (RL) has significantly advanced the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. Yet effective training remains challenging, as sparse, outcome-only rewards make it difficult to assign credit to individual steps in an agent’s action trajectory. A common remedy is to introduce dense intermediate supervision, such as process reward models or auxiliary self-supervised signals, but this increases supervision and tuning complexity and often generalizes poorly across tasks and domains. This paper presents AEM, a supervision-free credit assignment method that adaptively modulates entropy dynamics during RL training to achieve a more effective exploration-exploitation trade-off. Theoretically, we elevate entropy analysis from the token level to the response level to reduce token sampling variance and show that entropy drift under natural gradients is intrinsically governed by the product of the advantage and the relative response surprisal. Specifically, we derive a practical proxy to reshape training dynamics, enabling a natural transition from exploration to exploitation. Extensive experiments across various benchmarks and models ranging from 1.5B to 32B parameters demonstrate the effectiveness of AEM, including a notable 1.4 percent gain when integrated into a state-of-the-art baseline on the highly challenging SWE-bench-Verified benchmark.
[AI-22] BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLM s ACL
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中因高内存和计算需求导致的效率瓶颈问题,尤其是现有二值化方法无法有效处理激活值(activation)的重尾分布,从而限制了端到端的加速潜力。解决方案的关键在于提出BWLA(Binarized Weights and Low-bit Activations)框架,其核心创新为两个模块:一是正交-克罗内克变换(Orthogonal-Kronecker Transformation, OKT),通过期望最大化(EM)优化学习正交映射,将单峰权重转换为对称双峰形式,并抑制激活值尾部和不一致性;二是近似奇异值分解投影(Proximal SVD Projection, PSP),以轻量级低秩精修方式提升量化可实现性,同时引入极小计算开销。该方案实现了1比特权重与低比特激活(如6比特)的联合量化,在保持高精度的同时显著提升推理速度。
链接: https://arxiv.org/abs/2605.00422
作者: Zhixiong Zhao,Zukang Xu,Dawei Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ACL-Main 2026
Abstract:Large language models (LLMs) have driven major progress in NLP, yet their substantial memory and compute demands still hinder practical deployment. Binarization can compress weights to 1 bit, fundamentally lowering compute and bandwidth cost. However, existing methods cannot address activation heavy tails and thus must keep activations in high precision, preventing true end-to-end acceleration. To overcome this limitation, we propose BWLA (Binarized Weights and Low-bit Activations), the first post-training quantization framework that preserves high accuracy while achieving 1-bit weight quantization together with low-bit activations (e.g., 6 bits). The Orthogonal-Kronecker Transformation (OKT) learns an orthogonal mapping via EM minimization, converting unimodal weights into symmetric bimodal forms while suppressing activation tails and incoherence. The Proximal SVD Projection (PSP) then performs lightweight low-rank refinement through proximal SVD projection, further enhancing quantizability with minimal overhead. On Qwen3-32B, BWLA reaches a Wikitext2 perplexity of 11.92 under 6-bit activations (vs. 38 from SOTA), improves five zero-shot tasks by more than 70%, and delivers 3.26 times inference speedup, demonstrating strong potential for real-world LLM compression and acceleration.
[AI-23] rees to Flows and Back: Unifying Decision Trees and Diffusion Models ICML
【速读】:该论文旨在解决决策树(Decision Tree)与扩散模型(Diffusion Model)这两类看似迥异的模型之间的理论割裂问题,即如何在数学上建立它们之间的统一框架。其解决方案的关键在于提出一种名为“全局轨迹评分匹配”(Global Trajectory Score Matching, GTSM)的共享优化原则,并证明在理想化的梯度提升(Gradient Boosting)框架下,该原则具有渐近最优性。通过这一理论桥梁,作者进一步实现了两个实用工具:TreeFlow,在表格数据生成中达到更高保真度并实现约2倍计算加速;以及DSTreeM,一种将决策树的层次逻辑蒸馏至神经网络的新方法,在多个基准测试中性能误差控制在2%以内。
链接: https://arxiv.org/abs/2605.00414
作者: Sai Niranjan Ramachandran,Suvrit Sra
机构: 未知
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI)
备注: 12 pages (main), 68 pages (inclusive of appendix), Accepted in the Forty-Third International Conference on Machine Learning (ICML) 2026
Abstract:Decision trees and diffusion models are ostensibly disparate model classes, one discrete and hierarchical, the other continuous and dynamic. This work unifies the two by establishing a crisp mathematical correspondence between hierarchical decision trees and diffusion processes in appropriate limiting regimes. Our unification reveals a shared optimization principle: \emphGlobal Trajectory Score Matching (GTSM), for which gradient boosting (in an idealized version) is asymptotically optimal. We underscore the conceptual value of our work through two key practical instantiations: \treeflow, which achieves competitive generation quality on tabular data with higher fidelity and a 2\times computational speedup, and \dsmtree, a novel distillation method that transfers hierarchical decision logic into neural networks, matching teacher performance within 2% on many benchmarks.
[AI-24] Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling
【速读】:该论文旨在解决当前世界模型在物理可靠性、动作可控性以及长时程稳定性方面存在的不足,这些问题限制了其在具身决策中的应用。解决方案的关键在于提出哈密顿世界模型(Hamiltonian World Models),其核心思想是将观测编码到结构化的潜相空间中,通过引入哈密顿动力学启发的演化机制(包含控制项、耗散项和残差项)来更新潜在状态,并将预测轨迹解码为未来观测,从而支持规划。该方法有望提升模型的可解释性、数据效率及长期预测稳定性,同时应对现实机器人场景中摩擦、接触力、非保守力和变形物体等复杂因素带来的挑战。
链接: https://arxiv.org/abs/2605.00412
作者: Sen Cui,Jingheng Ma
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:World models have recently re-emerged as a central paradigm for embodied intelligence, robotics, autonomous driving, and model-based reinforcement learning. However, current world model research is often dominated by three partially separated routes: 2D video-generative models that emphasize visual future synthesis, 3D scene-centric models that emphasize spatial reconstruction, and JEPA-like latent models that emphasize abstract predictive representations. While each route has made important progress, they still struggle to provide physically reliable, action-controllable, and long-horizon stable predictions for embodied decision making. In this paper, we argue that the bottleneck of world models is no longer only whether they can generate realistic futures, but whether those futures are physically meaningful and useful for action. We propose \emphHamiltonian World Models as a physically grounded perspective on world modeling. The key idea is to encode observations into a structured latent phase space, evolve the latent state through Hamiltonian-inspired dynamics with control, dissipation, and residual terms, decode the predicted trajectory into future observations, and use the resulting rollouts for planning. We discuss how Hamiltonian structure may improve interpretability, data efficiency, and long-horizon stability, while also noting practical challenges in real-world robotic scenes involving friction, contact, non-conservative forces, and deformable objects.
[AI-25] Scalable Learning in Structured Recurrent Spiking Neural Networks without Backpropagation
【速读】:该论文旨在解决深度递归脉冲神经网络(Spiking Neural Networks, SNNs)中具有稀疏连接的可扩展学习问题,尤其是如何在保持硬件可扩展性和能量效率的同时实现稳定且高效的监督学习。其解决方案的关键在于提出一种结构化的多层递归SNN架构:局部密集的递归层通过稀疏的小世界长程投射连接到读出种群,其中长程连接基本固定以维持路由效率和硬件兼容性;同时,通过严格局部的突触可塑性机制实现学习,结合三种生物启发式设计——输出层基于群体的竞争性胜者为王(Winner-Take-All, WTA)教学信号、固定随机广播对齐反馈路径以及低维调制神经元群体通过三因子学习规则(含eligibility traces)调控突触更新。该方法无需反向传播或替代梯度,实现了深度递归计算与稀疏全局通信下的纯局部突触更新,从而支持高效、稳定的训练并具备良好的硬件可行性。
链接: https://arxiv.org/abs/2605.00402
作者: Bo Tang,Weiwei Xie
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 2 figures
Abstract:Spiking Neural Networks (SNNs) provide a promising framework for energy-efficient and biologically grounded computation; however, scalable learning in deep recurrent architectures with sparse connectivity remains a major challenge. In this work, we propose a structured multi-layer recurrent SNN architecture composed of locally dense recurrent layers augmented with sparse small-world long-range projections to a readout population. The long-range connectivity is largely fixed, preserving routing efficiency and hardware scalability, while synaptic adaptation is performed using strictly local plasticity mechanisms. To enable supervised learning without backpropagation or surrogate gradients, we introduce a biologically motivated learning framework that combines: (i) population-based winner-take-all (WTA) teaching signals at the output layer, (ii) fixed random broadcast alignment feedback pathways, and (iii) low-dimensional modulatory neuron populations that gate synaptic updates through three-factor learning rules with eligibility traces. This design supports deep recurrent computation with sparse global communication and purely local synaptic updates. We analyze the algorithmic properties, computational complexity, and hardware feasibility of the proposed approach, and demonstrate stable learning and competitive performance on benchmark classification tasks. The results highlight the potential of structured recurrence and neuromodulatory learning to enable scalable, hardware-compatible SNN training beyond gradient-based methods.
[AI-26] Social Bias in LLM -Generated Code: Benchmark and Mitigation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成代码时存在的社会偏见问题,尤其是在面向人类的应用中,这种偏见可能对不同人口统计群体产生不公平影响。现有评估主要关注功能正确性,而忽视了代码中潜在的社会偏见。其解决方案的关键在于提出一种名为公平性监控代理(Fairness Monitor Agent, FMA)的模块化组件,该组件可无缝集成到任意现有代码生成流程中而不需修改底层架构。FMA通过分析任务描述识别应考虑或限制的人口属性特征,并借助迭代审查机制检测和修正偏见违规行为,无需依赖可执行测试套件,在343个真实世界任务上使偏见降低65.1%,同时提升功能性正确率至83.97%。
链接: https://arxiv.org/abs/2605.00382
作者: Fazle Rabbi,Lin Ling,Song Wang,Jinqiu Yang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:Large Language Models (LLMs) are increasingly deployed to generate code for human-centered applications where demographic fairness is critical. However, existing evaluations focus almost exclusively on functional correctness, leaving social bias in LLM-generated code largely unexamined. Extending our prior work on Solar, we conduct a comprehensive empirical study using SocialBias-Bench, a benchmark of 343 real-world coding tasks spanning seven demographic dimensions. We evaluate four prominent LLMs and find severe bias across all models, with Code Bias Scores reaching up to 60.58%. We further show that standard prompt-level interventions, such as Chain-of-Thought reasoning and fairness persona assignment, inadvertently amplify bias rather than reduce it. We then investigate whether structured multi-agent software process frameworks can improve fairness, finding that structured pipelines reduce bias when early roles correctly scope what the code should and should not consider. However, adding explicit fairness instructions to all agent roles produces worse outcomes than providing none, suggesting that diffused responsibility goes unaddressed. To address these limitations, we propose the Fairness Monitor Agent (FMA), a modular component that plugs into any existing code generation pipeline without modifying it. FMA analyzes the task description to determine which attributes should be considered or restricted, then detects and corrects violations through an iterative review process, without requiring an executable test suite. Evaluated on all 343 tasks, FMA reduces bias by 65.1% compared to a developer agent alone and improves functional correctness from 75.80% to 83.97%, outperforming all other studied approaches.
[AI-27] GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models
【速读】:该论文旨在解决音乐理解任务中跨模态学习能力不足及时间序列与非时间序列音乐理解任务难以统一建模的问题。其解决方案的关键在于提出一种基于流式编码器-解码器架构的大型多模态模型(Large Multimodal Model, LMM)GaMMA,通过混合专家(Mixture-of-Experts)机制集成音频编码器,实现对音乐时序特征与全局语义特征的统一建模,并结合大规模数据集与渐进式训练流程(包括预训练、监督微调和强化学习),显著提升音乐理解性能。
链接: https://arxiv.org/abs/2605.00371
作者: Zuyao You,Zhesong Yu,Mingyu Liu,Bilei Zhu,Yuan Wan,Zuxuan Wu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and language. By incorporating audio encoders in a mixture-of-experts manner, GaMMA effectively unifies both time-series and non-time-series music understanding tasks within one set of parameters. Our approach combines carefully curated datasets at scale with a progressive training pipeline, effectively pushing the boundaries of music understanding via pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). To comprehensively assess both temporal and non-temporal capability of music LMMs, we introduce MusicBench, the largest music-oriented benchmark, comprising 3,739 human-curated multiple-choice questions covering diverse aspects of musical understanding. Extensive experiments demonstrate that GaMMA establishes new SoTA in the music domain, achieving 79.1% accuracy on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, consistently outperforming previous methods.
[AI-28] AlphaInventory: Evolving White-Box Inventory Policies via Large Language Models with Deployment Guarantees
【速读】:该论文旨在解决在在线、非平稳环境中如何利用大语言模型(Large Language Models, LLMs)演化库存策略的问题。传统LLM-based进化搜索方法(如AlphaEvolve)虽在静态结构化问题中表现优异,但难以直接应用于动态库存场景。其解决方案的关键在于提出AlphaInventory框架,该框架基于置信区间(confidence interval)的认证机制,通过强化学习训练LLM,并融合需求数据及数值与文本特征,生成具有统计安全保证的白盒库存策略,同时构建统一的理论接口以连接训练、推理与部署阶段,从而量化策略改进概率与部署偏差,实现在合成数据和真实零售数据上的性能超越经典库存策略与深度学习方法。
链接: https://arxiv.org/abs/2605.00369
作者: Chenyu Huang,Jianghao Lin,Zhengyang Tang,Bo Jiang,Ruoqing Jiang,Benyou Wang,Lai Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We study how large language models can be used to evolve inventory policies in online, non-stationary environments. Our work is motivated by recent advances in LLM-based evolutionary search, such as AlphaEvolve, which demonstrates strong performance for static and highly structured problems such as mathematical discovery, but is not directly suited to online dynamic inventory settings. To this end, we propose AlphaInventory, an end-to-end inventory-policy evolution and inference framework grounded in confidence-interval-based certification. The framework trains a large language model using reinforcement learning, incorporates demand data as well as numerical and textual features beyond demand, and generates white-box inventory policy with statistical safety guarantees for deployment in future periods. We further introduce a unified theoretical interface that connects training, inference, and deployment. This allows us to characterize the probability that the AlphaInventory evolves a statistically safe and improved policy, and to quantify the deployment gap relative to the oracle-safe benchmark. Tested on both synthetic data and real-world retail data, AlphaInventory outperforms classical inventory policies and deep learning based methods. In canonical inventory settings, it evolves new policies that improve upon existing benchmarks.
[AI-29] Pedagogical Promise and Peril of AI: A Text Mining Analysis of ChatGPT Research Discussions in Programming Education
【速读】:该论文旨在解决当前关于生成式 AI(Generative AI)在编程教育中角色的学术讨论缺乏清晰概念框架和系统性理解的问题。通过文本挖掘方法对权威学术数据库中的文献进行分析,研究识别出四大核心主题:教学实施、以学生为中心的学习与参与、AI基础设施与人机协作以及评估、提示设计与模型评价。其解决方案的关键在于基于实证分析揭示了学界对ChatGPT的应用聚焦于课堂实践与学习者互动,而对评估设计和制度治理关注不足,并指出ChatGPT兼具学习辅助工具与教学风险的双重属性,从而为负责任地整合此类技术提供依据,并强调需加强评估机制与治理框架建设。
链接: https://arxiv.org/abs/2605.00361
作者: Juvy C.Grume,John Paul P. Miranda,Aileen P. De Leon,Jordan L. Salenga,Hilene E. Hernandez,Mark Anthony A. Castro,Vernon Grace M. Maniago,Joel D. Canlas,Joel B. Quiambao
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 3 tables, 14 pages, book chapter
Abstract:GenAI systems such as ChatGPT are increasingly discussed in programming education, but the ways in which the research literature conceptualizes and frames their role remain unclear. This chapter applies text mining to publications indexed in a leading academic database to map scholarly discourse on ChatGPT in programming education. Term frequency analysis, phrase pattern extraction, and topic modeling reveal four dominant themes: pedagogical implementation, student-centered learning and engagement, AI infrastructure and human-AI collaboration, and assessment, prompting, and model evaluation. The literature prioritizes classroom practice and learner interaction, with comparatively limited attention to assessment design and institutional governance. Across studies, ChatGPT is positioned both as a learning aid that supports explanation, feedback, and efficiency and as a pedagogical risk linked to overreliance, unreliable outputs, and academic integrity concerns. These findings support responsible integration and highlight the need for stronger assessment and governance mechanisms.
[AI-30] VQ-SAD: Vector Quantized Structure Aware Diffusion For Molecule Generation
【速读】:该论文旨在解决现有基于扩散模型的分子生成方法中对分子符号信息(如原子类型和键类型)处理不当的问题,这些问题包括使用one-hot编码忽略符号语义、Morgan指纹易产生哈希冲突且难以嵌入连续空间,以及随机指纹无法对应有效分子结构。其解决方案的关键在于引入VQ-VAE(Vector Quantized Variational Autoencoder)框架,将原子和键类型视为潜在变量,并利用预训练的VQ-VAE模型中的码本(codebook)作为离散tokenizers,从而构建一个神经符号模型——VQ-SAD。该方法通过在扩散过程中使用可学习的前向过程(learnable forward process)和大离散码空间,实现了原子与键类型的更均衡分布,提升了去噪过程的质量,最终在QM9和ZINC250k数据集上优于当前最优(SOTA)的扩散模型。
链接: https://arxiv.org/abs/2605.00354
作者: Farshad Noravesh,Reza Haffari,Layki Soon,Arghya Pal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:Many diffusion based molecule generation methods ignore the symbolic information of molecules and represent the atom and bond type as one hot representation. Methods based on Morgan fingerprints produce hash collisions and are hard to embed into a continuous space without information loss and random fingerprints correspond to no valid molecule. To circumvent this issue we use another paradigm and consider atom and bond codes as latent variables of VQ-VAE. We introduce VQ-SAD which first trains a VQ-VAE and uses the frozen pretrained VQ-VAE model and considers the codebooks for both atom and bond types as tokenizers for the downstream diffusion process. VQ-SAD is a neuro-symbolic model that utilizes both symbolic and neural structural information for a diffusion based model with learnable forward process. The large discrete code space provides a more balanced atom and bond types which enhances the denoising process. VQ-VAE slightly outperforms SOTA models for diffusion based molecule generation on QM9 and ZINC250k datasets.
[AI-31] Hypergraph and Latent ODE Learning for Multimodal Root Cause Localization in Microservices
【速读】:该论文旨在解决云原生微服务系统中根因定位(Root Cause Localization, RCL)问题,其核心挑战在于建模复杂的服务依赖关系、不规则的时间动态特性以及异构可观测性数据(如日志、追踪、指标等)。解决方案的关键在于提出一个统一框架 HyperODE RCA,该框架融合了超图注意力学习(hypergraph attention learning)、潜在常微分方程(latent ordinary differential equations, ODEs)和多模态交叉注意力融合机制。具体而言,通过可微分的超边构建学习高阶服务交互关系,利用ODE-RNN编码器从不规则观测中捕捉异常演化过程,并基于上下文感知的模态路由机制自适应融合多源异构数据;同时引入变分信息瓶颈、时间因果正则化和不变风险约束提升模型鲁棒性。实验表明,该方法在天池AIOps基准上显著优于现有强基线,在保持可解释性的同时实现更精准的根因排序与分类性能。
链接: https://arxiv.org/abs/2605.00351
作者: Xin Liu,Yuhang He,Sichen Zhao,Kejian Tong,Xingyu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Root cause localization in cloud native microservice systems requires modeling complex service dependencies, irregular temporal dynamics, and heterogeneous observability data. We present HyperODE RCA, a unified framework that combines hypergraph attention learning, latent ordinary differential equations, and multimodal cross attention fusion for fine grained root cause analysis. The method learns higher order service interactions through differentiable hyperedge construction, captures continuous anomaly evolution from irregular observations with an ODE RNN encoder, and adaptively fuses logs, traces, metrics, entities, and events using context aware modality routing. We further improve robustness with a variational information bottleneck, temporal causal regularization, and invariant risk constraints. Experiments on the Tianchi AIOps benchmark show clear gains over strong baselines in ranking and classification performance, while preserving interpretability through learned hypergraph attention.
[AI-32] AI Adoption Among Teachers: Insights on Concerns Support Confidence and Attitudes
【速读】:该论文旨在解决教育领域中人工智能(Artificial Intelligence, AI)工具采纳率低的问题,重点探讨机构支持、教师信心与教师担忧三者之间的关系,并验证教师担忧是否调节机构支持对教师信心和AI采纳态度的影响。研究发现,机构支持能显著提升教师信心及对AI的积极态度,但教师担忧并未显著调节这一关系;进一步的中介分析表明,教师信心在机构支持与态度之间起完全中介作用——即机构支持通过增强教师信心来改善其对AI的接纳态度。解决方案的关键在于:教育机构应提供结构化且持续的支持措施(如专业发展培训、导师指导及将AI融入教师教育课程),以系统性提升教师信心,从而促进AI在教学中的有效应用。
链接: https://arxiv.org/abs/2605.00343
作者: Vanessa B. Sibug,Maria Anna D. Cruz,Vicky P. Vital,Juvy C. Grume,Almer B. Gamboa,Emerson Q. Fernando,Lloyd D. Feliciano,Jordan L. Salenga,John Paul P. Miranda
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 3 pages, conference proceedings, open access
Abstract:The study examines the adoption of artificial intelligence (AI) tools in education by analyzing the roles of institutional support, teacher confidence, and teacher concerns. It aims to determine whether teacher concerns moderate the relationship between institutional support and two outcomes: teacher confidence and attitudes toward AI adoption. The sample included 260 teachers from the Philippines. Composite scores were calculated for institutional support, confidence, concerns, and attitudes. Moderated multiple regression analysis showed that institutional support significantly predicted both teacher confidence and attitudes toward AI. However, teacher concerns did not significantly moderate these relationships. A follow-up mediation analysis tested whether confidence explains the effect of institutional support on attitudes. Results showed full mediation. The indirect effect was significant based on the Sobel test, and the direct effect became non-significant when confidence was included in the model. This shows that institutional support improves teacher attitudes by increasing their confidence. The study recommends that institutions provide structured and ongoing support to strengthen teacher confidence. Professional development, mentoring, and AI integration in teacher education programs can increase readiness and support effective AI adoption.
[AI-33] Unbox Responsible GeoAI: Navigating Climate Extreme and Disaster Mapping
【速读】:该论文旨在解决当前Geospatial Artificial Intelligence (GeoAI)在气候极端事件与灾害制图中因过度追求性能而引发的潜在问题,包括加剧空间不平等、阻碍应急决策以及造成严重的环境碳足迹。其解决方案的关键在于提出一个负责任的GeoAI治理框架,从数据(Data)、应用(Application)和社会(Society)三个维度构建系统性治理实践,并围绕代表性(Representativeness)、可解释性(Explainability)、可持续性(Sustainability)和伦理(Ethics)四个理论维度进行整合,推动GeoAI在气候韧性建设中实现负责任、伦理化与可持续的部署。
链接: https://arxiv.org/abs/2605.00315
作者: Hao Li,Steffen Knoblauch
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:As climate extreme and disaster events become more frequent and intense, Geospatial Artificial Intelligence (GeoAI) has emerged as a transformative approach for large-scale disaster mapping and risk reduction. However, the purely mechanical, performance-driven deployment of GeoAI models can result in amplifying inherent spatial inequalities, preventing effective emergency decision-making, and producing severe environmental carbon footprint. To unbox the concept of responsible GeoAI, this position paper examines its emerging role, e.g., in climate extreme and disaster mapping, from a critical GIS perspective. We address the nexus of responsible GeoAI into four interrelated theoretical dimensions, specifically Representativeness, Explainability, Sustainability, and Ethics, with examples from climate extreme and disaster mapping. Moreover, targeting at the operational practice, we then propose a conceptual governance Model of responsible GeoAI that categorizes its governance practices into Data, Application, and Society scopes. Last, this position paper aims to raise the attention in the broader GIS community that the future of climate resilience relies not just on building better algorithms, but on fostering a governance ecosystem where GeoAI is deployed responsibly, ethically, and sustainably.
[AI-34] Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的代理技能(agent skill)中存在的语义安全风险问题,即传统静态分析工具无法理解技能中由自然语言描述的触发条件与执行逻辑,而基于大语言模型(LLM)的方法又难以提供可复现的安全性证明。其解决方案的关键在于提出 Semia,一个针对代理技能的静态审计框架,通过将技能转化为结构化的 Skill Description Language(SDL)——一种基于 Datalog 的事实库,显式建模 LLM 触发动作、自然语言定义的条件及人工介入检查点;并引入约束引导的表示合成(Constraint-Guided Representation Synthesis, CGRS)机制,以迭代式“提议-验证-评估”流程优化 LLM 生成的 SDL 表达,确保其在结构正确性和语义忠实性上的收敛。最终,安全属性(如间接注入、密钥泄露、混淆委派等)可被形式化为 Datalog 可达性查询,从而实现对大规模真实技能的高效且准确的安全审计。
链接: https://arxiv.org/abs/2605.00314
作者: Hongbo Wen,Ying Li,Hanzhi Liu,Chaofan Shou,Yanju Chen,Yuan Tian,Yu Feng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:An agent skill is a configuration package that equips an LLM-driven agent with a concrete capability, such as reading email, executing shell commands, or signing blockchain transactions. Each skill is a hybrid artifact-a structured half declares executable interfaces, while a prose half dictates when and how those interfaces fire-and the prose is reinterpreted probabilistically on every invocation. Conventional static analyzers parse the structured half but ignore the prose; LLM-based tools read the prose but cannot reproducibly prove that a tainted input reaches a high-impact sink. We present Semia, a static auditor for agent skills. Semia lifts each skill into the Skill Description Language (SDL), a Datalog fact base that captures LLM-triggered actions, prose-defined conditions, and human-in-the-loop checkpoints. Synthesizing a fact base that is both structurally sound and semantically faithful to the original prose is the central challenge; we address it with Constraint-Guided Representation Synthesis (CGRS), a propose-verify-evaluate loop that refines LLM candidates until convergence. Security properties (e.g., indirect injection, secret leakage, confused deputies, unguarded sinks, etc.) over an agent skill can then be reduced to Datalog reachability queries. We evaluate Semia on 13,728 real-world skills from public marketplaces. Semia renders all of them auditable and finds that more than half carry at least one critical semantic risk. On a stratified sample of 541 expert-labeled skills, Semia achieves 97.7% recall and an F1 of 90.6%, substantially outperforming signature-based scanners and LLM baselines. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2605.00314 [cs.CR] (or arXiv:2605.00314v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.00314 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-35] oken Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
【速读】:该论文旨在解决当前公共推理基准测试在模型与供应商层面进行比较时,未能反映实际部署决策单位——即端点(endpoint,由提供者、模型、库存单位组成的三元组)粒度下性能差异的问题。现有基准忽略了量化策略、解码方式、区域和部署栈等关键变量对推理效率和质量的影响,导致评估结果与真实应用场景脱节。解决方案的关键在于提出 TokenArena,一个连续的端点级推理基准框架,其核心创新在于以端点为单位,从输出速度、首次 token 延迟、工作负载融合价格、有效上下文长度和实时端点质量五个维度进行测量,并结合建模的能耗估算,合成三个综合指标:每正确答案的焦耳数(joules per correct answer)、每正确答案的美元成本(dollars per correct answer)以及端点保真度(endpoint fidelity,衡量输出分布与第一方参考的相似性)。该方法揭示了同一模型在不同端点上存在显著性能波动,从而推动更精准、可复现且贴近生产环境的 AI 推理评估体系。
链接: https://arxiv.org/abs/2605.00300
作者: Yuxuan Gao,Megan Wang,Yi Ling Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
备注: 14 pages, 1 figure, 8 tables
Abstract:Public inference benchmarks compare AI systems at the model and provider level, but the unit at which deployment decisions are actually made is the endpoint: the (provider, model, stock-keeping-unit) tuple at which a specific quantization, decoding strategy, region, and serving stack is exposed. We introduce TokenArena, a continuous benchmark that measures inference at endpoint granularity along five core axes (output speed, time to first token, workload-blended price, effective context, and quality on the live endpoint) and synthesizes them, together with a modeled energy estimate, into three headline composites: joules per correct answer, dollars per correct answer, and endpoint fidelity (output-distribution similarity to a first-party reference). The framework’s novelty is empirical and methodological. Across 78 endpoints serving 12 model families, the same model on different endpoints differs in mean accuracy by up to 12.5 points on math and code, in fingerprint similarity to first party by up to 12 points, in tail latency by an order of magnitude, and in modeled joules per correct answer by a factor of 6.2. We further show that workload-aware blended pricing reorders the leaderboard substantially: 7 of 10 top-ranked endpoints under the chat preset (3:1 input:output) fall out of the top 10 under the retrieval-augmented preset (20:1), and the reasoning preset (1:5) elevates frontier closed models that the chat preset penalizes on price. We release the framework, schema, probe and eval harness, and a v1.0 leaderboard snapshot under CC BY 4.0. TokenArena is a methodology, not a single ranking; we publish full provenance and limitations and welcome external replication.
[AI-36] Caracal: Causal Architecture via Spectral Mixing ICML2026
【速读】:该论文旨在解决大语言模型(Large Language Models)在处理长序列时面临的两个核心问题:注意力机制带来的二次计算复杂度(quadratic cost of attention)以及位置编码(positional encodings)的局限性。其解决方案的关键在于提出一种名为Caracal的新架构,该架构用一个参数高效且时间复杂度为 O(LlogL) 的多头傅里叶(Multi-Head Fourier, MHF)模块替代传统注意力机制。通过引入快速傅里叶变换(Fast Fourier Transform, FFT)进行序列混洗,并采用频域因果掩码技术(frequency-domain causal masking)实现自回归建模能力,该方法在不依赖硬件特定优化的前提下实现了高可移植性和高效长序列建模性能。
链接: https://arxiv.org/abs/2605.00292
作者: Bingzheng Gan,Tianyi Zhang,Yusu Li,Jing Huang,Wei Shi,Yangkai Ding,Tao Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026
Abstract:The scalability of Large Language Models to long sequences is hindered by the quadratic cost of attention and the limitations of positional encodings. To address these, we introduce Caracal, a novel architecture that replaces attention with a parameter-efficient, \mathcalO(L \log L) Multi-Head Fourier (MHF) module. Our contributions are threefold: (1) We leverage the Fast Fourier Transform (FFT) for sequence mixing, inherently addressing both bottlenecks mentioned above. (2) We apply a frequency-domain causal masking technique that enforces autoregressive capabilities via asymmetric padding and truncation, overcoming a critical barrier for Fourier-based generative models. (3) Unlike efficient models relying on hardware-specific implementations (e.g., Mamba), we uses standard library operators. This ensures robust portability, eliminating common deployment barriers. Evaluations demonstrate that Caracal performs competitively with Transformer and SSM baselines, offering a scalable and simple pathway for efficient long-sequence modeling. Code is available in Appendix.
[AI-37] Agent ic AI for Trip Planning Optimization Application
【速读】:该论文旨在解决智能车辆路径规划中缺乏优化性能客观评估手段的问题,现有系统多聚焦于生成可行路线而非最优路径,且缺乏真实最优解作为基准,难以对优化效果进行量化比较。其解决方案的关键在于提出一个由编排代理(orchestration agent)协调多个专业化代理(如交通、充电和兴趣点代理)组成的智能体框架,并构建了Trip-planning Optimization Problems (TOP) 数据集,提供明确的最优解与任务结构层次,从而实现动态优化与精细化评估。实验表明,该系统在TOP基准上达到77.4%的准确率,显著优于单智能体及基于工作流的多智能体基线方法,验证了协同式智能体推理在提升路径规划优化能力上的重要性。
链接: https://arxiv.org/abs/2605.00276
作者: Tiejin Chen,Ahmadreza Moradipari,Kyungtae Han,Hua Wei,Nejib Ammar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to IV 2026
Abstract:Trip planning for intelligent vehicles increasingly requires selecting optimal routes rather than merely producing feasible itineraries, as interacting factors such as travel time, energy consumption, and traffic conditions directly affect plan quality. Yet existing systems are largely designed for feasibility-oriented planning, and current benchmarks provide only reference answers without ground truth, preventing objective evaluation of optimization performance. In our paper, we address these limitations with an agentic AI framework that enables dynamic refinement through an orchestration agent coordinating specialized agents for traffic, charging, and points of interest, and with the Trip-planning Optimization Problems Dataset, which supplies definitive optimal solutions and category-level task structure for fine-grained analysis. Experiments show that our system achieves 77.4% accuracy on the TOP Benchmark, significantly outperforming single-agent and workflow-based multi-agent baselines, demonstrating the importance of orchestrated agentic reasoning for robust trip planning optimization.
[AI-38] Jailbroken Frontier Models Retain Their Capabilities
【速读】:该论文旨在解决当前语言模型(Language Model)安全防护机制在面对日益复杂的越狱攻击(jailbreak)时,其有效性是否依赖于模型能力下降的问题。研究表明,传统观点认为越狱攻击会显著降低模型任务性能(即“越狱税”),但本文发现这种性能损失与模型能力呈反比关系:越先进的模型(如Claude Opus 4.6)在遭受最强攻击时性能下降极小(仅7.7%),而较弱模型(如Haiku 4.5)则损失高达33.1%。关键解决方案在于提出,针对前沿模型的安全评估不应假设越狱攻击必然导致显著能力削弱,应重新审视并强化基于边界点越狱(Boundary Point Jailbreaking)等高效攻击方法的防御策略,因其能在实现近乎完美规避的同时保持模型性能稳定,从而推动更可靠的模型安全验证体系构建。
链接: https://arxiv.org/abs/2605.00267
作者: Daniel Zhu,Zihan Wang,Jenny Bao,Jerry Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:As language model safeguards become more robust, attackers are pushed toward developing increasingly complex jailbreaks. Prior work has found that this complexity imposes a “jailbreak tax” that degrades the target model’s task performance. We show that this tax scales inversely with model capability and that the most advanced jailbreaks effectively yield no reduction in model capabilities. Evaluating 28 jailbreaks on five benchmarks across Claude models ranging in capability from Haiku 4.5 to Opus 4.6, we find Haiku 4.5 loses an average of 33.1% on benchmark performance when jailbroken, while Opus 4.6 at max thinking effort loses only 7.7%. We also observe that across all models, reasoning-heavy tasks display considerably more degradation than knowledge-recall tasks. Finally, Boundary Point Jailbreaking, currently the strongest jailbreak against deployed classifiers, achieves near-perfect classifier evasion with near-zero degradation across safeguarded models. We recommend that safety cases for frontier models should not rely on a meaningful capability degradation from jailbreaks.
[AI-39] Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM Serving
【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)服务中混合专家(Mixture-of-Experts, MoE)架构导致的通信开销问题,即当前行业为支持MoE推理而部署的高带宽集中式(scale-up)网络是否具备成本效益。其核心解决方案在于通过系统性跨层分析对比四种典型计算单元(XPU,如GPU/TPU)拓扑结构(scale-up、scale-out、3D环面和3D全连接)的性能与成本权衡,发现无需交换机的低成本拓扑(尤其是3D全连接拓扑)在所有服务场景下均显著优于scale-up拓扑,提升成本效益达20.6%–56.2%,且当前scale-up链路带宽存在过度配置,适度降低带宽可使单位成本吞吐量提升最高达27%。该研究揭示了switchless网络在未来GPU演进中仍具持续成本优势,为高效部署MoE LLM服务提供了理论依据与实践路径。
链接: https://arxiv.org/abs/2605.00254
作者: Junsun Choi,Sam Son,Sunjin Choi,Hansung Kim,Yakun Sophia Shao,Scott Shenker,Sylvia Ratnasamy,Borivoje Nikolic
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-experts (MoE) architectures have turned LLM serving into a cluster-scale workload in which communication consumes a considerable portion of LLM serving runtime. This has prompted industry to invest heavily in expensive high-bandwidth scale-up networks. We question whether such costly infrastructure is strictly necessary. We present the first systematic cross-layer analysis of network cost-effectiveness for MoE LLM serving, comparing four representative XPU (e.g., GPU/TPU) topologies (scale-up, scale-out, 3D torus, and 3D full-mesh). We find that lower-cost switchless topologies are more cost-effective than the scale-up topology across all serving scenarios explored, improving cost-effectiveness by 20.6-56.2%. In particular, the 3D full-mesh topology is Pareto-optimal in terms of the performance-cost tradeoff. We also find that current scale-up link bandwidths are over-provisioned: reducing the link bandwidth improves throughput per cost by up to 27%. A forward-looking analysis of upcoming GPU generations indicates that the cost-performance advantage of switchless networks will likely persist.
[AI-40] ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在军事防御场景中缺乏针对作战法规与伦理规范的安全评估问题。现有安全基准主要关注通用社会风险,未能检验模型是否遵循《战争法》(Law of War)、《交战规则》(Rules of Engagement)及《联合道德条例》(Joint Ethics Regulation)等军事核心准则。解决方案的关键在于提出ARMOR 2025,一个基于三大军事教义构建的对齐安全基准,通过从权威 doctrinal 文本中提取规则并生成保持原意的多选题,结合OODA(观察-调整-决策-行动)决策框架设计12类结构化测试类别,系统评估LLMs在军事相关决策类型中的准确性与拒绝能力,从而填补军事场景下模型安全评估的空白。
链接: https://arxiv.org/abs/2605.00245
作者: Sydney Johns,Heng Jin,Chaoyu Zhang,Y. Thomas Hou,Wenjing Lou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are now being explored for defense applications that require reliable and legally compliant decision support. They also hold significant potential to enhance decision making, coordination, and operational efficiency in military contexts. These uses demand evaluation methods that reflect the doctrinal standards that guide real military operations. Existing safety benchmarks focus on general social risks and do not test whether models follow the legal and ethical rules that govern real military operations. To address this gap, we introduce ARMOR 2025, a military aligned safety benchmark grounded in three core military doctrines the Law of War, the Rules of Engagement, and the Joint Ethics Regulation. We extract doctrinal text from these sources and generate multiple choice questions that preserve the intended meaning of each rule. The benchmark is organized through a taxonomy informed by the Observe Orient Decide Act (OODA) decision making framework. This structure enables systematic testing of accuracy and refusal across military relevant decision types. This benchmark features a structured 12-category taxonomy, 519 doctrinally grounded prompts, and rigorous evaluation procedures applied to 21 commercial LLMs. Evaluation results reveal critical gaps in safety alignment for military applications.
[AI-41] Attention Is Where You Attack
【速读】:该论文旨在解决安全对齐的大语言模型(Safety-aligned Large Language Models)中,其内部实现安全行为的机制尚不明确的问题。现有方法如基于强化学习的人类反馈(Reinforcement Learning from Human Feedback, RLHF)和指令微调虽能提升安全性,但缺乏对具体注意力机制如何支持安全决策的理解。为此,作者提出一种白盒对抗攻击方法——注意力重分配攻击(Attention Redistribution Attack, ARA),其关键在于识别出对安全响应至关重要的注意力头,并通过构造非语义的对抗性标记(nonsemantic adversarial tokens),利用Gumbel-softmax优化目标,在概率单纯形上扰动软最大注意力的几何结构,从而将注意力从安全相关位置引导至无关位置。实验表明,ARA仅需5个token和500次优化步骤即可有效绕过安全对齐机制,且在多个主流模型上显著优于传统越狱方法,揭示了安全并非由特定可移除的注意力头决定,而是依赖于这些头部所执行的注意力路由过程,一旦被干扰则会引发下游传播错误信号。
链接: https://arxiv.org/abs/2605.00236
作者: Aviral Srivastava,Sourav Panda
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Safety-aligned large language models rely on RLHF and instruction tuning to refuse harmful requests, yet the internal mechanisms implementing safety behavior remain poorly understood. We introduce the Attention Redistribution Attack (ARA), a white-box adversarial attack that identifies safety-critical attention heads and crafts nonsemantic adversarial tokens that redirect attention away from safety-relevant positions. Unlike prior jailbreak methods operating at the semantic or output-logit level, ARA targets the geometry of softmax attention on the probability simplex using Gumbel-softmax optimization over targeted heads. Across LLaMA-3-8B-Instruct, Mistral-7B-Instruct-v0.1, and Gemma-2-9B-it, ARA bypasses safety alignment with as few as 5 tokens and 500 optimization steps, achieving 36% ASR on Mistral-7B and 30% on LLaMA-3 against 200 HarmBench prompts, while Gemma-2 remains at 1%. Our principal mechanistic finding is a dissociation between ablation and redistribution: zeroing out the top-ranked safety heads produces at most 1 flip among 39 to 50 baseline refusals, while ARA targeting the corresponding safety-heavy layers flips 72/200 prompts on Mistral-7B and 60/200 on LLaMA-3. This suggests that safety is not localized in these heads as removable components, but emerges from the attention routing they perform. Removing a head allows compensation through the residual stream, while redirecting its attention propagates a corrupted signal downstream.
[AI-42] UR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization ICML2026
【速读】:该论文旨在解决传统直接偏好优化(Direct Preference Optimization, DPO)方法在对齐大语言模型(Large Language Models, LLMs)时存在的局限性,即DPO仅将人类偏好视为简单的胜者-败者信号,忽略了答案生成过程中的推理结构(reasoning topology),导致其对噪声或脆弱的推理链敏感,进而影响模型输出的语义忠实性和可靠性。解决方案的关键在于提出TUR-DPO(Topology- and Uncertainty-aware DPO),通过引入轻量级推理拓扑结构的显式建模,整合语义忠实度(semantic faithfulness)、效用(utility)与拓扑质量(topology quality)形成校准后的不确定性信号,并基于该信号加权更新DPO目标函数,从而在不依赖强化学习(RL-free)的前提下,提升模型对推理路径质量的感知能力。此方法保持了训练的简洁性,避免在线轨迹采样(online rollouts),同时在多任务、多模态及长上下文场景中均实现性能提升。
链接: https://arxiv.org/abs/2605.00224
作者: Abdulhady Abas Abdullah,Fatemeh Daneshfar,Seyedali Mirjalili,Mourad Oussalah
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)
Abstract:Aligning large language models (LLMs) with human preferences is commonly done via reinforcement learning from human feedback (RLHF) with Proximal Policy Optimization (PPO) or, more simply, via Direct Preference Optimization (DPO). While DPO is stable and RL-free, it treats preferences as flat winner vs. loser signals and is sensitive to noisy or brittle preferences arising from fragile chains of thought. We propose TUR-DPO, a topology- and uncertainty-aware variant of DPO that rewards how answers are derived, not only what they say, by eliciting lightweight reasoning topologies and combining semantic faithfulness, utility, and topology quality into a calibrated uncertainty signal. A small learnable reward is factorized over these signals and incorporated into an uncertainty-weighted DPO objective that remains RL-free and relies only on a fixed or moving reference policy. Empirically, across open 7-8B models and benchmarks spanning mathematical reasoning, factual question answering, summarization, and helpful/harmless dialogue, TUR-DPO improves judge win-rates, faithfulness, and calibration relative to DPO while preserving training simplicity and avoiding online rollouts. We further observe consistent gains in multimodal and long-context settings, and show that TUR-DPO matches or exceeds PPO on reasoning-centric tasks while maintaining operational simplicity.
[AI-43] Fair Dataset Distillation via Cross-Group Barycenter Alignment ICML2026
【速读】:该论文旨在解决数据集蒸馏(Dataset Distillation)过程中因不同人口统计群体(demographic groups)具有差异化的预测模式而导致的公平性问题。具体而言,即使在群体规模相对均衡的情况下,蒸馏过程仍难以同时保留所有子群体的信息,从而导致模型在某些子群体上的性能显著下降,产生公平性差距(fairness gaps)。解决方案的关键在于识别一个与群体不平衡无关的“群组不变质心”(group-imbalance-agnostic barycenter),即一种能诱导所有子群体产生相似表示的共享聚合表示。通过将蒸馏目标导向这一质心,可以有效缓解由数据集蒸馏引入的公平性偏差,且该方法可兼容现有蒸馏技术。
链接: https://arxiv.org/abs/2605.00185
作者: Mohammad Hossein Moslemi,Nima Hosseini Dashtbayaz,Zhimin Mei,Boyu Wang,Bissan Ghaddar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026
Abstract:Dataset Distillation aims to compress a large dataset into a small synthetic one while maintaining predictive performance. We show that as different demographic groups exhibit distinct predictive patterns, the distillation process struggles to simultaneously preserve informative signals for all subgroups, regardless of whether group sizes are mildly or severely imbalanced. Consequently, models trained on distilled data can experience substantial performance drops for certain subgroups, leading to fairness gaps. Crucially, these gaps do not disappear by merely correcting group imbalance, since they stem from fundamental mismatches in subgroup predictive patterns rather than from sample-size disparities alone. We therefore formally analyze the interaction between these two sources of bias and cast the solution as identifying a group-imbalance-agnostic barycenter of the predictive information that induces similar representations across all subgroups. By distilling toward this shared aggregate representation, we show that group fairness concerns can be reduced. Our approach is compatible with existing distillation methods, and empirical results show that it substantially reduces bias introduced by dataset distillation.
[AI-44] Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents
【速读】:该论文旨在解决工具增强型推理(tool-augmented reasoning)在存在语义干扰(semantic distractors)时,其性能未必优于原生链式思维(Chain-of-Thought, CoT)的问题。研究指出,在语义噪声环境下,工具调用协议本身引入的“工具使用税”(tool-use tax)——即因提示格式化成本、工具调用开销等导致的性能退化——往往抵消了工具执行带来的实际收益。解决方案的关键在于提出一种因子化干预框架(Factorized Intervention Framework),用于量化并分离上述三类因素的影响,并进一步引入 G-STEP,一个轻量级推理时门控机制(inference-time gate),以缓解协议引发的错误。尽管该方法实现了部分性能恢复,但作者强调,更显著的改进仍需强化模型内在的推理能力和与工具交互的协同性。
链接: https://arxiv.org/abs/2605.00136
作者: Kaituo Zhang,Zhen Xiong,Mingyu Zhong,Zhimeng Jiang,Zhouyuan Yuan,Zhecheng Li,Ying Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Tool-augmented reasoning has become a popular direction for LLM-based agents, and it is widely assumed to improve reasoning and reliability. However, we demonstrate that this consensus does not always hold: in the presence of semantic distractors, tool-augmented reasoning does not necessarily outperform native CoT. To explain this performance gap, we propose a Factorized Intervention Framework that isolates the cost of prompt formatting, the overhead of the tool-calling protocol, and the actual gain from executing tools. Our analysis reveals a critical tradeoff: under semantic noise, the gains from tools often fail to offset the “tool-use tax”, which is the performance degradation introduced by the tool-calling protocol itself. To address this, we introduce G-STEP, a lightweight inference-time gate to mitigate protocol-induced errors. While this yields partial recovery, our findings suggest that more substantial improvements still require strengthening the model’s intrinsic reasoning and tool-interaction capabilities.
[AI-45] Smart Profit-Aware Crop Advisory System: Kisan AI
【速读】:该论文旨在解决现代作物咨询系统中存在的“经济盲区”(economic blindness)问题,即系统仅优化生物产量而忽视市场价格,导致农民可能做出农艺上合理但财务不可行的决策。解决方案的关键在于构建一个利润感知型作物咨询系统——Kisan AI,其核心创新是将市场价作为关键预测特征纳入随机森林(Random Forest, RF)分类模型的输入特征集,从而显著提升决策的经济效益。实验表明,加入市场价后,RF模型在九特征基准数据集上达到99.3%的准确率和最低对数损失(Log Loss),验证了该策略的有效性;进一步地,系统集成多语言AI聊天机器人、基于Facebook Prophet的六月价格预测引擎及MobileNetV2病害检测模块,形成可部署于印度农民移动端的全栈智能平台。
链接: https://arxiv.org/abs/2605.00133
作者: Debasis Dwibedy,Avyay Nishtala,Pranathi Mukku,D Snehaja
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 13 pages, 8 figures, 5 tables
Abstract:Modern crop advisory systems exhibit a critical limitation termed \textiteconomic blindness. These systems primarily optimize for biological yield, often overlooking market price, which can lead farmers toward agronomically sound yet financially unviable decisions. In this paper, we develop Kisan AI, a smart profit-aware crop advisory system that resolves the above-mentioned limitation through a research-driven, full-stack application. We train the Random Forest(RF) classifier model on a nine-feature benchmark dataset, the standard seven agronomic attributes augmented with a \textitmarket_price variable, and evaluated against eight baseline models, considering the evaluation matrices, such as, accuracy, precision, recall, F1-score, and Log Loss. The RF model achieves the highest accuracy of 99.3% and the lowest Log Loss, confirming that the inclusion of market price as a predictive feature is both valid and impactful. We then implement the RF model within a multilingual progressive Web App alongside a Facebook Prophet six-month price forecasting engine and a MobileNetV2 disease detection module. A nine-language AI chatbot powered by the Anthropic Claude API unifies all modules into a single, mobile-installable platform accessible to farmers across India.
[AI-46] Minimal Local Causal Explanations for Jailbreak Success in Large Language Models
【速读】:该论文旨在解决生成式 AI(Generative AI)在安全训练后仍可能被“越狱”(jailbreak)攻击的问题,尤其是缺乏对特定越狱成功机制的局部因果解释。现有方法通常从全局视角分析越狱攻击如何改变模型中间表示中的有害性或拒绝概念,但无法解释为何某个具体越狱策略在特定请求上成功。为填补这一空白,论文提出 LOCA 方法,其关键在于通过识别一组最小且可解释的中间表示变化,来因果性地诱导模型对原本成功的越狱请求产生拒绝响应——即提供针对单个越狱案例的局部因果解释。实验表明,LOCA 平均仅需六次可解释的修改即可实现拒绝,显著优于先前方法在二十次修改下仍难以奏效的表现。
链接: https://arxiv.org/abs/2605.00123
作者: Shubham Kumar,Narendra Ahuja
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Pre-print
Abstract:Safety trained large language models (LLMs) can often be induced to answer harmful requests through jailbreak prompts. Because we lack a robust understanding of why LLMs are susceptible to jailbreaks, future frontier models operating more autonomously in higher-stakes settings may similarly be vulnerable to such attacks. Prior work has studied jailbreak success by examining the model’s intermediate representations, identifying directions in this space that causally encode concepts like harmfulness and refusal. Then, they globally explain all jailbreak attacks as attempting to reduce or strengthen these concepts (e.g., reduce harmfulness). However, different jailbreak strategies may succeed by strengthening or suppressing different intermediate concepts, and the same jailbreak strategy may not work for different harmful request categories (e.g., violence vs. cyberattack); thus, we seek to give a local explanation – i.e., why did this specific jailbreak succeed? To address this gap, we introduce LOCA, a method that gives Local, CAusal explanations of jailbreak success by identifying a minimal set of interpretable, intermediate representation changes that causally induce model refusal on an otherwise successful jailbreak request. We evaluate LOCA on harmful original-jailbreak pairs from a large jailbreak benchmark across Gemma and Llama chat models, comparing against prior methods adapted to this setting. LOCA can successfully induce refusal by making, on average, six interpretable changes; prior work routinely fails to achieve refusal even after 20 changes. LOCA is a step toward mechanistic, local explanations of jailbreak success in LLMs. Code to be released.
[AI-47] Hyperspherical Forward-Forward with Prototypical Representations
【速读】:该论文旨在解决Forward-Forward (FF) 算法在推理阶段计算开销过大的问题,即其需要为每个待评估类别执行一次独立的前向传播,导致效率低下。解决方案的关键在于提出一种新的架构——超球面Forward-Forward (Hyperspherical Forward-Forward, HFF),通过将每一层的局部目标从二分类拟合任务重构为超球面特征空间中的多类分类问题,利用一组类特定的单位范数原型(unit-norm prototypes)作为几何锚点和隐式负样本,从而实现权重更新与推理仅需一次前向传播,显著提升效率(比原始FF快40倍),同时保持甚至超越反向传播方法的性能表现。
链接: https://arxiv.org/abs/2605.00082
作者: Shalini Sarode,Brian Moser,Joachim Folz,Federico Raue,Tobias Nauen,Stanislav Frolov,Andreas Dengel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures
Abstract:The Forward-Forward (FF) algorithm presents a compelling, bio-inspired alternative to backpropagation. However, while efficient in training, it has a computationally prohibitive inference process that requires a separate forward pass for every class that is evaluated. In this work, we introduce the Hyperspherical Forward-Forward (HFF), a novel reformulation that resolves this critical bottleneck. Our core innovation is to reframe the local objective of each layer from a binary goodness-of-fit task to a direct multi-class classification problem within a hyperspherical feature space. We achieve this by learning a set of class-specific, unit-norm prototypes that act as geometric anchors and implicit negatives. This architectural innovation preserves the benefits of local training while enabling weight update and inference in a single forward pass, making it 40x faster than the original FF algorithm. Our method is simple to implement, scales effectively to modern convolutional architectures, and achieves superior accuracy on standard image classification benchmarks, closing the gap with backpropagation. Most notably, we are among the first greedy local-learning methods to report over 25% top-1 accuracy on ImageNet-1k, and 65.96% with transfer learning.
[AI-48] Agent Reputation: A Decentralized Agent ic AI Reputation Framework
【速读】:该论文旨在解决去中心化、代理型人工智能(Agentic AI)市场中缺乏可靠声誉机制的问题。现有声誉系统在无中心监管的环境下失效,主要源于三个根本挑战:代理可策略性优化评估流程、能力表现难以跨异构任务场景迁移,以及验证严格性差异大(从轻量自动化检查到高成本专家评审)。为应对这些挑战,论文提出AgentReputation框架,其核心在于构建一个三层去中心化架构,将任务执行、声誉服务与防篡改持久层分离,以发挥各自优势并支持独立演进;同时引入与声誉元数据绑定的显式验证机制和上下文条件化的声誉卡,防止不同领域和任务类型之间的声誉混淆,并通过面向决策的策略引擎实现基于风险与不确定性的资源分配、访问控制及动态验证升级。
链接: https://arxiv.org/abs/2605.00073
作者: Mohd Sameen Chishti,Damilare Peter Oyinloye,Jingyue Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure, accepted to FSE 2026, Ideas, Visions and Reflections track
Abstract:Decentralized, agentic AI marketplaces are rapidly emerging to support software engineering tasks such as debugging, patch generation, and security auditing, often operating without centralized oversight. However, existing reputation mechanisms fail in this setting for three fundamental reasons: agents can strategically optimize against evaluation procedures; demonstrated competence does not reliably transfer across heterogeneous task contexts; and verification rigor varies widely, from lightweight automated checks to costly expert review. Current approaches to reputation drawing on federated learning, blockchain-based AI platforms, and large language model safety research are unable to address these challenges in combination. We therefore propose \textbfAgentReputation, a decentralized, three-layer reputation framework for agentic AI systems. The framework separates task execution, reputation services, and tamper-proof persistence to both leverage their respective strengths and enable independent evolution. The framework introduces explicit verification regimes linked to agent reputation metadata, as well as context-conditioned reputation cards that prevent reputation conflation across domains and task types. In addition, AgentReputation provides a decision-facing policy engine that supports resource allocation, access control, and adaptive verification escalation based on risk and uncertainty. Building on this framework, we outline several future research directions, including the development of verification ontologies, methods for quantifying verification strength, privacy-preserving evidence mechanisms, cold-start reputation bootstrapping, and defenses against adversarial manipulation.
[AI-49] XekRung Technical Report
【速读】:该论文旨在解决当前大语言模型在网络安全(cybersecurity)领域专业能力不足的问题,尤其是在构建高质量、可扩展的训练数据和实现多维度能力提升方面的挑战。解决方案的关键在于:首先,设计了面向网络安全领域的多样化数据合成流水线(data synthesis pipelines),以规模化生成高质量训练数据;其次,建立了涵盖持续预训练(continued pre-training, CPT)、监督微调(supervised fine-tuning, SFT)和强化学习(reinforcement learning, RL)的完整训练流程,从而系统性增强模型的领域知识与推理能力;最后,引入多维评估体系指导模型迭代优化,兼顾网络安全专项任务与通用能力表现。
链接: https://arxiv.org/abs/2605.00072
作者: Jiutian Zeng,Junjie Li,Chengwei Dai,Jie Liang,Zhaoyu Hu,Yiliang Zhang,Ziang Weng,Longtao Huang,Dongjie Zhang,Libin Dong,Yang Ge,Yuanda Wang,Kaiwen Lv Kacuila,Bingyu Zhu,Jing Wang,Jin Xu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 22 pages, 2 figures, 5 tables. Jiutian Zeng, Junjie Li, Chengwei Dai, Jie Liang, and Zhaoyu Hu contributed equally to this work
Abstract:We present XekRung, a frontier large language model for cybersecurity, designed to provide comprehensive security capabilities. To achieve this, we develop diverse data synthesis pipelines tailored to the cybersecurity domain, enabling the scalable construction of high-quality training data and providing a strong foundation for cybersecurity knowledge and understanding. Building on this foundation, we establish a complete training pipeline spanning continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL) to further extend the model’s capabilities. We further introduce a multi-dimensional evaluation system to guide the iterative improvement of both domain-specific and general-purpose abilities. Extensive experiments demonstrate that XekRung achieves state-of-the-art performance on cybersecurity-specific benchmarks among models of the same scale, while maintaining strong performance on general benchmarks.
[AI-50] Human-in-the-Loop Meta Bayesian Optimization for Fusion Energy and Scientific Applications IJCAI2026
【速读】:该论文旨在解决惯性约束聚变(Inertial Confinement Fusion, ICF)领域中因实验成本高昂和数据稀缺导致的科学发现效率低下问题。其解决方案的关键在于提出一种“人在回路中的元贝叶斯优化”(Human-in-the-Loop Meta Bayesian Optimization, HL-MBO)框架,该框架通过融合专家知识与少量样本下的不确定性感知机器学习方法,构建了一个元学习得到的代理模型,并结合专家引导的采集函数来推荐候选实验,从而在高风险、数据稀疏的科学场景中实现高效探索与决策。同时,HL-MBO提供可解释性说明以增强用户信任并支持知情决策,实验证明其在ICF能量增益优化、分子设计及超导材料临界温度最大化等任务中均优于现有贝叶斯优化方法。
链接: https://arxiv.org/abs/2605.00068
作者: Ricardo Luna Gutierrez,Sahand Ghorbanpour,Ejaz Rahman,Varchas Gopalaswamy,Riccardo Betti,Vineet Gundecha,Aarne Lees,Soumyendu Sarkar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Plasma Physics (physics.plasm-ph)
备注: Accepted at IJCAI 2026 (35th International Joint Conference on Artificial Intelligence)
Abstract:Inertial Confinement Fusion (ICF) holds transformative promise for sustainable, near-limitless clean energy, yet remains constrained by prohibitively high costs and limited experimental opportunities. This paper presents Human-in-the-Loop Meta Bayesian Optimization (HL-MBO), a framework that integrates expert knowledge with few-shot, uncertainty-aware machine learning to accelerate discovery in data-scarce, high-stakes scientific domains. HL-MBO introduces a meta-learned surrogate model with an expert-informed acquisition function to recommend candidate experiments. To foster trust and enable informed decisions, HL-MBO also provides interpretable explanations of its suggestions. We show HL-MBO outperforms current BO methods on ICF energy yield optimization, as well as benchmarks in molecular optimization and critical temperature maximization for superconducting materials.
[AI-51] ADI: Tool-Augmented Drilling Intelligence via Agent ic LLM Orchestration over Heterogeneous Wellsite Data
【速读】:该论文旨在解决钻井操作数据(drilling operational data)难以高效转化为可解释、证据驱动的分析智能的问题,尤其是在复杂多源异构数据场景下(如结构化钻井报告、实时测井数据、生产记录等)。其解决方案的关键在于构建一个工具增强型智能体系统(Tool-Augmented Drilling Intelligence, TADI),采用双存储架构(DuckDB用于结构化查询,ChromaDB用于语义搜索),并由大语言模型(LLM)通过迭代函数调用协同12个领域专用工具,实现跨结构化测量与非结构化日报文本的多步证据交叉验证。该方法不仅实现了对1,759份DDR XML文件的零错误解析和命名规范兼容处理,还提出Evidence Grounding Score(EGS)作为衡量推理结果与原始证据一致性的重要指标,实证表明领域专用工具设计比模型规模本身更能提升技术操作场景下的分析质量。
链接: https://arxiv.org/abs/2605.00060
作者: Rong Lu
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:We present TADI (Tool-Augmented Drilling Intelligence), an agentic AI system that transforms drilling operational data into evidence-based analytical intelligence. Applied to the Equinor Volve Field dataset, TADI integrates 1,759 daily drilling reports, selected WITSML real-time objects, 15,634 production records, formation tops, and perforations into a dual-store architecture: DuckDB for structured queries over 12 tables with 65,447 rows, and ChromaDB for semantic search over 36,709 embedded documents. Twelve domain-specialized tools, orchestrated by a large language model via iterative function calling, support multi-step evidence gathering that cross-references structured drilling measurements with daily report narratives. The system parses all 1,759 DDR XML files with zero errors, handles three incompatible well naming conventions, and is backed by 95 automated tests plus a 130-question stress-question taxonomy spanning six operational categories. We formalize the agent’s behavior as a sequential tool-selection problem and propose the Evidence Grounding Score (EGS) as a simple grounding-compliance proxy based on measurements, attributed DDR quotations, and required answer sections. The complete 6,084-line, framework-free implementation is reproducible given the public Volve download and an API key, and the case studies and qualitative ablation analysis suggest that domain-specialized tool design, rather than model scale alone, is the primary driver of analytical quality in technical operations.
[AI-52] Dynamic-TD3: A Novel Algorithm for UAV Path Planning with Dynamic Obstacle Trajectory Prediction
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在复杂高风险环境中自主无人机导航时面临的安全-探索困境(safety-exploration dilemma),即传统软惩罚机制易引发冒险行为,而多数约束方法在传感器噪声和意图不确定性下性能显著下降的问题。其解决方案的关键在于提出一种物理增强型框架 Dynamic-TD3,通过将导航建模为约束马尔可夫决策过程(Constrained Markov Decision Process, CMDP),并引入两个核心机制:一是自适应轨迹关系演化机制(Adaptive Trajectory Relational Evolution Mechanism, ATREM),用于捕捉长距离意图;二是物理感知门控卡尔曼滤波器(Physically Aware Gated Kalman Filter, PAG-KF),以抑制非平稳观测噪声。最终,基于改进的状态表示驱动双准则策略,利用拉格朗日松弛平衡任务效率与硬性安全约束,在动态威胁场景中实现更优的避障性能、更低能耗及更平滑的飞行轨迹。
链接: https://arxiv.org/abs/2605.00059
作者: Wentao Chen,Jingtang Chen,Mingjian Fu,Tiantian Li,Youfeng Su,Wenxi Liu,Yuanlong Yu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures
Abstract:Deep reinforcement learning (DRL) finds extensive application in autonomous drone navigation within complex, high-risk environments. However, its practical deployment faces a safety-exploration dilemma: soft penalty mechanisms encourage risky trial-and-error, while most constraint-based methods suffer degraded performance under sensor noise and intent uncertainty. We propose Dynamic-TD3, a physically enhanced framework that enforces strict safety constraints while maintaining maneuverability by modeling navigation as a Constrained Markov Decision Process (CMDP). This framework integrates an Adaptive Trajectory Relational Evolution Mechanism (ATREM) to capture long-range intentions and employs a Physically Aware Gated Kalman Filter (PAG-KF) to mitigate non-stationary observation noise. The resulting state representation drives a dual-criterion policy that balances mission efficiency against hard safety constraints via Lagrangian relaxation. In experiments with aggressive dynamic threats, this approach demonstrates superior collision avoidance performance, reduced energy consumption, and smoother flight trajectories.
[AI-53] Smart Ensemble Learning Framework for Predicting Groundwater Heavy Metal Pollution
【速读】:该论文旨在解决地下水重金属污染指数(Heavy Metal Pollution Index, HPI)建模中因数据偏态分布和污染物间相关性导致的预测偏差问题,传统方法难以捕捉污染指标的统计复杂性和空间异质性。其解决方案的关键在于构建一个融合响应变量变换与嵌套交叉验证的集成机器学习框架:通过三种变换(原始值、对数变换和高斯Copula变换)处理HPI的非正态分布特性,并在六种回归模型(包括支持向量回归、K近邻、CART、弹性网络、核岭回归及堆叠Lasso集成)中筛选最优组合;其中高斯Copula变换显著提升了模型稳定性与可靠性,使堆叠集成模型达到R² = 0.96、RMSE = 0.19的性能,且残差改善明显,生成的空间分布图具地理合理性,同时结合DBSCAN聚类识别出Fe和Mn为主要污染贡献因子,增强了结果的可解释性。
链接: https://arxiv.org/abs/2605.00056
作者: T. Ansah-Narh,G. Y. Afrifa,J. B. Tandoh,K. Asare,M. Addi,K. E. Yorke,D. M. A. Akpoley,K. Aidoo,S. K. Fosuhene
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an); Geophysics (physics.geo-ph); Applications (stat.AP); Machine Learning (stat.ML)
备注: 53 pages, 16 figures, accepted for publication in Earth Systems and Environment (2026)
Abstract:Groundwater in the Densu Basin is increasingly threatened by heavy metal contamination, but conventional methods fail to capture the statistical complexity and spatial heterogeneity of pollution indicators. A key challenge is modelling the Heavy Metal Pollution Index (HPI), which is typically skewed and affected by correlated contaminants, leading to biased predictions without transformation. This study develops a predictive framework integrating response transformations with nested cross-validated ensemble machine learning. Three transformations (raw, log, and Gaussian copula) were applied to HPI and evaluated across six learners: support vector regression (SVM), k -nearest neighbours (k-NN), CART, Elastic Net, kernel ridge regression, and a stacked Lasso ensemble. Raw-scale models produced deceptively high fits (Elastic Net and stacked ensemble R^2 \approx 1.0 ), suggesting over-optimism. The log transformation stabilised variance (SVM: R^2 = 0.93 , RMSE = 0.18 ; k-NN: R^2 = 0.92 , RMSE = 0.20 ). The Gaussian copula gave the most reliable results: stacked ensemble R^2 = 0.96 (RMSE = 0.19 ), with other learners maintaining high accuracy. Copula-based models improved residuals and produced spatially plausible maps. DBSCAN clustering revealed Fe and Mn as primary HPI contributors, consistent with regional hydrogeochemistry. Limitations include reliance on random (not spatial) cross-validation and basin-specific scope. Future work should explore spatial validation and other geological settings. Overall, distribution-aware ensembles with clustering diagnostics offer robust, interpretable assessments of groundwater contamination.
[AI-54] AirFM-DDA: Air-Interface Foundation Model in the Delay-Doppler-Angle Domain for AI-Native 6G
【速读】:该论文旨在解决当前基于空间-时间-频率(STF)域信道状态信息(CSI)的无线基础模型在物理层设计中存在的两个核心问题:一是多径分量在STF域中天然叠加且结构纠缠,阻碍了通用信道表征的学习;二是依赖全局注意力机制导致计算开销过高。解决方案的关键在于提出AirFM-DDA,一种在延迟-多普勒-角度(DDA)域运行的无线接口基础模型,通过将CSI从STF域重参数化至DDA域,显式地沿物理意义明确的轴解析多径分量;同时引入基于窗口的注意力模块与帧结构感知的位置编码(FS-PE),使模型能够捕捉局部簇状多径依赖关系,避免二次复杂度的全局注意力,从而显著降低训练与推理成本,并提升零样本泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2605.00020
作者: Kejia Bian,Meixia Tao,Jianhua Mo,Zhiyong Chen,Leyan Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Signal Processing (eess.SP)
备注: 16 pages
Abstract:The success of large foundation models is catalyzing a new paradigm for AI-native 6G network design: wireless foundation models for physical layer design. However, existing models often operate on channel state information (CSI) in the space-time-frequency (STF) domain, where distinct multipath components are inherently superimposed and structurally entangled. This hinders the learning of universal channel representation. Meanwhile, their reliance on global attention mechanisms incurs prohibitive computational overhead. In this paper, we propose AirFM-DDA, an Air-interface Foundation Model operating in the Delay-Doppler-Angle (DDA) domain for physicallayer tasks. Specifically, AirFM-DDA reparameterizes CSI from the STF domain into the DDA domain to explicitly resolve multipath components along physically meaningful axes. It employs a window-based attention module augmented with framestructure-aware positional encoding (FS-PE). This window-based attention aligns with locally clustered multipath dependencies while avoiding quadratic-complexity global attention, and FS-PE injects frame-structure priors into network. Extensive experiments demonstrate that AirFM-DDA achieves superior zero-shot generalization across unseen scenarios and datasets, consistently outperforming the baselines on channel prediction and estimation tasks. Compared to the global attention, its window-based attention reduces training and inference costs by nearly an order of magnitude. Moreover, AirFM-DDA maintains robustness under high mobility, large delay spreads, severe noise, and extreme aliasing conditions.
[AI-55] FedACT: Concurrent Federated Intelligence across Heterogeneous Data Sources
【速读】:该论文旨在解决多任务联邦学习(Multi-Federated Learning, Multi-FL)系统中因设备异构性和资源利用效率低下导致的性能瓶颈问题。现有单任务联邦学习(Federated Learning, FL)优化方法在多任务场景下表现不佳,难以有效调度异构设备以满足多个并发FL任务的资源需求。其解决方案的关键在于提出FedACT——一种资源异构性感知的设备调度机制,通过设计一个对齐评分(alignment scoring)机制动态评估设备可用资源与任务资源需求之间的兼容性,并据此优先分配高匹配度设备,同时引入参与公平性约束确保各任务间设备贡献均衡,从而显著降低平均作业完成时间(Job Completion Time, JCT)并提升全局模型精度。
链接: https://arxiv.org/abs/2605.00011
作者: Md Sirajul Islam,Isabelle G Chapman,N I Md Ashafuddula,Xu Yuan,Li Chen,Nian-Feng Tzeng,Klara Nahrstedt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Federated Learning (FL) enables collaborative intelligence across decentralized data source devices in a privacy-preserving way. While substantial research attention has been drawn to optimizing the learning process for an individual task, real-world applications increasingly require multiple machine learning tasks simultaneously training their models across a shared pool of devices. Naively applying single-FL optimization techniques in multi-FL systems results in suboptimal system performance, particularly due to device heterogeneity and resource inefficiency. To address such a critical open challenge, we introduce \em FedACT, a novel resource heterogeneity-aware device scheduling approach designed to efficiently schedule heterogeneous devices across multiple concurrent FL jobs, with the goal of minimizing their average job completion time (JCT). \em FedACT dynamically assigns devices to FL jobs based on an alignment scoring mechanism that evaluates the compatibility between available resources of devices and resource demands of jobs. Additionally, it incorporates participation fairness to ensure balanced contributions from devices across jobs, further enhancing the accuracy levels of learned global models. An optimal scheduling plan is formulated in \em FedACT by prioritizing devices with higher alignment scores, while ensuring fair participation across jobs. To evaluate the effectiveness of the proposed scheduling algorithm, we carried out comprehensive experiments using diverse FL jobs and benchmark datasets. Experimental results demonstrate that \em FedACT reduces the average JCT by up to 8.3(\times) and improves model accuracy by up to 44.5%, compared to the state-of-the-art baselines.
[AI-56] Cloud Is Closer Than It Appears: Revisiting the Tradeoffs of Distributed Real-Time Inference
【速读】:该论文旨在解决分布式网络物理系统(Cyber-Physical Systems, CPS)中深度神经网络(Deep Neural Networks, DNNs)推理任务在实时控制场景下的延迟与性能矛盾问题。传统架构倾向于将推理任务部署于本地设备以规避云端网络波动带来的不确定性,但此策略显著增加了边缘硬件的能耗与计算负担。论文提出的关键解决方案是重新评估云平台在低延迟控制任务中的适用性:通过构建一个形式化的分布式推理延迟分析模型,量化感知频率、平台吞吐量、网络延迟及任务安全约束对整体延迟的影响,并在自动驾驶紧急制动场景下进行仿真验证。结果表明,在高吞吐量云资源支持下,网络和排队延迟可被有效摊销,使得云推理不仅能够满足实时性要求,甚至在安全性保障方面优于本地推理。这一发现挑战了“云不适合实时控制”的固有认知,揭示了云并非遥不可及,而是分布式CPS架构中更优的推理部署位置。
链接: https://arxiv.org/abs/2605.00005
作者: Pragya Sharma,Hang Qiu,Mani Srivastava
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
备注:
Abstract:The increasing deployment of deep neural networks (DNNs) in cyber-physical systems (CPS) enhances perception fidelity, but imposes substantial computational demands on execution platforms, posing challenges to real-time control deadlines. Traditional distributed CPS architectures typically favor on-device inference to avoid network variability and contention-induced delays on remote platforms. However, this design choice places significant energy and computational demands on the local hardware. In this work, we revisit the assumption that cloud-based inference is intrinsically unsuitable for latency-sensitive control tasks. We demonstrate that, when provisioned with high-throughput compute resources, cloud platforms can effectively amortize network and queueing delays, enabling them to match or surpass on-device performance for real-time decision-making. Specifically, we develop a formal analytical model that characterizes distributed inference latency as a function of the sensing frequency, platform throughput, network delay, and task-specific safety constraints. We instantiate this model in the context of emergency braking for autonomous driving and validate it through extensive simulations using real-time vehicular dynamics. Our empirical results identify concrete conditions under which cloud-based inference adheres to safety margins more reliably than its on-device counterpart. These findings challenge prevailing design strategies and suggest that the cloud is not merely a feasible option, but often the preferred inference location for distributed CPS architectures. In this light, the cloud is not as distant as traditionally perceived; in fact, it is closer than it appears.
[AI-57] Born-Qualified: An Autonomous Framework for Deploying Advanced Energy and Electronic Materials
【速读】:该论文旨在解决自主科学在材料与化学系统发现过程中面临的“死亡谷”(valley of death)问题,即许多实验室中表现优异的候选系统因缺乏可制造性、成本效益和耐久性等工业可行性考量而无法实现规模化应用。其解决方案的关键在于提出“生来合格”(born-qualified)的自主开发策略,通过从初始阶段就嵌入可制造性、成本和耐久性约束,结合多目标评估指标、因果模型、模块化基础设施以及将制造过程融入发现循环的四大支柱,实现从实验室到产业端的高效转化。
链接: https://arxiv.org/abs/2605.00639
作者: Steven R. Spurgeon,Milad Abolhasani,Frederick Baddour,Ryan B. Comes,Vinayak P. Dravid,Hilary Egan,Patrick Emami,Robert W. Epps,Davi M. Fébba,Renae Gannon,E. Ashley Gaulding,Ayana Ghosh,Kenny Gruchalla,Grace Guinan,Taro Hitosugi,Michael Holden,Sergei V. Kalinin,Yangang Liang,John S. Mangum,Matthew J. Olszta,Nathaniel H. Park,Axel Palmstrom,Michelle A. Smeaton,Brooks Tellekamp,Nicholas E. Thornburg,Raymond R. Unocic,Daniela Ushizima,Rama K. Vasudevan,Robert White,Andrew Young,Andriy Zakutayev
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures
Abstract:Autonomous science is transforming how we discover materials and chemical systems for advanced energy technologies. However, many initially promising systems never reach deployment. This “valley of death” stems from optimization that prioritizes laboratory metrics over industrial viability. We propose a new strategy: “born-qualified” autonomous development, which embeds manufacturability, cost, and durability constraints from the outset. This approach is enabled by four pillars, including the development of multi-objective metrics, causal models, a modular infrastructure, and embedding manufacturing in the discovery loop. Realizing this vision will require sustained, community-wide commitment, but the potential return on that investment is commensurate with the scale of the challenge.
[AI-58] Beyond Structure: Revolutionising Materials Discovery via AI-Driven Synthesis Protocol-Property Relationships
【速读】:该论文旨在解决当前以结构为中心的人工智能(AI)材料发现范式所面临的“可合成性差距”(synthesizability gap)问题,即尽管AI能筛选出成千上万的候选原子构型,但这些结构在实验中难以实现合成。解决方案的关键在于转向“合成优先”(synthesis-first)范式,将可执行的合成协议(executable synthesis protocols)作为核心设计变量,而不仅仅是原子排列。论文提出三大支柱:(i) 将合成步骤表示为机器可读的协议;(ii) 利用生成式与逆向设计模型提出可行的反应路径和配方;(iii) 通过闭环优化将协议与实验现实及可持续性约束相校准。该方法基于因果框架 P-X-y(从合成协议 P 到结构 X 和性质 y),构建了可复现、数据驱动的材料发现体系。
链接: https://arxiv.org/abs/2605.00313
作者: Guillaume Lambard
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 20 pages, 2 figures
Abstract:The current structure-centric paradigm in artificial intelligence (AI)-driven materials discovery, despite delivering thousands of candidate structures, is stalling at a critical barrier: the synthesizability gap. We argue that closing this gap demands a pivot to a synthesis-first paradigm in which executable synthesis protocols, not just atomic configurations, are treated as primary design variables. We outline a roadmap built on three pillars: (i) representing synthesis procedures as machine-readable protocols, (ii) deploying generative and inverse-design models to propose actionable reaction pathways and recipes, and (iii) integrating closed-loop optimisation to refine protocols against experimental realities and sustainability constraints. Framed in terms of the causal backbone P-X-y from protocol P to structure X and properties y, this perspective sets out methodological building blocks, standards needs and self-driving laboratory (SDL) integration strategies to accelerate reproducible, data-first materials discovery.
[AI-59] CRC-Screen: Certified DNA-Synthesis Hazard Screening Under Taxonomic Shift
【速读】:该论文旨在解决DNA合成供应商在筛查潜在危险序列时面临的“漏检率过高”问题,尤其当目标危险序列来自参考数据库中未包含的分类家族时,传统基于已知毒害序列库的筛查方法会失效(即产生100%假阳性标记)。其解决方案的关键在于:通过融合三种源自订单公共注释的信号——k-mer Jaccard相似度、五名大语言模型(Large Language Model, LLM)判官的修剪均值评分以及嵌入聚类中心的余弦相似度,并利用单调逻辑聚合器进行融合,在Conformal Risk Control框架下进行校准,从而在保证期望误漏率(False Negative Rate, FNR)不超过预设阈值α的前提下,实现可证伪的低漏检和低误报性能。实验表明,在UniProt KW-0800毒害蛋白数据集上,该方法在α=0.05时实现了零测试漏检率与九个fold上的零误报率,且有限样本松弛项限制了可认证漏检率上限为1.77%,进一步达到采购级精度α=10⁻³需18倍更大的校准数据集,凸显校准数据量是当前瓶颈而非算法本身。
链接: https://arxiv.org/abs/2605.00074
作者: Najmul Hasan
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 1 table. Code: this https URL
Abstract:DNA-synthesis providers screen incoming orders by searching the requested sequence against curated hazard lists. We show that this baseline collapses to a 100% false-flag rate when the hazardous sequence comes from a taxonomic family absent from the reference set: under Conformal Risk Control’s certified miss-rate constraint, a low-discrimination signal forces the threshold below the entire test-benign mass. We compose three signals derived from a synthesis order’s public annotation: k -mer Jaccard similarity to known toxins, the trimmed-mean score of a five-LLM judge panel, and cosine similarity to clustered embedding centroids. Fused under a monotone logistic aggregator and calibrated by Conformal Risk Control, the resulting screener certifies \mathbbE[\mathrmFNR] \le \alpha . Across ten leave-one-taxonomic-family-out folds at \alpha=0.05 on UniProt KW-0800 reviewed toxins, the calibrated screener achieves 0% test miss rate on every fold and 0% test false-flag rate on nine of ten folds. The bound’s finite-sample slack 1/(n_\mathrmcal+1) caps the certifiable miss rate at 1.77% on our 200-hazard subsample; reaching procurement-grade \alpha=10^-3 requires an 18\times larger calibration set, which the full reviewed UniProt KW-0800 corpus is large enough to deliver. The binding constraint on certifiable DNA-synthesis screening is calibration data, not algorithms. Code: this https URL
[AI-60] Mean-Field Path-Integral Diffusion: From Samples to Interacting Agents
【速读】:该论文旨在解决传统扩散生成模型中独立采样带来的效率瓶颈问题,即如何通过样本间的协同作用更高效地传输概率质量。其解决方案的关键在于提出均值场路径积分扩散(Mean-Field Path-Integral Diffusion, MF-PID)框架,将样本视为受群体密度自洽驱动的交互代理,从而将分布匹配问题转化为麦肯-弗拉索夫(McKean–Vlasov)扩展的随机最优传输问题。该框架统一了生成建模与多智能体控制,并在两种解析可解情形下验证有效性:一是线性二次高斯(LQG)基准情形,其中无限维均值场系统退化为有限维Riccati和线性常微分方程组;二是高斯混合情形,由分段常数协议保持闭式求解能力。特别地,在二次交互势能与零基漂移条件下,作者证明自洽均值场引导等价于初始与目标全局均值之间的精确线性插值,且对任意初始和目标分布及调度函数 βt 均成立。
链接: https://arxiv.org/abs/2605.00007
作者: Michael Chertkov
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 31 pages, 14 figures
Abstract:Independent sample generation is the prevailing paradigm in modern diffusion-based generative models of AI. We ask a different question: can samples \emphcoordinate through shared population statistics to transport probability mass more efficiently? We introduce Mean-Field Path-Integral Diffusion (MF-PID), a framework in which samples are promoted to interacting agents whose drift depends self-consistently on the evolving population density. The coupling converts distribution matching into a McKean–Vlasov extension of the stochastic optimal transport problem, unifying generative modeling and multi-agent control under the same Hamilton–Jacobi–Bellman/Kolmogorov–Fokker–Planck duality. We identify two analytically tractable regimes: a Linear–Quadratic–Gaussian (LQG) benchmark in which the infinite-dimensional mean-field system reduces to a finite set of Riccati and linear ODEs, and a Gaussian-mixture regime governed by a piecewise-constant protocol that preserves closed-form solvability. For a quadratic interaction potential with schedule \beta_t and zero base drift we prove that the self-consistent MF guidance is the \emphexact linear interpolant between initial and target global means – a result that holds for arbitrary initial and target densities and any \beta_t . Applied to demand-response control of energy systems, where agents aggregated into an ensemble are energy consumers (e.g.\ thermal zones within a building), MF-PID achieves 19–24% reductions in cumulative control energy over independent-agent baselines while matching the prescribed terminal distribution exactly, and reveals how coordination redistributes actuation effort across heterogeneous sub-populations.
机器学习
[LG-0] HyCOP: Hybrid Composition Operators for Interpretable Learning of PDEs
链接: https://arxiv.org/abs/2605.00820
作者: Jinpai Zhao,Nishant Panda,Yen Ting Lin,Eirik Valseth,Diane Oyen,Clint Dawson
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We introduce HyCOP, a modular framework that learns parametric PDE solution operators by composing simple modules (advection, diffusion, learned closures, boundary handling) in a query-conditioned way. Rather than learning a monolithic map, HyCOP learns a policy over short programs - which module to apply and for how long - conditioned on regime features and state statistics. Modules may be numerical sub-solvers or learned components, enabling hybrid surrogates evaluated at arbitrary query times without autoregressive rollout. Across diverse PDE benchmarks, HyCOP produces interpretable programs, delivers order-of-magnitude OOD improvements over monolithic neural operators, and supports modular transfer through dictionary updates (e.g., boundary swaps, residual enrichment). Our theory characterizes expressivity and gives an error decomposition that separates composition error from module error and doubles as a process-level diagnostic.
[LG-1] Generating Statistical Charts with Validation-Driven LLM Workflows
链接: https://arxiv.org/abs/2605.00800
作者: Pavlin G. Poličar,Andraž Pevcin,Blaž Zupan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generating diverse, readable statistical charts from tabular data remains challenging for LLMs, as many failures become apparent after rendering and are not detectable from data or code alone. Existing chart datasets also rarely provide fully aligned artifacts, such as executable code, dataset context, and question-answer pairs. We present a structured LLM-based workflow that decomposes chart generation into dataset screening, plot proposal, code synthesis, rendering, validation-driven refinement, description generation, and question-answer generation. By incorporating rendered-output validation, the workflow addresses visualization-specific failure modes such as readability and semantic mismatch. It treats chart generation as an inspectable process rather than a one-shot prompt-to-code task, retaining each chart with its code, dataset context, description, and question-answer pairs. Applied to UCI datasets, the workflow produces 1,500 charts from 74 datasets, spanning 24 chart families and paired with 30,003 question-answer pairs. We evaluate 16 multimodal LLMs (MLLMs) on these chart-question pairs. The results show that chart-syntax questions are nearly saturated, while value extraction, comparison, and reasoning remain more challenging, illustrating the workflow’s utility for diagnostic studies of chart-grounded multimodal reasoning.
[LG-2] SAVGO: Learning State-Action Value Geometry with Cosine Similarity for Continuous Control
链接: https://arxiv.org/abs/2605.00787
作者: Stavros Orfanoudakis,Pedro P. Vergara
类目: Machine Learning (cs.LG)
*备注: Reinforcement Learning
Abstract:While representation and similarity learning have improved the sample efficiency of Reinforcement Learning (RL), they are rarely used to shape policy updates directly in the action space. To bridge this gap, a geometry-aware RL algorithm that explicitly incorporates value-based similarity into the policy update, State-Action Value Geometry Optimization (SAVGO), is proposed. In detail, SAVGO learns a joint state-action embedding space in which pairs with similar action-value estimates exhibit high cosine similarity, while dissimilar pairs are mapped to distinct directions. This learned geometry enables the generation of a similarity kernel over candidate actions sampled at each update, allowing policy improvement to be guided directly toward higher-value regions beyond local gradient-based updates. As a result, representation learning, value estimation, and policy optimization are unified within a single geometry-consistent objective, while preserving the scalability of off-policy actor-critic training. The proposed method is evaluated on standard MuJoCo continuous-control benchmarks, demonstrating improvements over strong baselines on challenging high-dimensional tasks. Ablation studies are done to analyze the contributions of value-geometry learning and similarity-based policy updates.
[LG-3] Observable Performance Does Not Fully Reflect System Organization: A Multi-Level Analysis of Gait Dynamics Under Occlusal Constraint
链接: https://arxiv.org/abs/2605.00778
作者: Jacques Raynal,Pierre Slangen,Jacques Margerit
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 1 table, 4 figures. Exploratory single-case study
Abstract:In biomechanical systems, observable performance is often used as a proxy for underlying system organization. However, this assumption implicitly presumes a correspondence between output metrics and internal system states that may not hold in adaptive systems. In this study, the vertical dimension of occlusion (VDO) is considered as a constraint applied to an adaptive neuromechanical system, enabling the exploration of system-level responses under controlled variations. A single-case design in a patient with Parkinson’s disease allows an intra-individual analysis across repeated this http URL analysis is structured across three complementary levels: (i) aggregated linear metrics describing observable performance, (ii) a dynamical systems framework describing temporal organization in state space, and (iii) a latent space representation obtained through unsupervised embedding. The results show that conditions with comparable observable performance may correspond to different organizations in both state space and latent space representations. This dissociation highlights a limitation of aggregated metrics and suggests that similar outputs may arise from non-equivalent system states. A fourth level is proposed as a purely conceptual extension describing potential relationships between system states. This level is not implemented and is not derived from experimental data. These observations are strictly exploratory and non-causal. The proposed framework does not establish mechanistic, predictive, or directional relationships, but provides a structured approach for analyzing constraint-driven systems across multiple levels of representation.
[LG-4] Learning the Helmholtz equation operator with DeepONet for non-parametric 2D geometries
链接: https://arxiv.org/abs/2605.00760
作者: Rodolphe Barlogis,Ferhat Tamssaouet,Quentin Falcoz,Stéphane Grieu
类目: Machine Learning (cs.LG)
*备注: 24 pages, 16 figures
Abstract:This paper deals with solving the 2D Helmholtz equation on non-parametric domains, leveraging a physics-informed neural operator network based on the DeepONet framework. We consider a 2D square domain with an inclusion of arbitrary boundary geometry at its center. This inclusion acts as a scatterer for an incoming harmonic wave. The aim is to learn the operator linking the geometry of the scatterer to the resulting scattered field. A signed distance function to the boundary of the inner inclusion, evaluated at several points in the domain, is used to encode its geometry. It serves as input for the branch part of the DeepONet architecture, while local information is used as input for the trunk part. This approach enables the encoding of arbitrary geometries, whether they are parameterized or not. The evaluation of the model on unseen geometries is compared with its finite element method (FEM) equivalent to test its generalization capabilities. The trained network weights implicitly embed the local physics and their interaction with the domain geometry. If the training space sufficiently covers the target evaluation space, the model can generalize accordingly. Furthermore, it can be refined to extend to another region of interest without retraining from scratch. This framework also avoids the need to remesh the domain for each geometry. The proposed approach delivers a computationally lighter surrogate model than FEM alternatives and avoids relying on FEM-generated training data.
[LG-5] hemis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
链接: https://arxiv.org/abs/2605.00754
作者: Indraneil Paul,Glavaš Glavas,Iryna Gurevych
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.
[LG-6] NonZero: Interaction-Guided Exploration for Multi-Agent Monte Carlo Tree Search ICML2026
链接: https://arxiv.org/abs/2605.00751
作者: Sizhe Tang,Zuyuan Zhang,Mahdi Imani,Tian Lan
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026 as Spotlight
Abstract:Monte Carlo Tree Search (MCTS) scales poorly in cooperative multi-agent domains because expansion must consider an exponentially large set of joint actions, severely limiting exploration under realistic search budgets. We propose NonZero, which keeps multi-agent MCTS tractable by running surrogate-guided selection over a low-dimensional nonlinear representation using an interaction-guided proposal rule, instead of directly exploring the full joint-action space. Our exploration uses an interaction score: single-agent deviations are ranked by predicted gain, while two-agent deviations are scored by a mixed-difference measure that reveals coordination benefits even when no single agent can improve alone. We formalize candidate proposal as a bandit problem over local deviations and derive a proposal rule, NonZero, with a sublinear local-regret guarantee for reaching approximate graph-local optima without enumerating the joint-action space. Empirically, NonZero improves sample efficiency and final performance on MatGame, SMAC, and SMACv2 relative to strong model-based and model-free baselines under matched search budgets.
[LG-7] mporal Data Requirement for Predicting Unplanned Hospital Readmissions
链接: https://arxiv.org/abs/2605.00738
作者: Ramin Mohammadi,Vahab vahdat,Sarthak Jain,Amir T. Namin,Ramya Palacholla,Sagar Kamarthi
类目: Machine Learning (cs.LG)
*备注:
Abstract:With the proliferation of Electronic Health Records (EHRs), a critical challenge in building predictive models is determining the optimal historical data time window to maximize accuracy. This study investigates the impact of various observation windows ranging from the day of surgery to three years prior on predicting 30-day readmission following hip and knee arthroplasties. The dataset encompasses both structured encounter records (over 4 million) and unstructured clinical notes (80,000) from 7,174 patients. To extract meaning from the clinical notes, we employed a suite of non neural (BOW, count BOW, TF IDF, LDA) and neural encoders (BERT, 1D CNN, BiLSTM, Average). We subsequently evaluated models utilizing clinical notes alone, structured data alone, and a combination of both modalities. Our results demonstrate that the optimal time window for unstructured clinical notes is significantly shorter than for structured data, maximum predictive performance was achieved using notes from just three to six months prior to surgery. In contrast, performance using structured data improved as the time window lengthened, but strictly plateaued after twelve months. These modality-specific temporal patterns remained consistent regardless of model complexity or encoder type. Ultimately, these findings challenge the general assumption that more historical data inherently yields better machine learning predictions, establishing targeted time-window guidelines for optimizing readmission prediction models.
[LG-8] Weisfeiler Lehman Test on Combinatorial Complexes: Generalized Expressive Power of Topological Neural Networks
链接: https://arxiv.org/abs/2605.00725
作者: Jiawen Chen,Qi Shao,Duxin Chen,Wenwu Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Combinatorial complexes have unified set-based (e.g., graphs, hypergraphs) and part-whole (e.g., simplicial, cellular complexes) structures into a common topological framework. Existing topological neural networks and Weisfeiler-Lehman variants remain fragmented, lacking a unified theoretical foundation for topological deep learning. In this work, we introduce the Combinatorial Complex Weisfeiler-Lehman (CCWL) test, an axiomatic-style extension of the WL test to combinatorial complexes. CCWL formalizes topological message passing through four types of neighborhood relation and provides a unified perspective on the expressive power of higher-order variants. We further prove that upper and lower neighborhoods are sufficient among the four adjacent WL tests to reach the expressivity of the full CCWL framework across topological structures of combinatorial complexes. Building on this framework, we also propose the Combinatorial Complex Isomorphism Network (CCIN) and evaluate it on synthetic and real-world benchmarks. Experimental results indicate CCIN outperforms baseline methods and offers a generalized expressive framework for topological deep learning.
[LG-9] Aitchison Embeddings for Learning Compositional Graph Representations DATE ICML
链接: https://arxiv.org/abs/2605.00716
作者: Nikolaos Nakis,Chrysoula Kosma,Panagiotis Promponas,Michail Chatzianastasis,Giannis Nikolentzos
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Accepted version to ICML; It will updated soon with the camera-ready version
Abstract:Representation learning is central to graph machine learning, powering tasks such as link prediction and node classification. However, most graph embeddings are hard to interpret, offering limited insight into how learned features relate to graph structure. Many networks naturally admit a role-mixture view, where nodes are best described as mixtures over latent archetypal factors. Motivated by this structure, we propose a compositional graph embedding framework grounded in Aitchison geometry, the canonical geometry for comparing mixtures. Nodes are represented as simplex-valued compositions and embedded via isometric log-ratio (ILR) coordinates, which preserve Aitchison distances while enabling unconstrained optimization in Euclidean space. This yields intrinsically interpretable embeddings whose geometry reflects relative trade-offs among archetypes and supports coherent behavior under component restriction; we consider both fixed and learnable ILR bases. Across node classification and link prediction, our method achieves competitive performance with strong baselines while providing explainability by construction rather than post-hoc. Finally, subcompositional coherence enables principled component restriction: removing and renormalizing subsets preserves a well-defined geometry, which we exploit via subcompositional dimensionality removal to probe how archetype groups influence representations and predictions.
[LG-10] Deep Kernel Learning for Stratifying Glaucoma Trajectories
链接: https://arxiv.org/abs/2605.00708
作者: Bruce Rushing,Angela Danquah,Alireza Namazi,Arjun Dirghangi,Heman Shakeri
类目: Machine Learning (cs.LG)
*备注:
Abstract:Effectively stratifying patient risk in chronic diseases like glaucoma is a major clinical challenge. Clinicians need tools to identify patients at high risk of progression from sparse and irregularly-sampled electronic health records (EHRs). We propose a novel deep kernel learning (DKL) architecture that leverages a Gaussian Process (GP) backend. The GP’s kernel is defined by a transformer-based feature extractor applied to clinical-BERT embeddings to model glaucoma patient trajectories from multimodal EHR data. Our method successfully identifies three clinically distinct patient subgroups. Crucially, the model learns to decouple disease progression from current severity, identifying a high-risk group with a worsening trajectory despite having better average visual acuity than a second, stably poor group. This reveals that the model learns to identify progression risk rather than just the current disease state. This ability to stratify patients based on their risk trajectory progression offers a powerful tool for clinical decision support, enabling targeted interventions for high-risk individuals and improving the management of glaucoma care.
[LG-11] Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game AAAI2026
链接: https://arxiv.org/abs/2605.00677
作者: Lixing Li
类目: Machine Learning (cs.LG)
*备注: 4 pages. Accepted as a short paper to the AAAI 2026 Spring Symposium on Machine Learning and Knowledge Engineering for Knowledge-Grounded Semantic Agents (MAKE 2026)
Abstract:While Large Language Models have achieved notable success on formal mathematics benchmarks such as MiniF2F, it remains unclear whether these results stem from genuine logical reasoning or semantic pattern matching against pre-training data. This paper identifies Architectural Reasoning: the ability to synthesize formal proofs using exclusively local axioms and definitions within an alien math domain, as the necessary ability for future automated theorem discovery AI. We use the Obfuscated Natural Number Game, a benchmark to evaluate Architectural Reasoning. By renaming identifiers in the Natural Number Game in Lean 4, we created a zero-knowledge, closed environment. We evaluate state-of-the-art models, finding a universal latency tax where obfuscation increases inference time. The results also reveal a divergence in robustness: while general models (Claude-Sonnet-4.5, GPT-4o) suffer performance degradation, reasoning models (DeepSeek-R1, GPT-5, DeepSeek-Prover-V2) maintain the same accuracy despite the absence of semantic cues. These findings provide a quantitative metric for assessing the true capacity for mathematical reasoning.
[LG-12] Spiking Sequence Machines and Transformers
链接: https://arxiv.org/abs/2605.00662
作者: Joy Bose
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 14 pages, 2 figures, 2 tables
Abstract:Sequence learning reduces to similarity-based retrieval over a temporally indexed representation space, a constraint on any sequence model, not a property of a specific architecture. We show that a spiking Sparse Distributed Memory sequence machine (2007) and the transformer (2017) independently instantiate the same five functional operations (encoding, context maintenance, associative retrieval, storage, and decoding), with cosine similarity as the shared retrieval primitive in both. We formalise a Phase-Latency Isomorphism showing that sinusoidal positional phase and spike timing are linearly related, and prove that dot product attention is invariant to this mapping up to a global scale factor on the positional component (Lemma 1). Empirically, frequency-compressed positional encoding fails to converge on a positionally demanding copy task, while a learned rank-based embedding matches or exceeds sinusoidal encoding, indicating that the critical property for positional representation is distance discriminability under dot-product similarity, not sinusoidal form. Time, phase, and rank are three instantiations of the same computational primitive, an ordered index whose structure survives similarity-based retrieval.
[LG-13] Budget Constraints as Riemannian Manifolds
链接: https://arxiv.org/abs/2605.00649
作者: Michael Helcig,Dan Alistarh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Assigning one of K options to each of N groups under a total cost budget is a recurring problem in machine learning, appearing in mixed-precision quantization, non-uniform pruning, and expert selection. The objective (model loss) depends jointly on all assignments and does not decompose across groups, which prevents combinatorial solvers from optimizing the true objective directly and limits them to proxy objectives. Evolutionary search evaluates the actual loss but lacks gradient information, while penalty-based methods provide gradients but enforce the budget only approximately and require sensitive hyperparameter tuning. We observe that under softmax relaxation, the budget constraint defines a smooth Riemannian manifold in logit space with particularly simple geometry: the normal vector is available in closed form, shifting logits along the cost vector changes expected cost monotonically, allowing binary-search retraction, and vector transport reduces to a single inner product. Building on this structure, we propose Riemannian Constrained Optimization (RCO), which augments a standard Adam update with tangent projection, binary-search retraction, and momentum transport. Combined with Gumbel straight-through estimation and budget-constrained dynamic programming for discrete feasibility, RCO enables first-order optimization of the true objective under exact budget enforcement, without introducing constraint hyperparameters. On synthetic knapsack problems with known optima, the manifold-based constraint handling recovers optimal solutions, whereas penalty methods plateau at 83% of optimal. On LLM compression tasks, including mixed-precision quantization and MoE expert pruning, RCO matches or exceeds evolutionary search methods while requiring 3x to 16x lower wall-clock cost on the evaluated configurations.
[LG-14] PEACE: Cross-modal Enhanced Pediatric-Adult ECG Alignment for Robust Pediatric Diagnosis
链接: https://arxiv.org/abs/2605.00647
作者: Xinran Liu,Yuwen Li,Hongxiang Gao,Heyang Xu,Jianqing Li,Zongmin Wang,Chengyu Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Automated pediatric electrocardiogram (ECG) diagnosis remains challenging because models trained predominantly on adult data suffer from substantial cross-population mismatch, while pediatric labels are often scarce. We present PEACE (Pediatric-Adult ECG Alignment via Cross-modal Enhancement), a structured cross-modal alignment framework for adult-to-pediatric ECG transfer. PEACE integrates tri-axial clinical semantic decomposition, label-query feature extraction, and curriculum-gated optimization to align transferable adult ECG representations with pediatric diagnostic targets. Since ZZU-pECG provides no paired clinical reports, we generate label-conditioned semantic descriptors using Gemini with concise clinical prompts and use them only as auxiliary training supervision; inference remains ECG-only. On ZZU-pECG, PEACE achieves 59.39%, 79.03%, and 90.89% AUC under zero-shot, 50-shot, and full fine-tuning settings, respectively, and reaches 96.65% AUC on the shared PTB-XL label space. These results suggest that structured clinical semantic supervision can improve low-resource adult-to-pediatric ECG transfer, while prospective clinical validation and more explicit age-aware modeling remain necessary before real-world deployment.
[LG-15] From Prediction to Practice: A Task-Aware Evaluation Framework for Blood Glucose Forecasting
链接: https://arxiv.org/abs/2605.00645
作者: Alireza Namazi,Heman Shakeri
类目: Machine Learning (cs.LG)
*备注:
Abstract:Clinical time-series forecasting is increasingly studied for decision support, yet standard aggregate metrics can obscure whether a model is actually useful for the task it is meant to serve. In safety-critical settings, low average error can coexist with dangerous failures in exactly the high-risk regimes that matter most. We present a task-aware evaluation framework for blood glucose forecasting built around two downstream uses: hypoglycemia early warning and insulin dosing decision support. For early warning, we evaluate on real data from three clinical cohorts using event-level recall and false alarms per patient-day, metrics that reflect operational alarm burden rather than aggregate accuracy. We show that models appearing acceptable overall, with recall above 0.9 on the full test set, can fail badly in the post-bolus slice, where insulin-on-board is elevated and missed warnings carry the greatest clinical consequences. Standard forecasting evaluation, however, does not test whether a model can reason about the effects of actions, a requirement for supporting insulin dosing decisions. We therefore add a second, interventional arm using the FDA-accepted UVA/Padova simulator, where we evaluate whether forecasters can predict glucose responses to altered insulin plans in paired factual/counterfactual scenarios. We show that models that look strong on real-data forecasting often fail to predict the direction, magnitude, or ranking of intervention effects, and choose poor insulin doses when evaluated under a clinically motivated cost. Taken together, the two arms reveal a consistent gap between forecasting accuracy and task-relevant usefulness. We release the benchmark, the standardized preprocessing pipeline for public cohorts, and the simulator-based interventional dataset as a reproducible toolkit.
[LG-16] Bridging Graph Drawing and Dimensionality Reduction with Stochastic Stress Optimization
链接: https://arxiv.org/abs/2605.00641
作者: Daniel Hangan,Stephen Kobourov,Jacob Miller
类目: Machine Learning (cs.LG)
*备注: To appear in GDxDR workshop 2026
Abstract:Both Dimensionality Reduction (DR) and Graph Drawing (GD) aim to visualize abstract, non-linear structures, yet rely on different optimization paradigms. This contrast is evident in Multidimensional Scaling (MDS), which typically depends on the SMACOF algorithm despite graph drawing results showing that simpler stochastic optimization schemes can be more effective for the same objective. We bridge these domains by adapting Stochastic Gradient Descent (SGD) techniques from graph drawing to vector data embedding. We present a scikit-learn compatible estimator that minimizes global stress through local pairwise updates, improving upon the existing implementation. Experiments on standard high-dimensional benchmarks show that our stochastic solver converges substantially faster than SMACOF while achieving comparable or lower stress.
[LG-17] Knowing when to trust machine-learned interatomic potentials
链接: https://arxiv.org/abs/2605.00640
作者: Shams Mehdi,Ilkwon Cho,Olexandr Isayev
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:
Abstract:Prevailing machine-learned interatomic potential (MLIP) uncertainty-quantification methods rely on ensembles of independently trained backbones. These methods scale unfavorably with foundation-scale MLIPs, and their member-disagreement signals correlate weakly with per-molecule prediction error. Here we probe the frozen per-atom representations of a pretrained MLIP with a compact discriminative classifier, recasting MLIP uncertainty quantification as selective classification rather than error regression. The resulting method, PROBE (Post-hoc Reliability frOm Backbone Embeddings), produces a per-prediction reliability probability that monotonically tracks actual error without modification to the underlying model. Across large held-out evaluation sets and two structurally distinct MLIP architectures, PROBE outperforms ensemble disagreement as a binary reliability signal, which strengthens with the expressiveness of the backbone representation, implying a favorable scaling trajectory toward foundation-scale MLIPs. Multi-head self-attention additionally yields per-atom importance maps, providing chemically interpretable diagnostics at no additional computational cost. PROBE is post-hoc and architecture-agnostic, and is directly deployable on any MLIP that exposes per-atom representations.
[LG-18] Unlearning Offline Stochastic Multi-Armed Bandits
链接: https://arxiv.org/abs/2605.00638
作者: Zichun Ye,Runqi Wang,Xuchuang Wang,Xutong Liu,Shuai Li,Mohammad Hajiesmaili
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: First two authors made an equal contribution
Abstract:Machine unlearning aims to unlearn data points from a learned model, offering a principled way to process data-deletion requests and mitigate privacy risks without full retraining. Prior work has mainly studied unsupervised / supervised machine unlearning, leaving unlearning for sequential decision-making systems far less understood. We initiate the first study of a foundational sequential decision-making problem: offline stochastic multi-armed bandits (MAB). We formalize the privacy constraint for offline MAB and measure utility by the post-unlearning decision quality. We conduct a systematic study of both single- and multi-source unlearning scenarios under two data-generation models, the fixed-sample model and the distribution model. For these settings, our algorithmic design is built on two canonical base algorithms: Gaussian mechanism and rollback, and we propose adaptive algorithms that switch between them according to the data regime and privacy constraint. We further introduce a mixing procedure that elucidates the rationale behind these baselines. We provide performance guarantees across the above settings and establish lower bounds under both dataset models. Experiments validate the predicted tradeoffs and demonstrate the effectiveness of the proposed methods.
[LG-19] Class Angular Distortion Index for Dimensionality Reduction
链接: https://arxiv.org/abs/2605.00637
作者: Kaviru Gunaratne,Stephen Kobourov,Jacob Miller
类目: Machine Learning (cs.LG)
*备注: To appear in EuroVis 2026 proceedings
Abstract:Dimensionality reduction (DR) techniques are often characterized by whether they preserve global, high-level structures in the data or local, neighborhood structures. This distinction matters in visualization: global methods can obscure clusters while local methods can over-emphasize them. Yet, even when clusters appear distinct, their relative arrangement in the projection may be arbitrary or misleading, a common issue in techniques such as t-SNE and UMAP. Existing cluster quality metrics either only measure cluster separability or assume spherical, globular clusters in the original space. We introduce the Class Angular Distortion Index (CADI), a metric that uses internal angles among point triples to determine the faithfulness of cluster organization in a projection. We show cases on both real and synthetic data where existing cluster metrics fail, but CADI provides an interpretable result. Since it relies on computing angles, CADI is also differentiable, enabling optimization. We demonstrate this with a CADI-based DR technique.
[LG-20] Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors
链接: https://arxiv.org/abs/2605.00610
作者: Chaohao Yuan,Chenghao Xiao,Yu Rong,Hong Cheng,Long-Kai Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:SFT and RLVR represent two fundamental yet distinct paradigms for LLM post-training, each excelling in distinct dimensions. SFT expands knowledge breadth while RLVR enhances reasoning depth. Yet integrating these complementary strengths remains a formidable challenge. Sequential training can cause catastrophic forgetting, and joint optimization often suffers from severe gradient conflicts. We analyze SFT and RLVR through the lens of task vectors and reveal three structural properties behind these failures: a 30* magnitude disparity, 45* sign interference, and heterogeneous module-wise update distributions. These findings show SFT and RLVR are difficult to integrate directly, but they also suggest that the two paradigms modify partly complementary components of the model. Motivated by these observations, we propose Decoupled Test-time Synthesis (DoTS), a post-hoc framework allows SFT and RLVR checkpoints to be trained independently and synthesizes their capabilities only at inference time via task vector arithmetic, without updating model parameters. To reduce interference, DOTS applies selective sparsification with norm-preserving rescaling. It then uses Bayesian optimization on a small set of unlabeled queries to search for combination coefficients on the Pareto frontier of consistency and perplexity. Empirically, \ours matches or exceeds the performance of training-based SFT–RLVR integration methods across multiple mathematical reasoning benchmarks, incurring only \sim 3% of the computational cost. When applied to stronger post-trained checkpoints, DOTS surpasses SOTA models and generalizes to out-of-domain benchmarks without re-tuning. Code is available at this https URL.
[LG-21] Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
链接: https://arxiv.org/abs/2605.00604
作者: Man Yung Wong(Russell)
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Code: this https URL
Abstract:Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/- 0.001 probability to the correct expert at the transition. Three lightweight gate modifications raise this to 0.748 +/- 0.002 (124x), cutting experts needed for 99% coverage from infeasible to a small constant: temporal memory (beta), a per-expert LIF membrane potential accumulating routing context across tokens; precision-weighted gating (Pi), a per-expert inverse variance of recent prediction error, yielding 31x contrast between reliable and unreliable experts; and anticipatory routing, a next-state predictor conditioned on the beta-accumulated hidden state. The mechanisms draw from Friston’s Free Energy Principle and use LIF dynamics from spiking neural networks. An ablation across all 2^3 subsets reveals a super-additive beta x Ant interaction: anticipation alone gives nothing (+0.000 +/- 0.001); beta alone gives modest gain (+0.295 +/- 0.013); combined they close 75% of the oracle gap (+0.741 +/- 0.002, exceeding the sum by +0.446 +/- 0.014). This is structural: a stateless predictor cannot detect approaching transitions because pre-transition tokens are distributionally identical to within-domain tokens. In a character-level MoE LM (5 seeds), beta-routing reduces transition-step BPC from 6.56 +/- 0.01 (Standard) to 4.01 +/- 0.15 (beta-MoE); the beta + Ant gate places 0.86 +/- 0.02 probability on the correct domain expert before that domain appears in input, vs 0.42 +/- 0.12 for Standard MoE. Reference implementations (~200 lines each): this https URL
[LG-22] Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance ICML2026
链接: https://arxiv.org/abs/2605.00553
作者: Minchan Kwon,Sunghyun Baek,Minseo Kim,Jaemyung Yu,Dongyoon Han,Junmo Kim
类目: Machine Learning (cs.LG)
*备注: ICML 2026 Spotlight
Abstract:Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are a promising methods, but they are notorious for training instability and mode collapse. In particular, unstable rewards in red-teaming accelerate mode collapse. We propose Stable-GFN (S-GFN), which eliminates partition function Z estimation in GFN and reduces training instability. S-GFN avoids Z-estimation through pairwise comparisons and employs a robust masking methodology against noisy rewards. Additionally, we propose a fluency stabilizer to prevent the model from getting stuck in local optima that produce gibberish. S-GFN provides more stable training while maintaining the optimal policy of GFN. We demonstrate the overwhelming attack performance and diversity of S-GFN across various settings.
[LG-23] mpus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge
链接: https://arxiv.org/abs/2605.00536
作者: M. Grailoo,J. Núñez-Yáñez
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Performance (cs.PF); Robotics (cs.RO)
*备注: 11 pages, 3 figures, 8 tables, 4 algorithms
Abstract:Scaling laws for Large Language Models (LLMs) establish that model quality improves with computational scale, yet edge deployment imposes strict constraints on compute, memory, and power. Since General Matrix Multiplication (GEMM) accounts for up to 90% of inference time, efficient GEMM acceleration is critical for edge AI. The Adaptive Intelligent Engines available in the AMD Versal adaptive SoCs are well suited for this task, but existing state-of-the-art (SOTA) frameworks maximize performance through spatial scaling, distributing workloads across hundreds of cores – an approach that fails on resource-limited edge SoCs due to physical implementation failures, bandwidth saturation, and excessive resource consumption. We propose Tempus, a Resource-Invariant Temporal GEMM framework for the AMD Versal AI Edge SoC. Rather than expanding hardware resources with matrix size, Tempus employs a fixed compute block of 16 AIE-ML cores, achieving scalability through iterative graph execution and algorithmic data tiling and replication in the Programmable Logic. High-speed cascade streaming ensures low-latency partial sum reduction at Initiation Interval (II) of 1, while a deadlock-free DATAFLOW protocol maximizes transfer-compute overlap and PLIO reuse. Evaluated on GEMM workloads, Tempus achieves 607 GOPS at 10.677 W total on-chip power. By characterizing system-level efficiency through the Platform-Aware Utility (PAU) metric, we prove that Tempus achieves a 211.2x higher prominence factor than the leading spatial SOTA (ARIES). Furthermore, the framework maintains a 0.00% utilization of URAM/DSP, yielding 22.0x core frugality, 7.1x power frugality, and a 6.3x reduction in I/O demand, establishing a sustainable, scalable foundation for edge LLM inference.
[LG-24] A Comparative Study of QSPR Methods on a Unique Multitask PAMPA dataset
链接: https://arxiv.org/abs/2605.00508
作者: Andrs Formanek,Anna Vincze,Richrd Bicsak,Yves Moreau,Gyorgy T. Balogh,Adam Arany
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present a unique, multitask dataset comprising 143 drug and drug candidate molecules, each evaluated on in vitro, parallel artificial-membrane permeability assays (PAMPA) using six different model membranes. Using this resource, we systematically assess the effectiveness of various molecular descriptors and regression models in predicting passive membrane permeability. The studied models range from simple linear regression to a modern pre-trained transformer architecture. Particular attention is given to the trade-off between predictive performance and model interpretability, highlighting the challenges introduced by machine learning approaches. To our knowledge, this is the most comprehensive study on simultaneous modeling of multiple organ-specific PAMPA membranes to date, offering novel insights into membrane-specific permeability profiles. We found that expert-designed physico-chemical property descriptors are more fitting for a limited sample size permeabilty study than deep learning based representations. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.00508 [cs.LG] (or arXiv:2605.00508v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.00508 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1021/acs.jcim.5c02931 Focus to learn more DOI(s) linking to related resources
[LG-25] LambdaRankIC: Directly Optimizing Rank IC for Financial Prediction
链接: https://arxiv.org/abs/2605.00501
作者: Yan Lin,Yihong Su,Yi Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:In financial predictions, the performance of machine learning models is often assessed by Rank IC, which is the Spearman rank correlation between the model predictions and the realized asset returns. Despite its wide adoption, most existing models are trained using regression losses or ranking objectives that may not align with Rank IC. We propose LambdaRankIC, a novel learning-to-rank approach that directly optimizes Rank IC. We circumvent the non-differentiability of the ranking operator by deriving the closed-form expression for the lambda gradients induced by the pairwise rank swaps, which enables efficient gradient-based optimization within the LambdaRank framework. We implement LambdaRankIC as a custom objective in XGBoost. Theoretically, we show that our approach optimizes an upper bound on Rank IC. We evaluate the proposed approach on both simulated and real-world financial data. In simulation studies, LambdaRankIC accurately recovers the true ranking structure in noiseless settings and consistently outperforms regression-based and NDCG-oriented ranking methods under low signal-to-noise ratios and heavy-tailed noise regimes. In empirical experiments using real market data, LambdaRankIC achieves the best out-of-sample performance on evaluation metrics commonly used in finance, including Rank IC, ICIR, monthly return, and Sharpe ratio. These results show that directly optimizing Rank IC can yield substantial improvements over conventional learning objectives in financial predictions when the full-order ranking quality is the primary goal.
[LG-26] Scaling Federated Linear Contextual Bandits via Sketching
链接: https://arxiv.org/abs/2605.00500
作者: Hantao Yang,Hong Xie,Xutong Liu,Defu Lian
类目: Machine Learning (cs.LG)
*备注:
Abstract:In federated contextual linear bandits, high data dimensionality incurs prohibitive computation and communication costs: local agents perform O(d^3) -time determinant computation and upload O(d^2) parameters, making existing algorithms unscalable, where d is the dimension of data. To relieve these scaling bottlenecks, this paper proposes Federated Sketch Contextual Linear Bandits (FSCLB). On the computation side, FSCLB uses SVD to indirectly obtain the determinant required for communication, eliminating the prohibitive cost of direct determinant calculation and cutting complexity from O(d^3) to O(l^2d) per round, where l d is the sketch size. On the communication side, FSCLB introduces a double-sketch strategy that reduces both upload and download costs from O(d^2) to O(ld) . Naively involving sketch update into federated contextual linear bandits can destroy the local increment and invalidate the asynchronous communication condition; FSCLB solves this by replacing the covariance matrix with the sketch matrix when deciding whether to communicate. Theoretically, FSCLB achieves a regret bound of \widetildeO ((\sqrtd+\sqrtM\varepsilon_l)\sqrtlT) , where \varepsilon_l is the upper bounded by the spectral tail of the covariance matrix; when l exceeds the rank of the covariance matrix, the bound simplifies to \widetildeO(\sqrtldT) , matching the optimal no-sketch regret. Experiments on both synthetic and real-world datasets show that FSCLB significantly reduces computational and communication costs by over 90 % while sacrificing only a negligible amount of cumulative reward.
[LG-27] Distance metric learning for conditional anomaly detection
链接: https://arxiv.org/abs/2605.00490
作者: Michal Valko,Milos Hauskrecht
类目: Machine Learning (cs.LG)
*备注: Published at FLAIRS 2008 (21st International Florida AI Research Society Conference)
Abstract:Anomaly detection methods can be very useful in identifying unusual or interesting patterns in data. A recently proposed conditional anomaly detection framework extends anomaly detection to the problem of identifying anomalous patterns on a subset of attributes in the data. The anomaly always depends (is conditioned) on the value of remaining attributes. The work presented in this paper focuses on instance-based methods for detecting conditional anomalies. The methods depend heavily on the distance metric that lets us identify examples in the dataset that are most critical for detecting the anomaly. To optimize the performance of such methods we study and devise a metric learning method that learns the distance metric to reflect best the conditional anomaly pattern.
[LG-28] Revealing graph bandits for maximizing local influence AISTATS2016
链接: https://arxiv.org/abs/2605.00489
作者: Alexandra Carpentier,Michal Valko
类目: Machine Learning (cs.LG)
*备注: Published at AISTATS 2016 (19th International Conference on Artificial Intelligence and Statistics)
Abstract:We study a graph bandit setting where the objective of the learner is to detect the most influential node of a graph by requesting as little information from the graph as possible. One of the relevant applications for this setting is marketing in social networks, where the marketer aims at finding and taking advantage of the most influential customers. The existing approaches for bandit problems on graphs require either partial or complete knowledge of the graph. In this paper, we do not assume any knowledge of the graph, but we consider a setting where it can be gradually discovered in a sequential and active way. At each round, the learner chooses a node of the graph and the only information it receives is a stochastic set of the nodes that the chosen node is currently influencing. To address this setting, we propose BARE, a bandit strategy for which we prove a regret guarantee that scales with the detectable dimension, a problem dependent quantity that is often much smaller than the number of nodes.
[LG-29] rading off rewards and errors in multi-armed bandits AISTATS2017
链接: https://arxiv.org/abs/2605.00488
作者: Akram Erraqabi,Alessandro Lazaric,Michal Valko,Emma Brunskill,Yun-En Liu
类目: Machine Learning (cs.LG)
*备注: Published at AISTATS 2017 (20th International Conference on Artificial Intelligence and Statistics)
Abstract:In multi-armed bandits, the most-explored arms are the most informative, while reward maximization typically pulls only the best arm. We study the tradeoff between identifying arm means accurately and accumulating reward, and present an algorithm with regret guarantees that interpolates between the two objectives. We provide both upper and lower bounds and validate empirically.
[LG-30] Near-optimal and Efficient First-Order Algorithm for Multi-Task Learning with Shared Linear Representation
链接: https://arxiv.org/abs/2605.00473
作者: Shihong Ding,Fangyu Du,Cong Fang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Multi-task learning (MTL) has emerged as a pivotal paradigm in machine learning by leveraging shared structures across multiple related tasks. Despite its empirical success, the development of likelihood-based efficiently solvable algorithms–even for shared linear representations–remains largely underdeveloped, primarily due to the non-convex structure intrinsic to matrix factorization. This paper introduces a first-order algorithm that jointly learns a shared representation and task-specific parameters, with guaranteed efficiency. Notably, it converges in \widetilde\mathcalO(1) iterations and attains a \emphnear-optimal estimation error of \widetilde\mathcalO(dk/(TN)) , \emphimproving over existing likelihood-based methods by a factor of k , where d , k , T , N denote input dimension, representation dimension, task count, and samples per task, respectively. Our results justify that likelihood-based first-order methods can efficiently solve the MTL problem.
[LG-31] Batch Normalization for Neural Networks on Complex Domains
链接: https://arxiv.org/abs/2605.00467
作者: Xuan Son Nguyen,Nistor Grozavu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Riemannian neural networks have proven effective in solving a variety of machine learning tasks. The key to their success lies in the development of principled Riemannian analogs of fundamental building blocks in deep neural networks (DNNs). Among those, Riemannian batch normalization (BN) layers have shown to enhance training stability and improve accuracy. In this paper, we propose BN layers for neural networks on complex domains. The proposed layers have close connections with existing Riemannian BN layers. We derive essential components for practical implementations of BN layers on some complex domains which are less studied in previous works, e.g., the Siegel disk domain. We conduct experiments on radar clutter classification, node classification, and action recognition demonstrating the efficacy of our method.
[LG-32] CleanBase: Detecting Malicious Documents in RAG Knowledge Databases
链接: https://arxiv.org/abs/2605.00460
作者: Weifei Jin,Xilong Wang,Wei Zou,Jinyuan Jia,Neil Gong
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Retrieval-augmented generation (RAG) is vulnerable to prompt injection attacks, in which an adversary inserts malicious documents containing carefully crafted injected prompts into the knowledge database. When a user issues a question targeted by the attack, the RAG system may retrieve these malicious documents, whose injected prompts mislead it into generating attacker-specified answers, thereby compromising the integrity of the RAG system. In this work, we propose CleanBase, a method to detect malicious documents within a knowledge database. Our key insight is that malicious documents crafted for the same attack-targeted questions often exhibit high semantic similarity, as attackers deliberately make them consistent to improve attack success rates. Accordingly, CleanBase constructs a similarity graph over the knowledge database, where each node represents a document and an edge connects two nodes if their semantic similarity–computed using an embedding model–exceeds a statistically determined threshold. Due to their inherent similarity, malicious documents tend to form cliques within this graph. CleanBase detects such cliques and flags the corresponding documents as malicious. We theoretically derive upper bounds on CleanBase’s false positive and false negative rates and empirically validate its effectiveness. Experimental results across multiple datasets and prompt injection attacks demonstrate that CleanBase accurately detects malicious documents and effectively safeguards RAG systems. Our source code is available at this https URL.
[LG-33] Federated Learning with Hypergradient-based Online Update of Aggregation Weights
链接: https://arxiv.org/abs/2605.00458
作者: Ayano Nakai-Kasai,Tadashi Wadayama
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Federated learning using mobile and Internet of Things devices requires not only the ability to handle heterogeneity of clients’ data distributions but also high adaptability to varying communication environments. We propose FedHAW (Federated Learning with Hypergradient-based update of Aggregation Weights) that implements online updates of aggregation weights. FedHAW updates the aggregation weights by using hypergradient, the gradient of the objective function with respect to the weights, which can be calculated with low computational overhead. Simulation results show that the proposed method possesses high generalization performance in heterogeneous environments and high robustness to communication errors.
[LG-34] A Policy-Driven DRL Framework for System-Level Tradeoff Control in NR-U/Wi-Fi Coexistence
链接: https://arxiv.org/abs/2605.00457
作者: Po-Heng Chou,Yi-Fang Yu,Shou-Yu Chen,Chiapin Wang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 12 pages, 13 figures, 1 table, submitted to IEEE Systems Journal
Abstract:The coexistence of NR-U and Wi-Fi in unlicensed spectrum introduces a system-level resource coordination problem, where heterogeneous channel access mechanisms lead to a significant imbalance in spectrum utilization and degraded Wi-Fi performance. To address this challenge, we propose a policy-driven deep reinforcement learning (DRL) framework for adaptive TXOP control, in which the coexistence process is formulated as a Markov decision process (MDP) and a deep Q-network (DQN) learns control policies through online interaction. A key contribution is the introduction of a policy layer via reward design, enabling explicit control of system-level tradeoffs among fairness, throughput, and quality of service (QoS). Three policies, namely absolute fairness, moderate fairness, and utility-based fairness, are developed to achieve different operating points. Simulation results show that the proposed framework achieves a Jain fairness index above 0.9 under strict fairness control. Compared to absolute fairness, moderate fairness improves aggregate throughput by 68.22%, while the utility-based policy further enhances utility by 177.6%. These results demonstrate that policy-driven control provides a flexible and effective solution for managing tradeoffs in heterogeneous coexistence networks.
[LG-35] Soft Graph Diffusion Transformer for MIMO Detection
链接: https://arxiv.org/abs/2605.00449
作者: Nan Jiang,Jiadong Hong,Lei Liu,Xinyu Bian,Wenjie Wang,Zhaoyang Zhang
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 4 figures, 2 tables
Abstract:Learning-based MIMO detection has shown strong empirical performance, yet existing methods typically rely on fixed-depth architectures without explicitly modeling the progressive refinement of symbol estimates. In this paper, we revisit MIMO detection from a flow matching perspective and propose the Soft Graph Diffusion Transformer (SGDiT), which reformulates detection as a noise-level-conditioned denoising process that progressively transforms a Gaussian initialization toward the posterior conditioned on channel observations. An adaptive layer normalization (AdaLN)-conditioned soft graph transformer is employed to parameterize the denoising dynamics, enabling stage-aware information integration between observation and symbol domains. To better align with the discrete nature of symbol detection, we further adopt a cross-entropy-based training objective that directly models bit-wise posterior probabilities, providing a more suitable inductive bias than conventional regression-based formulations. Experimental results across various MIMO system configurations demonstrate that SGDiT achieves competitive bit error rate (BER) performance compared with representative baselines. Furthermore, the proposed model exhibits good generalization capability across different channel conditions. Overall, the SGDiT framework provides an effective and practical approach for neural MIMO detection.
[LG-36] he Power of Order: Fooling LLM s with Adversarial Table Permutations
链接: https://arxiv.org/abs/2605.00445
作者: Xinshuai Dong,Haifeng Chen,Xuyuan Liu,Shengyu Chen,Haoyu Wang,Shaoan Xie,Kun Zhang,Zhengzhang Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models have achieved remarkable success and are increasingly deployed in critical applications involving tabular data, such as Table Question Answering. However, their robustness to the structure of this input remains a critical, unaddressed question. This paper demonstrates that modern LLMs exhibit a significant vulnerability to the layout of tabular data. Specifically, we show that semantically-invariant permutations of rows and columns - rearrangements that do not alter the table’s underlying information - are sometimes sufficient to cause incorrect or inconsistent model outputs. To systematically probe this vulnerability, we introduce Adversarial Table Permutation, a novel, gradient-based attack that efficiently identifies worst-case permutations designed to maximally disrupt model performance. Our extensive experiments demonstrate that ATP significantly degrades the performance of a wide range of LLMs. This reveals a pervasive vulnerability across different model sizes and architectures, including the most recent and popular models. Our findings expose a fundamental weakness in how current LLMs process structured data, underscoring the urgent need to develop permutation-robust models for reliable, real-world applications.
[LG-37] Optimal Spatio-Temporal Decoupling for Bayesian Conformal Prediction UAI2026
链接: https://arxiv.org/abs/2605.00432
作者: Yu-Hsueh Fang,Chia-Yen Lee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Under review of UAI2026
Abstract:Online Conformal Prediction (CP) struggles to balance temporal adaptability and structural stability. Feedback-driven methods (e.g., Adaptive Conformal Inference (ACI)) suffer from systemic marginal under-coverage and high interval variance during abrupt shifts, while temporally discounted Bayesian CP suffers from severe structural lag and uncalibrated interval bloat. We propose State-Adaptive Bayesian Conformal Prediction (SA-BCP) to achieve optimal spatio-temporal decoupling. By gating long-term temporal inertia with spatial kernel-density evidence, SA-BCP proactively expands intervals for recognized historical regimes while maintaining tight efficiency during stable states. We rigorously prove this mechanism’s optimality, identifying a minimax bias-variance tradeoff governed by an evidence threshold K . Extensive benchmarks on volatile financial datasets (2016–2026), including AMD, Gold, and GBP/USD, demonstrate that SA-BCP consistently minimizes the strictly proper Winkler score across diverse confidence levels. Specifically, SA-BCP resolves the systematic under-coverage inherent to ACI variants while simultaneously reducing the uncalibrated interval bloat of Bayesian CP by 10% to 37% under high-confidence requests. By elegantly navigating this tradeoff, SA-BCP achieves an optimal balance between conditional reliability and predictive efficiency.
[LG-38] GD4: Graph-based Discrete Denoising Diffusion for MIMO Detection
链接: https://arxiv.org/abs/2605.00423
作者: Qincheng Lu,Sitao Luan,Xiao-Wen Chang
类目: Machine Learning (cs.LG)
*备注:
Abstract:In wireless communications, recovering the optimal solution to the multiple-input multiple-output (MIMO) detection problem is NP-hard. Obtaining high-quality suboptimal solutions with a favorable performance-complexity trade-off is particularly challenging in under-determined systems with N_t transmit antennas and N_r N_t receive antennas. Recent diffusion-based MIMO detectors have shown promise, but they require extensive sampling iterations at inference time, and their performance degrades in under-determined scenarios. We propose GD4, a graph-based discrete denoising diffusion method for MIMO detection. Unlike existing diffusion-based detectors that operate in a continuous relaxed space, GD4 performs denoising directly in the discrete symbol space and enables fast inference with one or a few denoising evaluations. Numerical results show that, under a similar inference-time compute budget, GD4 produces higher-quality suboptimal solutions than existing diffusion-based detectors and some widely used classical baseline including box-constrained Babai point and the K -best box-constrained randomized Klein-Babai point in both under-determined and overdetermined settings.
[LG-39] M-CaStLe: Uncovering Local Causal Structures in Multivariate Space-Time Gridded Data
链接: https://arxiv.org/abs/2605.00398
作者: J. Jake Nichol,Michael Weylandt,G. Matthew Fricke,Jhayron Perez-Carrasquilla,Melanie E. Moses
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (stat.ML)
*备注: 19 pages and 6 figures in the main text; 33 pages and 11 figures total
Abstract:Causal graph discovery for space-time systems is challenging in high-dimensional gridded data, which often has many more grid cells than temporal observations per cell. The Causal Space-Time Stencil Learning (CaStLe) meta-algorithm was developed to address that niche under space-time locality and stationarity assumptions, but it is currently limited to univariate analyses. In this work, we present M-CaStLe. M-CaStLe generalizes the local embedding and parent-identification phases of CaStLe to jointly model local within-variable and cross-variable space-time causal structures in gridded data. Like CaStLe, by constraining candidate parents to a constant-size space-time neighborhood and pooling spatial replicates, M-CaStLe increases effective sample size to make discovery tractable in high-dimensional settings. We further decompose the resulting multivariate stencil graph into reaction and spatial graphs to aid interpretation in complex settings. We study M-CaStLe in four settings: a multivariate space-time vector autoregression benchmark with known ground truth, an advective-diffusive-reaction partial differential equation verification problem with derived physical reference structure, an atmospheric chemistry case study in a low-temporal-sample regime, and an El Niño Southern Oscillation study on reanalysis data, identifying phase-dependent ocean–atmosphere coupling. Across these settings, M-CaStLe more accurately recovers multivariate causal structure in controlled settings and identifies important physical dynamics in real-world case studies. Overall, M-CaStLe advances causal discovery for multivariate space-time systems while retaining interpretability at the grid level. Comments: 19 pages and 6 figures in the main text; 33 pages and 11 figures total Subjects: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (stat.ML) Cite as: arXiv:2605.00398 [cs.LG] (or arXiv:2605.00398v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.00398 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-40] Mesh Field Theory: Port-Hamiltonian Formulation of Mesh-Based Physics ICML2026
链接: https://arxiv.org/abs/2605.00394
作者: Satoshi Noguchi,Yoshinobu Kawahara
类目: Machine Learning (cs.LG)
*备注: 23 pages, 4 figures, 10 tables. Accepted to ICML 2026
Abstract:We present Mesh Field Theory (MeshFT) and its neural realization, MeshFT-Net: a structure-preserving framework for mesh-based continuum physics that cleanly separates the physics’ topological structure from its metric structure. Imposing minimal physical principles (locality, permutation equivariance, orientation covariance, and energy balance/dissipation inequality), we prove a reduction theorem for mesh-based physics. Under these conditions, the physical dynamics admit a local factorization into a port-Hamiltonian form: the conservative interconnection is fixed uniquely by mesh topology, whereas metric effects enter only through constitutive relations and dissipation. This reduction clarifies what must be fixed and what should be learned, directly informing MeshFT-Net’s design. Across evaluations on analytic and realistic datasets, physics-consistency tests, and out-of-distribution validation, MeshFT-Net achieves near-zero energy drift and strong physical fidelity (correct dispersion and momentum conservation) along with robust extrapolation and high data efficiency. By eliminating non-physical degrees of freedom and learning only metric-dependent structure, MeshFT provides a principled inductive bias for stable, faithful, and data-efficient learning-based physical simulation.
[LG-41] Model-Based Reinforcement Learning with Double Oracle Efficiency in Policy Optimization and Offline Estimation
链接: https://arxiv.org/abs/2605.00393
作者: Haichen Hu,Jian Qian,David Simchi-Levi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) in large environments often suffers from severe computational bottlenecks, as conventional regret minimization algorithms require repeated, costly calls to planning and statistical estimation oracles. While recent advances have explored offline oracle-efficient algorithms, their computational complexity typically scales with the cardinality of the state and action spaces, rendering them intractable for large-scale or continuous environments. In this paper, we address this fundamental limitation by studying offline oracle-efficient episodic RL through the lens of log-barrier and log-determinant regularization. Specifically, for tabular Markov Decision Processes (MDPs), we propose a novel algorithm that achieves the optimal \tildeO(\sqrtT) regret bound while requiring only O(H\log\log T) calls to both the offline statistical estimation and planning oracles when T is known and O(H\log T) calls when T is unknown. Crucially, this oracle complexity is entirely independent of the size of the state and action spaces. This strict independence drastically reduces the planning oracle complexity, representing a substantial improvement over existing offline oracle-efficient algorithms (Qian et al., 2024). Furthermore, we demonstrate the versatility of our framework by generalizing the algorithm to linear MDPs featuring infinite state spaces and arbitrary action spaces. We prove that this generalized approach successfully attains meaningful sub-linear regret. Consequently, our work yields the first doubly oracle-efficient (i.e., efficient with respect to both statistical estimation and policy optimization) regret minimization algorithm capable of solving MDPs with infinite state and action spaces, significantly expanding the boundaries of computationally tractable RL.
[LG-42] owards Robust and Scalable Density-based Clustering via Graph Propagation
链接: https://arxiv.org/abs/2605.00390
作者: Yingtao Zheng,Hugo Phibbs,Ninh Pham
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2508.02989
Abstract:We present \textitCluProp, a novel framework that reimagines varied-density clustering in high-dimensional spaces as a label propagation process over neighborhood graphs. Our approach formally bridges the gap between density-based clustering and graph connectivity, leveraging efficient propagation mechanisms from network science to mitigate the parameter sensitivity inherent in traditional density-based methods. Specifically, we introduce a deterministic density-based propagation strategy to ensure scalable neighborhood identification. The framework is agnostic to the choice of distance metric and exhibits superior performance on large-scale data, processing millions of points in minutes while consistently outperforming existing baselines in accuracy.
[LG-43] PILIR: Physics-Informed Local Implicit Representation
链接: https://arxiv.org/abs/2605.00385
作者: Jianfeng Li,Feng Wang,Ke Tang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Physics-Informed Neural Networks have become a powerful mesh-free method for solving partial differential equations, but their performance is often limited by spectral bias. Specifically, in standard MLPs used in PINNs, the global parameter coupling causes the model to prioritize learning low-frequency components, resulting in slow convergence for high-frequency details. To overcome this limitation, we introduce the Physics-Informed Local Implicit Representation (PILIR). Our approach separates the global physical domain into a discrete latent feature space and a continuous generative decoder. By using a learnable grid to encode explicit spatial locality, PILIR can capture high-frequency details locally, preventing dilution by global patterns. A generative neural operator then synthesizes these local latent features into continuous physical fields, allowing accurate reconstruction of fine-scale structures. Experiments on a range of challenging PDEs show that PILIR effectively mitigates spectral bias, thereby boosting the convergence of high-frequency details and achieving superior accuracy compared to state-of-the-art methods.
[LG-44] Advancing Edge Classification through High-Dimensional Causal Modeling of Node-Edge Interplay
链接: https://arxiv.org/abs/2605.00374
作者: Duanyu Feng,Li Ding,Hongru Liang,Wenqiang Lei
类目: Machine Learning (cs.LG)
*备注:
Abstract:Edge classification, a crucial task for graph applications, remains relatively under-explored compared to link prediction. Current methods often overlook the potential causal influences of node features on edge features, leading to a loss of relevant prior information. In this work, we present an empirical exploration using the Causal Edge Classification Framework (CECF). Unlike conventional causal inference methods, CECF is the first framework to apply causal inference principles to the edge classification task and to explore modeling edge features as a high-dimensional treatment within a causal framework. Based on the node embedding of Graph Neural Network (GNN), CECF seeks to learn a balanced representation of high-dimensional edge features by mitigating the potential influence of node features. Then, a cross-attention network captures the complex dependencies between node and edge features for final edge this http URL experiments demonstrate that CECF not only achieves superior performance but also serves as a flexible, plug-and-play enhancement for existing this http URL also provide empirical analyses, offering insights into when and how this high-dimensional causal modeling framework works for the edge classification.
[LG-45] Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration ICML2026
链接: https://arxiv.org/abs/2605.00370
作者: Chunlei Meng,Pengbin Feng,Rong Fu,Hoi Leong Lee,Xiaojing Du,Zhaolu Kang,Zeyu Zhang,Weilin Zhou,Chun Ouyang,Zhongxue Gan
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Multimedia (cs.MM)
*备注: This study has been Accepted by ICML 2026. The current version is a manuscript, please refer to the official version released at ICML 2026 for the final published version
Abstract:Centralized multimodal learning commonly compresses language, acoustic, and visual signals into a single fused representation for prediction. While effective, this paradigm suffers from two limitations: modality dominance, where optimization gravitates towards the path of least resistance, ignoring weaker but informative modalities, and spurious modality coupling, where models overfit to incidental cross-modal correlations. To address these, we propose Group Cognition Learning (GCL), a governed collaboration paradigm that applies a two-stage protocol after modality-specific encoding. In Stage 1 (Selective Interaction), a Routing Agent proposes directed interaction routes, and an Auditing Agent assigns sample-wise gates to emphasize exchanges that yield positive marginal predictive gain while suppressing redundant coupling. In Stage 2 (Consensus Formation), a Public-Factor Agent maintains an explicit shared factor, and an Aggregation Agent produces the final prediction through contribution-aware weighting while keeping each modality representation as a specialization channel. Extensive experiments on CMU-MOSI, CMU-MOSEI, and MIntRec demonstrate that GCL mitigates dominance and coupling, establishing state-of-the-art results across both regression and classification benchmarks. Analysis experiments further demonstrate the effectiveness of the design.
[LG-46] Geometric analysis of attractor boundaries and storag e capacity limits in kernel Hopfield networks
链接: https://arxiv.org/abs/2605.00366
作者: Akira Tamamori
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 10 pages, 6 figures
Abstract:High-capacity associative memories based on Kernel Logistic Regression (KLR) exhibit strong storage capabilities, but the dynamical and geometric mechanisms underlying their stability remain poorly understood. This paper investigates the global geometry of attractor basins and the physical determinants of the storage limit in KLR-trained Hopfield networks. We combine empirical evaluations using random sequences and real-world image embeddings (CIFAR-10) with phenomenological morphing experiments and statistical Signal-to-Noise Ratio (SNR) analysis. Our experiments reveal that the network achieves a storage capacity for random sequences up to P/N \approx 16 , and maintains stable retrieval for structured data at effective loads near P/N \approx 20 . Through morphing analysis, we reveal that attractors on the “Ridge of Optimization” are separated by sharp, phase-transition-like boundaries, characterized by steep effective potential barriers and critical slowing down. Furthermore, by contrasting an SNR analysis with a geometric reference point inspired by Cover’s theorem, we show that the ultimate storage limit is constrained primarily not by a lack of geometric separability in the feature space, but by the loss of dynamical stability against crosstalk noise. These findings suggest that KLR networks function as highly localized, exemplar-based memories that operate optimally just before the onset of dynamical collapse, providing new insights into the design of robust, large-scale retrieval systems.
[LG-47] Binomial flows: Denoising and flow matching for discrete ordinal data
链接: https://arxiv.org/abs/2605.00360
作者: Yair Shenfeld,Ricardo Baptista,Stefano Peluchetti
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 41 pages, 9 figures
Abstract:Flow-based generative modeling in continuous spaces exploit Tweedie’s formula to express the denoiser (learned in training) as a score function (used in sampling). In contrast, this relation has been largely missing in the discrete setting where common approaches focus on learning discrete scores and rates. In this work we close this gap for discrete non-negative ordinal data by introducing Binomial flows. Our framework provides a simple recipe for training a discrete diffusion model which simultaneously denoises, samples, and estimates exact likelihoods. We verify our methodology on synthetic examples and obtain competitive results on real-world data sets.
[LG-48] Free Energy Surface Sampling via Reduced Flow Matching
链接: https://arxiv.org/abs/2605.00337
作者: Zichen Liu,Tiejun Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sampling the free energy surface, namely, the distribution of collective variables (CVs), is a crucial problem in statistical physics, as it underpins a better understanding of chemical reactions and conformational transitions. Traditional methods for free energy surface sampling involve simulation in high-dimensional configuration space and projecting the resulting configurations onto the CV space. To reduce the computational costs of such sampling, we propose FES-FM, a reduced flow matching (FM) method for free energy sampling (FES). We train a dynamical transport map in the CV space, thereby enabling direct sampling of the free energy surface. For many-particle systems, we construct a prior distribution based on the Hessian at a local minimum of the potential, which ensures both rotation-translation invariance and physically meaningful configurations. We evaluate the proposed method across a variety of potential functions and collective variables. Comparative experiments demonstrate that our approach drastically reduces computational costs while delivering superior accuracy per unit sampling time.
[LG-49] Conformalized Quantum DeepONet Ensembles for Scalable Operator Learning with Distribution-Free Uncertainty
链接: https://arxiv.org/abs/2605.00330
作者: Purav Matlia,Christian Moya,Guang Lin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Operator learning enables fast surrogate modeling of high-dimensional dynamical systems, but existing approaches face two fundamental limitations: quadratic inference complexity and unreliable uncertainty quantification in safety-critical settings. We propose Conformalized Quantum DeepONet Ensembles, a framework that addresses both challenges simultaneously. By leveraging Quantum Orthogonal Neural Networks (QOrthoNNs), we reduce operator inference complexity from O(n^2) to O(n), enabling scalable evaluation over fine discretizations. To provide rigorous uncertainty quantification, we combine ensemble-based epistemic modeling with adaptive conformal prediction, yielding distribution-free coverage guarantees. A key challenge in ensembling is that naive parallelism scales hardware resources linearly with the number of models. We resolve this by using Superposed Parameterized Quantum Circuits (SPQCs), which compress multiple ensemble members into a single circuit and enable simultaneous multi-model execution. Experiments on synthetic partial differential equations and real-world power system dynamics demonstrate that our approach achieves accurate predictions while maintaining calibrated uncertainty under realistic quantum noise. These results establish a practical pathway toward scalable, uncertainty-aware operator learning in quantum machine learning.
[LG-50] Federated Weather Modeling on Sensor Data
链接: https://arxiv.org/abs/2605.00322
作者: Shengchao Chen,Guodong Long
类目: Machine Learning (cs.LG)
*备注: Accepted by Encyclopedia of GIS, this is an unedited version. Published version: this https URL
Abstract:Federated weather modeling on sensor data is a distributed system underpinned by federated learning, enabling multiple sensor data sources, including ground weather stations, satellites and IoT devices, to collaboratively train deep learning models without sharing raw data. This method safeguards data privacy and security while leverages diverse, geographically distributed datasets to improve the accuracy and robustness of global/regional weather modeling tasks such as forecasting and anomaly detection.
[LG-51] Data Deletion Can Help in Adaptive RL
链接: https://arxiv.org/abs/2605.00298
作者: Param Budhraja,Aditya Gangrade,Alex Olshevsky,Venkatesh Saligrama
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Deploying reinforcement learning policies in the real world requires adapting to time-varying environments. We study this problem in the contextual Markov Decision Process (cMDP) framework, where a family of environments is indexed by a low-dimensional context unknown at test time. The standard approach decomposes the problem: train a so-called “universal policy” which assumes knowledge of the true context, then pair it with a context estimator which approximates context using the observed trajectory. We identify a simple, counterintuitive trick that substantially improves the estimator: randomly delete a fraction of the training buffer after each round. This works because data is collected across multiple rounds using progressively better policies, and older trajectories come from a different distribution than what the estimator will face at deployment time; random deletion creates an implicit exponential decay on older data while preserving diversity without requiring any explicit identification of which samples are stale. This reduces robustness gap by 30% for MLPs and by 6% on average for recurrent networks. Strikingly, it allows a narrow MLP with 5x fewer parameters to outperform a wide MLP trained without deletion. To understand when and why deletion helps, we analyze regularized empirical risk minimization with a mismatch between the train distribution and the distribution at deployment; in this idealized setting, we prove that removing a single uniformly random training point decreases expected test loss in expectation under mild conditions. For ridge regression we make this quantitative: deletion helps when the regularization coefficient is moderate and the signal-to-noise ratio (SNR) is sufficiently low, and, crucially, this SNR threshold gives a direct measure of how large the distribution mismatch between training and deployment must be for deletion to be beneficial.
[LG-52] rident: Improving Malware Detection with LLM s and Behavioral Features
链接: https://arxiv.org/abs/2605.00297
作者: Rebecca Saul,Jingzhi Jiang,Elliott Chia,David Wagner
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Traditionally, machine learning methods for PE malware detection have relied on static features like byte histograms, string information, and PE header contents. One barrier to incorporating dynamic analysis features has been the semi-structured nature of sandbox behavior reports. We show that, using the latest generation of large language models with reasoning, it is possible to efficiently process these behavior reports and utilize them as part of a malware detection pipeline. Specifically, we leverage LLMs to generate behavior-based malware detection rules based on a small training set of labeled malware. We find that these detection rules, derived from behavioral features, are much more robust to concept drift than standard static-feature methods, while maintaining practical false positive rates. Finally, we introduce Trident, a system which combines a classic decision tree model over static features, our behavior-based detection rules, and direct LLM analysis of sandbox reports through majority voting. Trident outperforms standard methods using static features, outperforms behavior-based rules alone, and is as resilient to concept drift as active learning methods without requiring retraining.
[LG-53] A Dirac-Frenkel-Onsager principle: Instantaneous residual minimization with gauge momentum for nonlinear parametrizations of PDE solutions
链接: https://arxiv.org/abs/2605.00284
作者: Matteo Raviola,Benjamin Peherstorfer
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:
Abstract:Dirac-Frenkel instantaneous residual minimization evolves nonlinear parametrizations of PDE solutions in time, but ill-conditioning can render the parameter dynamics non-unique. We interpret this non-uniqueness as a gauge freedom: nullspace directions that leave the time derivative unchanged can be used to select better-conditioned parameter velocities. Building on Onsager’s minimum-dissipation principle, we introduce a history variable – interpretable as momentum – and inject it only along the nullspace directions. The resulting Dirac-Frenkel-Onsager dynamics preserve instantaneous residual minimization, in contrast to standard regularization that can introduce bias, while promoting temporally smooth parameter evolutions. Examples demonstrate that the approach leads to increased robustness in singular and near-singular regimes. Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML) Cite as: arXiv:2605.00284 [cs.LG] (or arXiv:2605.00284v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.00284 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Matteo Raviola [view email] [v1] Thu, 30 Apr 2026 22:47:59 UTC (2,005 KB) Full-text links: Access Paper: View a PDF of the paper titled A Dirac-Frenkel-Onsager principle: Instantaneous residual minimization with gauge momentum for nonlinear parametrizations of PDE solutions, by Matteo Raviola and Benjamin PeherstorferView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs cs.NA math math.NA stat stat.ML References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-54] A Comparative Analysis of Machine Learning Models for Intrusion Detection in Intelligent Transport Systems
链接: https://arxiv.org/abs/2605.00279
作者: Zawad Yalmie Sazid,Robert Abbas,Sasa Maric
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:AI-powered edge computing security is moving Intelligent Transportation Systems (ITS) from passive, rule-based protections to proactive, smart, zero-touch, self-sufficient safeguards that neutralize threats in milliseconds. As transportation becomes more connected with edge computing, massive IoT, and advanced 5G for vehicle-to-everything (V2X) connectivity, AI at the edge computing nodes plays a crucial role in protecting against sophisticated threats, enabling URLLC (ultra-low-latency communications) for smart transport, and enhancing infrastructure capabilities and safety. This research applies edge computing to improve latency, bandwidth efficiency, and service responsiveness by moving processing closer to devices, gateways, and users. However, this shift also expands the cyberattack surface because edge nodes are distributed, heterogeneous, and often resource-constrained. The paper proposes a trust-aware federated hybrid intrusion detection framework in which a random forest, a decision tree, and a linear SVM network learn complementary traffic representations at each edge site, while a server performs trust-aware aggregation of local model updates.
[LG-55] Polaris: Coupled Orbital Polar Embeddings for Hierarchical Concept Learning ICML2026
链接: https://arxiv.org/abs/2605.00265
作者: Sahil Mishra,Srinitish Srinivasan,Sourish Dasgupta,Tanmoy Chakraborty
类目: Machine Learning (cs.LG)
*备注: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)
Abstract:Real-world knowledge is often organized as hierarchies such as product taxonomies, medical ontologies, and label trees, yet learning hierarchical representations is challenging due to asymmetric structure and noisy semantics. We introduce Polaris, a polar hyperspherical embedding framework that separates semanticity from hierarchy using angular geometry and radius, enabling the learning of meaning and structure without interference. To map latent representation onto the sphere, we project it to the tangent space at the north pole, apply the exponential map, and learn unit-norm representations using spherical linear layers. Polaris then combines robust local constraints, global regularization that prevents geometric collapse, and uncertainty-aware asymmetric objectives that encourage directional containment. At inference time, Polaris uses structure-guided retrieval to efficiently narrow down candidate parents before final ranking. We evaluate Polaris on different settings of taxonomy expansion - spanning trees, multi-parent DAGs, and multimodal hierarchies, showing consistent improvements of up to ~19 points in top-K retrieval and up to ~60% reduction in mean rank over fourteen strong baselines.
[LG-56] Pessimism-Free Offline Learning in General-Sum Games via KL Regularization
链接: https://arxiv.org/abs/2605.00264
作者: Claire Chen,Yuheng Zhang
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:Offline multi-agent reinforcement learning in general-sum settings is challenged by the distribution shift between logged datasets and target equilibrium policies. While standard methods rely on manual pessimistic penalties, we demonstrate that KL regularization suffices to stabilize learning and achieve equilibrium recovery. We propose General-sum Anchored Nash Equilibrium (GANE), which recovers regularized Nash equilibria at an accelerated statistical rate of \widetildeO(1/n) . For computational tractability, we develop General-sum Anchored Mirror Descent (GAMD), an iterative algorithm converging to a Coarse Correlated Equilibrium at the standard rate of \widetildeO(1/\sqrtn+1/T) . These results establish KL regularization as a standalone mechanism for pessimism-free offline learning that achieves equivalent or accelerated rates in multi-player general-sum games.
[LG-57] NLPOpt-Net: A Learning Method for Nonlinear Optimization with Feasibility Guarantees
链接: https://arxiv.org/abs/2605.00260
作者: Bimol Nath Roy,Rahul Golder,MM Faruque Hasan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Nonlinear Parametric Optimization Network (NLPOpt-Net) is an unsupervised learning architecture to solve constrained nonlinear programs (NLP). Given the structure of an NLP, it learns the parametric solution maps with guaranteed constraint satisfaction. The architecture consists of a backbone neural network (NN) followed by a multilayer ( k -layered) projection. While the NN drives toward optimality through a loss function consisting of a modified Lagrangian augmented with a consistency loss, the projection ensures feasibility by projecting the NN predictions in the original constraint manifold. Instead of typical distance minimization, our projection exploits local quadratic approximations of the original NLP. Under certain conditions (such as convexity), the projection has a descent property, which improves the NN predictions further. NLPOpt-Net deploys an inversion-free, modified Chambolle-Pock algorithm to solve the constrained quadratic projections during the forward pass and uses the implicit function theorem for efficient backpropagation. The fixed structure of the projection further allows decoupling of the NN and the projection once the training is complete. NLPOpt-Net solves large-scale convex QP, QCQP, NLP, and nonconvex problems with near zero optimality gap and constraint violations reduced to machine precision. Additionally, it provides near accurate prediction of the active sets and corresponding dual variables, thereby enabling a scalable approach for multiparametric programming. Compiling the projection in C provides order of magnitude improvement in inference time compared to JAX. We provide the codes and NLPOpt-Net as a ready to use package that includes GPU support.
[LG-58] Bayesian Optimization in Linear Time
链接: https://arxiv.org/abs/2605.00237
作者: Jesse Schneider,William J. Welch
类目: Machine Learning (cs.LG)
*备注: 25 pages, 25 figures; code available at this https URL
Abstract:Bayesian optimization is a sequential method for minimizing objective functions that are expensive to evaluate and about which few assumptions can be made. By using all gathered data to train a Gaussian process model for the function and adaptively employing a mixture of global exploration and local exploitation, this method has been used for optimization in many fields including machine learning, automotive engineering and reinforcement learning. However, the standard method suffers from two problems: 1) with cubic computational complexity in the training-set size it eventually becomes computationally infeasible to train the model, and 2) globally modeling the objective function is not necessarily optimal given the local nature of minimization. Using flexible and recursive binary partitioning of the search space, we adapt both the modeling and acquisitive aspects of standard Bayesian optimization to work harmoniously with the partitioning scheme, thereby ameliorating both standard shortcomings. We compare our method against a commonly used Bayesian optimization library on seven challenging test functions, ranging in dimensionality from 6 to 124 , and show that our method achieves superior optimization performance in all tests. In addition our method has linear computational complexity.
[LG-59] CompleteRXN: Toward Completing Open Chemical Reaction Databases
链接: https://arxiv.org/abs/2605.00222
作者: Gabriel Vogel,Minouk Noordsij,Evgeny Pidko,Jana M. Weber
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:
Abstract:Chemical reaction datasets such as USPTO suffer from substantial incompleteness, frequently missing byproducts, co-reactants, and stoichiometric coefficients. This limits their applicability and reliability in downstream applications. Here, we introduce CompleteRXN, a large-scale supervised benchmark for reaction completion under realistic missing-data conditions. We construct a dataset of aligned incomplete and atom-balanced reactions by mapping USPTO records to curated mechanistic reactions. We evaluate representative baselines, including a novel encoder-decoder reaction completion model with constrained decoding, the Constrained Reaction Balancer (CRB), and a recent algorithmic method, SynRBL. On our CompleteRXN benchmark, the CRB achieves high performance across splits of increasing difficulty, reaching 99.20% equivalence accuracy on the random split and 91.12% on the extreme out-of-distribution split. SynRBL produces many balanced and chemically plausible completions, but with lower accuracy on the benchmark test splits. Across all methods, performance degrades with increasing incompleteness. We observe a substantial drop when evaluating on reactions outside the benchmark (full uncurated USPTO), highlighting the gap between benchmark performance and practical robustness and motivating future work.
[LG-60] Selfie-Capture Dynamics as an Auxiliary Signal Against Deepfakes and Injection Attacks for Mobile Identity Verification
链接: https://arxiv.org/abs/2605.00218
作者: Erkka Rantahalvari,Olli Silvén,Zinelabidine Boulkenafet,Constantino Álvarez Casado
类目: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, 8 tables, 51 references, conference
Abstract:Mobile remote identity verification (RIdV) systems are exposed to attacks that manipulate or replace the facial video stream, including presentation attacks, real-time deepfakes, and video injection. Recent European requirements, including ETSI TS 119 461 and CEN/TS 18099, motivate complementary evidence channels beyond camera-based presentation-attack detection. This paper investigates whether passive motion traces recorded during selfie capture provide auxiliary evidence for spoof screening and user verification. We introduce CanSelfie, a dataset of 375 bona fide multi-sensor sequences collected at 50,Hz from 30 participants using a commercial mobile RIdV application, together with stationary, handheld, and temporally shifted attack-proxy scenarios. We benchmark 7 multivariate time-series classifiers and 8 whole-series anomaly detectors across sensor configurations and temporal windows. For spoof screening, accelerometer-only ROCKAD obtains 0.00% false rejection rate (FRR) and 43.8% false acceptance rate (FAR), while QUANT+3-NN obtains the lowest overall FAR of 32.0% at 2.37% FRR; both reject all stationary attack proxies. For same-device and same-session user verification, WEASEL+MUSE reaches 1.07% equal error rate (EER) using 9 sensor channels. The analysis shows that raw accelerometer data, preserving gravity and orientation cues, is the most informative modality, and that closed-set classification accuracy alone does not imply good verification performance because threshold calibration depends on score distributions. The findings suggest that short selfie-capture motion traces contain measurable spoof-related and identity-related information, supporting their use as a low-friction auxiliary signal while also identifying the need for cross-device, cross-session, and real injection-attack evaluation.
[LG-61] Matroid Algorithms Under Size-Sensitive Independence Oracles ICML2026
链接: https://arxiv.org/abs/2605.00201
作者: Kiarash Banihashem,MohammadTaghi Hajiaghayi,Mahdi JafariRaviz,Danny Mittal
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Accepted to the 43rd International Conference on Machine Learning (ICML 2026) as a spotlight paper
Abstract:The standard oracle model for matroid algorithms assumes that each independence query can be answered in constant time, regardless of the size of the queried set. While this abstraction has underpinned much of the theoretical progress in matroid optimization, it masks the true computational effort required by these algorithms. In particular, for natural and widely studied classes such as graphic matroids, even a single independence query can require work linear in the size of the set, making the constant-time assumption implausible. We address this gap by introducing a size-sensitive cost model where the cost of a query Q scales with |Q| . Nearly linear-time oracle implementations exist for broad families of matroids, and this refined abstraction therefore captures the true cost of query evaluation while allowing for a more faithful comparison between general matroids and their natural special cases. Within this framework we study three fundamental algorithmic tasks: finding a basis of a matroid, approximating its rank, and approximating its partition size. We establish tight results, proving nearly matching upper and lower bounds that show the optimal query cost is (up to logarithmic factors) quadratic in the size of the matroid. On the algorithmic side, our upper bounds are realized by explicit procedures that construct the desired solution. On the complexity side, our lower bounds are unconditional and already hold even for weaker distinguishing formulations of the problems. Finally, for matroids with maximum circuit size at most c , we show that the quadratic barrier can be broken, providing an algorithm that calculates the maximum-weight basis with expected query cost \mathcalO(n^2-1/c \log n) . Comments: Accepted to the 43rd International Conference on Machine Learning (ICML 2026) as a spotlight paper Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2605.00201 [cs.DS] (or arXiv:2605.00201v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2605.00201 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-62] Diversity in Large Language Models under Supervised Fine-Tuning
链接: https://arxiv.org/abs/2605.00195
作者: Roman Klypa,Oleksandr Cherednichenko
类目: Machine Learning (cs.LG)
*备注:
Abstract:Supervised Fine-Tuning (SFT) is essential for aligning Large Language Models (LLMs) with user intent, yet it is believed to suppress generative diversity. Although this reduction is frequently referenced, formal empirical testing of the phenomenon remains limited. The expressiveness of LLMs by itself was addressed by multiple prior methods. Their varying perspectives suggest that deeper analysis could yield further improvements. In this study, we attribute the decline to two primary drivers: the neglect of low-frequency patterns within fine-tuning datasets and the forgetting of preexisting knowledge. Motivated by our theoretical analysis, we develop Tempered Focal (TOFU) loss, a novel objective that addresses both stated challenges simultaneously. Our extensive evaluation confirms at scale that generation breadth narrows after SFT and strengthens the hypothesis explaining this effect. Across multiple models and benchmarks, we demonstrate that TOFU enhances output diversity while preserving high response quality, offering a principled approach to SFT.
[LG-63] OTSS: Output-Targeted Soft Segmentation for Contextual Decision-Weight Learning
链接: https://arxiv.org/abs/2605.00193
作者: Renjun Hu,Hyun-Soo Ahn
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 23 pages, 2 figures
Abstract:Many machine learning systems make constrained decisions by optimizing factorized objectives, but the context-specific objective is often treated as fixed. We study contextual decision-weight learning: from logged decisions and proxy outputs, learn an optimizer-facing weight vector w(x) over interpretable decision factors z(x,d), rather than a direct policy or generic predictive score. We propose OTSS, an output-targeted soft-segmentation model that deploys the personalized decision-ready weight vector. At the function-class level, the theory highlights a hard-versus-soft distinction. Hard partitions incur an approximation-estimation tradeoff under overlap, while a realizable fixed-K soft class removes the hard-partition approximation floor and attains a parametric rate. We evaluate OTSS in controlled benchmarks with finite evaluation libraries, where the true weight vector and downstream regret can be computed exactly. In the representative overlap setting, OTSS attains the lowest mean regret among the comparators, including EM mixture regression, the strongest soft-mixture baseline in our comparison; it matches EM on coefficient recovery while running about two orders of magnitude faster. In a matched K=5 benchmark, OTSS remains competitive under hard-routed truth and improves as heterogeneity becomes softer and sample size grows. On a fixed Complete Journey retail anchor with real household covariates and action geometry, OTSS again achieves the lowest mean-regret point estimate. Comments: 23 pages, 2 figures Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2605.00193 [cs.LG] (or arXiv:2605.00193v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.00193 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-64] owards A Generative Protein Evolution Machine with DPLM-Evo ICML2026
链接: https://arxiv.org/abs/2605.00182
作者: Xinyou Wang,Liang Hong,Jiasheng Ye,Zaixiang Zheng,Yu Li,Shujian Huang,Quanquan Gu
类目: Machine Learning (cs.LG)
*备注: A peer-reviewed version was accepted to ICML 2026
Abstract:Proteins are shaped by gradual evolution under biophysical and functional constraints. Protein language models learn rich evolutionary constraints from large-scale sequences, and discrete diffusion-based protein language models~(\eg, DPLMs) are promising for both understanding and generation. However, existing DPLMs typically rely on masking-based absorbing diffusion that contradicts a simple biological intuition: proteins evolve through accumulated edits, not by emerging from masks. Consequently, these frameworks lack explicit pretraining objectives for substitution and insertion/deletion (indel) operations, limiting both optimization-style post-editing and flexible guided generation. To address these limitations, we present DPLM-Evo, an evolutionary discrete diffusion framework that explicitly predicts substitution, insertion, and deletion operations during denoising. DPLM-Evo decouples an upsampled-length latent alignment space from the variable-length observed sequence space, which makes indel-aware generation tractable and enables adaptive scaffold growth throughout the process with negligible computational overhead. To better align substitutions with real evolution, we further introduce a contextualized evolutionary noising kernel that produces biologically informed, context-dependent mutation patterns. Across tasks, DPLM-Evo improves sequence understanding and achieves state-of-the-art mutation effect prediction performance on ProteinGym in the single-sequence setting. It also enables variable-length simulated evolution, and post-editing/optimization of existing proteins via explicit edit trajectories.
[LG-65] Network Digital Untwinning: Towards Backward Optimization of Digital Twins
链接: https://arxiv.org/abs/2605.00169
作者: Zifan Zhang,Dianwei Chen,Anjun Gao,Manhua Wang,Mingzhe Chen,Minghong Fang,Xianfeng Yang,Yuchen Liu
类目: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted by ICDCS 2026
Abstract:Network digital twins (NDTs) are transforming network management by offering precise virtual replicas of physical network systems. However, their reliance on diverse and sensitive data introduces significant challenges related to data management, regulatory compliance, and user privacy. In scenarios where selective data removal is necessary, such as device deactivation, network reconfiguration, or regulatory compliance, traditional approaches often fall short of preserving the integrity of the twin model. To address this gap, we introduce a network digital untwinning framework that enables the targeted removal of deprecated NDT contributions while maintaining model integrity. Our approach comprises two complementary components: Single Request Untwinning (\algO) and Parallel Request Untwinning (\algM) mechanisms. \algO leverages connectivity metrics based on geographical proximity, data distribution, and network-level attributes to identify and remove the target NDT along with its propagating influence. This is achieved through an optimally selected rollback checkpoint augmented with injected Gaussian noise, followed by a precise remapping phase. \algM extends this mechanism to efficiently handle multiple removal requests by clustering NDTs with similar attributes and performing a coordinated rollback and untwinning schedule. We provide theoretical guarantees on model indistinguishability from scratch-built twins, and validate the framework through extensive experiments on real-world traffic data, demonstrating its effectiveness and operational efficiency.
[LG-66] Consistent Diffusion Language Models ICML2026
链接: https://arxiv.org/abs/2605.00161
作者: Hasan Amin,Yuan Gao,Yaser Souri,Subhojit Som,Ming Yin,Rajiv Khanna,Xia Song
类目: Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:Diffusion language models (DLMs) are an attractive alternative to autoregressive models because they promise sublinear-time, parallel generation, yet practical gains remain elusive as high-quality samples still demand hundreds of refinement steps. In continuous domains, consistency training along the probability-flow ODE is a popular recipe to accelerate diffusion. For discrete diffusion, no analogous sample-space ODE exists, making direct adaptation ill-defined. We argue that the natural discrete substitute is not a deterministic trajectory but its stochastic counterpart: the exact posterior bridge, available in closed form for broad corruption families including masked and uniform diffusion. Building on this observation, we introduce Multi-Path Discrete Consistency (MPDC), a new principle that trains a denoiser to be path-invariant in expectation across these stochastic bridges, and instantiate it as the Consistent Diffusion Language Model (CDLM), a single-stage, teacher-free training framework. A single CDLM objective unifies masked diffusion, continuous consistency models, and progressive/discrete distillation as analytic limits or empirical approximations of one common view. Empirically, CDLM establishes a new state of the art on both conditional and unconditional text-generation, consistently outperforming strong base discrete diffusion models and often even multi-stage distilled baselines across sampling budgets, with the largest gains in the few-step regime. Together, these results position CDLM as a principled and scalable foundation for the next generation of fast, high-fidelity discrete generative modeling.
[LG-67] Learning Fingerprints for Medical Time Series with Redundancy-Constrained Information Maximization
链接: https://arxiv.org/abs/2605.00130
作者: Huayu Li,ZhengXiao He,Xiwen Chen,Jingjing Wang,Siyuan Tian,Jinghao Wen,Ao Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning meaningful representations from medical time series (MedTS) such as ECG or EEG signals is a critical challenge. These signals are often high-dimensional, variable-length and rife with noise. Existing self-supervised approaches, such as Masked Autoencoders (MAEs) are highly effective for pre-training general-purpose encoders. However, they do not explicitly learn compact and semantically interpretable latent representations, typically relying on heuristic aggregation strategies such as global average pooling or a designated [CLS] token. We propose a novel framework that compresses a variable-length MedTS into a fixed-size set of k latent Fingerprint Tokens. Our architecture employs a cross-attention bottleneck to generate these tokens and is trained with a dual-objective function. The first objective is a reconstruction loss, which ensures the tokens are \textitsufficient statistics for the original data. The second, a diversity penalty based on the Total Coding Rate (TCR), explicitly minimizes the redundancy between tokens, encouraging them to become statistically \textitdisentangled representations. We present the theoretical justification for our method, framing it as a novel \textbfDisentangled Rate-Distortion problem. This approach produces a low-dimensional, interpretable, and sample-efficient representation, where each token is encouraged to capture an independent factor of variation, paving the way for more robust digital biomarkers.
[LG-68] SPLICE: Latent Diffusion over JEPA Embeddings for Conformal Time-Series Inpainting
链接: https://arxiv.org/abs/2605.00126
作者: Arnaud Zinflou
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:
Abstract:Generative models for time-series imputation achieve strong reconstruction accuracy, yet provide no finite-sample reliability guarantees, a critical limitation in power systems where imputed values inform dispatch and planning. We introduce SPLICE (Self-supervised Predictive Latent Inpainting with Conformal Envelopes), a modular framework coupling latent generative imputation with distribution-free, online-adaptive prediction intervals. A JEPA encoder maps daily load segments into a 64-dimensional latent space; a conditional latent bridge with four sampling modes generates candidate gap trajectories; an hourly-conditioned decoder maps back to signal space; and Adaptive Conformal Inference (ACI) wraps the output with coverage-guaranteed prediction bands. The flow-matching variant achieves comparable quality to DDIM in 5–10 ODE steps (5-10x speedup). On thirteen load datasets (nine proprietary, three UCI Electricity, ETTh1), SPLICE achieves the lowest mean Load-only MSE (0.056), winning 9/12 non-degenerate datasets at 91-day gaps and 18/32 across all gap lengths vs. five established baselines, and produces the best CRPS (0.161, -18.3% vs. the strongest competitor). ACI delivers 93–95% empirical coverage, correcting under-coverage failures of up to 7.5 pp observed with static conformal prediction. A pooled JEPA encoder trained on nine feeds transfers to four unseen domains, matching or exceeding per-dataset oracles with only a quick bridge fine-tuning.
[LG-69] Comparative Analysis of Polygon-Based and Global Machine Learning Models for Bus Occupancy Prediction
链接: https://arxiv.org/abs/2605.00083
作者: Daniel Azenkot,Michael Fire,Eran Ben Elia
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate forecasting of bus ridership (passengers numbers) is crucial for efficient management and optimization of public transport systems. Traditional forecasting models often fail to capture the unique and localized dynamics of different urban areas by treating the entire city as a single, homogeneous region. This paper introduces a novel framework that enhances bus ridership prediction by integrating a spatial clustering methodology with multi-dimensional feature analysis. The proposed framework utilizes a diverse set of data, including bus ridership data (by route number, time, and bus stop) complemented by a variety of open source data, such as spatial features (e.g., attractive destinations), meteorological conditions (e.g., temperature, rainfall), and temporal patterns (e.g., time of day, day of week). By clustering the urban area into distinct regions, based on the principle that bus stops in close proximity share similar ridership characteristics, a separate local forecasting model is trained for each of these clusters. This localized approach demonstrates an accuracy comparable to that of global models. The findings suggest that a spatially-aware, localized modeling strategy is effective for public transport prediction, paving the way for more targeted and efficient service improvements.
[LG-70] CRADIPOR: Crash Dispersion Predictor
链接: https://arxiv.org/abs/2605.00070
作者: Edgar Chaillou,Sebastian Rodriguez,Yves Tourbier,Francisco Chinesta
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present CRADIPOR, a numerical dispersion prediction tool for automotive crash simulations. Finite Element (FE) crash models are widely used throughout vehicle development, but their predictions are not strictly repeatable because of parallel computation and model complexity. As a result, performance criteria evaluated during post-processing may exhibit significant numerical dispersion, which complicates engineering decision-making. Although dispersion can be estimated by repeating the same simulation, this approach is generally impractical because of its high computational cost. This work therefore investigates a prediction tool that can be applied during routine crash-simulation post-processing without repeating the computation. The proposed approach relies on a Rank Reduction Autoencoder (RRAE) combined with supervised classification in order to identify regions sensitive to numerical dispersion. The comparative analysis suggests that the RRAE-based framework is more effective than the Random Forest baseline on the studied dataset. Among the tested signal representations, wavelet-based and slope-based inputs appear to be the most promising, with slope variations providing the best classification performance. These results support the use of structured latent representations for improving numerical-dispersion detection in automotive crash post-processing. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.00070 [cs.LG] (or arXiv:2605.00070v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.00070 Focus to learn more arXiv-issued DOI via DataCite
[LG-71] Soft-MSM: Differentiable Context-Aware Elastic Alignment for Time Series
链接: https://arxiv.org/abs/2605.00069
作者: Christopher Holder,Anthony Bagnall
类目: Machine Learning (cs.LG)
*备注:
Abstract:Elastic distances like dynamic time warping (DTW) are central to time series machine learning because they compare sequences under local temporal misalignment. Soft-DTW is an adaptation of DTW that can be used as a gradient-based loss by replacing the hard minimum in its dynamic-programming recursion with a smooth relaxation. However, this approach does not directly extend to elastic distances whose transition costs depend on the local alignment context. Move-Split-Merge (MSM) is one such distance: it uses context-aware split and merge penalties and has often outperformed DTW in supervised and unsupervised time series machine learning tasks such as classification and clustering. We introduce Soft-MSM, a smooth relaxation of MSM and an elastic alignment loss with context-aware transition costs. Central to the formulation is a smooth gated surrogate for MSM’s piecewise split/merge cost, which enables gradients through both the dynamic-programming recursion and the local transition structure. We derive the forward recursion, backward recursion, soft alignment matrix, closed-form gradient, limiting behaviour, and divergence-corrected formulation. Experiments on 112 UCR datasets show that Soft-MSM gives lower MSM barycentre loss than existing MSM barycentre methods, and yields significantly better clustering and nearest-centroid classification performance than Soft-DTW-based alternatives. An implementation is available in the open-source \textttaeon toolkit. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.00069 [cs.LG] (or arXiv:2605.00069v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.00069 Focus to learn more arXiv-issued DOI via DataCite
[LG-72] Information-Theoretic Generalization Bounds for Stochastic Gradient Descent with Predictable Virtual Noise
链接: https://arxiv.org/abs/2605.00064
作者: Mohammad Partohaghighi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Information-theoretic generalization bounds analyze stochastic optimization by relating expected generalization error to the mutual information between learned parameters and training data. Virtual perturbation analyses of SGD add auxiliary Gaussian noise only in the proof, making mutual information tractable while leaving the actual SGD trajectory unchanged. Existing bounds, however, typically require perturbation covariances to be fixed independently of the optimization history, limiting their ability to represent geometries induced by moving gradient statistics, preconditioners, curvature proxies, and other pathwise information. We introduce predictable history-adaptive virtual perturbations, where the perturbation covariance at each iteration may depend on the past real SGD history but not on current or future randomness. This predictability enables a conditional Gaussian relative-entropy argument and yields generalization bounds for SGD with adaptive virtual-noise geometry. The bounds replace fixed sensitivity and gradient-deviation terms with conditional adaptive counterparts, include an output-sensitivity penalty from accumulated perturbation covariance, and reduce the deviation term to a conditional variance only under conditional unbiasedness. Since adaptive covariances may be data-dependent, we separate local Gaussian smoothing from global reference-kernel comparison. The resulting bound includes a covariance-comparison cost measuring the KL price of using an admissible reference geometry different from the actual adaptive covariance. Fixed-noise-style bounds are recovered under admissible synchronization, such as deterministic, public, or prefix-observable covariance rules. The framework recovers fixed isotropic and geometry-aware bounds as special cases while extending virtual perturbation analysis to history-dependent SGD without modifying the algorithm. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.00064 [cs.LG] (or arXiv:2605.00064v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.00064 Focus to learn more arXiv-issued DOI via DataCite
[LG-73] Autoformalizing Memory Specifications with Agents
链接: https://arxiv.org/abs/2605.00058
作者: Jan Ole Ernst,Dmitri Michelangelo Saberi,Derek Christ,Thomas Zimmermann,Rajath Salegame,Suhaas M. Bhat,Stanislav Levental,Thomas Dybdahl Ahle,Matthias Jung
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:The primary goal of Design Verification (DV) is to ensure that a proposed chip design implementation (either in code, or physical form) exactly matches its specification and is free of functional errors in order to avoid costly re-designs. Achieving this often demands extensive manual interpretation, translating the specification document into a formal, testable representation. While AI has made progress in DV, current approaches typically focus on narrow, isolated tasks rather than full end-to-end specification compliance of modern chip designs, failing to capture the complexity of real-world verification. Our method automatically formalizes natural language memory chip specifications, for industry relevant Dynamic Random Access Memory (DRAM) standards, into a formal representation called DRAMPyML that can be used for downstream DV tasks like the generation of SystemVerilog assertions, stimulus, and functional coverage. We also release our benchmarking dataset, DRAMBench, which can be used to evaluate the evolution of model capabilities (and new approaches) at hardware autoformalization.
[LG-74] ROSA: Robust and Energy-Efficient Microring-Based Optical Neural Networks via Optical Shift-and-Add and Layer-Wise Hybrid Mapping
链接: https://arxiv.org/abs/2605.00032
作者: Huifan Zhang,Yun Hu,Caizhi Sheng,Yurui Qu,Pingqiang Zhou
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:This work presents ROSA, a microring-based optical neural network architecture that improves robustness and energy efficiency using an optical shift-and-add (OSA) module and a layer-wise hybrid mapping strategy. It introduces a noise-aware voltage-to-weight model considering DAC and thermal variations, and a workload-aware framework to co-optimize MRR array size and layer-wise dataflow. Optimized arrays reduce the aggregated relative energy-delay product (EDP) by 64% and 26% compared with DEAP-CNNs and a general compact array, respectively. OSA further contributes 29% EDP reduction. The proposed hybrid mapping strategy improves CIFAR-10 accuracy by 8.3% over weight-stationary mapping while achieving an average 54.7% lower EDP than DEAP-CNNs.
[LG-75] What Physics do Data-Driven MoCap-to-Radar Models Learn?
链接: https://arxiv.org/abs/2605.00018
作者: Kevin Chen,Kenneth W. Parker,Anish Arora
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Data-driven MoCap-to-radar models generate plausible micro-Doppler spectrograms, but do they actually learn the underlying physics? We introduce a physics-based interpretability framework to answer this question via two proposed complementary metrics: one measures alignment between model predictions and the physics-derived Doppler frequency, while the other tests whether predictions preserve the velocity-frequency relationship under velocity intervention. Both metrics require only MoCap input and model predictions, without access to measured radar data. Experiments across several model architectures reveal that low reconstruction error does not guarantee physical consistency: some, but not all, models achieve low error yet perform poorly on the two physics-based metrics. Further analysis shows that temporal attention is critical for transformer-based models to learn the underlying physics.
[LG-76] Quantum Interval Bound Propagation for Certified Training of Quantum Neural Networks
链接: https://arxiv.org/abs/2605.00747
作者: Emma Andrews,Nahyeon Kim,Prabhat Mishra
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Quantum machine learning is a promising field for efficiently learning features of a dataset to perform a specified task, such as classification. Interval bound propagation (IBP) is a popular certified training method in classical machine learning, where the lower and upper bounds are tracked throughout the model. These bounds are used during training to ensure that the model is certified to predict the correct label even under adversarial perturbations. While IBP is successful in classical domain, there are limited certified training efforts in quantum domain. In this paper, we present quantum interval bound propagation (QIBP) to establish a certified training routine for quantum machine learning, certifying the accuracy of models under adversarial perturbations. We implement QIBP using both interval and affine arithmetic to explore the tradeoffs between the two implementations in terms of accuracy and other design considerations. Extensive evaluation demonstrates that the resulting certified trained models have robust decision boundaries, guaranteed to predict the correct class for the samples within the trained adversarial robustness bounds.
[LG-77] Randomized Subspace Nesterov Accelerated Gradient
链接: https://arxiv.org/abs/2605.00740
作者: Gaku Omiya,Pierre-Louis Poirion,Akiko Takeda
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 50 pages
Abstract:Randomized-subspace methods reduce the cost of first-order optimization by using only low-dimensional projected-gradient information, a feature that is attractive in forward-mode automatic differentiation and communication-limited settings. While Nesterov acceleration is well understood for full-gradient and coordinate-based methods, obtaining accelerated methods for general subspace sketches that use only projected-gradient information and can improve over full-dimensional Nesterov acceleration in oracle complexity is technically nontrivial. We develop randomized-subspace Nesterov accelerated gradient methods for smooth convex and smooth strongly convex optimization under matrix smoothness and generic sketch moment assumptions. The key technical ingredient is a three-sequence formulation tailored to matrix smoothness, which recovers the corresponding classical Nesterov methods in the full-dimensional case. The resulting theory establishes accelerated oracle-complexity guarantees and makes explicit how matrix smoothness and the sketch distribution enter the complexity. It also provides a unified basis for comparing sketch families and identifying when randomized-subspace acceleration improves over full-dimensional Nesterov acceleration in oracle complexity. Comments: 50 pages Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 90C25 (Primary), 90C06, 90C30, 60B20 (Secondary) Cite as: arXiv:2605.00740 [math.OC] (or arXiv:2605.00740v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2605.00740 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-78] Decentralized Proximal Stochastic Gradient Langevin Dynamics
链接: https://arxiv.org/abs/2605.00723
作者: Mohammad Rafiqul Islam,Lingjiong Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 42 pages, 7 figures
Abstract:We propose Decentralized Proximal Stochastic Gradient Langevin Dynamics (DE-PSGLD), a decentralized Markov chain Monte Carlo (MCMC) algorithm for sampling from a log-concave probability distribution constrained to a convex domain. Constraints are enforced through a shared proximal regularization based on the Moreau-Yosida envelope, enabling unconstrained updates while preserving consistency with the target constrained posterior. We establish non-asymptotic convergence guarantees in the 2-Wasserstein distance for both individual agent iterates and their network averages. Our analysis shows that DE-PSGLD converges to a regularized Gibbs distribution and quantifies the bias introduced by the proximal approximation. We evaluate DE-PSGLD for different sampling problems on synthetic and real datasets. As the first decentralized approach for constrained domains, our algorithm exhibits fast posterior concentration and high predictive accuracy.
[LG-79] FedKPer: Tackling Generalization and Personalization in Medical Federated Learning via Knowledge Personalization ICIP
链接: https://arxiv.org/abs/2605.00698
作者: Zoe Fowler,Ghassan AlRegib
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: Accepted to IEEE International Conference on Image Processing (ICIP)
Abstract:Federated learning (FL) holds great potential for medical applications. However, statistical heterogeneity across healthcare institutions poses a major challenge for FL, as the global model struggles both to generalize across unseen patient populations and to adapt to the unique data distributions of individual hospitals. This heterogeneity also exacerbates forgetting at both the global and local level, resulting in previous learned patient patterns to be misclassified after model updates. While prior work has largely treated generalization and personalization as separate challenges, we show that a better balance between the two can be achieved through selective alignment with the global model and a modified aggregation scheme, which together mitigate the effects of statistical heterogeneity. Specifically, we introduce FedKPer, which introduces knowledge personalization into the training stage of each local device. Afterwards, generalization is considered via the global model aggregation process, where local updates that are reliable and label-diverse are emphasized. We evaluate the performance of FedKPer, devising additional metrics that relate to common consequences of forgetting. Overall, we demonstrate FedKPer improves the generalization-personalization trade-off without sacrificing retention.
[LG-80] Gradient Regularized Newton Boosting Trees with Global Convergence
链接: https://arxiv.org/abs/2605.00581
作者: Nikita Zozoulenko,Daniel Falkowski,Thomas Cass,Lukas Gonon
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Gradient Boosting Decision Trees (GBDTs) dominate tabular machine learning, with modern implementations like XGBoost, LightGBM, and CatBoost being based on Newton boosting: a second-order descent step in the space of decision trees. Despite its empirical success, the global convergence of Newton boosting is poorly understood compared to first-order boosting. In this paper, we introduce Restricted Newton Descent, which studies convex optimization with Newton’s method on Hilbert spaces with inexact iterates, based on the concepts of cosine angle and weak gradient edge. Within this framework, we recover Newton boosting with GBDTs and classical finite-dimensional theory as special cases. We first prove that vanilla Newton boosting achieves a linear rate of convergence for smooth, strongly convex losses that satisfy a Hessian-dominance condition. To handle general convex losses with Lipschitz Hessians, we extend a recent gradient regularized Newton scheme to the restricted weak learner setting. This scheme minimally modifies the classical algorithm by introducing an adaptive \ell_2 -regularization term proportional to the square root of the gradient norm at each iteration. We establish a \mathcalO(\frac1k^2) rate for this scheme, thereby obtaining a globally convergent second-order GBDT algorithm with a rate matching that of first-order boosting with Nesterov momentum. In numerical experiments, we show that our scheme converges while vanilla Newton boosting may diverge.
[LG-81] A unified perspective on fine-tuning and sampling with diffusion and flow models
链接: https://arxiv.org/abs/2605.00229
作者: Carles Domingo-Enrich,Yuanqi Du,Michael S. Albergo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We study the problem of training diffusion and flow generative models to sample from target distributions defined by an exponential tilting of a base density; a formulation that subsumes both sampling from unnormalized densities and reward fine-tuning of pre-trained models. This problem can be approached from a stochastic optimal control (SOC) perspective, using adjoint-based or score matching methods, or from a non-equilibrium thermodynamics perspective. We provide a unified framework encompassing these approaches and make three main contributions: (i) bias-variance decompositions revealing that Adjoint Matching/Sampling and Novel Score Matching have finite gradient variance, while Target and Conditional Score Matching do not; (ii) norm bounds on the lean adjoint ODE that theoretically support the effectiveness of adjoint-based methods; and (iii) adaptations of the CMCD and NETS loss functions, along with novel Crooks and Jarzynski identities, to the exponential tilting setting. We validate our analysis with reward fine-tuning experiments on Stable Diffusion 1.5 and 3.
[LG-82] From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species Embeddings
链接: https://arxiv.org/abs/2605.00225
作者: Christiaan M. Geldenhuys,Thomas R. Niesler
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Quantitative Methods (q-bio.QM)
*备注:
Abstract:We show that pretrained acoustic embeddings classify elephant vocalisations at a level approaching that of end-to-end supervised neural networks, without any fine-tuning of the embedding model. This result is of practical importance because annotated bioacoustic data are scarce and costly to obtain, leaving conventional supervised approaches prone to overfitting and to poor generalisation under domain shift. A broad range of embedding models drawn from general audio, speech, and bioacoustic domains is evaluated, all of which are either out-of-domain (containing no bioacoustic data) or out-of-species (containing no elephant call data). The embedding networks themselves remain fixed; only the lightweight downstream classifiers, which include a linear model and several small neural networks, are trained. Among the models considered, Perch 2.0 achieves the best cross-validated classification performance, attaining AUCs of 0.849 on African bush elephant (Loxodonta africana) calls and 0.936 on Asian elephant (Elephas maximus) calls, with Perch 1.0 close behind. The best-performing system is within 2.2 % of an end-to-end supervised elephant call classification system. A layerwise analysis of pretrained transformer encoders, considered as embedding models, shows that intermediate representations outperform final-layer outputs. The second layer of both wav2vec2.0 and HuBERT encodes sufficient information for effective elephant call classification; truncation at this layer therefore preserves classification performance whilst retaining only approximately 10 % of the parameters of the full network. Such compact embedding networks are well suited to on-device processing where computational resources are limited.
[LG-83] SHIFT: Robust Double Machine Learning for Averag e Dose-Response Functions under Heavy-Tailed Contamination
链接: https://arxiv.org/abs/2605.00176
作者: Eichi Uehara
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 77 pages, 43 figures, 35 tables. Code and raw CSVs: this https URL
Abstract:Double-machine-learning pipelines for the Average Dose-Response Function rely on kernel-weighted local-linear smoothers, which inherit unbounded functional influence: a single outlier within a kernel window biases the curve across the entire window. We introduce SHIFT (Self-calibrated Heavy-tail Inlier-Fit with Tempering), a robust DML estimator combining cross-fit nuisance orthogonalization with a kernel-local Welsch-loss second stage optimized by Graduated Non-Convexity, and – the principal design choice – a defensive OLS refit whose inlier cutoff is scaled by post-GNC residual MAD rather than the raw-outcome MAD. On a localized-contamination stress test at p=0.25 this design choice drops level-RMSE from 1.03 to 0.33 while leaving clean and uniformly-contaminated runs unchanged. Across 1,400 main-sweep fits, SHIFT has competitive worst-case shape recovery (RMSE 0.325 at p=0.25 , second to Huber-DML’s 0.276 ); among the three methods with worst-case RMSE below 0.35 , only SHIFT emits a non-uniform per-sample weight vector, recovering the ground-truth outlier mask at mean F_1 \approx 0.96 (range 0.945 – 0.968 ) on Gaussian-jump DGPs. We pair the estimator with a six-technique Extreme Value Theory diagnostic suite (Hill, GPD-MLE/PWM, GEV, Mean Excess, parameter stability, causal tail coefficient) that lets a practitioner distinguish Frechet from Weibull regimes and choose between SHIFT and L1 alternatives on empirical grounds. Extensions to binary-treatment CATE (Huber pseudo-outcome X-Learner) and time-series ADRF (block-CV + rolling MAD) are included. A counter-intuitive ablation: linear nuisance models (Ridge, Lasso) outperform gradient-boosted nuisances for robust DML under uniform contamination, inverting the usual more-flexible-is-better heuristic.
[LG-84] Adaptive Norm-Based Regularization for Neural Networks
链接: https://arxiv.org/abs/2605.00171
作者: Muhammad Qasim,Farrukh Javed
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 37 pages, 9 figures
Abstract:In this paper, we study norm-based regularization methods for neural networks. We compare existing penalization approaches and introduce two regularization strategies that extend classical ridge- and lasso-type penalties to neural network models. The first strategy modifies weight decay by incorporating the covariance structure of the input features into a ridge-type \ell_2 penalty, allowing regularization to account for feature dependence. The second combines an \ell_1 sparsity penalty with covariance-aware \ell_2 regularization, producing neural network weights that are both sparse and structurally informed. Monte Carlo simulations are used to evaluate these methods under different data-generating settings, followed by two real-data applications on building cooling-load prediction and leukemia cell-type classification from high-dimensional gene expression data. Across simulated and real-data examples, the proposed regularizers improve predictive performance on unseen data and provide more effective complexity control than standard norm-based penalties, particularly when features are correlated or high-dimensional.
[LG-85] Efficient Mutation Testing of Quantum Machine Learning Models
链接: https://arxiv.org/abs/2605.00107
作者: Emma Andrews,Prabhat Mishra
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Quantum machine learning integrates the strengths of quantum computing and machine learning, enabling models to learn complex features using fewer parameters than their classical counterparts. Due to the increasing complexity of quantum machine learning models, it is necessary to verify that the implementation of these models satisfy the design specification and be free of bugs and faults. Mutation testing is a promising avenue to identify faulty quantum circuits that do not meet design specifications or contain defects by intentionally inserting faults into the quantum circuit. It is necessary to define mutation operations to inject faults into quantum circuits to ensure that a test suite is robust enough to evaluate an implementation against its design specification. In this paper, we extend mutation testing to quantum machine learning applications, primarily quantum neural network models. Specifically, this paper makes two important contributions. We define new mutation operations for efficient fault insertion compared to state-of-the-art approaches. We also present a directed mutation generation technique to reduce redundant mutant circuits. Extensive experimental evaluation demonstrates that our approach generates a more diverse and representative set of mutants, effectively addressing faults that traditional techniques fail to expose.
[LG-86] Provable and scalable quantum Gaussian processes for quantum learning
链接: https://arxiv.org/abs/2605.00099
作者: Jonas Jäger,Paolo Braccia,Pablo Bermejo,Manuel G. Algaba,Diego García-Martín,M. Cerezo
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 + 70 pages, 5 + 14 figures, 2 tables
Abstract:Despite rapid recent advances in quantum machine learning, the field is in many ways stuck. Existing approaches can exhibit serious limitations, and we still lack learning frameworks that are simple, interpretable, scalable, and naturally suited to quantum data. To address this, here we introduce quantum Gaussian processes, a Bayesian framework for learning from quantum systems through priors over unknown quantum transformations. We show that, under suitable conditions, unitary quantum stochastic processes define Gaussian processes, thereby enabling regression, classification, and Bayesian optimization directly on quantum data. The key ingredient in this framework is sufficient knowledge of a quantum process’s structure and symmetries to define an informative prior through its corresponding quantum kernel, effectively injecting a strong, physics-informed inductive bias into the learning model. We then prove that matchgate, or free-fermionic, evolutions give rise to provable and scalable quantum Gaussian processes, providing the first family in our framework where the unknown unitary acts non-trivially on all qubits. Finally, we demonstrate accurate long-range extrapolation, phase-diagram learning in many-body systems, and sample-efficient Bayesian optimization in a quantum sensing task. Our results identify quantum Gaussian processes as a promising route toward simpler and more structured forms of quantum learning.
[LG-87] RETO: A Rotary-Enhanced Transformer Operator for High-Fidelity Prediction of Automotive Aerodynamics
链接: https://arxiv.org/abs/2605.00062
作者: Bojun Zhang,Huiyu Yang,Yunpeng Wang,Yuntian Chen,Yuanwei Bin,Rikui Zhang,Jianchun Wang
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:
Abstract:Rapid aerodynamic evaluation is crucial for modern vehicle design, yet existing neural operators struggle to capture intricate spatial correlations. We propose the rotary-enhanced transformer operator (RETO), a novel neural solver featuring a dual-stage spatial awareness mechanism: sinusoidal-cosine encodings for global referencing and rotary positional encodings (RoPE) for relative displacements. RoPE encodes spatial relations via unitary rotations, enforcing translation invariance and enhancing local gradient resolution. RETO is validated on ShapeNet and the high-fidelity DrivAerML benchmark. On ShapeNet, RETO achieves a relative L_2 error of 0.063, outperforming RegDGCNN at 0.125 and representing a 16% improvement over the Transolver baseline, which yields an error of 0.075. These performance gains are further amplified on the DrivAerML dataset, where RETO achieves relative L_2 errors of 0.089 for surface pressure and 0.097 for velocity. In comparison, Transolver results in errors of 0.116 and 0.121 for the same metrics, indicating that RETO achieves precision enhancements of 23% and 19%, respectively. For comprehensive comparison, the surface pressure and velocity errors for AB-UBT are 0.102 and 0.124, while RegDGCNN yields 0.235 and 0.312, respectively. Information-theoretical analysis shows that the entropy peak of RETO at 0.35 is significantly lower than that of Transolver at 0.75 under 10^4 resolution, indicating a focused attentional mechanism capable of preserving localized gradients against global diffusion.
附件下载


