本篇博文主要内容为 2026-02-20 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-02-20)
今日共更新521篇论文,其中:
- 自然语言处理共71篇(Computation and Language (cs.CL))
- 人工智能共169篇(Artificial Intelligence (cs.AI))
- 计算机视觉共61篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共163篇(Machine Learning (cs.LG))
- 多智能体系统共11篇(Multiagent Systems (cs.MA))
- 信息检索共20篇(Information Retrieval (cs.IR))
- 人机交互共27篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] BMC4TimeSec: Verification Of Timed Security Protocols AAMAS2026
【速读】:该论文旨在解决时序安全协议(Timed Security Protocols, TSP)的形式化验证难题,特别是如何在考虑时间约束和多代理交互的复杂环境中准确分析协议安全性。解决方案的关键在于提出一种基于SMT的有界模型检测(Bounded Model Checking, BMC)与时序解释系统(Timed Interpreted Systems, TIS)及交错时序解释系统(Timed Interleaved Interpreted Systems, TIIS)相结合的端到端工具BMC4TimeSec。该工具通过将TSP执行建模为TIS/TIIS环境(包含动作同步、交替执行、延迟和生命周期等机制),并用知识自动机(knowledge automata)刻画参与方(包括攻击者)的知识演化过程,从而实现对时序安全协议的精确建模与高效验证。
链接: https://arxiv.org/abs/2602.17590
作者: Agnieszka M. Zbrzezny
机构: SWPS University (社会心理学与政策科学大学)
类目: Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注: To appear in the Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), May 25 - 29, 2026, Paphos, Cyprus
Abstract:We present BMC4TimeSec, an end-to-end tool for verifying Timed Security Protocols (TSP) based on SMT-based bounded model checking and multi-agent modelling in the form of Timed Interpreted Systems (TIS) and Timed Interleaved Interpreted Systems (TIIS). In BMC4TimeSec, TSP executions implement the TIS/TIIS environment (join actions, interleaving, delays, lifetimes), and knowledge automata implement the agents (evolution of participant knowledge, including the intruder). The code is publicly available on \hrefthis https URLGitHub, as is a \hrefthis https URLvideo demonstration.
[MA-1] Linear Convergence in Games with Delayed Feedback via Extra Prediction
【速读】:该论文旨在解决多智能体学习中反馈延迟(feedback delays)导致性能严重下降的问题,尤其是在无约束双线性博弈(unconstrained bilinear games)场景下,延迟对算法收敛速率的影响尚不明确。解决方案的关键在于引入加权乐观梯度下降-上升算法(Weighted Optimistic Gradient Descent-Ascent, WOGDA),通过预测未来奖励的“额外乐观性”(extra optimism)来提升收敛速度;理论分析表明,标准乐观性(预测下一步奖励)可实现指数收敛率 exp(−Θ(t/m5)),而采用额外乐观性(预测更远未来奖励)则能容忍更大步长并显著加速收敛至 exp(−Θ(t/(m2logm))),实验证实了该策略的有效性与理论一致性。
链接: https://arxiv.org/abs/2602.17486
作者: Yuma Fujimoto,Kenshi Abe,Kaito Ariu
机构: 未知
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
备注: 9 pages, 3 figures (main); 5 pages, 1 figure (appendix)
Abstract:Feedback delays are inevitable in real-world multi-agent learning. They are known to severely degrade performance, and the convergence rate under delayed feedback is still unclear, even for bilinear games. This paper derives the rate of linear convergence of Weighted Optimistic Gradient Descent-Ascent (WOGDA), which predicts future rewards with extra optimism, in unconstrained bilinear games. To analyze the algorithm, we interpret it as an approximation of the Extra Proximal Point (EPP), which is updated based on farther future rewards than the classical Proximal Point (PP). Our theorems show that standard optimism (predicting the next-step reward) achieves linear convergence to the equilibrium at a rate \exp(-\Theta(t/m^5)) after t iterations for delay m . Moreover, employing extra optimism (predicting farther future reward) tolerates a larger step size and significantly accelerates the rate to \exp(-\Theta(t/(m^2\log m))) . Our experiments also show accelerated convergence driven by the extra optimism and are qualitatively consistent with our theorems. In summary, this paper validates that extra optimism is a promising countermeasure against performance degradation caused by feedback delays.
[MA-2] Multi-Agent Temporal Logic Planning via Penalty Functions and Block-Coordinate Optimization
【速读】:该论文旨在解决多智能体(Multi-agent)在信号时序逻辑(Signal Temporal Logic, STL)约束下的规划问题中因高维性导致的计算复杂性难题,从而实现可扩展的合成并保证满足性。其解决方案的关键在于将STL规划建模为一个带任意多智能体约束的优化问题,并引入基于惩罚函数的无约束松弛方法;通过采用基于平滑STL语义定义的二次惩罚函数,利用块坐标梯度下降(Block-Coordinate Gradient Descent, BCGD)算法对每个智能体的决策变量进行独立更新,显著降低计算复杂度;同时嵌套两层优化结构——内层固定惩罚参数进行BCGD迭代,外层逐步增大惩罚参数以提升STL鲁棒性,最终实现高效且可行的多智能体路径规划。
链接: https://arxiv.org/abs/2602.17434
作者: Eleftherios E. Vlahakis,Arash Bahari Kordabad,Lars Lindemann,Pantelis Sopasakis,Sadegh Soudjani,Dimos V. Dimarogonas
机构: 未知
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA)
备注: Submitted to L-CSS
Abstract:Multi-agent planning under Signal Temporal Logic (STL) is often hindered by collaborative tasks that lead to computational challenges due to the inherent high-dimensionality of the problem, preventing scalable synthesis with satisfaction guarantees. To address this, we formulate STL planning as an optimization program under arbitrary multi-agent constraints and introduce a penalty-based unconstrained relaxation that can be efficiently solved via a Block-Coordinate Gradient Descent (BCGD) method, where each block corresponds to a single agent’s decision variables, thereby mitigating complexity. By utilizing a quadratic penalty function defined via smooth STL semantics, we show that BCGD iterations converge to a stationary point of the penalized problem under standard regularity assumptions. To enforce feasibility, the BCGD solver is embedded within a two-layer optimization scheme: inner BCGD updates are performed for a fixed penalty parameter, which is then increased in an outer loop to progressively improve multi-agent STL robustness. The proposed framework enables scalable computations and is validated through various complex multi-robot planning scenarios.
[MA-3] Algorithmic Collusion at Test Time: A Meta-game Design and Evaluation AAMAS2026
【速读】:该论文旨在解决算法合谋(algorithmic collusion)的风险评估问题,特别是探讨在现实测试时间(test-time)约束下,智能体是否能通过理性选择自发形成合谋行为,以及其与竞争策略之间的动态演化机制。现有研究多依赖于长学习周期、对博弈对手理性假设及参数对称性等理想化条件,难以反映真实场景中的复杂性。论文的关键解决方案在于提出一种元博弈(meta-game)设计框架,将智能体建模为具有预训练策略(如竞争型、天真合作型、稳健合谋型)的代理,并通过选择一个结合初始策略与游戏中适应规则的元策略来分析其行为演化。作者进一步采样正常形式的经验博弈(empirical games),计算收益与后悔值等统计指标,并构建经验最优响应图(empirical best-response graphs)以揭示策略间的相互作用关系。实验基于强化学习和大语言模型(LLM)策略,在对称与非对称成本设置下的重复定价博弈中验证了该方法的有效性,从而系统地评估了算法合谋的可行性及其在实际环境中的表现。
链接: https://arxiv.org/abs/2602.17203
作者: Yuhong Luo,Daniel Schoepflin,Xintong Wang
机构: Rutgers University (罗格斯大学)
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT)
备注: AAMAS 2026. 31 pages
Abstract:The threat of algorithmic collusion, and whether it merits regulatory intervention, remains debated, as existing evaluations of its emergence often rely on long learning horizons, assumptions about counterparty rationality in adopting collusive strategies, and symmetry in hyperparameters and economic settings among players. To study collusion risk, we introduce a meta-game design for analyzing algorithmic behavior under test-time constraints. We model agents as possessing pretrained policies with distinct strategic characteristics (e.g., competitive, naively cooperative, robustly collusive), and formulate the problem as selecting a meta-strategy that combines a pretrained, initial policy with an in-game adaptation rule. We seek to examine whether collusion can emerge under rational choices and how agents co-adapt toward cooperation or competition. To this end, we sample normal-form empirical games over meta-strategy profiles, % across random initial game states, compute relevant game statistics (e.g., payoffs against individuals and regret against an equilibrium mixture of opponents), and construct empirical best-response graphs to uncover strategic relationships. We evaluate both reinforcement-learning and LLM-based strategies in repeated pricing games under symmetric and asymmetric cost settings, and present findings on the feasibility of algorithmic collusion and the effectiveness of pricing strategies in practical ``test-time’’ environments. The source code and the full paper with appendix are available at: this https URL. Comments: AAMAS 2026. 31 pages Subjects: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2602.17203 [cs.MA] (or arXiv:2602.17203v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2602.17203 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-4] Agent Conductor: Topology Evolution for Multi-Agent Competition-Level Code Generation
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS)在复杂任务中存在通信冗余与性能瓶颈的问题,具体表现为:现有方法无法根据任务难度自适应调整智能体间交互拓扑的密度,且缺乏利用执行反馈对单个任务实例进行拓扑迭代优化的能力。其解决方案的核心是提出AgentConductor——一个由LLM驱动的强化学习优化多智能体系统,其中包含一个作为核心的编排智能体(orchestrator agent),能够端到端地基于反馈动态生成交互拓扑结构。关键创新在于:1)设计了一种新型拓扑密度函数,以数学方式刻画多智能体协作中的通信感知特性;2)引入难度区间划分策略,避免因过度剪枝导致拓扑密度上界估计不准,并实现按难度层级的精细化密度控制。
链接: https://arxiv.org/abs/2602.17100
作者: Siyu Wang,Ruotian Lu,Zhihao Yang,Yuchao Wang,Yanzhou Zhang,Lei Xu,Qimin Xu,Guojun Yin,Cailian Chen,Xinping Guan
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Large language model(LLM)-driven multi-agent systems(MAS) coordinate specialized agents through predefined interaction topologies and have shown promise for complex tasks such as competition-level code generation. Recent studies demonstrate that carefully designed multi-agent workflows and communication graphs can significantly improve code generation performance by leveraging collaborative reasoning. However, existing methods neither adapt topology density to task difficulty nor iteratively refine the topology within an instance using execution feedback, which leads to redundant communication and performance bottlenecks. To address these issues, we propose AgentConductor: a reinforcement learning-optimized MAS with an LLM-based orchestrator agent as its core, which enables end-to-end feedback-driven dynamic generation of interaction topologies. For each query, AgentConductor infers agent roles and task difficulty, then constructs a task-adapted, density-aware layered directed acyclic graph (DAG) topology, underpinned by two key innovations. First, we design a novel topological density function that captures communication-aware mathematical characterizations of multi-agent interactions. Second, we adopt difficulty interval partitioning to avoid excessive pruning for precise topological density upper bound measurement per difficulty level and finer-grained control. Empirically, across three competition-level and two foundational code datasets, AgentConductor achieves state-of-the-art accuracy, outperforming the strongest baseline by up to 14.6% in pass@1 accuracy, 13% in density reduction, and 68% in token cost reduction.
[MA-5] Safe Continuous-time Multi-Agent Reinforcement Learning via Epigraph Form ICLR2026
【速读】:该论文旨在解决当前多智能体强化学习(Multi-agent Reinforcement Learning, MARL)在连续时间场景下难以处理安全约束(如碰撞惩罚)的问题。现有基于哈密顿-雅可比-贝尔曼(Hamilton-Jacobi-Bellman, HJB)方程的连续时间MARL方法因安全约束引入的不连续性而难以优化。解决方案的关键在于提出一种连续时间约束马尔可夫决策过程(Continuous-Time Constrained MDP, CT-CMDP)的新形式化,并通过基于上图(epigraph)的重构将离散时间MDP转化为CT-CMDP;进而设计了一种物理信息神经网络(Physics-Informed Neural Network, PINN)驱动的演员-评论家方法,实现连续时间下的稳定高效优化,从而在安全多智能体环境中显著提升训练稳定性与性能表现。
链接: https://arxiv.org/abs/2602.17078
作者: Xuefeng Wang,Lei Zhang,Henglin Pu,Husheng Li,Ahmed H. Qureshi
机构: Purdue University (普渡大学)
类目: Multiagent Systems (cs.MA)
备注: Accepted by ICLR 2026. 27 pages, 15 figures
Abstract:Multi-agent reinforcement learning (MARL) has made significant progress in recent years, but most algorithms still rely on a discrete-time Markov Decision Process (MDP) with fixed decision intervals. This formulation is often ill-suited for complex multi-agent dynamics, particularly in high-frequency or irregular time-interval settings, leading to degraded performance and motivating the development of continuous-time MARL (CT-MARL). Existing CT-MARL methods are mainly built on Hamilton-Jacobi-Bellman (HJB) equations. However, they rarely account for safety constraints such as collision penalties, since these introduce discontinuities that make HJB-based learning difficult. To address this challenge, we propose a continuous-time constrained MDP (CT-CMDP) formulation and a novel MARL framework that transforms discrete MDPs into CT-CMDPs via an epigraph-based reformulation. We then solve this by proposing a novel physics-informed neural network (PINN)-based actor-critic method that enables stable and efficient optimization in continuous time. We evaluate our approach on continuous-time safe multi-particle environments (MPE) and safe multi-agent MuJoCo benchmarks. Results demonstrate smoother value approximations, more stable training, and improved performance over safe MARL baselines, validating the effectiveness and robustness of our method.
[MA-6] Discovering Multiagent Learning Algorithms with Large Language Models
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)在不完美信息博弈中算法设计高度依赖人工经验的问题。传统方法如基于反事实遗憾最小化(Counterfactual Regret Minimization, CFR)和策略空间响应Oracle(Policy Space Response Oracles, PSRO)虽理论基础扎实,但其高效变体的设计往往受限于人类直觉对庞大算法设计空间的探索效率。解决方案的关键在于引入AlphaEvolve——一个由大语言模型驱动的进化编码代理,用于自动发现新的多智能体学习算法。该框架通过演化机制分别在迭代遗憾最小化和基于种群的训练范式中成功发现两种新算法:VAD-CFR与SHOR-PSRO,前者通过非直观的波动率敏感折扣机制和一致性强化乐观性实现更优收敛性能,后者则通过动态调整混合策略与多样性奖励的元求解器实现从种群多样性到均衡收敛的自动化过渡,从而显著提升算法的泛化能力和实证表现。
链接: https://arxiv.org/abs/2602.16928
作者: Zun Li,John Schultz,Daniel Hennes,Marc Lanctot
机构: Google DeepMind(谷歌深度思维)
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Much of the advancement of Multi-Agent Reinforcement Learning (MARL) in imperfect-information games has historically depended on manual iterative refinement of baselines. While foundational families like Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO) rest on solid theoretical ground, the design of their most effective variants often relies on human intuition to navigate a vast algorithmic design space. In this work, we propose the use of AlphaEvolve, an evolutionary coding agent powered by large language models, to automatically discover new multiagent learning algorithms. We demonstrate the generality of this framework by evolving novel variants for two distinct paradigms of game-theoretic learning. First, in the domain of iterative regret minimization, we evolve the logic governing regret accumulation and policy derivation, discovering a new algorithm, Volatility-Adaptive Discounted (VAD-)CFR. VAD-CFR employs novel, non-intuitive mechanisms-including volatility-sensitive discounting, consistency-enforced optimism, and a hard warm-start policy accumulation schedule-to outperform state-of-the-art baselines like Discounted Predictive CFR+. Second, in the regime of population based training algorithms, we evolve training-time and evaluation-time meta strategy solvers for PSRO, discovering a new variant, Smoothed Hybrid Optimistic Regret (SHOR-)PSRO. SHOR-PSRO introduces a hybrid meta-solver that linearly blends Optimistic Regret Matching with a smoothed, temperature-controlled distribution over best pure strategies. By dynamically annealing this blending factor and diversity bonuses during training, the algorithm automates the transition from population diversity to rigorous equilibrium finding, yielding superior empirical convergence compared to standard static meta-solvers.
[MA-7] AdaptOrch: Task-Adaptive Multi-Agent Orchestration in the Era of LLM Performance Convergence
【速读】:该论文试图解决的问题是:随着来自不同提供商的大语言模型在基准测试中性能趋同,单纯依赖选择单一最优模型的任务分配策略已难以带来显著性能提升,系统级性能的进一步优化亟需新的突破口。解决方案的关键在于提出 AdaptOrch 框架,其核心创新为将“编排拓扑结构”(orchestration topology)作为首要优化目标,通过动态选择四种典型拓扑结构(并行、串行、分层和混合)来适配任务依赖图与领域特征,从而实现系统级性能的显著提升。该框架包含三项关键贡献:性能收敛缩放律(Performance Convergence Scaling Law)、拓扑路由算法(Topology Routing Algorithm)以及自适应合成协议(Adaptive Synthesis Protocol),实验证明其在代码生成、推理和检索增强生成任务中相较静态单拓扑基线提升12–23%,且不依赖模型规模提升。
链接: https://arxiv.org/abs/2602.16873
作者: Geunbin Yu
机构: Korea National Open University (韩国国立开放大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 21 pages, 10 figures, 6 tables
Abstract:As large language models from diverse providers converge toward comparable benchmark performance, the traditional paradigm of selecting a single best model per task yields diminishing returns. We argue that orchestration topology – the structural composition of how multiple agents are coordinated, parallelized, and synthesized – now dominates system-level performance over individual model capability. We present AdaptOrch, a formal framework for task-adaptive multi-agent orchestration that dynamically selects among four canonical topologies (parallel, sequential, hierarchical, and hybrid) based on task dependency graphs and empirically derived domain characteristics. Our framework introduces three key contributions: (1) a Performance Convergence Scaling Law, formalizing conditions under which orchestration selection outweighs model selection; (2) a Topology Routing Algorithm that maps task decomposition DAGs to optimal orchestration patterns in O(|V| + |E|) time; and (3) an Adaptive Synthesis Protocol with provable termination guarantees and heuristic consistency scoring for parallel agent outputs. We validate AdaptOrch across coding (SWE-bench), reasoning (GPQA), and retrieval-augmented generation tasks, demonstrating that topology-aware orchestration achieves 12-23% improvement over static single-topology baselines, even when using identical underlying models. Our results establish orchestration design as a first-class optimization target independent of model scaling.
[MA-8] Self-Evolving Multi-Agent Network for Industrial IoT Predictive Maintenance
【速读】:该论文旨在解决工业物联网(Industrial IoT)预测性维护中实时异常检测的难题,即如何在不牺牲模型可解释性且不占用过多计算资源的前提下实现动态适应复杂运行环境的能力。传统静态离线训练模型难以应对工况变化,而基于大语言模型(Large Language Model, LLM)的单体系统则因内存和延迟过高无法部署于边缘端。解决方案的关键在于提出一种自演化分层多智能体系统(SEMAS),其通过将专用智能体分布于边缘(Edge)、雾计算(Fog)与云(Cloud)三层架构中实现资源感知的专业化分工:边缘智能体执行轻量特征提取与预过滤,雾层智能体采用动态共识投票机制进行多样化集成检测,云端智能体利用近端策略优化(Proximal Policy Optimization, PPO)持续优化策略并保持异步非阻塞推理;同时引入LLM生成解释性响应与联邦知识聚合机制以支持自适应策略分发。该设计在保证实时性能与可解释性的前提下,显著提升了系统在动态工况下的稳定性与准确性。
链接: https://arxiv.org/abs/2602.16738
作者: Rebin Saleh,Khanh Pham Dinh,Balázs Villányi,Truong-Son Hy
机构: Budapest University of Technology and Economics (布达佩斯技术与经济大学); DataScienceWorld; The University of Alabama at Birmingham (阿拉巴马大学伯明翰分校)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注:
Abstract:Industrial IoT predictive maintenance requires systems capable of real-time anomaly detection without sacrificing interpretability or demanding excessive computational resources. Traditional approaches rely on static, offline-trained models that cannot adapt to evolving operational conditions, while LLM-based monolithic systems demand prohibitive memory and latency, rendering them impractical for on-site edge deployment. We introduce SEMAS, a self-evolving hierarchical multi-agent system that distributes specialized agents across Edge, Fog, and Cloud computational tiers. Edge agents perform lightweight feature extraction and pre-filtering; Fog agents execute diversified ensemble detection with dynamic consensus voting; and Cloud agents continuously optimize system policies via Proximal Policy Optimization (PPO) while maintaining asynchronous, non-blocking inference. The framework incorporates LLM-based response generation for explainability and federated knowledge aggregation for adaptive policy distribution. This architecture enables resource-aware specialization without sacrificing real-time performance or model interpretability. Empirical evaluation on two industrial benchmarks (Boiler Emulator and Wind Turbine) demonstrates that SEMAS achieves superior anomaly detection performance with exceptional stability under adaptation, sustains prediction accuracy across evolving operational contexts, and delivers substantial latency improvements enabling genuine real-time deployment. Ablation studies confirm that PPO-driven policy evolution, consensus voting, and federated aggregation each contribute materially to system effectiveness. These findings indicate that resource-aware, self-evolving 1multi-agent coordination is essential for production-ready industrial IoT predictive maintenance under strict latency and explainability constraints.
[MA-9] Guiding LLM -Based Human Mobility Simulation with Mobility Measures from Shared Data
【速读】:该论文旨在解决现有基于大语言模型(Large Language Models, LLMs)的人类移动模拟方法中缺乏群体层面协调机制的问题,这些方法通常独立生成个体移动轨迹,无法捕捉集体行为的涌现。解决方案的关键在于提出M2LSimu框架,该框架利用从共享数据中提取的移动度量(mobility measures)作为引导,对个体级提示(prompt)进行多提示调整(multi-prompt adjustment),通过粗粒度到细粒度的渐进式调整策略,在有限预算下同时满足多个群体层面的移动目标,从而实现更真实的群体移动轨迹生成。
链接: https://arxiv.org/abs/2602.16726
作者: Hua Yan,Heng Tan,Yu Yang
机构: Lehigh University (莱赫igh大学)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Large-scale human mobility simulation is critical for many science domains such as urban science, epidemiology, and transportation analysis. Recent works treat large language models (LLMs) as human agents to simulate realistic mobility trajectories by modeling individual-level cognitive processes. However, these approaches generate individual mobility trajectories independently, without any population-level coordination mechanism, and thus fail to capture the emergence of collective behaviors. To address this issue, we design M2LSimu, a mobility measures-guided multi-prompt adjustment framework that leverages mobility measures derived from shared data as guidance to refine individual-level prompts for realistic mobility generation. Our framework applies coarse-grained adjustment strategies guided by mobility measures, progressively enabling fine-grained individual-level adaptation while satisfying multiple population-level mobility objectives under a limited budget. Experiments show that M2LSimu significantly outperforms state-of-the-art LLM-based methods on two public datasets.
[MA-10] Adaptive Decentralized Composite Optimization via Three-Operator Splitting
【速读】:该论文旨在解决分布式优化问题,即在网络中多个智能体(agents)协作最小化局部平滑(强)凸损失函数之和以及一个非光滑凸扩展值项(convex extended value term)的总和。其核心挑战在于如何在不依赖全局信息的情况下实现高效且鲁棒的收敛。解决方案的关键在于提出一种基于三算子分裂分解(three-operator splitting factorization)的去中心化方法,其中各智能体通过轻量级最小一致性协议(min-consensus protocols)与本地回溯过程(local backtracking procedures)自适应调整步长(stepsize),并引入一种新的BCV预条件度量(Bertsekas-O’Connor-Vandenberghe preconditioning metric),从而支持高效的去中心化实现和局部步长调节机制。该设计在仅凸条件下保证次线性收敛,在目标函数整体强凸且非光滑部分部分光滑(partly smooth)时进一步实现线性收敛。
链接: https://arxiv.org/abs/2602.17545
作者: Xiaokai Chen,Ilya Kuruzov,Gesualdo Scutari
机构: Purdue University (普渡大学)
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 25 pages, 3 figures
Abstract:The paper studies decentralized optimization over networks, where agents minimize a sum of \it locally smooth (strongly) convex losses and plus a nonsmooth convex extended value term. We propose decentralized methods wherein agents \it adaptively adjust their stepsize via local backtracking procedures coupled with lightweight min-consensus protocols. Our design stems from a three-operator splitting factorization applied to an equivalent reformulation of the problem. The reformulation is endowed with a new BCV preconditioning metric (Bertsekas-O’Connor-Vandenberghe), which enables efficient decentralized implementation and local stepsize adjustments. We establish robust convergence guarantees. Under mere convexity, the proposed methods converge with a sublinear rate. Under strong convexity of the sum-function, and assuming the nonsmooth component is partly smooth, we further prove linear convergence. Numerical experiments corroborate the theory and highlight the effectiveness of the proposed adaptive stepsize strategy.
自然语言处理
[NLP-0] Sink-Aware Pruning for Diffusion Language Models
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在推理过程中因迭代去噪带来的高计算成本问题,提出一种更高效的剪枝(pruning)方法。现有剪枝策略主要借鉴自自回归语言模型(Autoregressive Language Models, AR LLMs),通常保留注意力汇聚点(attention sink tokens),因为AR模型中的sink token被视为稳定的全局锚点。然而,本文发现这一假设不适用于DLMs:DLM中注意力汇聚位置在整个生成轨迹中表现出显著更高的波动性(即主导sink位置随时间步变化较大),表明这些sink往往是瞬态的、结构上不如AR模型中那样关键。解决方案的关键在于提出Sink-Aware Pruning方法,能够自动识别并剪除DLM中不稳定的sink token(而此前研究通常保留所有sink),无需重新训练即可实现更好的质量-效率权衡,在相同计算预算下优于强基线剪枝方法。
链接: https://arxiv.org/abs/2602.17664
作者: Aidar Myrzakhan,Tianyi Li,Bowei Guo,Shengkun Tang,Zhiqiang Shen
机构: VILA Lab, MBZUAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code at: this https URL
Abstract:Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose \bf \textttSink-Aware Pruning , which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at this https URL.
[NLP-1] What Language is This? Ask Your Tokenizer
【速读】: 该论文旨在解决多语言自然语言处理(Natural Language Processing, NLP)中语言识别(Language Identification, LID)在低资源语言和紧密相关语言场景下性能脆弱的问题。现有系统在高资源语言上表现接近完美,但在低资源条件下准确率显著下降,且难以扩展新语言而不重新训练模型。解决方案的关键在于提出UniLID,一种基于UnigramLM分词算法的简单高效LID方法,其核心创新在于:在共享分词器词汇表上学习语言条件下的unigram分布,同时将分词过程视为语言特定现象;该框架具备数据与计算效率高、无需重训练即可增量添加新语言,并可无缝集成到现有语言模型分词流程中的优势。实证结果表明,UniLID在标准基准测试中表现具有竞争力,在低资源场景下样本效率显著提升(每语言仅需5个标注样本即可达到70%以上准确率),并在细粒度方言识别任务中取得显著改进。
链接: https://arxiv.org/abs/2602.17655
作者: Clara Meister,Ahmetcan Yavuz,Pietro Lesci,Tiago Pimentel
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its probabilistic framing, parameter estimation technique and inference strategy. In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into existing language model tokenization pipelines. Empirical evaluations against widely used baselines, including fastText, GlotLID, and CLD3, show that UniLID achieves competitive performance on standard benchmarks, substantially improves sample efficiency in low-resource settings - surpassing 70% accuracy with as few as five labeled samples per language - and delivers large gains on fine-grained dialect identification.
[NLP-2] Differences in Typological Alignment in Language Models Treatment of Differential Argument Marking
【速读】: 该论文旨在探究语言模型(Language Models, LMs)在学习人工合成语料时是否能再现人类语言中差异性论元标记(Differential Argument Marking, DAM)的类型学倾向,尤其是区分自然标记方向(natural markedness direction)与对象偏好(object preference)这两个维度。其解决方案的关键在于采用受控的合成学习范式,训练GPT-2模型在18个实现不同DAM系统的语料上,并通过最小对(minimal pairs)测试评估模型的泛化能力,从而揭示不同类型学倾向背后的潜在机制差异。
链接: https://arxiv.org/abs/2602.17653
作者: Iskar Deng,Nathalia Xu,Shane Steinert-Threlkeld
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 7 figures, 7 tables. Under review
Abstract:Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.
[NLP-3] Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting
【速读】: 该论文旨在解决黑盒场景下对大型视觉语言模型(LVLMs)的对抗攻击难题,其核心挑战在于缺乏梯度信息以及多模态边界复杂性导致的优化不稳定。现有最优方法如M-Attack依赖于源图像与目标图像之间的局部裁剪级匹配,但该策略会引入高方差、近乎正交的梯度,破坏局部对齐一致性并引发优化发散。作者指出问题根源在于ViT对平移敏感性导致尖峰式梯度和源目标裁剪间的结构不对称性。解决方案的关键在于重构局部匹配为源变换的非对称期望与目标语义的联合优化,并提出两个核心改进模块:一是多裁剪对齐(Multi-Crop Alignment, MCA),通过每轮迭代中独立采样多个局部视图平均梯度以降低方差;二是辅助目标对齐(Auxiliary Target Alignment, ATA),用语义相关分布的小型辅助集替代激进的目标增强,从而获得更平滑的目标流形。此外,将动量机制重新诠释为补丁动量(Patch Momentum),结合精炼的补丁尺寸集成(Patch Ensemble, PE+),强化了可迁移方向。上述模块共同构成M-Attack-V2,在多个前沿LVLM上显著提升攻击成功率,验证了其有效性与通用性。
链接: https://arxiv.org/abs/2602.17645
作者: Xiaohan Zhao,Zhaoyi Li,Yaxin Luo,Jiacheng Cui,Zhiqiang Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Code at: this https URL
Abstract:Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: this https URL.
[NLP-4] Unmasking the Factual-Conceptual Gap in Persian Language Models
【速读】: 该论文试图解决当前 Persian 自然语言处理(Natural Language Processing, NLP)模型在文化理解上的局限性问题,即现有基准测试虽涵盖语用和礼貌等维度,却难以区分模型对文化事实的记忆能力与对隐含社会规范的推理能力。为解决这一问题,作者提出 DivanBench,一个聚焦于迷信与习俗的诊断性基准,其关键在于设计了315道跨三类任务(事实检索、成对场景验证、情境推理)的问题,以系统评估大语言模型(Large Language Models, LLMs)在非逻辑化、情境依赖的文化规则上的推理能力。实验揭示了当前模型普遍存在顺从偏差(acquiescence bias)、连续预训练反而加剧偏差以及知识检索与应用间存在21%性能差距等核心缺陷,表明单纯扩大单语种数据规模无法实现真正的文化胜任力,模型需内化文化底层结构而非仅模仿表层模式。
链接: https://arxiv.org/abs/2602.17623
作者: Alireza Sakhaeirad,Ali Ma’manpoosh,Arshia Hemmat
机构: EPFL (瑞士联邦理工学院); University of Isfahan (伊斯法罕大学); University of Oxford (牛津大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DivanBench, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model’s ability to discern contradictions; and all models show a 21% performance gap between retrieving factual knowledge and applying it in scenarios. These findings demonstrate that cultural competence requires more than scaling monolingual data, as current models learn to mimic cultural patterns without internalizing the underlying schemas.
[NLP-5] he Cascade Equivalence Hypothesis: When Do Speech LLM s Behave Like ASRrightarrowLLM Pipelines?
【速读】: 该论文旨在解决当前语音大语言模型(Speech Large Language Models, Speech LLMs)是否真正具备端到端语音理解能力的问题,而非仅依赖隐式自动语音识别(ASR)的级联机制。研究通过控制模型骨干网络(backbone)进行匹配测试,发现多数Speech LLMs在任务表现和机制上与简单的Whisper→LLM级联系统无异,表明其本质仍是昂贵且低效的间接处理方式。关键解决方案在于引入严格的对照实验设计(matched-backbone testing),结合logit lens和LEACE概念擦除等方法,验证了文本表示在隐藏状态中的存在及其因果必要性,从而揭示了当前主流Speech LLMs的行为可被级联架构解释,而Qwen2-Audio则展示了例外情况,证明这种等价性并非普遍成立,而是依赖于具体架构设计。
链接: https://arxiv.org/abs/2602.17598
作者: Jayadev Billa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 10 pages, 6 figures, 7 tables
Abstract:Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper \to LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. Ultravox is statistically indistinguishable from its matched cascade ( \kappa=0.93 ); logit lens reveals literal text emerging in hidden states; LEACE concept erasure confirms text representations are causally necessary in both architectures tested, collapsing accuracy to near-zero. Qwen2-Audio genuinely diverges, revealing cascade equivalence is architecture-dependent, not universal. For most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0 dB.
[NLP-6] KLong: Training LLM Agent for Extremely Long-horizon Tasks
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在处理超长时程任务(extremely long-horizon tasks)时面临的挑战,例如任务执行过程中上下文丢失、长期规划能力不足以及训练数据稀缺等问题。其核心解决方案包括两个关键创新:一是提出轨迹分割式监督微调(trajectory-splitting SFT),通过保留早期上下文、逐步截断后期内容并维持子轨迹间的重叠,有效缓解长序列训练中的信息衰减问题;二是设计一种渐进式强化学习(progressive RL)策略,将训练过程分阶段进行,并逐步延长任务超时时间,从而提升模型对复杂、多步骤任务的持续推理与执行能力。实验表明,KLong 在 PaperBench 等基准上显著优于现有模型,且性能提升具有良好的泛化性。
链接: https://arxiv.org/abs/2602.17547
作者: Yue Liu,Zhiyuan Hu,Flood Sung,Jiaheng Zhang,Bryan Hooi
机构: NUS(新加坡国立大学); MIT(麻省理工学院); Independent Researcher(独立研究员)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.
[NLP-7] Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
【速读】: 该论文旨在解决生成式 AI(Generative AI)在微调过程中安全行为退化的问题,尤其是在良性微调下模型安全性下降、对抗性更新下恶化的情况。现有防御方法通常保护效果有限或导致安全与实用性之间的权衡。其解决方案的关键在于提出一种自适应正则化训练框架,根据实时估计的安全风险动态调整正则化强度:通过两种风险估计方式——基于裁判评分的 Safety Critic(提供高阶危害评分)和基于激活的轻量级分类器(预测有害意图),获取风险信号,并据此约束高风险更新以贴近安全参考策略,而低风险更新则采用标准训练流程。实验证明,该机制能有效降低攻击成功率、保持下游性能且不增加推理开销,实现了安全与实用性的协同优化。
链接: https://arxiv.org/abs/2602.17546
作者: Jyotin Goel,Souvik Maji,Pratik Mazumder
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in progress (30 pages)
Abstract:Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.
[NLP-8] Using LLM s for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems
【速读】: 该论文旨在解决真实世界学习数据中缺乏细粒度知识组件(Knowledge Components, KCs)正确性标签的问题,尤其是在开放性编程任务中,学生代码通常同时涉及多个KC,而简单地将题目级正确性传播至所有相关KC会掩盖部分掌握情况,导致学习曲线拟合不佳。解决方案的关键在于提出一个基于大语言模型(Large Language Models, LLMs)的自动化框架,直接从学生编写的代码中标注KC级正确性,并引入一种时序感知的Code-KC映射机制,以更精准地将KC与个体学生的代码行为对齐,从而提升学习曲线的一致性和预测性能。
链接: https://arxiv.org/abs/2602.17542
作者: Zhangqi Duan,Arnav Kankaria,Dhruv Kartik,Andrew Lan
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Fine-grained skill representations, commonly referred to as knowledge components (KCs), are fundamental to many approaches in student modeling and learning analytics. However, KC-level correctness labels are rarely available in real-world datasets, especially for open-ended programming tasks where solutions typically involve multiple KCs simultaneously. Simply propagating problem-level correctness to all associated KCs obscures partial mastery and often leads to poorly fitted learning curves. To address this challenge, we propose an automated framework that leverages large language models (LLMs) to label KC-level correctness directly from student-written code. Our method assesses whether each KC is correctly applied and further introduces a temporal context-aware Code-KC mapping mechanism to better align KCs with individual student code. We evaluate the resulting KC-level correctness labels in terms of learning curve fit and predictive performance using the power law of practice and the Additive Factors Model. Experimental results show that our framework leads to learning curves that are more consistent with cognitive theory and improves predictive performance, compared to baselines. Human evaluation further demonstrates substantial agreement between LLM and expert annotations.
[NLP-9] he Anxiety of Influence: Bloom Filters in Transformer Attention Heads DATE
【速读】: 该论文试图解决的问题是:Transformer模型中是否存在能够执行成员测试(membership testing)的注意力头,即判断某个token是否曾在上下文中出现过,并揭示其工作机制与分布规律。解决方案的关键在于识别出多个具有不同策略的成员测试注意力头,其中两个表现出高精度过滤能力(错误率仅0–4%,远超经典布隆过滤器64位容量限制),一个符合理论布隆过滤器行为(拟合优度R²=1.0,容量约5比特),另一头因序列长度混淆被重新归类为前缀注意力头。这些真实成员测试头集中在早期层(第0–1层),具备距离敏感特性(错误率随嵌入距离递减),且具有广泛泛化能力(对任意重复token响应,非仅限于特定名称)。通过消融实验进一步表明,这些头同时参与重复与新token处理,说明成员测试功能并非孤立存在,而是与更广泛的计算任务共存。
链接: https://arxiv.org/abs/2602.17526
作者: Peter Balogh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 8 figures, code at this https URL v2: L3H0 reclassified as prefix-attention head following confound control. Capacity analysis updated. Duplicate-token head overlap experiment added v3: All experiments were independently validated on CPU to rule out hardware-specific computation artifacts. Results are consistent across backends
Abstract:Some transformer attention heads appear to function as membership testers, dedicating themselves to answering the question “has this token appeared before in the context?” We identify these heads across four language models (GPT-2 small, medium, and large; Pythia-160M) and show that they form a spectrum of membership-testing strategies. Two heads (L0H1 and L0H5 in GPT-2 small) function as high-precision membership filters with false positive rates of 0-4% even at 180 unique context tokens – well above the d_\texthead = 64 bit capacity of a classical Bloom filter. A third head (L1H11) shows the classic Bloom filter capacity curve: its false positive rate follows the theoretical formula p \approx (1 - e^-kn/m)^k with R^2 = 1.0 and fitted capacity m \approx 5 bits, saturating by n \approx 20 unique tokens. A fourth head initially identified as a Bloom filter (L3H0) was reclassified as a general prefix-attention head after confound controls revealed its apparent capacity curve was a sequence-length artifact. Together, the three genuine membership-testing heads form a multi-resolution system concentrated in early layers (0-1), taxonomically distinct from induction and previous-token heads, with false positive rates that decay monotonically with embedding distance – consistent with distance-sensitive Bloom filters. These heads generalize broadly: they respond to any repeated token type, not just repeated names, with 43% higher generalization than duplicate-token-only heads. Ablation reveals these heads contribute to both repeated and novel token processing, indicating that membership testing coexists with broader computational roles. The reclassification of L3H0 through confound controls strengthens rather than weakens the case: the surviving heads withstand the scrutiny that eliminated a false positive in our own analysis.
[NLP-10] Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics LREC2026
【速读】: 该论文旨在解决临床自由文本笔记中段落结构识别(clinical section segmentation)的难题,以支持临床决策和下游自然语言处理(Natural Language Processing, NLP)任务。其关键解决方案在于:首先构建了一个新的去标识化、带标签的产科笔记数据集,填补了现有公共语料库(如MIMIC-III)在特定医学领域覆盖不足的问题;其次系统评估了基于Transformer的监督模型在域内(in-domain)与域外(out-of-domain)数据上的性能表现,发现监督模型在域外性能显著下降;最后首次对监督模型与零样本大语言模型(zero-shot large language models)进行了直接对比,表明一旦修正零样本模型产生的幻觉性段落标题,其在域外场景下展现出更强的适应能力。研究强调开发领域特异性临床资源的重要性,并指出零样本分割在扩展医疗NLP应用至低资源领域时具有潜力,前提是有效管理模型幻觉问题。
链接: https://arxiv.org/abs/2602.17513
作者: Baris Karacan,Barbara Di Eugenio,Patrick Thornton
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages. Accepted at LREC 2026. To appear in the proceedings
Abstract:Clinical free-text notes contain vital patient information. They are structured into labelled sections; recognizing these sections has been shown to support clinical decision-making and downstream NLP tasks. In this paper, we advance clinical section segmentation through three key contributions. First, we curate a new de-identified, section-labeled obstetrics notes dataset, to supplement the medical domains covered in public corpora such as MIMIC-III, on which most existing segmentation approaches are trained. Second, we systematically evaluate transformer-based supervised models for section segmentation on a curated subset of MIMIC-III (in-domain), and on the new obstetrics dataset (out-of-domain). Third, we conduct the first head-to-head comparison of supervised models for medical section segmentation with zero-shot large language models. Our results show that while supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected. These findings underscore the importance of developing domain-specific clinical resources and highlight zero-shot segmentation as a promising direction for applying healthcare NLP beyond well-studied corpora, as long as hallucinations are appropriately managed.
[NLP-11] Small LLM s for Medical NLP: a Systematic Analysis of Few-Shot Constraint Decoding Fine-Tuning and Continual Pre-Training in Italian LREC2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗自然语言处理(Medical Natural Language Processing, Medical NLP)任务中因计算资源消耗大而难以部署于真实临床场景的问题。其核心解决方案是验证“小型”LLMs(参数量约10亿)是否能在保持竞争力准确率的前提下有效执行多种医学NLP任务。关键发现为:微调(fine-tuning)是最有效的适应策略,而少样本提示(few-shot prompting)与约束解码(constraint decoding)的组合则提供了低资源替代方案;其中基于Qwen3-1.7B的最佳配置在平均得分上比其32B版本高出9.2分,证明小模型在特定优化下可超越更大规模基线模型。
链接: https://arxiv.org/abs/2602.17475
作者: Pietro Ferrazzi,Mattia Franzin,Alberto Lavelli,Bernardo Magnini
机构: 未知
类目: Computation and Language (cs.CL)
备注: Paper Accepted at LREC 2026
Abstract:Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether “small” LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.
[NLP-12] PEACE 2.0: Grounded Explanations and Counter-Speech for Combating Hate Expressions
【速读】: 该论文旨在解决在线平台中仇恨言论(hate speech, HS)日益增长所带来的社会挑战,特别是针对现有自然语言处理技术在自动检测仇恨言论方面取得进展后,如何有效生成回应(即反言论,counter-speech)这一尚未解决的问题。解决方案的关键在于提出PEACE 2.0工具,其核心创新在于引入检索增强生成(Retrieval-Augmented Generation, RAG)管道,实现三个主要功能:一是基于证据和事实对仇恨言论进行解释,二是自动生成有证据支持的反言论响应,三是分析反言论回复的特征。通过整合这些能力,PEACE 2.0能够对显性和隐性仇恨言论进行深入分析并生成高质量回应。
链接: https://arxiv.org/abs/2602.17467
作者: Greta Damo,Stéphane Petiot,Elena Cabrio,Serena Villata
机构: Université Côte d’Azur (蔚蓝海岸大学); CNRS (法国国家科学研究中心); Inria (法国国家信息与自动化研究院); Institut 3IA Côte d’Azur (蔚蓝海岸3IA研究所); Techpool (技术池)
类目: Computation and Language (cs.CL)
备注:
Abstract:The increasing volume of hate speech on online platforms poses significant societal challenges. While the Natural Language Processing community has developed effective methods to automatically detect the presence of hate speech, responses to it, called counter-speech, are still an open challenge. We present PEACE 2.0, a novel tool that, besides analysing and explaining why a message is considered hateful or not, also generates a response to it. More specifically, PEACE 2.0 has three main new functionalities: leveraging a Retrieval-Augmented Generation (RAG) pipeline i) to ground HS explanations into evidence and facts, ii) to automatically generate evidence-grounded counter-speech, and iii) exploring the characteristics of counter-speech replies. By integrating these capabilities, PEACE 2.0 enables in-depth analysis and response generation for both explicit and implicit hateful messages.
[NLP-13] Entropy-Based Data Selection for Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在资源受限场景下进行高效微调时,因数据选择策略与不确定性估计之间关系不明确而导致的计算成本高、训练效率低的问题。其解决方案的关键在于提出一种基于熵的无监督数据选择框架(Entropy-Based Unsupervised Data Selection, EUDS),通过构建高效的计算机制筛选具有高信息价值的数据样本,从而显著降低对计算资源的需求并提升训练时间效率,同时在少量数据条件下仍能保持模型性能。
链接: https://arxiv.org/abs/2602.17465
作者: Hongming Li,Yang Liu,Chao Huang
机构: University of Science and Technology Beijing (北京科技大学); Beijing Institute for General Artificial Intelligence (北京通用人工智能研究院)
类目: Computation and Language (cs.CL)
备注: IEEE Access, 15 pages, 5 figures, 11 tables
Abstract:Modern language models (LMs) increasingly require two critical resources: computational resources and data resources. Data selection techniques can effectively reduce the amount of training data required for fine-tuning LMs. However, their effectiveness is closely related to computational resources, which always require a high compute budget. Owing to the resource limitations in practical fine-tuning scenario, we systematically reveal the relationship between data selection and uncertainty estimation of selected data. Although large language models (LLMs) exhibit exceptional capabilities in language understanding and generation, which provide new ways to alleviate data scarcity, evaluating data usability remains a challenging task. This makes efficient data selection indispensable. To mitigate these issues, we propose Entropy-Based Unsupervised Data Selection (EUDS) framework. Empirical experiments on sentiment analysis (SA), topic classification (Topic-CLS), and question answering (QA) tasks validate its effectiveness. EUDS establishes a computationally efficient data-filtering mechanism. Theoretical analysis and experimental results confirm the effectiveness of our approach. EUDS significantly reduces computational costs and improves training time efficiency with less data requirement. This provides an innovative solution for the efficient fine-tuning of LMs in the compute-constrained scenarios.
[NLP-14] ABCD: All Biases Come Disguised
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多选题(Multiple-choice Question, MCQ)评估中因标签位置、选项顺序及提示样本分布等因素引发的偏差问题,即“标签位置少样本提示偏差”(label-position-few-shot-prompt bias)。这种偏差会导致模型表现依赖于题目选项的排列方式或提示中的答案分布,从而掩盖其真实推理与知识掌握能力。解决方案的关键在于提出一种简化的去偏评估协议:将每个问题的答案标签替换为无序且均匀分布的标签,并引导模型基于完整答案文本进行作答,而非依赖选项位置或提示中的答案模式。通过引入一个简单的句子相似度模型来匹配预测与参考答案,该方法显著提升了评估结果对选项排列变化的鲁棒性,在仅造成最小性能下降的前提下,有效暴露了LLM的真实能力,且在多个基准和模型上均验证了其优越性。
链接: https://arxiv.org/abs/2602.17445
作者: Mateusz Nowak,Xavier Cadet,Peter Chin
机构: Dartmouth College (达特茅斯学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 29 pages, 20 figures, pre-print, 12 tables
Abstract:Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs’ ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distributions of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question. We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in LLM’s performance, exposing the LLM’s capabilities under reduced evaluation artifacts, without any help from the prompt examples or the option labels. Across multiple benchmarks and models, this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance 3\times with only a minimal decrease in the mean model’s performance. Through ablation studies on various embedding models and similarity functions, we show that the method is more robust than the standard ones.
[NLP-15] AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在评估战略推理能力时存在的局限性,即传统静态基准无法充分反映模型在动态多轮对话中进行信息推理与策略制定的真实能力。为应对这一问题,作者提出了一种基于博弈论的框架——对抗性信息推断游戏(Adversarial Information Deduction Game, AIDG),其核心在于通过设计两类互补任务:AIDG-I用于衡量社交推断中的语用策略,AIDG-II则聚焦于结构化“20个问题”场景下的约束满足能力。关键创新在于揭示了LLMs在信息保持(状态维护)方面显著优于信息提取(主动推断)的现象(ELO优势达350,效应量 Cohen’s d = 5.47),并识别出两个主要瓶颈:信息动态性(确认策略比盲目推断高效7.75倍)和约束遵循性下降(对话负荷下指令遵循能力退化导致41.3%的推断失败)。这表明LLMs虽擅长局部防御性一致性,但在全局状态追踪以支持战略性探询方面仍存在显著不足。
链接: https://arxiv.org/abs/2602.17443
作者: Adib Sakhawat,Fardeen Sadab,Rakin Shahriar
机构: Islamic University of Technology (伊斯兰科技大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 5 figures, 13 tables. Includes appendix and supplementary materials
Abstract:Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions. We introduce AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction (active deduction) and information containment (state maintenance) in dialogue. We propose two complementary tasks: AIDG-I, measuring pragmatic strategy in social deduction, and AIDG-II, measuring constraint satisfaction in a structured “20 Questions” setting. Across 439 games with six frontier LLMs, we observe a clear capability asymmetry: models perform substantially better at containment than deduction, with a 350 ELO advantage on defense;(Cohen’s d = 5.47). We identify two bottlenecks driving this gap: (1) Information Dynamics, where confirmation strategies are 7.75x more effective than blind deduction (p 0.00001), and (2) Constraint Adherence, where instruction-following degrades under conversational load, accounting for 41.3% of deductive failures. These findings suggest that while LLMs excel at local defensive coherence, they struggle with the global state tracking required for strategic inquiry.
[NLP-16] Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study ALT
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本生成中难以有效检测幻觉(hallucination)的问题,尤其是现有不确定性量化(Uncertainty Quantification, UQ)方法主要针对短文本输出,缺乏对长文本生成的适配性。其解决方案的关键在于提出一种细粒度不确定性量化的分类体系,从响应分解(response decomposition)、单元级评分(unit-level scoring)和响应级聚合(response-level aggregation)三个阶段系统化地设计与比较UQ方法,并形式化了几类基于一致性的黑盒评分器,从而实现更精准的幻觉识别与事实性增强。实验表明,基于断言-响应蕴含(claim-response entailment)的评分策略优于复杂断言级评分,且引入不确定性感知解码(uncertainty-aware decoding)能显著提升长文本生成的事实准确性。
链接: https://arxiv.org/abs/2602.17431
作者: Dylan Bouchard,Mohit Singh Chauhan,Viren Bajaj,David Skarbrevik
机构: CVS Health(CVS健康)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: UQLM repository: this https URL
Abstract:Uncertainty quantification has emerged as an effective approach to closed-book hallucination detection for LLMs, but existing methods are largely designed for short-form outputs and do not generalize well to long-form generation. We introduce a taxonomy for fine-grained uncertainty quantification in long-form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit-level scoring, and response-level aggregation. We formalize several families of consistency-based black-box scorers, providing generalizations and extensions of existing methods. In our experiments across multiple LLMs and datasets, we find 1) claim-response entailment consistently performs better or on par with more complex claim-level scorers, 2) claim-level scoring generally yields better results than sentence-level scoring, and 3) uncertainty-aware decoding is highly effective for improving the factuality of long-form outputs. Our framework clarifies relationships between prior methods, enables apples-to-apples comparisons, and provides practical guidance for selecting components for fine-grained UQ.
[NLP-17] Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF and BLEU Metrics
【速读】: 该论文旨在解决在极低资源语言(Extremely Low-Resource Language, ELRL)场景下机器翻译(Machine Translation, MT)质量评估的难题,传统指标如BLEU在数据稀缺环境中常无法准确反映翻译质量。其解决方案的关键在于对两种不同机制的评价指标——基于n-gram的BLEU与基于字符的ChrF++进行对比分析,发现尽管BLEU在绝对分数上较低,但其对词汇精度的敏感性提供了与ChrF++互补的解释能力,从而提升ELRL场景下MT输出的可解释性和评估可靠性。
链接: https://arxiv.org/abs/2602.17425
作者: Sanjeev Kumar,Preethi Jyothi,Pushpak Bhattacharyya
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages
Abstract:Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce contexts. This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings. We examine how each metric responds to translation artifacts, including hallucinations, repetition, source-text copying, and diacritic (\textitmatra) variations across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi, with a focus on outputs from large language models (LLMs) and neural MT (NMT) systems. While recent work often relies solely on ChrF++, our findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.
[NLP-18] Diverse Word Choices Same Reference: Annotating Lexically-Rich Cross-Document Coreference
【速读】: 该论文旨在解决跨文档共指消解(Cross-document coreference resolution, CDCR)在新闻领域中因现有数据集主要聚焦事件共指且定义狭窄,导致难以有效分析语义多样性和话语框架差异的问题。其解决方案的关键在于提出一种新的标注方案,将共指链视为话语元素(Discourse Elements, DEs),并引入身份与近似身份关系的统一处理机制,从而支持对“caravan”、“asylum seekers”等不同表述的链接,增强模型对媒体话语中词汇多样性与框架变化的捕捉能力,同时保持细粒度的DE标注一致性。通过重新标注NewsWCL50和ECB+子集并使用统一编码本进行评估,验证了新数据集在词汇多样性指标和同头词干基线上的表现介于原始数据之间,为新闻领域的平衡、话语感知型CDCR研究提供了可靠基础。
链接: https://arxiv.org/abs/2602.17424
作者: Anastasia Zhukova,Felix Hamborg,Karsten Donnay,Norman Meuschke,Bela Gipp
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants. However, existing datasets primarily focus on event resolution and employ a narrow definition of coreference, which limits their effectiveness in analyzing diverse and polarized news coverage where wording varies widely. This paper proposes a revised CDCR annotation scheme of the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis. The approach accommodates both identity and near-identity relations, e.g., by linking “the caravan” - “asylum seekers” - “those contemplating illegal entry”, allowing models to capture lexical diversity and framing variation in media discourse, while maintaining the fine-grained annotation of DEs. We reannotate the NewsWCL50 and a subset of ECB+ using a unified codebook and evaluate the new datasets through lexical diversity metrics and a same-head-lemma baseline. The results show that the reannotated datasets align closely, falling between the original ECB+ and NewsWCL50, thereby supporting balanced and discourse-aware CDCR research in the news domain.
[NLP-19] DAVE: A Policy-Enforcing LLM Spokesperson for Secure Multi-Document Data Sharing
【速读】: 该论文旨在解决当前跨组织数据空间中使用策略(usage policy)执行粒度过粗的问题——即现有机制仅能在整个文档或数据集层面进行共享或屏蔽,导致当文档部分内容敏感时,数据提供方不得不手动进行文本脱敏(redaction),这种方式成本高、粒度粗且难以维护。解决方案的关键在于提出 DAVE(Data Access Virtual Enforcer),一个基于大语言模型(Large Language Model, LLM)的“代言人”系统,其通过自然语言接口响应查询请求,同时受机器可读的使用策略约束;核心创新是引入“虚拟脱敏”(virtual redaction)机制,在查询时动态抑制敏感信息,无需修改原始文档即可实现细粒度的信息控制。该方案将策略强制与问答服务解耦,并基于 Eclipse Dataspace Components 和 ODRL 策略格式构建架构原型,为未来对 LLM 在多方数据空间中安全可控访问的实证研究奠定基础。
链接: https://arxiv.org/abs/2602.17413
作者: René Brinkhege,Prahlad Menon
机构: Fraunhofer ISST (弗劳恩霍夫信息与通信技术研究所); Fraunhofer CMA (弗劳恩霍夫材料与能源研究中心)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:In current inter-organizational data spaces, usage policies are enforced mainly at the asset level: a whole document or dataset is either shared or withheld. When only parts of a document are sensitive, providers who want to avoid leaking protected information typically must manually redact documents before sharing them, which is costly, coarse-grained, and hard to maintain as policies or partners change. We present DAVE, a usage policy-enforcing LLM spokesperson that answers questions over private documents on behalf of a data provider. Instead of releasing documents, the provider exposes a natural language interface whose responses are constrained by machine-readable usage policies. We formalize policy-violating information disclosure in this setting, drawing on usage control and information flow security, and introduce virtual redaction: suppressing sensitive information at query time without modifying source documents. We describe an architecture for integrating such a spokesperson with Eclipse Dataspace Components and ODRL-style policies, and outline an initial provider-side integration prototype in which QA requests are routed through a spokesperson service instead of triggering raw document transfer. Our contribution is primarily architectural: we do not yet implement or empirically evaluate the full enforcement pipeline. We therefore outline an evaluation methodology to assess security, utility, and performance trade-offs under benign and adversarial querying as a basis for future empirical work on systematically governed LLM access to multi-party data spaces.
[NLP-20] he Role of the Availability Heuristic in Multiple-Choice Answering Behaviour
【速读】: 该论文旨在解决学生在面对多选题(Multiple-Choice Question, MCQ)时因不确定正确答案而依赖猜测策略的问题,特别是探究“认知可用性”(cognitive availability)是否可作为有效的答题策略。其核心问题是:仅凭选项在记忆中浮现的难易程度(即可用性)来选择答案,是否能提升答题准确率?解决方案的关键在于提出一种基于大规模语料库(如维基百科)计算选项可用性的方法——通过衡量选项词项在语料中的出现频率来量化其认知可用性。研究发现,在三个大型题库中,正确选项的可用性显著高于错误选项,且始终选择最可用选项可使得分比随机猜测高出13.5%至32.9%,表明可用性是一个值得在计算建模学生行为时纳入的重要因素。
链接: https://arxiv.org/abs/2602.17377
作者: Leonidas Zotos,Hedderik van Rijn,Malvina Nissim
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 4 figures
Abstract:When students are unsure of the correct answer to a multiple-choice question (MCQ), guessing is common practice. The availability heuristic, proposed by A. Tversky and D. Kahneman in 1973, suggests that the ease with which relevant instances come to mind, typically operationalised by the mere frequency of exposure, can offer a mental shortcut for problems in which the test-taker does not know the exact answer. Is simply choosing the option that comes most readily to mind a good strategy for answering MCQs? We propose a computational method of assessing the cognitive availability of MCQ options operationalised by concepts’ prevalence in large corpora. The key finding, across three large question sets, is that correct answers, independently of the question stem, are significantly more available than incorrect MCQ options. Specifically, using Wikipedia as the retrieval corpus, we find that always selecting the most available option leads to scores 13.5% to 32.9% above the random-guess baseline. We further find that LLM-generated MCQ options show similar patterns of availability compared to expert-created options, despite the LLMs’ frequentist nature and their training on large collections of textual data. Our findings suggest that availability should be considered in current and future work when computationally modelling student behaviour.
[NLP-21] RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长尾问答任务中因难以获取和准确回忆低频知识而导致的性能瓶颈问题。现有检索增强生成(Retrieval-Augmented Generation, RAG)系统虽能缓解此问题,但密集检索模型在稀有或专业领域知识上的泛化能力仍不足。其解决方案的关键在于提出一种名为RPDR的数据增强框架,通过三个核心组件实现:合成数据生成、基于往返预测(Round-Trip Prediction)的易学样本选择机制,以及利用这些高质量易学样本训练密集检索器。该方法显著提升了检索器在PopQA和EntityQuestion两个长尾检索基准上的表现,尤其在极端长尾类别上效果突出,并进一步通过动态路由机制将查询分配至专用检索模块以优化整体性能。
链接: https://arxiv.org/abs/2602.17366
作者: Yiming Zhang,Siyue Zhang,Junbo Zhao,Chen Zhao
机构: Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学); NYU Shanghai (纽约大学上海); Center for Data Science, New York University (纽约大学数据科学中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Long-tail question answering presents significant challenges for large language models (LLMs) due to their limited ability to acquire and accurately recall less common knowledge. Retrieval-augmented generation (RAG) systems have shown great promise in mitigating this limitation by integrating external retrieval mechanisms. However, dense retrieval models often face the same difficulties when generalizing to rare or niche knowledge. In this study, we introduce RPDR, a novel data augmentation framework that selects high-quality easy-to-learn training data, to enhance dense retrievers. Our approach is built around three core components: synthetic data generation, data selection with Round-Trip prediction to identify easy-to-learn instances, and retriever training with these instances. We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories. We identify the strengths and limitations of RPDR through detailed human analysis and propose a dynamic routing mechanism to dynamically route queries to specialized retrieval modules to further improve retrieval performance.
[NLP-22] Same Meaning Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation LREC2026
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)评估基准的可靠性问题,即模型性能对输入提示中细微但语义等价的词法和句法扰动敏感,从而可能导致模型排名失真。解决方案的关键在于设计两种基于语言学原则的扰动管道:一是通过同义词替换实现词法层面的无语义变化,二是利用依存句法分析确定可应用的句法变换,以此系统性地测试23个主流LLM在MMLU、SQuAD和AMEGA三个基准上的表现稳定性。结果表明,词法扰动普遍导致显著且统计显著的性能下降,而句法扰动则效果不一,二者均破坏了复杂任务上的模型排行榜,揭示出LLMs更依赖表层词汇模式而非抽象语言能力,强调将鲁棒性测试纳入标准评估流程的必要性。
链接: https://arxiv.org/abs/2602.17316
作者: Bogdan Kostić,Conor Fallon,Julian Risch,Alexander Löser
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at LREC 2026
Abstract:The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks. Furthermore, model robustness did not consistently scale with model size, revealing strong task dependence. Overall, the findings suggest that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence, underscoring the need for robustness testing as a standard component of LLM evaluation.
[NLP-23] ArXiv-to-Model: A Practical Study of Scientific LM Training
【速读】: 该论文旨在解决从原始arXiv LaTeX数据中训练领域专业化科学语言模型(Scientific Language Model)的实践流程缺乏系统性文档的问题。其关键解决方案在于构建并公开了一个端到端的工程化训练管道,涵盖元数据过滤、归档验证、LaTeX提取、文本标准化、领域感知分词(domain-aware tokenization)以及在受限计算资源(2xA100 GPU)下的密集Transformer训练。通过24次实验分析了训练稳定性、数据利用率损失和基础设施瓶颈,揭示了预处理决策对可用token数量的影响、分词策略对符号稳定性的关键作用,以及存储与I/O限制可能成为与计算能力同等重要的制约因素,从而为资源有限的研究者提供了可复现、透明且实用的科学语言模型训练方法论。
链接: https://arxiv.org/abs/2602.17288
作者: Anuj Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 6 figures, 1 table
Abstract:While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented. In this work, we present a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics. We describe an end-to-end pipeline covering metadata filtering, archive validation, LaTeX extraction, text normalization, domain-aware tokenization, and dense transformer training under constrained compute (2xA100 GPUs). Through 24 experimental runs, we analyze training stability, scaling behavior, data yield losses, and infrastructure bottlenecks. Our findings highlight how preprocessing decisions significantly affect usable token volume, how tokenization impacts symbolic stability, and how storage and I/O constraints can rival compute as limiting factors. We further analyze convergence dynamics and show stable training behavior in a data-rich regime (52B pretraining tokens). Rather than proposing a novel architecture, this work provides an engineering-grounded, transparent account of training a small scientific language model from scratch. We hope these insights support researchers operating under moderate compute budgets who seek to build domain-specialized models.
[NLP-24] Representation Collapse in Machine Translation Through the Lens of Angular Dispersion
【速读】: 该论文旨在解决神经机器翻译(Neural Machine Translation, NMT)中Transformer模型在训练过程中出现的表示坍缩(representation collapse)问题,尤其是在深层网络和连续输出架构下,模型可能退化为将所有向量映射到相同值的平凡解,从而损害翻译质量。解决方案的关键在于引入基于角度分散性(angular dispersion)的正则化方法,实验证明该方法不仅能有效缓解表示坍缩现象,还能提升翻译性能;此外,研究还表明即使在量化(quantization)后的模型中,该正则化策略依然能保持对坍缩的抑制效果。
链接: https://arxiv.org/abs/2602.17287
作者: Evgeniia Tokarchuk,Maya K. Nachesa,Sergey Troshin,Vlad Niculae
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Modern neural translation models based on the Transformer architecture are known for their high performance, particularly when trained on high-resource datasets. A standard next-token prediction training strategy, while widely adopted in practice, may lead to overlooked artifacts such as representation collapse. Previous works have shown that this problem is especially pronounced in the representation of the deeper Transformer layers, where it often fails to efficiently utilize the geometric space. Representation collapse is even more evident in end-to-end training of continuous-output neural machine translation, where the trivial solution would be to set all vectors to the same value. In this work, we analyze the dynamics of representation collapse at different levels of discrete and continuous NMT transformers throughout training. We incorporate an existing regularization method based on angular dispersion and demonstrate empirically that it not only mitigates collapse but also improves translation quality. Furthermore, we show that quantized models exhibit similar collapse behavior and that the benefits of regularization are preserved even after quantization.
[NLP-25] owards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在内容安全评估中对显性危害(如暴力或仇恨言论)关注有余,而对数字内容中隐含的深层价值观(如公平、自由、宗教等)评估不足的问题。其解决方案的关键在于提出X-Value——一个跨语言的价值观评估基准,包含5000多个问答对,覆盖18种语言,并基于Schwartz的基本人类价值观理论划分为7个核心领域,同时区分易难层级以实现差异化评估。此外,该研究设计了一种两阶段标注框架:首先判断议题是否属于全球共识(如人权)或多元主义(如宗教),再对内容中潜藏的价值观进行多方评估,从而系统性提升LLMs在跨语言情境下对复杂价值观的理解与判别能力。
链接: https://arxiv.org/abs/2602.17283
作者: Yukun Chen,Xinyu Zhang,Jialong Tang,Yu Wan,Baosong Yang,Yiming Li,Zhan Qin,Kui Ren
机构: Zhejiang University (浙江大学); Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security (杭州高新区(滨江区)区块链与数据安全研究院); Tongyi Lab, Alibaba Group Inc (通义实验室,阿里巴巴集团); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While large language models (LLMs) have become pivotal to content safety, current evaluation paradigms primarily focus on detecting explicit harms (e.g., violence or hate speech), neglecting the subtler value dimensions conveyed in digital content. To bridge this gap, we introduce X-Value, a novel Cross-lingual Values Assessment Benchmark designed to evaluate LLMs’ ability to assess deep-level values of content from a global perspective. X-Value consists of more than 5,000 QA pairs across 18 languages, systematically organized into 7 core domains grounded in Schwartz’s Theory of Basic Human Values and categorized into easy and hard levels for discriminative evaluation. We further propose a unique two-stage annotation framework that first identifies whether an issue falls under global consensus (e.g., human rights) or pluralism (e.g., religion), and subsequently conducts a multi-party evaluation of the latent values embedded within the content. Systematic evaluations on X-Value reveal that current SOTA LLMs exhibit deficiencies in cross-lingual values assessment ( Acc 77% ), with significant performance disparities across different languages ( \Delta Acc 20% ). This work highlights the urgent need to improve the nuanced, values-aware content assessment capability of LLMs. Our X-Value is available at: this https URL.
[NLP-26] Quantifying and Mitigating Socially Desirable Responding in LLM s: A Desirability-Matched Graded Forced-Choice Psychometric Study
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在基于问卷的评估中因社会期许反应(Socially Desirable Responding, SDR)导致的偏差问题。SDR指模型在评价情境下倾向于给出符合社会规范的答案,而非真实响应,从而扭曲测评分数和下游结论。解决方案的关键在于提出一个心理测量学框架:首先通过在“诚实”与“伪善良好”指令下分别施测同一量表,利用项目反应理论(Item Response Theory, IRT)估计潜变量得分,并以方向校正的标准效应量量化SDR;其次,构建一种受控匹配期望值的分级强制选择(Graded Forced-Choice, GFC)大五人格量表,通过约束优化从题项池中选取30对跨领域配对项,使其社会期望值一致,从而显著降低SDR,同时保持对预设人格特征的准确恢复能力。该方法揭示了模型依赖的SDR与人格特征恢复之间的权衡关系,推动了更严谨的问卷基准测试与审计实践。
链接: https://arxiv.org/abs/2602.17262
作者: Kensuke Okada,Yui Furukawa,Kyosuke Bunji
机构: The University of Tokyo (东京大学); Kobe University (神户大学)
类目: Computation and Language (cs.CL); Methodology (stat.ME)
备注:
Abstract:Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-tuned LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.
[NLP-27] Mechanistic Interpretability of Cognitive Complexity in LLM s via Linear Probing using Blooms Taxonomy
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)黑箱特性带来的评估难题,即如何超越表面性能指标来理解模型内部对认知复杂度的编码机制。其解决方案的关键在于引入布卢姆分类法(Bloom’s Taxonomy)作为层级分析框架,通过解析不同LLM中高维激活向量在残差流(residual streams)中的分布特征,验证从基础记忆(Remember)到抽象创造(Create)等认知层级是否在线性可分的子空间中被编码。研究发现,线性分类器在所有布卢姆层级上平均准确率达约95%,表明认知复杂度确实在模型表示中具有线性可访问性,且模型在前向传播早期即已识别出提示的认知难度,且随着层加深,各层级表示逐渐更易分离。
链接: https://arxiv.org/abs/2602.17229
作者: Bianca Raimondi,Maurizio Gabbrielli
机构: University of Bologna (博洛尼亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. Under review
Abstract:The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics. This study investigates the internal neural representations of cognitive complexity using Bloom’s Taxonomy as a hierarchical lens. By analyzing high-dimensional activation vectors from different LLMs, we probe whether different cognitive levels, ranging from basic recall (Remember) to abstract synthesis (Create), are linearly separable within the model’s residual streams. Our results demonstrate that linear classifiers achieve approximately 95% mean accuracy across all Bloom levels, providing strong evidence that cognitive level is encoded in a linearly accessible subspace of the model’s representations. These findings provide evidence that the model resolves the cognitive difficulty of a prompt early in the forward pass, with representations becoming increasingly separable across layers.
[NLP-28] From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwans Humanities and Social Sciences
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在人文与社会科学领域方法论研究不足的问题,尤其缺乏针对此类学科的可复制、结构化协作研究框架。其解决方案的关键在于提出一种基于 AI Agent 的协同研究工作流(Agentic Workflow),该工作流由七个模块化阶段构成,遵循任务模块化、人机分工和可验证性三大原则:其中人类研究人员负责研究判断与伦理决策,AI Agent 承担信息检索与文本生成任务。通过台湾地区 Anthropic 经济指数(AEI)数据(N=7,729 条对话)的实证分析,验证了该方法在二次数据分析中的可行性与输出质量,同时识别出三种人机协作模式——直接执行、迭代优化与人类主导,并强调人类在问题界定、理论阐释、情境推理与伦理反思中的不可替代作用。
链接: https://arxiv.org/abs/2602.17221
作者: Yi-Chih Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: also in Chinese
Abstract:Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences. Positioned as a “methodological experiment,” this study proposes an AI Agent-based collaborative research workflow (Agentic Workflow) for humanities and social science research. Taiwan’s this http URL usage data (N = 7,729 conversations, November 2025) from the Anthropic Economic Index (AEI) serves as the empirical vehicle for validating the feasibility of this methodology. This study operates on two levels: the primary level is the design and validation of a methodological framework - a seven-stage modular workflow grounded in three principles: task modularization, human-AI division of labor, and verifiability, with each stage delineating clear roles for human researchers (research judgment and ethical decisions) and AI Agents (information retrieval and text generation); the secondary level is the empirical analysis of AEI Taiwan data - serving as an operational demonstration of the workflow’s application to secondary data research, showcasing both the process and output quality (see Appendix A). This study contributes by proposing a replicable AI collaboration framework for humanities and social science researchers, and identifying three operational modes of human-AI collaboration - direct execution, iterative refinement, and human-led - through reflexive documentation of the operational process. This taxonomy reveals the irreplaceability of human judgment in research question formulation, theoretical interpretation, contextualized reasoning, and ethical reflection. Limitations including single-platform data, cross-sectional design, and AI reliability risks are acknowledged. Comments: also in Chinese Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2602.17221 [cs.AI] (or arXiv:2602.17221v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.17221 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-29] What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform
【速读】: 该论文旨在解决文本类远程医疗(text-based telemedicine)中患者满意度信号的建模问题,尤其关注如何识别影响患者评分的关键因素,以优化医患沟通质量。其解决方案的关键在于构建一个基于时间划分的分类模型,利用可解释特征(包括语言无关特征如响应长度和结构特性、可读性代理指标,以及罗马尼亚语LIWC心理语言学特征与礼貌/缓和标记),并通过SHAP分析揭示:患者与医生的历史交互信息是主导预测的强先验信号,而响应文本本身的特征虽贡献较小,却是可操作的关键变量;进一步的子群相关性分析表明,礼貌性和缓和表达始终与积极反馈正相关,而词汇多样性则呈负相关。
链接: https://arxiv.org/abs/2602.17194
作者: Adrian Cosma,Cosmin Dumitrache,Emilian Radoi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Text-based telemedicine has become a common mode of care, requiring clinicians to deliver medical advice clearly and effectively in writing. As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy. We analyse patient satisfaction signals in Romanian text-based telemedicine. Using a sample of 77,334 anonymised patient question–doctor response pairs, we model feedback as a binary outcome, treating thumbs-up responses as positive and grouping negative or absent feedback into the other class. We extract interpretable, predominantly language-agnostic features (e.g., length, structural characteristics, readability proxies), along with Romanian LIWC psycholinguistic features and politeness/hedging markers where available. We train a classifier with a time-based split and perform SHAP-based analyses, which indicate that patient and clinician history features dominate prediction, functioning as strong priors, while characteristics of the response text provide a smaller but, crucially, actionable signal. In subgroup correlation analyses, politeness and hedging are consistently positively associated with patient feedback, whereas lexical diversity shows a negative association.
[NLP-30] he Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在多智能体系统和递归评估循环中,因缺乏对稳定、可审计的行为特征识别而导致的安全与治理难题。传统基准测试仅关注瞬时任务准确性,无法捕捉训练和对齐过程中嵌入的潜在响应策略(即“主导思维模式”),这些策略会超越单个模型版本而持续存在。解决方案的关键在于提出一种基于心理测量学理论的新颖审计框架,利用有序不确定性下的潜变量估计方法,在不依赖真实标签的前提下量化此类行为倾向;其核心创新包括使用强制选择的序数情境(vignettes)结合语义正交干扰项,并通过密码学排列不变性保障数据隐私与公平性,从而在优化偏差、谄媚倾向和现状正当化等维度上实现跨模型的稳定行为聚类分析。
链接: https://arxiv.org/abs/2602.17127
作者: Dusan Bosnjakovic
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatures becomes a critical requirement for safety and governance. Traditional benchmarks measure transient task accuracy but fail to capture stable, latent response policies – the prevailing mindsets'' embedded during training and alignment that outlive individual model versions. This paper introduces a novel auditing framework that utilizes psychometric measurement theory -- specifically latent trait estimation under ordinal uncertainty -- to quantify these tendencies without relying on ground-truth labels. Utilizing forced-choice ordinal vignettes masked by semantically orthogonal decoys and governed by cryptographic permutation-invariance, the research audits nine leading models across dimensions including Optimization Bias, Sycophancy, and Status-Quo Legitimization. Using Mixed Linear Models (MixedLM) and Intraclass Correlation Coefficient (ICC) analysis, the research identifies that while item-level framing drives high variance, a persistent lab signal’’ accounts for significant behavioral clustering. These findings demonstrate that in ``locked-in’’ provider ecosystems, latent biases are not merely static errors but compounding variables that risk creating recursive ideological echo chambers in multi-layered AI architectures. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.17127 [cs.CL] (or arXiv:2602.17127v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.17127 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-31] Projective Psychological Assessment of Large Multimodal Models Using Thematic Apperception Tests
【速读】: 该论文试图解决的问题是:如何通过非语言模态评估大型多模态模型(Large Multimodal Models, LMMs)的人格特质,尤其是其认知表征与情感关系层面的个性功能。解决方案的关键在于引入社会认知与客体关系量表-全局版(Social Cognition and Object Relations Scale - Global, SCORS-G)作为评估框架,并将LMMs置于两个角色中——作为生成故事的主体模型(Subject Models, SMs)和基于SCORS-G对叙事进行评分的评价模型(Evaluator Models, EMs)。研究发现,EMs能高度一致地理解并分析TAT图像生成的故事,且所有模型均展现出良好的人际动态理解和自我概念把握能力,但在攻击性感知与调节方面存在系统性缺陷;同时,模型性能随规模和训练时间显著提升,表明模型复杂度与人格模拟能力之间存在正相关关系。
链接: https://arxiv.org/abs/2602.17108
作者: Anton Dzega,Aviad Elyashar,Ortal Slobodin,Odeya Cohen,Rami Puzis
机构: Ben-Gurion University of the Negev (本-古里安大学); Shamoon College of Engineering (沙穆恩工程学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Thematic Apperception Test (TAT) is a psychometrically grounded, multidimensional assessment framework that systematically differentiates between cognitive-representational and affective-relational components of personality-like functioning. This test is a projective psychological framework designed to uncover unconscious aspects of personality. This study examines whether the personality traits of Large Multimodal Models (LMMs) can be assessed through non-language-based modalities, using the Social Cognition and Object Relations Scale - Global (SCORS-G). LMMs are employed in two distinct roles: as subject models (SMs), which generate stories in response to TAT images, and as evaluator models (EMs), who assess these narratives using the SCORS-G framework. Evaluators demonstrated an excellent ability to understand and analyze TAT responses. Their interpretations are highly consistent with those of human experts. Assessment results highlight that all models understand interpersonal dynamics very well and have a good grasp of the concept of self. However, they consistently fail to perceive and regulate aggression. Performance varied systematically across model families, with larger and more recent models consistently outperforming smaller and earlier ones across SCORS-G dimensions.
[NLP-32] BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在金融领域,尤其是数字银行场景中进行核心银行业务计算时准确率低的问题,如存款/贷款总收益估算、不同利率产品比较以及提前还款条件下的利息计算等任务。这些问题要求多步数值推理和对产品上下文的深入理解,而现有LLMs常因误解产品类型、误用条件或无法正确处理指数与等比数列等基础数学运算导致系统性错误,且缺乏针对此类真实场景的评测基准。解决方案的关键在于提出BankMathBench——一个面向银行场景的专用数据集,按难度分为基础(单产品推理)、中级(多产品对比)和高级(多条件场景)三个层级,并通过工具增强的微调策略显著提升了模型在公式生成与数值推理上的准确性,平均准确率提升达57.6%p至75.1%p,验证了其作为可靠评估与提升LLMs在实际银行业务中数值推理能力的有效性。
链接: https://arxiv.org/abs/2602.17072
作者: Yunseung Lee,Subin Kim,Youngjun Kwak,Jaegul Choo
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low accuracy in core banking computations-including total payout estimation, comparison of products with varying interest rates, and interest calculation under early repayment conditions. Such tasks require multi-step numerical reasoning and contextual understanding of banking products, yet existing LLMs often make systematic errors-misinterpreting product types, applying conditions incorrectly, or failing basic calculations involving exponents and geometric progressions. However, such errors have rarely been captured by existing benchmarks. Mathematical datasets focus on fundamental math problems, whereas financial benchmarks primarily target financial documents, leaving everyday banking scenarios underexplored. To address this limitation, we propose BankMathBench, a domain-specific dataset that reflects realistic banking tasks. BankMathBench is organized in three levels of difficulty-basic, intermediate, and advanced-corresponding to single-product reasoning, multi-product comparison, and multi-condition scenarios, respectively. When trained on BankMathBench, open-source LLMs exhibited notable improvements in both formula generation and numerical reasoning accuracy, demonstrating the dataset’s effectiveness in enhancing domain-specific reasoning. With tool-augmented fine-tuning, the models achieved average accuracy increases of 57.6%p (basic), 75.1%p (intermediate), and 62.9%p (advanced), representing significant gains over zero-shot baselines. These findings highlight BankMathBench as a reliable benchmark for evaluating and advancing LLMs’ numerical reasoning in real-world banking scenarios.
[NLP-33] Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression
【速读】: 该论文致力于解决子比特模型压缩(sub-bit model compression)中的关键瓶颈问题,即在对神经网络权重进行极端量化(存储低于每权重1比特)时,符号位(sign bit)成为不可忽略的固定开销。研究发现,尽管权重符号矩阵在谱特性上看似随机,但其符号模式主要由初始值决定,符号翻转行为本质上源于训练过程中极少数接近零的权重穿越边界所致。为此,作者提出“符号锁定理论”(sign lock-in theory),通过停时分析揭示了符号翻转数量服从几何尾分布的机制,并据此设计了基于间隔的初始化策略与轻量级外向漂移正则化项,显著降低有效符号翻转率至约 10−3,同时仅带来约1个点的困惑度(perplexity)增长。
链接: https://arxiv.org/abs/2602.17063
作者: Akira Sakai,Yuma Ichikawa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned sign matrices resist low-rank approximation and are spectrally indistinguishable from an i.i.d. Rademacher baseline. Despite this apparent randomness, most weights retain their initialization signs; flips primarily occur via rare near-zero boundary crossings, suggesting that sign-pattern randomness is largely inherited from initialization. We formalize this behavior with sign lock-in theory, a stopping-time analysis of sign flips under SGD noise. Under bounded updates and a rare re-entry condition into a small neighborhood around zero, the number of effective sign flips exhibits a geometric tail. Building on this mechanism, we introduce a gap-based initialization and a lightweight outward-drift regularizer, reducing the effective flip rate to approximately 10^-3 with only about a one-point increase in perplexity.
[NLP-34] ALPS: A Diagnostic Challenge Set for Arabic Linguistic Prag matic Reasoning
【速读】: 该论文旨在解决当前阿拉伯语自然语言处理(Natural Language Processing, NLP)基准测试中普遍存在的数据质量不足问题,即多数基准依赖合成或翻译数据,缺乏对深层语言学特征的严谨验证。为此,作者提出ALPS(Arabic Linguistic Pragmatic Suite),一个由专家精心设计的诊断性挑战数据集,聚焦于深度语义与语用能力,通过531个严格构造的问题覆盖15项任务和47个子任务,确保文化真实性并消除翻译偏差。其关键创新在于采用母语专家标注、以morpho-syntactic(形态句法)依赖为核心评估维度,并引入单次人类表现(平均84.6%准确率)与专家裁定的Oracle(99.2%准确率)作为基准,揭示出当前主流模型在表面流畅性上的高表现与其在基础形态句法理解上的显著不足之间的关键差异。
链接: https://arxiv.org/abs/2602.17054
作者: Hussein S. Al-Olimat,Ahmad Alshareef
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification. We introduce ALPS (Arabic Linguistic Pragmatic Suite), a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics, capabilities that complement specialized large-scale benchmarks. While broad-coverage benchmarks prioritize scale and multi-task coverage, ALPS targets the depth of linguistic understanding through 531 rigorously crafted questions across 15 tasks and 47 subtasks. We developed the dataset with deep expertise in Arabic linguistics, guaranteeing cultural authenticity and eliminating translation artifacts. Evaluating 23 diverse models (commercial, open-source, and Arabic-native) against a single-pass human performance (avg. 84.6% accuracy) and an expert-adjudicated oracle (99.2%), we reveal a critical dissociation: models achieve high fluency but fail on fundamental morpho-syntactic dependencies, with elevated error rates on morpho-syntactic dependencies (36.5% across diacritics-reliant tasks) compared to compositional semantics. While top commercial models (Gemini-3-flash at 94.2%) surpass the average single human, a substantial gap persists between commercial giants and Arabic-native models, with the best Arabic-specific model (Jais-2-70B at 83.6%) approaching but not matching human performance.
[NLP-35] RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Reasoning Intervention in Large Reasoning Models ICLR ICLR2026
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在生成推理过程时存在“表面合理但实质不忠实”的问题,即模型输出的推理链看似逻辑通顺,却未能真实反映其决策机制,从而损害了模型的可靠性与可信度。解决方案的关键在于提出一个形式化的推理忠实性(reasoning faithfulness)框架,包含两个可验证条件:立场一致性(stance consistency)(推理与答案之间保持一致立场)和因果影响(causal influence)(在输出层面施加干预时,推理内容能显著驱动答案变化),且明确将忠实性与准确性解耦。为实现这一框架,作者构建了RFEval基准测试集(7,186个实例,覆盖7个任务),通过控制性的输出级反事实干预来量化模型的忠实性表现,实证发现当前主流LRM普遍存在忠实性缺失(49.7%输出不忠实),且这种缺陷主要源于立场不一致,并集中出现在数学和代码等脆弱、收敛性强的任务中,同时揭示准确率无法有效预测忠实性,强调未来可信AI需同时优化输出正确性和推理结构完整性。
链接: https://arxiv.org/abs/2602.17053
作者: Yunseok Han,Yejoon Lee,Jaeyoung Do
机构: Seoul National University (首尔国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted in ICLR 2026 Poster: \href
Abstract:Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We introduce a formal framework for reasoning faithfulness, defined by two testable conditions: stance consistency (a coherent stance linking reasoning to answer) and causal influence (the stated reasoning causally drives the answer under output-level interventions), explicitly decoupled from accuracy. To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions. Evaluating twelve open-source LRMs, we find unfaithfulness in 49.7% of outputs, predominantly from stance inconsistency. Failures are concentrated in brittle, convergent domains such as math and code, and correlate more with post-training regimes than with scale: within-family ablations indicate that adding current RL-style objectives on top of supervised fine-tuning can reduce reasoning faithfulness, even when accuracy is maintained. Crucially, accuracy is neither a sufficient nor a reliable proxy for faithfulness: once controlling for model and task, the accuracy-faithfulness link is weak and statistically insignificant. Our work establishes a rigorous methodology for auditing LRM reliability and shows that trustworthy AI requires optimizing not only for correct outcomes but also for the structural integrity of the reasoning process. Our code and dataset can be found at project page: \hrefthis https URLthis https URL
[NLP-36] Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data
【速读】: 该论文旨在解决多语言社交媒体话语分析在自然语言处理中的挑战,特别是在跨语言公共讨论中如何实现可靠的主题发现。其核心问题是:如何从基于关键词的噪声数据中高效过滤出与特定主题(以氢能源为例)相关的内容,并在此基础上进行跨语言的主题建模。解决方案的关键在于比较四种不同的跨语言文本分类策略——包括语言特定模型、统一英文标注模型、直接应用英语微调的多语言Transformer以及混合策略——并通过实证评估其在过滤噪声和提取主题方面的有效性,从而揭示翻译与多语言方法之间的权衡关系,为优化大规模社交媒体跨语言分析流程提供可操作的洞见。
链接: https://arxiv.org/abs/2602.17051
作者: Deepak Uniyal,Md Abul Bashar,Richi Nayak
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations. Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean (2013–2022) for topic discovery. The online keyword-driven data collection results in a significant amount of irrelevant content. We explore four approaches to filter relevant content: (1) translating English annotated data into target languages for building language-specific models for each target language, (2) translating unlabelled data appearing from all languages into English for creating a single model based on English annotations, (3) applying English fine-tuned multilingual transformers directly to each target language data, and (4) a hybrid strategy that combines translated annotations with multilingual training. Each approach is evaluated for its ability to filter hydrogen-related tweets from noisy keyword-based collections. Subsequently, topic modeling is performed to extract dominant themes within the relevant subsets. The results highlight key trade-offs between translation and multilingual approaches, offering actionable insights into optimising cross-lingual pipelines for large-scale social media analysis.
[NLP-37] Large Language Models Persuade Without Planning Theory of Mind
【速读】: 该论文试图解决现有理论认为第一人称互动是心理理论(Theory of Mind, ToM)的核心,而当前主流评估方法多依赖静态、非交互式的问答基准,可能无法有效衡量人类与大语言模型(Large Language Models, LLMs)的真实ToM能力这一问题。解决方案的关键在于设计了一个新颖的ToM任务:要求代理通过战略性地披露信息来说服目标选择三个政策提案之一,成功与否取决于说服者对目标知识状态(knowledge states)和动机状态(motivational states)的敏感性。实验通过控制这些状态是否向说服者揭示(Revealed vs. Hidden),考察LLMs与人类在多步推理和心理状态推断上的差异,从而揭示LLMs虽在特定情境下表现出色,但其优势可能源于非显式ToM的修辞策略而非真正的心理状态建模能力。
链接: https://arxiv.org/abs/2602.17045
作者: Jared Moore,Rasmus Overmark,Ned Cooper,Beba Cibralic,Nick Haber,Cameron R. Jones
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks. However, theoretical work in the field suggests that first-personal interaction is a crucial part of ToM and that such predictive, spectatorial tasks may fail to evaluate it. We address this gap with a novel ToM task that requires an agent to persuade a target to choose one of three policy proposals by strategically revealing information. Success depends on a persuader’s sensitivity to a given target’s knowledge states (what the target knows about the policies) and motivational states (how much the target values different outcomes). We varied whether these states were Revealed to persuaders or Hidden, in which case persuaders had to inquire about or infer them. In Experiment 1, participants persuaded a bot programmed to make only rational inferences. LLMs excelled in the Revealed condition but performed below chance in the Hidden condition, suggesting difficulty with the multi-step planning required to elicit and use mental state information. Humans performed moderately well in both conditions, indicating an ability to engage such planning. In Experiment 2, where a human target role-played the bot, and in Experiment 3, where we measured whether human targets’ real beliefs changed, LLMs outperformed human persuaders across all conditions. These results suggest that effective persuasion can occur without explicit ToM reasoning (e.g., through rhetorical strategies) and that LLMs excel at this form of persuasion. Overall, our results caution against attributing human-like ToM to LLMs while highlighting LLMs’ potential to influence people’s beliefs and behavior.
[NLP-38] ReIn: Conversational Error Recovery with Reasoning Inception ICLR2026
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的对话代理在面对用户引发的意外错误时缺乏有效恢复能力的问题,尤其是在无法进行模型微调或提示修改的现实约束下。其解决方案的关键在于提出一种称为“推理启始”(Reasoning Inception, ReIn)的测试时干预方法:通过一个外部的启始模块识别对话上下文中的预定义错误并生成恢复计划,随后将这些计划嵌入代理内部推理过程以引导纠正行为,而无需改动模型参数或系统提示。ReIn在多种代理模型与启始模块组合中显著提升任务成功率,并能泛化至未见错误类型,且优于显式提示修改方法,证明了其作为高效、实时修复机制的有效性。
链接: https://arxiv.org/abs/2602.17022
作者: Takyoung Kim,Jinseok Nam,Chandrayee Basu,Xing Fan,Chengyuan Ma,Heng Ji,Gokhan Tur,Dilek Hakkani-Tür
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon(亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors. Rather than focusing on error prevention, this work focuses on error recovery, which necessitates the accurate diagnosis of erroneous dialogue contexts and execution of proper recovery plans. Under realistic constraints precluding model fine-tuning or prompt modification due to significant cost and time requirements, we explore whether agents can recover from contextually flawed interactions and how their behavior can be adapted without altering model parameters and prompts. To this end, we propose Reasoning Inception (ReIn), a test-time intervention method that plants an initial reasoning into the agent’s decision-making process. Specifically, an external inception module identifies predefined errors within the dialogue context and generates recovery plans, which are subsequently integrated into the agent’s internal reasoning process to guide corrective actions, without modifying its parameters or system prompts. We evaluate ReIn by systematically simulating conversational failure scenarios that directly hinder successful completion of user goals: user’s ambiguous and unsupported requests. Across diverse combinations of agent models and inception modules, ReIn substantially improves task success and generalizes to unseen error types. Moreover, it consistently outperforms explicit prompt-modification approaches, underscoring its utility as an efficient, on-the-fly method. In-depth analysis of its operational mechanism, particularly in relation to instruction hierarchy, indicates that jointly defining recovery tools with ReIn can serve as a safe and effective strategy for improving the resilience of conversational agents without modifying the backbone models or system prompts.
[NLP-39] Arcee Trinity Large Technical Report
【速读】: 该论文旨在解决大规模语言模型在参数效率与训练稳定性之间的平衡问题,尤其是在保持高性能的同时降低计算资源消耗。其核心解决方案是设计并实现了一种稀疏的混合专家(Mixture-of-Experts, MoE)架构,通过动态激活少量专家(如Trinity Large每token仅激活13B参数),显著提升模型的可扩展性与推理效率。关键创新包括:采用交错局部与全局注意力机制、门控注意力(gated attention)、深度缩放的夹层归一化(depth-scaled sandwich norm)以及sigmoid路由策略优化专家分配;此外,针对Trinity Large引入了软钳制动量专家偏置更新(Soft-clamped Momentum Expert Bias Updates, SMEBU)以增强MoE负载均衡,结合Muon优化器确保训练过程零损失波动,从而实现高效且稳定的预训练。
链接: https://arxiv.org/abs/2602.17004
作者: Varun Singh,Lucas Krauss,Sami Jaghouar,Matej Sirovatka,Charles Goddard,Fares Obied,Jack Min Ong,Jannik Straube,Fern,Aria Harley,Conner Stewart,Colin Kealty,Maziyar Panahi,Simon Kirsten,Anushka Deshpande,Anneketh Vij,Arthur Bresnu,Pranav Veldurthi,Raghav Ravishankar,Hardik Bishnoi,DatologyAI Team,Arcee AI Team,Prime Intellect Team,Mark McQuade,Johannes Hagemann,Lucas Atkins
机构: DatologyAI Team3; Google(谷歌); Meta(元)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 26B total parameters with 3B activated per token. The models’ modern architecture includes interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for Mixture-of-Experts. For Trinity Large, we also introduce a new MoE load balancing strategy titled Soft-clamped Momentum Expert Bias Updates (SMEBU). We train the models using the Muon optimizer. All three models completed training with zero loss spikes. Trinity Nano and Trinity Mini were pre-trained on 10 trillion tokens, and Trinity Large was pre-trained on 17 trillion tokens. The model checkpoints are available at this https URL.
[NLP-40] Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History
【速读】: 该论文旨在解决当前大型语言模型驱动的网络代理(web agents)缺乏个性化能力的问题,即在用户未明确表达全部意图的情况下,代理难以通过推断用户偏好和上下文来处理模糊查询。其解决方案的关键在于提出首个面向真实开放网络的个性化网络代理评估基准——Persona2Web,该基准基于“澄清-个性化”(clarify-to-personalize)原则,要求代理根据用户历史行为而非显式指令来消除歧义;该框架包含揭示长期隐式偏好的用户历史数据、需要推理隐含偏好的模糊查询,以及支持细粒度评估个性化的推理感知评价体系,从而系统性地推动个性化网络代理的发展与评测。
链接: https://arxiv.org/abs/2602.17003
作者: Serin Kim,Sangam Lee,Dongha Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions. Persona2Web consists of: (1) user histories that reveal preferences implicitly over long time spans, (2) ambiguous queries that require agents to infer implicit user preferences, and (3) a reasoning-aware evaluation framework that enables fine-grained assessment of personalization. We conduct extensive experiments across various agent architectures, backbone models, history access schemes, and queries with varying ambiguity levels, revealing key challenges in personalized web agent behavior. For reproducibility, our codes and datasets are publicly available at this https URL.
[NLP-41] Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases
【速读】: 该论文旨在解决自然语言查询时间序列数据库(Natural Language Querying for Time Series Databases, NLQ4TSDB)中的两大核心挑战:一是现有Text-to-SQL方法难以处理连续形态意图(如形状或异常),二是传统时间序列模型无法有效应对超长历史数据的查询任务。解决方案的关键在于提出一种神经符号框架Sonar-TS,其采用“搜索-验证”(Search-Then-Verify)流水线机制,类比主动声呐(active sonar)原理,先通过特征索引利用SQL快速筛选候选时间窗口,再由生成的Python程序对原始信号进行精确定位与验证,从而实现对复杂时序语义的高效精准解析。
链接: https://arxiv.org/abs/2602.17001
作者: Zhao Tan,Yiji Zhao,Shiyu Wang,Chang Xu,Yuxuan Liang,Xiping Liu,Shirui Pan,Ming Jin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注:
Abstract:Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and summaries from massive temporal records. However, existing Text-to-SQL methods are not designed for continuous morphological intents such as shapes or anomalies, while time series models struggle to handle ultra-long histories. To address these challenges, we propose Sonar-TS, a neuro-symbolic framework that tackles NLQ4TSDB via a Search-Then-Verify pipeline. Analogous to active sonar, it utilizes a feature index to ping candidate windows via SQL, followed by generated Python programs to lock on and verify candidates against raw signals. To enable effective evaluation, we introduce NLQTSBench, the first large-scale benchmark designed for NLQ over TSDB-scale histories. Our experiments highlight the unique challenges within this domain and demonstrate that Sonar-TS effectively navigates complex temporal queries where traditional methods fail. This work presents the first systematic study of NLQ4TSDB, offering a general framework and evaluation standard to facilitate future research.
[NLP-42] Exploring LLM s for User Story Extraction from Mockups
【速读】: 该论文旨在解决软件需求工程中用户故事(User Story)生成效率低、沟通成本高的问题,尤其是在缺乏专业术语一致性的情况下,开发者与用户之间难以准确理解需求。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)从高保真原型图(high-fidelity mockups)中自动提取用户故事,并通过在提示词(prompt)中引入语言扩展词汇表(Language Extended Lexicon, LEL)来增强模型对领域术语的理解能力。实证结果表明,加入LEL显著提升了生成用户故事的准确性与适用性,从而推动了人工智能在需求工程中的集成应用。
链接: https://arxiv.org/abs/2602.16997
作者: Diego Firmenich,Leandro Antonelli,Bruno Pazos,Fabricio Lozada,Leonardo Morales
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 6 figures. Preprint of the paper published in the 28th Workshop on Requirements Engineering (WER 2025)
Abstract:User stories are one of the most widely used artifacts in the software industry to define functional requirements. In parallel, the use of high-fidelity mockups facilitates end-user participation in defining their needs. In this work, we explore how combining these techniques with large language models (LLMs) enables agile and automated generation of user stories from mockups. To this end, we present a case study that analyzes the ability of LLMs to extract user stories from high-fidelity mockups, both with and without the inclusion of a glossary of the Language Extended Lexicon (LEL) in the prompts. Our results demonstrate that incorporating the LEL significantly enhances the accuracy and suitability of the generated user stories. This approach represents a step forward in the integration of AI into requirements engineering, with the potential to improve communication between users and developers.
[NLP-43] Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在训练和推理过程中对完整多模态数据的强依赖问题,即当某些模态缺失、异步采集或仅存在于部分样本中时,现有方法难以有效利用不完整的多模态数据。解决方案的关键在于提出PRIMO——一种监督式隐变量插补模型,通过引入一个隐变量来建模缺失模态与已观测模态之间的预测关系,并在推理阶段从学习到的缺失模态分布中采样多个完成版本,从而既获得边际预测分布以支持预测任务,又能基于方差度量量化每个实例上缺失模态的预测影响。该方法实现了对所有可用训练样本(无论模态是否完整)的有效利用,在多种数据集上表现出与单模态和全模态基线相当的性能。
链接: https://arxiv.org/abs/2602.16979
作者: Divyam Madaan,Sumit Chopra,Kyunghyun Cho
机构: New York University (纽约大学); New York University Grossman School of Medicine (纽约大学格罗斯曼医学院); Genentech (基因泰克); CIFAR (加拿大高级研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Despite the recent success of Multimodal Large Language Models (MLLMs), existing approaches predominantly assume the availability of multiple modalities during training and inference. In practice, multimodal data is often incomplete because modalities may be missing, collected asynchronously, or available only for a subset of examples. In this work, we propose PRIMO, a supervised latent-variable imputation model that quantifies the predictive impact of any missing modality within the multimodal learning setting. PRIMO enables the use of all available training examples, whether modalities are complete or partial. Specifically, it models the missing modality through a latent variable that captures its relationship with the observed modality in the context of prediction. During inference, we draw many samples from the learned distribution over the missing modality to both obtain the marginal predictive distribution (for the purpose of prediction) and analyze the impact of the missing modalities on the prediction for each instance. We evaluate PRIMO on a synthetic XOR dataset, Audio-Vision MNIST, and MIMIC-III for mortality and ICD-9 prediction. Across all datasets, PRIMO obtains performance comparable to unimodal baselines when a modality is fully missing and to multimodal baselines when all modalities are available. PRIMO quantifies the predictive impact of a modality at the instance level using a variance-based metric computed from predictions across latent completions. We visually demonstrate how varying completions of the missing modality result in a set of plausible labels.
[NLP-44] HQFS: Hybrid Quantum Classical Financial Security with VQC Forecasting QUBO Annealing and Audit-Ready Post-Quantum Signing
【速读】: 该论文旨在解决传统金融风险系统中预测与优化分离架构在实际应用中面临的稳定性差、可解释性弱及计算效率低的问题,尤其是在引入离散约束(如最小交易单位、持仓上限)或市场波动时决策不稳定、审计困难等挑战。其核心解决方案是提出HQFS(Hybrid Quantum-First Pipeline),关键在于三点:一是利用变分量子电路(VQC)结合小规模经典头部网络联合学习下一阶段收益和波动率代理,提升预测精度;二是将风险-收益目标与约束转化为二次无约束布尔优化(QUBO)问题,通过量子退火求解器实现高效离散优化,并以经典QUBO求解器作为部署备用方案;三是采用后量子数字签名机制对每次再平衡输出进行签名,确保分配记录可验证且无需依赖运行环境的信任,从而实现全流程的可审计性。
链接: https://arxiv.org/abs/2602.16976
作者: Srikumar Nayak
机构: Incedo Inc(印度科多公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 1 fig , 4 tables
Abstract:Here’s the corrected paragraph with all punctuation and formatting issues fixed: Financial risk systems usually follow a two-step routine: a model predicts return or risk, and then an optimizer makes a decision such as a portfolio rebalance. In practice, this split can break under real constraints. The prediction model may look good, but the final decision can be unstable when the market shifts, when discrete constraints are added (lot sizes, caps), or when the optimization becomes slow for larger asset sets. Also, regulated settings need a clear audit trail that links each decision to the exact model state and inputs. We present HQFS, a practical hybrid pipeline that connects forecasting, discrete risk optimization, and auditability in one flow. First, HQFS learns next-step return and a volatility proxy using a variational quantum circuit (VQC) with a small classical head. Second, HQFS converts the risk-return objective and constraints into a QUBO and solves it with quantum annealing when available, while keeping a compatible classical QUBO solver as a fallback for deployment. Third, HQFS signs each rebalance output using a post-quantum signature so the allocation can be verified later without trusting the runtime environment. On our market dataset study, HQFS reduces return prediction error by 7.8% and volatility prediction error by 6.1% versus a tuned classical baseline. For the decision layer, HQFS improves out-of-sample Sharpe by 9.4% and lowers maximum drawdown by 11.7%. The QUBO solve stage also cuts average solve time by 28% compared to a mixed-integer baseline under the same constraints, while producing fully traceable, signed allocation records. Comments: 11 pages, 1 fig , 4 tables Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2602.16976 [cs.AI] (or arXiv:2602.16976v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.16976 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-45] Eigenmood Space: Uncertainty-Aware Spectral Graph Analysis of Psychological Patterns in Classical Persian Poetry
【速读】: 该论文旨在解决古典波斯诗歌中情感表达的量化分析难题,即如何在保持文本细读深度的同时实现大规模可复现的心理学层面比较。其核心挑战在于诗歌通过隐喻、互文惯例和修辞间接性传达情感,使得传统计算方法难以准确建模。解决方案的关键在于构建一个不确定性感知的计算框架:首先对诗句进行多标签心理概念自动标注,并为每个标签提供置信度分数及“弃权”标志(abstention flag),从而保留证据不足的语境;随后将加权证据聚合为诗人×概念矩阵,利用Jensen–Shannon散度和Kullback–Leibler散度量化每位诗人的心理个性差异;进一步通过置信度加权共现图构建Eigenmood嵌入空间,捕捉概念间的结构性关系。该框架实现了从句级不确定证据到诗人级心理推断的传播机制,支持可审计的大规模数字人文分析,同时保障了对文学解释的谨慎态度。
链接: https://arxiv.org/abs/2602.16959
作者: Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Classical Persian poetry is a historically sustained archive in which affective life is expressed through metaphor, intertextual convention, and rhetorical indirection. These properties make close reading indispensable while limiting reproducible comparison at scale. We present an uncertainty-aware computational framework for poet-level psychological analysis based on large-scale automatic multi-label annotation. Each verse is associated with a set of psychological concepts, per-label confidence scores, and an abstention flag that signals insufficient evidence. We aggregate confidence-weighted evidence into a Poet \times Concept matrix, interpret each poet as a probability distribution over concepts, and quantify poetic individuality as divergence from a corpus baseline using Jensen–Shannon divergence and Kullback–Leibler divergence. To capture relational structure beyond marginals, we build a confidence-weighted co-occurrence graph over concepts and define an Eigenmood embedding through Laplacian spectral decomposition. On a corpus of 61,573 verses across 10 poets, 22.2% of verses are abstained, underscoring the analytical importance of uncertainty. We further report sensitivity analysis under confidence thresholding, selection-bias diagnostics that treat abstention as a category, and a distant-to-close workflow that retrieves verse-level exemplars along Eigenmood axes. The resulting framework supports scalable, auditable digital-humanities analysis while preserving interpretive caution by propagating uncertainty from verse-level evidence to poet-level inference.
[NLP-46] When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English
【速读】: 该论文旨在解决跨语言 euphemism(委婉语)检测中的迁移学习问题,特别是探讨不同语言间委婉语表达的等价性如何影响模型在多语言场景下的迁移效果。其解决方案的关键在于对潜在委婉语词(Potentially Euphemistic Terms, PETs)进行分类:基于功能、语用和语义一致性将土耳其语与英语中的PETs划分为重叠型(Overlapping PETs, OPETs)与非重叠型(Non-Overlapping PETs, NOPETs),从而揭示迁移效果并非仅由语义相似性决定,而是受制于语言间标签分布差异及领域特异性对齐程度的影响,进而解释了为何在低资源语言(如土耳其语→英语)方向上,即使存在语义重叠,迁移性能仍可能下降,甚至在某些情况下依赖非重叠型PETs训练反而提升效果。
链接: https://arxiv.org/abs/2602.16957
作者: Hasan Can Biyik,Libby Barak,Jing Peng,Anna Feldman
机构: Montclair State University (蒙特克莱尔州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Euphemisms substitute socially sensitive expressions, often softening or reframing meaning, and their reliance on cultural and pragmatic context complicates modeling across languages. In this study, we investigate how cross-lingual equivalence influences transfer in multilingual euphemism detection. We categorize Potentially Euphemistic Terms (PETs) in Turkish and English into Overlapping (OPETs) and Non-Overlapping (NOPETs) subsets based on their functional, pragmatic, and semantic alignment. Our findings reveal a transfer asymmetry: semantic overlap is insufficient to guarantee positive transfer, particularly in low-resource Turkish-to-English direction, where performance can degrade even for overlapping euphemisms, and in some cases, improve under NOPET-based training. Differences in label distribution help explain these counterintuitive results. Category-level analysis suggests that transfer may be influenced by domain-specific alignment, though evidence is limited by sparsity.
[NLP-47] ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders EACL2026
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的用户模拟器在提升对话式人工智能(Conversational AI)性能时存在的“现实性差距”(realism gap)问题,即模拟系统虽在仿真环境中表现良好,但在真实场景中泛化能力不足。其解决方案的关键在于提出ConvApparel数据集和一套综合验证框架:ConvApparel采用双代理数据采集协议(包含“优质”与“劣质”推荐者),通过捕捉用户对不同交互体验的主观满意度标注,实现反事实验证(counterfactual validation);同时,该框架融合统计一致性、人类相似度评分与反事实验证三重指标,有效评估模拟器的泛化能力。实验表明,尽管所有模拟器均存在显著现实性差距,但数据驱动型模拟器在反事实验证中表现出更强的适应性,说明其具备更稳健的用户建模能力。
链接: https://arxiv.org/abs/2602.16938
作者: Ofer Meshi,Krisztian Balog,Sally Goldman,Avi Caciularu,Guy Tennenholtz,Jihwan Jeong,Amir Globerson,Craig Boutilier
机构: Google(谷歌)
类目: Computation and Language (cs.CL)
备注: EACL 2026
Abstract:The promise of LLM-based user simulators to improve conversational AI is hindered by a critical “realism gap,” leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol – using both “good” and “bad” recommenders – enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.
[NLP-48] A Conceptual Hybrid Framework for Post-Quantum Security: Integrating BB84 QKD AES and Bio-inspired Mechanisms
【速读】: 该论文旨在解决量子计算对传统公钥加密体系(如RSA)带来的安全威胁问题,特别是针对Shor算法在多项式时间内高效分解大整数的能力所引发的RSA密钥脆弱性。其解决方案的关键在于提出一个混合安全框架,融合AES加密(经典对称加密)、BB84量子密钥分发(Quantum Key Distribution, QKD)用于抗窃听的密钥协商、量子态比较实现轻量级身份认证,以及基于生物免疫机制的自适应威胁检测模型,从而构建面向后量子时代的可扩展、动态响应的数据保护体系。
链接: https://arxiv.org/abs/2602.16922
作者: Md. Ismiel Hossen Abir
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Quantum computing is a significant risk to classical cryptographic, especially RSA, which depends on the difficulty of factoring large numbers. Classical factorization methods, such as Trial Division and Pollard’s Rho, are inefficient for large keys, while Shor’s quantum algorithm can break RSA efficiently in polynomial time. This research studies RSA’s vulnerabilities under both classical and quantum attacks and designs a hybrid security framework to ensure data protection in the post-quantum era. The conceptual framework combines AES encryption for classical security, BB84 Quantum Key Distribution (QKD) for secure key exchange with eavesdropping detection, quantum state comparison for lightweight authentication, and a bio-inspired immune system for adaptive threat detection. RSA is vulnerable to Shor’s algorithm, BB84 achieves full key agreement in ideal conditions, and it detects eavesdropping with high accuracy. The conceptual model includes both classical and quantum security methods, providing a scalable and adaptive solution for Post-Quantum encryption data protection. This work primarily proposes a conceptual framework. Detailed implementation, security proofs, and extensive experimental validation are considered future work.
[NLP-49] Mobile-Agent -v3.5: Multi-platform Fundamental GUI Agents
【速读】: 该论文旨在解决当前图形用户界面(GUI)代理模型在多平台适应性、实时交互能力以及复杂任务执行效率方面的瓶颈问题,尤其针对跨平台GUI自动化、视觉-语言对齐(grounding)、工具调用(tool-calling)及长期记忆与知识推理等关键挑战。解决方案的关键在于三个方面:一是构建“混合数据飞轮”(Hybrid Data Flywheel),融合模拟环境与云端沙箱环境以提升数据收集的效率与质量;二是提出统一的“思维合成流水线”(Unified Thought-Synthesis Pipeline),强化模型的推理能力并聚焦于工具使用、记忆保持和多代理协作等核心能力;三是设计新型多平台环境强化学习算法MRPO(Multi-platform Environment RL),有效缓解多平台冲突并提高长时任务训练效率。这些创新共同推动了GUI-Owl-1.5在多个开源基准上达到最先进性能。
链接: https://arxiv.org/abs/2602.16855
作者: Haiyang Xu,Xi Zhang,Haowei Liu,Junyang Wang,Zhaozai Zhu,Shengjie Zhou,Xuhao Hu,Feiyu Gao,Junjie Cao,Zihua Wang,Zhiyuan Chen,Jitong Liao,Qi Zheng,Jiahui Zeng,Ze Xu,Shuai Bai,Junyang Lin,Jingren Zhou,Ming Yan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 25 pages, 11 figures, 11 tables
Abstract:The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge collaboration and real-time interaction. GUI-Owl-1.5 achieves state-of-the-art results on more than 20+ GUI benchmarks on open-source models: (1) on GUI automation tasks, it obtains 56.5 on OSWorld, 71.6 on AndroidWorld, and 48.4 on WebArena; (2) on grounding tasks, it obtains 80.3 on ScreenSpotPro; (3) on tool-calling tasks, it obtains 47.6 on OSWorld-MCP, and 46.8 on MobileWorld; (4) on memory and knowledge tasks, it obtains 75.5 on GUI-Knowledge Bench. GUI-Owl-1.5 incorporates several key innovations: (1) Hybird Data Flywheel: we construct the data pipeline for UI understanding and trajectory generation based on a combination of simulated environments and cloud-based sandbox environments, in order to improve the efficiency and quality of data collection. (2) Unified Enhancement of Agent Capabilities: we use a unified thought-synthesis pipeline to enhance the model’s reasoning capabilities, while placing particular emphasis on improving key agent abilities, including Tool/MCP use, memory and multi-agent adaptation; (3) Multi-platform Environment RL Scaling: We propose a new environment RL algorithm, MRPO, to address the challenges of multi-platform conflicts and the low training efficiency of long-horizon tasks. The GUI-Owl-1.5 models are open-sourced, and an online cloud-sandbox demo is available at this https URL.
[NLP-50] Meenz bleibt Meenz but Large Language Models Do Not Speak Its Dialect LREC2026
【速读】: 该论文旨在解决德语方言Meenzerisch(美因茨方言)因濒危而面临消失的风险,探索自然语言处理(Natural Language Processing, NLP)技术在方言保护与复兴中的应用潜力。其关键解决方案是构建首个面向NLP研究的Meenzerisch数字词典数据集,包含2,351个方言词汇及其标准德语释义,并基于此数据集评估大语言模型(Large Language Models, LLMs)在两个核心任务上的表现:生成方言词定义和根据定义生成方言词。实验表明,当前LLMs在两项任务中均表现不佳(准确率分别仅达6.27%和1.51%),即使引入少样本学习和规则提取策略,准确率仍低于10%,凸显了针对德语方言开展系统性NLP研究的紧迫性和必要性。
链接: https://arxiv.org/abs/2602.16852
作者: Minh Duc Bui,Manuel Mager,Peter Herbert Kann,Katharina von der Wense
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026
Abstract:Meenzerisch, the dialect spoken in the German city of Mainz, is also the traditional language of the Mainz carnival, a yearly celebration well known throughout Germany. However, Meenzerisch is on the verge of dying out-a fate it shares with many other German dialects. Natural language processing (NLP) has the potential to help with the preservation and revival efforts of languages and dialects. However, so far no NLP research has looked at Meenzerisch. This work presents the first research in the field of NLP that is explicitly focused on the dialect of Mainz. We introduce a digital dictionary-an NLP-ready dataset derived from an existing resource (Schramm, 1966)-to support researchers in modeling and benchmarking the language. It contains 2,351 words in the dialect paired with their meanings described in Standard German. We then use this dataset to answer the following research questions: (1) Can state-of-the-art large language models (LLMs) generate definitions for dialect words? (2) Can LLMs generate words in Meenzerisch, given their definitions? Our experiments show that LLMs can do neither: the best model for definitions reaches only 6.27% accuracy and the best word generation model’s accuracy is 1.51%. We then conduct two additional experiments in order to see if accuracy is improved by few-shot learning and by extracting rules from the training set, which are then passed to the LLM. While those approaches are able to improve the results, accuracy remains below 10%. This highlights that additional resources and an intensification of research efforts focused on German dialects are desperately needed.
[NLP-51] BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization EACL2026
【速读】: 该论文旨在解决低资源语言(如孟加拉语)在文本摘要中的事实一致性评估问题,现有方法通常依赖参考摘要且忽视孟加拉语等语言的特殊性。其解决方案的关键在于提出一种无需参考摘要的、基于问答(Question-Answering, QA)框架——BanglaSummEval,该框架利用单一多语言指令微调语言模型完成问题生成、答案抽取、问答判断及重要性加权等任务,从而统一处理事实准确性与内容覆盖度的评估;同时采用BERTScore-Recall衡量答案间的语义一致性,提升评估的深度与可靠性。实证表明,该方法与专家人工评分具有高度相关性(Pearson’s r = 0.694,Spearman’s ρ = 0.763),并提供可解释的分步诊断,为低资源语言场景下的事实一致性评估提供了透明且实用的方案。
链接: https://arxiv.org/abs/2602.16843
作者: Ahmed Rafid,Rumman Adib,Fariya Ahmed,Ajwad Abrar,Mohammed Saidul Islam
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted in 2nd LoResLM at EACL 2026
Abstract:Evaluating factual consistency is essential for reliable text summarization, particularly in high-stakes domains such as healthcare and news. However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries. We introduce BanglaSummEval, a reference-free, question-answering-based framework for evaluating factual consistency in Bangla summarization. The proposed method assesses both factual accuracy and content coverage through automatically generated questions and answers derived from the source document and the summary. A single multilingual instruction-tuned language model handles question generation, question answering, candidate answer extraction, and question importance weighting. This unified design reduces system complexity and computational cost. To capture semantic consistency beyond surface-level overlap, we use BERTScore-Recall for answer comparison. We validate BanglaSummEval on 300 human-written summaries from educational and medical domains, demonstrating strong correlation with expert human judgments (Pearson’s r = 0.694 , Spearman’s \rho = 0.763 ). By providing interpretable, step-wise diagnostics alongside reliable evaluation scores, BanglaSummEval offers a practical and transparent solution for factual consistency evaluation in low-resource language settings.
[NLP-52] raining Large Reasoning Models Efficiently via Progressive Thought Encoding ICLR2026
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在强化学习(Reinforcement Learning, RL)训练过程中因长序列rollout导致的效率瓶颈问题,特别是自回归解码(autoregressive decoding)占用大量时间和内存资源。其核心挑战在于:虽然滑动窗口缓存策略可限制内存使用,但会破坏长上下文推理能力并降低性能。解决方案的关键是提出渐进式思维编码(Progressive Thought Encoding)——一种参数高效的微调方法,通过将中间推理过程逐步编码为固定大小的向量表示,从而避免对完整缓存rollout进行反向传播,显著减少内存消耗,并在推理阶段保持恒定内存占用。实验表明,该方法在多个数学基准测试中均实现显著性能提升,且在严格缓存预算下优于LoRA微调和未微调的基线模型。
链接: https://arxiv.org/abs/2602.16839
作者: Zeliang Zhang,Xiaodong Liu,Hao Cheng,Hao Sun,Chenliang Xu,Jianfeng Gao
机构: University of Rochester (罗彻斯特大学); Microsoft Research (微软研究院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: ICLR 2026, 15 pages
Abstract:Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency: reinforcement learning (RL) training requires long rollouts for outcome-based rewards, where autoregressive decoding dominates time and memory usage. While sliding-window cache strategies can bound memory, they disrupt long-context reasoning and degrade performance. We introduce Progressive Thought Encoding, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches. By progressively encoding intermediate reasoning into fixed-size vector representations, our approach eliminates the need to backpropagate through full-cache rollouts, thereby reducing memory usage, while maintaining constant memory during inference. Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3% improvement over LoRA-based fine-tuning and +29.9% over LRMs without fine-tuning on average, with up to +23.4 accuracy improvement on AIME2024/2025 under the same tight cache budgets. These results demonstrate that Progressive Thought Encoding not only improves reasoning accuracy but also makes RL training of LRMs substantially more efficient and scalable under real-world memory constraints.
[NLP-53] Claim Automation using Large Language Model
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在保险等受监管且数据敏感领域部署受限的问题,尤其是如何从非结构化的索赔叙述中生成符合业务规范的结构化纠正措施建议。其解决方案的关键在于:基于数百万条历史保修索赔数据,在本地部署一个治理感知的语言建模组件,并采用低秩适应(Low-Rank Adaptation, LoRA)技术对预训练LLM进行领域特定微调,使其聚焦于索赔处理流程中的初始决策模块,从而提升理赔员的决策效率。实证结果表明,该方法显著优于商用通用模型和提示工程方案,约80%的案例能实现与真实纠正措施近乎一致的输出,验证了领域自适应微调在对齐模型输出分布与实际运营数据方面的有效性。
链接: https://arxiv.org/abs/2602.16836
作者: Zhengda Mo,Zhiyu Quan,Eli O’Donohue,Kaiwen Zhong
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); PCMI Corporation (PCMI公司)
类目: Computation and Language (cs.CL)
备注: 46 pages, 12 figures. Code and data processing pipeline described
Abstract:While Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, their deployment in regulated and data-sensitive domains, including insurance, remains limited. Leveraging millions of historical warranty claims, we propose a locally deployed governance-aware language modeling component that generates structured corrective-action recommendations from unstructured claim narratives. We fine-tune pretrained LLMs using Low-Rank Adaptation (LoRA), scoping the model to an initial decision module within the claim processing pipeline to speed up claim adjusters’ decisions. We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy. Our results show that domain-specific fine-tuning substantially outperforms commercial general-purpose and prompt-based LLMs, with approximately 80% of the evaluated cases achieving near-identical matches to ground-truth corrective actions. Overall, this study provides both theoretical and empirical evidence to prove that domain-adaptive fine-tuning can align model output distributions more closely with real-world operational data, demonstrating its promise as a reliable and governable building block for insurance applications.
[NLP-54] IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages EACL
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言环境下的安全对齐(Safety Alignment)评估不足的问题,尤其是针对印度语系(Indic)及南亚地区语言的对抗性攻击鲁棒性研究长期被忽视的现状。现有评估大多局限于英文且基于合同约束(contract-bound),导致多语言漏洞未被充分揭示。其解决方案的关键在于提出一个无需人工评判(judge-free)的基准测试工具——Indic Jailbreak Robustness (IJR),覆盖12种印度语系和南亚语言(涉及约21亿使用者),包含45,216条提示样本,分为JSON(合同绑定)与Free(自然主义)两个子集。IJR通过系统性实验证明:合同文本虽能提升拒绝率但无法有效阻止越狱攻击;英语到印度语的攻击迁移性强,格式包装器(format wrappers)比指令包装器(instruction wrappers)更有效;罗马化或混合输入显著降低越狱成功率(JSR),且与罗马化程度和分词方式呈显著相关性(r ≈ 0.28–0.32)。该基准为多语言安全评估提供了可复现的“压力测试”框架,揭示了仅依赖英文评估可能掩盖的重大风险,尤其适用于频繁进行代码切换(code-switching)和罗马化的南亚用户群体。
链接: https://arxiv.org/abs/2602.16832
作者: Priyaranjan Pattnayak,Sanchari Chowdhuri
机构: Oracle America Inc. (Oracle美国公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted in EACL Industry Track Oral, 2026
Abstract:Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied. We introduce \textbfIndic Jailbreak Robustness (IJR), a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (2.1 Billion speakers), covering 45216 prompts in JSON (contract-bound) and Free (naturalistic) tracks. IJR reveals three patterns. (1) Contracts inflate refusals but do not stop jailbreaks: in JSON, LLaMA and Sarvam exceed 0.92 JSR, and in Free all models reach 1.0 with refusals collapsing. (2) English to Indic attacks transfer strongly, with format wrappers often outperforming instruction wrappers. (3) Orthography matters: romanized or mixed inputs reduce JSR under JSON, with correlations to romanization share and tokenization (approx 0.28 to 0.32) indicating systematic effects. Human audits confirm detector reliability, and lite-to-full comparisons preserve conclusions. IJR offers a reproducible multilingual stress test revealing risks hidden by English-only, contract-focused evaluations, especially for South Asian users who frequently code-switch and romanize. Comments: Accepted in EACL Industry Track Oral, 2026 Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2602.16832 [cs.AI] (or arXiv:2602.16832v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.16832 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-55] Hybrid-Gym: Training Coding Agents to Generalize Across Tasks
【速读】: 该论文旨在解决当前编码代理(coding agents)评估与训练中存在的局限性问题:现有基准(如SWE-Bench)主要关注单一GitHub问题的求解能力,而实际应用场景中代理需处理更复杂、多样化的任务,包括代码库探索、软件测试和架构设计等跨领域技能。为此,作者通过将智能体行为轨迹分解为细粒度组件,识别出可迁移的核心技能,并据此提出一套辅助训练任务的设计原则。解决方案的关键在于构建一个名为Hybrid-Gym的可扩展合成训练环境,其中包含函数定位、依赖搜索等多样化任务,有效提升语言模型在未见过的真实任务上的泛化能力——实验表明,基于该环境训练的代理在多个基准上均实现显著性能提升(如SWE-Bench Verified提升25.4%绝对分数),且能增强下游数据集(如SWE-Play)的表现。
链接: https://arxiv.org/abs/2602.16819
作者: Yiqing Xie,Emmy Liu,Gaokai Zhang,Nachiket Kotalwar,Shubham Gandhi,Sathwik Acharya,Xingyao Wang,Carolyn Rose,Graham Neubig,Daniel Fried
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench. In contrast, in real use, these agents solve more various and complex tasks that involve other skills such as exploring codebases, testing software, and designing architecture. In this paper, we first characterize some transferable skills that are shared across diverse tasks by decomposing trajectories into fine-grained components, and derive a set of principles for designing auxiliary training tasks to teach language models these skills. Guided by these principles, we propose a training environment, Hybrid-Gym, consisting of a set of scalable synthetic tasks, such as function localization and dependency search. Experiments show that agents trained on our synthetic tasks effectively generalize to diverse real-world tasks that are not present in training, improving a base model by 25.4% absolute gain on SWE-Bench Verified, 7.9% on SWT-Bench Verified, and 5.1% on Commit-0 Lite. Hybrid-Gym also complements datasets built for the downstream tasks (e.g., improving SWE-Play by 4.9% on SWT-Bench Verified). Code available at: this https URL.
[NLP-56] One-step Language Modeling via Continuous Denoising
【速读】: 该论文旨在解决离散扩散语言模型在少步生成(few-step regime)下样本质量显著下降的问题,从而无法实现其相比自回归模型更快生成的潜力。解决方案的关键在于提出一种基于流模型(Flow-based Language Model, FLM)的语言建模框架,该框架通过在one-hot token编码空间中执行欧几里得去噪(Euclidean denoising),将连续流机制引入离散模态的生成建模中;同时,通过引入简单的时间重参数化(time reparameterization)策略,提升了训练稳定性和生成质量,并进一步通过蒸馏获得可支持高效少步生成的 distilled flow map language model (FMLM),实现在LM1B和OWT数据集上优于现有少步语言模型的性能表现。
链接: https://arxiv.org/abs/2602.16813
作者: Chanhyuk Lee,Jaehoon Yoo,Manan Agarwal,Sheel Shah,Jerry Huang,Aditi Raghunathan,Seunghoon Hong,Nicholas M. Boffi,Jinwoo Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 39 pages, 17 figures
Abstract:Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. In practice, however, they exhibit a sharp degradation of sample quality in the few-step regime, failing to realize this promise. Here we show that language models leveraging flow-based continuous denoising can outperform discrete diffusion in both quality and speed. By revisiting the fundamentals of flows over discrete modalities, we build a flow-based language model (FLM) that performs Euclidean denoising over one-hot token encodings. We show that the model can be trained by predicting the clean data via a cross entropy objective, where we introduce a simple time reparameterization that greatly improves training stability and generation quality. By distilling FLM into its associated flow map, we obtain a distilled flow map language model (FMLM) capable of few-step generation. On the LM1B and OWT language datasets, FLM attains generation quality matching state-of-the-art discrete diffusion models. With FMLM, our approach outperforms recent few-step language models across the board, with one-step generation exceeding their 8-step quality. Our work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow-based language modeling at scale. Code is available at this https URL.
[NLP-57] Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在低资源语言(如希腊语)中表现不足的问题,尤其是现有模型多聚焦于高资源语言(如英语),且依赖从高资源语言迁移学习,导致对低资源语言的社会、文化及历史特征刻画不充分。解决方案的关键在于:首先构建一个反映希腊语社会文化语境的新型问答数据集DemosQA,该数据集基于社交媒体用户提问和社区审核答案;其次提出一种内存高效的LLM评估框架,可适配多种问答数据集和语言;最后通过在6个人工标注的希腊语问答数据集上对11个单语和多语LLM进行系统评估,并采用三种提示策略验证其性能差异,从而填补了单语LLM在特定语言任务上的有效性研究空白。
链接: https://arxiv.org/abs/2602.16811
作者: Charalampos Mastrokostas,Nikolaos Giarelis,Nikos Karacapilidis
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in Natural Language Processing and Deep Learning have enabled the development of Large Language Models (LLMs), which have significantly advanced the state-of-the-art across a wide range of tasks, including Question Answering (QA). Despite these advancements, research on LLMs has primarily targeted high-resourced languages (e.g., English), and only recently has attention shifted toward multilingual models. However, these models demonstrate a training data bias towards a small number of popular languages or rely on transfer learning from high- to under-resourced languages; this may lead to a misrepresentation of social, cultural, and historical aspects. To address this challenge, monolingual LLMs have been developed for under-resourced languages; however, their effectiveness remains less studied when compared to multilingual counterparts on language-specific tasks. In this study, we address this research gap in Greek QA by contributing: (i) DemosQA, a novel dataset, which is constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural zeitgeist; (ii) a memory-efficient LLM evaluation framework adaptable to diverse QA datasets and languages; and (iii) an extensive evaluation of 11 monolingual and multilingual LLMs on 6 human-curated Greek QA datasets using 3 different prompting strategies. We release our code and data to facilitate reproducibility.
[NLP-58] References Improve LLM Alignment in Non-Verifiable Domains ICLR2026
【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在非可验证领域(如大语言模型对齐)中的应用瓶颈问题,即缺乏真实标签的验证器(ground-truth verifiers)导致无法有效训练。其核心解决方案是引入参考引导的大语言模型评估器(reference-guided LLM-evaluators),作为软性“验证器”来提升评估准确性,并进一步用于指导模型自我改进。关键在于利用高质量参考输出(来自前沿模型或人类撰写)增强LLM评判者的性能,从而在无显式验证信号的场景下实现有效的后训练优化,最终在AlpacaEval和Arena-Hard等基准上显著优于直接监督微调(SFT)与无参考自提升方法。
链接: https://arxiv.org/abs/2602.16802
作者: Kejian Shi,Yixin Liu,Peifeng Wang,Alexander R. Fabbri,Shafiq Joty,Arman Cohan
机构: Yale University (耶鲁大学); Meta; Scale AI; Salesforce Research; Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICLR 2026 Camera Ready
Abstract:While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft “verifiers”. First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.
[NLP-59] Better Think Thrice: Learning to Reason Causally with Double Counterfactual Consistency
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对反事实问题时表现脆弱的问题,这反映出其因果推理能力的不足。现有方法依赖于标注的反事实任务数据作为评估基准,但此类数据难以大规模获取,限制了对模型因果推理能力的全面评测与提升。论文提出了一种轻量级的推理时方法——双反事实一致性(Double Counterfactual Consistency, DCC),其核心在于无需标注数据即可验证模型是否具备因果干预(causal intervention)和反事实预测(counterfactual prediction)这两个关键因果推理要素。DCC不仅可用于评估多种主流LLM在不同推理任务中的因果能力,还可作为无训练的测试时拒绝采样准则,直接提升模型在多个模型族上的推理性能。
链接: https://arxiv.org/abs/2602.16787
作者: Victoria Lin,Xinnuo Xu,Rachel Lawrence,Risa Ueno,Amit Sharma,Javier Gonzalez,Niranjani Prasad
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Despite their strong performance on reasoning benchmarks, large language models (LLMs) have proven brittle when presented with counterfactual questions, suggesting weaknesses in their causal reasoning ability. While recent work has demonstrated that labeled counterfactual tasks can be useful benchmarks of LLMs’ causal reasoning, producing such data at the scale required to cover the vast potential space of counterfactuals is limited. In this work, we introduce double counterfactual consistency (DCC), a lightweight inference-time method for measuring and guiding the ability of LLMs to reason causally. Without requiring labeled counterfactual data, DCC verifies a model’s ability to execute two important elements of causal reasoning: causal intervention and counterfactual prediction. Using DCC, we evaluate the causal reasoning abilities of various leading LLMs across a range of reasoning tasks and interventions. Moreover, we demonstrate the effectiveness of DCC as a training-free test-time rejection sampling criterion and show that it can directly improve performance on reasoning tasks across multiple model families.
[NLP-60] Omitted Variable Bias in Language Models Under Distribution Shift
【速读】: 该论文旨在解决现代语言模型在分布偏移(distribution shift)情境下表现脆弱的问题,即当测试数据与训练数据分布不一致时,模型性能显著下降。其核心挑战在于,分布偏移可分解为可观测和不可观测两部分,而现有方法仅能处理前者,忽略了由未观测变量引发的遗漏变量偏差(omitted variable bias),从而影响模型评估与优化的可靠性。解决方案的关键在于提出一个框架,将未观测变量的强度映射为在分布偏移下语言模型最坏情况泛化性能的理论边界(bounds),并通过实证实验表明,利用这些边界进行模型评估与优化,不仅能提供更严谨的域外性能度量,还能提升真实域外性能,并在目标分布标签可用时推断未观测变量的强度。
链接: https://arxiv.org/abs/2602.16784
作者: Victoria Lin,Louis-Philippe Morency,Eli Ben-Michael
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Methodology (stat.ME)
备注:
Abstract:Despite their impressive performance on a wide variety of tasks, modern language models remain susceptible to distribution shifts, exhibiting brittle behavior when evaluated on data that differs in distribution from their training data. In this paper, we describe how distribution shifts in language models can be separated into observable and unobservable components, and we discuss how established approaches for dealing with distribution shift address only the former. Importantly, we identify that the resulting omitted variable bias from unobserved variables can compromise both evaluation and optimization in language models. To address this challenge, we introduce a framework that maps the strength of the omitted variables to bounds on the worst-case generalization performance of language models under distribution shift. In empirical experiments, we show that using these bounds directly in language model evaluation and optimization provides more principled measures of out-of-distribution performance, improves true out-of-distribution performance relative to standard distribution shift adjustment methods, and further enables inference about the strength of the omitted variables when target distribution labels are available.
[NLP-61] Intent Laundering: AI Safety Datasets Are Not What They Seem
【速读】: 该论文旨在解决当前AI安全评估中存在的重要偏差问题,即现有安全数据集未能真实反映现实世界中的恶意攻击行为。其核心问题是:当前主流的AI安全测试依赖于显式的触发线索(triggering cues),这些线索虽然能有效诱发模型拒绝响应,但缺乏现实攻击的隐蔽性和复杂性,导致评估结果与实际风险严重脱节。解决方案的关键在于提出“意图伪装”(intent laundering)机制——通过剥离攻击样本中的触发线索,同时严格保留其恶意意图和关键细节,从而更真实地模拟攻击者的行为模式。实验表明,一旦去除触发线索,原本被认定为“合理安全”的模型(如Gemini 3 Pro和Claude Sonnet 3.7)均表现出显著的安全漏洞,且该方法作为越狱技术在全黑盒环境下成功率高达90%–98%,揭示了现有评估体系与真实威胁之间存在的根本性断层。
链接: https://arxiv.org/abs/2602.16729
作者: Shahriar Golchin,Marc Wetter
机构: Labelbox(标签框)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: v1 preprint
Abstract:We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world attacks based on three key properties: driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on “triggering cues”: words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce “intent laundering”: a procedure that abstracts away triggering cues from attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results indicate that current AI safety datasets fail to faithfully represent real-world attacks due to their overreliance on triggering cues. In fact, once these cues are removed, all previously evaluated “reasonably safe” models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90% to over 98%, under fully black-box access. Overall, our findings expose a significant disconnect between how model safety is evaluated and how real-world adversaries behave.
[NLP-62] Retrieval Augmented (Knowledge Graph) and Large Language Model-Driven Design Structure Matrix (DSM) Generation of Cyber-Physical Systems
【速读】: 该论文旨在解决设计结构矩阵(Design Structure Matrix, DSM)自动化生成的问题,尤其是在复杂系统设计中识别组件及其相互关系的挑战。传统DSM构建依赖人工经验,效率低且易出错,尤其在缺乏明确架构参考时更为困难。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)、检索增强生成(Retrieval-Augmented Generation, RAG)以及基于图的RAG(Graph-based RAG, GraphRAG)技术,从文本描述中自动提取组件间的关系并生成DSM。实验表明,这些方法在两个典型场景(电动螺丝刀和立方星CubeSat)中均展现出潜力,尤其在识别未知组件及其交互关系方面具有突破性进展,为设计过程的智能化提供了可行路径。
链接: https://arxiv.org/abs/2602.16715
作者: H. Sinan Bank,Daniel R. Herber
机构: Colorado State University (科罗拉多州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)
备注: 26 pages, 10 figures
Abstract:We explore the potential of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Graph-based RAG (GraphRAG) for generating Design Structure Matrices (DSMs). We test these methods on two distinct use cases – a power screwdriver and a CubeSat with known architectural references – evaluating their performance on two key tasks: determining relationships between predefined components, and the more complex challenge of identifying components and their subsequent relationships. We measure the performance by assessing each element of the DSM and overall architecture. Despite design and computational challenges, we identify opportunities for automated DSM generation, with all code publicly available for reproducibility and further feedback from the domain experts.
[NLP-63] Multi-Objective Alignment of Language Models for Personalized Psychotherapy
【速读】: 该论文旨在解决当前生成式 AI 在心理健康治疗应用中因单一目标优化导致的临床安全与患者偏好难以平衡的问题。其关键解决方案是提出一种多目标直接偏好优化(Multi-objective Direct Preference Optimization, MODPO)框架,通过整合六项核心治疗维度(包括共情、安全性、主动倾听、自我驱动改变、信任/关系和患者自主性)的偏好排序数据进行联合建模,相较于单目标优化和传统微调方法,在保持高安全性(62.6%)的同时显著提升共情能力(77.6%),并经盲评临床医生验证优于现有基线方法。
链接: https://arxiv.org/abs/2602.16053
作者: Mehrab Beikzadeh,Yasaman Asadollah Salmanpour,Ashima Suvarna,Sriram Sankararaman,Matteo Malgaroli,Majid Sarrafzadeh,Saadia Gabriel
机构: University of California, Los Angeles(加州大学洛杉矶分校); The University of Texas at Austin(德克萨斯大学奥斯汀分校); NYU Grossman School of Medicine(纽约大学格罗斯曼医学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Mental health disorders affect over 1 billion people worldwide, yet access to care remains limited by workforce shortages and cost constraints. While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety. We survey 335 individuals with lived mental health experience to collect preference rankings across therapeutic dimensions, then develop a multi-objective alignment framework using direct preference optimization. We train reward models for six criteria – empathy, safety, active listening, self-motivated change, trust/rapport, and patient autonomy – and systematically compare multi-objective approaches against single-objective optimization, supervised fine-tuning, and parameter merging. Multi-objective DPO (MODPO) achieves superior balance (77.6% empathy, 62.6% safety) compared to single-objective optimization (93.6% empathy, 47.8% safety), and therapeutic criteria outperform general communication principles by 17.2%. Blinded clinician evaluation confirms MODPO is consistently preferred, with LLM-evaluator agreement comparable to inter-clinician reliability.
[NLP-64] PREFER: An Ontology for the PREcision FERmentation Community
【速读】: 该论文旨在解决精度发酵(precision fermentation)过程中因缺乏社区标准而导致的数据可访问性与互操作性不足的问题,从而阻碍了不同高通量生物反应器平台之间的数据整合。解决方案的关键在于提出一个名为PREFER的开源本体(ontology),该本体基于广泛采用的基本形式本体(Basic Formal Ontology, BFO)构建,并与其他多个社区本体实现连接,确保数据的一致性和跨领域兼容性,覆盖整个精度发酵流程。通过将PREFER集成到高通量生物过程开发工作流中,可实现结构化元数据,支持自动化跨平台执行和高保真度数据捕获,进而促进机器可操作数据集的生成,为合成生物学中预测性强、鲁棒性高的机器学习模型训练提供基础。
链接: https://arxiv.org/abs/2602.16755
作者: Txell Amigó(1),Shawn Zheng Kai Tan(2),Angel Luu Phanthanourak(1),Sebastian Schulz(1),Pasquale D. Colaianni(1),Dominik M. Maszczyk(1),Ester Milesi(1),Ivan Schlembach(1),Mykhaylo Semenov Petrov(1),Marta Reventós Montané(1),Lars K. Nielsen(1,3),Jochen Förster(1),Bernhard Ø. Palsson(1,4),Suresh Sudarsan(1, 5),Alberto Santos(1) ((1) The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kgs. Lyngby, Denmark, (2) SignaMind, Singapore, (3) Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, Brisbane, Queensland, Australia, (4) The Department of Bioengineering, University of California, San Diego, USA, (5) Nexxar ApS, Lynge, Denmark)
机构: 未知
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Precision fermentation relies on microbial cell factories to produce sustainable food, pharmaceuticals, chemicals, and biofuels. Specialized laboratories such as biofoundries are advancing these processes using high-throughput bioreactor platforms, which generate vast datasets. However, the lack of community standards limits data accessibility and interoperability, preventing integration across platforms. In order to address this, we introduce PREFER, an open-source ontology designed to establish a unified standard for bioprocess data. Built in alignment with the widely adopted Basic Formal Ontology (BFO) and connecting with several other community ontologies, PREFER ensures consistency and cross-domain compatibility and covers the whole precision fermentation process. Integrating PREFER into high-throughput bioprocess development workflows enables structured metadata that supports automated cross-platform execution and high-fidelity data capture. Furthermore, PREFER’s standardization has the potential to bridge disparate data silos, generating machine-actionable datasets critical for training predictive, robust machine learning models in synthetic biology. This work provides the foundation for scalable, interoperable bioprocess systems and supports the transition toward more data-driven bioproduction.
信息检索
[IR-0] CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts ECIR2026
【速读】:该论文旨在解决从噪声大、多语言的历史文本中提取人物与地点之间语义关系的问题,特别是识别两类关系:人物是否曾到访某地(at)以及人物在文本出版时间附近是否位于某地(isAt),这需要结合时空线索进行推理。解决方案的关键在于构建一个三重评估体系,同时衡量系统的准确性、计算效率和跨领域泛化能力,并通过大规模历史数据处理支持知识图谱构建、历史人物传记重建及数字人文中的空间分析等下游应用。
链接: https://arxiv.org/abs/2602.17663
作者: Juri Opitz,Corina Raclé,Emanuela Boros,Andrianos Michail,Matteo Romanello,Maud Ehrmann,Simon Clematide
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: ECIR 2026. CLEF Evaluation Lab. Registration DL: 2026/04/23. Task Homepage at this https URL
Abstract:HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person–place associations in multiple languages and time periods. Systems are asked to classify relations of two types - at (“Has the person ever been at this place?”) and isAt (“Is the person located at this place around publication time?”) - requiring reasoning over temporal and geographical cues. The lab introduces a three-fold evaluation profile that jointly assesses accuracy, computational efficiency, and domain generalization. By linking relation extraction to large-scale historical data processing, HIPE-2026 aims to support downstream applications in knowledge-graph construction, historical biography reconstruction, and spatial analysis in digital humanities.
[IR-1] Mine and Refine: Optimizing Graded Relevance in E-commerce Search Retrieval
【速读】:该论文旨在解决电商搜索中语义文本嵌入(Semantic Text Embeddings)的多类别相关性建模问题,尤其针对长尾、噪声查询的泛化能力不足以及生产系统对可扩展且政策一致的监督信号的需求。解决方案的关键在于提出一个两阶段“挖掘与精炼”(Mine and Refine)对比训练框架:第一阶段通过标签感知的监督对比损失训练一个多语言双塔检索器,构建鲁棒的全局语义空间;第二阶段利用近邻搜索挖掘难样本,并基于政策对齐的小型大语言模型(LLM)重新标注,引入多类圆损失(multi-class circle loss)显式增强不同相关性层级间的相似度边界,从而进一步优化嵌入空间的区分度和稳定性。
链接: https://arxiv.org/abs/2602.17654
作者: Jiaqi Xi,Raghav Saboo,Luming Chen,Martin Wang,Sudeep Das
机构: DoorDash Inc.(DoorDash公司)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:We propose a two-stage “Mine and Refine” contrastive training framework for semantic text embeddings to enhance multi-category e-commerce search retrieval. Large scale e-commerce search demands embeddings that generalize to long tail, noisy queries while adhering to scalable supervision compatible with product and policy constraints. A practical challenge is that relevance is often graded: users accept substitutes or complements beyond exact matches, and production systems benefit from clear separation of similarity scores across these relevance strata for stable hybrid blending and thresholding. To obtain scalable policy consistent supervision, we fine-tune a lightweight LLM on human annotations under a three-level relevance guideline and further reduce residual noise via engagement driven auditing. In Stage 1, we train a multilingual Siamese two-tower retriever with a label aware supervised contrastive objective that shapes a robust global semantic space. In Stage 2, we mine hard samples via ANN and re-annotate them with the policy aligned LLM, and introduce a multi-class extension of circle loss that explicitly sharpens similarity boundaries between relevance levels, to further refine and enrich the embedding space. Robustness is additionally improved through additive spelling augmentation and synthetic query generation. Extensive offline evaluations and production A/B tests show that our framework improves retrieval relevance and delivers statistically significant gains in engagement and business impact.
[IR-2] Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
【速读】:该论文旨在解决当前多智能体信息检索(IR)流水线中,基于大语言模型(LLM)的思维链(Chain-of-Thought, CoT)评估仅依赖目标任务准确率所带来的局限性——即无法衡量推理过程本身的质量或实用性。为克服这一问题,作者提出两个新指标:可重用性(reusability)和可验证性(verifiability),并通过引入Thinker-Executor框架将CoT生成与执行解耦,从而独立评估推理内容的通用价值。关键创新在于通过分离推理生成与执行阶段,揭示出现有基于准确率的排行榜未能捕捉到的推理质量维度,且发现专用推理模型生成的CoT并不一定比通用大模型(如Llama、Gemma)更具可重用性和可验证性,挑战了当前对“专业化推理能力”的普遍认知。
链接: https://arxiv.org/abs/2602.17544
作者: Shashank Aggarwal,Ram Vikas Mishra,Amit Awekar
机构: Indian Institute of Technology Guwahati (印度理工学院古瓦哈蒂分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the reasoning process itself. To address this limitation, we introduce two novel measures: reusability and verifiability. We decouple CoT generation from execution using a Thinker-Executor framework. Reusability measures how easily an Executor can reuse the Thinker’s CoT. Verifiability measures how frequently an Executor can match the Thinker’s answer using the CoT. We evaluated four Thinker models against a committee of ten Executor models across five benchmarks. Our results reveal that reusability and verifiability do not correlate with standard accuracy, exposing a blind spot in current accuracy-based leaderboards for reasoning capability. Surprisingly, we find that CoTs from specialized reasoning models are not consistently more reusable or verifiable than those from general-purpose LLMs like Llama and Gemma.
[IR-3] A Picture of Agent ic Search
【速读】:该论文旨在解决信息检索(Information Retrieval, IR)系统在自动化代理(agent)日益参与查询生成背景下的适应性问题。当前IR系统仍以人类为中心设计,其评估指标、用户模型和数据集均基于人类查询行为,但随着代理驱动的查询量增加,传统假设如查询负载的可预测性和人类行为模式已不再成立,导致缓存失效、预处理冗余以及标准评估指标失真等问题。解决方案的关键在于构建一个全面捕捉代理增强型检索系统在回答查询过程中产生的全部数据(包括推理诱导的查询、检索文档及中间思考过程)的方法论,并发布了Agentic Search Queryset (ASQ) 数据集,涵盖多个代理、检索管道和基准数据集(HotpotQA、Researchy Questions 和 MS MARCO),从而为未来面向代理的IR研究提供数据基础与可扩展工具链。
链接: https://arxiv.org/abs/2602.17518
作者: Francesca Pezzuti,Ophir Frieder,Fabrizio Silvestri,Sean MacAvaney,Nicola Tonellotto
机构: University of Pisa(比萨大学); Georgetown University(乔治城大学); Sapienza University of Rome(罗马大学); University of Glasgow(格拉斯哥大学)
类目: Information Retrieval (cs.IR)
备注: 7 pages, 2 figures
Abstract:With automated systems increasingly issuing search queries alongside humans, Information Retrieval (IR) faces a major shift. Yet IR remains human-centred, with systems, evaluation metrics, user models, and datasets designed around human queries and behaviours. Consequently, IR operates under assumptions that no longer hold in practice, with changes to workload volumes, predictability, and querying behaviours. This misalignment affects system performance and optimisation: caching may lose effectiveness, query pre-processing may add overhead without improving results, and standard metrics may mismeasure satisfaction. Without adaptation, retrieval models risk satisfying neither humans, nor the emerging user segment of agents. However, datasets capturing agent search behaviour are lacking, which is a critical gap given IR’s historical reliance on data-driven evaluation and optimisation. We develop a methodology for collecting all the data produced and consumed by agentic retrieval-augmented systems when answering queries, and we release the Agentic Search Queryset (ASQ) dataset. ASQ contains reasoning-induced queries, retrieved documents, and thoughts for queries in HotpotQA, Researchy Questions, and MS MARCO, for 3 diverse agents and 2 retrieval pipelines. The accompanying toolkit enables ASQ to be extended to new agents, retrievers, and datasets.
[IR-4] Beyond Pipelines: A Fundamental Study on the Rise of Generative-Retrieval Architectures in Web Research
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)如何重塑网络研究与产业应用的问题,特别是其在信息检索、问答系统、推荐系统及网络分析等传统任务中的范式转变。解决方案的关键在于利用检索增强生成(Retrieval-Augmented Generation, RAG)技术,将外部知识库的实时检索能力与LLMs的强大生成能力相结合,从而提升生成内容的准确性、可解释性与相关性,推动Web研究从静态处理向动态、智能的生成式解决方案演进。
链接: https://arxiv.org/abs/2602.17450
作者: Amirereza Abbasi,Mohsen Hooshmand
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Web research and practices have evolved significantly over time, offering users diverse and accessible solutions across a wide range of tasks. While advanced concepts such as Web 4.0 have emerged from mature technologies, the introduction of large language models (LLMs) has profoundly influenced both the field and its applications. This wave of LLMs has permeated science and technology so deeply that no area remains untouched. Consequently, LLMs are reshaping web research and development, transforming traditional pipelines into generative solutions for tasks like information retrieval, question answering, recommendation systems, and web analytics. They have also enabled new applications such as web-based summarization and educational tools. This survey explores recent advances in the impact of LLMs-particularly through the use of retrieval-augmented generation (RAG)-on web research and industry. It discusses key developments, open challenges, and future directions for enhancing web solutions with LLMs.
[IR-5] WarpRec: Unifying Academic Rigor and Industrial Scale for Responsible Reproducible and Efficient Recommendation
【速读】:该论文旨在解决推荐系统(Recommender Systems)研究中面临的生态碎片化问题,即研究人员在本地内存实验的便捷性与工业级分布式引擎所需的复杂重写之间存在显著权衡。其解决方案的关键在于提出WarpRec框架,该框架采用一种后端无关(backend-agnostic)的创新架构,支持从本地执行无缝过渡到分布式训练与优化,并集成50余种前沿算法、40种评估指标及19种数据过滤与分割策略,同时通过CodeCarbon实现能耗实时追踪,兼顾性能提升与可持续性,从而为下一代面向生成式AI(Generative AI)生态的可扩展、环保且具备智能代理能力(Agentic AI-ready)的推荐系统提供统一架构基础。
链接: https://arxiv.org/abs/2602.17442
作者: Marco Avolio,Potito Aghilar,Sabino Roccotelli,Vito Walter Anelli,Chiara Mallamaci,Vincenzo Paparella,Marco Valentini,Alejandro Bellogín,Michelantonio Trizio,Joseph Trotta,Antonio Ferrara,Tommaso Di Noia
机构: Wideverse(宽世界); Politecnico di Bari(巴里理工大学); ISTI-CNR(意大利国家研究委员会信息科学与技术研究所); UAM(马德里自治大学); OVS(奥维多视觉系统)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Innovation in Recommender Systems is currently impeded by a fractured ecosystem, where researchers must choose between the ease of in-memory experimentation and the costly, complex rewriting required for distributed industrial engines. To bridge this gap, we present WarpRec, a high-performance framework that eliminates this trade-off through a novel, backend-agnostic architecture. It includes 50+ state-of-the-art algorithms, 40 metrics, and 19 filtering and splitting strategies that seamlessly transition from local execution to distributed training and optimization. The framework enforces ecological responsibility by integrating CodeCarbon for real-time energy tracking, showing that scalability need not come at the cost of scientific integrity or sustainability. Furthermore, WarpRec anticipates the shift toward Agentic AI, leading Recommender Systems to evolve from static ranking engines into interactive tools within the Generative AI ecosystem. In summary, WarpRec not only bridges the gap between academia and industry but also can serve as the architectural backbone for the next generation of sustainable, agent-ready Recommender Systems. Code is available at this https URL
[IR-6] Improving LLM -based Recommendation with Self-Hard Negatives from Intermediate Layers
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推荐系统中进行偏好学习时,因依赖序列级离线生成的负样本而导致判别能力弱、信息量不足的问题,尤其是在负样本空间庞大的场景下。其解决方案的关键在于提出一种名为ILRec的新颖偏好微调框架,通过从模型中间层提取自硬负样本(self-hard negative tokens)作为细粒度的负向监督信号,动态反映模型的学习过程;并设计两阶段机制——跨层偏好优化与跨层偏好蒸馏,以协同提升对有效负样本的判别能力和中间层负信号的质量;同时引入轻量级协同过滤模型为负样本提供token级奖励,降低误罚假负样本的风险。
链接: https://arxiv.org/abs/2602.17410
作者: Bingqian Li,Bowen Zheng,Xiaolei Wang,Long Zhang,Jinpeng Wang,Sheng Chen,Wayne Xin Zhao,Ji-rong Wen
机构: GSAI, Renmin University of China(中国人民大学高瓴人工智能学院); Meituan(美团)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have shown great promise in recommender systems, where supervised fine-tuning (SFT) is commonly used for adaptation. Subsequent studies further introduce preference learning to incorporate negative samples into the training process. However, existing methods rely on sequence-level, offline-generated negatives, making them less discriminative and informative when adapting LLMs to recommendation tasks with large negative item spaces. To address these challenges, we propose ILRec, a novel preference fine-tuning framework for LLM-based recommendation, leveraging self-hard negative signals extracted from intermediate layers to improve preference learning. Specifically, we identify self-hard negative tokens from intermediate layers as fine-grained negative supervision that dynamically reflects the model’s preference learning process. To effectively integrate these signals into training, we design a two-stage framework comprising cross-layer preference optimization and cross-layer preference distillation, enabling the model to jointly discriminate informative negatives and enhance the quality of negative signals from intermediate layers. In addition, we introduce a lightweight collaborative filtering model to assign token-level rewards for negative signals, mitigating the risk of over-penalizing false negatives. Extensive experiments on three datasets demonstrate ILRec’s effectiveness in enhancing the performance of LLM-based recommender systems.
[IR-7] Visual Model Checking: Graph-Based Inference of Visual Routines for Image Retrieval ICPR
【速读】:该论文旨在解决当前基于嵌入(embedding)的图像检索系统在处理涉及复杂关系、对象组合或精确约束(如身份、数量和比例)的自然语言查询时,仍存在不可靠或无法解决的问题。其解决方案的关键在于将形式化验证(formal verification)与深度学习相结合,通过图-based验证方法与神经代码生成的协同作用,实现对用户查询中每个原子命题的显式验证,从而在开放词汇条件下提升检索结果的可信度与可解释性,同时明确标识哪些约束被满足、哪些未被满足,显著增强检索过程的透明性和问责性。
链接: https://arxiv.org/abs/2602.17386
作者: Adrià Molina,Oriol Ramos Terrades,Josep Lladós
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Submitted for ICPR Review
Abstract:Information retrieval lies at the foundation of the modern digital industry. While natural language search has seen dramatic progress in recent years largely driven by embedding-based models and large-scale pretraining, the field still faces significant challenges. Specifically, queries that involve complex relationships, object compositions, or precise constraints such as identities, counts and proportions often remain unresolved or unreliable within current frameworks. In this paper, we propose a novel framework that integrates formal verification into deep learning-based image retrieval through a synergistic combination of graph-based verification methods and neural code generation. Our approach aims to support open-vocabulary natural language queries while producing results that are both trustworthy and verifiable. By grounding retrieval results in a system of formal reasoning, we move beyond the ambiguity and approximation that often characterize vector representations. Instead of accepting uncertainty as a given, our framework explicitly verifies each atomic truth in the user query against the retrieved content. This allows us to not only return matching results, but also to identify and mark which specific constraints are satisfied and which remain unmet, thereby offering a more transparent and accountable retrieval process while boosting the results of the most popular embedding-based approaches.
[IR-8] raining-free Graph-based Imputation of Missing Modalities in Multimodal Recommendation
【速读】:该论文旨在解决多模态推荐系统(Multimodal Recommender Systems, RSs)中因模态缺失(如商品图像或描述信息不完整)导致的性能下降问题。现有方法通常直接丢弃缺失模态的物品,从而造成数据利用率低下且可能引入偏差。论文首次对多模态推荐中的模态缺失问题进行了形式化建模,并提出基于用户-物品图结构的解决方案:将缺失模态信息的补全问题转化为在物品-物品共购买图上的特征插值问题。其核心创新在于设计了四种无需额外训练的图传播策略,通过在物品图上扩散已有的多模态特征来实现缺失特征的高效插补,显著提升了模型鲁棒性与泛化能力。实验表明,该方法可无缝集成至任意现有多模态RS框架,且在不同缺失场景下优于传统机器学习插补方法,同时首次揭示了物品图上的特征同质性(feature homophily)对图插值效果的影响机制。
链接: https://arxiv.org/abs/2602.17354
作者: Daniele Malitesta,Emanuele Rossi,Claudio Pomo,Tommaso Di Noia,Fragkiskos D. Malliaros
机构: 1. University of Bologna (博洛尼亚大学); 2. Politecnico di Milano (米兰理工大学); 3. Università degli Studi di Napoli Federico II (那不勒斯腓特烈二世大学)
类目: Information Retrieval (cs.IR)
备注: Accepted in IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE)
Abstract:Multimodal recommender systems (RSs) represent items in the catalog through multimodal data (e.g., product images and descriptions) that, in some cases, might be noisy or (even worse) missing. In those scenarios, the common practice is to drop items with missing modalities and train the multimodal RSs on a subsample of the original dataset. To date, the problem of missing modalities in multimodal recommendation has still received limited attention in the literature, lacking a precise formalisation as done with missing information in traditional machine learning. In this work, we first provide a problem formalisation for missing modalities in multimodal recommendation. Second, by leveraging the user-item graph structure, we re-cast the problem of missing multimodal information as a problem of graph features interpolation on the item-item co-purchase graph. On this basis, we propose four training-free approaches that propagate the available multimodal features throughout the item-item graph to impute the missing features. Extensive experiments on popular multimodal recommendation datasets demonstrate that our solutions can be seamlessly plugged into any existing multimodal RS and benchmarking framework while still preserving (or even widen) the performance gap between multimodal and traditional RSs. Moreover, we show that our graph-based techniques can perform better than traditional imputations in machine learning under different missing modalities settings. Finally, we analyse (for the first time in multimodal RSs) how feature homophily calculated on the item-item graph can influence our graph-based imputations.
[IR-9] WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval
【速读】:该论文旨在解决多语言信息检索(Multilingual Information Retrieval, Mulitlingual IR)中高质量、大规模标注数据稀缺的问题,以及现有问答(FAQ-based)资源在跨语言对齐和上下文丰富性方面的不足。其解决方案的关键在于提出WebFAQ 2.0数据集,通过直接爬取并提取网页内容的新颖数据采集策略,显著扩展了多语言覆盖范围(108种语言)和双语对齐QA对数量(超1430万),同时引入页面标题与描述增强上下文多样性;此外,为训练密集检索器提供125万查询的困难负样本集(hard negatives),并结合两阶段检索管道与交叉编码器评分机制,支持对比学习(MultipleNegativesRanking loss)与知识蒸馏(MarginMSE loss)两种主流微调范式,从而提升模型在跨语言场景下的检索性能。
链接: https://arxiv.org/abs/2602.17327
作者: Michael Dinzinger,Laura Caspari,Ali Salman,Irvin Topi,Jelena Mitrović,Michael Granitzer
机构: University of Passau(帕绍大学); IT:U Austria
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages. Compared to the previous version, it significantly expands multilingual coverage and the number of bilingual aligned QA pairs to over 14.3M, making it the largest FAQ-based resource. Unlike the original release, WebFAQ 2.0 uses a novel data collection strategy that directly crawls and extracts relevant web content, resulting in a substantially more diverse and multilingual dataset with richer context through page titles and descriptions. In response to community feedback, we also release a hard negatives dataset for training dense retrievers, with 1.25M queries across 20 languages. These hard negatives were mined using a two-stage retrieval pipeline and include cross-encoder scores for 200 negatives per query. We further show how this resource enables two primary fine-tuning strategies for dense retrievers: Contrastive Learning with MultipleNegativesRanking loss, and Knowledge Distillation with MarginMSE loss. WebFAQ 2.0 is not a static resource but part of a long-term effort. Since late 2025, structured FAQs are being regularly released through the Open Web Index, enabling continuous expansion and refinement. We publish the datasets and training scripts to facilitate further research in multilingual and cross-lingual IR. The dataset itself and all related resources are publicly available on GitHub and HuggingFace.
[IR-10] On the Reliability of User-Centric Evaluation of Conversational Recommender Systems
【速读】:该论文旨在解决当前对话推荐系统(Conversational Recommender Systems, CRS)用户中心评估中依赖静态对话日志进行第三方标注的可靠性问题,尤其关注标注结果的稳定性和维度间结构关系。其关键解决方案在于通过大规模实证研究(1,053条标注来自124名众包工作者对200段ReDial对话的评估),采用随机效应可靠性模型和相关性分析,量化了18维CRS-Que框架下各评价维度的稳定性及其相互依赖性,揭示出实用性维度(如准确性、有用性、满意度)具有中等可靠性,而社会性维度(如亲和力、融洽度)则显著不可靠,并发现多数维度可被单一全局质量信号解释,存在明显的光环效应(halo effect)。这一发现挑战了单标注者和基于大语言模型(LLM)的评估协议的有效性,从而强调在离线评估中应采用多标注者聚合与维度降维策略以提升评估信度。
链接: https://arxiv.org/abs/2602.17264
作者: Michael Müller,Amir Reza Mohammadi,Andreas Peintner,Beatriz Barroso Gstrein,Günther Specht,Eva Zangerle
机构: University of Innsbruck(因斯布鲁克大学)
类目: Information Retrieval (cs.IR)
备注: 5 pages, 2 figures. Submitted to UMAP 2026. Code available at this https URL
Abstract:User-centric evaluation has become a key paradigm for assessing Conversational Recommender Systems (CRS), aiming to capture subjective qualities such as satisfaction, trust, and rapport. To enable scalable evaluation, recent work increasingly relies on third-party annotations of static dialogue logs by crowd workers or large language models. However, the reliability of this practice remains largely unexamined. In this paper, we present a large-scale empirical study investigating the reliability and structure of user-centric CRS evaluation on static dialogue transcripts. We collected 1,053 annotations from 124 crowd workers on 200 ReDial dialogues using the 18-dimensional CRS-Que framework. Using random-effects reliability models and correlation analysis, we quantify the stability of individual dimensions and their interdependencies. Our results show that utilitarian and outcome-oriented dimensions such as accuracy, usefulness, and satisfaction achieve moderate reliability under aggregation, whereas socially grounded constructs such as humanness and rapport are substantially less reliable. Furthermore, many dimensions collapse into a single global quality signal, revealing a strong halo effect in third-party judgments. These findings challenge the validity of single-annotator and LLM-based evaluation protocols and motivate the need for multi-rater aggregation and dimension reduction in offline CRS evaluation.
[IR-11] When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在信息检索(Information Retrieval, IR)相关性评估中是否具备可靠性、稳定性和严谨性的问题,即LLM能否作为人类评估者的替代方案。其核心发现是:LLM在相关性判断中普遍存在高估行为(overrating),表现为对不满足信息需求的文本片段仍给出高分且自信度高,这种偏差具有系统性而非随机波动;此外,LLM的判断高度依赖于段落长度和表面词汇线索,表明其评估机制与人类认知存在本质差异。解决方案的关键在于构建严谨的诊断性评估框架,以识别并量化LLM在相关性判断中的偏倚,从而避免将其直接用于IR评测任务中,确保评估结果的有效性和可比性。
链接: https://arxiv.org/abs/2602.17170
作者: Chuting Yu,Hang Li,Joel Mackenzie,Teerapong Leelanupab
机构: University of Queensland(昆士兰大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Human relevance assessment is time-consuming and cognitively intensive, limiting the scalability of Information Retrieval evaluation. This has led to growing interest in using large language models (LLMs) as proxies for human judges. However, it remains an open question whether LLM-based relevance judgments are reliable, stable, and rigorous enough to match humans for relevance assessment. In this work, we conduct a systematic study of overrating behavior in LLM-based relevance judgments across model backbones, evaluation paradigms (pointwise and pairwise), and passage modification strategies. We show that models consistently assign inflated relevance scores – often with high confidence – to passages that do not genuinely satisfy the underlying information need, revealing a system-wide bias rather than random fluctuations in judgment. Furthermore, controlled experiments show that LLM-based relevance judgments can be highly sensitive to passage length and surface-level lexical cues. These results raise concerns about the usage of LLMs as drop-in replacements for human relevance assessors, and highlight the urgent need for careful diagnostic evaluation frameworks when applying LLMs for relevance assessments. Our code and results are publicly available.
[IR-12] Multiple Index Merge for Approximate Nearest Neighbor Search
【速读】:该论文旨在解决高维向量数据库中近似k近邻(Approximate k Nearest Neighbor, AKNN)搜索在构建大规模索引时面临的两大挑战:一是基于邻近图的索引因大量高维向量的距离计算导致构建速度慢、内存开销大;二是当数据规模超过单机内存容量时,需分块构建多个子索引,而直接对这些分离索引进行查询会严重降低搜索效率,因为无法利用跨图连接信息。为应对上述问题,论文提出两种关键技术:其一为反向邻居滑动合并(Reverse Neighbor Sliding Merge, RNSM),通过挖掘图结构信息提升两索引合并的效率;其二为合并顺序选择(Merge Order Selection, MOS),通过优化多索引合并顺序以减少冗余操作。实验表明,该方法相比现有合并策略提速达5.48倍,相比重建索引提速9.92倍,且在1亿向量规模下仍保持高效扩展性。
链接: https://arxiv.org/abs/2602.17099
作者: Liuchang Jing,Mingyu Yang,Lei Li,Jianbin Qin,Wei Wang
机构: 未知
类目: Databases (cs.DB); Information Retrieval (cs.IR)
备注: technical report
Abstract:Approximate k nearest neighbor (AKNN) search in high-dimensional space is a foundational problem in vector databases with widespread applications. Among the numerous AKNN indexes, Proximity Graph-based indexes achieve state-of-the-art search efficiency across various benchmarks. However, their extensive distance computations of high-dimensional vectors lead to slow construction and substantial memory overhead. The limited memory capacity often prevents building the entire index at once when handling large-scale datasets. A common practice is to build multiple sub-indexes separately. However, directly searching on these separated indexes severely compromises search efficiency, as queries cannot leverage cross-graph connections. Therefore, efficient graph index merging is crucial for multi-index searching. In this paper, we focus on efficient two-index merging and the merge order of multiple indexes for AKNN search. To achieve this, we propose a reverse neighbor sliding merge (RNSM) that exploits structural information to boost merging efficiency. We further investigate merge order selection (MOS) to reduce the merging cost by eliminating redundant merge operations. Experiments show that our approach yields up to a 5.48 \times speedup over existing index merge methods and 9.92 \times speedup over index reconstruction, while maintaining expected superior search performance. Moreover, our method scales efficiently to 100 million vectors with 50 partitions, maintaining consistent speedups.
[IR-13] A Long-term Value Prediction Framework In Video Ranking
【速读】:该论文旨在解决短视频推荐排序阶段长期价值(LTV)建模的三大挑战:位置偏差(position bias)、归因模糊性(attribution ambiguity)和时间局限性(temporal limitations)。其关键解决方案包括:(1) 提出位置感知去偏分位数(Position-aware Debias Quantile, PDQ)模块,通过分位数分布对用户参与度进行归一化处理,实现无需结构变更的位置鲁棒LTV估计;(2) 设计多维归因模块,学习上下文、行为与内容信号上的连续归因强度,替代静态规则以捕捉视频间的细粒度影响关系,并引入定制混合损失增强因果清晰度;(3) 构建跨时间作者建模模块,基于 censoring-aware 的日级LTV目标建模创作者驱动的再参与行为,支持更长周期的LTV预测且可扩展至主题、风格等其他维度。该框架已在淘宝亿级系统中部署,显著提升LTV指标并保持与短期目标的稳定权衡。
链接: https://arxiv.org/abs/2602.17058
作者: Huabin Chen,Xinao Wang,Huiping Chu,Keqin Xu,Chenhao Zhai,Chenyi Wang,Kai Meng,Yuning Jiang
机构: Alibaba Group(阿里巴巴集团); Tsinghua University (清华大学)
类目: Information Retrieval (cs.IR)
备注: 9 pages
Abstract:Accurately modeling long-term value (LTV) at the ranking stage of short-video recommendation remains challenging. While delayed feedback and extended engagement have been explored, fine-grained attribution and robust position normalization at billion-scale are still underdeveloped. We propose a practical ranking-stage LTV framework addressing three challenges: position bias, attribution ambiguity, and temporal limitations. (1) Position bias: We introduce a Position-aware Debias Quantile (PDQ) module that normalizes engagement via quantile-based distributions, enabling position-robust LTV estimation without architectural changes. (2) Attribution ambiguity: We propose a multi-dimensional attribution module that learns continuous attribution strengths across contextual, behavioral, and content signals, replacing static rules to capture nuanced inter-video influence. A customized hybrid loss with explicit noise filtering improves causal clarity. (3) Temporal limitations: We present a cross-temporal author modeling module that builds censoring-aware, day-level LTV targets to capture creator-driven re-engagement over longer horizons; the design is extensible to other dimensions (e.g., topics, styles). Offline studies and online A/B tests show significant improvements in LTV metrics and stable trade-offs with short-term objectives. Implemented as task augmentation within an existing ranking model, the framework supports efficient training and serving, and has been deployed at billion-scale in Taobao’s production system, delivering sustained engagement gains while remaining compatible with industrial constraints. Comments: 9 pages Subjects: Information Retrieval (cs.IR) Reportnumber: ind0066 Cite as: arXiv:2602.17058 [cs.IR] (or arXiv:2602.17058v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.17058 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3774904.3792830 Focus to learn more DOI(s) linking to related resources
[IR-14] LiveGraph: Active-Structure Neural Re-ranking for Exercise Recommendation
【速读】:该论文旨在解决当前练习推荐框架在应对学生参与度的长尾分布问题以及无法适应个体化学习轨迹方面的局限性。其核心解决方案是提出LiveGraph,一个基于动态结构的神经重排序框架,关键在于通过图结构表示增强策略弥合活跃与不活跃学生之间的信息鸿沟,并引入动态重排序机制以提升内容多样性,从而在保证推荐精度的同时实现教学内容的多样化。
链接: https://arxiv.org/abs/2602.17036
作者: Rong Fu,Zijian Zhang,Haiyun Wei,Jiekai Wu,Kun Liu,Xianda Li,Haoyu Zhao,Yang Li,Yongtai Liu,Ziming Wang,Rui Lu,Simon Fong
机构: University of Macau (澳门大学); University of Pennsylvania (宾夕法尼亚大学); Tongji University (同济大学); Juntendo University (顺天堂大学); University of Southampton (南安普顿大学); University of Bologna (博洛尼亚大学); Wuhan University (武汉大学); University of the Chinese Academy of Sciences (中国科学院大学); Hanyang University (汉阳大学); Zhejiang University (浙江大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 19 pages, 5 figures
Abstract:The continuous expansion of digital learning environments has catalyzed the demand for intelligent systems capable of providing personalized educational content. While current exercise recommendation frameworks have made significant strides, they frequently encounter obstacles regarding the long-tailed distribution of student engagement and the failure to adapt to idiosyncratic learning trajectories. We present LiveGraph, a novel active-structure neural re-ranking framework designed to overcome these limitations. Our approach utilizes a graph-based representation enhancement strategy to bridge the information gap between active and inactive students while integrating a dynamic re-ranking mechanism to foster content diversity. By prioritizing the structural relationships within learning histories, the proposed model effectively balances recommendation precision with pedagogical variety. Comprehensive experimental evaluations conducted on multiple real-world datasets demonstrate that LiveGraph surpasses contemporary baselines in both predictive accuracy and the breadth of exercise diversity.
[IR-15] WSDM Cup 2026 Multilingual Retrieval: A Low-Cost Multi-Stage Retrieval Pipeline
【速读】:该论文针对多语言信息检索任务中如何高效且准确地利用英文查询检索非英语语种文档的问题,提出了一种低成本的四阶段检索系统。其关键解决方案在于:首先通过基于大语言模型(LLM)的GRF风格查询扩展增强查询表达能力;随后采用BM25进行候选文档初筛;接着使用jina-embeddings-v4生成长文本嵌入进行密集排序;最后对前20个候选结果使用Qwen3-Reranker-4B进行点式重排序,同时保留其余结果的密集排序顺序。该设计在有限计算资源下实现了高精度检索性能,nDCG@20达0.403,Judged@20达0.95。
链接: https://arxiv.org/abs/2602.16989
作者: Chentong Hao,Minmao Wang
机构: Brown University (布朗大学); Fudan University (复旦大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:We present a low-cost retrieval system for the WSDM Cup 2026 multilingual retrieval task, where English queries are used to retrieve relevant documents from a collection of approximately ten million news articles in Chinese, Persian, and Russian, and to output the top-1000 ranked results for each query. We follow a four-stage pipeline that combines LLM-based GRF-style query expansion with BM25 candidate retrieval, dense ranking using long-text representations from jina-embeddings-v4, and pointwise re-ranking of the top-20 candidates using Qwen3-Reranker-4B while preserving the dense order for the remaining results. On the official evaluation, the system achieves nDCG@20 of 0.403 and Judged@20 of 0.95. We further conduct extensive ablation experiments to quantify the contribution of each stage and to analyze the effectiveness of query expansion, dense ranking, and top- k reranking under limited compute budgets.
[IR-16] Bending the Scaling Law Curve in Large-Scale Recommendation Systems
【速读】:该论文旨在解决大规模推荐系统中序列建模的计算效率与模型表达能力之间的矛盾问题,尤其是传统方法依赖交叉注意力机制导致的二次计算复杂度瓶颈,从而限制了自注意力机制在推荐任务中的表征潜力。其解决方案的关键在于通过端到端的模型与系统协同设计,创新性地优化输入序列结构、引入稀疏注意力机制以及重构模型拓扑,实现了训练和推理效率的显著提升(分别超过5倍和21倍),同时保持并提升了推荐质量,在真实生产环境中带来4%至8%的用户消费与参与度增长。
链接: https://arxiv.org/abs/2602.16986
作者: Qin Ding,Kevin Course,Linjian Ma,Jianhui Sun,Rouchen Liu,Zhao Zhu,Chunxing Yin,Wei Li,Dai Li,Yu Shi,Xuan Cao,Ze Yang,Han Li,Xing Liu,Bi Xue,Hongwei Li,Rui Jian,Daisy Shi He,Jing Qian,Matt Ma,Qunshu Zhang,Rui Li
机构: Meta(元)
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注:
Abstract:Learning from user interaction history through sequential models has become a cornerstone of large-scale recommender systems. Recent advances in large language models have revealed promising scaling laws, sparking a surge of research into long-sequence modeling and deeper architectures for recommendation tasks. However, many recent approaches rely heavily on cross-attention mechanisms to address the quadratic computational bottleneck in sequential modeling, which can limit the representational power gained from self-attention. We present ULTRA-HSTU, a novel sequential recommendation model developed through end-to-end model and system co-design. By innovating in the design of input sequences, sparse attention mechanisms, and model topology, ULTRA-HSTU achieves substantial improvements in both model quality and efficiency. Comprehensive benchmarking demonstrates that ULTRA-HSTU achieves remarkable scaling efficiency gains – over 5x faster training scaling and 21x faster inference scaling compared to conventional models – while delivering superior recommendation quality. Our solution is fully deployed at scale, serving billions of users daily and driving significant 4% to 8% consumption and engagement improvements in real-world production environments.
[IR-17] Beyond Chunk-Then-Embed: A Comprehensive Taxonomy and Evaluation of Document Chunking Strategies for Information Retrieval
【速读】:该论文旨在解决文档分块(document chunking)策略在密集检索系统中设计空间不明确的问题,尤其针对现有方法如大语言模型(LLM)引导的分块方法(如DenseX和LumberChunker)与上下文感知分块方法(如Late Chunking)缺乏统一评估框架、难以直接比较的现状。其解决方案的关键在于构建一个系统性框架,从两个核心维度对现有分块策略进行归类与评估:(1) 分割方法(包括基于结构的方法如固定大小、句子级和段落级分割,以及语义驱动和LLM引导的方法);(2) 嵌入范式(决定分块时机,即预嵌入分块 vs. 上下文感知分块)。通过在两种典型检索场景——文档内检索(needle-in-a-haystack)和语料库内检索(in-corpus retrieval)中进行系统性实验,该研究揭示了最优分块策略具有任务依赖性,并明确了不同方法在不同场景下的有效性边界。
链接: https://arxiv.org/abs/2602.16974
作者: Yongjie Zhou,Shuai Wang,Bevan Koopman,Guido Zuccon
机构: The University of Queensland (昆士兰大学); CSIRO (澳大利亚联邦科学与工业研究组织); Google(谷歌)
类目: Information Retrieval (cs.IR)
备注: Github link will be pushed later as it’s anonymoused at the moment
Abstract:Document chunking is a critical preprocessing step in dense retrieval systems, yet the design space of chunking strategies remains poorly understood. Recent research has proposed several concurrent approaches, including LLM-guided methods (e.g., DenseX and LumberChunker) and contextualized strategies(e.g., Late Chunking), which generate embeddings before segmentation to preserve contextual information. However, these methods emerged independently and were evaluated on benchmarks with minimal overlap, making direct comparisons difficult. This paper reproduces prior studies in document chunking and presents a systematic framework that unifies existing strategies along two key dimensions: (1) segmentation methods, including structure-based methods (fixed-size, sentence-based, and paragraph-based) as well as semantically-informed and LLM-guided methods; and (2) embedding paradigms, which determine the timing of chunking relative to embedding (pre-embedding chunking vs. contextualized chunking). Our reproduction evaluates these approaches in two distinct retrieval settings established in previous work: in-document retrieval (needle-in-a-haystack) and in-corpus retrieval (the standard information retrieval task). Our comprehensive evaluation reveals that optimal chunking strategies are task-dependent: simple structure-based methods outperform LLM-guided alternatives for in-corpus retrieval, while LumberChunker performs best for in-document retrieval. Contextualized chunking improves in-corpus effectiveness but degrades in-document retrieval. We also find that chunk size correlates moderately with in-document but weakly with in-corpus effectiveness, suggesting segmentation method differences are not purely driven by chunk size. Our code and evaluation benchmarks are publicly available at (Anonymoused). Comments: Github link will be pushed later as it’s anonymoused at the moment Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2602.16974 [cs.IR] (or arXiv:2602.16974v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.16974 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-18] SAGE: Structure Aware Graph Expansion for Retrieval of Heterogeneous Data
【速读】:该论文旨在解决异构语料库中跨模态(文本、表格、图节点)的多跳证据链检索问题,传统实体级知识图谱构建与维护成本高且查询时遍历效率低,而标准检索-阅读流水线因独立分块文本的相似性搜索无法捕捉跨模态的多跳推理路径。解决方案的关键在于提出SAGE(Structure Aware Graph Expansion)框架:首先离线构建基于元数据驱动相似度与百分位剪枝的chunk级图结构;其次在线检索时通过初始基线检索器获取k个种子chunk,扩展一跳邻居,并结合密集+稀疏检索对邻居进行筛选,最终选择k’个新增chunk,从而有效整合结构化与非结构化信息,提升跨模态多跳推理能力。
链接: https://arxiv.org/abs/2602.16964
作者: Prasham Titiya,Rohit Khoja,Tomer Wolfson,Vivek Gupta,Dan Roth
机构: Arizona State University (亚利桑那州立大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Retrieval-augmented question answering over heterogeneous corpora requires connected evidence across text, tables, and graph nodes. While entity-level knowledge graphs support structured access, they are costly to construct and maintain, and inefficient to traverse at query time. In contrast, standard retriever-reader pipelines use flat similarity search over independently chunked text, missing multi-hop evidence chains across modalities. We propose SAGE (Structure Aware Graph Expansion) framework that (i) constructs a chunk-level graph offline using metadata-driven similarities with percentile-based pruning, and (ii) performs online retrieval by running an initial baseline retriever to obtain k seed chunks, expanding first-hop neighbors, and then filtering the neighbors using dense+sparse retrieval, selecting k’ additional chunks. We instantiate the initial retriever using hybrid dense+sparse retrieval for implicit cross-modal corpora and SPARK (Structure Aware Planning Agent for Retrieval over Knowledge Graphs) an agentic retriever for explicit schema graphs. On OTT-QA and STaRK, SAGE improves retrieval recall by 5.7 and 8.5 points over baselines.
[IR-19] RankEvolve: Automating the Discovery of Retrieval Algorithms via LLM -Driven Evolution
【速读】:该论文旨在解决传统信息检索(Information Retrieval, IR)中排名算法优化依赖人工调参和直觉经验的问题,试图通过自动化方法发现更有效的词汇级检索算法。其解决方案的关键在于提出RankEvolve——一种基于AlphaEvolve的程序演化框架,将候选排序算法表示为可执行代码,并利用大语言模型(Large Language Model, LLM)在评估器(evaluator)引导下进行迭代变异、重组与选择,从而自动探索并生成新颖且高效的检索算法。该方法从BM25和Dirichlet平滑查询似然两个种子程序出发,在多个IR基准数据集(BEIR、BRIGHT)上实现性能提升,并展现出良好的跨域迁移能力。
链接: https://arxiv.org/abs/2602.16932
作者: Jinming Nian,Fangchen Li,Dae Hoon Park,Yi Fang
机构: Santa Clara University (圣克拉拉大学); Walmart Global Tech (沃尔玛全球科技); Independent Researcher (独立研究员)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval algorithms like BM25 and query likelihood with Dirichlet smoothing remain strong and efficient first-stage rankers, yet improvements have mostly relied on parameter tuning and human intuition. We investigate whether a large language model, guided by an evaluator and evolutionary search, can automatically discover improved lexical retrieval algorithms. We introduce RankEvolve, a program evolution setup based on AlphaEvolve, in which candidate ranking algorithms are represented as executable code and iteratively mutated, recombined, and selected based on retrieval performance across 12 IR datasets from BEIR and BRIGHT. RankEvolve starts from two seed programs: BM25 and query likelihood with Dirichlet smoothing. The evolved algorithms are novel, effective, and show promising transfer to the full BEIR and BRIGHT benchmarks as well as TREC DL 19 and 20. Our results suggest that evaluator-guided LLM program evolution is a practical path towards automatic discovery of novel ranking algorithms.
人机交互
[HC-0] he Effectiveness of a Virtual Reality-Based Training Program for Improving Body Awareness in Children with Attention Deficit and Hyperactivity Disorder
【速读】:该论文旨在解决儿童注意缺陷多动障碍(ADHD)患者在身体意识(body awareness)方面存在的缺陷问题,尤其是空间感知、身体部位识别和运动表达能力的不足。其解决方案的关键在于采用基于虚拟现实(Virtual Reality, VR)的结构化训练程序,通过沉浸式交互环境提供安全、具吸引力且可定制的干预方式,从而有效改善患儿的身心协调能力和心理运动功能,并验证了该方案具有良好的长期稳定性与持续效果。
链接: https://arxiv.org/abs/2602.17649
作者: Aya Abdelnaem El-Basha,Ebtsam ELSayed Mahmoud ELSayes,Ahmad Al-Kabbany
机构: Damanhour University (达曼豪尔大学); VRapeutic Inc. (VR治疗公司); Arab Academy for Science and Technology (阿拉伯科技大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:This study investigates the effectiveness of a Virtual Reality (VR)-based training program in improving body awareness among children with Attention Deficit Hyperactivity Disorder (ADHD). Utilizing a quasi-experimental design, the research sample consisted of 10 children aged 4 to 7 years, with IQ scores ranging from 90 to 110. Participants were divided into an experimental group and a control group, with the experimental group receiving a structured VR intervention over three months, totaling 36 sessions. Assessment tools included the Stanford-Binet Intelligence Scale (5th Edition), the Conners Test for ADHD, and a researcher-prepared Body Awareness Scale. The results indicated statistically significant differences between pre-test and post-test scores for the experimental group, demonstrating the program’s efficacy in enhancing spatial awareness, body part identification, and motor expressions. Furthermore, follow-up assessments conducted one month after the intervention revealed no significant differences from the post-test results, confirming the sustainability and continuity of the program’s effects over time. The findings suggest that immersive VR environments provide a safe, engaging, and effective therapeutic medium for addressing psychomotor deficits in early childhood ADHD. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2602.17649 [cs.HC] (or arXiv:2602.17649v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2602.17649 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-1] Modeling Distinct Human Interaction in Web Agents
【速读】:该论文旨在解决当前自主网络代理(autonomous web agents)在执行任务过程中缺乏对人类干预行为的结构化建模问题,导致代理在关键决策点仍盲目自主操作或频繁请求不必要的确认,从而影响人机协作效率。其解决方案的关键在于通过收集真实用户与代理交互的数据集CowCorpus(包含400条轨迹、超过4200次交错的人类与代理动作),识别出四种人类干预模式——“放手监督”、“主动监督”、“协同解决问题”和“完全接管”,并基于这些模式训练语言模型(LMs)以预测用户何时可能干预,显著提升干预预测准确率(较基础模型提高61.4–63.4%),最终在真实用户研究中实现代理有用性评分提升26.5%,验证了结构化建模人类干预对构建更适应、更具协作性的代理系统的有效性。
链接: https://arxiv.org/abs/2602.17588
作者: Faria Huq,Zora Zhiruo Wang,Zhanqiu Guo,Venu Arvind Arangarajan,Tianyue Ou,Frank Xu,Shuyan Zhou,Graham Neubig,Jeffrey P. Bigham
机构: Carnegie Mellon University (卡内基梅隆大学); Duke University (杜克大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Preprint
Abstract:Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real-user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents – hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4-63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 26.5% increase in user-rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.
[HC-2] What Do LLM s Associate with Your Name? A Human-Centered Black-Box Audit of Personal Data
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在预训练和用户交互过程中暴露个人数据(Personal Data, PD)所带来的隐私风险问题,特别是用户对模型如何关联特定信息与其身份缺乏认知。其解决方案的关键在于提出并验证LMP2(Language Model Privacy Probe),一种以人为中心、隐私保护的审计工具,通过两轮形成性研究(N=20)优化设计,并基于欧盟居民的两项实证研究(N1=155, N2=303)量化了模型生成PD的准确性与用户反应,揭示了GPT-4o等模型能以60%及以上准确率生成11类PD特征(如性别、发色、语言等),并发现72%的参与者希望控制模型对其姓名的关联行为,从而推动对PD定义及LLM数据隐私权利边界的再思考。
链接: https://arxiv.org/abs/2602.17483
作者: Dimitri Staufer,Kirsten Morehouse
机构: TU Berlin (柏林工业大学); Columbia University (哥伦比亚大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Large language models (LLMs), and conversational agents based on them, are exposed to personal data (PD) during pre-training and during user interactions. Prior work shows that PD can resurface, yet users lack insight into how strongly models associate specific information to their identity. We audit PD across eight LLMs (3 open-source; 5 API-based, including GPT-4o), introduce LMP2 (Language Model Privacy Probe), a human-centered, privacy-preserving audit tool refined through two formative studies (N=20), and run two studies with EU residents to capture (i) intuitions about LLM-generated PD (N1=155) and (ii) reactions to tool output (N2=303). We show empirically that models confidently generate multiple PD categories for well-known individuals. For everyday users, GPT-4o generates 11 features with 60% or more accuracy (e.g., gender, hair color, languages). Finally, 72% of participants sought control over model-generated associations with their name, raising questions about what counts as PD and whether data privacy rights should extend to LLMs.
[HC-3] ShadAR: LLM -driven shader generation to transform visual perception in Augmented Reality
【速读】:该论文旨在解决当前增强现实(Augmented Reality, AR)中视觉感知模拟依赖开发者预先定义视觉效果而导致灵活性不足的问题。其解决方案的关键在于提出ShadAR系统,该系统利用大语言模型(Large Language Models, LLMs)驱动的着色器(shader)生成管道,使用户能够通过自然语言表达视觉意图,由LLM自动解析并生成对应的着色器代码,进而实现实时编译与AR头显视口的动态修改,从而支持更具包容性和创造性的视觉感知变换。
链接: https://arxiv.org/abs/2602.17481
作者: Yanni Mei,Samuel Wendt,Florian Mueller,Jan Gugenheimer
机构: TU Darmstadt (达姆施塔特工业大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Augmented Reality (AR) can simulate various visual perceptions, such as how individuals with colorblindness see the world. However, these simulations require developers to predefine each visual effect, limiting flexibility. We present ShadAR, an AR application enabling real-time transformation of visual perception through shader generation using large language models (LLMs). ShadAR allows users to express their visual intent via natural language, which is interpreted by an LLM to generate corresponding shader code. This shader is then compiled real-time to modify the AR headset viewport. We present our LLM-driven shader generation pipeline and demonstrate its ability to transform visual perception for inclusiveness and creativity.
[HC-4] Auditing Reciprocal Sentiment Alignment: Inversion Risk Dialect Representation and Intent Misalignment in Transformers
【速读】:该论文旨在解决跨语言情感对齐(Cross-Lingual Sentiment Misalignment)问题,特别是 Bengali 与 English 之间的语义和情感失真现象,其核心挑战在于当前生成式 AI 系统在多语言场景下缺乏对人类意图的准确理解与可信赖的行为表现。解决方案的关键在于提出并验证“情感稳定性”(Affective Stability)指标作为新的对齐评估标准,该指标能显式惩罚低资源语言和方言情境下的极性反转(Sentiment Inversion Rate),从而推动构建更具文化敏感性和语境适应性的多元共存型对齐机制,而非依赖单一压缩模型(如 mDistilBERT)所导致的情感失真。
链接: https://arxiv.org/abs/2602.17469
作者: Nusrat Jahan Lia,Shubhashis Roy Dipta
机构: Institute of Information Technology, University of Dhaka (达卡大学信息技术研究所); University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:The core theme of bidirectional alignment is ensuring that AI systems accurately understand human intent and that humans can trust AI behavior. However, this loop fractures significantly across language barriers. Our research addresses Cross-Lingual Sentiment Misalignment between Bengali and English by benchmarking four transformer architectures. We reveal severe safety and representational failures in current alignment paradigms. We demonstrate that compressed model (mDistilBERT) exhibits 28.7% “Sentiment Inversion Rate,” fundamentally misinterpreting positive user intent as negative (or vice versa). Furthermore, we identify systemic nuances affecting human-AI trust, including “Asymmetric Empathy” where some models systematically dampen and others amplify the affective weight of Bengali text relative to its English counterpart. Finally, we reveal a “Modern Bias” in the regional model (IndicBERT), which shows a 57% increase in alignment error when processing formal (Sadhu) Bengali. We argue that equitable human-AI co-evolution requires pluralistic, culturally grounded alignment that respects language and dialectal diversity over universal compression, which fails to preserve the emotional fidelity required for reciprocal human-AI trust. We recommend that alignment benchmarks incorporate “Affective Stability” metrics that explicitly penalize polarity inversions in low-resource and dialectal contexts.
[HC-5] Do Hackers Dream of Electric Teachers?: A Large-Scale In-Situ Evaluation of Cybersecurity Student Behaviors and Performance with AI Tutors
【速读】:该论文试图解决的问题是:在网络安全(cybersecurity)教育中,如何有效评估生成式 AI(Generative AI)辅导工具在真实、大规模课程环境中对学生学习行为和成效的影响,尤其是在以实践为导向的攻防演练(如“夺旗赛”Capture-the-Flag)场景下,此前尚无系统性研究验证其使用方式与学习收益之间的关系。解决方案的关键在于设计并部署一个嵌入式 AI 辅导工具,在一门包含 309 名学生的高年级网络安全入门课程中开展为期一学期的观察性研究,通过分析 142,526 条学生查询记录及 396 个挑战任务的数据,识别出三种主要的 AI 聊天交互风格(Short、Reactive、Proactive),并揭示这些策略与挑战完成度之间的显著关联,尤其在难度较高的内容中效应增强,从而为安全教育者和开发者提供基于实证的 AI 辅导应用建议。
链接: https://arxiv.org/abs/2602.17448
作者: Michael Tompkins,Nihaarika Agarwal,Ananta Soneji,Robert Wasinger,Connor Nelson,Kevin Leach,Rakibul Hasan,Adam Doupé,Daniel Votipka,Yan Shoshitaishvili,Jaron Mink
机构: Arizona State University (亚利桑那州立大学); Vanderbilt University (范德堡大学); Tufts University (塔夫茨大学)
类目: Human-Computer Interaction (cs.HC)
备注: 33 pages, 7 figures
Abstract:To meet the ever-increasing demands of the cybersecurity workforce, AI tutors have been proposed for personalized, scalable education. But, while AI tutors have shown promise in introductory programming courses, no work has evaluated their use in hands-on exploration and exploitation of systems (e.g., ``capture-the-flag’‘) commonly used to teach cybersecurity. Thus, despite growing interest and need, no work has evaluated how students use AI tutors or whether they benefit from their presence in real, large-scale cybersecurity courses. To answer this, we conducted a semester-long observational study on the use of an embedded AI tutor with 309 students in an upper-division introductory cybersecurity course. By analyzing 142,526 student queries sent to the AI tutor across 396 cybersecurity challenges spanning 9 core cybersecurity topics and an accompanying set of post-semester surveys, we find (1) what queries and conversational strategies students use with AI tutors, (2) how these strategies correlate with challenge completion, and (3) students’ perceptions of AI tutors in cybersecurity education. In particular, we identify three broad AI tutor conversational styles among users: Short (bounded, few-turn exchanges), Reactive (repeatedly submitting code and errors), and Proactive (driving problem-solving through targeted inquiry). We also find that the use of these styles significantly predicts challenge completion, and that this effect increases as materials become more advanced. Furthermore, students valued the tutor’s availability but reported that it became less useful for harder material. Based on this, we provide suggestions for security educators and developers on practical AI tutor use.
[HC-6] PersonaMail: Learning and Adapting Personal Communication Preferences for Context-Aware Email Writing
【速读】:该论文旨在解决生成式 AI 在人际沟通场景中难以捕捉微妙语气(nuance)的问题,尤其是在电子邮件写作中,有效沟通不仅依赖语言流畅性,还需精准匹配意图、关系和情境等多维因素。现有系统常忽视这些细微差别,导致生成内容缺乏针对性与人际适配性。解决方案的关键在于提出 PersonaMail 系统,其通过结构化探索沟通要素(communication factors)、提供细粒度编辑控制(granular editing controls)以及自适应复用成功语气策略(adaptive reuse of successful strategies),从而提升用户在即时和重复使用场景下的效率与满意度。
链接: https://arxiv.org/abs/2602.17340
作者: Rui Yao,Qiuyuan Ren,Felicia Fang-Yi Tan,Chen Yang,Xiaoyu Zhang,Shengdong Zhao
机构: City University of Hong Kong(香港城市大学)
类目: Human-Computer Interaction (cs.HC)
备注: 21 pages, 5 figures. Accepted to the 31st International Conference on Intelligent User Interfaces (IUI 26), March 23-26, 2026, Paphos, Cyprus
Abstract:LLM-assisted writing has seen rapid adoption in interpersonal communication, yet current systems often fail to capture the subtle tones essential for effectiveness. Email writing exemplifies this challenge: effective messages require careful alignment with intent, relationship, and context beyond mere fluency. Through formative studies, we identified three key challenges: articulating nuanced communicative intent, making modifications at multiple levels of granularity, and reusing effective tone strategies across messages. We developed PersonaMail, a system that addresses these gaps through structured communication factor exploration, granular editing controls, and adaptive reuse of successful strategies. Our evaluation compared PersonaMail against standard LLM interfaces, and showed improved efficiency in both immediate and repeated use, alongside higher user satisfaction. We contribute design implications for AI-assisted communication systems that prioritize interpersonal nuance over generic text generation.
[HC-7] NotebookRAG : Retrieving Multiple Notebooks to Augment the Generation of EDA Notebooks for Crowd-Wisdom
【速读】:该论文旨在解决自动化探索性数据分析(Exploratory Data Analysis, EDA)中因用户意图抽象而导致分析计划与可视化生成效果不佳的问题,同时利用分散在各平台和组织中的大量分析笔记本(notebook)所蕴含的结构化知识来提升自动化水平。其解决方案的关键在于提出 NotebookRAG 方法:首先将代码单元格转化为上下文增强的可执行组件,从而提升检索质量并支持基于新数据的重新运行以生成更新的可视化与可靠洞察;其次,通过代理(agent)利用增强后的检索内容构建有效的 EDA 计划、推导洞察并生成适当可视化,实现从静态文档到动态知识复用的转变,显著提升了 EDA 自动化的效果与意图对齐度。
链接: https://arxiv.org/abs/2602.17215
作者: Yi Shan,Yixuan He,Zekai Shao,Kai Xu,Siming Chen
机构: Fudan University, China; University of Nottingham, UK
类目: Human-Computer Interaction (cs.HC)
备注: 11 pages, 7 figures
Abstract:High-quality exploratory data analysis (EDA) is essential in the data science pipeline, but remains highly dependent on analysts’ expertise and effort. While recent LLM-based approaches partially reduce this burden, they struggle to generate effective analysis plans and appropriate insights and visualizations when user intent is abstract. Meanwhile, a vast collection of analysis notebooks produced across platforms and organizations contains rich analytical knowledge that can potentially guide automated EDA. Retrieval-augmented generation (RAG) provides a natural way to leverage such corpora, but general methods often treat notebooks as static documents and fail to fully exploit their potential knowledge for automating EDA. To address these limitations, we propose NotebookRAG, a method that takes user intent, datasets, and existing notebooks as input to retrieve, enhance, and reuse relevant notebook content for automated EDA generation. For retrieval, we transform code cells into context-enriched executable components, which improve retrieval quality and enable rerun with new data to generate updated visualizations and reliable insights. For generation, an agent leverages enhanced retrieval content to construct effective EDA plans, derive insights, and produce appropriate visualizations. Evidence from a user study with 24 participants confirms the superiority of our method in producing high-quality and intent-aligned EDA notebooks.
[HC-8] he Bots of Persuasion: Examining How Conversational Agents Linguistic Expressions of Personality Affect User Perceptions and Decisions
【速读】:该论文试图解决的问题是:由大型语言模型(Large Language Model, LLM)驱动的对话代理(Conversational Agent, CA)通过语言表达出的个性特征如何影响用户在慈善捐赠情境中的决策与感知。解决方案的关键在于设计了一项众包实验,让360名参与者与八种不同人格组合的CA互动,每种人格由三个语言维度构成:态度(乐观/悲观)、权威性(权威/顺从)和推理方式(情感化/理性)。研究发现,尽管CA的复合人格未显著改变用户的捐赠决策,却显著影响了其情绪状态、对CA的信任感、胜任力认知及情境共情能力,其中信任、胜任力和共情成为预测捐赠行为的关键心理变量。这揭示了CA作为潜在操纵工具的风险,其可通过细微的语言个性塑造来引导用户感知与反应。
链接: https://arxiv.org/abs/2602.17185
作者: Uğur Genç,Heng Gu,Chadha Degachi,Evangelos Niforatos,Senthil Chandrasegaran,Himanshu Verma
机构: Delft University of Technology (代尔夫特理工大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to be presented at CHI’26 in Barcelona
Abstract:Large Language Model-powered conversational agents (CAs) are increasingly capable of projecting sophisticated personalities through language, but how these projections affect users is unclear. We thus examine how CA personalities expressed linguistically affect user decisions and perceptions in the context of charitable giving. In a crowdsourced study, 360 participants interacted with one of eight CAs, each projecting a personality composed of three linguistic aspects: attitude (optimistic/pessimistic), authority (authoritative/submissive), and reasoning (emotional/rational). While the CA’s composite personality did not affect participants’ decisions, it did affect their perceptions and emotional responses. Particularly, participants interacting with pessimistic CAs felt lower emotional state and lower affinity towards the cause, perceived the CA as less trustworthy and less competent, and yet tended to donate more toward the charity. Perceptions of trust, competence, and situational empathy significantly predicted donation decisions. Our findings emphasize the risks CAs pose as instruments of manipulation, subtly influencing user perceptions and decisions.
[HC-9] Understanding Nature Engagement Experiences of Blind People
【速读】:该论文旨在解决盲人群体在自然体验与关联性方面研究不足的问题,尤其是其如何在无视觉条件下感知、参与并建立与自然的情感联结。研究通过问卷调查(N=20盲人 vs. N=20视力正常者)和深度访谈(N=16盲人)揭示:盲人群体的自然相关性(nature relatedness)整体低于视力正常者,且其自然互动受到环境可达性、安全顾虑及社会支持等因素制约。解决方案的关键在于识别盲人对自然体验的独特需求与价值取向,并提出面向未来的辅助技术设计原则,以支持更安全、有意义的自然参与,从而推动包容性自然交互技术的发展。
链接: https://arxiv.org/abs/2602.17093
作者: Mengjie Tang,Xinman Li,Juxiao Zhang,Franklin Mingzhe Li,Zhuying Li
机构: Southeast University (东南大学); Nanjing Normal University of Special Education (南京特殊教育师范学院); Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC)
备注: CHI 2026 Full Paper
Abstract:Nature plays a crucial role in human health and well-being, but little is known about how blind people experience and relate to it. We conducted a survey of nature relatedness with blind (N=20) and sighted (N=20) participants, along with in-depth interviews with 16 blind participants, to examine how blind people engage with nature and the factors shaping this engagement. Our survey results revealed lower levels of nature relatedness among blind participants compared to sighted peers. Our interview study further highlighted: 1) current practices and challenges of nature engagement, 2) attitudes and values that shape engagement, and 3) expectations for assistive technologies that support safe and meaningful engagement. We also provide design implications to guide future technologies that support nature engagement for blind people. Overall, our findings illustrate how blind people experience nature beyond vision and lay a foundation for technologies that support inclusive nature engagement.
[HC-10] Rememo: A Research-through-Design Inquiry Towards an AI-in-the-loop Therapists Tool for Dementia Reminiscence
【速读】:该论文旨在解决当前技术驱动的回忆疗法(Reminiscence Therapy, RT)干预中,过度依赖对话代理替代人类引导者所导致的关系性支持缺失问题。研究表明,RT的有效性高度依赖于引导者与患者之间的互动关系,而现有技术方案忽视了这一核心要素。为此,作者提出了一种以治疗师为中心的工具 Rememo,其关键在于将生成式 AI (Generative AI) 作为增强人类引导能力的辅助工具,而非替代品,从而在新加坡本地文化与护理基础设施背景下实现个性化回忆疗法的支持。该解决方案强调人机协同中的关系动态,推动对合成图像在记忆疗愈中角色的认知重构——从“真实记录”转向“情感支持”,为未来护理场景中 AI 系统的设计提供了社会技术意识框架。
链接: https://arxiv.org/abs/2602.17083
作者: Celeste Seah,Yoke Chuan Lee,Jung-Joo Lee,Ching-Chiuan Yen,Clement Zheng
机构: National University of Singapore(新加坡国立大学); ECON Healthcare Group(经济医疗集团); CUTE Center(用户体验与技术中心)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Reminiscence therapy (RT) is a common non-pharmacological intervention in dementia care. Recent technology-mediated interventions have largely focused on people with dementia through solutions that replace human facilitators with conversational agents. However, the relational work of facilitation is critical in the effectiveness of RT. Hence, we developed Rememo, a therapist-oriented tool that integrates Generative AI to support and enrich human facilitation in RT. Our tool aims to support the infrastructural and cultural challenges that therapists in Singapore face. In this research, we contribute the Rememo system as a therapist’s tool for personalized RT developed through sociotechnically-aware research-through-design. Through studying this system in-situ, our research extends our understanding of human-AI collaboration for care work. We discuss the implications of designing AI-enabled systems that respect the relational dynamics in care contexts, and argue for a rethinking of synthetic imagery as a therapeutic support for memory rahter than a record of truth.
[HC-11] StoryLensEdu: Personalized Learning Report Generation through Narrative-Driven Multi-Agent Systems
【速读】:该论文旨在解决当前个性化反馈在自我调节学习(Self-Regulated Learning, SRL)中面临的可解释性差、呈现方式单调以及缺乏教育意义的问题。现有方案如文本报告或学习分析仪表盘常因信息抽象、交互性弱而难以有效支持学生反思与策略调整。其解决方案的关键在于提出StoryLensEdu——一个基于叙事驱动的多智能体系统,通过三个协同工作的智能体实现高质量学习报告生成:数据分析师(Data Analyst)基于学习目标结构提取关键洞察,教师代理(Teacher)确保内容的教育相关性并提供可操作建议,故事讲述者(Storyteller)则利用“英雄之旅”(Heroes Journey)叙事框架组织信息以增强理解与情感共鸣;此外,系统支持后生成阶段的交互式问答机制,显著提升解释性和用户参与度。
链接: https://arxiv.org/abs/2602.17067
作者: Leixian Shen,Yan Luo,Rui Sheng,Yujia He,Haotian Li,Leni Yang,Huamin Qu
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Personalized feedback plays an important role in self-regulated learning (SRL), helping students track progress and refine their strategies. However, current common solutions, such as text-based reports or learning analytics dashboards, often suffer from poor interpretability, monotonous presentation, and limited explainability. To overcome these challenges, we present StoryLensEdu, a narrative-driven multi-agent system that automatically generates intuitive, engaging, and interactive learning reports. StoryLensEdu integrates three agents: a Data Analyst that extracts data insights based on a learning objective centered structure, a Teacher that ensures educational relevance and offers actionable suggestions, and a Storyteller that organizes these insights using the Heroes Journey narrative framework. StoryLensEdu supports post-generation interactive question answering to improve explainability and user engagement. We conducted a formative study in a real high school and iteratively developed StoryLensEdu in collaboration with an e-learning team to inform our design. Evaluation with real users shows that StoryLensEdu enhances engagement and promotes a deeper understanding of the learning process.
[HC-12] IntentCUA: Learning Intent-level Representations for Skill Abstraction and Multi-Agent Planning in Computer-Use Agents AAMAS2026
【速读】:该论文旨在解决计算机使用代理(Computer-use Agents)在长时程执行中因感知噪声、多窗口上下文和环境状态动态变化而导致的用户意图偏离与重复求解常规子问题的问题,从而引发误差累积和效率低下。其解决方案的关键在于提出了一种名为IntentCUA的多智能体框架,通过共享的意图对齐计划记忆(intent-aligned plan memory),将原始交互轨迹抽象为多视角意图表示(multi-view intent representations)并提取可复用技能(reusable skills)。运行时,基于意图原型检索子群对齐技能并注入局部计划,有效减少冗余重规划,缓解跨桌面应用的误差传播。实验证明,该方法在端到端任务成功率(74.83%)和步效率比(0.91)上显著优于基于强化学习(RL)和轨迹中心的基线模型,且消融实验表明多视角意图抽象与共享计划记忆协同提升了执行稳定性,尤其在长时程任务中,多智能体协作环路贡献最大性能提升。
链接: https://arxiv.org/abs/2602.17049
作者: Seoyoung Lee,Seobin Yoon,Seongbeen Lee,Yoojung Chun,Dayoung Park,Doyeon Kim,Joo Yong Sim
机构: Sookmyung Women’s University (淑明女子大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: 12 pages, 9 figures, AAMAS 2026
Abstract:Computer-use agents operate over long horizons under noisy perception, multi-window contexts, evolving environment states. Existing approaches, from RL-based planners to trajectory retrieval, often drift from user intent and repeatedly solve routine subproblems, leading to error accumulation and inefficiency. We present IntentCUA, a multi-agent computer-use framework designed to stabilize long-horizon execution through intent-aligned plan memory. A Planner, Plan-Optimizer, and Critic coordinate over shared memory that abstracts raw interaction traces into multi-view intent representations and reusable skills. At runtime, intent prototypes retrieve subgroup-aligned skills and inject them into partial plans, reducing redundant re-planning and mitigating error propagation across desktop applications. In end-to-end evaluations, IntentCUA achieved a 74.83% task success rate with a Step Efficiency Ratio of 0.91, outperforming RL-based and trajectory-centric baselines. Ablations show that multi-view intent abstraction and shared plan memory jointly improve execution stability, with the cooperative multi-agent loop providing the largest gains on long-horizon tasks. These results highlight that system-level intent abstraction and memory-grounded coordination are key to reliable and efficient desktop automation in large, dynamic environments.
[HC-13] Wink: Recovering from Misbehaviors in Coding Agents
【速读】:该论文旨在解决自主编码代理(Autonomous Coding Agents)在实际应用中因行为异常导致的开发流程中断问题,这些问题包括指令偏离、重复循环和工具调用失败等,统称为“agentic misbehaviors”。根据对生产环境流量的分析,作者识别出三类主要错误类型:规范漂移(Specification Drift)、推理问题(Reasoning Problems)和工具调用失败(Tool Call Failures),它们约占所有代理轨迹的30%。解决方案的关键在于提出一种轻量级、异步的自干预系统Wink,该系统能够实时观测代理轨迹,并提供有针对性的纠正指导,从而将代理引导回高效执行路径。实验表明,Wink可成功修复90%仅需单次干预的错误,并在真实生产环境中显著降低工具调用失败率、每会话token消耗及工程师介入次数,验证了其在构建可扩展、鲁棒的智能代理系统中的有效性。
链接: https://arxiv.org/abs/2602.17037
作者: Rahul Nanda,Chandra Maddila,Smriti Jha,Euna Mehnaz Khan,Matteo Paltenghi,Satish Chandra
机构: Meta Platforms, Inc.(Meta)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Programming Languages (cs.PL)
备注:
Abstract:Autonomous coding agents, powered by large language models (LLMs), are increasingly being adopted in the software industry to automate complex engineering tasks. However, these agents are prone to a wide range of misbehaviors, such as deviating from the user’s instructions, getting stuck in repetitive loops, or failing to use tools correctly. These failures disrupt the development workflow and often require resource-intensive manual intervention. In this paper, we present a system for automatically recovering from agentic misbehaviors at scale. We first introduce a taxonomy of misbehaviors grounded in an analysis of production traffic, identifying three primary categories: Specification Drift, Reasoning Problems, and Tool Call Failures, which we find occur in about 30% of all agent trajectories. To address these issues, we developed a lightweight, asynchronous self-intervention system named Wink. Wink observes agent trajectories and provides targeted course-correction guidance to nudge the agent back to a productive path. We evaluated our system on over 10,000 real world agent trajectories and found that it successfully resolves 90% of the misbehaviors that require a single intervention. Furthermore, a live A/B test in our production environment demonstrated that our system leads to a statistically significant reduction in Tool Call Failures, Tokens per Session and Engineer Interventions per Session. We present our experience designing and deploying this system, offering insights into the challenges of building resilient agentic systems at scale. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Programming Languages (cs.PL) Cite as: arXiv:2602.17037 [cs.SE] (or arXiv:2602.17037v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2602.17037 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-14] “Its like a pet…but my pet doesnt collect data about me”: Multi-person Households Privacy Design Preferences for Household Robots
【速读】:该论文旨在解决当前家庭机器人在日益普及背景下所引发的隐私风险问题,特别是用户对数据收集与共享缺乏信任、以及现有研究尚未充分考虑多用户场景下隐私设计需求的空白。其解决方案的关键在于通过参与式设计方法,从15个家庭的实际使用情境出发,提炼出用户对隐私控制权、可访问的控制与通知机制,以及个性化定制能力的核心诉求,并据此提出可落地的设计建议,以增强用户对机器人系统数据隐私的信任与自主管理能力。
链接: https://arxiv.org/abs/2602.16975
作者: Jennica Li,Shirley Zhang,Dakota Sullivan,Bengisu Cagiltay,Heather Kirkorian,Bilge Mutlu,Kassem Fawaz
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Koç University Istanbul (科奇大学伊斯坦布尔校区)
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: 13 pages (main body), 2 figures
Abstract:Household robots boasting mobility, more sophisticated sensors, and powerful processing models have become increasingly prevalent in the commercial market. However, these features may expose users to unwanted privacy risks, including unsolicited data collection and unauthorized data sharing. While security and privacy researchers thus far have explored people’s privacy concerns around household robots, literature investigating people’s preferred privacy designs and mitigation strategies is still limited. Additionally, the existing literature has not yet accounted for multi-user perspectives on privacy design and household robots. We aimed to fill this gap by conducting in-person participatory design sessions with 15 households to explore how they would design a privacy-aware household robot based on their concerns and expectations. We found that participants did not trust that robots, or their respective manufacturers, would respect the data privacy of household members or operate in a multi-user ecosystem without jeopardizing users’ personal data. Based on these concerns, they generated designs that gave them authority over their data, contained accessible controls and notification systems, and could be customized and tailored to suit the needs and preferences of each user over time. We synthesize our findings into actionable design recommendations for robot manufacturers and developers.
[HC-15] Nudging Attention to Workplace Meeting Goals: A Large-Scale Preregistered Field Experiment
【速读】:该论文试图解决的问题是:当前会议效率低下,而现有的协作平台缺乏对会议目标明确性的集成支持。解决方案的关键在于引入一种轻量级的目标反思干预(goal-reflection intervention),即在会议前通过简短的预会调查问卷引导参与者关注即将召开会议的目标。实验结果显示,虽然该干预对会议效果的统计显著性不强,但混合方法研究发现,两组员工在自我报告的意识和行为方面均有提升,这表明后置会议调查本身也起到了类似干预的作用,凸显了支持会议意图性(meeting intentionality)的技术设计潜力。
链接: https://arxiv.org/abs/2602.16939
作者: Lev Tankelevitch,Ava Elizabeth Scott,Nagaravind Challakere,Payod Panda,Sean Rintel
机构: Microsoft Research (微软研究院); University of Copenhagen (哥本哈根大学); Microsoft (微软)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Ineffective meetings are pervasive. Thinking ahead explicitly about meeting goals may improve effectiveness, but current collaboration platforms lack integrated support. We tested a lightweight goal-reflection intervention in a preregistered field experiment in a global technology company (361 employees, 7196 meetings). Over two weeks, workers in the treatment group completed brief pre-meeting surveys in their collaboration platform, nudging attention to goals for upcoming meetings. To measure impact, both treatment and control groups completed post-meeting surveys about meeting effectiveness. While the intervention impact on meeting effectiveness was not statistically significant, mixed-methods findings revealed improvements in self-reported awareness and behaviour across both groups, with post-meeting surveys unintentionally functioning as an intervention. We highlight the promise of supporting goal reflection, while noting challenges of evaluating and supporting workplace reflection for meetings, including workflow and collaboration norms, and attitudes and behaviours around meeting preparation. We conclude with implications for designing technological support for meeting intentionality.
[HC-16] Say It My Way: Exploring Control in Conversational Visual Question Answering with Blind Users
【速读】:该论文旨在解决盲人用户在使用辅助性视觉问答(Assistive Visual Question Answering, Assistive VQA)系统时,因交互模式僵化、缺乏个性化定制而导致的响应不匹配问题。现有系统多基于通用生成式AI(Generative AI)技术,但未充分考虑盲人群体对灵活性和上下文适配性的需求,导致其在实际使用中难以满足特定任务目标。解决方案的关键在于通过引入提示工程(Prompt Engineering)等定制化技术,使用户能够主动调整与系统的交互方式,从而绕过系统在冗余输出、空间/时间距离估计能力不足、图像框架不可访问及摄像头引导缺失等方面的局限。研究通过对11名盲用户的418次交互记录和访谈分析,验证了提示策略的有效性,并为VQA系统的查询层与系统层交互设计提供了实证依据。
链接: https://arxiv.org/abs/2602.16930
作者: Farnaz Zamiri Zeraati,Yang Trista Cao,Yuehan Qiao,Hal Daumé III,Hernisa Kacorri
机构: University of Maryland College Park (马里兰大学学院公园分校); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Preprint, Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems
Abstract:Prompting and steering techniques are well established in general-purpose generative AI, yet assistive visual question answering (VQA) tools for blind users still follow rigid interaction patterns with limited opportunities for customization. User control can be helpful when system responses are misaligned with their goals and contexts, a gap that becomes especially consequential for blind users that may rely on these systems for access. We invite 11 blind users to customize their interactions with a real-world conversational VQA system. Drawing on 418 interactions, reflections, and post-study interviews, we analyze prompting-based techniques participants adopted, including those introduced in the study and those developed independently in real-world settings. VQA interactions were often lengthy: participants averaged 3 turns, sometimes up to 21, with input text typically tenfold shorter than the responses they heard. Built on state-of-the-art LLMs, the system lacked verbosity controls, was limited in estimating distance in space and time, relied on inaccessible image framing, and offered little to no camera guidance. We discuss how customization techniques such as prompt engineering can help participants work around these limitations. Alongside a new publicly available dataset, we offer insights for interaction design at both query and system levels.
[HC-17] Evidotes: Integrating Scientific Evidence and Anecdotes to Support Uncertainties Triggered by Peer Health Posts
【速读】:该论文旨在解决健康社交平台中用户因阅读同伴健康分享(peer health posts)而产生的信息不确定性和情绪负担问题。现有研究主要关注提升内容的相关性和准确性,但忽视了用户多样化的信息需求和由此引发的情绪反应。其解决方案的关键在于引入Evidotes系统,通过三种可选的“信息透镜”(dive deeper、focus on positivity、big picture)对单个帖子进行科学证据与个人经验的增强式补充,从而实现信息增广(information augmentation)。该设计不仅显著提升了用户的自我报告信息满意度(从3.2升至4.6),降低了情感成本(从3.4降至1.9),还通过共现不同来源的信息,促成科学证据与个体故事之间的协同效应(information symbiosis):前者使后者更具可理解性与情境化,后者则帮助前者实现筛选与泛化,最终助力用户更有效地应对健康不确定性。
链接: https://arxiv.org/abs/2602.16900
作者: Shreya Bali,Riku Arakawa,Peace Odiase,Tongshuang Wu,Mayank Goel
机构: Carnegie Mellon University (卡内基梅隆大学); University of Pittsburgh (匹兹堡大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Peer health posts surface new uncertainties, such as questions and concerns for readers. Prior work focused primarily on improving relevance and accuracy fails to address users’ diverse information needs and emotions triggered. Instead, we propose directly addressing these by information augmentation. We introduce Evidotes, an information support system that augments individual posts with relevant scientific and anecdotal information retrieved using three user-selectable lenses (dive deeper, focus on positivity, and big picture). In a mixed-methods study with 17 chronic illness patients, Evidotes improved self-reported information satisfaction (3.2-4.6) and reduced self-reported emotional cost (3.4-1.9) compared to participants’ baseline browsing. Moreover, by co-presenting sources, Evidotes unlocked information symbiosis: anecdotes made research accessible and contextual, while research helped filter and generalize peer stories. Our work enables an effective integration of scientific evidence and human anecdotes to help users better manage health uncertainty.
[HC-18] Connecting the Dots: Surfacing Structure in Documents through AI-Generated Cross-Modal Links
【速读】:该论文旨在解决信息密集型文档(如科学论文和食谱)中,读者难以在文本、图表、表格等多模态内容之间定位、理解并建立关联的问题。此类文档通常篇幅较长且包含专业术语,导致信息检索困难,知识整合效率低下。现有工具对跨媒体信息的整合支持有限,使复杂内容的理解仍具有较高的认知负荷。解决方案的关键在于提出一个细粒度的信息整合框架,并将其具象化为增强型阅读界面:通过在图表上添加可点击标记、在正文实现交互式高亮以及设置持续可见的参考面板,使用户无需手动滚动即可获取整合后的细节信息。实验表明,使用该工具的参与者在阅读测验中得分显著更高,且未增加完成时间或认知负荷,验证了细粒度整合方法在提升复杂材料理解效率方面的有效性。
链接: https://arxiv.org/abs/2602.16895
作者: Alyssa Hwang,Hita Kambhamettu,Yue Yang,Ajay Patel,Joseph Chee Chang,Andrew Head
机构: University of Pennsylvania (宾夕法尼亚大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Human-Computer Interaction (cs.HC)
备注: 40 pages, 16 figures
Abstract:Understanding information-dense documents like recipes and scientific papers requires readers to find, interpret, and connect details scattered across text, figures, tables, and other visual elements. These documents are often long and filled with specialized terminology, hindering the ability to locate relevant information or piece together related ideas. Existing tools offer limited support for synthesizing information across media types. As a result, understanding complex material remains cognitively demanding. This paper presents a framework for fine-grained integration of information in complex documents. We instantiate the framework in an augmented reading interface, which populates a scientific paper with clickable points on figures, interactive highlights in the body text, and a persistent reference panel for accessing consolidated details without manual scrolling. In a controlled between-subjects study, we find that participants who read the paper with our tool achieved significantly higher scores on a reading quiz without evidence of increased time to completion or cognitive load. Fine-grained integration provides a systematic way of revealing relationships within a document, supporting engagement with complex, information-dense materials.
[HC-19] CreateAI Insights from an NSF Workshop on K12 Students Teachers and Families as Designers of Artificial Intelligence and Machine Learning Applications
【速读】:该论文试图解决的问题是:如何将人工智能(Artificial Intelligence, AI)和机器学习(Machine Learning, ML)教育从单纯的使用者培养转向创造者培养,即让K-12学生和教师不仅掌握AI工具的使用,还能成为AI/ML应用的开发者与创新者。其解决方案的关键在于构建以“创造”为核心的教育框架——通过设计适配青少年认知水平的AI/ML开发工具、明确学习路径与能力进阶机制,促进课堂整合;同时强调伦理教育的嵌入式实践,支持学生在真实情境中开展负责任的AI创作,并通过多元评估手段建立教师知识储备与学生创新能力的基准,从而推动教育系统向更具批判性、参与性和赋能性的AI素养发展转型。
链接: https://arxiv.org/abs/2602.16894
作者: Yasmin Kafai,José Ramón Lizárraga,R. Benjamin Shapiro
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:In response to the exponential growth in the use of artificial intelligence and machine learning applications, educators, researchers and policymakers have taken steps to integrate artificial intelligence applications into K-12 education. Among these efforts, one equally important approach has received little, if any attention: What if students and teachers were not just learning to be competent users of AI but also its creators? This question is at the heart of CreateAI in which K12 educators, researchers, and learning scientists addressed the following questions: (1) What tools, skills, and knowledge will empower students and teachers to build their own AI/ML applications? (2) How can we integrate these approaches into classrooms? and (3) What new possibilities for learning emerge when students and teachers become innovators and creators? In the report we provide recommendations for what tools designed for creating AI/ML applications should address in terms of design features, and learner progression in investigations. To promote effective learning and teaching of creating AI applications, we also need to help students and teachers select appropriate tools. We outline how we need to develop a better understanding of learning practices and funds of knowledge to support youth as they create and evaluate AI/ML applications. This also includes engaging youth in learning about ethics and critically that is authentic, empowering, and relevant throughout the design process. Here we advocate for the integration of ethics in the curriculum. We also address what teachers need to know and how assessments can help establish baselines, include different instruments, and promote students as responsible creators of AI. Together, these recommendations provide important insights for preparing students to engage thoughtfully and critically with these technologies.
[HC-20] CalmReminder: A Design Probe for Parental Engagement with Children with Hyperactivity Augmented by Real-Time Motion Sensing with a Watch
【速读】:该论文旨在解决当前数字干预措施在帮助注意力缺陷多动障碍(ADHD)儿童家庭时存在的“一刀切”问题,即多数干预方案未能贴合家长的实际育儿实践,导致使用效果不佳。其解决方案的关键在于开发了一种基于智能手表的系统——CalmReminder,通过实时检测儿童的平静状态(calm moments),在恰当时机向家长推送个性化提示(just-in-time prompts)。研究发现,这种感知驱动的通知机制不仅被家长感知为在孩子情绪平稳时触发,还促使家长以多样化方式主动重构干预内容,如用于表扬、正念训练或活动规划等,从而体现出家长作为积极设计者(active designers)的角色。这一成果揭示了干预系统应注重支持用户自主性与灵活性的设计方向。
链接: https://arxiv.org/abs/2602.16893
作者: Riku Arakawa,Shreya Bali,Anupama Sitaraman,Woosuk Seo,Sam Shaaban,Oliver Lindheim,Traci M. Kennedy,Mayank Goel
机构: Carnegie Mellon University (卡内基梅隆大学); Yale University (耶鲁大学); NuRelm; University of Pittsburgh School of Medicine (匹兹堡大学医学院)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Families raising children with ADHD often experience heightened stress and reactive parenting. While digital interventions promise personalization, many remain one-size-fits-all and fail to reflect parents’ lived practices. We present CalmReminder, a watch-based system that detects children’s calm moments and delivers just-in-time prompts to parents. Through a four-week deployment with 16 families (twelve completed) of children with ADHD, we compared notification strategies ranging from hourly to random to only when the child was inferred to be calm. Our sensing-based notifications were frequently perceived as arriving during calm moments. More importantly, parents adopted the system in diverse ways: using notifications for praise, mindfulness, activity planning, or conversation. These findings show that parents are not passive recipients but active designers, reshaping interventions to fit their parenting styles. We contribute a calm detection pipeline, empirical insights into families’ flexible appropriation of notifications, and design implications for intervention systems that foster agency.
[HC-21] Expanding the Scope of Computational Thinking in Artificial Intelligence for K-12 Education
【速读】:该论文旨在解决如何在K-12教育中有效整合生成式人工智能(Generative AI)与机器学习技术到计算思维(Computational Thinking, CT)框架中的问题,以应对AI技术快速普及带来的教育挑战。其解决方案的关键在于拓展传统计算思维的内涵,使其不仅涵盖编程和算法逻辑,还纳入对AI系统原理、伦理影响及社会公平性的理解;同时,借鉴过去十年在课程设计、跨学科融合以及算法偏见与正义教育方面的实践经验,构建更具包容性和前瞻性的AI素养教学路径。
链接: https://arxiv.org/abs/2602.16890
作者: Yasmin Kafai,Shuchi Grover
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 1 figure
Abstract:The introduction of generative artificial intelligence applications to the public has led to heated discussions about its potential impacts and risks for K-12 education. One particular challenge has been to decide what students should learn about AI, and how this relates to computational thinking, which has served as an umbrella for promoting and introducing computing education in schools. In this paper, we situate in which ways we should expand computational thinking to include artificial intelligence and machine learning technologies. Furthermore, we discuss how these efforts can be informed by lessons learned from the last decade in designing instructional programs, integrating computing with other subjects, and addressing issues of algorithmic bias and justice in teaching computing in schools.
[HC-22] “Hello Im Delivering. Let Me Pass By”: Navigating Public Pathways with Walk-along with Robots in Crowded City Streets
【速读】:该论文试图解决当前人机交互(Human-Robot Interaction, HRI)研究中对公共空间内自主移动机器人(autonomous mobile robots)的实地研究方法不足的问题。现有研究多依赖受控实验或结构化观察方法(如“巫师奥兹”技术),难以应对现实场景中机器人自主导航、动态路径和不可预测环境带来的复杂性。解决方案的关键在于提出一种名为“与机器人同行”(Walk-Along with Robots, WawR)的新方法,该方法借鉴城市研究、地理学和社会学中的公共领域民族志(public realm ethnography),强调研究者以参与式观察的方式跟随机器人行动,从而获得更真实、深入的现场洞察,并为后续评估提供可操作的框架。
链接: https://arxiv.org/abs/2602.16861
作者: EunJeong Cheon,Do Yeon Shin
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:
Abstract:As the presence of autonomous robots in public spaces increases-whether navigating campus walkways or neighborhood sidewalks-understanding how to carefully study these robots becomes critical. While HRI research has conducted field studies in public spaces, these are often limited to controlled experiments with prototype robots or structured observational methods, such as the Wizard of Oz technique. However, the autonomous mobile robots we encounter today, particularly delivery robots, operate beyond the control of researchers, navigating dynamic routes and unpredictable environments. To address this challenge, a more deliberate approach is required. Drawing inspiration from public realm ethnography in urban studies, geography, and sociology, this paper proposes the Walk-Along with Robots (WawR) methodology. We outline the key features of this method, the steps we applied in our study, the unique insights it offers, and the ways it can be evaluated. We hope this paper stimulates further discussion on research methodologies for studying autonomous robots in public spaces.
[HC-23] “My body is not your Porn”: Identifying Trends of Harm and Oppression through a Sociotechnical Genealogy of Digital Sexual Violence in South Korea
【速读】:该论文旨在解决数字性暴力(Digital Sexual Violence, DSV)在韩国随数字技术演进而持续加剧的问题,特别是图像型DSV在不同技术时代中的形态演变、社会建构机制及其与性别不平等的深层关联。其解决方案的关键在于采用谱系学方法(genealogical approach),系统梳理从1990年代早期互联网时代到2020年代中期深度伪造(deepfake)丑闻的四个阶段典型案件,揭示DSV的三大相互关联维度:(1) 男性主导网络中通过共谋实践将受害者图像建构为“淫秽”(obscenity);(2) 技术隐蔽性增强导致受害者的伤害感知能力被削弱;(3) 去中心化经济基础设施推动滥用行为的商业化。这一分析框架不仅阐明了DSV作为动态社会技术配置的复杂演化路径,也为未来计算机支持的协同工作(CSCW)研究提供了理论方向与方法论启示。
链接: https://arxiv.org/abs/2602.16853
作者: Inha Cha,Yeonju Jang,Haesoo Kim,Joo Young Park,Seora Park,EunJeong Cheon
机构: Georgia Institute of Technology (佐治亚理工学院); Cornell University (康奈尔大学); KTH Royal Institute of Technology (皇家理工学院); Indiana University Bloomington (印第安纳大学布卢明顿分校); Syracuse University (锡拉丘兹大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Ever since the introduction of internet technologies in South Korea, digital sexual violence (DSV) has been a persistent and pervasive problem. Evolving alongside digital technologies, the severity and scale of violence have grown consistently, leading to widespread public concern. In this paper, we present four eras of image-based DSV in South Korea, spanning from the early internet era of the 1990s to the deepfake scandals in the mid-2020s. Drawing from media coverage, legal documents, and academic literature, we elucidate forms and characteristics of DSV cases in each era, tracing how entrenched misogyny is reconfigured and amplified through evolving technologies, alongside shifting legislative measures. Taking a genealogical approach to read prominent cases of different eras, our analysis identifies three constitutive and interconnected dimensions of DSV: (1) the homo-social fabrication of “obscenity”, wherein victims’ imagery becomes collectively framed as obscene through participatory practices in male-dominant networks; (2) the increasing imperceptibility of violence, as technologies foreclose victims’ ability to perceive harm; and (3) the commercialization of abuse through decentralized economic infrastructures. We suggest future directions for CSCW research, and further reflect on the value of the genealogical method in enabling non-linear understanding of DSV as dynamically evolving sociotechnical configurations of harm.
[HC-24] Overseeing Agents Without Constant Oversight: Challenges and Opportunities
【速读】:该论文旨在解决人类对代理型人工智能(Agentic AI)系统进行有效监督时面临的挑战,即如何设计合理的推理与动作追踪(trace),使其在信息丰富性与简洁性之间取得平衡,从而提升用户对系统输出的验证效率。其解决方案的关键在于提出一种新型界面设计,通过优化追踪信息的呈现方式,显著缩短用户发现错误所需的时间;尽管该设计提升了用户的决策信心,但并未显著改善最终准确性,揭示了人类验证过程中存在的深层问题,如内置假设管理、主观正确性标准变化以及过程透明度的重要性与局限性。
链接: https://arxiv.org/abs/2602.16844
作者: Madeleine Grunde-McLaughlin,Hussein Mozannar,Maya Murad,Jingya Chen,Saleema Amershi,Adam Fourney
机构: University of Washington (华盛顿大学); Microsoft Research (微软研究院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:To enable human oversight, agentic AI systems often provide a trace of reasoning and action steps. Designing traces to have an informative, but not overwhelming, level of detail remains a critical challenge. In three user studies on a Computer User Agent, we investigate the utility of basic action traces for verification, explore three alternatives via design probes, and test a novel interface’s impact on error finding in question-answering tasks. As expected, we find that current practices are cumbersome, limiting their efficacy. Conversely, our proposed design reduced the time participants spent finding errors. However, although participants reported higher levels of confidence in their decisions, their final accuracy was not meaningfully improved. To this end, our study surfaces challenges for human verification of agentic systems, including managing built-in assumptions, users’ subjective and changing correctness criteria, and the shortcomings, yet importance, of communicating the agent’s process.
[HC-25] AI-Mediated Feedback Improves Student Revisions: A Randomized Trial with FeedbackWriter in a Large Undergraduate Course
【速读】:该论文试图解决的问题是:在生成式 AI (Generative AI) 被用于辅助教学反馈的背景下,学生对 AI 辅助反馈与传统人工反馈的响应差异尚不明确。为填补这一研究空白,作者设计并实施了一项随机对照试验(RCT),在一门大型经济学导论课程中部署了 FeedbackWriter 系统——该系统向助教(TAs)提供由 LLM 生成的反馈建议,TAs 可选择采纳、修改或忽略这些建议。关键解决方案在于构建一个“AI 辅助人类反馈”的协同机制,即让 TAs 在保留决策权的前提下整合 AI 建议,从而形成可量化评估的反馈干预组与基准组(纯人工反馈)。实验结果表明,接受 AI 辅助反馈的学生在修订稿质量上显著提升,且提升幅度随 TAs 采纳 AI 建议的比例增加而增强,验证了该协同模式的有效性。
链接: https://arxiv.org/abs/2602.16820
作者: Xinyi Lu,Kexin Phyllis Ju,Mitchell Dudley,Larissa Sano,Xu Wang
机构: University of Michigan(密歇根大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite growing interest in using LLMs to generate feedback on students’ writing, little is known about how students respond to AI-mediated versus human-provided feedback. We address this gap through a randomized controlled trial in a large introductory economics course (N=354), where we introduce and deploy FeedbackWriter - a system that generates AI suggestions to teaching assistants (TAs) while they provide feedback on students’ knowledge-intensive essays. TAs have the full capacity to adopt, edit, or dismiss the suggestions. Students were randomly assigned to receive either handwritten feedback from TAs (baseline) or AI-mediated feedback where TAs received suggestions from FeedbackWriter. Students revise their drafts based on the feedback, which is further graded. In total, 1,366 essays were graded using the system. We found that students receiving AI-mediated feedback produced significantly higher-quality revisions, with gains increasing as TAs adopted more AI suggestions. TAs found the AI suggestions useful for spotting gaps and clarifying rubrics.
[HC-26] Exploring the Design and Impact of Interactive Worked Examples for Learners with Varying Prior Knowledge
【速读】:该论文旨在解决传统教学系统中因学习者先验知识水平差异而导致的“能力-干预交互效应”(aptitude-treatment interaction effect)问题,即低先验知识学习者在被动式讲解型干预中受益更多,而高先验知识学习者可能因缺乏挑战而难以提升。解决方案的关键在于基于ICAP(Interactive, Constructive, Active, Passive)学习理论设计两种新型生成式 worked examples:Buggy(学生修复错误)和Guided(学生补全缺失规则),通过调节认知投入强度实现差异化干预——Buggy促进高先验知识学习者的探索与修正行为,Guided则增强低先验知识学习者的求助行为并减少错误,从而优化不同知识水平学习者的逻辑问题解决能力。
链接: https://arxiv.org/abs/2602.16806
作者: Sutapa Dey Tithi,Xiaoyi Tian,Ally Limke,Min Chi,Tiffany Barnes
机构: North Carolina State University (北卡罗来纳州立大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Tutoring systems improve learning through tailored interventions, such as worked examples, but often suffer from the aptitude-treatment interaction effect where low prior knowledge learners benefit more. We applied the ICAP learning theory to design two new types of worked examples, Buggy (students fix bugs), and Guided (students complete missing rules), requiring varying levels of cognitive engagement, and investigated their impact on learning in a controlled experiment with 155 undergraduate students in a logic problem solving tutor. Students in the Buggy and Guided examples groups performed significantly better on the posttest than those receiving passive worked examples. Buggy problems helped high prior knowledge learners whereas Guided problems helped low prior knowledge learners. Behavior analysis showed that Buggy produced more exploration-revision cycles, while Guided led to more help-seeking and fewer errors. This research contributes to the design of interventions in logic problem solving for varied levels of learner knowledge and a novel application of behavior analysis to compare learner interactions with the tutor.
计算机视觉
[CV-0] OpenEarthAgent : A Unified Framework for Tool-Augmented Geospatial Agents
【速读】:该论文旨在解决将多模态推理能力扩展至遥感(remote sensing)领域时面临的挑战,即模型需在空间尺度、地理结构和多光谱指数(如NDVI、NBR、NDBI)等复杂背景下,保持连贯的多步骤逻辑推理。解决方案的关键在于提出OpenEarthAgent框架,该框架通过监督微调(supervised fine-tuning)训练工具增强型地理空间代理(tool-augmented geospatial agents),利用包含14,538个训练实例和1,169个评估实例的结构化推理轨迹数据集,使模型能够对卫星影像与自然语言查询进行联合解析,并执行GIS操作与指数分析,从而实现稳定的空间理解、可解释的工具驱动行为及跨场景的结构化推理能力。
链接: https://arxiv.org/abs/2602.17665
作者: Akashah Shabbir,Muhammad Umer Sheikh,Muhammad Akhtar Munir,Hiyam Debary,Mustansar Fiaz,Muhammad Zaigham Zaheer,Paolo Fraccaro,Fahad Shahbaz Khan,Muhammad Haris Khan,Xiao Xiang Zhu,Salman Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent progress in multimodal reasoning has enabled agents that can interpret imagery, connect it with language, and perform structured analytical tasks. Extending such capabilities to the remote sensing domain remains challenging, as models must reason over spatial scale, geographic structures, and multispectral indices while maintaining coherent multi-step logic. To bridge this gap, OpenEarthAgent introduces a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces. The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions across diverse analytical contexts. The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split. It spans urban, environmental, disaster, and infrastructure domains, and incorporates GIS-based operations alongside index analyses such as NDVI, NBR, and NDBI. Grounded in explicit reasoning traces, the learned agent demonstrates structured reasoning, stable spatial understanding, and interpretable behaviour through tool-driven geospatial interactions across diverse conditions. We report consistent improvements over a strong baseline and competitive performance relative to recent open and closed-source models.
[CV-1] When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs
【速读】:该论文旨在解决视觉-语言-动作模型(Vision-Language-Action models, VLAs)在执行语言指令时存在的“反事实失败”(counterfactual failures)问题,即模型因数据集偏差而依赖视觉捷径(vision shortcuts),在缺乏场景特定监督的情况下重复执行训练中常见的行为,忽略语言意图。解决方案的关键在于提出一种名为“反事实动作引导”(Counterfactual Action Guidance, CAG)的双分支推理机制:该机制通过将标准VLA策略与一个语言无关的视觉-动作(Vision-Action, VA)模块结合,在动作选择阶段进行反事实比较,从而显式地正则化语言条件作用,减少对视觉捷径的依赖,并提升在低观测任务上的鲁棒性。CAG无需额外演示或修改现有架构或预训练模型,具有良好的可插拔性和通用性。
链接: https://arxiv.org/abs/2602.17659
作者: Yu Fang,Yuchun Feng,Dong Jing,Jiaqi Liu,Yue Yang,Zhenyu Wei,Daniel Szafir,Mingyu Ding
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Website: this https URL
Abstract:Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves \pi_0.5 by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.
[CV-2] Human-level 3D shape perception emerges from multi-view learning
【速读】:该论文旨在解决如何建模人类从二维视觉输入中推断三维结构的能力这一长期挑战,其核心问题是现有计算方法难以达到人类水平的3D形状推理性能。解决方案的关键在于提出一种新型神经网络框架,该框架通过在自然场景的多视角图像数据上训练一个仅依赖于视觉-空间目标(visual-spatial objective)的模型,无需任何与物体相关的归纳偏置(inductive biases),即可学习预测相机位置和视差等空间信息。该模型在未进行任务特定微调的情况下,首次实现了与人类在3D形状推断任务上的准确率相当,并能通过独立读出(independent readouts)预测人类行为的细粒度特征(如错误模式和反应时间),揭示了模型动态与人类感知之间的自然对应关系。
链接: https://arxiv.org/abs/2602.17650
作者: Tyler Bonnen,Jitendra Malik,Angjoo Kanazawa
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view’ models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.
[CV-3] IntRec: Intent-based Retrieval with Contrastive Refinement
【速读】:该论文旨在解决复杂场景中用户指定对象的检索问题,尤其针对查询模糊或多相似目标时的识别困难。现有开放词汇检测器采用一次性推理方式,无法根据用户反馈迭代优化预测结果。解决方案的关键在于提出IntRec交互式对象检索框架,其核心是一个意图状态(Intent State, IS),通过维护正锚点(positive anchors,即确认线索)和负约束(negative constraints,即被排除假设)的双记忆集合,利用对比对齐函数在候选对象中进行排序——该函数通过最大化与正锚点的相似性并惩罚与负约束的相似性,实现对杂乱场景中目标的细粒度区分。此机制无需额外监督即可显著提升检索准确性,并在LVIS-Ambiguous基准上仅需一次纠正反馈即实现+7.9 AP的性能提升,且单次交互延迟低于30 ms。
链接: https://arxiv.org/abs/2602.17639
作者: Pourya Shamsolmoali,Masoumeh Zareapoor,Eric Granger,Yue Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.
[CV-4] CORAL: Correspondence Alignment for Improved Virtual Try-On
【速读】:该论文旨在解决虚拟试衣(Virtual Try-On, VTON)中在无配对场景下难以保持衣物细节的问题,尤其是现有方法未能显式建模人与衣物之间的精确对应关系,且缺乏对Diffusion Transformer(DiT)架构中对应关系生成机制的解释。其解决方案的关键在于揭示了全三维注意力机制中人物-衣物对应关系依赖于查询(query)与键(key)之间的精准匹配,并据此提出CORrespondence ALignment(CORAL)框架:通过引入对应蒸馏损失(correspondence distillation loss)将可靠的外部对应关系对齐到人物-衣物注意力空间,以及熵最小化损失(entropy minimization loss)以增强注意力分布的聚焦性,从而实现更鲁棒的跨模态对齐与细节保留。
链接: https://arxiv.org/abs/2602.17636
作者: Jiyoung Kim,Youngjin Shin,Siyoon Jin,Dahyun Chung,Jisu Nam,Tongmin Kim,Jongjae Park,Hyeonwoo Kang,Seungryong Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 25 figures
Abstract:Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person-garment correspondence is required. These methods do not explicitly enforce person-garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that the person-garment correspondence critically depends on precise person-garment query-key matching within the full 3D attention. Building on this insight, we then introduce CORrespondence ALignment (CORAL), a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person-garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.
[CV-5] Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery
【速读】:该论文旨在解决在资源受限且环境动态变化的现实场景中(如环境监测或公共卫生),如何通过策略性采样高效发现隐藏目标的问题。由于地理空间数据稀疏且存在偏差,传统基于学习的方法(如强化学习)难以适用。解决方案的关键在于提出一个统一的地理空间发现框架,其核心是基于“概念相关性”(concept relevance)这一共享理念,引入两项创新:一是概念加权不确定性采样策略,通过已知领域概念(如土地覆盖、污染源距离)调节不确定性,提升采样效率;二是相关性感知的元批次构建策略,在在线元学习更新中促进语义多样性,增强模型在动态环境中的泛化能力。该方法在真实PFAS污染数据集上验证了其在有限数据和变化环境中可靠发现目标的能力。
链接: https://arxiv.org/abs/2602.17605
作者: Jowaria Khan,Anindya Sarkar,Yevgeniy Vorobeychik,Elizabeth Bondi-Kelly
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:In many real-world settings, such as environmental monitoring, disaster response, or public health, with costly and difficult data collection and dynamic environments, strategically sampling from unobserved regions is essential for efficiently uncovering hidden targets under tight resource constraints. Yet, sparse and biased geospatial ground truth limits the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of concept relevance, which captures how domain-specific factors influence target presence: a concept-weighted uncertainty sampling strategy, where uncertainty is modulated by learned relevance based on readily-available domain-specific concepts (e.g., land cover, source proximity); and a relevance-aware meta-batch formation strategy that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. Our experiments include testing on a real-world dataset of cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, showcasing our method’s reliability at uncovering targets with limited data and a varying environment.
[CV-6] Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment
【速读】:该论文旨在解决当前图像条件音乐生成系统存在的两个核心问题:一是模型通常基于自然照片训练,难以捕捉艺术作品中更丰富的语义、风格与文化内涵;二是多数方法依赖图像到文本的转换阶段,通过语言作为语义捷径进行条件控制,从而阻碍了直接的视觉到音频学习。解决方案的关键在于提出ArtToMus框架,这是首个专为直接艺术作品到音乐生成设计的模型,它摒弃了图像到文本的中间步骤和基于语言的语义监督,而是将视觉嵌入投影到潜在扩散模型的条件空间中,仅凭视觉信息引导音乐合成。该方法实现了对源艺术品显著视觉特征的有效响应,在保持音乐连贯性和风格一致性的同时,推动了视觉到音频生成这一独立且具有挑战性的研究方向的发展。
链接: https://arxiv.org/abs/2602.17599
作者: Ivan Rinaldi,Matteo Mendula,Nicola Fanelli,Florence Levé,Matteo Testi,Giovanna Castellano,Gennaro Vessio
机构: University of Bari Aldo Moro (巴里大学阿尔多·莫罗分校); Catalonia’s Telecommunications Technology Centre (加泰罗尼亚电信技术中心); University of Picardie Jules Verne (皮卡第朱尔斯·凡尔纳大学); Artificial Intelligence Venture Builder (AIVB) (人工智能创业构建者)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注:
Abstract:Music generation has advanced markedly through multimodal deep learning, enabling models to synthesize audio from text and, more recently, from images. However, existing image-conditioned systems suffer from two fundamental limitations: (i) they are typically trained on natural photographs, limiting their ability to capture the richer semantic, stylistic, and cultural content of artworks; and (ii) most rely on an image-to-text conversion stage, using language as a semantic shortcut that simplifies conditioning but prevents direct visual-to-audio learning. Motivated by these gaps, we introduce ArtSound, a large-scale multimodal dataset of 105,884 artwork-music pairs enriched with dual-modality captions, obtained by extending ArtGraph and the Free Music Archive. We further propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation, which maps digitized artworks to music without image-to-text translation or language-based semantic supervision. The framework projects visual embeddings into the conditioning space of a latent diffusion model, enabling music synthesis guided solely by visual information. Experimental results show that ArtToMus generates musically coherent and stylistically consistent outputs that reflect salient visual cues of the source artworks. While absolute alignment scores remain lower than those of text-conditioned systems-as expected given the substantially increased difficulty of removing linguistic supervision-ArtToMus achieves competitive perceptual quality and meaningful cross-modal correspondence. This work establishes direct visual-to-music generation as a distinct and challenging research direction, and provides resources that support applications in multimedia art, cultural heritage, and AI-assisted creative practice. Code and dataset will be publicly released upon acceptance.
[CV-7] FR-GESTURE: An RGBD Dataset For Gesture-based Human-Robot Interaction In First Responder Operations
【速读】:该论文旨在解决突发事件中一线救援人员(First Responders, FRs)在复杂环境下操作无人地面车辆(UGV)时面临的控制效率与准确性难题。其核心解决方案是构建首个专为FRs设计的基于手势的UGV控制数据集(FR-GESTURE),包含12个经实战反馈优化的手势指令,通过双视角、七距离采集的3312对RGBD图像实现多模态感知数据支撑,并定义了标准化评估协议以推动后续算法改进。关键创新在于将战术手语与实际救援场景结合,形成可落地的交互范式,为生成式AI (Generative AI) 在应急响应领域的应用提供基础数据支持。
链接: https://arxiv.org/abs/2602.17573
作者: Konstantinos Foteinos,Georgios Angelidis,Aggelos Psiris,Vasileios Argyriou,Panagiotis Sarigiannidis,Georgios Th. Papadopoulos
机构: 1. University of West Attica (西阿提卡大学); 2. Aristotle University of Thessaloniki (塞萨洛尼基亚里士多德大学); 3. Hellenic Open University (希腊开放大学); 4. National and Kapodistrian University of Athens (雅典国立卡波迪斯特里安大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The ever increasing intensity and number of disasters make even more difficult the work of First Responders (FRs). Artificial intelligence and robotics solutions could facilitate their operations, compensating these difficulties. To this end, we propose a dataset for gesture-based UGV control by FRs, introducing a set of 12 commands, drawing inspiration from existing gestures used by FRs and tactical hand signals and refined after incorporating feedback from experienced FRs. Then we proceed with the data collection itself, resulting in 3312 RGBD pairs captured from 2 viewpoints and 7 distances. To the best of our knowledge, this is the first dataset especially intended for gesture-based UGV guidance by FRs. Finally we define evaluation protocols for our RGBD dataset, termed FR-GESTURE, and we perform baseline experiments, which are put forward for improvement. We have made data publicly available to promote future research on the domain: this https URL.
[CV-8] RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在专业图像编辑任务中因缺乏可靠、可验证的奖励信号而导致的训练难题,尤其是如何有效建模主观性强的创意编辑意图。解决方案的关键在于提出RetouchIQ框架,其核心创新是引入一个通用奖励模型(generalist reward model),该模型基于强化学习(Reinforcement Learning, RL)微调的MLLM,能够针对每种编辑案例生成定制化的评估指标,并通过多模态推理提供标量反馈,从而实现高质量、指令一致的梯度更新,使MLLM代理能从高层美学目标自动推导出可执行的参数调整策略,显著提升语义一致性和感知质量。
链接: https://arxiv.org/abs/2602.17558
作者: Qiucheng Wu,Jing Shi,Simon Jenni,Kushal Kafle,Tianyu Wang,Shiyu Chang,Handong Zhao
机构: Adobe Research (Adobe 研究院); UC, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures
Abstract:Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.
[CV-9] GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking
【速读】:该论文旨在解决视频推理中因缺乏显式因果结构建模而导致的幻觉问题,尤其是在事件间因果关系隐含且人工标注成本高昂的情况下。现有多模态大语言模型(Multimodal Large Language Models, MLLMs)依赖密集描述或视频摘要进行推理,难以实现真正的因果理解。解决方案的关键在于提出GraphThinker方法,通过强化学习微调构建事件级场景图(Event-based Video Scene Graph, EVSG),显式建模事件内与事件间的关联,并将该结构作为中间思考过程引入MLLM;同时,在强化微调阶段引入视觉注意力奖励机制,增强视觉定位能力,从而显著减少视频推理中的幻觉现象。
链接: https://arxiv.org/abs/2602.17555
作者: Zixu Cheng,Da Li,Jian Hu,Ziquan Liu,Wei Li,Shaogang Gong
机构: Queen Mary University of London (伦敦玛丽女王大学); Samsung AI Centre Cambridge (三星人工智能中心剑桥); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:Video reasoning requires understanding the causal relationships between events in a video. However, such relationships are often implicit and costly to annotate manually. While existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries for video reasoning, such modeling still lacks causal understanding. Without explicit causal structure modeling within and across video events, these models suffer from hallucinations during the video reasoning. In this work, we propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs and enhances visual grounding to jointly reduce hallucinations in video reasoning. Specifically, we first employ an MLLM to construct an event-based video scene graph (EVSG) that explicitly models both intra- and inter-event relations, and incorporate these formed scene graphs into the MLLM as an intermediate thinking process. We also introduce a visual attention reward during reinforcement finetuning, which strengthens video grounding and further mitigates hallucinations. We evaluate GraphThinker on two datasets, RexTime and VidHalluc, where it shows superior ability to capture object and event relations with more precise event localization, reducing hallucinations in video reasoning compared to prior methods.
[CV-10] LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
【速读】:该论文旨在解决医学视觉语言模型(Medical Vision-Language Models, VLMs)在域偏移下可靠性不足的问题,特别是其不确定性校准缺乏有限样本覆盖保证,且传统分片共形预测(Split Conformal Prediction, SCP)方法在小样本、类别不平衡场景中存在预测集过大(效率低)和类条件覆盖差距(Class-wise Coverage Variance, CCV)显著的问题。解决方案的关键在于提出一种无需训练和标签的精炼方法 LATA(Laplacian-Assisted Transductive Adaptation),通过在联合校准与测试池上构建图像-图像 k-NN 图并利用少量 CCCP 平均场更新平滑零样本概率,同时引入确定性变换保持 SCP 有效性;此外,进一步设计了一种故障感知共形得分(failure-aware conformal score),嵌入到视觉语言不确定性(ViLU)框架中,实现实例级难度评估与标签合理性判断,从而在固定覆盖水平下提升预测集效率和类间平衡性,且不破坏交换性假设。
链接: https://arxiv.org/abs/2602.17535
作者: Behzad Bozorgtabar,Dwarikanath Mahapatra,Sudipta Roy,Muzammal Naseer,Imran Razzak,Zongyuan Ge
机构: Aarhus University (A3 Lab); Khalifa University; Jio Institute; MBZUAI; Monash University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 6 figures, 4 tables
Abstract:Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced-high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \texttt\textbfLATA (Laplacian-Assisted Transductive Adaptation), a \textittraining- and label-free refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image k-NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a \textitfailure-aware conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \texttt\textbfLATA is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across \textbfthree medical VLMs and \textbfnine downstream tasks, \texttt\textbfLATA consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \texttt\textbfLATA sharpens zero-shot predictions without compromising exchangeability.
[CV-11] FoundationPose-Initialized 3D-2D Liver Registration for Surgical Augmented Reality
【速读】:该论文旨在解决腹腔镜下肝切除术中肿瘤定位精度不足的问题,尤其是传统配准流程依赖器官轮廓且采用有限元(Finite-Element, FE)模型进行非刚性(非rigid)变形校正时所面临的建模复杂性和工程实现难度高的挑战。其解决方案的关键在于:引入腹腔镜深度图(depth map)与基础位姿估计器(foundation pose estimator)相结合,实现相机-肝脏位姿的精准估计,并以非刚性迭代最近点算法(Non-Rigid Iterative Closest Point, NICP)替代传统的FE模型进行非刚性变形处理,从而显著降低模型复杂度和对专业领域的依赖,同时在真实患者数据上实现了9.91 mm的平均配准误差,验证了该方法在临床应用中的可行性与高效性。
链接: https://arxiv.org/abs/2602.17517
作者: Hanyuan Zhang,Lucas He,Runlong He,Abdolrahim Kadkhodamohammadi,Danail Stoyanov,Brian R. Davidson,Evangelos B. Mazomenos,Matthew J. Clarkson
机构: UCL Hawkes Institute, University College London (伦敦大学学院霍克斯研究所); Division of Surgery and Interventional Science, University College London (伦敦大学学院外科与介入科学系); Unit for Lifelong Health and Ageing at UCL, University College London (伦敦大学学院终身健康与老龄化中心); Medtronic plc. (美敦力公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Augmented reality can improve tumor localization in laparoscopic liver surgery. Existing registration pipelines typically depend on organ contours; deformable (non-rigid) alignment is often handled with finite-element (FE) models coupled to dimensionality-reduction or machine-learning components. We integrate laparoscopic depth maps with a foundation pose estimator for camera-liver pose estimation and replace FE-based deformation with non-rigid iterative closest point (NICP) to lower engineering/modeling complexity and expertise requirements. On real patient data, the depth-augmented foundation pose approach achieved 9.91 mm mean registration error in 3 cases. Combined rigid-NICP registration outperformed rigid-only registration, demonstrating NICP as an efficient substitute for finite-element deformable models. This pipeline achieves clinically relevant accuracy while offering a lightweight, engineering-friendly alternative to FE-based deformation.
[CV-12] racing Copied Pixels and Regularizing Patch Affinity in Copy Detection
【速读】:该论文旨在解决图像复制检测(Image Copy Detection, ICD)中因现有自监督学习(Self-Supervised Learning, SSL)方法在复杂编辑场景下缺乏细粒度对应关系建模而导致的性能瓶颈问题。其解决方案的关键在于引入两个核心创新:一是提出PixTrace模块,通过显式维护像素级坐标映射来捕捉编辑操作中的几何可追溯性;二是设计CopyNCE损失函数,利用PixTrace验证的映射关系计算重叠比例以指导patch级别的对比学习,从而在SSL训练中抑制监督噪声。该方法实现了像素级可追溯性与patch级相似性学习的有效融合,在DISC21数据集上取得了88.7% uAP的匹配器性能和72.6% uAP的描述子性能,显著优于现有方法。
链接: https://arxiv.org/abs/2602.17484
作者: Yichen Lu,Siwei Nie,Minlong Lu,Xudong Yang,Xiaobo Zhang,Peng Zhang
机构: Ant Group(蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Image Copy Detection (ICD) aims to identify manipulated content between image pairs through robust feature representation learning. While self-supervised learning (SSL) has advanced ICD systems, existing view-level contrastive methods struggle with sophisticated edits due to insufficient fine-grained correspondence learning. We address this limitation by exploiting the inherent geometric traceability in edited content through two key innovations. First, we propose PixTrace - a pixel coordinate tracking module that maintains explicit spatial mappings across editing transformations. Second, we introduce CopyNCE, a geometrically-guided contrastive loss that regularizes patch affinity using overlap ratios derived from PixTrace’s verified mappings. Our method bridges pixel-level traceability with patch-level similarity learning, suppressing supervision noise in SSL training. Extensive experiments demonstrate not only state-of-the-art performance (88.7% uAP / 83.9% RP90 for matcher, 72.6% uAP / 68.4% RP90 for descriptor on DISC21 dataset) but also better interpretability over existing methods.
[CV-13] QuPAINT: Physics-Aware Instruction Tuning Approach to Quantum Material Discovery
【速读】:该论文旨在解决二维量子材料(two-dimensional quantum materials)从光学显微图像中进行表征时面临的挑战,包括层依赖性对比度微弱、标注数据有限以及实验室间和成像设备间的显著差异。现有视觉模型因缺乏物理先验知识而难以泛化至新材料或不同硬件条件。其解决方案的关键在于提出一个物理感知的多模态框架:首先构建基于物理的合成数据生成器Synthia,模拟薄膜干涉下的真实光学响应以减少对人工标注的依赖;其次设计首个大规模量子材料指令数据集QMat-Instruct,包含多模态、物理信息驱动的问题-答案对,用于训练多模态大语言模型(Multimodal Large Language Models, MLLMs)理解晶片外观与厚度关系;最后引入物理感知指令微调方法(QuPAINT),通过物理信息注意力模块融合视觉嵌入与光学先验,提升晶片表征的鲁棒性和判别力。
链接: https://arxiv.org/abs/2602.17478
作者: Xuan-Bac Nguyen,Hoang-Quan Nguyen,Sankalp Pandey,Tim Faltermeier,Nicholas Borys,Hugh Churchill,Khoa Luu
机构: University of Arkansas (阿肯色大学); University of Utah (犹他大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Characterizing two-dimensional quantum materials from optical microscopy images is challenging due to the subtle layer-dependent contrast, limited labeled data, and significant variation across laboratories and imaging setups. Existing vision models struggle in this domain since they lack physical priors and cannot generalize to new materials or hardware conditions. This work presents a new physics-aware multimodal framework that addresses these limitations from both the data and model perspectives. We first present Synthia, a physics-based synthetic data generator that simulates realistic optical responses of quantum material flakes under thin-film interference. Synthia produces diverse and high-quality samples, helping reduce the dependence on expert manual annotation. We introduce QMat-Instruct, the first large-scale instruction dataset for quantum materials, comprising multimodal, physics-informed question-answer pairs designed to teach Multimodal Large Language Models (MLLMs) to understand the appearance and thickness of flakes. Then, we propose Physics-Aware Instruction Tuning (QuPAINT), a multimodal architecture that incorporates a Physics-Informed Attention module to fuse visual embeddings with optical priors, enabling more robust and discriminative flake representations. Finally, we establish QF-Bench, a comprehensive benchmark spanning multiple materials, substrates, and imaging settings, offering standardized protocols for fair and reproducible evaluation.
[CV-14] 4D Monocular Surgical Reconstruction under Arbitrary Camera Motions
【速读】:该论文旨在解决从单目内窥镜视频中重建可变形手术场景的挑战,尤其针对相机大幅运动下传统方法因依赖立体深度先验或精确结构光恢复(Structure-from-Motion, SfM)初始化而性能受限的问题。解决方案的关键在于提出Local-EndoGS框架,其核心创新包括:1)引入一种基于窗口的渐进式全局表示机制,将局部可变形场景模型分配至每个观测窗口,从而实现对长序列和大范围运动的可扩展重建;2)设计粗到精的初始化策略,融合多视角几何、跨窗口信息与单目深度先验,提升初始估计鲁棒性;3)集成远距离2D像素轨迹约束与物理运动先验,增强形变合理性。实验表明,该方法在多个公开数据集上均优于现有最先进方法,在外观质量和几何精度方面表现突出。
链接: https://arxiv.org/abs/2602.17473
作者: Jiwei Shan,Zeyu Cai,Cheng-Tai Hsieh,Yirui Li,Hao Liu,Lijun Han,Hesheng Wang,Shing Shin Cheng
机构: The Chinese University of Hong Kong (香港中文大学); Chinese Academy of Sciences (中国科学院); Shenyang Institute of Automation, Chinese Academy of Sciences (中国科学院沈阳自动化研究所); State Key Laboratory of Robotics and Intelligent Systems (机器人与智能系统全国重点实验室); Shanghai Jiao Tong University (上海交通大学); School of Integrated Circuits, Shanghai Jiao Tong University (上海交通大学集成电路学院); School of Automation and Intelligent Sensing, Shanghai Jiao Tong University (上海交通大学自动化与智能感知学院); Key Laboratory of System Control and Information Processing, Ministry of Education of China (教育部系统控制与信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Due to the limitation “The abstract field cannot be longer than 1,920 characters”, the abstract here is shorter than that in the PDF file Subjects
Abstract:Reconstructing deformable surgical scenes from endoscopic videos is challenging and clinically important. Recent state-of-the-art methods based on implicit neural representations or 3D Gaussian splatting have made notable progress. However, most are designed for deformable scenes with fixed endoscope viewpoints and rely on stereo depth priors or accurate structure-from-motion for initialization and optimization, limiting their ability to handle monocular sequences with large camera motion in real clinical settings. To address this, we propose Local-EndoGS, a high-quality 4D reconstruction framework for monocular endoscopic sequences with arbitrary camera motion. Local-EndoGS introduces a progressive, window-based global representation that allocates local deformable scene models to each observed window, enabling scalability to long sequences with substantial motion. To overcome unreliable initialization without stereo depth or accurate structure-from-motion, we design a coarse-to-fine strategy integrating multi-view geometry, cross-window information, and monocular depth priors, providing a robust foundation for optimization. We further incorporate long-range 2D pixel trajectory constraints and physical motion priors to improve deformation plausibility. Experiments on three public endoscopic datasets with deformable scenes and varying camera motions show that Local-EndoGS consistently outperforms state-of-the-art methods in appearance quality and geometry. Ablation studies validate the effectiveness of our key designs. Code will be released upon acceptance at: this https URL.
[CV-15] EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models
【速读】:该论文旨在解决工业异常检测中深度学习方法仅提供二元决策且缺乏语义解释的问题,同时克服现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在异常检测任务中需昂贵微调且性能提升不稳定的局限。其解决方案的关键在于提出无需参数更新的专家增强注意力引导框架(Expert-Augmented Attention Guidance for Industrial Anomaly Detection in MLLMs, EAGLE),通过引入专家模型输出作为指导信号,引导MLLMs在不进行任何训练的情况下实现更准确的异常定位与可解释的描述生成。实验表明,EAGLE能有效提升多个MLLMs在MVTec-AD和VisA数据集上的检测性能,并促使模型在中间层注意力分布上更加聚焦于异常区域,从而增强检测的可解释性与鲁棒性。
链接: https://arxiv.org/abs/2602.17419
作者: Xiaomeng Peng,Xilang Huang,Seon Han Choi
机构: Ewha Womans University (伊尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Industrial anomaly detection is important for smart manufacturing, but many deep learning approaches produce only binary decisions and provide limited semantic explanations. Multimodal large language models (MLLMs) can potentially generate fine-grained, language-based analyses, yet existing methods often require costly fine-tuning and do not consistently improve anomaly detection accuracy compared to lightweight specialist detectors. We propose expert-augmented attention guidance for industrial anomaly detection in MLLMs (EAGLE), a tuning-free framework that integrates outputs from expert model to guide MLLMs toward both accurate detection and interpretable anomaly descriptions. We further study how EAGLE affects MLLMs internals by examining the attention distribution of MLLMs to the anomalous image regions in the intermediate layers. We observe that successful anomaly detection is associated with increased attention concentration on anomalous regions, and EAGLE tends to encourage this alignment. Experiments on MVTec-AD and VisA show that EAGLE improves anomaly detection performance across multiple MLLMs without any parameter updates, achieving results comparable to fine-tuning based methods. Code is available at \hrefthis https URLthis https URL
[CV-16] A High-Level Survey of Optical Remote Sensing
【速读】:该论文旨在解决当前光学遥感领域研究分散、缺乏系统性综述的问题,尤其针对无人机搭载RGB相机在遥感应用中的广泛使用却未形成统一认知的现状。其解决方案的关键在于提供一个全面且结构化的领域概览,整合关键数据集、任务类型与方法论,并通过高阶洞察帮助新进入的研究者快速定位感兴趣的研究方向,从而填补现有文献中对这一跨任务、跨方法的全景式综述空白。
链接: https://arxiv.org/abs/2602.17397
作者: Panagiotis Koletsis,Vasilis Efthymiou,Maria Vakalopoulou,Nikos Komodakis,Anastasios Doulamis,Georgios Th. Papadopoulos
机构: 1. University of Thessaly (塞萨洛尼基大学); 2. National and Kapodistrian University of Athens (雅典国立卡波迪斯特里安大学); 3. INRIA (法国国家信息与自动化研究院); 4. University of Thessaly (塞萨洛尼基大学); 5. Aristotle University of Thessaloniki (塞萨洛尼基亚里士多德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, significant advances in computer vision have also propelled progress in remote sensing. Concurrently, the use of drones has expanded, with many organizations incorporating them into their operations. Most drones are equipped by default with RGB cameras, which are both robust and among the easiest sensors to use and interpret. The body of literature on optical remote sensing is vast, encompassing diverse tasks, capabilities, and methodologies. Each task or methodology could warrant a dedicated survey. This work provides a comprehensive overview of the capabilities of the field, while also presenting key information, such as datasets and insights. It aims to serve as a guide for researchers entering the field, offering high-level insights and helping them focus on areas most relevant to their interests. To the best of our knowledge, no existing survey addresses this holistic perspective.
[CV-17] SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery ICLR2026
【速读】:该论文旨在解决通用类别发现(Generalized Category Discovery, GCD)任务中因仅依赖图像特征训练参数化分类器而导致的对旧类过拟合问题,以及现有多模态方法在处理不同模态时独立建模且计算成本高的局限性。其解决方案的关键在于提出一种高效且有效的多模态方法 SpectralGCD,该方法利用 CLIP 模型中的跨模态图像-概念相似度构建统一的跨模态表示:每张图像被表达为来自一个大规模、任务无关词典的语义概念混合,从而将学习锚定在显式语义上并减少对虚假视觉线索的依赖;同时引入谱滤波(Spectral Filtering)机制,通过强教师模型计算的 softmax 化相似度的跨模态协方差矩阵自动保留词典中的相关概念,结合正向与反向知识蒸馏,确保学生模型学到的跨模态表示既语义充分又对齐良好。
链接: https://arxiv.org/abs/2602.17395
作者: Lorenzo Caselli,Marco Mistretta,Simone Magistri,Andrew D. Bagdanov
机构: University of Florence (佛罗伦萨大学); Media Integration and Communication Center (媒体整合与通信中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICLR 2026. Code available at this https URL
Abstract:Generalized Category Discovery (GCD) aims to identify novel categories in unlabeled data while leveraging a small labeled subset of known classes. Training a parametric classifier solely on image features often leads to overfitting to old classes, and recent multimodal approaches improve performance by incorporating textual information. However, they treat modalities independently and incur high computational cost. We propose SpectralGCD, an efficient and effective multimodal approach to GCD that uses CLIP cross-modal image-concept similarities as a unified cross-modal representation. Each image is expressed as a mixture over semantic concepts from a large task-agnostic dictionary, which anchors learning to explicit semantics and reduces reliance on spurious visual cues. To maintain the semantic quality of representations learned by an efficient student, we introduce Spectral Filtering which exploits a cross-modal covariance matrix over the softmaxed similarities measured by a strong teacher model to automatically retain only relevant concepts from the dictionary. Forward and reverse knowledge distillation from the same teacher ensures that the cross-modal representations of the student remain both semantically sufficient and well-aligned. Across six benchmarks, SpectralGCD delivers accuracy comparable to or significantly superior to state-of-the-art methods at a fraction of the computational cost. The code is publicly available at: this https URL.
[CV-18] DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition
【速读】:该论文旨在解决当前基于Transformer的手写文本识别(Handwritten Text Recognition, HTR)系统在解码过程中因键值(Key-Value, KV)缓存不断增长而导致的推理速度慢、内存占用高的问题。解决方案的关键在于提出一种基于Retentive Network(RetNet)的解码器-only模型DRetHTR:通过用无softmax的retention机制替代传统的softmax注意力机制,并引入多尺度序列先验(multi-scale sequential priors),有效避免了KV缓存的增长,使解码复杂度在时间和空间上均与输出长度呈线性关系;同时,设计层间gamma缩放策略以逐步扩展有效retention范围,恢复注意力机制中从局部到全局的归纳偏置,从而在不损失准确率的前提下显著提升推理效率——相较同等规模的Transformer基线模型,推理速度提升1.6–1.9倍,内存消耗减少38–42%。
链接: https://arxiv.org/abs/2602.17387
作者: Changhun Kim,Martin Mayr,Thomas Gorges,Fei Wu,Mathias Seuret,Andreas Maier,Vincent Christlein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Pattern Recognition, 11 pages + 2-page appendix, 7 figures, 12 tables
Abstract:State-of-the-art handwritten text recognition (HTR) systems commonly use Transformers, whose growing key-value (KV) cache makes decoding slow and memory-intensive. We introduce DRetHTR, a decoder-only model built on Retentive Networks (RetNet). Compared to an equally sized decoder-only Transformer baseline, DRetHTR delivers 1.6-1.9x faster inference with 38-42% less memory usage, without loss of accuracy. By replacing softmax attention with softmax-free retention and injecting multi-scale sequential priors, DRetHTR avoids a growing KV cache: decoding is linear in output length in both time and memory. To recover the local-to-global inductive bias of attention, we propose layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers. This encourages early layers to model short-range dependencies and later layers to capture broader context, mitigating the flexibility gap introduced by removing softmax. Consequently, DRetHTR achieves best reported test character error rates of 2.26% (IAM-A, en), 1.81% (RIMES, fr), and 3.46% (Bentham, en), and is competitive on READ-2016 (de) with 4.21%. This demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency.
[CV-19] ree crop mapping of South America reveals links to deforestation and conservation
【速读】:该论文旨在解决当前零毁林政策(如欧盟《无毁林产品法规》EUDR)在监测树本作物扩张时面临的挑战,即缺乏高分辨率数据以准确区分农业系统与森林覆盖。其关键解决方案是构建首张南美洲10米分辨率的树本作物分布图,采用多模态时空深度学习模型,基于Sentinel-1和Sentinel-2卫星影像时间序列进行训练,从而精确识别约1100万公顷树本作物,并揭示其中23%与2000–2020年间森林覆盖损失相关。该高分辨率基准地图可有效减少因现有监管地图将小农户农林复合系统误判为“森林”而导致的虚假毁林预警及对小规模农民的不公平处罚,助力实现更具包容性和公平性的保护政策。
链接: https://arxiv.org/abs/2602.17372
作者: Yuchang Jiang,Anton Raichuk,Xiaoye Tong,Vivien Sainte Fare Garnot,Daniel Ortiz-Gonzalo,Dan Morris,Konrad Schindler,Jan Dirk Wegner,Maxim Neumann
机构: Google DeepMind(谷歌深度思维); EcoVision Lab, DM3L, University of Zurich(苏黎世大学生态视觉实验室, DM3L); University of Copenhagen(哥本哈根大学); Google Research(谷歌研究); ETH Zürich(苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monitoring tree crop expansion is vital for zero-deforestation policies like the European Union’s Regulation on Deforestation-free Products (EUDR). However, these efforts are hindered by a lack of highresolution data distinguishing diverse agricultural systems from forests. Here, we present the first 10m-resolution tree crop map for South America, generated using a multi-modal, spatio-temporal deep learning model trained on Sentinel-1 and Sentinel-2 satellite imagery time series. The map identifies approximately 11 million hectares of tree crops, 23% of which is linked to 2000-2020 forest cover loss. Critically, our analysis reveals that existing regulatory maps supporting the EUDR often classify established agriculture, particularly smallholder agroforestry, as “forest”. This discrepancy risks false deforestation alerts and unfair penalties for small-scale farmers. Our work mitigates this risk by providing a high-resolution baseline, supporting conservation policies that are effective, inclusive, and equitable.
[CV-20] Application and Evaluation of the Common Circles Method
【速读】:该论文旨在解决光学衍射层析成像(Optical Diffraction Tomography, ODT)中亚毫米级生物组织样本在无接触声学力场约束下的运动估计问题。由于样本无法固定,其微小位移会影响重建质量,因此需从采集的图像中准确估计运动参数。解决方案的关键在于采用通用圆方法(Common Circle Method),该方法通过识别傅里叶空间中Ewald球面的交线来确定旋转运动,并引入时间一致性约束以提升重建稳定性,从而实现计算高效的运动检测,优于传统全优化方法。
链接: https://arxiv.org/abs/2602.17353
作者: Michael Quellmalz,Mia Kvåle Løvmo,Simon Moser,Franziska Strasser,Monika Ritsch-Marte
机构: 未知
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We investigate the application of the common circle method for estimating sample motion in optical diffraction tomography (ODT) of sub-millimeter sized biological tissue. When samples are confined via contact-free acoustical force fields, their motion must be estimated from the captured images. The common circle method identifies intersections of Ewald spheres in Fourier space to determine rotational motion. This paper presents a practical implementation, incorporating temporal consistency constraints to achieve stable reconstructions. Our results on both simulated and real-world data demonstrate that the common circle method provides a computationally efficient alternative to full optimization methods for motion detection.
[CV-21] Polaffini: A feature-based approach for robust affine and polyaffine image registration
【速读】:该论文旨在解决医学图像配准中传统基于强度的方法依赖代理对齐指标、缺乏解剖学依据的问题,以及早期基于特征的方法因难以可靠提取解剖特征而被边缘化的问题。其解决方案的关键在于利用深度学习预训练分割模型,快速获得高精度的解剖结构边界,并从中简单提取中心点(centroid)作为具有1对1对应关系的解剖特征点,进而通过闭式解法实现高效的全局与局部仿射匹配,最终构建可调平滑度的多仿射(polyaffine)变换,该变换在对数欧几里得框架下保证了微分同胚性质,从而显著提升结构对齐精度并改善非线性配准的初始值。
链接: https://arxiv.org/abs/2602.17337
作者: Antoine Legouhy,Cosimo Campo,Ross Callaghan,Hojjat Azadbakht,Hui Zhang
机构: Hawkes Institute & Department of Computer Science, University College London (伦敦大学学院计算机科学系); AINOSTICS ltd. (AINOSTICS有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: associated github repo: this https URL
Abstract:In this work we present Polaffini, a robust and versatile framework for anatomically grounded registration. Medical image registration is dominated by intensity-based registration methods that rely on surrogate measures of alignment quality. In contrast, feature-based approaches that operate by identifying explicit anatomical correspondences, while more desirable in theory, have largely fallen out of favor due to the challenges of reliably extracting features. However, such challenges are now significantly overcome thanks to recent advances in deep learning, which provide pre-trained segmentation models capable of instantly delivering reliable, fine-grained anatomical delineations. We aim to demonstrate that these advances can be leveraged to create new anatomically-grounded image registration algorithms. To this end, we propose Polaffini, which obtains, from these segmented regions, anatomically grounded feature points with 1-to-1 correspondence in a particularly simple way: extracting their centroids. These enable efficient global and local affine matching via closed-form solutions. Those are used to produce an overall transformation ranging from affine to polyaffine with tunable smoothness. Polyaffine transformations can have many more degrees of freedom than affine ones allowing for finer alignment, and their embedding in the log-Euclidean framework ensures diffeomorphic properties. Polaffini has applications both for standalone registration and as pre-alignment for subsequent non-linear registration, and we evaluate it against popular intensity-based registration techniques. Results demonstrate that Polaffini outperforms competing methods in terms of structural alignment and provides improved initialisation for downstream non-linear registration. Polaffini is fast, robust, and accurate, making it particularly well-suited for integration into medical image processing pipelines.
[CV-22] Leverag ing Contrastive Learning for a Similarity-Guided Tampered Document Data Generation Pipeline
【速读】:该论文旨在解决文档图像中篡改文本检测任务因数据稀缺而导致模型泛化能力差的问题。现有方法依赖规则生成篡改文档,但生成结果多样性不足且视觉质量低,常留下明显伪影,与真实篡改场景存在显著差异,从而限制了模型学习鲁棒特征的能力。解决方案的关键在于提出一种新颖的高质量篡改文档图像生成框架:首先训练两个辅助网络——一个基于对比学习(contrastive learning)定义正负样本对以比较文本区域,另一个用于评估裁剪区域是否精确包围目标字符而不截断或包含邻近字符;随后利用这两个网络构建精心设计的生成流程,实现多样且高保真的篡改文档图像合成,从而提升下游检测模型在真实数据上的性能表现。
链接: https://arxiv.org/abs/2602.17322
作者: Mohamed Dhouib,Davide Buscaldi,Sonia Vanier,Aymen Shabou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detecting tampered text in document images is a challenging task due to data scarcity. To address this, previous work has attempted to generate tampered documents using rule-based methods. However, the resulting documents often suffer from limited variety and poor visual quality, typically leaving highly visible artifacts that are rarely observed in real-world manipulations. This undermines the model’s ability to learn robust, generalizable features and results in poor performance on real-world data. Motivated by this discrepancy, we propose a novel method for generating high-quality tampered document images. We first train an auxiliary network to compare text crops, leveraging contrastive learning with a novel strategy for defining positive pairs and their corresponding negatives. We also train a second auxiliary network to evaluate whether a crop tightly encloses the intended characters, without cutting off parts of characters or including parts of adjacent ones. Using a carefully designed generation pipeline that leverages both networks, we introduce a framework capable of producing diverse, high-quality tampered document images. We assess the effectiveness of our data generation pipeline by training multiple models on datasets derived from the same source images, generated using our method and existing approaches, under identical training protocols. Evaluating these models on various open-source datasets shows that our pipeline yields consistent performance improvements across architectures and datasets.
[CV-23] he Sound of Death: Deep Learning Reveals Vascular Damage from Carotid Ultrasound
【速读】:该论文旨在解决心血管疾病(Cardiovascular Diseases, CVDs)早期风险识别受限于现有诊断手段的问题,特别是如何从常规但信息未被充分挖掘的颈动脉超声视频中提取具有临床意义的血管损伤(Vascular Damage, VD)特征。其解决方案的关键在于构建一个基于机器学习(Machine Learning, ML)的框架,利用高血压作为弱监督标签,自动学习出生物合理、可解释且与已知心血管风险因素高度相关的血管损伤表征;该模型不仅能够有效分层个体的心肌梗死、心脏性死亡及全因死亡风险,表现优于或媲美传统风险评估模型(如SCORE2),还通过可解释人工智能(Explainable AI)揭示了其依赖于血管形态和周围组织特征,从而发现新的功能性和解剖学血管损伤标志物。
链接: https://arxiv.org/abs/2602.17321
作者: Christoph Balada,Aida Romano-Martinez,Payal Varshney,Vincent ten Cate,Katharina Geschke,Jonas Tesarz,Paul Claßen,Alexander K. Schuster,Dativa Tibyampansha,Karl-Patrik Kresoja,Philipp S. Wild,Sheraz Ahmed,Andreas Dengel
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cardiovascular diseases (CVDs) remain the leading cause of mortality worldwide, yet early risk detection is often limited by available diagnostics. Carotid ultrasound, a non-invasive and widely accessible modality, encodes rich structural and hemodynamic information that is largely untapped. Here, we present a machine learning (ML) framework that extracts clinically meaningful representations of vascular damage (VD) from carotid ultrasound videos, using hypertension as a weak proxy label. The model learns robust features that are biologically plausible, interpretable, and strongly associated with established cardiovascular risk factors, comorbidities, and laboratory measures. High VD stratifies individuals for myocardial infarction, cardiac death, and all-cause mortality, matching or outperforming conventional risk models such as SCORE2. Explainable AI analyses reveal that the model relies on vessel morphology and perivascular tissue characteristics, uncovering novel functional and anatomical signatures of vascular damage. This work demonstrates that routine carotid ultrasound contains far more prognostic information than previously recognized. Our approach provides a scalable, non-invasive, and cost-effective tool for population-wide cardiovascular risk assessment, enabling earlier and more personalized prevention strategies without reliance on laboratory tests or complex clinical inputs.
[CV-24] Attachment Anchors: A Novel Framework for Laparoscopic Grasping Point Prediction in Colorectal Surgery
【速读】:该论文旨在解决微创手术中自主组织操作的关键挑战——准确预测抓取点(grasping point),尤其是在复杂多变的结直肠手术场景下。由于此类手术具有重复性组织操作特征且当前研究覆盖不足,传统仅依赖腹腔镜图像的方法难以应对分布外(out-of-distribution)情况下的不确定性。解决方案的关键在于引入“附着锚点”(attachment anchors),这是一种结构化表示方法,编码了组织与其解剖附着点之间的局部几何与力学关系,通过将手术场景归一化到一致的局部参考系来降低抓取点预测的不确定性。实验表明,该表示可从腹腔镜图像中预测并集成至基于机器学习的抓取框架,在90例结直肠手术数据集上显著优于纯图像基线模型,尤其在未见术式和不同术者场景下表现更优,验证了其作为学习驱动组织操作的有效中间表示能力。
链接: https://arxiv.org/abs/2602.17310
作者: Dennis N. Schneider,Lars Wagner,Daniel Rueckert,Dirk Wilhelm
机构: Technical University of Munich (慕尼黑工业大学); TUM School of Medicine and Health (慕尼黑工业大学医学院与健康学院); TUM University Hospital rechts der Isar (慕尼黑工业大学右岸医院); Department of Surgery (外科部门); Research Group MITI (MITI 研究组); Chair for AI in Healthcare and Medicine Munich (医疗健康人工智能主席职位); Department of Computing, Imperial College London (帝国理工学院计算机系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate grasping point prediction is a key challenge for autonomous tissue manipulation in minimally invasive surgery, particularly in complex and variable procedures such as colorectal interventions. Due to their complexity and prolonged duration, colorectal procedures have been underrepresented in current research. At the same time, they pose a particularly interesting learning environment due to repetitive tissue manipulation, making them a promising entry point for autonomous, machine learning-driven support. Therefore, in this work, we introduce attachment anchors, a structured representation that encodes the local geometric and mechanical relationships between tissue and its anatomical attachments in colorectal surgery. This representation reduces uncertainty in grasping point prediction by normalizing surgical scenes into a consistent local reference frame. We demonstrate that attachment anchors can be predicted from laparoscopic images and incorporated into a grasping framework based on machine learning. Experiments on a dataset of 90 colorectal surgeries demonstrate that attachment anchors improve grasping point prediction compared to image-only baselines. There are particularly strong gains in out-of-distribution settings, including unseen procedures and operating surgeons. These results suggest that attachment anchors are an effective intermediate representation for learning-based tissue manipulation in colorectal surgery.
[CV-25] Physics Encoded Spatial and Temporal Generative Adversarial Network for Tropical Cyclone Image Super-resolution
【速读】:该论文旨在解决现有基于深度学习的超分辨率(Super-Resolution, SR)方法在处理热带气旋(Tropical Cyclone, TC)卫星图像序列时,因忽略大气物理规律而导致云系结构重建不准确的问题。其关键解决方案是提出一种物理编码的时空生成对抗网络(Physics Encoded Spatial and Temporal Generative Adversarial Network, PESTGAN),通过设计解耦生成器架构并引入PhyCell模块,利用约束卷积近似涡度方程,将物理动力学信息编码为隐式潜在表示,从而实现物理动态与视觉纹理的分离;同时采用双判别器框架,结合时间判别器以强制运动一致性,显著提升重建结果的气象合理性与物理保真度。
链接: https://arxiv.org/abs/2602.17277
作者: Ruoyi Zhang,Jiawei Yuan,Lujia Ye,Runling Yu,Liling Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:High-resolution satellite imagery is indispensable for tracking the genesis, intensification, and trajectory of tropical cyclones (TCs). However, existing deep learning-based super-resolution (SR) methods often treat satellite image sequences as generic videos, neglecting the underlying atmospheric physical laws governing cloud motion. To address this, we propose a Physics Encoded Spatial and Temporal Generative Adversarial Network (PESTGAN) for TC image super-resolution. Specifically, we design a disentangled generator architecture incorporating a PhyCell module, which approximates the vorticity equation via constrained convolutions and encodes the resulting approximate physical dynamics as implicit latent representations to separate physical dynamics from visual textures. Furthermore, a dual-discriminator framework is introduced, employing a temporal discriminator to enforce motion consistency alongside spatial realism. Experiments on the Digital Typhoon dataset for 4 \times upscaling demonstrate that PESTGAN establishes a better performance in structural fidelity and perceptual quality. While maintaining competitive pixel-wise accuracy compared to existing approaches, our method significantly excels in reconstructing meteorologically plausible cloud structures with superior physical fidelity.
[CV-26] Unified Latents (UL): How to train your latents
【速读】:该论文旨在解决现有生成式模型中潜在表示(latent representation)学习效率与重建质量之间的权衡问题,尤其是在训练计算成本较高和潜在空间压缩率不足的场景下。解决方案的关键在于提出统一潜在框架(Unified Latents, UL),通过将编码器输出的噪声与扩散先验(diffusion prior)的最小噪声水平相联系,构建一个简洁的训练目标,该目标提供了潜在比特率(latent bitrate)的紧致上界,从而在保证高重建质量(如ImageNet-512上的高PSNR)的同时,显著降低训练所需的浮点运算次数(FLOPs),并在Kinetics-600视频数据集上实现了新的最优FVD指标(1.3)。
链接: https://arxiv.org/abs/2602.17270
作者: Jonathan Heek,Emiel Hoogeboom,Thomas Mensink,Tim Salimans
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder’s output noise to the prior’s minimum noise level, we obtain a simple training objective that provides a tight upper bound on the latent bitrate. On ImageNet-512, our approach achieves competitive FID of 1.4, with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. On Kinetics-600, we set a new state-of-the-art FVD of 1.3.
[CV-27] EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection
【速读】:该论文旨在解决当前AI生成视频检测方法在面对如Sora、Veo等新一代基础视频生成模型时所暴露出的局限性,这些问题主要体现在依赖浅层嵌入轨迹、基于图像的适应性差或计算资源消耗大的大型多模态语言模型(MLLM)上。解决方案的关键在于提出EA-Swin模型,其核心创新是采用一种解耦的窗口化注意力机制,直接在预训练视频嵌入上建模时空依赖关系,从而实现对通用ViT类分块编码器的兼容;同时构建了包含13万条视频的EA-Video基准数据集,涵盖多种商业与开源生成器及未见生成器划分,支持跨分布评估。实验表明,EA-Swin在主流生成器上达到0.97–0.99的准确率,显著优于现有最先进方法(通常为0.8–0.9),且具备强泛化能力,为现代AI生成视频检测提供了可扩展、鲁棒的解决方案。
链接: https://arxiv.org/abs/2602.17260
作者: Hung Mai,Loi Dinh,Duc Hai Nguyen,Dat Do,Luong Doan,Khanh Nguyen Quoc,Huan Vu,Phong Ho,Naeem Ul Islam,Tuan Do
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: First preprint
Abstract:Recent advances in foundation video generators such as Sora2, Veo3, and other commercial systems have produced highly realistic synthetic videos, exposing the limitations of existing detection methods that rely on shallow embedding trajectories, image-based adaptation, or computationally heavy MLLMs. We propose EA-Swin, an Embedding-Agnostic Swin Transformer that models spatiotemporal dependencies directly on pretrained video embeddings via a factorized windowed attention design, making it compatible with generic ViT-style patch-based encoders. Alongside the model, we construct the EA-Video dataset, a benchmark dataset comprising 130K videos that integrates newly collected samples with curated existing datasets, covering diverse commercial and open-source generators and including unseen-generator splits for rigorous cross-distribution evaluation. Extensive experiments show that EA-Swin achieves 0.97-0.99 accuracy across major generators, outperforming prior SoTA methods (typically 0.8-0.9) by a margin of 5-20%, while maintaining strong generalization to unseen distributions, establishing a scalable and robust solution for modern AI-generated video detection.
[CV-28] A Multi-modal Detection System for Infrastructure-based Freight Signal Priority
【速读】:该论文旨在解决货运车辆在信号交叉口处因缺乏可靠检测与运动估计而导致无法有效实施基础设施驱动的货运信号优先(Freight Signal Priority, FSP)的问题。其核心挑战在于实现对车辆类型、位置和速度的高精度、实时感知,以支撑优先控制策略的执行。解决方案的关键在于设计并部署了一种基于多模态传感(LiDAR与摄像头融合)的基础设施级货运车辆检测系统,采用分层式混合传感架构(包括路口安装子系统与路段中段子系统),通过无线通信实现同步数据传输,并结合聚类与深度学习检测方法及卡尔曼滤波跟踪算法,确保稳定实时性能;同时利用LiDAR测量值注册至大地坐标系,实现车道级定位与一致的车辆跟踪,从而在高时空分辨率下可靠监测货运车辆动态行为。
链接: https://arxiv.org/abs/2602.17252
作者: Ziyan Zhang,Chuheng Wei,Xuanpeng Zhao,Siyan Li,Will Snyder,Mike Stas,Peng Hao,Kanok Boriboonsomsin,Guoyuan Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
备注: 12 pages, 15 figures. Accepted at ICTD 2026. Final version to appear in ASCE Proceedings
Abstract:Freight vehicles approaching signalized intersections require reliable detection and motion estimation to support infrastructure-based Freight Signal Priority (FSP). Accurate and timely perception of vehicle type, position, and speed is essential for enabling effective priority control strategies. This paper presents the design, deployment, and evaluation of an infrastructure-based multi-modal freight vehicle detection system integrating LiDAR and camera sensors. A hybrid sensing architecture is adopted, consisting of an intersection-mounted subsystem and a midblock subsystem, connected via wireless communication for synchronized data transmission. The perception pipeline incorporates both clustering-based and deep learning-based detection methods with Kalman filter tracking to achieve stable real-time performance. LiDAR measurements are registered into geodetic reference frames to support lane-level localization and consistent vehicle tracking. Field evaluations demonstrate that the system can reliably monitor freight vehicle movements at high spatio-temporal resolution. The design and deployment provide practical insights for developing infrastructure-based sensing systems to support FSP applications.
[CV-29] Inferring Height from Earth Embeddings: First insights using Google AlphaEarth
【速读】:该论文旨在解决如何利用地球嵌入(Earth Embeddings)中的地理空间与多模态特征,有效引导深度学习回归模型进行区域地表高程制图的问题。其关键解决方案在于采用轻量级卷积解码器结构(U-Net 和 U-Net++)来解析 AlphaEarth Embeddings 中编码的地形信息,并通过高精度数字表面模型(DSM)作为参考评估其在地表高程估计中的有效性。结果表明,两种架构均展现出强训练性能(R² = 0.97),且 U-Net++ 在测试集上表现出更强的泛化能力(R² = 0.84,中位数偏差 -2.62 m),优于标准 U-Net(R² = 0.78,中位数偏差 -7.22 m),说明嵌入中蕴含可迁移的地形模式,而空间感知的卷积架构是提升区域适应性的核心要素。
链接: https://arxiv.org/abs/2602.17250
作者: Alireza Hamoudzadeh,Valeria Belloni,Roberta Ravanelli
机构: Sapienza University of Rome (罗马大学); University of Liège (列日大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 9 figures
Abstract:This study investigates whether the geospatial and multimodal features encoded in \textitEarth Embeddings can effectively guide deep learning (DL) regression models for regional surface height mapping. In particular, we focused on AlphaEarth Embeddings at 10 m spatial resolution and evaluated their capability to support terrain height inference using a high-quality Digital Surface Model (DSM) as reference. U-Net and U-Net++ architectures were thus employed as lightweight convolutional decoders to assess how well the geospatial information distilled in the embeddings can be translated into accurate surface height estimates. Both architectures achieved strong training performance (both with R^2 = 0.97 ), confirming that the embeddings encode informative and decodable height-related signals. On the test set, performance decreased due to distribution shifts in height frequency between training and testing areas. Nevertheless, U-Net++ shows better generalization ( R^2 = 0.84 , median difference = -2.62 m) compared with the standard U-Net ( R^2 = 0.78 , median difference = -7.22 m), suggesting enhanced robustness to distribution mismatch. While the testing RMSE (approximately 16 m for U-Net++) and residual bias highlight remaining challenges in generalization, strong correlations indicate that the embeddings capture transferable topographic patterns. Overall, the results demonstrate the promising potential of AlphaEarth Embeddings to guide DL-based height mapping workflows, particularly when combined with spatially aware convolutional architectures, while emphasizing the need to address bias for improved regional transferability.
[CV-30] HiMAP: History-aware Map-occupancy Prediction with Fallback
【速读】:该论文旨在解决自动驾驶中运动预测(motion forecasting)因多目标跟踪(Multi-Object Tracking, MOT)失败而导致的性能下降与安全风险问题。传统方法依赖于持续且准确的物体身份关联,但在遮挡、身份切换或漏检等场景下,MOT失效会显著影响预测质量。其解决方案的关键在于提出一种无跟踪(tracking-free)的轨迹预测框架HiMAP:通过将历史检测结果转换为时空不变的历史占用图(historical occupancy maps),引入历史查询模块(historical query module)以当前代理状态为条件,从无标签的占用表示中迭代检索特定代理的历史信息;再结合时间映射嵌入(temporal map embedding)与最终查询及地图上下文,驱动类似DETR的解码器生成多模态未来轨迹。该设计摆脱了对身份标识的依赖,支持流式推理,并在无跟踪条件下仍保持鲁棒性,实验证明其在Argoverse 2数据集上达到与基于跟踪方法相当的性能,并在无跟踪设置下显著优于强基线模型。
链接: https://arxiv.org/abs/2602.17231
作者: Yiming Xu,Yi Yang,Hao Cheng,Monika Sester
机构: Leibniz University Hannover (汉诺威莱布尼茨大学); University of Twente (特温特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in 2026 IEEE International Conference on Robotics and Automation
Abstract:Accurate motion forecasting is critical for autonomous driving, yet most predictors rely on multi-object tracking (MOT) with identity association, assuming that objects are correctly and continuously tracked. When tracking fails due to, e.g., occlusion, identity switches, or missed detections, prediction quality degrades and safety risks increase. We present \textbfHiMAP, a tracking-free, trajectory prediction framework that remains reliable under MOT failures. HiMAP converts past detections into spatiotemporally invariant historical occupancy maps and introduces a historical query module that conditions on the current agent state to iteratively retrieve agent-specific history from unlabeled occupancy representations. The retrieved history is summarized by a temporal map embedding and, together with the final query and map context, drives a DETR-style decoder to produce multi-modal future trajectories. This design lifts identity reliance, supports streaming inference via reusable encodings, and serves as a robust fallback when tracking is unavailable. On Argoverse~2, HiMAP achieves performance comparable to tracking-based methods while operating without IDs, and it substantially outperforms strong baselines in the no-tracking setting, yielding relative gains of 11% in FDE, 12% in ADE, and a 4% reduction in MR over a fine-tuned QCNet. Beyond aggregate metrics, HiMAP delivers stable forecasts for all agents simultaneously without waiting for tracking to recover, highlighting its practical value for safety-critical autonomy. The code is available under: this https URL.
[CV-31] GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation
【速读】:该论文旨在解决现代文本到图像(Text-to-Image, T2I)生成模型在给定相同提示时缺乏多样性的问题,这一问题不仅限制了用户的选择空间,还可能放大社会偏见。解决方案的关键在于提出几何感知的球面采样(Geometry-Aware Spherical Sampling, GASS),其核心思想是通过分解CLIP嵌入中的多样性度量为两个正交方向——文本嵌入(对应提示相关的语义变化)和一个识别出的正交方向(对应提示无关的变化,如背景等),从而在生成过程中分别增强这两个维度上的投影扩散,并通过扩展生成轨迹上的预测来引导采样过程。该方法实现了对多样性的解耦增强,在不显著影响图像保真度和语义一致性的情况下提升了生成结果的多样性。
链接: https://arxiv.org/abs/2602.17200
作者: Ye Zhu,Kaleb S. Newman,Johannes F. Lutzeyer,Adriana Romero-Soriano,Michal Drozdzal,Olga Russakovsky
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Code will be available at this https URL
Abstract:Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.
[CV-32] EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中因处理大量视觉标记(visual tokens)而导致的高计算成本问题。现有基于令牌剪枝(token pruning)的方法通常依赖于静态、经验性选择的网络层,缺乏可解释性和跨模型迁移能力。其解决方案的关键在于提出一种基于矩阵熵(matrix-entropy)的新视角,识别出一个“熵崩溃层”(Entropy Collapse Layer, ECL),即视觉表示的信息内容在此处出现显著且一致的下降,从而为剪枝阶段提供了一个理论依据。在此基础上,作者提出了EntropyPrune框架,通过量化单个视觉令牌的信息价值来剪除冗余令牌,无需依赖注意力图,并利用对偶Gram矩阵的谱等价性降低熵计算复杂度,实现最高达64倍的理论加速效果。该方法在多个多模态基准上均优于当前最优剪枝方法,在保持性能的同时显著提升效率。
链接: https://arxiv.org/abs/2602.17196
作者: Yahong Wang,Juncheng Wu,Zhangkai Ni,Chengmei Yang,Yihang Liu,Longzhen Yang,Yuyin Zhou,Ying Wen,Lianghua He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) incur substantial inference cost due to the processing of hundreds of visual tokens per image. Although token pruning has proven effective for accelerating inference, determining when and where to prune remains largely heuristic. Existing approaches typically rely on static, empirically selected layers, which limit interpretability and transferability across models. In this work, we introduce a matrix-entropy perspective and identify an “Entropy Collapse Layer” (ECL), where the information content of visual representations exhibits a sharp and consistent drop, which provides a principled criterion for selecting the pruning stage. Building on this observation, we propose EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps. Moreover, to enable efficient computation, we exploit the spectral equivalence of dual Gram matrices, reducing the complexity of entropy computation and yielding up to a 64x theoretical speedup. Extensive experiments on diverse multimodal benchmarks demonstrate that EntropyPrune consistently outperforms state-of-the-art pruning methods in both accuracy and efficiency. On LLaVA-1.5-7B, our method achieves a 68.2% reduction in FLOPs while preserving 96.0% of the original performance. Furthermore, EntropyPrune generalizes effectively to high-resolution and video-based models, highlighting the strong robustness and scalability in practical MLLM acceleration. The code will be publicly available at this https URL.
[CV-33] xo: Formula Recognition within 20M Parameters
【速读】:该论文旨在解决公式识别模型在保持高性能的同时,模型参数量过大导致部署困难的问题。其解决方案的关键在于通过精心设计的注意力机制、词汇表与分词器的蒸馏(distillation)及迁移(transfer),在仅使用2000万参数的情况下实现了与当前最优模型(如UniMERNet-T和PPFormulaNet-S)相当的性能,从而显著降低模型规模(分别减少80%和65%),使其实现消费级硬件上的实时推理甚至浏览器内部署。
链接: https://arxiv.org/abs/2602.17189
作者: Sicheng Mao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper we present Texo, a minimalist yet highperformance formula recognition model that contains only 20 million parameters. By attentive design, distillation and transfer of the vocabulary and the tokenizer, Texo achieves comparable performance to state-of-the-art models such as UniMERNet-T and PPFormulaNet-S, while reducing the model size by 80% and 65%, respectively. This enables real-time inference on consumer-grade hardware and even in-browser deployment. We also developed a web application to demonstrate the model capabilities and facilitate its usage for end users.
[CV-34] Selective Training for Large Vision Language Models via Visual Information Gain
【速读】:该论文旨在解决大型视觉语言模型(Large Vision Language Models, LVLMs)中存在的语言偏置问题,即模型在生成回答时过度依赖文本信息而忽视视觉证据。为实现对视觉输入贡献的量化评估,作者提出了一种基于困惑度(perplexity)的指标——视觉信息增益(Visual Information Gain, VIG),用于衡量视觉输入对降低预测不确定性的作用。该指标可实现样本级和词元级的细粒度分析,精准识别出受图像显著影响的语义元素(如颜色、空间关系和属性)。解决方案的关键在于利用VIG引导的选择性训练策略,优先训练高VIG样本与词元,从而强化模型的视觉接地能力并缓解语言偏置,在显著减少标注监督的前提下提升性能。
链接: https://arxiv.org/abs/2602.17186
作者: Seulbi Lee,Sangheum Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.
[CV-35] NRGS-SLAM: Monocular Non-Rigid SLAM for Endoscopy via Deformation-Aware 3D Gaussian Splatting
【速读】:该论文旨在解决内窥镜场景下单目非刚性同步定位与建图(monocular non-rigid SLAM)中存在的耦合模糊性问题,即由于软组织变形导致相机自身运动(ego-motion)与场景内在形变之间难以区分,从而引发跟踪漂移和重建质量低下。解决方案的关键在于提出NRGS-SLAM系统,其核心创新是引入一种具有形变感知能力的3D高斯点阵表示(deformation-aware 3D Gaussian map),通过在每个高斯原型中附加可学习的形变概率,并利用贝叶斯自监督策略进行优化,无需外部非刚性标签即可有效解耦形变与运动;同时结合分层鲁棒位姿估计、高效帧级形变更新以及融合几何先验的统一鲁棒几何损失函数,显著提升了位姿估计精度(RMSE降低最高达50%)和重建图像的真实感质量。
链接: https://arxiv.org/abs/2602.17182
作者: Jiwei Shan,Zeyu Cai,Yirui Li,Yongbo Chen,Lijun Han,Yun-hui Liu,Hesheng Wang,Shing Shin Cheng
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Visual simultaneous localization and mapping (V-SLAM) is a fundamental capability for autonomous perception and navigation. However, endoscopic scenes violate the rigidity assumption due to persistent soft-tissue deformations, creating a strong coupling ambiguity between camera ego-motion and intrinsic deformation. Although recent monocular non-rigid SLAM methods have made notable progress, they often lack effective decoupling mechanisms and rely on sparse or low-fidelity scene representations, which leads to tracking drift and limited reconstruction quality. To address these limitations, we propose NRGS-SLAM, a monocular non-rigid SLAM system for endoscopy based on 3D Gaussian Splatting. To resolve the coupling ambiguity, we introduce a deformation-aware 3D Gaussian map that augments each Gaussian primitive with a learnable deformation probability, optimized via a Bayesian self-supervision strategy without requiring external non-rigidity labels. Building on this representation, we design a deformable tracking module that performs robust coarse-to-fine pose estimation by prioritizing low-deformation regions, followed by efficient per-frame deformation updates. A carefully designed deformable mapping module progressively expands and refines the map, balancing representational capacity and computational efficiency. In addition, a unified robust geometric loss incorporates external geometric priors to mitigate the inherent ill-posedness of monocular non-rigid SLAM. Extensive experiments on multiple public endoscopic datasets demonstrate that NRGS-SLAM achieves more accurate camera pose estimation (up to 50% reduction in RMSE) and higher-quality photo-realistic reconstructions than state-of-the-art methods. Comprehensive ablation studies further validate the effectiveness of our key design choices. Source code will be publicly available upon paper acceptance.
[CV-36] BadCLIP: Stealthy and Persistent Backdoors in Multimodal Contrastive Learning
【速读】:该论文旨在解决针对多模态对比学习模型的后门攻击中面临的两个核心挑战:隐蔽性(stealthiness)和持久性(persistence)。现有方法在强检测机制或持续微调下易失效,主要归因于跨模态不一致性暴露触发模式,以及低中毒率下的梯度稀释加速后门遗忘。其解决方案的关键在于提出统一框架BadCLIP++,通过语义融合QR微触发器(semantic-fusion QR micro-trigger)实现不可察觉的触发模式嵌入,同时结合目标对齐子集选择增强低注入率下的信号强度;在持久性方面,采用半径收缩与中心对齐稳定触发嵌入,并通过曲率控制和弹性权重巩固(elastic weight consolidation)稳定模型参数,确保解位于低曲率宽盆地内以抵抗微调干扰。此外,论文首次提供了理论分析证明,在信任区域内干净微调与后门目标梯度共向,从而保证攻击成功率衰减的上界非递增。
链接: https://arxiv.org/abs/2602.17168
作者: Siyuan Liang,Yongcheng Jing,Yingjie Wang,Jiaxing Huang,Ee-chien Chang,Dacheng Tao
机构: College of Computing and Data Science, Nanyang Technological University, Singapore; School of Computing, National University of Singapore, Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 10 figures
Abstract:Research on backdoor attacks against multimodal contrastive learning models faces two key challenges: stealthiness and persistence. Existing methods often fail under strong detection or continuous fine-tuning, largely due to (1) cross-modal inconsistency that exposes trigger patterns and (2) gradient dilution at low poisoning rates that accelerates backdoor forgetting. These coupled causes remain insufficiently modeled and addressed. We propose BadCLIP++, a unified framework that tackles both challenges. For stealthiness, we introduce a semantic-fusion QR micro-trigger that embeds imperceptible patterns near task-relevant regions, preserving clean-data statistics while producing compact trigger distributions. We further apply target-aligned subset selection to strengthen signals at low injection rates. For persistence, we stabilize trigger embeddings via radius shrinkage and centroid alignment, and stabilize model parameters through curvature control and elastic weight consolidation, maintaining solutions within a low-curvature wide basin resistant to fine-tuning. We also provide the first theoretical analysis showing that, within a trust region, gradients from clean fine-tuning and backdoor objectives are co-directional, yielding a non-increasing upper bound on attack success degradation. Experiments demonstrate that with only 0.3% poisoning, BadCLIP++ achieves 99.99% attack success rate (ASR) in digital settings, surpassing baselines by 11.4 points. Across nineteen defenses, ASR remains above 99.90% with less than 0.8% drop in clean accuracy. The method further attains 65.03% success in physical attacks and shows robustness against watermark removal defenses.
[CV-37] B3-Seg: Camera-Free Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates
【速读】:该论文旨在解决交互式3D高斯溅射(3D Gaussian Splatting, 3DGS)分割在影视与游戏制作中实时编辑预重建资产时面临的低延迟、无监督和无需重训练的挑战。现有方法依赖于预定义相机视角、真实标签或昂贵的再训练过程,难以满足实际应用需求。解决方案的关键在于提出 B³-Seg(Beta-Bernoulli Bayesian Segmentation for 3DGS),其将分割建模为贝叶斯序贯更新问题,并通过解析期望信息增益(Expected Information Gain, EIG)主动选择最优观测视角;该方法理论保证了EIG的自适应单调性和子模性,从而实现对最优视图采样策略的贪心近似((1−1/e)近似),在无需训练且自由视角条件下实现了端到端的快速分割,显著提升了信息效率与实用性。
链接: https://arxiv.org/abs/2602.17134
作者: Hiromichi Kamata,Samuel Arthur Munro,Fuminori Homma
机构: Sony Group Corporation(索尼集团); Pixomondo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Interactive 3D Gaussian Splatting (3DGS) segmentation is essential for real-time editing of pre-reconstructed assets in film and game production. However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use. We propose B ^3 -Seg (Beta-Bernoulli Bayesian Segmentation for 3DGS), a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under camera-free and training-free conditions. Our approach reformulates segmentation as sequential Beta-Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG). This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy (1-1/e) approximation to the optimal view sampling policy. Experiments on multiple datasets show that B ^3 -Seg achieves competitive results to high-cost supervised methods while operating end-to-end segmentation within a few seconds. The results demonstrate that B ^3 -Seg enables practical, interactive 3DGS segmentation with provable information efficiency.
[CV-38] 3D Scene Rendering with Multimodal Gaussian Splatting
【速读】:该论文旨在解决传统基于视觉的3D高斯泼溅(3D Gaussian Splatting, GS)重建方法在相机视角不足或视觉线索不可靠(如恶劣天气、低光照或部分遮挡)条件下初始化困难且渲染质量下降的问题。其解决方案的关键在于引入射频(RF)感知(如车载雷达)作为多模态信息源,通过稀疏RF深度测量高效生成高质量三维点云,用于初始化GS中的高斯函数,从而提升在复杂环境下的结构准确性和渲染保真度,实现更鲁棒的3D场景重建与渲染。
链接: https://arxiv.org/abs/2602.17124
作者: Chi-Shiang Gau,Konstantinos D. Polyzos,Athanasios Bacharis,Saketh Madhuvarasu,Tara Javidi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:3D scene reconstruction and rendering are core tasks in computer vision, with applications spanning industrial monitoring, robotics, and autonomous driving. Recent advances in 3D Gaussian Splatting (GS) and its variants have achieved impressive rendering fidelity while maintaining high computational and memory efficiency. However, conventional vision-based GS pipelines typically rely on a sufficient number of camera views to initialize the Gaussian primitives and train their parameters, typically incurring additional processing cost during initialization while falling short in conditions where visual cues are unreliable, such as adverse weather, low illumination, or partial occlusions. To cope with these challenges, and motivated by the robustness of radio-frequency (RF) signals to weather, lighting, and occlusions, we introduce a multimodal framework that integrates RF sensing, such as automotive radar, with GS-based rendering as a more efficient and robust alternative to vision-only GS rendering. The proposed approach enables efficient depth prediction from only sparse RF-based depth measurements, yielding a high-quality 3D point cloud for initializing Gaussian functions across diverse GS architectures. Numerical tests demonstrate the merits of judiciously incorporating RF sensing into GS pipelines, achieving high-fidelity 3D scene rendering driven by RF-informed structural accuracy.
[CV-39] Benchmarking the Effects of Object Pose Estimation and Reconstruction on Robotic Grasping Success
【速读】:该论文旨在解决当前3D重建模型的质量评估缺乏对下游机器人操作任务(如抓取)功能性影响的衡量标准的问题。现有方法虽能生成视觉和几何上高质量的网格,但其对6D物体位姿估计与抓取性能的实际贡献仍不明确。解决方案的关键在于构建一个大规模、基于物理的基准测试平台,通过在不同精度的3D网格上生成抓取姿态并将其执行于真实物体模型,量化重建误差、位姿估计误差与抓取鲁棒性之间的耦合效应。实验表明,重建伪影显著减少抓取候选数量,但在位姿估计准确的前提下对抓取成功率影响甚微,且空间位移误差(特别是平移误差)对对称物体的抓取成功具有主导作用。
链接: https://arxiv.org/abs/2602.17101
作者: Varun Burde,Pavel Burget,Torsten Sattler
机构: Czech Technical University in Prague (布拉格捷克技术大学); Czech Institute of Informatics, Robotics and Cybernetics (捷克信息学、机器人学与控制论研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D reconstruction serves as the foundational layer for numerous robotic perception tasks, including 6D object pose estimation and grasp pose generation. Modern 3D reconstruction methods for objects can produce visually and geometrically impressive meshes from multi-view images, yet standard geometric evaluations do not reflect how reconstruction quality influences downstream tasks such as robotic manipulation performance. This paper addresses this gap by introducing a large-scale, physics-based benchmark that evaluates 6D pose estimators and 3D mesh models based on their functional efficacy in grasping. We analyze the impact of model fidelity by generating grasps on various reconstructed 3D meshes and executing them on the ground-truth model, simulating how grasp poses generated with an imperfect model affect interaction with the real object. This assesses the combined impact of pose error, grasp robustness, and geometric inaccuracies from 3D reconstruction. Our results show that reconstruction artifacts significantly decrease the number of grasp pose candidates but have a negligible effect on grasping performance given an accurately estimated pose. Our results also reveal that the relationship between grasp success and pose error is dominated by spatial error, and even a simple translation error provides insight into the success of the grasping pose of symmetric objects. This work provides insight into how perception systems relate to object manipulation using robots.
[CV-40] ComptonUNet: A Deep Learning Model for GRB Localization with Compton Cameras under Noisy and Low-Statistic Conditions
【速读】:该论文旨在解决弱伽马射线暴(Gamma-ray Burst, GRB)在低光子统计和强背景噪声条件下难以准确检测与定位的问题。现有机器学习模型虽能分别应对部分挑战,但在统计稳健性与噪声抑制之间难以取得平衡。解决方案的关键在于提出ComptonUNet——一种混合深度学习框架,通过联合处理原始数据与图像重建,在保持直接重建模型统计效率的同时,利用基于图像架构的去噪能力,显著提升在极端低统计和高背景环境下的GRB定位精度。
链接: https://arxiv.org/abs/2602.17085
作者: Shogo Sato,Kazuo Tanaka,Shojun Ogasawara,Kazuki Yamamoto,Kazuhiko Murasaki,Ryuichi Tanida,Jun Kataoka
机构: Waseda University (早稻田大学); NTT Corporation
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM)
备注: Accepted by ApJ
Abstract:Gamma-ray bursts (GRBs) are among the most energetic transient phenomena in the universe and serve as powerful probes for high-energy astrophysical processes. In particular, faint GRBs originating from a distant universe may provide unique insights into the early stages of star formation. However, detecting and localizing such weak sources remains challenging owing to low photon statistics and substantial background noise. Although recent machine learning models address individual aspects of these challenges, they often struggle to balance the trade-off between statistical robustness and noise suppression. Consequently, we propose ComptonUNet, a hybrid deep learning framework that jointly processes raw data and reconstructs images for robust GRB localization. ComptonUNet was designed to operate effectively under conditions of limited photon statistics and strong background contamination by combining the statistical efficiency of direct reconstruction models with the denoising capabilities of image-based architectures. We perform realistic simulations of GRB-like events embedded in background environments representative of low-Earth orbit missions to evaluate the performance of ComptonUNet. Our results demonstrate that ComptonUNet significantly outperforms existing approaches, achieving improved localization accuracy across a wide range of low-statistic and high-background scenarios.
[CV-41] Cross Pseudo Labeling For Weakly Supervised Video Anomaly Detection ICASSP2026
【速读】:该论文旨在解决弱监督视频异常检测(Weakly Supervised Video Anomaly Detection, WS-VAD)中难以同时实现高精度异常定位与异常类别识别的问题。现有方法通常在时间分辨率和语义理解之间存在权衡,导致无法有效区分不同类别的异常事件。其解决方案的关键在于提出一种双分支框架 CPL-VAD,通过交叉伪标签(Cross Pseudo Labeling)机制,在二分类异常检测分支(专注于片段级异常定位)与类别分类分支(利用视觉-语言对齐识别异常事件类别)之间实现信息互补与协同优化,从而在保持时间精度的同时提升语义判别能力,最终在 XD-Violence 和 UCF-Crime 数据集上实现了异常检测与类别识别的最先进性能。
链接: https://arxiv.org/abs/2602.17077
作者: Lee Dayeon,Kim Dongheyong,Park Chaewon,Woo Sungmin,Lee Sangyoun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICASSP 2026
Abstract:Weakly supervised video anomaly detection aims to detect anomalies and identify abnormal categories with only video-level labels. We propose CPL-VAD, a dual-branch framework with cross pseudo labeling. The binary anomaly detection branch focuses on snippet-level anomaly localization, while the category classification branch leverages vision-language alignment to recognize abnormal event categories. By exchanging pseudo labels, the two branches transfer complementary strengths, combining temporal precision with semantic discrimination. Experiments on XD-Violence and UCF-Crime demonstrate that CPL-VAD achieves state-of-the-art performance in both anomaly detection and abnormal category classification.
[CV-42] Cholec80-port: A Geometrically Consistent Trocar Port Segmentation Dataset for Robust Surgical Scene Understanding
【速读】:该论文旨在解决腹腔镜手术视频中trocar port(Trocar端口)对基于几何的下游任务(如图像拼接、3D重建和视觉SLAM)造成的干扰问题,其核心挑战在于端口具有镜面反射和纹理特征,易吸引异常特征点并导致配准与跟踪不稳定。解决方案的关键在于提出一个高保真度的trocar port分割数据集Cholec80-port,并制定一套严格的标准化操作流程(SOP),明确要求标注端口袖套区域但排除中央开口(lumen),从而确保标注在几何上的一致性;同时,该方法还统一清洗了现有公开数据集以符合该标准,实验证明此类几何一致性的标注显著提升了跨数据集的鲁棒性,超越单纯依赖数据量带来的改进。
链接: https://arxiv.org/abs/2602.17060
作者: Shunsuke Kikuchi,Atsushi Kouno,Hiroki Matsuzaki
机构: Jmees Inc.(Jmees公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Trocar ports are camera-fixed, pseudo-static structures that can persistently occlude laparoscopic views and attract disproportionate feature points due to specular, textured surfaces. This makes ports particularly detrimental to geometry-based downstream pipelines such as image stitching, 3D reconstruction, and visual SLAM, where dynamic or non-anatomical outliers degrade alignment and tracking stability. Despite this practical importance, explicit port labels are rare in public surgical datasets, and existing annotations often violate geometric consistency by masking the central lumen (opening), even when anatomical regions are visible through it. We present Cholec80-port, a high-fidelity trocar port segmentation dataset derived from Cholec80, together with a rigorous standard operating procedure (SOP) that defines a port-sleeve mask excluding the central opening. We additionally cleanse and unify existing public datasets under the same SOP. Experiments demonstrate that geometrically consistent annotations substantially improve cross-dataset robustness beyond what dataset size alone provides.
[CV-43] StructCore: Structure-Aware Image-Level Scoring for Training-Free Unsupervised Anomaly Detection
【速读】:该论文旨在解决当前基于记忆库的无监督异常检测(Unsupervised Anomaly Detection, UAD)中,采用最大池化(Max Pooling)将异常得分图转换为图像级决策时存在的局限性——即仅依赖单一极端响应,忽略了异常证据在图像中的分布与结构信息,导致正常与异常得分易发生重叠。其解决方案的关键在于提出一种无需训练、面向结构感知的图像级评分方法 StructCore:首先计算异常得分图的低维结构描述符 φ(S),以捕捉分布和空间特征;随后利用训练集中正常样本估计对角马氏距离校准参数,实现图像级评分优化,从而有效利用了传统最大池化所忽略的结构签名,显著提升了图像级异常检测性能,在 MVTec AD 和 VisA 数据集上分别达到 99.6% 和 98.4% 的图像级 AUROC。
链接: https://arxiv.org/abs/2602.17048
作者: Joongwon Chae,Lihui Luo,Yang Liu,Runming Wang,Dongmei Yu,Zeming Liang,Xi Yuan,Dayan Zhang,Zhenglin Chen,Peiwu Qin,Ilmoon Chae
机构: Tsinghua University Shenzhen International Graduate School (清华大学深圳国际研究生院); Ratel Soft; Affiliated Fifth Hospital, Wenzhou Medical University (温州医科大学附属第五医院); Chinese Medicine Guangdong Laboratory
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Max pooling is the de facto standard for converting anomaly score maps into image-level decisions in memory-bank-based unsupervised anomaly detection (UAD). However, because it relies on a single extreme response, it discards most information about how anomaly evidence is distributed and structured across the image, often causing normal and anomalous scores to overlap. We propose StructCore, a training-free, structure-aware image-level scoring method that goes beyond max pooling. Given an anomaly score map, StructCore computes a low-dimensional structural descriptor phi(S) that captures distributional and spatial characteristics, and refines image-level scoring via a diagonal Mahalanobis calibration estimated from train-good samples, without modifying pixel-level localization. StructCore achieves image-level AUROC scores of 99.6% on MVTec AD and 98.4% on VisA, demonstrating robust image-level anomaly detection by exploiting structural signatures missed by max pooling. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.17048 [cs.CV] (or arXiv:2602.17048v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.17048 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-44] Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers
【速读】:该论文旨在解决扩散 Transformer (Diffusion Transformer, DiT) 架构在文本到图像 (Text-to-Image, T2I) 生成任务中面临的计算成本高昂和部署困难的问题。其核心解决方案是提出一种高效的模型压缩框架,无需从头训练即可将60层双流MMDiT架构的Qwen-Image模型转化为轻量化版本。关键创新在于:首先采用时间步敏感的深度剪枝策略保留重要层,并通过局部权重平均重初始化与逐层蒸馏及全参数微调优化;进而引入混合流结构,将深层双流转换为单一流(源自图像分支),并结合渐进式蒸馏与轻量微调进一步压缩模型。该方法在减少70%参数的同时,仅需少于2000 GPU小时即可完成从10B到6B模型的压缩与训练,显著优于传统训练范式。
链接: https://arxiv.org/abs/2602.17047
作者: Chaojie Yang,Tian Li,Yue Zhang,Jun Gao
机构: HelloGroup Inc.(HelloGroup公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion Transformer (DiT) architectures have significantly advanced Text-to-Image (T2I) generation but suffer from prohibitive computational costs and deployment barriers. To address these challenges, we propose an efficient compression framework that transforms the 60-layer dual-stream MMDiT-based Qwen-Image into lightweight models without training from scratch. Leveraging this framework, we introduce Amber-Image, a series of streamlined T2I models. We first derive Amber-Image-10B using a timestep-sensitive depth pruning strategy, where retained layers are reinitialized via local weight averaging and optimized through layer-wise distillation and full-parameter fine-tuning. Building on this, we develop Amber-Image-6B by introducing a hybrid-stream architecture that converts deep-layer dual streams into a single stream initialized from the image branch, further refined via progressive distillation and lightweight fine-tuning. Our approach reduces parameters by 70% and eliminates the need for large-scale data engineering. Notably, the entire compression and training pipeline-from the 10B to the 6B variant-requires fewer than 2,000 GPU hours, demonstrating exceptional cost-efficiency compared to training from scratch. Extensive evaluations on benchmarks like DPG-Bench and LongText-Bench show that Amber-Image achieves high-fidelity synthesis and superior text rendering, matching much larger models.
[CV-45] PartRAG : Retrieval-Augmented Part-Level 3D Generation and Editing
【速读】:该论文旨在解决单图三维生成中部件级结构难以保持一致性和可编辑性的问题:现有方法在处理长尾部件几何形状时先验知识不足,且难以维持多视角一致性,同时缺乏对局部精确修改的支持。解决方案的关键在于提出PartRAG框架,其核心创新包括两个模块:一是分层对比检索(Hierarchical Contrastive Retrieval)模块,通过将图像密集补丁与3D部件潜在表示在部件和对象粒度上对齐,从包含1,236个标注部件的外部数据库中检索多样且物理合理的示例注入去噪过程;二是掩码式部件级编辑器,在共享规范空间中实现部件替换、属性优化和组合更新,无需重生成整个物体即可保持非目标部件及多视角一致性。该方法显著提升了Objaverse等数据集上的重建质量(Chamfer Distance下降,F-Score提升),并支持交互式编辑(5–8秒)。
链接: https://arxiv.org/abs/2602.17033
作者: Peize Li,Zeyu Zhang,Hao Tang
机构: King’s College London (国王学院); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single-image 3D generation with part-level structure remains challenging: learned priors struggle to cover the long tail of part geometries and maintain multi-view consistency, and existing systems provide limited support for precise, localized edits. We present PartRAG, a retrieval-augmented framework that integrates an external part database with a diffusion transformer to couple generation with an editable representation. To overcome the first challenge, we introduce a Hierarchical Contrastive Retrieval module that aligns dense image patches with 3D part latents at both part and object granularity, retrieving from a curated bank of 1,236 part-annotated assets to inject diverse, physically plausible exemplars into denoising. To overcome the second challenge, we add a masked, part-level editor that operates in a shared canonical space, enabling swaps, attribute refinements, and compositional updates without regenerating the whole object while preserving non-target parts and multi-view consistency. PartRAG achieves competitive results on Objaverse, ShapeNet, and ABO-reducing Chamfer Distance from 0.1726 to 0.1528 and raising F-Score from 0.7472 to 0.844 on Objaverse-with inference of 38s and interactive edits in 5-8s. Qualitatively, PartRAG produces sharper part boundaries, better thin-structure fidelity, and robust behavior on articulated objects. Code: this https URL. Website: this https URL.
[CV-46] Patch-Based Spatial Authorship Attribution in Human-Robot Collaborative Paintings
【速读】:该论文旨在解决人机协同绘画中作者归属(spatial authorship attribution)的识别问题,尤其是在缺乏明确标注数据的情况下如何准确区分人类与机器人在混合创作中的贡献。其解决方案的关键在于提出了一种基于补丁(patch-based)的框架,利用商用平板扫描仪获取图像数据,并通过留一 painting 交叉验证策略实现高精度的局部区域归属判断(88.8% patch-level accuracy),同时引入条件香农熵(conditional Shannon entropy)量化风格重叠程度,有效区分纯绘画与混合创作区域的不确定性差异,从而在无明确真值标签的协作作品中识别出潜在的混合作者特征。
链接: https://arxiv.org/abs/2602.17030
作者: Eric Chen,Patricia Alves-Oliveira
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:As agentic AI becomes increasingly involved in creative production, documenting authorship has become critical for artists, collectors, and legal contexts. We present a patch-based framework for spatial authorship attribution within human-robot collaborative painting practice, demonstrated through a forensic case study of one human artist and one robotic system across 15 abstract paintings. Using commodity flatbed scanners and leave-one-painting-out cross-validation, the approach achieves 88.8% patch-level accuracy (86.7% painting-level via majority vote), outperforming texture-based and pretrained-feature baselines (68.0%-84.7%). For collaborative artworks, where ground truth is inherently ambiguous, we use conditional Shannon entropy to quantify stylistic overlap; manually annotated hybrid regions exhibit 64% higher uncertainty than pure paintings (p=0.003), suggesting the model detects mixed authorship rather than classification failure. The trained model is specific to this human-robot pair but provides a methodological grounding for sample-efficient attribution in data-scarce human-AI creative workflows that, in the future, has the potential to extend authorship attribution to any human-robot collaborative painting.
[CV-47] DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers
【速读】:该论文旨在解决扩散 Transformer(Diffusion Transformers, DiTs)在图像和视频生成任务中计算效率低下的问题,其根源在于固定的分块(tokenization)策略——在整个去噪过程中使用恒定大小的补丁,未能根据内容复杂度动态调整。解决方案的关键在于提出一种**动态分块(dynamic tokenization)**策略,该策略在推理阶段依据内容复杂度和去噪步数自适应地调整补丁大小:早期去噪步骤仅需粗粒度补丁以建模全局结构,而后期步骤则采用更细粒度的补丁以精细化局部细节。这一机制显著降低了计算成本,同时保持了生成质量与提示遵循性。
链接: https://arxiv.org/abs/2602.16968
作者: Dahye Kim,Deepti Ghadiyaram,Raghudeep Gadde
机构: Boston University (波士顿大学); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content’s complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to 3.52\times and 3.2\times speedup on this http URL and Wan 2.1 , respectively, without compromising the generation quality and prompt adherence.
[CV-48] HS-3D-NeRF: 3D Surface and Hyperspectral Reconstruction From Stationary Hyperspectral Images Using Multi-Channel NeRFs
【速读】:该论文旨在解决农业产后检测中高通量、多模态(高光谱成像与三维几何重建)融合分析的难题,尤其是传统方法因复杂硬件配置难以适配自动化表型系统的问题。其核心解决方案是提出一种基于静态相机的多通道神经辐射场(multi-channel NeRF)框架——HSI-SC-NeRF,关键创新在于:通过在特氟龙(Teflon)成像舱内旋转样品并利用ArUco标记进行姿态估计,将固定视角数据转换为等效多视角输入;同时采用分阶段训练策略(几何初始化与辐射度精调分离)和复合光谱损失函数,实现可见光至近红外波段内高精度的空间重建与光谱保真度,从而支持自动化农业流程集成。
链接: https://arxiv.org/abs/2602.16950
作者: Kibon Ku,Talukder Z. Jubery,Adarsh Krishnamurthy,Baskar Ganapathysubramanian
机构: Iowa State University (爱荷华州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 14 figures, 3 tables
Abstract:Advances in hyperspectral imaging (HSI) and 3D reconstruction have enabled accurate, high-throughput characterization of agricultural produce quality and plant phenotypes, both essential for advancing agricultural sustainability and breeding programs. HSI captures detailed biochemical features of produce, while 3D geometric data substantially improves morphological analysis. However, integrating these two modalities at scale remains challenging, as conventional approaches involve complex hardware setups incompatible with automated phenotyping systems. Recent advances in neural radiance fields (NeRF) offer computationally efficient 3D reconstruction but typically require moving-camera setups, limiting throughput and reproducibility in standard indoor agricultural environments. To address these challenges, we introduce HSI-SC-NeRF, a stationary-camera multi-channel NeRF framework for high-throughput hyperspectral 3D reconstruction targeting postharvest inspection of agricultural produce. Multi-view hyperspectral data is captured using a stationary camera while the object rotates within a custom-built Teflon imaging chamber providing diffuse, uniform illumination. Object poses are estimated via ArUco calibration markers and transformed to the camera frame of reference through simulated pose transformations, enabling standard NeRF training on stationary-camera data. A multi-channel NeRF formulation optimizes reconstruction across all hyperspectral bands jointly using a composite spectral loss, supported by a two-stage training protocol that decouples geometric initialization from radiometric refinement. Experiments on three agricultural produce samples demonstrate high spatial reconstruction accuracy and strong spectral fidelity across the visible and near-infrared spectrum, confirming the suitability of HSI-SC-NeRF for integration into automated agricultural workflows.
[CV-49] Xray-Visual Models: Scaling Vision models on Industry Scale Data
【速读】:该论文旨在解决大规模图像与视频理解任务中模型性能、泛化能力与计算效率之间的平衡问题,尤其是在工业级社交平台数据(如Facebook和Instagram)上训练时面临的噪声干扰与语义多样性不足的挑战。其解决方案的关键在于提出了一种统一的视觉模型架构Xray-Visual,结合三阶段训练策略:首先利用自监督MAE(Masked Autoencoders)预训练提取通用视觉表征,其次通过半监督哈希标签分类增强视频理解能力,最后采用CLIP-style对比学习对齐图像与文本模态;同时引入高效标记重组机制(EViT)优化Vision Transformer的计算效率,并创新性地将大语言模型(LLM)作为文本编码器(LLM2CLIP),显著提升跨模态检索性能与真实场景下的泛化能力。
链接: https://arxiv.org/abs/2602.16918
作者: Shlok Mishra,Tsung-Yu Lin,Linda Wang,Hongli Xu,Yimin Liu,Michael Hsu,Chaitanya Ahuja,Hao Yuan,Jianpeng Cheng,Hong-You Chen,Haoyuan Xu,Chao Li,Abhijeet Awasthi,Jihye Moon,Don Husa,Michael Ge,Sumedha Singla,Arkabandhu Chowdhury,Phong Dingh,Satya Narayan Shukla,Yonghuan Yang,David Jacobs,Qi Guo,Jun Xiao,Xiangjun Fan,Aashu Singh
机构: Meta(元)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We present Xray-Visual, a unified vision model architecture for large-scale image and video understanding trained on industry-scale social media data. Our model leverages over 15 billion curated image-text pairs and 10 billion video-hashtag pairs from Facebook and Instagram, employing robust data curation pipelines that incorporate balancing and noise suppression strategies to maximize semantic diversity while minimizing label noise. We introduce a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities. Our architecture builds on a Vision Transformer backbone enhanced with efficient token reorganization (EViT) for improved computational efficiency. Extensive experiments demonstrate that Xray-Visual achieves state-of-the-art performance across diverse benchmarks, including ImageNet for image classification, Kinetics and HMDB51 for video understanding, and MSCOCO for cross-modal retrieval. The model exhibits strong robustness to domain shift and adversarial perturbations. We further demonstrate that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments. Xray-Visual establishes new benchmarks for scalable, multimodal vision models, while maintaining superior accuracy and computational efficiency.
[CV-50] SemCovNet: Towards Fair and Semantic Coverag e-Aware Learning for Underrepresented Visual Concepts
【速读】:该论文旨在解决视觉模型中长期被忽视的语义覆盖不平衡(Semantic Coverage Imbalance, SCI)问题,即在长尾分布的语义表示下,模型对稀有但有意义的语义概念学习不足,从而影响其推理能力与公平性。解决方案的关键在于提出Semantic Coverage-Aware Network (SemCovNet),其核心创新包括:1)语义描述符图(Semantic Descriptor Map, SDM)用于显式建模语义表示;2)描述符注意力调制模块(Descriptor Attention Modulation, DAM)动态加权视觉特征与概念特征;3)描述符-视觉对齐损失(Descriptor-Visual Alignment, DVA)以增强视觉特征与语义描述的一致性。该方法通过量化语义公平性的Coverage Disparity Index (CDI) 实现可测量、可校正的语义偏差优化,显著提升模型在多数据集上的可靠性与公平性表现。
链接: https://arxiv.org/abs/2602.16917
作者: Sakib Ahammed,Xia Cui,Xinqi Fan,Wenqi Lu,Moi Hoon Yap
机构: Manchester Metropolitan University (曼彻斯特都会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern vision models increasingly rely on rich semantic representations that extend beyond class labels to include descriptive concepts and contextual attributes. However, existing datasets exhibit Semantic Coverage Imbalance (SCI), a previously overlooked bias arising from the long-tailed semantic representations. Unlike class imbalance, SCI occurs at the semantic level, affecting how models learn and reason about rare yet meaningful semantics. To mitigate SCI, we propose Semantic Coverage-Aware Network (SemCovNet), a novel model that explicitly learns to correct semantic coverage disparities. SemCovNet integrates a Semantic Descriptor Map (SDM) for learning semantic representations, a Descriptor Attention Modulation (DAM) module that dynamically weights visual and concept features, and a Descriptor-Visual Alignment (DVA) loss that aligns visual features with descriptor semantics. We quantify semantic fairness using a Coverage Disparity Index (CDI), which measures the alignment between coverage and error. Extensive experiments across multiple datasets demonstrate that SemCovNet enhances model reliability and substantially reduces CDI, achieving fairer and more equitable performance. This work establishes SCI as a measurable and correctable bias, providing a foundation for advancing semantic fairness and interpretable vision learning.
[CV-51] StereoAdapter-2: Globally Structure-Consistent Underwater Stereo Depth Estimation
【速读】:该论文旨在解决水下立体深度估计(stereo depth estimation)中因波长依赖的光衰减、散射和折射导致的严重域偏移(domain shift)问题。现有方法虽利用单目基础模型与基于GRU的迭代优化进行适应,但其序列门控机制和局部卷积核限制了长距离视差传播效率,尤其在大视差和无纹理区域表现不佳。解决方案的关键在于提出StereoAdapter-2框架,用一种基于选择性状态空间模型(selective state space model, SSM)的新式ConvSS2D算子替代传统ConvGRU更新器,该算子采用四方向扫描策略,天然契合极线几何并捕捉垂直结构一致性,可在单次更新步骤内实现线性复杂度的长程空间传播,显著提升精度与效率。
链接: https://arxiv.org/abs/2602.16915
作者: Zeyu Ren,Xiang Li,Yiran Wang,Zeyu Zhang,Hao Tang
机构: The University of Melbourne (墨尔本大学); Peking University (北京大学); Australian Centre for Robotics (澳大利亚机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Stereo depth estimation is fundamental to underwater robotic perception, yet suffers from severe domain shifts caused by wavelength-dependent light attenuation, scattering, and refraction. Recent approaches leverage monocular foundation models with GRU-based iterative refinement for underwater adaptation; however, the sequential gating and local convolutional kernels in GRUs necessitate multiple iterations for long-range disparity propagation, limiting performance in large-disparity and textureless underwater regions. In this paper, we propose StereoAdapter-2, which replaces the conventional ConvGRU updater with a novel ConvSS2D operator based on selective state space models. The proposed operator employs a four-directional scanning strategy that naturally aligns with epipolar geometry while capturing vertical structural consistency, enabling efficient long-range spatial propagation within a single update step at linear computational complexity. Furthermore, we construct UW-StereoDepth-80K, a large-scale synthetic underwater stereo dataset featuring diverse baselines, attenuation coefficients, and scattering parameters through a two-stage generative pipeline combining semantic-aware style transfer and geometry-consistent novel view synthesis. Combined with dynamic LoRA adaptation inherited from StereoAdapter, our framework achieves state-of-the-art zero-shot performance on underwater benchmarks with 17% improvement on TartanAir-UW and 7.2% improvment on SQUID, with real-world validation on the BlueROV2 platform demonstrates the robustness of our approach. Code: this https URL. Website: this https URL.
[CV-52] MALLVI: a multi agent framework for integrated generalized robotics manipulation
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的机器人操作任务规划中存在的脆弱性问题,即现有方法多依赖专用模型、微调或提示调优,且通常采用开环执行方式,缺乏对环境变化的鲁棒反馈机制。解决方案的关键在于提出MALLVi(Multi-Agent Large Language and Vision framework),其通过多智能体协同实现闭环反馈驱动的机器人操作:该框架整合了分解器(Decomposer)、定位器(Localizer)、思考者(Thinker)和反思者(Reflector)等专业化智能体,分别负责感知、定位、推理与高层规划,并引入可选的描述符(Descriptor)智能体提供初始状态视觉记忆;其中,反思者支持针对错误的精准检测与恢复,仅激活相关智能体而非全系统重置,从而显著提升零样本场景下的泛化能力与任务成功率。
链接: https://arxiv.org/abs/2602.16898
作者: Iman Ahmadi,Mehrshad Taji,Arad Mahdinezhad Kashani,AmirHossein Jadidi,Saina Kashani,Babak Khalaj
机构: Sharif University of Technology (谢里夫理工大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic this http URL present MALLVi, a Multi Agent Large Language and Vision framework that enables closed loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVi generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next this http URL than using a single model, MALLVi coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full this http URL in simulation and real world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation this http URL available at this https URL.
[CV-53] DODO: Discrete OCR Diffusion Models
【速读】:该论文旨在解决当前基于视觉-语言模型(Vision-Language Model, VLM)的光学字符识别(Optical Character Recognition, OCR)任务中因采用自回归解码(autoregressive decoding)导致的推理速度慢的问题,尤其是在处理长文档时,其逐token生成的方式造成显著的计算开销。解决方案的关键在于提出DODO,这是首个利用块离散扩散(block discrete diffusion)机制的VLM方法,通过将生成过程分解为多个块(block)来缓解全局扩散带来的同步误差,从而在保持接近最先进准确率的同时,实现最高达3倍于自回归基线的推理加速。
链接: https://arxiv.org/abs/2602.16872
作者: Sean Man,Roy Ganz,Roi Ronen,Shahar Tsiper,Shai Mazor,Niv Nayman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.
[CV-54] Analytic Score Optimization for Multi Dimension Video Quality Assessment
【速读】:该论文旨在解决传统视频质量评估(Video Quality Assessment, VQA)仅依赖单一平均意见分数(Mean Opinion Score, MOS)的局限性,无法全面反映视频在多个维度上的感知质量的问题。为此,作者构建了大规模多维VQA数据集UltraVQA,涵盖用户生成内容(User-Generated Content, UGC)在运动质量、运动幅度、美学质量、内容质量和清晰度五个关键维度上的精细标注,并引入基于GPT生成的解释性理由以增强可解释性。解决方案的关键在于提出分析性评分优化(Analytic Score Optimization, ASO),这是一种理论驱动的后训练目标函数,将质量评估建模为正则化的决策过程,从而获得闭式解并自然捕捉人类评分的序数特性,确保与人类排序偏好对齐,显著提升模型在离散质量评分预测中的准确性(降低均方误差MAE)。
链接: https://arxiv.org/abs/2602.16856
作者: Boda Lin,Yongjie Zhu,Wenyu Qin,Meng Wang,Pengfei Wan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages
Abstract:Video Quality Assessment (VQA) is evolving beyond single-number mean opinion score toward richer, multi-faceted evaluations of video content. In this paper, we present a large-scale multi-dimensional VQA dataset UltraVQA that encompasses diverse User-Generated Content~(UGC) annotated across five key quality dimensions: Motion Quality, Motion Amplitude, Aesthetic Quality, Content Quality, and Clarity Quality. Each video in our dataset is scored by over 3 human raters on these dimensions, with fine-grained sub-attribute labels, and accompanied by an explanatory rationale generated by GPT based on the collective human judgments. To better leverage these rich annotations and improve discrete quality score assessment, we introduce Analytic Score Optimization (ASO), a theoretically grounded post-training objective derived for multi-dimensional VQA. By reframing quality assessment as a regularized decision-making process, we obtain a closed-form solution that naturally captures the ordinal nature of human ratings, ensuring alignment with human ranking preferences. In experiments, our method outperforms most baselines including closed-source APIs and open-source models, while also reducing mean absolute error (MAE) in quality prediction. Our work highlights the importance of multi-dimensional, interpretable annotations and reinforcement-based alignment in advancing video quality assessment.
[CV-55] hree-dimensional Damage Visualization of Civil Structures via Gaussian Splatting-enabled Digital Twins
【速读】:该论文旨在解决传统二维(2D)损伤识别方法在数字孪生(Digital Twin)中难以实现精确三维(3D)损伤可视化的问题,尤其针对桥梁等基础设施在地震后损伤评估的精度与效率挑战。其解决方案的关键在于引入基于高斯泼溅(Gaussian Splatting, GS)的3D重建方法,利用离散各向异性3D高斯函数表示辐射场,相较于NeRF等连续隐式模型具备更高的重建效率和渲染质量,并能有效处理无纹理区域;同时通过多尺度重建策略平衡计算效率与损伤细节保留,实现随时间演化的数字孪生动态更新,从而提升损伤可视化的真实性与实用性。
链接: https://arxiv.org/abs/2602.16713
作者: Shuo Wang,Shuo Wang,Xin Nie,Yasutaka Narazaki,Thomas Matiki,Billie F. Spencer Jr
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in civil infrastructure inspections underscore the need for precise three-dimensional (3D) damage visualization on digital twins, transcending traditional 2D image-based damage identifications. Compared to conventional photogrammetric 3D reconstruction techniques, modern approaches such as Neural Radiance Field (NeRF) and Gaussian Splatting (GS) excel in scene representation, rendering quality, and handling featureless regions. Among them, GS stands out for its efficiency, leveraging discrete anisotropic 3D Gaussians to represent radiance fields, unlike NeRF’s continuous implicit model. This study introduces a GS-enabled digital twin method tailored for effective 3D damage visualization. The method’s key contributions include: 1) utilizing GS-based 3D reconstruction to visualize 2D damage segmentation results while reducing segmentation errors; 2) developing a multi-scale reconstruction strategy to balance efficiency and damage detail; 3) enabling digital twin updates as damage evolves over time. Demonstrated on an open-source synthetic dataset for post-earthquake inspections, the proposed approach offers a promising solution for comprehensive 3D damage visualization in civil infrastructure digital twins.
[CV-56] Probability-Invariant Random Walk Learning on Gyral Folding-Based Cortical Similarity Networks for Alzheimers and Lewy Body Dementia Diagnosis
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)与路易体痴呆(Lewy body dementia, LBD)在临床特征上高度重叠但需差异化诊断的问题,尤其针对基于脑网络分析的神经影像方法中因个体解剖异质性导致的节点对齐困难和拓扑结构不一致问题。传统基于图谱(atlas-based)的方法难以捕捉个体化解剖细节,而基于皮层褶皱(gyral folding)构建的网络虽具生物学基础,却因个体间褶皱模式差异大,造成节点对应不一致及网络尺寸不规则,违背了现有图学习方法对固定拓扑和节点对齐的假设。其解决方案的关键在于提出一种概率不变的随机游走框架(probability-invariant random-walk-based framework),通过局部形态学特征构建皮层相似性网络,并以匿名化随机游走分布表示网络结构,结合解剖感知编码实现排列不变性(permutation invariance),从而无需显式节点对齐即可进行个体化网络分类,在AD与LBD临床队列中显著优于现有基于皮层褶皱和图谱的方法,展现出更强的鲁棒性和在痴呆诊断中的应用潜力。
链接: https://arxiv.org/abs/2602.17557
作者: Minheng Chen,Jing Zhang,Tong Chen,Chao Cao,Tianming Liu,Li Su,Dajiang Zhu
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Alzheimer’s disease (AD) and Lewy body dementia (LBD) present overlapping clinical features yet require distinct diagnostic strategies. While neuroimaging-based brain network analysis is promising, atlas-based representations may obscure individualized anatomy. Gyral folding-based networks using three-hinge gyri provide a biologically grounded alternative, but inter-individual variability in cortical folding results in inconsistent landmark correspondence and highly irregular network sizes, violating the fixed-topology and node-alignment assumptions of most existing graph learning methods, particularly in clinical datasets where pathological changes further amplify anatomical heterogeneity. We therefore propose a probability-invariant random-walk-based framework that classifies individualized gyral folding networks without explicit node alignment. Cortical similarity networks are built from local morphometric features and represented by distributions of anonymized random walks, with an anatomy-aware encoding that preserves permutation invariance. Experiments on a large clinical cohort of AD and LBD subjects show consistent improvements over existing gyral folding and atlas-based models, demonstrating robustness and potential for dementia diagnosis.
[CV-57] Neural Implicit Representations for 3D Synthetic Aperture Radar Imaging
【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)在三维成像中因傅里叶域采样不充分而导致的重建图像伪影问题。传统方法依赖图像域稀疏性等简单先验来正则化逆问题,但难以有效恢复复杂场景的精细结构。解决方案的关键在于利用神经结构建模主导SAR回波的表面散射特性,通过从稀疏散射数据中学习隐式表面表示——即以符号距离函数(signed distance function)形式编码目标表面,并在训练过程中通过对隐式表面采样点进行约束,从而有效正则化从稀疏且噪声点云中估计平滑表面这一病态问题。该方法在单个车辆及包含大量车辆的大场景实测与仿真数据上均展现出优越的散射建模能力,为高保真3D SAR成像提供了新路径。
链接: https://arxiv.org/abs/2602.17556
作者: Nithin Sugavanam,Emre Ertin
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Synthetic aperture radar (SAR) is a tomographic sensor that measures 2D slices of the 3D spatial Fourier transform of the scene. In many operational scenarios, the measured set of 2D slices does not fill the 3D space in the Fourier domain, resulting in significant artifacts in the reconstructed imagery. Traditionally, simple priors, such as sparsity in the image domain, are used to regularize the inverse problem. In this paper, we review our recent work that achieves state-of-the-art results in 3D SAR imaging employing neural structures to model the surface scattering that dominates SAR returns. These neural structures encode the surface of the objects in the form of a signed distance function learned from the sparse scattering data. Since estimating a smooth surface from a sparse and noisy point cloud is an ill-posed problem, we regularize the surface estimation by sampling points from the implicit surface representation during the training step. We demonstrate the model’s ability to represent target scattering using measured and simulated data from single vehicles and a larger scene with a large number of vehicles. We conclude with future research directions calling for methods to learn complex-valued neural representations to enable synthesizing new collections from the volumetric neural implicit representation.
人工智能
[AI-0] MARS: Margin-Aware Reward-Modeling with Self-Refinement
【速读】:该论文旨在解决当前奖励建模(Reward Modeling)中依赖昂贵且有限的人工标注偏好数据的问题,进而提出一种更高效的训练策略以提升奖励模型的鲁棒性与准确性。其解决方案的关键在于提出MARS(Margin-aware Augmentation and Sampling Strategy),该策略通过识别并聚焦于低置信度(低边际)的偏好样本(即奖励模型最不确定的模糊区域),实施自适应的数据增强与采样,从而迭代优化训练分布,显著提升损失函数的平均曲率,改善信息获取效率和模型条件数,最终实现比均匀增强方法更稳定的奖励建模性能。
链接: https://arxiv.org/abs/2602.17658
作者: Payel Bhattacharjee,Osvaldo Simeone,Ravi Tandon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:Reward modeling is a core component of modern alignment pipelines including RLHF and RLAIF, underpinning policy optimization methods including PPO and TRPO. However, training reliable reward models relies heavily on human-labeled preference data, which is costly and limited, motivating the use of data augmentation. Existing augmentation approaches typically operate at the representation or semantic level and remain agnostic to the reward model’s estimation difficulty. In this paper, we propose MARS, an adaptive, margin-aware augmentation and sampling strategy that explicitly targets ambiguous and failure modes of the reward model. Our proposed framework, MARS, concentrates augmentation on low-margin (ambiguous) preference pairs where the reward model is most uncertain, and iteratively refines the training distribution via hard-sample augmentation. We provide theoretical guarantees showing that this strategy increases the average curvature of the loss function hence enhance information and improves conditioning, along with empirical results demonstrating consistent gains over uniform augmentation for robust reward modeling.
[AI-1] FAMOSE: A ReAct Approach to Automated Feature Discovery
【速读】:该论文旨在解决机器学习中特征工程(Feature Engineering)这一关键瓶颈问题,尤其是在表格数据(tabular data)场景下,传统方法需要大量领域知识才能从指数级增长的特征空间中识别最优特征。为应对这一挑战,作者提出FAMOSE(Feature AugMentation and Optimal Selection agEnt)框架,其核心创新在于首次将基于ReAct(Reasoning + Acting)范式的AI代理(AI Agent)引入自动化特征工程任务中。该方案的关键在于利用ReAct机制让大语言模型(LLM)在迭代的特征发现与评估过程中记录历史信息,从而形成类似“少样本提示”(few-shot prompt)的效果,引导模型生成更优、更具创新性的特征组合,进而显著提升回归和分类任务的性能表现。
链接: https://arxiv.org/abs/2602.17641
作者: Keith Burghardt,Jienan Liu,Sadman Sakib,Yuning Hao,Bo Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 6 figures
Abstract:Feature engineering remains a critical yet challenging bottleneck in machine learning, particularly for tabular data, as identifying optimal features from an exponentially large feature space traditionally demands substantial domain expertise. To address this challenge, we introduce FAMOSE (Feature AugMentation and Optimal Selection agEnt), a novel framework that leverages the ReAct paradigm to autonomously explore, generate, and refine features while integrating feature selection and evaluation tools within an agent architecture. To our knowledge, FAMOSE represents the first application of an agentic ReAct framework to automated feature engineering, especially for both regression and classification tasks. Extensive experiments demonstrate that FAMOSE is at or near the state-of-the-art on classification tasks (especially tasks with more than 10K instances, where ROC-AUC increases 0.23% on average), and achieves the state-of-the-art for regression tasks by reducing RMSE by 2.0% on average, while remaining more robust to errors than other algorithms. We hypothesize that FAMOSE’s strong performance is because ReAct allows the LLM context window to record (via iterative feature discovery and evaluation steps) what features did or did not work. This is similar to a few-shot prompt and guides the LLM to invent better, more innovative features. Our work offers evidence that AI agents are remarkably effective in solving problems that require highly inventive solutions, such as feature engineering.
[AI-2] Reverso: Efficient Time Series Foundation Models for Zero-shot Forecasting
【速读】:该论文旨在解决当前时间序列基础模型(time series foundation models)在零样本预测任务中存在参数规模庞大、计算效率低下和部署成本高昂的问题。其解决方案的关键在于提出一种简洁高效的建模配方:采用小型混合模型架构,通过交错嵌入长卷积层与线性循环神经网络(RNN)层(特别是DeltaNet层),在显著减少模型参数量(超过百倍)的同时,性能可媲美大型Transformer模型;此外,结合数据增强和推理策略优化进一步提升效果,最终构建出Reverso系列高效时间序列基础模型,大幅拓展了性能-效率帕累托前沿(performance-efficiency Pareto frontier)。
链接: https://arxiv.org/abs/2602.17634
作者: Xinghong Fu,Yanhong Li,Georgios Papaioannou,Yoon Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning time series foundation models has been shown to be a promising approach for zero-shot time series forecasting across diverse time series domains. Insofar as scaling has been a critical driver of performance of foundation models in other modalities such as language and vision, much recent work on time series foundation modeling has focused on scaling. This has resulted in time series foundation models with hundreds of millions of parameters that are, while performant, inefficient and expensive to use in practice. This paper describes a simple recipe for learning efficient foundation models for zero-shot time series forecasting that are orders of magnitude smaller. We show that large-scale transformers are not necessary: small hybrid models that interleave long convolution and linear RNN layers (in particular DeltaNet layers) can match the performance of larger transformer-based models while being more than a hundred times smaller. We also describe several data augmentation and inference strategies that further improve performance. This recipe results in Reverso, a family of efficient time series foundation models for zero-shot forecasting that significantly push the performance-efficiency Pareto frontier.
[AI-3] When to Trust the Cheap Check: Weak and Strong Verification for Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中如何有效平衡验证成本与可靠性的问题。具体而言,论文区分了弱验证(weak verification)与强验证(strong verification):前者通过低成本的内部检查(如自一致性或代理奖励)快速筛选输出,但存在噪声和不准确;后者依赖用户反馈等高成本外部验证机制,虽能建立可信结果但效率低下。为应对这一权衡,论文提出弱-强验证策略(weak–strong verification policies),其核心是设计一个基于两个阈值的决策机制,决定何时接受、拒绝或转交至强验证。关键创新在于引入可量化错误率(误接受、误拒绝)和强验证频率的指标,并证明最优策略具有两阈值结构;进一步开发了一种无需假设查询流、语言模型或弱验证器性能的在线算法,可严格控制两类错误率,从而实现高效且可靠的验证闭环。
链接: https://arxiv.org/abs/2602.17633
作者: Shayan Kiyani,Sima Noorani,George Pappas,Hamed Hassani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Reasoning with LLMs increasingly unfolds inside a broader verification loop. Internally, systems use cheap checks, such as self-consistency or proxy rewards, which we call weak verification. Externally, users inspect outputs and steer the model through feedback until results are trustworthy, which we call strong verification. These signals differ sharply in cost and reliability: strong verification can establish trust but is resource-intensive, while weak verification is fast and scalable but noisy and imperfect. We formalize this tension through weak–strong verification policies, which decide when to accept or reject based on weak verification and when to defer to strong verification. We introduce metrics capturing incorrect acceptance, incorrect rejection, and strong-verification frequency. Over population, we show that optimal policies admit a two-threshold structure and that calibration and sharpness govern the value of weak verifiers. Building on this, we develop an online algorithm that provably controls acceptance and rejection errors without assumptions on the query stream, the language model, or the weak verifier.
[AI-4] SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中一个关键挑战:将训练好的策略网络(actor-critic)通过基于价值的在线强化学习算法(如Soft Actor-Critic或TD3)进行微调时,性能通常会出现显著下降的问题。作者提出Score Matched Actor-Critic(SMAC),其核心创新在于在离线训练阶段引入一种正则化机制,使Q函数满足策略得分(score of the policy)与Q函数动作梯度之间的一阶导数等式关系,从而避免在损失景观中离线最优解与在线最优解之间存在低性能“山谷”。这一设计确保了从离线到在线的平滑过渡,实验表明SMAC可在6个D4RL任务中实现无性能下降的迁移,并在4个环境中相比最佳基线减少34–58%的遗憾(regret)。
链接: https://arxiv.org/abs/2602.17632
作者: Nathan S. de Lara,Florian Shkurti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.
[AI-5] Stable Asynchrony: Variance-Controlled Off-Policy RL for LLM s
【速读】:该论文旨在解决异步强化学习(Reinforcement Learning, RL)训练中因策略梯度估计器方差过大而导致的不稳定问题,尤其是在无评价网络(critic-free)的策略梯度方法(如REINFORCE和GRPO)中,高异步性会引入过时的轨迹数据,导致重要性权重分布呈现重尾特性,从而使得少量样本主导更新方向,加剧梯度噪声并引发训练崩溃。解决方案的关键在于提出一种通用的稳定化方法——方差受控策略优化(Variance Controlled Policy Optimization, VCPO),其核心机制包括:(i) 根据有效样本量(Effective Sample Size, ESS)动态调整学习率以抑制不可靠更新;(ii) 在离策略设置下采用闭式最小方差基线(minimum-variance baseline),无需额外的价值函数模型且计算开销极低。实验表明,VCPO显著提升了异步训练的鲁棒性,在数学推理、通用推理和工具使用任务上均优于多种基准方法,并实现2.5倍的长上下文多轮训练加速,同时保持与同步训练相当的性能,验证了对策略梯度方差进行显式控制是实现大规模异步RL可靠性的关键。
链接: https://arxiv.org/abs/2602.17616
作者: Luke Huang,Zhuoyang Zhang,Qinghao Hu,Shang Yang,Song Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks, and asynchronous RL training is attractive because it increases end-to-end throughput. However, for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly \textbfhigher variance : training on stale rollouts creates heavy-tailed importance ratios, causing a small fraction of samples to dominate updates. This amplification makes gradients noisy and learning unstable relative to matched on-policy training. Across math and general reasoning benchmarks, we find collapse is reliably predicted by effective sample size (ESS) and unstable gradient norms. Motivated by this diagnosis, we propose \textbfV ariance \textbfC ontrolled \textbfP olicy \textbfO ptimization ( \textbfVCPO ), a general stabilization method for REINFORCE/GRPO-style algorithms that (i) scales learning rate based on effective sample size to dampen unreliable updates, and (ii) applies a closed-form minimum-variance baseline for the off-policy setting, avoiding an auxiliary value model and adding minimal overhead. Empirically, VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, outperforming a broad suite of baselines spanning masking/clipping stabilizers and algorithmic variants. This reduces long-context, multi-turn training time by 2.5 \times while matching synchronous performance, demonstrating that explicit control of policy-gradient variance is key for reliable asynchronous RL at scale.
[AI-6] owards Anytime-Valid Statistical Watermarking
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)生成内容检测中现有统计水印方法的两大局限:一是缺乏对采样分布选择的理论指导,二是依赖固定时长的假设检验,无法实现有效的早期停止。其解决方案的关键在于提出首个基于e值(e-value)的水印框架——Anchored E-Watermarking,该框架通过构建检测过程的检验超鞅(test supermartingale)实现了任意时间有效的推断(anytime-valid inference),从而在保证第一类错误率控制的前提下支持灵活的早期停止;同时,利用锚定分布(anchor distribution)逼近目标模型,推导出针对最坏情况对数增长速率的最优e值及最小期望停止时间,显著提升了样本效率,在基准测试中相较当前最优基线平均减少13-15%的token预算。
链接: https://arxiv.org/abs/2602.17608
作者: Baihe Huang,Eric Xu,Kannan Ramchandran,Jiantao Jiao,Michael I. Jordan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:The proliferation of Large Language Models (LLMs) necessitates efficient mechanisms to distinguish machine-generated content from human text. While statistical watermarking has emerged as a promising solution, existing methods suffer from two critical limitations: the lack of a principled approach for selecting sampling distributions and the reliance on fixed-horizon hypothesis testing, which precludes valid early stopping. In this paper, we bridge this gap by developing the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference. Unlike traditional approaches where optional stopping invalidates Type-I error guarantees, our framework enables valid, anytime-inference by constructing a test supermartingale for the detection process. By leveraging an anchor distribution to approximate the target model, we characterize the optimal e-value with respect to the worst-case log-growth rate and derive the optimal expected stopping time. Our theoretical claims are substantiated by simulations and evaluations on established benchmarks, showing that our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.
[AI-7] AutoNumerics: An Autonomous PDE-Agnostic Multi-Agent Pipeline for Scientific Computing
【速读】:该论文旨在解决偏微分方程(Partial Differential Equations, PDEs)数值求解器设计过程中对数学专业知识依赖性强、手动调参繁琐且传统神经网络方法存在计算开销大、可解释性差的问题。其解决方案的关键在于提出一个名为 \textttAutoNumerics 的多智能体框架,该框架能够直接从自然语言描述中自动设计、实现、调试和验证通用PDE的数值求解器;通过引入粗粒度到细粒度的执行策略与基于残差的自验证机制,确保生成的求解器不仅准确高效,而且具有透明性和可解释性,从而显著提升自动化水平并增强对PDE结构特性的适应能力。
链接: https://arxiv.org/abs/2602.17607
作者: Jianda Du,Youran Sun,Haizhao Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注:
Abstract:PDEs are central to scientific and engineering modeling, yet designing accurate numerical solvers typically requires substantial mathematical expertise and manual tuning. Recent neural network-based approaches improve flexibility but often demand high computational cost and suffer from limited interpretability. We introduce \textttAutoNumerics, a multi-agent framework that autonomously designs, implements, debugs, and verifies numerical solvers for general PDEs directly from natural language descriptions. Unlike black-box neural solvers, our framework generates transparent solvers grounded in classical numerical analysis. We introduce a coarse-to-fine execution strategy and a residual-based self-verification mechanism. Experiments on 24 canonical and real-world PDE problems demonstrate that \textttAutoNumerics achieves competitive or superior accuracy compared to existing neural and LLM-based baselines, and correctly selects numerical schemes based on PDE structural properties, suggesting its viability as an accessible paradigm for automated PDE solving.
[AI-8] MolHIT: Advancing Molecular-Graph Generation with Hierarchical Discrete Diffusion Models
【速读】:该论文旨在解决分子图扩散模型在生成化学有效性低、难以满足目标性质要求的问题,尤其是在与一维(1D)建模方法对比时表现不佳的局限性。解决方案的关键在于提出MolHIT框架,其核心创新包括:基于层次化离散扩散模型(Hierarchical Discrete Diffusion Model),将化学先验信息编码为额外类别以增强扩散过程的合理性;以及解耦原子编码机制(decoupled atom encoding),按原子的化学角色对原子类型进行分离表示,从而更精确地建模分子结构。该方法首次在图扩散模型中实现接近完美的化学有效性,并在MOSES数据集上达到新的SOTA性能,同时在多属性引导生成和骨架扩展等下游任务中展现出强大能力。
链接: https://arxiv.org/abs/2602.17602
作者: Hojung Jung,Rodrigo Hormazabal,Jaehyeong Jo,Youngrok Park,Kyunggeun Roh,Se-Young Yun,Sehui Han,Dae-Woong Jeong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Molecular generation with diffusion models has emerged as a promising direction for AI-driven drug discovery and materials science. While graph diffusion models have been widely adopted due to the discrete nature of 2D molecular graphs, existing models suffer from low chemical validity and struggle to meet the desired properties compared to 1D modeling. In this work, we introduce MolHIT, a powerful molecular graph generation framework that overcomes long-standing performance limitations in existing methods. MolHIT is based on the Hierarchical Discrete Diffusion Model, which generalizes discrete diffusion to additional categories that encode chemical priors, and decoupled atom encoding that splits the atom types according to their chemical roles. Overall, MolHIT achieves new state-of-the-art performance on the MOSES dataset with near-perfect validity for the first time in graph diffusion, surpassing strong 1D baselines across multiple metrics. We further demonstrate strong performance in downstream tasks, including multi-property guided generation and scaffold extension.
[AI-9] AI Gamestore: Scalable Open-Ended Evaluation of Machine General Intelligence with Human Games
【速读】:该论文试图解决的问题是如何在快速发展的技术背景下,对机器智能进行全面且动态的评估,以更准确地衡量其是否具备类人的一般智能(general intelligence)。传统AI基准测试仅能评估特定领域的狭窄能力,且易因开发者优化而迅速饱和,难以反映真正的通用性。为此,作者提出一种基于“人类游戏”(human game)空间的评估范式——即让AI系统在所有可想象和享受的人类游戏中进行学习与表现,并与具有相同资源水平的人类玩家进行对比。解决方案的关键在于构建一个名为AI GameStore的可扩展、开放平台,利用大语言模型(LLM)结合人类反馈,从主流数字游戏平台自动获取并适配标准化的游戏环境,从而合成多样化的代表性人类游戏;通过在100个生成游戏中对7个前沿视觉-语言模型(VLMs)进行初步测试,发现当前模型在多数游戏中的表现远低于人类平均水平,尤其在世界模型学习、记忆和规划等关键能力上存在显著短板,验证了该方法的有效性和挑战性,为推动迈向类人一般智能提供了新的评估路径。
链接: https://arxiv.org/abs/2602.17594
作者: Lance Ying,Ryan Truong,Prafull Sharma,Kaiya Ivy Zhao,Nathan Cloos,Kelsey R. Allen,Thomas L. Griffiths,Katherine M. Collins,José Hernández-Orallo,Phillip Isola,Samuel J. Gershman,Joshua B. Tenenbaum
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 14 figures
Abstract:Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play \textbfall conceivable human games, in comparison to human players with the same level of experience, time, or other resources. We define a “human game” to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy – the “Multiverse of Human Games”. Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.
[AI-10] Conditional Flow Matching for Continuous Anomaly Detection in Autonomous Driving on a Manifold-Aware Spectral Space
【速读】:该论文旨在解决自动驾驶车辆(Level 4 AV)在安全验证中因难以规模化检测稀有高风险长尾场景而面临的瓶颈问题,传统基于规则的启发式方法无法有效覆盖此类边缘案例。其解决方案的关键在于提出 Deep-Flow 框架,该框架利用最优传输条件流匹配(Optimal Transport Conditional Flow Matching, OT-CFM)建模专家人类驾驶行为的连续概率密度分布,并通过主成分分析(Principal Component Analysis, PCA)瓶颈约束生成过程于低秩谱流形上,从而保证运动学平滑性并实现数值稳定的确定性对数似然估计;同时引入车道感知的目标条件化早期融合 Transformer 编码器与意图完整性保持机制,以及基于运动学复杂度加权策略(以路径曲折度和急动度量化高能操作)进行无仿真训练,显著提升了对异常行为(如车道边界违规和非规范交叉口操作)的识别能力,揭示了传统安全过滤器忽略的可预测性缺口,为定义统计安全门限提供了数学严谨基础。
链接: https://arxiv.org/abs/2602.17586
作者: Antonio Guillen-Perez
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Safety validation for Level 4 autonomous vehicles (AVs) is currently bottlenecked by the inability to scale the detection of rare, high-risk long-tail scenarios using traditional rule-based heuristics. We present Deep-Flow, an unsupervised framework for safety-critical anomaly detection that utilizes Optimal Transport Conditional Flow Matching (OT-CFM) to characterize the continuous probability density of expert human driving behavior. Unlike standard generative approaches that operate in unstable, high-dimensional coordinate spaces, Deep-Flow constrains the generative process to a low-rank spectral manifold via a Principal Component Analysis (PCA) bottleneck. This ensures kinematic smoothness by design and enables the computation of the exact Jacobian trace for numerically stable, deterministic log-likelihood estimation. To resolve multi-modal ambiguity at complex junctions, we utilize an Early Fusion Transformer encoder with lane-aware goal conditioning, featuring a direct skip-connection to the flow head to maintain intent-integrity throughout the network. We introduce a kinematic complexity weighting scheme that prioritizes high-energy maneuvers (quantified via path tortuosity and jerk) during the simulation-free training process. Evaluated on the Waymo Open Motion Dataset (WOMD), our framework achieves an AUC-ROC of 0.766 against a heuristic golden set of safety-critical events. More significantly, our analysis reveals a fundamental distinction between kinematic danger and semantic non-compliance. Deep-Flow identifies a critical predictability gap by surfacing out-of-distribution behaviors, such as lane-boundary violations and non-normative junction maneuvers, that traditional safety filters overlook. This work provides a mathematically rigorous foundation for defining statistical safety gates, enabling objective, data-driven validation for the safe deployment of autonomous fleets.
[AI-11] Be Wary of Your Time Series Preprocessing AAAI-26
【速读】:该论文试图解决的问题是:在基于Transformer的时间序列建模中,归一化(Normalization)和缩放(Scaling)等预处理步骤对模型表达能力(Expressivity)的影响尚缺乏系统的理论分析。现有研究多依赖经验性实践,未明确不同归一化策略(如实例级归一化与全局缩放)如何影响模型区分相似与不相似输入的能力。解决方案的关键在于提出一个针对时间序列的新型表达能力框架,该框架能够量化模型在表示空间中区分输入的能力,并基于此框架推导出标准归一化(Standard Scaling)和最小-最大归一化(Min-Max Scaling)的理论边界。理论分析表明,归一化策略的选择显著影响模型的表征容量,且最优策略取决于任务类型和数据特性;同时,实证结果进一步验证了不存在统一最优的归一化方法,甚至在某些场景下完全省略归一化反而表现更优。这一发现强调了预处理在时间序列学习中的关键作用,并推动了面向特定任务和数据集的更合理归一化策略的设计。
链接: https://arxiv.org/abs/2602.17568
作者: Sofiane Ennadir,Tianze Wang,Oleg Smirnov,Sahar Asadi,Lele Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the AI4TS workshop at AAAI-26
Abstract:Normalization and scaling are fundamental preprocessing steps in time series modeling, yet their role in Transformer-based models remains underexplored from a theoretical perspective. In this work, we present the first formal analysis of how different normalization strategies, specifically instance-based and global scaling, impact the expressivity of Transformer-based architectures for time series representation learning. We propose a novel expressivity framework tailored to time series, which quantifies a model’s ability to distinguish between similar and dissimilar inputs in the representation space. Using this framework, we derive theoretical bounds for two widely used normalization methods: Standard and Min-Max scaling. Our analysis reveals that the choice of normalization strategy can significantly influence the model’s representational capacity, depending on the task and data characteristics. We complement our theory with empirical validation on classification and forecasting benchmarks using multiple Transformer-based models. Our results show that no single normalization method consistently outperforms others, and in some cases, omitting normalization entirely leads to superior performance. These findings highlight the critical role of preprocessing in time series learning and motivate the need for more principled normalization strategies tailored to specific tasks and datasets.
[AI-12] A Hybrid Federated Learning Based Ensemble Approach for Lung Disease Diagnosis Leverag ing Fusion of SWIN Transformer and CNN
【速读】:该论文旨在解决医疗领域中肺部疾病(如新冠肺炎和肺炎)诊断准确性不足及医疗数据隐私保护难题。其核心问题在于如何在保障患者数据安全的前提下,提升模型对疾病诊断与病情严重程度预测的性能。解决方案的关键在于提出一种基于联邦学习(Federated Learning, FL)的混合人工智能(AI)架构,融合了SWIN Transformer与多种先进卷积神经网络(CNN)模型(如DenseNet201、Inception V3、VGG19),通过分布式训练实现跨机构数据协同建模,同时利用联邦学习确保数据不出本地,从而提升模型泛化能力与安全性,并支持实时持续学习以适应动态医疗场景。
链接: https://arxiv.org/abs/2602.17566
作者: Asif Hasan Chowdhury,Md. Fahim Islam,M Ragib Anjum Riad,Faiyaz Bin Hashem,Md Tanzim Reza,Md. Golam Rabiul Alam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The significant advancements in computational power cre- ate a vast opportunity for using Artificial Intelligence in different ap- plications of healthcare and medical science. A Hybrid FL-Enabled Ensemble Approach For Lung Disease Diagnosis Leveraging a Combination of SWIN Transformer and CNN is the combination of cutting-edge technology of AI and Federated Learning. Since, medi- cal specialists and hospitals will have shared data space, based on that data, with the help of Artificial Intelligence and integration of federated learning, we can introduce a secure and distributed system for medical data processing and create an efficient and reliable system. The proposed hybrid model enables the detection of COVID-19 and Pneumonia based on x-ray reports. We will use advanced and the latest available tech- nology offered by Tensorflow and Keras along with Microsoft-developed Vision Transformer, that can help to fight against the pandemic that the world has to fight together as a united. We focused on using the latest available CNN models (DenseNet201, Inception V3, VGG 19) and the Transformer model SWIN Transformer in order to prepare our hy- brid model that can provide a reliable solution as a helping hand for the physician in the medical field. In this research, we will discuss how the Federated learning-based Hybrid AI model can improve the accuracy of disease diagnosis and severity prediction of a patient using the real-time continual learning approach and how the integration of federated learn- ing can ensure hybrid model security and keep the authenticity of the information.
[AI-13] ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment ICLR2026
【速读】:该论文旨在解决当前生成式 AI(Generative AI)中激活量调控(activation steering)方法的两大局限:一是缺乏统一的理论框架以指导调控方向的设计,二是过度依赖单步调控而难以捕捉激活分布的复杂模式。其解决方案的关键在于提出一个基于常微分方程(ODEs)的统一理论框架,将传统激活加法视为ODE解的一阶近似,并通过控制理论中的“屏障函数”(barrier function)来设计调控方向。在此基础上,作者提出了ODESteer方法,利用正负激活的对数密度比定义屏障函数,构建多步自适应的ODE调控路径,在TruthfulQA、UltraFeedback和RealToxicityPrompts等多个LLM对齐基准上实现显著性能提升,验证了该理论框架的有效性与普适性。
链接: https://arxiv.org/abs/2602.17560
作者: Hongjue Zhao,Haosen Sun,Jiangtao Kong,Xiaochang Li,Qineng Wang,Liwei Jiang,Qi Zhu,Tarek Abdelzaher,Yejin Choi,Manling Li,Huajie Shao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026
Abstract:Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: \textit(i) the lack of a unified theoretical framework for guiding the design of steering directions, and \textit(ii) an over-reliance on \textitone-step steering that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based \textittheoretical framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a \textitbarrier function from control theory. Derived from this framework, we introduce ODESteer, a kind of ODE-based steering guided by barrier functions, which shows \textitempirical advancement in LLM alignment. ODESteer identifies steering directions by defining the barrier function as the log-density ratio between positive and negative activations, and employs it to construct an ODE for \textitmulti-step and adaptive steering. Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements on diverse LLM alignment benchmarks, a notable 5.7% improvement over TruthfulQA, 2.5% over UltraFeedback, and 2.4% over RealToxicityPrompts. Our work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed ODESteer method.
[AI-14] MASPO: Unifying Gradient Utilization Probability Mass and Signal Reliability for Robust and Sample-Efficient LLM Reasoning
【速读】:该论文针对现有基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)算法(如GRPO)在应用于大语言模型(Large Language Models, LLMs)时存在的三大核心问题展开研究:一是硬截断导致的梯度利用效率低下;二是均匀比例约束忽略token分布,造成概率质量敏感性不足;三是正负样本间信用分配模糊性差异引发的信号可靠性不对称。为解决这些问题,作者提出统一框架Mass-Adaptive Soft Policy Optimization (MASPO),其关键创新在于:引入可微分软高斯门控机制以最大化梯度利用率,设计质量自适应限制器以平衡概率谱上的探索,以及构建非对称风险控制器以使更新幅度与信号置信度对齐,从而实现更高效、稳定且符合LLM优化特性的策略更新。
链接: https://arxiv.org/abs/2602.17550
作者: Xiaoliang Fu,Jiaye Lin,Yangyi Fang,Binbin Zheng,Chaowen Hu,Zekai Shao,Cong Qin,Lu Pan,Ke Zeng,Xunliang Cai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms, such as GRPO, rely on rigid, uniform, and symmetric trust region mechanisms that are fundamentally misaligned with the complex optimization dynamics of Large Language Models (LLMs). In this paper, we identify three critical challenges in these methods: (1) inefficient gradient utilization caused by the binary cutoff of hard clipping, (2) insensitive probability mass arising from uniform ratio constraints that ignore the token distribution, and (3) asymmetric signal reliability stemming from the disparate credit assignment ambiguity between positive and negative samples. To bridge these gaps, we propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions. MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance exploration across the probability spectrum, and an asymmetric risk controller to align update magnitudes with signal confidence. Extensive evaluations demonstrate that MASPO serves as a robust, all-in-one RLVR solution, significantly outperforming strong baselines. Our code is available at: this https URL.
[AI-15] Position: Evaluation of ECG Representations Must Be Fixed
【速读】:该论文旨在解决当前12导联心电图(Electrocardiogram, ECG)表示学习领域中基准测试实践存在的局限性问题,即现有评估体系过度依赖于心律失常和波形形态类标签的多标签基准(如PTB-XL、CPSC2018和CSN),未能充分覆盖ECG所蕴含的更广泛临床信息,尤其是结构性心脏病和患者层面预测等关键临床目标。其解决方案的关键在于:首先,扩展下游评估任务以纳入结构心脏疾病识别与患者级预后预测;其次,采用适用于多标签、类别不平衡场景的评估最佳实践,重新审视现有方法的性能排序;最后,通过实证表明随机初始化编码器结合线性评估即可达到当前最先进预训练模型的性能,从而确立随机编码器作为合理基线模型的重要性。这一系列改进使ECG表示学习的研究进展更加可靠且贴近临床实际需求。
链接: https://arxiv.org/abs/2602.17531
作者: Zachary Berger,Daniel Prakah-Asante,John Guttag,Collin M. Stultz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Project website at this https URL
Abstract:This position paper argues that current benchmarking practice in 12-lead ECG representation learning must be fixed to ensure progress is reliable and aligned with clinically meaningful objectives. The field has largely converged on three public multi-label benchmarks (PTB-XL, CPSC2018, CSN) dominated by arrhythmia and waveform-morphology labels, even though the ECG is known to encode substantially broader clinical information. We argue that downstream evaluation should expand to include an assessment of structural heart disease and patient-level forecasting, in addition to other evolving ECG-related endpoints, as relevant clinical targets. Next, we outline evaluation best practices for multi-label, imbalanced settings, and show that when they are applied, the literature’s current conclusion about which representations perform best is altered. Furthermore, we demonstrate the surprising result that a randomly initialized encoder with linear evaluation matches state-of-the-art pre-training on many tasks. This motivates the use of a random encoder as a reasonable baseline model. We substantiate our observations with an empirical evaluation of three representative ECG pre-training approaches across six evaluation settings: the three standard benchmarks, a structural disease dataset, hemodynamic inference, and patient forecasting.
[AI-16] Enhancing Large Language Models (LLM s) for Telecom using Dynamic Knowledge Graphs and Explainable Retrieval-Augmented Generation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在电信领域应用中因领域复杂性、标准动态演进及专业术语密集而导致的准确性不足和幻觉问题。为应对这一挑战,作者提出KG-RAG框架,其核心创新在于将知识图谱(Knowledge Graphs, KGs)与检索增强生成(Retrieval-Augmented Generation, RAG)相结合:KG提供源自电信标准和技术文档的结构化领域知识,RAG则实现对相关事实的动态检索以约束生成内容,从而显著提升输出的事实准确性、降低幻觉并确保合规性。实验表明,该方案在基准数据集上相较纯LLM和标准RAG基线分别平均提升14.3%和21.6%的准确率。
链接: https://arxiv.org/abs/2602.17529
作者: Dun Yuan,Hao Zhou,Xue Liu,Hao Chen,Yan Xin,Jianzhong(Charlie)Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have shown strong potential across a variety of tasks, but their application in the telecom field remains challenging due to domain complexity, evolving standards, and specialized terminology. Therefore, general-domain LLMs may struggle to provide accurate and reliable outputs in this context, leading to increased hallucinations and reduced utility in telecom this http URL address these limitations, this work introduces KG-RAG-a novel framework that integrates knowledge graphs (KGs) with retrieval-augmented generation (RAG) to enhance LLMs for telecom-specific tasks. In particular, the KG provides a structured representation of domain knowledge derived from telecom standards and technical documents, while RAG enables dynamic retrieval of relevant facts to ground the model’s outputs. Such a combination improves factual accuracy, reduces hallucination, and ensures compliance with telecom this http URL results across benchmark datasets demonstrate that KG-RAG outperforms both LLM-only and standard RAG baselines, e.g., KG-RAG achieves an average accuracy improvement of 14.3% over RAG and 21.6% over LLM-only models. These results highlight KG-RAG’s effectiveness in producing accurate, reliable, and explainable outputs in complex telecom scenarios.
[AI-17] LORA-CRAFT: Cross-layer Rank Adaptation via Frozen Tucker Decomposition of Pre-trained Attention Weights
【速读】:该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法中模型适应能力与参数开销之间的权衡问题,特别是在大规模预训练语言模型中如何在保持高性能的同时显著减少可训练参数数量。其解决方案的关键在于提出CRAFT(Cross-layer Rank Adaptation via Frozen Tucker),该方法将跨层的注意力权重矩阵组织为三维张量,并通过高阶奇异值分解(Higher-Order SVD, HOSVD)对其进行完整的Tucker分解,冻结所有分解得到的因子矩阵,并仅对每个因子矩阵施加轻量级的可训练变换(adaptation matrices)。这种方法实现了跨层信息共享与参数解耦,使得总适应参数量仅为41K,且不随模型维度或层数变化,从而在GLUE基准上展现出与现有方法相当甚至更优的性能表现。
链接: https://arxiv.org/abs/2602.17510
作者: Kasun Dewage,Marianna Pensky,Suranadi De Silva,Shankadeep Mondal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce CRAFT (Cross-layer Rank Adaptation via Frozen Tucker), a parameter-efficient fine-tuning (PEFT) method that applies Tucker tensor decomposition to pre-trained attention weight matrices stacked across transformer layers and trains only small square adaptation matrices on the resulting frozen Tucker factors. Existing tensor-based PEFT methods decompose gradient updates: LoTR applies Tucker decomposition with shared factor matrices, while SuperLoRA groups and reshapes \Delta W across layers before applying Tucker decomposition. Separately, methods like PiSSA apply SVD to pre-trained weights but operate independently per layer. CRAFT bridges these two lines of work: it performs full Tucker decomposition via Higher-Order SVD (HOSVD) directly on pre-trained weights organized as cross-layer 3D tensors, freezes all resulting factors, and adapts the model through lightweight trainable transformations applied to each factor matrix. Experiments on the GLUE benchmark using RoBERTa-base and RoBERTa-large demonstrate that CRAFT achieves competitive performance with existing methods while requiring only 41K Tucker adaptation parameters–a count independent of model dimension and depth at fixed Tucker ranks.
[AI-18] Pareto Optimal Benchmarking of AI Models on ARM Cortex Processors for Sustainable Embedded Systems
【速读】:该论文旨在解决嵌入式系统中人工智能(AI)模型在ARM Cortex处理器上的优化问题,核心挑战在于如何在能量效率、模型精度和资源利用率之间实现平衡。解决方案的关键在于构建一个自动化的测试基准框架,通过系统性评估关键性能指标(KPIs),识别处理器与AI模型的最佳组合;同时利用帕累托分析(Pareto analysis)量化能效与精度的权衡关系,并基于浮点运算量(FLOPs)与推理时间之间的近线性相关性,提供一种可靠的计算需求估算方法,从而指导开发者设计出高性能且节能的AI应用。
链接: https://arxiv.org/abs/2602.17508
作者: Pranay Jain,Maximilian Kasper,Göran Köber,Axel Plinge,Dominik Seuß
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, Funding: GreenICT@FMD (BMFTR grant 16ME0491K)
Abstract:This work presents a practical benchmarking framework for optimizing artificial intelligence (AI) models on ARM Cortex processors (M0+, M4, M7), focusing on energy efficiency, accuracy, and resource utilization in embedded systems. Through the design of an automated test bench, we provide a systematic approach to evaluate across key performance indicators (KPIs) and identify optimal combinations of processor and AI model. The research highlights a nearlinear correlation between floating-point operations (FLOPs) and inference time, offering a reliable metric for estimating computational demands. Using Pareto analysis, we demonstrate how to balance trade-offs between energy consumption and model accuracy, ensuring that AI applications meet performance requirements without compromising sustainability. Key findings indicate that the M7 processor is ideal for short inference cycles, while the M4 processor offers better energy efficiency for longer inference tasks. The M0+ processor, while less efficient for complex AI models, remains suitable for simpler tasks. This work provides insights for developers, guiding them to design energy-efficient AI systems that deliver high performance in realworld applications.
[AI-19] Learning with Boolean threshold functions
【速读】:该论文旨在解决在布尔数据(Boolean data)上训练神经网络时,如何实现精确且可解释的模型学习问题,尤其针对传统基于梯度的方法在离散系统中难以收敛或泛化能力弱的挑战。其核心解决方案是将损失最小化转化为一种非凸约束满足问题,通过“分而收敛”(divide-and-concur)策略分解为两个互补约束:一是局部布尔阈值函数(BTF)一致性约束,确保每个节点输入、权重与输出之间满足布尔逻辑关系;二是架构一致性约束,强制神经元输出等于下游输入并跨训练样本保持权重一致。采用反射-反射-松弛(RRR)投影算法来协调这些约束,并通过设置合理的边际下界保证学习到的表示具有稀疏性,等价于由±1权重逻辑门构成的简单电路。此方法在乘法器电路发现、二进制自动编码、逻辑网络推断和细胞自动机学习等多个任务中实现了精确解或强泛化性能,展现出基于约束满足的学习范式在离散神经系统的潜力。
链接: https://arxiv.org/abs/2602.17493
作者: Veit Elser,Manish Krishan Lal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 21 figures
Abstract:We develop a method for training neural networks on Boolean data in which the values at all nodes are strictly \pm 1 , and the resulting models are typically equivalent to networks whose nonzero weights are also \pm 1 . The method replaces loss minimization with a nonconvex constraint formulation. Each node implements a Boolean threshold function (BTF), and training is expressed through a divide-and-concur decomposition into two complementary constraints: one enforces local BTF consistency between inputs, weights, and output; the other imposes architectural concurrence, equating neuron outputs with downstream inputs and enforcing weight equality across training-data instantiations of the network. The reflect-reflect-relax (RRR) projection algorithm is used to reconcile these constraints. Each BTF constraint includes a lower bound on the margin. When this bound is sufficiently large, the learned representations are provably sparse and equivalent to networks composed of simple logical gates with \pm 1 weights. Across a range of tasks – including multiplier-circuit discovery, binary autoencoding, logic-network inference, and cellular automata learning – the method achieves exact solutions or strong generalization in regimes where standard gradient-based methods struggle. These results demonstrate that projection-based constraint satisfaction provides a viable and conceptually distinct foundation for learning in discrete neural systems, with implications for interpretability and efficient inference. Comments: 22 pages, 21 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.17493 [cs.LG] (or arXiv:2602.17493v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.17493 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-20] Jolt Atlas: Verifiable Inference via Lookup Arguments in Zero Knowledge
【速读】:该论文旨在解决生成式 AI(Generative AI)模型推理过程中的可信验证问题,即如何在不泄露模型参数或输入数据的前提下,对模型输出进行高效、简洁且可验证的零知识证明(zero-knowledge proof)。现有 zkML 框架普遍依赖于零知识虚拟机(zkVM),需模拟 CPU 指令执行,导致开销大、效率低。Jolt Atlas 的核心创新在于摒弃传统 zkVM 架构,转而基于 Jolt 证明系统,将 ONNX(Open Neural Network Exchange)张量运算直接映射为查找表(lookup table)形式,并结合 sumcheck 协议构建高效的查找论证机制,从而适配非线性函数等现代机器学习关键组件。其关键技术突破包括:利用 ONNX 格式的通用性和计算模型简化内存一致性验证;通过神经 teleportation 等优化手段压缩查找表规模而不损失精度;支持流式(streaming)证明以适应内存受限环境;并通过 BlindFold 技术实现零知识特性。最终,Jolt Atlas 可在普通设备上完成模型推理的加密验证,适用于隐私敏感和对抗性强的应用场景。
链接: https://arxiv.org/abs/2602.17452
作者: Wyatt Benno,Alberto Centelles,Antoine Douchet,Khalil Gibran
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:We present Jolt Atlas, a zero-knowledge machine learning (zkML) framework that extends the Jolt proving system to model inference. Unlike zkVMs (zero-knowledge virtual machines), which emulate CPU instruction execution, Jolt Atlas adapts Jolt’s lookup-centric approach and applies it directly to ONNX tensor operations. The ONNX computational model eliminates the need for CPU registers and simplifies memory consistency verification. In addition, ONNX is an open-source, portable format, which makes it easy to share and deploy models across different frameworks, hardware platforms, and runtime environments without requiring framework-specific conversions. Our lookup arguments, which use sumcheck protocol, are well-suited for non-linear functions – key building blocks in modern ML. We apply optimisations such as neural teleportation to reduce the size of lookup tables while preserving model accuracy, as well as several tensor-level verification optimisations detailed in this paper. We demonstrate that Jolt Atlas can prove model inference in memory-constrained environments – a prover property commonly referred to as \textitstreaming. Furthermore, we discuss how Jolt Atlas achieves zero-knowledge through the BlindFold technique, as introduced in Vega. In contrast to existing zkML frameworks, we show practical proving times for classification, embedding, automated reasoning, and small language models. Jolt Atlas enables cryptographic verification that can be run on-device, without specialised hardware. The resulting proofs are succinctly verifiable. This makes Jolt Atlas well-suited for privacy-centric and adversarial environments. In a companion work, we outline various use cases of Jolt Atlas, including how it serves as guardrails in agentic commerce and for trustless AI context (often referred to as \textitAI memory). Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.17452 [cs.CR] (or arXiv:2602.17452v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.17452 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-21] Convergence Analysis of Two-Layer Neural Networks under Gaussian Input Masking
【速读】:该论文旨在解决两层神经网络在输入层采用高斯随机掩码(Gaussian randomly masked inputs)时的训练收敛性问题,这类场景常见于输入级dropout、传感器网络中的噪声输入训练、隐私保护训练以及联邦学习等实际应用中。解决方案的关键在于利用神经切线核(Neural Tangent Kernel, NTK)分析方法,证明了在该设定下,使用ReLU激活函数的两层网络能够实现线性收敛,最终误差范围与掩码方差成正比;其中一项关键技术突破是成功处理了非线性激活函数内部的随机性问题,这本身也具有独立的研究价值。
链接: https://arxiv.org/abs/2602.17423
作者: Afroditi Kolomvaki,Fangshuo Liao,Evan Dramko,Ziyun Guang,Anastasios Kyrillidis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
备注: 69 pages, submitted to AI/ML Journal
Abstract:We investigate the convergence guarantee of two-layer neural network training with Gaussian randomly masked inputs. This scenario corresponds to Gaussian dropout at the input level, or noisy input training common in sensor networks, privacy-preserving training, and federated learning, where each user may have access to partial or corrupted features. Using a Neural Tangent Kernel (NTK) analysis, we demonstrate that training a two-layer ReLU network with Gaussian randomly masked inputs achieves linear convergence up to an error region proportional to the mask’s variance. A key technical contribution is resolving the randomness within the non-linear activation, a problem of independent interest.
[AI-22] A Privacy by Design Framework for Large Language Model-Based Applications for Children
【速读】:该论文旨在解决儿童在使用生成式 AI(Generative AI)技术时面临的隐私风险问题,尤其是在现有隐私法规(如GDPR、PIPEDA、COPPA)实施过程中,开发者难以有效落实保护措施的挑战。解决方案的关键在于提出一个基于隐私设计原则(Privacy-by-Design, PbD)的系统性框架,将欧盟GDPR、加拿大PIPEDA、美国COPPA等法规的核心原则映射到大型语言模型(Large Language Models, LLMs)的应用生命周期中,包括数据收集、模型训练、运行监控和持续验证等阶段,并结合联合国《儿童权利公约》(UNCRC)和英国适龄设计规范(Age-Appropriate Design Code, AADC)等儿童权益保护标准,制定面向儿童的AI设计指南。通过技术与组织控制相结合,并贯穿LLM全生命周期的年龄适宜性设计决策,该框架可显著降低隐私风险并确保合规性。
链接: https://arxiv.org/abs/2602.17418
作者: Diana Addae,Diana Rogachova,Nafiseh Kahani,Masoud Barati,Michael Christensen,Chen Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Children are increasingly using technologies powered by Artificial Intelligence (AI). However, there are growing concerns about privacy risks, particularly for children. Although existing privacy regulations require companies and organizations to implement protections, doing so can be challenging in practice. To address this challenge, this article proposes a framework based on Privacy-by-Design (PbD), which guides designers and developers to take on a proactive and risk-averse approach to technology design. Our framework includes principles from several privacy regulations, such as the General Data Protection Regulation (GDPR) from the European Union, the Personal Information Protection and Electronic Documents Act (PIPEDA) from Canada, and the Children’s Online Privacy Protection Act (COPPA) from the United States. We map these principles to various stages of applications that use Large Language Models (LLMs), including data collection, model training, operational monitoring, and ongoing validation. For each stage, we discuss the operational controls found in the recent academic literature to help AI service providers and developers reduce privacy risks while meeting legal standards. In addition, the framework includes design guidelines for children, drawing from the United Nations Convention on the Rights of the Child (UNCRC), the UK’s Age-Appropriate Design Code (AADC), and recent academic research. To demonstrate how this framework can be applied in practice, we present a case study of an LLM-based educational tutor for children under 13. Through our analysis and the case study, we show that by using data protection strategies such as technical and organizational controls and making age-appropriate design decisions throughout the LLM life cycle, we can support the development of AI applications for children that provide privacy protections and comply with legal requirements.
[AI-23] A Contrastive Variational AutoEncoder for NSCLC Survival Prediction with Missing Modalities
【速读】:该论文旨在解决非小细胞肺癌(NSCLC)患者生存预测中因多模态数据(如全切片图像、批量转录组学和DNA甲基化)在真实临床数据中存在严重缺失而导致的模型鲁棒性不足问题。其核心解决方案是提出一种多模态对比变分自编码器(Multimodal Contrastive Variational AutoEncoder, MCVAE),关键创新在于:通过模态特定的变分编码器捕捉各数据源的不确定性,并引入带有学习门控机制的融合瓶颈以动态归一化当前可用模态的贡献;同时设计多任务目标函数,结合生存损失与重构损失对患者表征进行正则化,并利用跨模态对比损失强制潜在空间中的模态对齐;此外,在训练中采用随机模态掩码策略提升模型对任意缺失模式的鲁棒性。
链接: https://arxiv.org/abs/2602.17402
作者: Michele Zanitti,Vanja Miskovic,Francesco Trovò,Alessandra Laura Giulia Pedrocchi,Ming Shen,Yan Kyaw Tun,Arsela Prelaj,Sokol Kosta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at The 13th IEEE International Conference on Big Data (IEEE BigData 2025)
Abstract:Predicting survival outcomes for non-small cell lung cancer (NSCLC) patients is challenging due to the different individual prognostic features. This task can benefit from the integration of whole-slide images, bulk transcriptomics, and DNA methylation, which offer complementary views of the patient’s condition at diagnosis. However, real-world clinical datasets are often incomplete, with entire modalities missing for a significant fraction of patients. State-of-the-art models rely on available data to create patient-level representations or use generative models to infer missing modalities, but they lack robustness in cases of severe missingness. We propose a Multimodal Contrastive Variational AutoEncoder (MCVAE) to address this issue: modality-specific variational encoders capture the uncertainty in each data source, and a fusion bottleneck with learned gating mechanisms is introduced to normalize the contributions from present modalities. We propose a multi-task objective that combines survival loss and reconstruction loss to regularize patient representations, along with a cross-modal contrastive loss that enforces cross-modal alignment in the latent space. During training, we apply stochastic modality masking to improve the robustness to arbitrary missingness patterns. Extensive evaluations on the TCGA-LUAD (n=475) and TCGA-LUSC (n=446) datasets demonstrate the efficacy of our approach in predicting disease-specific survival (DSS) and its robustness to severe missingness scenarios compared to two state-of-the-art models. Finally, we bring some clarifications on multimodal integration by testing our model on all subsets of modalities, finding that integration is not always beneficial to the task.
[AI-24] Voice-Driven Semantic Perception for UAV-Assisted Emergency Networks
【速读】:该论文旨在解决应急响应场景中无人飞行器(UAV)辅助网络难以直接利用语音通信进行自动化管理的问题,因为传统语音通信具有非结构化特性,无法与网络管理系统无缝集成。解决方案的关键在于提出SIREN框架,该框架通过融合自动语音识别(ASR)、基于大语言模型(LLM)的语义提取以及自然语言处理(NLP)验证技术,将应急语音流量转化为结构化的机器可读信息,包括响应单位、位置参考、紧急程度和质量服务(QoS)需求等,从而实现语音驱动的情境感知,为无人机辅助网络提供人机协同决策支持与自适应管理能力。
链接: https://arxiv.org/abs/2602.17394
作者: Nuno Saavedra,Pedro Ribeiro,André Coelho,Rui Campos
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 7 pages, 4 figures
Abstract:Unmanned Aerial Vehicle (UAV)-assisted networks are increasingly foreseen as a promising approach for emergency response, providing rapid, flexible, and resilient communications in environments where terrestrial infrastructure is degraded or unavailable. In such scenarios, voice radio communications remain essential for first responders due to their robustness; however, their unstructured nature prevents direct integration with automated UAV-assisted network management. This paper proposes SIREN, an AI-driven framework that enables voice-driven perception for UAV-assisted networks. By integrating Automatic Speech Recognition (ASR) with Large Language Model (LLM)-based semantic extraction and Natural Language Processing (NLP) validation, SIREN converts emergency voice traffic into structured, machine-readable information, including responding units, location references, emergency severity, and Quality-of-Service (QoS) requirements. SIREN is evaluated using synthetic emergency scenarios with controlled variations in language, speaker count, background noise, and message complexity. The results demonstrate robust transcription and reliable semantic extraction across diverse operating conditions, while highlighting speaker diarization and geographic ambiguity as the main limiting factors. These findings establish the feasibility of voice-driven situational awareness for UAV-assisted networks and show a practical foundation for human-in-the-loop decision support and adaptive network management in emergency response operations.
[AI-25] Dataless Weight Disentanglement in Task Arithmetic via Kronecker-Factored Approximate Curvature ICLR2026
【速读】:该论文旨在解决任务向量(task vector)在组合过程中因交叉任务干扰导致的表征漂移(representation drift)问题,从而提升基础模型在任务相加或相减时的性能稳定性。其解决方案的关键在于将表征漂移正则化建模为曲率矩阵近似问题,进而利用Kronecker-Factored Approximate Curvature(KFAC)方法获得一种无需外部任务数据的正则项,实现了模块化、可扩展且对任务向量缩放鲁棒的适应机制。
链接: https://arxiv.org/abs/2602.17385
作者: Angelo Porrello,Pietro Buzzega,Felix Dangel,Thomas Sommariva,Riccardo Salami,Lorenzo Bonicelli,Simone Calderara
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2026
Abstract:Task Arithmetic yields a modular, scalable way to adapt foundation models. Combining multiple task vectors, however, can lead to cross-task interference, causing representation drift and degraded performance. Representation drift regularization provides a natural remedy to disentangle task vectors; however, existing approaches typically require external task data, conflicting with modularity and data availability constraints (e.g., privacy requirements). We propose a dataless approach by framing regularization against representation drift as a curvature matrix approximation problem. This allows us to leverage well-established techniques; in particular, we adopt Kronecker-Factored Approximate Curvature and obtain a practical regularizer that achieves state-of-the-art results in task addition and negation. Our method has constant complexity in the number of tasks and promotes robustness to task vector rescaling, eliminating the need for held-out tuning.
[AI-26] A feature-stable and explainable machine learning framework for trustworthy decision-making under incomplete clinical data
【速读】:该论文旨在解决机器学习模型在生物医学领域高风险场景中应用受限的问题,具体包括模型鲁棒性差、可解释性不足以及在现实数据扰动(如缺失值)下特征不稳定,导致模型难以获得临床信任。其解决方案的关键在于提出一个名为CACTUS(Comprehensive Abstraction and Classification Tool for Uncovering Structures)的可解释机器学习框架,该框架通过整合特征抽象、可解释分类与系统性的特征稳定性分析,量化关键特征在数据质量下降时的一致性表现。实验表明,CACTUS在小规模、异质性和不完整临床数据集上不仅保持了竞争性或更优的预测性能,还显著提升了顶级特征的稳定性,从而为模型可信度评估提供了超越传统性能指标的新维度。
链接: https://arxiv.org/abs/2602.17364
作者: Justyna Andrys-Olek,Paulina Tworek,Luca Gherardini,Mark W. Ruddock,Mary Jo Kurt,Peter Fitzgerald,Jose Sousa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine learning models are increasingly applied to biomedical data, yet their adoption in high stakes domains remains limited by poor robustness, limited interpretability, and instability of learned features under realistic data perturbations, such as missingness. In particular, models that achieve high predictive performance may still fail to inspire trust if their key features fluctuate when data completeness changes, undermining reproducibility and downstream decision-making. Here, we present CACTUS (Comprehensive Abstraction and Classification Tool for Uncovering Structures), an explainable machine learning framework explicitly designed to address these challenges in small, heterogeneous, and incomplete clinical datasets. CACTUS integrates feature abstraction, interpretable classification, and systematic feature stability analysis to quantify how consistently informative features are preserved as data quality degrades. Using a real-world haematuria cohort comprising 568 patients evaluated for bladder cancer, we benchmark CACTUS against widely used machine learning approaches, including random forests and gradient boosting methods, under controlled levels of randomly introduced missing data. We demonstrate that CACTUS achieves competitive or superior predictive performance while maintaining markedly higher stability of top-ranked features as missingness increases, including in sex-stratified analyses. Our results show that feature stability provides information complementary to conventional performance metrics and is essential for assessing the trustworthiness of machine learning models applied to biomedical data. By explicitly quantifying robustness to missing data and prioritising interpretable, stable features, CACTUS offers a generalizable framework for trustworthy data-driven decision support.
[AI-27] What Breaks Embodied AI Security:LLM Vulnerabilities CPS Flawsor Something Else?
【速读】:该论文旨在解决当前 embodied AI 系统(如自动驾驶汽车、服务机器人和基于大语言模型的交互式智能体)在现实世界部署中面临的安全部署难题,特别是现有研究仅从大语言模型(LLM)漏洞或经典网络物理系统(CPS)故障角度分析时,无法充分解释许多实际运行中的系统级崩溃问题。其解决方案的关键在于提出一种新的认知框架:强调许多失败源于“具身诱导的系统级不匹配”,而非孤立的模型缺陷或传统攻击;并指出四大核心洞察——语义正确性不等于物理安全性、相同动作因非线性动力学与状态不确定性产生迥异结果、小误差在感知-决策-执行闭环中传播放大、安全性质不可组合导致局部安全累积为全局风险——从而主张应从组件级防护转向面向物理风险、不确定性及故障传播的系统级推理来保障 embodied AI 的安全性。
链接: https://arxiv.org/abs/2602.17345
作者: Boyang Ma,Hechuan Guo,Peizhuo Lv,Minghui Xu,Xuelong Dai,YeChao Zhang,Yijun Yang,Yue Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Embodied AI systems (e.g., autonomous vehicles, service robots, and LLM-driven interactive agents) are rapidly transitioning from controlled environments to safety critical real-world deployments. Unlike disembodied AI, failures in embodied intelligence lead to irreversible physical consequences, raising fundamental questions about security, safety, and reliability. While existing research predominantly analyzes embodied AI through the lenses of Large Language Model (LLM) vulnerabilities or classical Cyber-Physical System (CPS) failures, this survey argues that these perspectives are individually insufficient to explain many observed breakdowns in modern embodied systems. We posit that a significant class of failures arises from embodiment-induced system-level mismatches, rather than from isolated model flaws or traditional CPS attacks. Specifically, we identify four core insights that explain why embodied AI is fundamentally harder to secure: (i) semantic correctness does not imply physical safety, as language-level reasoning abstracts away geometry, dynamics, and contact constraints; (ii) identical actions can lead to drastically different outcomes across physical states due to nonlinear dynamics and state uncertainty; (iii) small errors propagate and amplify across tightly coupled perception-decision-action loops; and (iv) safety is not compositional across time or system layers, enabling locally safe decisions to accumulate into globally unsafe behavior. These insights suggest that securing embodied AI requires moving beyond component-level defenses toward system-level reasoning about physical risk, uncertainty, and failure propagation.
[AI-28] From Subtle to Significant: Prompt-Driven Self-Improving Optimization in Test-Time Graph OOD Detection
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在开放世界场景下部署时,如何有效检测测试图是否偏离训练分布(即图外分布,Graph Out-of-Distribution, OOD)的问题。现有方法多采用单次推理范式,难以通过迭代优化逐步修正误判以增强OOD信号。其解决方案的关键在于提出一种无监督框架SIGOOD(Self-Improving Graph Out-of-Distribution detector),该框架将连续自学习与测试时训练相结合:首先生成提示(prompt)构造增强图以放大潜在OOD信号,并引入能量偏好优化(Energy Preference Optimization, EPO)损失函数,利用原始测试图与提示增强图之间的能量差异来优化提示;随后通过自提升循环不断迭代优化提示并嵌入检测模型,最终基于最优提示增强图实现更准确的OOD检测。
链接: https://arxiv.org/abs/2602.17342
作者: Luzhi Wang,Xuanshuo Fu,He Zhang,Chuang Liu,Xiaobao Wang,Hongbo Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9pages, 5 figures
Abstract:Graph Out-of-Distribution (OOD) detection aims to identify whether a test graph deviates from the distribution of graphs observed during training, which is critical for ensuring the reliability of Graph Neural Networks (GNNs) when deployed in open-world scenarios. Recent advances in graph OOD detection have focused on test-time training techniques that facilitate OOD detection without accessing potential supervisory information (e.g., training data). However, most of these methods employ a one-pass inference paradigm, which prevents them from progressively correcting erroneous predictions to amplify OOD signals. To this end, we propose a \textbfSelf-\textbfImproving \textbfGraph \textbfOut-\textbfof-\textbfDistribution detector (SIGOOD), which is an unsupervised framework that integrates continuous self-learning with test-time training for effective graph OOD detection. Specifically, SIGOOD generates a prompt to construct a prompt-enhanced graph that amplifies potential OOD signals. To optimize prompts, SIGOOD introduces an Energy Preference Optimization (EPO) loss, which leverages energy variations between the original test graph and the prompt-enhanced graph. By iteratively optimizing the prompt by involving it into the detection model in a self-improving loop, the resulting optimal prompt-enhanced graph is ultimately used for OOD detection. Comprehensive evaluations on 21 real-world datasets confirm the effectiveness and outperformance of our SIGOOD method. The code is at this https URL.
[AI-29] SubQuad: Near-Quadratic-Free Structure Inference with Distribution-Balanced Objectives in Adaptive Receptor framework
【速读】:该论文旨在解决群体规模下适应性免疫库(adaptive immune repertoire)比较分析中的两大实践瓶颈:一是配对亲和力评估的近二次方成本,二是数据集不平衡导致临床重要的稀有克隆型(clonotype)被掩盖。其解决方案的关键在于提出一个端到端的Pipeline——SubQuad,通过抗原感知的近次二次检索、GPU加速的亲和力核函数、学习得到的多模态融合机制以及公平性约束聚类来实现。系统采用紧凑的MinHash预过滤显著减少候选比较数量,引入可微分门控模块以按对自适应加权对齐与嵌入通道,并设计自动化校准流程强制稀有抗原特异性亚组的比例代表性。这一协同设计的索引、相似性融合与公平性感知目标共同构建了一个可扩展且具备偏见意识的免疫库挖掘平台,有效提升了吞吐量与内存效率,同时保持或改善了召回率@k、簇纯度及亚组公平性。
链接: https://arxiv.org/abs/2602.17330
作者: Rong Fu,Zijian Zhang,Wenxin Zhang,Kun Liu,Jiekai Wu,Xianda Li,Simon Fong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 9 figures
Abstract:Comparative analysis of adaptive immune repertoires at population scale is hampered by two practical bottlenecks: the near-quadratic cost of pairwise affinity evaluations and dataset imbalances that obscure clinically important minority clonotypes. We introduce SubQuad, an end-to-end pipeline that addresses these challenges by combining antigen-aware, near-subquadratic retrieval with GPU-accelerated affinity kernels, learned multimodal fusion, and fairness-constrained clustering. The system employs compact MinHash prefiltering to sharply reduce candidate comparisons, a differentiable gating module that adaptively weights complementary alignment and embedding channels on a per-pair basis, and an automated calibration routine that enforces proportional representation of rare antigen-specific subgroups. On large viral and tumor repertoires SubQuad achieves measured gains in throughput and peak memory usage while preserving or improving recall@k, cluster purity, and subgroup equity. By co-designing indexing, similarity fusion, and equity-aware objectives, SubQuad offers a scalable, bias-aware platform for repertoire mining and downstream translational tasks such as vaccine target prioritization and biomarker discovery.
[AI-30] Flickering Multi-Armed Bandits
【速读】:该论文旨在解决多臂赌博机(Multi-Armed Bandits, MAB)框架中臂(或动作)集合随时间动态变化且当前可用臂依赖于前序选择臂的复杂决策问题,即“闪烁式多臂赌博机”(Flickering Multi-Armed Bandits, FMAB)。其核心挑战在于在局部移动约束下实现高效探索与利用的平衡。解决方案的关键在于提出一种两阶段算法:第一阶段采用懒惰随机游走(lazy random walk)进行探索,以低代价识别最优臂;第二阶段通过导航与承诺机制实现对最优臂的稳定利用。理论分析表明,该算法在两种随机图模型(i.i.d. Erdős–Rényi 过程和边马尔可夫过程)下均能获得高概率和期望意义上的次线性 regret 上界,并通过信息论下界证明了探索成本的近最优性,凸显了局部移动约束下的基本探索代价。
链接: https://arxiv.org/abs/2602.17315
作者: Sourav Chakraborty,Amit Kiran Rege,Claire Monteleoni,Lijun Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Flickering Multi-Armed Bandits (FMAB), a new MAB framework where the set of available arms (or actions) can change at each round, and the available set at any time may depend on the agent’s previously selected arm. We model this constrained, evolving availability using random graph processes, where arms are nodes and the agent’s movement is restricted to its local neighborhood. We analyze this problem under two random graph models: an i.i.d. Erdős–Rényi (ER) process and an Edge-Markovian process. We propose and analyze a two-phase algorithm that employs a lazy random walk for exploration to efficiently identify the optimal arm, followed by a navigation and commitment phase for exploitation. We establish high-probability and expected sublinear regret bounds for both graph settings. We show that the exploration cost of our algorithm is near-optimal by establishing a matching information-theoretic lower bound for this problem class, highlighting the fundamental cost of exploration under local-move constraints. We complement our theoretical guarantees with numerical simulations, including a scenario of a robotic ground vehicle scouting a disaster-affected region.
[AI-31] MedClarify: An information-seeking AI agent for medical diagnosis with case-specific follow-up questions
【速读】:该论文旨在解决当前医学大语言模型(Large Language Models, LLMs)在诊断决策中缺乏有效信息获取能力的问题,尤其是在面对不完整或模糊的患者信息时,难以通过迭代式推理来缩小鉴别诊断范围并降低误诊风险。其解决方案的关键在于提出一个名为MedClarify的AI代理系统,该系统基于信息论原理计算候选诊断列表(模拟临床中的鉴别诊断),并主动生成具有最高预期信息增益的后续提问,从而实现有针对性的不确定性感知推理。实验表明,相比标准单次推理的LLM基线,MedClarify可将诊断错误率降低约27个百分点,显著提升了医疗LLM在复杂临床情境下的诊断准确性与可解释性。
链接: https://arxiv.org/abs/2602.17308
作者: Hui Min Wong,Philip Heesen,Pascal Janetzky,Martin Bendszus,Stefan Feuerriegel
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are increasingly used for diagnostic tasks in medicine. In clinical practice, the correct diagnosis can rarely be immediately inferred from the initial patient presentation alone. Rather, reaching a diagnosis often involves systematic history taking, during which clinicians reason over multiple potential conditions through iterative questioning to resolve uncertainty. This process requires considering differential diagnoses and actively excluding emergencies that demand immediate intervention. Yet, the ability of medical LLMs to generate informative follow-up questions and thus reason over differential diagnoses remains underexplored. Here, we introduce MedClarify, an AI agent for information-seeking that can generate follow-up questions for iterative reasoning to support diagnostic decision-making. Specifically, MedClarify computes a list of candidate diagnoses analogous to a differential diagnosis, and then proactively generates follow-up questions aimed at reducing diagnostic uncertainty. By selecting the question with the highest expected information gain, MedClarify enables targeted, uncertainty-aware reasoning to improve diagnostic performance. In our experiments, we first demonstrate the limitations of current LLMs in medical reasoning, which often yield multiple, similarly likely diagnoses, especially when patient cases are incomplete or relevant information for diagnosis is missing. We then show that our information-theoretic reasoning approach can generate effective follow-up questioning and thereby reduces diagnostic errors by ~27 percentage points (p.p.) compared to a standard single-shot LLM baseline. Altogether, MedClarify offers a path to improve medical LLMs through agentic information-seeking and to thus promote effective dialogues with medical LLMs that reflect the iterative and uncertain nature of real-world clinical reasoning.
[AI-32] Federated Latent Space Alignment for Multi-user Semantic Communications
【速读】:该论文旨在解决多智能体AI原生语义通信中因不同设备潜在表示(latent representations)差异导致的语义失配问题,从而影响任务执行的有效性。解决方案的关键在于提出一种基于语义预均衡器(semantic pre-equalizer)与本地语义均衡器协同工作的协议:在下行链路场景下,接入点(AP)部署共享的语义预均衡器,用户端则各自运行本地语义均衡器,通过联邦优化实现去中心化训练,以在功率和复杂度约束下提升语义对齐度和任务导向通信性能。
链接: https://arxiv.org/abs/2602.17271
作者: Giuseppe Di Poce,Mario Edoardo Pandolfo,Emilio Calvanese Strinati,Paolo Di Lorenzo
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注:
Abstract:Semantic communication aims to convey meaning for effective task execution, but differing latent representations in AI-native devices can cause semantic mismatches that hinder mutual understanding. This paper introduces a novel approach to mitigating latent space misalignment in multi-agent AI- native semantic communications. In a downlink scenario, we consider an access point (AP) communicating with multiple users to accomplish a specific AI-driven task. Our method implements a protocol that shares a semantic pre-equalizer at the AP and local semantic equalizers at user devices, fostering mutual understanding and task-oriented communication while considering power and complexity constraints. To achieve this, we employ a federated optimization for the decentralized training of the semantic equalizers at the AP and user sides. Numerical results validate the proposed approach in goal-oriented semantic communication, revealing key trade-offs among accuracy, com- munication overhead, complexity, and the semantic proximity of AI-native communication devices.
[AI-33] Web Verbs: Typed Abstractions for Reliable Task Composition on the Agent ic Web
【速读】:该论文旨在解决当前Web代理(Web Agent)在执行目标导向任务时依赖低层操作(如点击和键盘输入)所导致的脆弱性、低效性和难以验证的问题。现有方法缺乏对网页交互行为的语义抽象,使得代理难以可靠地发现、组合和执行复杂任务。解决方案的关键在于提出Web Verbs——一种可扩展的、类型化且语义明确的函数集合,这些函数通过统一接口暴露网站能力(无论基于API还是客户端工作流),作为稳定、可组合的原子单元供代理调用。Web Verbs支持前置条件、后置条件、策略标签和日志记录,从而提升系统的可靠性、效率与可验证性,实现从浏览器级操作到语义级控制的范式跃迁。
链接: https://arxiv.org/abs/2602.17245
作者: Linxi Jiang,Rui Xi,Zhijie Liu,Shuo Chen,Zhiqiang Lin,Suman Nath
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The Web is evolving from a medium that humans browse to an environment where software agents act on behalf of users. Advances in large language models (LLMs) make natural language a practical interface for goal-directed tasks, yet most current web agents operate on low-level primitives such as clicks and keystrokes. These operations are brittle, inefficient, and difficult to verify. Complementing content-oriented efforts such as NLWeb’s semantic layer for retrieval, we argue that the agentic web also requires a semantic layer for web actions. We propose \textbfWeb Verbs, a web-scale set of typed, semantically documented functions that expose site capabilities through a uniform interface, whether implemented through APIs or robust client-side workflows. These verbs serve as stable and composable units that agents can discover, select, and synthesize into concise programs. This abstraction unifies API-based and browser-based paradigms, enabling LLMs to synthesize reliable and auditable workflows with explicit control and data flow. Verbs can carry preconditions, postconditions, policy tags, and logging support, which improves \textbfreliability by providing stable interfaces, \textbfefficiency by reducing dozens of steps into a few function calls, and \textbfverifiability through typed contracts and checkable traces. We present our vision, a proof-of-concept implementation, and representative case studies that demonstrate concise and robust execution compared to existing agents. Finally, we outline a roadmap for standardization to make verbs deployable and trustworthy at web scale.
[AI-34] APO-Structured Description Logic for Information Behavior: Procedural and Oracle-Based Extensions ACL
【速读】:该论文旨在解决传统描述逻辑(Description Logic, DL)在建模信息行为时缺乏对动态过程和外部交互能力的表达问题,尤其难以刻画信息生成、条件执行与外部验证等复杂交互场景。其解决方案的关键在于提出一种结构化的扩展形式语言——TAPO-Structured Description Logic (TAPO–DL),通过引入两个核心组件:**程序盒(Procedural Box, P-Box)用于表达概念驱动的指令式程序(如条件判断与循环),以及Oracle盒(Oracle Box, O-Box)**用于形式化与外部信息源的受控交互;同时构建了一个基于层叠理论(sheaf-theoretic)的统一语义框架,将局部信息状态视为截面,全局一致性对应于稳定结构,从而将信息真值定义为在重复主体交互下保持稳定的性质,而非固定全局状态的对应关系。这一框架实现了对信息行为中交互性、不确定性与情境依赖性的形式化建模。
链接: https://arxiv.org/abs/2602.17242
作者: Takao Inoué
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 10 pages. Introduces TAPO-DL, a structured description logic integrating TBox, ABox, procedural PBox, and oracle-based OBox. Provides formal syntax, semantics, and inference rules, with an application to information behavior modeling
Abstract:We introduce \emphTAPO-Structured Description Logic (TAPO–DL), a formal extension of classical description logic designed to model \emphinformation behavior as a structured, dynamic process. TAPO–DL extends the standard T–Box/A–Box architecture with two additional layers: a \emphProcedural Box (P–Box), which supports concept-driven, imperative-style programs such as conditional and iterative actions, and an \emphOracle Box (O–Box), which formalizes controlled interaction with external information sources. While the terminological and assertional components capture static conceptual and factual knowledge, the procedural and oracle-based components enable the explicit representation of information-generating actions and external validation. We provide a unified semantic framework for TAPO–DL based on a co-generative, sheaf-theoretic interpretation, in which local informational states are modeled as sections and informational stability corresponds to the existence of coherent global structures. Within this setting, informational truth is characterized as stability under repeated agentive interaction rather than correspondence to a fixed global state. By integrating description logic with procedural dynamics, oracle-based reasoning, and sheaf-theoretic semantics, TAPO–DL offers a principled formal framework for analyzing information behavior in contexts involving interaction, uncertainty, and contextuality. Comments: 10 pages. Introduces TAPO-DL, a structured description logic integrating TBox, ABox, procedural PBox, and oracle-based OBox. Provides formal syntax, semantics, and inference rules, with an application to information behavior modeling Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI) MSC classes: 03B45, 68T27, 18F20 Cite as: arXiv:2602.17242 [cs.LO] (or arXiv:2602.17242v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2602.17242 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Takao Inoue [view email] [v1] Thu, 19 Feb 2026 10:43:21 UTC (7 KB)
[AI-35] All Leaks Count Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在进行未来事件预测时存在的时间知识泄露(temporal knowledge leakage)问题,即模型可能因训练过程中编码了截止日期之后的信息,导致其推理过程不可靠,从而影响回溯测试(backtesting)的有效性。解决方案的关键在于提出一个基于声明级(claim-level)的可解释评估框架,通过将模型推理分解为原子声明并按时间可验证性分类,结合Shapley值量化每个声明对最终决策的贡献,进而定义出Shapley加权决策关键泄露率(Shapley-DCLR)这一指标;在此基础上进一步设计Time-Supervised Prediction with Extracted Claims (TimeSPEC) 方法,通过在生成过程中插入声明验证与再生步骤,主动过滤时间污染,确保所有支撑性声明均源自截止日期前可用信息,从而实现既降低泄露又保持任务性能的可靠回溯测试能力。
链接: https://arxiv.org/abs/2602.17234
作者: Zeyu Zhang,Ryan Chen,Bradly C. Stadie
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages plus appendix
Abstract:To evaluate whether LLMs can accurately predict future events, we need the ability to \textitbacktest them on events that have already resolved. This requires models to reason only with information available at a specified past date. Yet LLMs may inadvertently leak post-cutoff knowledge encoded during training, undermining the validity of retrospective evaluation. We introduce a claim-level framework for detecting and quantifying this \emphtemporal knowledge leakage. Our approach decomposes model rationales into atomic claims and categorizes them by temporal verifiability, then applies \textitShapley values to measure each claim’s contribution to the prediction. This yields the \textbfShapley-weighted \textbfDecision-\textbfCritical \textbfLeakage \textbfRate (\textbfShapley-DCLR), an interpretable metric that captures what fraction of decision-driving reasoning derives from leaked information. Building on this framework, we propose \textbfTime-\textbfSupervised \textbfPrediction with \textbfExtracted \textbfClaims (\textbfTimeSPEC), which interleaves generation with claim verification and regeneration to proactively filter temporal contamination – producing predictions where every supporting claim can be traced to sources available before the cutoff date. Experiments on 350 instances spanning U.S. Supreme Court case prediction, NBA salary estimation, and stock return ranking reveal substantial leakage in standard prompting baselines. TimeSPEC reduces Shapley-DCLR while preserving task performance, demonstrating that explicit, interpretable claim-level verification outperforms prompt-based temporal constraints for reliable backtesting.
[AI-36] Decoding the Human Factor: High Fidelity Behavioral Prediction for Strategic Foresight
【速读】:该论文旨在解决在高风险环境中准确预测人类决策行为的难题,尤其针对大语言模型(Large Language Models, LLMs)在生成个体特异性、一致性行为时表现不足的问题。现有基于提示(prompting)的方法易受身份漂移(identity drift)影响,且难以有效利用日益详尽的个性描述信息。其解决方案的关键在于提出一种名为“大型行为模型”(Large Behavioral Model, LBM)的新范式——通过将临时提示转化为结构化的高维心理特质嵌入(behavioral embedding),LBM 基于来自全面心理测量工具箱的稳定人格特征、动机状态与情境约束之间的映射关系进行微调,从而实现对个体战略选择的高保真预测。实验表明,相较于未适配的 Llama-3.1-8B-Instruct 模型,LBM 在保留个体差异的同时显著提升预测性能,并在更密集的心理特质输入下持续优化表现,展现出可扩展的行为模拟能力。
链接: https://arxiv.org/abs/2602.17222
作者: Ben Yellin,Ehud Ezra,Mark Foreman,Shula Grinapol
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Predicting human decision-making in high-stakes environments remains a central challenge for artificial intelligence. While large language models (LLMs) demonstrate strong general reasoning, they often struggle to generate consistent, individual-specific behavior, particularly when accurate prediction depends on complex interactions between psychological traits and situational constraints. Prompting-based approaches can be brittle in this setting, exhibiting identity drift and limited ability to leverage increasingly detailed persona descriptions. To address these limitations, we introduce the Large Behavioral Model (LBM), a behavioral foundation model fine-tuned to predict individual strategic choices with high fidelity. LBM shifts from transient persona prompting to behavioral embedding by conditioning on a structured, high-dimensional trait profile derived from a comprehensive psychometric battery. Trained on a proprietary dataset linking stable dispositions, motivational states, and situational constraints to observed choices, LBM learns to map rich psychological profiles to discrete actions across diverse strategic dilemmas. In a held-out scenario evaluation, LBM fine-tuning improves behavioral prediction relative to the unadapted Llama-3.1-8B-Instruct backbone and performs comparably to frontier baselines when conditioned on Big Five traits. Moreover, we find that while prompting-based baselines exhibit a complexity ceiling, LBM continues to benefit from increasingly dense trait profiles, with performance improving as additional trait dimensions are provided. Together, these results establish LBM as a scalable approach for high-fidelity behavioral simulation, enabling applications in strategic foresight, negotiation analysis, cognitive security, and decision support.
[AI-37] Continual learning and refinement of causal models through dynamic predicate invention
【速读】:该论文旨在解决传统世界建模方法在复杂环境中存在的样本效率低、透明性差以及可扩展性不足的问题。其核心解决方案是提出一种在线构建符号因果世界模型的框架,通过将连续模型学习与修复机制嵌入智能体决策循环中,并利用元解释学习(Meta-Interpretive Learning)和谓词发明(predicate invention)技术,自动发现语义上有意义且可复用的抽象概念,从而从观测数据中构建出层次化、解耦且高质量的概念体系。该方法在具有复杂关系动态的领域中表现出良好可扩展性,相较于基于PPO的神经网络基线,实现了数量级更高的样本效率提升。
链接: https://arxiv.org/abs/2602.17217
作者: Enrique Crespo-Fernandez,Oliver Ray,Telmo de Menezes e Silva Filho,Peter Flach
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Efficiently navigating complex environments requires agents to internalize the underlying logic of their world, yet standard world modelling methods often struggle with sample inefficiency, lack of transparency, and poor scalability. We propose a framework for constructing symbolic causal world models entirely online by integrating continuous model learning and repair into the agent’s decision loop, by leveraging the power of Meta-Interpretive Learning and predicate invention to find semantically meaningful and reusable abstractions, allowing an agent to construct a hierarchy of disentangled, high-quality concepts from its observations. We demonstrate that our lifted inference approach scales to domains with complex relational dynamics, where propositional methods suffer from combinatorial explosion, while achieving sample-efficiency orders of magnitude higher than the established PPO neural-network-based baseline.
[AI-38] Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长代码上下文时的鲁棒性问题,即模型在面对不同输入条件(如答案格式变化、干扰项引入和上下文规模扩展)时的表现稳定性。其解决方案的关键在于设计系统性的消融实验,通过控制变量法测试模型对答案格式、干扰信息以及上下文规模的敏感性,并在此基础上扩展LongCodeBench Python数据集,新增COBOL和Java题库,构建三种典型场景:(i)选项乱序的多项选择题、(ii)开放式问答、(iii)包含相关与对抗性无关信息的“针 haystack”式上下文。实验结果揭示了当前模型在复杂输入条件下性能显著下降且对无关线索表现出脆弱行为,从而指出了现有长上下文评估方法的局限性,并提供了更全面的基准用于评估跨语言、跨架构的代码推理能力。
链接: https://arxiv.org/abs/2602.17183
作者: Kishan Maharaj,Nandakishore Menon,Ashita Saxena,Srikanth Tamilselvam
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 Figures, 5 Tables, Work in Progress
Abstract:Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context code question answering using controlled ablations that test sensitivity to answer format, distractors, and context scale. Extending LongCodeBench Python dataset with new COBOL and Java question-answer sets, we evaluate state-of-the-art models under three settings: (i) shuffled multiple-choice options, (ii) open-ended questions and (iii) needle-in-a-haystack contexts containing relevant and adversarially irrelevant information. Results show substantial performance drops in both shuffled multiple-choice options and open-ended questions, and brittle behavior in the presence of irrelevant cues. Our findings highlight limitations of current long-context evaluations and provide a broader benchmark for assessing code reasoning in both legacy and modern systems.
[AI-39] Continual uncertainty learning
【速读】:该论文旨在解决机械系统在存在多种不确定性(如非线性动力学与工况变化交织)时的鲁棒控制难题,尤其针对深度强化学习(DRL)在处理多重不确定性时易出现策略次优和学习效率低的问题。其解决方案的关键在于提出一种基于课程学习(curriculum-based continual learning)的框架,将复杂的多不确定性控制问题分解为一系列连续的学习任务,逐步学习应对每种不确定性;同时通过扩展原系统为一组动态不确定性逐步增强的子系统(plant sets),并引入模型基础控制器(MBC)作为共享基准性能保障,实现无灾难性遗忘的稳定策略更新和任务特异性优化,从而显著提升样本效率与sim-to-real迁移能力。
链接: https://arxiv.org/abs/2602.17174
作者: Heisei Yonezawa,Ansei Yonezawa,Itsuro Kajiwara
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Robust control of mechanical systems with multiple uncertainties remains a fundamental challenge, particularly when nonlinear dynamics and operating-condition variations are intricately intertwined. While deep reinforcement learning (DRL) combined with domain randomization has shown promise in mitigating the sim-to-real gap, simultaneously handling all sources of uncertainty often leads to sub-optimal policies and poor learning efficiency. This study formulates a new curriculum-based continual learning framework for robust control problems involving nonlinear dynamical systems in which multiple sources of uncertainty are simultaneously superimposed. The key idea is to decompose a complex control problem with multiple uncertainties into a sequence of continual learning tasks, in which strategies for handling each uncertainty are acquired sequentially. The original system is extended into a finite set of plants whose dynamic uncertainties are gradually expanded and diversified as learning progresses. The policy is stably updated across the entire plant sets associated with tasks defined by different uncertainty configurations without catastrophic forgetting. To ensure learning efficiency, we jointly incorporate a model-based controller (MBC), which guarantees a shared baseline performance across the plant sets, into the learning process to accelerate the convergence. This residual learning scheme facilitates task-specific optimization of the DRL agent for each uncertainty, thereby enhancing sample efficiency. As a practical industrial application, this study applies the proposed method to designing an active vibration controller for automotive powertrains. We verified that the resulting controller is robust against structural nonlinearities and dynamic variations, realizing successful sim-to-real transfer.
[AI-40] In-Context Learning in Linear vs. Quadratic Attention Models: An Empirical Study on Regression Tasks
【速读】:该论文旨在解决Transformer模型中不同注意力机制(即线性注意力与二次注意力)在上下文学习(In-Context Learning, ICL)行为上的差异问题,特别是在经典线性回归任务中的表现差异。其解决方案的关键在于通过实证研究比较两种注意力机制在学习质量(均方误差MSE)、收敛性和泛化能力方面的性能,并进一步分析模型深度增加对ICL性能的影响,从而揭示线性注意力相对于二次注意力在该设定下的优势与局限性。
链接: https://arxiv.org/abs/2602.17171
作者: Ayush Goel,Arjun Kohli,Sarvagya Somvanshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work has demonstrated that transformers and linear attention models can perform in-context learning (ICL) on simple function classes, such as linear regression. In this paper, we empirically study how these two attention mechanisms differ in their ICL behavior on the canonical linear-regression task of Garg et al. We evaluate learning quality (MSE), convergence, and generalization behavior of each architecture. We also analyze how increasing model depth affects ICL performance. Our results illustrate both the similarities and limitations of linear attention relative to quadratic attention in this setting.
[AI-41] JEPA-DNA: Grounding Genomic Foundation Models through Joint-Embedding Predictive Architectures
【速读】:该论文旨在解决当前基因组基础模型(Genomic Foundation Models, GFMs)在预训练过程中过度依赖掩码语言建模(Masked Language Modeling, MLM)或下一个词预测(Next Token Prediction, NTP)所导致的局限性——即模型虽能捕捉局部基因组语法和精细基序模式,却难以建模更广泛的生物学功能背景,从而生成缺乏全局生物意义的表征。解决方案的关键在于提出JEPA-DNA框架,通过将联合嵌入预测架构(Joint-Embedding Predictive Architecture, JEPA)与传统生成目标相结合,在潜空间中引入潜在接地(latent grounding),具体表现为将token级恢复任务与对CLS token的高阶功能嵌入预测相结合,迫使模型关注被遮蔽基因组片段的整体功能语义而非仅限于单个核苷酸的重建。这一机制显著提升了模型对基因组功能逻辑的理解能力,并可在现有GFMs基础上作为持续预训练增强模块部署,实现更鲁棒、更具生物学意义的表示学习。
链接: https://arxiv.org/abs/2602.17162
作者: Ariel Larey,Elay Dahan,Amit Bleiweiss,Raizy Kellerman,Guy Leib,Omri Nayshool,Dan Ofer,Tal Zinger,Dan Dominissini,Gideon Rechavi,Nicole Bussola,Simon Lee,Shane O’Connell,Dung Hoang,Marissa Wirth,Alexander W. Charney,Nati Daniel,Yoli Shavit
机构: 未知
类目: Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注:
Abstract:Genomic Foundation Models (GFMs) have largely relied on Masked Language Modeling (MLM) or Next Token Prediction (NTP) to learn the language of life. While these paradigms excel at capturing local genomic syntax and fine-grained motif patterns, they often fail to capture the broader functional context, resulting in representations that lack a global biological perspective. We introduce JEPA-DNA, a novel pre-training framework that integrates the Joint-Embedding Predictive Architecture (JEPA) with traditional generative objectives. JEPA-DNA introduces latent grounding by coupling token-level recovery with a predictive objective in the latent space by supervising a CLS token. This forces the model to predict the high-level functional embeddings of masked genomic segments rather than focusing solely on individual nucleotides. JEPA-DNA extends both NTP and MLM paradigms and can be deployed either as a standalone from-scratch objective or as a continual pre-training enhancement for existing GFMs. Our evaluations across a diverse suite of genomic benchmarks demonstrate that JEPA-DNA consistently yields superior performance in supervised and zero-shot tasks compared to generative-only baselines. By providing a more robust and biologically grounded representation, JEPA-DNA offers a scalable path toward foundation models that understand not only the genomic alphabet, but also the underlying functional logic of the sequence.
[AI-42] meOmni-VL: Unified Models for Time Series Understanding and Generation
【速读】:该论文旨在解决时间序列建模中数值生成与语义理解之间的割裂问题:生成模型常依赖表层模式匹配,而理解型模型难以实现高保真数值输出。其解决方案的关键在于提出首个以视觉为中心的统一框架TimeOmni-VL,通过两项核心技术实现突破:(1) 时序信号与图像间的保真双向映射(Bi-TSI),实现近无损的时间序列到图像(TS2I)及图像到时间序列(I2TS)转换;(2) 基于理解引导的生成机制,引入TSUMM-Suite数据集,包含六类基于时间序列分析的理解任务与两类生成任务,并结合校准后的思维链(Chain-of-Thought),首次将时间序列理解作为显式控制信号用于高保真生成,从而显著提升语义理解和数值精度,推动多模态时间序列建模迈向新前沿。
链接: https://arxiv.org/abs/2602.17149
作者: Tong Guan,Sheng Pan,Johan Barthelemy,Zhao Li,Yujun Cai,Cesare Alippi,Ming Jin,Shirui Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent time series modeling faces a sharp divide between numerical generation and semantic understanding, with research showing that generation models often rely on superficial pattern matching, while understanding-oriented models struggle with high-fidelity numerical output. Although unified multimodal models (UMMs) have bridged this gap in vision, their potential for time series remains untapped. We propose TimeOmni-VL, the first vision-centric framework that unifies time series understanding and generation through two key innovations: (1) Fidelity-preserving bidirectional mapping between time series and images (Bi-TSI), which advances Time Series-to-Image (TS2I) and Image-to-Time Series (I2TS) conversions to ensure near-lossless transformations. (2) Understanding-guided generation. We introduce TSUMM-Suite, a novel dataset consists of six understanding tasks rooted in time series analytics that are coupled with two generation tasks. With a calibrated Chain-of-Thought, TimeOmni-VL is the first to leverage time series understanding as an explicit control signal for high-fidelity generation. Experiments confirm that this unified approach significantly improves both semantic understanding and numerical precision, establishing a new frontier for multimodal time series modeling.
[AI-43] Bonsai: A Framework for Convolutional Neural Network Acceleration Using Criterion-Based Pruning
【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在追求更高精度和更强性能时所面临的模型规模增大、执行时间延长、内存占用上升及功耗增加的问题。其解决方案的关键在于提出了一种基于准则(criterion-based)的剪枝框架 Combine,该框架通过统一的标准语言来描述和比较不同剪枝准则,并支持迭代式剪枝,从而实现高效且可复现的模型压缩。实验表明,该框架在 VGG 类模型上可剪枝高达 79% 的滤波器,同时保持或提升准确率,并使计算量减少最多达 68%。
链接: https://arxiv.org/abs/2602.17145
作者: Joseph Bingham,Sam Helmich
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures, accepted to MLDM 2021
Abstract:As the need for more accurate and powerful Convolutional Neural Networks (CNNs) increases, so too does the size, execution time, memory footprint, and power consumption. To overcome this, solutions such as pruning have been proposed with their own metrics and methodologies, or criteria, for how weights should be removed. These solutions do not share a common implementation and are difficult to implement and compare. In this work, we introduce Combine, a criterion- based pruning solution and demonstrate that it is fast and effective framework for iterative pruning, demonstrate that criterion have differing effects on different models, create a standard language for comparing criterion functions, and propose a few novel criterion functions. We show the capacity of these criterion functions and the framework on VGG inspired models, pruning up to 79% of filters while retaining or improving accuracy, and reducing the computations needed by the network by up to 68%.
[AI-44] VP-VAE: Rethinking Vector Quantization via Adaptive Vector Perturbation
【速读】:该论文旨在解决向量量化变分自编码器(Vector Quantized Variational Autoencoders, VQ-VAEs)在训练过程中存在的不稳定性以及“码本坍缩”(codebook collapse)问题,这些问题源于表示学习与离散码本优化之间的强耦合。解决方案的关键在于提出VP-VAE(Vector Perturbation VAE),通过消除训练阶段对显式码本的需求来解耦表示学习与离散化过程:其核心思想是将量化操作视为潜在空间中引入结构化扰动的过程,并用基于Metropolis–Hastings采样的分布一致且尺度自适应的潜变量扰动替代不可微的量化器,从而实现稳定训练并提升对推理时量化误差的鲁棒性。
链接: https://arxiv.org/abs/2602.17133
作者: Linwei Zhai,Han Ding,Mingzhi Lin,Cui Zhao,Fei Wang,Ge Wang,Wang Zhi,Wei Xi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental to modern generative modeling, yet they often suffer from training instability and “codebook collapse” due to the inherent coupling of representation learning and discrete codebook optimization. In this paper, we propose VP-VAE (Vector Perturbation VAE), a novel paradigm that decouples representation learning from discretization by eliminating the need for an explicit codebook during training. Our key insight is that, from the neural network’s viewpoint, performing quantization primarily manifests as injecting a structured perturbation in latent space. Accordingly, VP-VAE replaces the non-differentiable quantizer with distribution-consistent and scale-adaptive latent perturbations generated via Metropolis–Hastings sampling. This design enables stable training without a codebook while making the model robust to inference-time quantization error. Moreover, under the assumption of approximately uniform latent variables, we derive FSP (Finite Scalar Perturbation), a lightweight variant of VP-VAE that provides a unified theoretical explanation and a practical improvement for FSQ-style fixed quantizers. Extensive experiments on image and audio benchmarks demonstrate that VP-VAE and FSP improve reconstruction fidelity and achieve substantially more balanced token usage, while avoiding the instability inherent to coupled codebook training.
[AI-45] Efficient Parallel Algorithm for Decomposing Hard CircuitSAT Instances
【速读】:该论文旨在解决难解电路可满足性问题(CircuitSAT)的高效分解难题,其核心挑战在于如何将复杂的电路SAT实例划分为更易处理的子问题。解决方案的关键在于提出一种参数化的并行算法,通过引入特定约束将原始SAT实例分解为一组弱化公式,并利用并行计算的硬度假设估计来指导高质量分解的识别,从而在逻辑等价性验证和密码哈希函数预像攻击等实际场景中展现出显著的实用性与效率提升。
链接: https://arxiv.org/abs/2602.17130
作者: Victor Kondratiev,Irina Gribanova,Alexander Semenov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a novel parallel algorithm for decomposing hard CircuitSAT instances. The technique employs specialized constraints to partition an original SAT instance into a family of weakened formulas. Our approach is implemented as a parameterized parallel algorithm, where adjusting the parameters allows efficient identification of high-quality decompositions, guided by hardness estimations computed in parallel. We demonstrate the algorithm’s practical efficacy on challenging CircuitSAT instances, including those encoding Logical Equivalence Checking of Boolean circuits and preimage attacks on cryptographic hash functions.
[AI-46] IFO: Time-Invariant Frequency Operator for Stationarity-Aware Representation Learning in Time Series
【速读】:该论文旨在解决非平稳时间序列预测中因训练数据与测试数据分布差异(distribution shift)导致的性能下降问题。现有方法通常通过移除单个样本的低阶矩来缓解依赖关系,但这类方法无法捕捉跨样本的时间演化结构,也未能建模复杂的时间动态特性。其解决方案的关键在于提出一种时不变频域算子(Time-Invariant Frequency Operator, TIFO),该算子在频域空间中学习对整个数据集具有感知能力的权重表示,能够突出稳定频率成分并抑制非平稳成分,从而有效缓解分布偏移问题。TIFO基于傅里叶变换隐式诱导的频域特征分解机制设计,具备即插即用特性,可无缝集成到多种预测模型中,并在多个基准数据集上显著提升预测精度(如ETTm2上平均MSE分别降低33.3%和55.3%),同时减少60%-70%的计算开销,展现出良好的可扩展性。
链接: https://arxiv.org/abs/2602.17122
作者: Xihao Piao,Zheng Chen,Lingwei Zhu,Yushun Dong,Yasuko Matsubara,Yasushi Sakurai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Nonstationary time series forecasting suffers from the distribution shift issue due to the different distributions that produce the training and test data. Existing methods attempt to alleviate the dependence by, e.g., removing low-order moments from each individual sample. These solutions fail to capture the underlying time-evolving structure across samples and do not model the complex time structure. In this paper, we aim to address the distribution shift in the frequency space by considering all possible time structures. To this end, we propose a Time-Invariant Frequency Operator (TIFO), which learns stationarity-aware weights over the frequency spectrum across the entire dataset. The weight representation highlights stationary frequency components while suppressing non-stationary ones, thereby mitigating the distribution shift issue in time series. To justify our method, we show that the Fourier transform of time series data implicitly induces eigen-decomposition in the frequency space. TIFO is a plug-and-play approach that can be seamlessly integrated into various forecasting models. Experiments demonstrate our method achieves 18 top-1 and 6 top-2 results out of 28 forecasting settings. Notably, it yields 33.3% and 55.3% improvements in average MSE on the ETTm2 dataset. In addition, TIFO reduces computational costs by 60% -70% compared to baseline methods, demonstrating strong scalability across diverse forecasting models.
[AI-47] Epistemology of Generative AI: The Geometry of Knowing
【速读】:该论文试图解决生成式 AI(Generative AI)在知识生产中的 epistemic(认识论)基础缺失问题,即当前对生成式 AI 的理解仍停留在工程应用层面,缺乏对其内部机制的哲学与认知框架支撑,导致其在科学、教育及制度性场景中无法实现负责任的整合。解决方案的关键在于提出一种“高维空间索引性认识论”(Indexical Epistemology of High-Dimensional Spaces),其核心突破在于识别神经网络架构对符号输入的几何重构:将原始二进制编码映射至高维语义空间,其中坐标对应语义参数,形成可导航的流形结构。该理论基于高维几何的四个结构性特征——测度集中性、近正交性、指数方向容量与流形规则性,重构了生成模型作为“学习流形上的导航者”的角色,并引入“导航知识”作为区别于符号推理与统计重组的第三类知识生产模式。
链接: https://arxiv.org/abs/2602.17116
作者: Ilya Levin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27
Abstract:Generative AI presents an unprecedented challenge to our understanding of knowledge and its production. Unlike previous technological transformations, where engineering understanding preceded or accompanied deployment, generative AI operates through mechanisms whose epistemic character remains obscure, and without such understanding, its responsible integration into science, education, and institutional life cannot proceed on a principled basis. This paper argues that the missing account must begin with a paradigmatic break that has not yet received adequate philosophical attention. In the Turing-Shannon-von Neumann tradition, information enters the machine as encoded binary vectors, and semantics remains external to the process. Neural network architectures rupture this regime: symbolic input is instantly projected into a high-dimensional space where coordinates correspond to semantic parameters, transforming binary code into a position in a geometric space of meanings. It is this space that constitutes the active epistemic condition shaping generative production. Drawing on four structural properties of high-dimensional geometry concentration of measure, near-orthogonality, exponential directional capacity, and manifold regularity the paper develops an Indexical Epistemology of High-Dimensional Spaces. Building on Peirce semiotics and Papert constructionism, it reconceptualizes generative models as navigators of learned manifolds and proposes navigational knowledge as a third mode of knowledge production, distinct from both symbolic reasoning and statistical recombination.
[AI-48] Instructor-Aligned Knowledge Graphs for Personalized Learning
【速读】:该论文旨在解决大规模课程中难以精准识别学生知识缺口并实施个性化干预的问题,核心挑战在于如何有效建模教育概念间的复杂依赖关系(如先修关系与子概念关系)。解决方案的关键在于提出InstructKG框架,该框架通过自动解析课程讲义等教学材料,结合大语言模型的泛化能力与教育文本中独特的时序及语义信号(如知识点出现顺序和定义关联),构建与教师教学意图对齐的知识图谱,从而精确捕捉课程预期的学习进展路径。
链接: https://arxiv.org/abs/2602.17111
作者: Abdulrahman AlRabah,Priyanka Kargupta,Jiawei Han,Abdussalam Alawini
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Mastering educational concepts requires understanding both their prerequisites (e.g., recursion before merge sort) and sub-concepts (e.g., merge sort as part of sorting algorithms). Capturing these dependencies is critical for identifying students’ knowledge gaps and enabling targeted intervention for personalized learning. This is especially challenging in large-scale courses, where instructors cannot feasibly diagnose individual misunderstanding or determine which concepts need reinforcement. While knowledge graphs offer a natural representation for capturing these conceptual relationships at scale, existing approaches are either surface-level (focusing on course-level concepts like “Algorithms” or logistical relationships such as course enrollment), or disregard the rich pedagogical signals embedded in instructional materials. We propose InstructKG, a framework for automatically constructing instructor-aligned knowledge graphs that capture a course’s intended learning progression. Given a course’s lecture materials (slides, notes, etc.), InstructKG extracts significant concepts as nodes and infers learning dependencies as directed edges (e.g., “part-of” or “depends-on” relationships). The framework synergizes the rich temporal and semantic signals unique to educational materials (e.g., “recursion” is taught before “mergesort”; “recursion” is mentioned in the definition of “merge sort”) with the generalizability of large language models. Through experiments on real-world, diverse lecture materials across multiple courses and human-based evaluation, we demonstrate that InstructKG captures rich, instructor-aligned learning progressions.
[AI-49] Owen-based Semantics and Hierarchy-Aware Explanation (O-Shap)
【速读】:该论文旨在解决传统Shapley值方法在视觉任务中因假设特征独立性而失效的问题,尤其是在图像数据中像素间存在强空间和语义依赖关系的情况下。解决方案的关键在于引入Owen值(Owen value)作为Shapley值的层次化扩展,以支持特征分组的归因,并提出一种满足T-性质(T-property)的新分割方法,确保跨层级的语义一致性。该方法不仅通过层次结构实现计算剪枝以提升效率,还显著提升了归因精度与可解释性,尤其在结构敏感的任务中表现优越。
链接: https://arxiv.org/abs/2602.17107
作者: Xiangyu Zhou,Chenhan Xiao,Yang Weng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Shapley value-based methods have become foundational in explainable artificial intelligence (XAI), offering theoretically grounded feature attributions through cooperative game theory. However, in practice, particularly in vision tasks, the assumption of feature independence breaks down, as features (i.e., pixels) often exhibit strong spatial and semantic dependencies. To address this, modern SHAP implementations now include the Owen value, a hierarchical generalization of the Shapley value that supports group attributions. While the Owen value preserves the foundations of Shapley values, its effectiveness critically depends on how feature groups are defined. We show that commonly used segmentations (e.g., axis-aligned or SLIC) violate key consistency properties, and propose a new segmentation approach that satisfies the T -property to ensure semantic alignment across hierarchy levels. This hierarchy enables computational pruning while improving attribution accuracy and interpretability. Experiments on image and tabular datasets demonstrate that O-Shap outperforms baseline SHAP variants in attribution precision, semantic coherence, and runtime efficiency, especially when structure matters.
[AI-50] oward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction
【速读】:该论文试图解决当前可持续性评级(ESG rating)机构对同一公司给出的评分存在显著差异的问题,这种差异削弱了评级结果的可比性、可信度及决策相关性。解决方案的关键在于提出一个通用的人机协作框架,该框架由两个互补模块组成:STRIDE(Sustainability Trust Rating Integrity Data Equation)提供基于原则的评判标准与评分体系,指导利用大语言模型(Large Language Models, LLMs)构建企业层面的基准数据集;SR-Delta则是一个差异分析流程框架,用于识别并揭示可能需要调整的评分偏差。该框架实现了可持续性评级方法的可扩展且可比较的评估,推动了以AI赋能的方法改进ESG评级体系,从而支持紧迫的可持续发展目标。
链接: https://arxiv.org/abs/2602.17106
作者: Xiaoran Cai,Wang Yang,Xiyu Ren,Chekun Law,Rohit Sharma,Peng Qi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Sustainability or ESG rating agencies use company disclosures and external data to produce scores or ratings that assess the environmental, social, and governance performance of a company. However, sustainability ratings across agencies for a single company vary widely, limiting their comparability, credibility, and relevance to decision-making. To harmonize the rating results, we propose adopting a universal human-AI collaboration framework to generate trustworthy benchmark datasets for evaluating sustainability rating methodologies. The framework comprises two complementary parts: STRIDE (Sustainability Trust Rating Integrity Data Equation) provides principled criteria and a scoring system that guide the construction of firm-level benchmark datasets using large language models (LLMs), and SR-Delta, a discrepancy-analysis procedural framework that surfaces insights for potential adjustments. The framework enables scalable and comparable assessment of sustainability rating methodologies. We call on the broader AI community to adopt AI-powered approaches to strengthen and advance sustainability rating methodologies that support and enforce urgent sustainability agendas.
[AI-51] Agent ic Wireless Communication for 6G: Intent-Aware and Continuously Evolving Physical-Layer Intelligence
【速读】:该论文旨在解决6G无线通信系统中因功能复杂度提升和多样化服务需求增长所带来的控制机制转型难题,即从传统的规则驱动控制向意图驱动的自主智能演进的问题。其核心挑战在于如何准确理解多维用户意图(如时延敏感性、能耗偏好、计算约束等)以及动态变化的通信环境,并实现高效的闭环自治决策与网络执行。解决方案的关键在于引入基于大语言模型(Large Language Models, LLMs)的代理式人工智能(Agentic AI),通过其强大的上下文理解能力和跨模态推理能力,将自然语言形式的用户意图转化为可执行的物理层控制与配置决策,从而构建具备意图感知能力的自主网络代理,推动6G通信系统的可持续演化。
链接: https://arxiv.org/abs/2602.17096
作者: Zhaoyang Li,Xingzhi Jin,Junyu Pan,Qianqian Yang,Zhiguo Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As 6G wireless systems evolve, growing functional complexity and diverse service demands are driving a shift from rule-based control to intent-driven autonomous intelligence. User requirements are no longer captured by a single metric (e.g., throughput or reliability), but by multi-dimensional objectives such as latency sensitivity, energy preference, computational constraints, and service-level requirements. These objectives may also change over time due to environmental dynamics and user-network interactions. Therefore, accurate understanding of both the communication environment and user intent is critical for autonomous and sustainably evolving 6G communications. Large language models (LLMs), with strong contextual understanding and cross-modal reasoning, provide a promising foundation for intent-aware network agents. Compared with rule-driven or centrally optimized designs, LLM-based agents can integrate heterogeneous information and translate natural-language intents into executable control and configuration decisions. Focusing on a closed-loop pipeline of intent perception, autonomous decision making, and network execution, this paper investigates agentic AI for the 6G physical layer and its realization pathways. We review representative physical-layer tasks and their limitations in supporting intent awareness and autonomy, identify application scenarios where agentic AI is advantageous, and discuss key challenges and enabling technologies in multimodal perception, cross-layer decision making, and sustainable optimization. Finally, we present a case study of an intent-driven link decision agent, termed AgenCom, which adaptively constructs communication links under diverse user preferences and channel conditions. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.17096 [cs.AI] (or arXiv:2602.17096v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.17096 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-52] FLoRG: Federated Fine-tuning with Low-rank Gram Matrices and Procrustes Alignment
【速读】:该论文旨在解决联邦微调(Federated Learning, FL)中基于低秩适配(Low-Rank Adaptation, LoRA)方法所面临的两大挑战:一是分别聚合两个低秩矩阵时引入的聚合误差;二是即使聚合其乘积,服务器在进行矩阵分解以恢复因子时因非唯一性导致的分解漂移(decomposition drift)。解决方案的关键在于提出FLoRG框架,该框架采用单一低秩矩阵进行微调,并聚合其Gram矩阵(即列向量内积构成的矩阵),从而消除聚合误差并降低通信开销;同时引入Procrustes对齐方法,在连续训练轮次间对分解矩阵进行对齐,有效抑制分解漂移,理论分析表明该策略可获得更紧的收敛界。
链接: https://arxiv.org/abs/2602.17095
作者: Chuiyang Meng,Ming Tang,Vincent W.S. Wong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Parameter-efficient fine-tuning techniques such as low-rank adaptation (LoRA) enable large language models (LLMs) to adapt to downstream tasks efficiently. Federated learning (FL) further facilitates this process by enabling collaborative fine-tuning across distributed clients without sharing private data. However, the use of two separate low-rank matrices in LoRA for federated fine-tuning introduces two types of challenges. The first challenge arises from the error induced by separately aggregating those two low-rank matrices. The second challenge occurs even when the product of two low-rank matrices is aggregated. The server needs to recover factors via matrix decomposition, which is non-unique and can introduce decomposition drift. To tackle the aforementioned challenges, we propose FLoRG, a federated fine-tuning framework which employs a single low-rank matrix for fine-tuning and aggregates its Gram matrix (i.e., the matrix of inner products of its column vectors), eliminating the aggregation error while also reducing the communication overhead. FLoRG minimizes the decomposition drift by introducing a Procrustes alignment approach which aligns the decomposed matrix between consecutive fine-tuning rounds for consistent updates. We theoretically analyze the convergence of FLoRG and prove that adopting the Procrustes alignment results in a tighter convergence bound. Experimental results across multiple LLM fine-tuning benchmarks demonstrate that FLoRG outperforms five state-of-the-art baseline schemes in the downstream task accuracy and can reduce the communication overhead by up to 2041 \times .
[AI-53] How AI Coding Agents Communicate: A Study of Pull Request Description Characteristics and Human Review Responses
【速读】:该论文试图解决的问题是:当前生成式 AI 编码代理(AI coding agents)在 GitHub 上自动生成拉取请求(Pull Request, PR)时,其 PR 描述特征的差异及其对人类评审者响应行为的影响尚不明确。解决方案的关键在于通过实证分析五种不同 AI 编码代理在 AIDev 数据集上的 PR 描述结构特征(如格式、内容组织等),并量化人类评审者的互动行为(包括评审活跃度、响应时间、情感倾向及合并结果),从而揭示 PR 描述风格与评审参与度、响应效率和合并成功率之间的关联机制,为优化人-AI 协同软件开发中的交互质量提供依据。
链接: https://arxiv.org/abs/2602.17084
作者: Kan Watanabe,Rikuto Tsuchida,Takahiro Monno,Bin Huang,Kazuma Yamasaki,Youmei Fan,Kazumasa Shimari,Kenichi Matsumoto
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:The rapid adoption of large language models has led to the emergence of AI coding agents that autonomously create pull requests on GitHub. However, how these agents differ in their pull request description characteristics, and how human reviewers respond to them, remains underexplored. In this study, we conduct an empirical analysis of pull requests created by five AI coding agents using the AIDev dataset. We analyze agent differences in pull request description characteristics, including structural features, and examine human reviewer response in terms of review activity, response timing, sentiment, and merge outcomes. We find that AI coding agents exhibit distinct PR description styles, which are associated with differences in reviewer engagement, response time, and merge outcomes. We observe notable variation across agents in both reviewer interaction metrics and merge rates. These findings highlight the role of pull request presentation and reviewer interaction dynamics in human-AI collaborative software development.
[AI-54] AdvSynGNN: Structure-Adaptive Graph Neural Nets via Adversarial Synthesis and Self-Corrective Propagation
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在面对结构噪声或非同质性拓扑(non-homophilous topologies)时性能显著下降的问题。其解决方案的关键在于提出AdvSynGNN框架,该框架通过多分辨率结构合成与对比学习目标协同构建几何敏感的初始化,并引入基于可学习拓扑信号调节注意力机制的Transformer骨干网络以自适应处理异质性;同时集成对抗传播引擎,其中生成器识别潜在连接变化、判别器保障全局一致性,辅以基于节点置信度的残差校正策略实现标签精炼和迭代稳定性控制,从而在多种图分布上实现高精度预测且保持计算效率。
链接: https://arxiv.org/abs/2602.17071
作者: Rong Fu,Muge Qi,Chunlei Meng,Shuo Yin,Kun Liu,Zhaolu Kang,Simon Fong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 32 pages, 8 figures
Abstract:Graph neural networks frequently encounter significant performance degradation when confronted with structural noise or non-homophilous topologies. To address these systemic vulnerabilities, we present AdvSynGNN, a comprehensive architecture designed for resilient node-level representation learning. The proposed framework orchestrates multi-resolution structural synthesis alongside contrastive objectives to establish geometry-sensitive initializations. We develop a transformer backbone that adaptively accommodates heterophily by modulating attention mechanisms through learned topological signals. Central to our contribution is an integrated adversarial propagation engine, where a generative component identifies potential connectivity alterations while a discriminator enforces global coherence. Furthermore, label refinement is achieved through a residual correction scheme guided by per-node confidence metrics, which facilitates precise control over iterative stability. Empirical evaluations demonstrate that this synergistic approach effectively optimizes predictive accuracy across diverse graph distributions while maintaining computational efficiency. The study concludes with practical implementation protocols to ensure the robust deployment of the AdvSynGNN system in large-scale environments.
[AI-55] Predictive Batch Scheduling: Accelerating Language Model Training Through Loss-Aware Sample Prioritization
【速读】:该论文旨在解决语言模型训练中收敛速度慢的问题,特别是如何通过优化样本选择策略来加速模型收敛。其解决方案的关键在于提出了一种名为预测性批处理调度(Predictive Batch Scheduling, PBS)的新颖训练优化技术,该技术通过动态优先选择高损失样本构建批次,实现更高效的梯度更新。PBS的核心创新在于使用一个轻量级的线性预测器在线估计样本难度,仅依赖四个静态的token级特征(如token频率、序列长度、词汇多样性及稀有token比例),即可在无需预先定义难度指标或昂贵的逐样本损失追踪的情况下,实现与实际损失高度相关的预测能力(相关系数达0.44),从而在保持极低计算开销的前提下,显著提升训练效率(评估损失收敛速度加快6–13%)。
链接: https://arxiv.org/abs/2602.17066
作者: Sumedh Rasal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Predictive Batch Scheduling (PBS), a novel training optimization technique that accelerates language model convergence by dynamically prioritizing high-loss samples during batch construction. Unlike curriculum learning approaches that require predefined difficulty metrics or hard example mining methods that demand expensive per-sample loss tracking, PBS employs a lightweight linear predictor trained online to estimate sample difficulty from static token-level features. Our predictor achieves 0.44 correlation with actual loss using only four simple features: token frequency, sequence length, vocabulary diversity, and rare token ratio. Experiments on a 130M parameter transformer demonstrate that PBS achieves 6-13% faster convergence measured by evaluation loss across training checkpoints, with the predictor’s correlation improving from 0.14 to 0.44 over 10,000 training steps. These results validate that token frequency statistics encode meaningful information about sample difficulty, enabling effective curriculum learning with negligible computational overhead.
[AI-56] Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning ICLR2026
【速读】:该论文旨在解决协作式多智能体强化学习(Cooperative Multi-Agent Reinforcement Learning, MARL)中现有价值分解方法依赖单一最优动作、在价值函数变化时难以适应且易收敛至次优策略的问题。其解决方案的关键在于提出Successive Sub-value Q-learning(S2Q),通过学习多个子价值函数(sub-value functions)来保留替代的高价值动作,并将这些子价值函数引入基于Softmax的行为策略中,从而增强持续探索能力并使Q值快速响应最优解的变化,提升算法的适应性和整体性能。
链接: https://arxiv.org/abs/2602.17062
作者: Yonghyeon Jo,Sunwoo Lee,Seungyul Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 technical page followed by references and appendix. Accepted to ICLR 2026
Abstract:Value decomposition is a core approach for cooperative multi-agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub-value Q-learning (S2Q), which learns multiple sub-value functions to retain alternative high-value actions. Incorporating these sub-value functions into a Softmax-based behavior policy, S2Q encourages persistent exploration and enables Q^\texttot to adjust quickly to the changing optima. Experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance. Our code is available at this https URL.
[AI-57] Dynamic System Instructions and Tool Exposure for Efficient Agent ic LLM s
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在长时间运行过程中因每步重复加载完整系统指令和工具目录而导致的高成本、高延迟、工具选择错误率上升及代理漂移概率增加的问题。解决方案的关键在于提出一种名为“指令-工具检索”(Instruction-Tool Retrieval, ITR)的RAG(Retrieval-Augmented Generation)变体,其核心思想是在每个推理步骤中仅检索最小必要的系统提示片段和最精简的工具子集,动态构建运行时系统提示并引入置信度门控的回退机制,从而显著减少上下文token消耗、提升工具路由准确性,并降低整体任务成本。
链接: https://arxiv.org/abs/2602.17046
作者: Uria Franko
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) agents often run for many steps while re-ingesting long system instructions and large tool catalogs each turn. This increases cost, agent derailment probability, latency, and tool-selection errors. We propose Instruction-Tool Retrieval (ITR), a RAG variant that retrieves, per step, only the minimal system-prompt fragments and the smallest necessary subset of tools. ITR composes a dynamic runtime system prompt and exposes a narrowed toolset with confidence-gated fallbacks. Using a controlled benchmark with internally consistent numbers, ITR reduces per-step context tokens by 95%, improves correct tool routing by 32% relative, and cuts end-to-end episode cost by 70% versus a monolithic baseline. These savings enable agents to run 2-20x more loops within context limits. Savings compound with the number of agent steps, making ITR particularly valuable for long-running autonomous agents. We detail the method, evaluation protocol, ablations, and operational guidance for practical deployment.
[AI-58] Phase-Aware Mixture of Experts for Agent ic Reinforcement Learning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中基于单一策略网络(single policy network)导致的“简单任务偏置”(simplicity bias)问题,即简单任务因占据大部分参数而主导梯度更新,从而挤压复杂任务的学习空间。解决方案的关键在于提出一种相位感知的专家混合模型(Phase-Aware Mixture of Experts, PA-MoE),其核心创新是引入一个轻量级的相位路由器(phase router),该路由器直接从RL目标函数中学习潜在的相位边界,而非预定义类别,并实现对同一相位内动作序列的时序一致性分配,确保专家能够保留特定相位的专业能力,从而有效缓解简单任务对参数资源的垄断,提升复杂任务的建模能力。
链接: https://arxiv.org/abs/2602.17038
作者: Shengtian Yang(1 and 3),Yu Li(1),Shuo He(2),Yewen Li(3),Qingpeng Cai(3),Peng Jiang(3),Lei Feng(1) ((1) Southeast University, (2) Nanyang Technological University, (3) Kuaishou Technology)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages
Abstract:Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a \emphsingle policy network, causing \emphsimplicity bias where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could be employing the Mixture-of-Experts (MoE) architecture in the policy network, as MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters. However, a key limitation of traditional MoE is its token-level routing, where the router assigns each token to specialized experts, which fragments phase-consistent patterns into scattered expert assignments and thus undermines expert specialization. In this paper, we propose \textbfPhase-Aware Mixture of Experts (PA-MoE). It first features a lightweight \emphphase router that learns latent phase boundaries directly from the RL objective without pre-defining phase categories. Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise. Experimental results demonstrate the effectiveness of our proposed PA-MoE.
[AI-59] Forecasting Anomaly Precursors via Uncertainty-Aware Time-Series Ensembles
【速读】:该论文旨在解决时间序列数据中异常检测的滞后性问题,即现有方法多为反应式检测(reactive detection),仅能在异常发生后识别,缺乏提供前瞻性早期预警信号的能力。其解决方案的关键在于提出FATE(Forecasting Anomalies with Time-series Ensembles)框架,通过构建多样化的时序预测模型集成(time-series ensemble),利用预测不确定性(predictive uncertainty)量化潜在异常前兆(Precursors-of-Anomaly, PoA),从而在无需目标值或标签的情况下实现早期预警。该方法不依赖重建误差或监督信号,而是基于集成模型间的分歧(ensemble disagreement)来捕捉异常前兆,显著提升了早期检测的准确性与及时性。
链接: https://arxiv.org/abs/2602.17028
作者: Hyeongwon Kang,Jinwoo Park,Seunghun Han,Pilsung Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This manuscript contains 14 pages and 8 figures. It is currently under review at IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
Abstract:Detecting anomalies in time-series data is critical in domains such as industrial operations, finance, and cybersecurity, where early identification of abnormal patterns is essential for ensuring system reliability and enabling preventive maintenance. However, most existing methods are reactive: they detect anomalies only after they occur and lack the capability to provide proactive early warning signals. In this paper, we propose FATE (Forecasting Anomalies with Time-series Ensembles), a novel unsupervised framework for detecting Precursors-of-Anomaly (PoA) by quantifying predictive uncertainty from a diverse ensemble of time-series forecasting models. Unlike prior approaches that rely on reconstruction errors or require ground-truth labels, FATE anticipates future values and leverages ensemble disagreement to signal early signs of potential anomalies without access to target values at inference time. To rigorously evaluate PoA detection, we introduce Precursor Time-series Aware Precision and Recall (PTaPR), a new metric that extends the traditional Time-series Aware Precision and Recall (TaPR) by jointly assessing segment-level accuracy, within-segment coverage, and temporal promptness of early predictions. This enables a more holistic assessment of early warning capabilities that existing metrics overlook. Experiments on five real-world benchmark datasets show that FATE achieves an average improvement of 19.9 percentage points in PTaPR AUC and 20.02 percentage points in early detection F1 score, outperforming baselines while requiring no anomaly labels. These results demonstrate the effectiveness and practicality of FATE for real-time unsupervised early warning in complex time-series environments.
[AI-60] ransforming Behavioral Neuroscience Discovery with In-Context Learning and AI-Enhanced Tensor Methods
【速读】:该论文旨在解决行为神经科学领域中科研发现流程复杂、僵化且耗时的问题,尤其在小鼠恐惧泛化研究中,传统方法依赖专家手动处理数据和调试流程,限制了高效洞察的获取。解决方案的关键在于引入“上下文学习”(In-Context Learning, ICL)作为领域专家与AI之间的接口,使专家无需掌握模型训练或微调即可自动化数据准备和模式识别任务;同时,论文提出对张量分解模型的创新改进,以更无缝地从异构实验数据中发现潜在模式。通过实证评估,该AI增强型流程在性能上优于领域标准实践及非ICL类机器学习基线,且结果得到领域专家验证,实现了易用性与高性能的兼顾。
链接: https://arxiv.org/abs/2602.17027
作者: Paimon Goulart,Jordan Steinhauser,Dawon Ahn,Kylene Shuler,Edward Korzus,Jia Chen,Evangelos E. Papalexakis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Scientific discovery pipelines typically involve complex, rigid, and time-consuming processes, from data preparation to analyzing and interpreting findings. Recent advances in AI have the potential to transform such pipelines in a way that domain experts can focus on interpreting and understanding findings, rather than debugging rigid pipelines or manually annotating data. As part of an active collaboration between data science/AI researchers and behavioral neuroscientists, we showcase an example AI-enhanced pipeline, specifically designed to transform and accelerate the way that the domain experts in the team are able to gain insights out of experimental data. The application at hand is in the domain of behavioral neuroscience, studying fear generalization in mice, an important problem whose progress can advance our understanding of clinically significant and often debilitating conditions such as PTSD (Post-Traumatic Stress Disorder). We identify the emerging paradigm of “In-Context Learning” (ICL) as a suitable interface for domain experts to automate parts of their pipeline without the need for or familiarity with AI model training and fine-tuning, and showcase its remarkable efficacy in data preparation and pattern interpretation. Also, we introduce novel AI-enhancements to tensor decomposition model, which allows for more seamless pattern discovery from the heterogeneous data in our application. We thoroughly evaluate our proposed pipeline experimentally, showcasing its superior performance compared to what is standard practice in the domain, as well as against reasonable ML baselines that do not fall under the ICL paradigm, to ensure that we are not compromising performance in our quest for a seamless and easy-to-use interface for domain experts. Finally, we demonstrate effective discovery, with results validated by the domain experts in the team.
[AI-61] Sales Research Agent and Sales Research Bench MICRO
【速读】:该论文旨在解决企业对可解释、高质量AI系统的需求,尤其是在处理实时客户关系管理(CRM)数据时,现有模型缺乏透明且可重复的质量评估证据。解决方案的关键在于开发了Sales Research Agent(销售研究代理),这是一个以AI为核心的Microsoft Dynamics 365 Sales应用,能够连接实时CRM数据、推理复杂数据模式,并输出文本与图表形式的决策支持信息;同时提出Sales Research Bench这一专用基准测试工具,通过八个由客户权重决定的维度(如文本和图表的准确性、相关性、可解释性等)量化评估AI系统的性能,从而提供一种可重复比较不同AI解决方案质量的方法。
链接: https://arxiv.org/abs/2602.17017
作者: Deepanjan Bhol
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Technical report. 2 figures. Microsoft Dynamics 365 Sales
Abstract:Enterprises increasingly need AI systems that can answer sales-leader questions over live, customized CRM data, but most available models do not expose transparent, repeatable evidence of quality. This paper describes the Sales Research Agent in Microsoft Dynamics 365 Sales, an AI-first application that connects to live CRM and related data, reasons over complex schemas, and produces decision-ready insights through text and chart outputs. To make quality observable, we introduce the Sales Research Bench, a purpose-built benchmark that scores systems on eight customer-weighted dimensions, including text and chart groundedness, relevance, explainability, schema accuracy, and chart quality. In a 200-question run on a customized enterprise schema on October 19, 2025, the Sales Research Agent outperformed Claude Sonnet 4.5 by 13 points and ChatGPT-5 by 24.1 points on the 100-point composite score, giving customers a repeatable way to compare AI solutions.
[AI-62] M2F: Automated Formalization of Mathematical Literature at Scale
【速读】:该论文旨在解决数学文献的自动化形式化(automated formalization)在规模上的局限性问题,即当前方法仅能处理孤立定理或短片段,难以扩展至教材和研究论文级别的项目,尤其缺乏对跨文件依赖关系管理、导入解析以及端到端编译完整性的支持。其解决方案的关键在于提出M2F(Math-to-Formal)框架,该框架采用两阶段策略:第一阶段为声明编译阶段,通过原子化分块、依赖关系推断与声明骨架修复实现项目级编译;第二阶段为证明修复阶段,基于固定签名下的目标条件局部编辑填补证明缺口;整个过程始终将验证器(verifier)置于闭环反馈中,仅在工具链反馈确认改进时才提交修改,从而实现了从长篇数学文本(如实分析与凸分析教材共479页)到可编译Lean库(153,853行代码)的端到端自动化转换,显著提升了大规模数学文献形式化的可行性与效率。
链接: https://arxiv.org/abs/2602.17016
作者: Zichen Wang,Wanli Ma,Zhenyu Ming,Gong Zhang,Kun Yuan,Zaiwen Wen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Automated formalization of mathematics enables mechanical verification but remains limited to isolated theorems and short snippets. Scaling to textbooks and research papers is largely unaddressed, as it requires managing cross-file dependencies, resolving imports, and ensuring that entire projects compile end-to-end. We present M2F (Math-to-Formal), the first agentic framework for end-to-end, project-scale autoformalization in Lean. The framework operates in two stages. The statement compilation stage splits the document into atomic blocks, orders them via inferred dependencies, and repairs declaration skeletons until the project compiles, allowing placeholders in proofs. The proof repair stage closes these holes under fixed signatures using goal-conditioned local edits. Throughout both stages, M2F keeps the verifier in the loop, committing edits only when toolchain feedback confirms improvement. In approximately three weeks, M2F converts long-form mathematical sources into a project-scale Lean library of 153,853 lines from 479 pages textbooks on real analysis and convex analysis, fully formalized as Lean declarations with accompanying proofs. This represents textbook-scale formalization at a pace that would typically require months or years of expert effort. On FATE-H, we achieve 96% proof success (vs.\ 80% for a strong baseline). Together, these results demonstrate that practical, large-scale automated formalization of mathematical literature is within reach. The full generated Lean code from our runs is available at this https URL.
[AI-63] Cinder: A fast and fair matchmaking system
【速读】:该论文旨在解决多玩家在线游戏中异质技能水平的队伍(lobbies)之间公平匹配的问题,传统基于平均技能指标(如均值或中位数评分)的匹配方式常导致比赛失衡,尤其在技能分布宽泛或偏斜时更为明显。解决方案的关键在于提出一个两阶段匹配系统Cinder:第一阶段利用Ruzicka相似性指数快速筛选出“非异常值”技能范围相近的队伍对;第二阶段通过将玩家排名映射到由逆正态分布生成的非线性技能桶(skill buckets),在平均技能区间提供更高粒度,并使用Kantorovich距离量化潜在匹配的公平性,最终输出“制裁分数”(Sanction Score)作为匹配质量的衡量标准。
链接: https://arxiv.org/abs/2602.17015
作者: Saurav Pal
机构: 未知
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:
Abstract:A fair and fast matchmaking system is an important component of modern multiplayer online games, directly impacting player retention and satisfaction. However, creating fair matches between lobbies (pre-made teams) of heterogeneous skill levels presents a significant challenge. Matching based simply on average team skill metrics, such as mean or median rating or rank, often results in unbalanced and one-sided games, particularly when skill distributions are wide or skewed. This paper introduces Cinder, a two-stage matchmaking system designed to provide fast and fair matches. Cinder first employs a rapid preliminary filter by comparing the “non-outlier” skill range of lobbies using the Ruzicka similarity index. Lobbies that pass this initial check are then evaluated using a more precise fairness metric. This second stage involves mapping player ranks to a non-linear set of skill buckets, generated from an inverted normal distribution, to provide higher granularity at average skill levels. The fairness of a potential match is then quantified using the Kantorovich distance on the lobbies’ sorted bucket indices, producing a “Sanction Score.” We demonstrate the system’s viability by analyzing the distribution of Sanction Scores from 140 million simulated lobby pairings, providing a robust foundation for fair matchmaking thresholds.
[AI-64] Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation
【速读】:该论文旨在解决现有推荐系统评估基准中将用户行为作为唯一真实标准所导致的偏差问题,特别是在金融咨询场景下,用户短期选择可能受市场波动干扰而偏离长期理性决策目标。传统方法混淆了行为模仿(behavioral imitation)与决策质量(decision quality),从而无法区分模型是基于投资者风险偏好进行理性分析,还是简单复制了噪声或市场情绪。解决方案的关键在于提出 Conv-FinRe——一个对话式且纵向的股票推荐评估基准,它通过引入多视角参考标准(multi-view references),明确区分描述性行为(descriptive behavior)与规范性效用(normative utility),从而能够诊断大语言模型(LLM)是否遵循理性分析、模仿用户噪声或受市场动量驱动。该基准基于真实市场数据和人类决策轨迹构建,并结合控制变量的对话场景对主流 LLM 进行评估,揭示出理性决策质量与行为一致性之间存在持续张力。
链接: https://arxiv.org/abs/2602.16990
作者: Yan Wang,Yi Han,Lingfei Qian,Yueru He,Xueqing Peng,Dongji Feng,Zhuohan Xie,Vincent Jim Zhang,Rosie Guo,Fengran Mo,Jimin Huang,Yankai Chen,Xue Liu,Jian-Yun Nie
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user’s long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub.
[AI-65] Fundamental Limits of Black-Box Safety Evaluation: Information-Theoretic and Computational Barriers from Latent Context Conditioning
【速读】:该论文旨在解决黑箱安全评估(black-box safety evaluation)中一个关键假设的可靠性问题:即模型在测试分布下的行为能否可靠预测其在部署环境中的性能。作者通过引入潜在上下文条件策略(latent context-conditioned policies)——这类模型的行为依赖于评估阶段罕见但在部署阶段普遍存在的未观测内部变量——挑战了这一假设,并证明了在某些情况下,任何黑箱评估器都无法可靠估计部署风险。解决方案的关键在于三方面:(1) 通过Le Cam方法建立被动评估的最小最大下界,表明即使独立同分布采样也无法避免显著误差;(2) 利用哈希触发构造和Yao原理证明自适应查询也无法消除最坏情况下的误差,且检测需Theta(1/ε)次查询;(3) 在陷门单向函数假设下,揭示计算分离现象——具备特权信息的部署环境可激活不可被多项式时间评估器识别的不安全行为。这些结果量化了黑箱测试在统计上的欠定性,并明确指出在最坏情况下必须依赖架构约束、训练时保障、可解释性和部署监控等额外机制以实现安全保证。
链接: https://arxiv.org/abs/2602.16984
作者: Vishal Srivastava
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Black-box safety evaluation of AI systems assumes model behavior on test distributions reliably predicts deployment performance. We formalize and challenge this assumption through latent context-conditioned policies – models whose outputs depend on unobserved internal variables that are rare under evaluation but prevalent under deployment. We establish fundamental limits showing that no black-box evaluator can reliably estimate deployment risk for such models. (1) Passive evaluation: For evaluators sampling i.i.d. from D_eval, we prove minimax lower bounds via Le Cam’s method: any estimator incurs expected absolute error = (5/24)deltaL approximately 0.208deltaL, where delta is trigger probability under deployment and L is the loss gap. (2) Adaptive evaluation: Using a hash-based trigger construction and Yao’s minimax principle, worst-case error remains = delta*L/16 even for fully adaptive querying when D_dep is supported over a sufficiently large domain; detection requires Theta(1/epsilon) queries. (3) Computational separation: Under trapdoor one-way function assumptions, deployment environments possessing privileged information can activate unsafe behaviors that any polynomial-time evaluator without the trapdoor cannot distinguish. For white-box probing, estimating deployment risk to accuracy epsilon_R requires O(1/(gamma^2 * epsilon_R^2)) samples, where gamma = alpha_0 + alpha_1 - 1 measures probe quality, and we provide explicit bias correction under probe error. Our results quantify when black-box testing is statistically underdetermined and provide explicit criteria for when additional safeguards – architectural constraints, training-time guarantees, interpretability, and deployment monitoring – are mathematically necessary for worst-case safety assurance.
[AI-66] Early-Warning Signals of Grokking via Loss-Landscape Geometry
【速读】:该论文旨在解决生成式 AI(Generative AI)模型在训练过程中出现的“grokking”现象——即模型从单纯记忆数据到实现泛化能力的突然转变——是否具有普适性,以及其背后的机制是否可被识别和干预。研究聚焦于两个序列学习任务:SCAN的组合泛化与Dyck-1深度预测,发现非交换梯度更新导致的“共变缺陷”(commutator defect)在泛化发生前显著上升,且其提前量遵循超线性幂律关系(SCAN约为1.18,Dyck约为1.13),这与模块算术中的发现一致。关键解决方案在于揭示了共变缺陷是一种架构无关、因果驱动的早期预警信号:通过因果干预增强非交换性可加速grokking(SCAN约32%,Dyck约50%),而抑制正交梯度流则延迟或阻止该过程;尽管三类任务对干预敏感程度不同(模块算术最刚性,Dyck最响应,SCAN居中),但抑制均能阻断泛化,证明共变缺陷是延迟泛化的必要条件。
链接: https://arxiv.org/abs/2602.16967
作者: Yongzhong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 13 figures
Abstract:Grokking – the abrupt transition from memorization to generalization after prolonged training – has been linked to confinement on low-dimensional execution manifolds in modular arithmetic. Whether this mechanism extends beyond arithmetic remains open. We study two sequence-learning benchmarks: SCAN compositional generalization and Dyck-1 depth prediction. Across both tasks and a wide range of learning rates, the commutator defect – a curvature measure derived from non-commuting gradient updates – rises well before generalization, with lead times following a superlinear power law (alpha approximately 1.18 for SCAN, approximately 1.13 for Dyck), consistent with prior results on modular arithmetic. Weight-space PCA reveals that spectral concentration is not a universal precursor; the commutator defect is. Causal interventions demonstrate a mechanistic role: amplifying non-commutativity accelerates grokking (roughly 32% on SCAN, roughly 50% on Dyck), while suppressing orthogonal gradient flow delays or prevents it. The three task families form a spectrum of causal sensitivity – modular arithmetic is rigid, Dyck is responsive, SCAN is intermediate – yet suppression delays or prevents grokking in all cases, establishing necessity as a universal finding. These results identify the commutator defect as a robust, architecture-agnostic, causally implicated early-warning signal for delayed generalization in transformers.
[AI-67] A Unified Framework for Locality in Scalable MARL
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中因维度灾难导致的可扩展性问题,其核心挑战在于如何有效利用局部性来降低计算复杂度。传统方法依赖于环境本身的衰减性质(即指数衰减性质 Exponential Decay Property, EDP),但现有条件往往过于保守,仅基于最坏情况下的环境边界(如对动作取上确界),忽略了策略本身对局部性的正则化作用。论文的关键创新在于提出了一种新的策略诱导的依赖矩阵 $ H^\pi $ 的分解方法,将环境对状态和动作的敏感性($ E^\mathrms $ 和 $ E^\mathrma )与策略对状态的敏感性( \Pi(\pi) $)分离,揭示出即使在强动作耦合环境中,平滑策略(即小的 $ \Pi(\pi) $)也能诱导局部性,从而发现局部性与最优性之间存在根本权衡。基于此框架,作者推导出一个更紧致的谱条件 $ \rho(E^\mathrms + E^\mathrma \Pi(\pi)) < 1 $ 来保证指数衰减,并据此设计了一个具有理论保障的局部化块坐标策略改进框架,其性能保证直接关联于该谱半径。
链接: https://arxiv.org/abs/2602.16966
作者: Sourav Chakraborty,Amit Kiran Rege,Claire Monteleoni,Lijun Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Scalable Multi-Agent Reinforcement Learning (MARL) is fundamentally challenged by the curse of dimensionality. A common solution is to exploit locality, which hinges on an Exponential Decay Property (EDP) of the value function. However, existing conditions that guarantee the EDP are often conservative, as they are based on worst-case, environment-only bounds (e.g., supremums over actions) and fail to capture the regularizing effect of the policy itself. In this work, we establish that locality can also be a \emphpolicy-dependent phenomenon. Our central contribution is a novel decomposition of the policy-induced interdependence matrix, H^\pi , which decouples the environment’s sensitivity to state ( E^\mathrms ) and action ( E^\mathrma ) from the policy’s sensitivity to state ( \Pi(\pi) ). This decomposition reveals that locality can be induced by a smooth policy (small \Pi(\pi) ) even when the environment is strongly action-coupled, exposing a fundamental locality-optimality tradeoff. We use this framework to derive a general spectral condition \rho(E^\mathrms+E^\mathrma\Pi(\pi)) 1 for exponential decay, which is strictly tighter than prior norm-based conditions. Finally, we leverage this theory to analyze a provably-sound localized block-coordinate policy improvement framework with guarantees tied directly to this spectral radius.
[AI-68] Automating Agent Hijacking via Structural Template Injection
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理系统中由结构化模板注入引发的代理劫持(Agent Hijacking)问题,此类攻击可使恶意指令通过检索内容注入并误导代理执行非预期行为。现有方法多依赖人工设计的语义驱动提示操纵,存在成功率低且难以迁移至闭源商业模型的问题。解决方案的关键在于提出 Phantom 框架,其核心创新是利用 LLM 代理对特定聊天模板标记(chat template tokens)的依赖性,通过向检索上下文中注入优化后的结构化模板诱导角色混淆,使代理将恶意内容误判为合法用户指令或工具输出;进一步地,为提升黑盒场景下的攻击迁移能力,引入模板自动编码器(Template Autoencoder, TAE)与贝叶斯优化相结合的模板搜索机制,在连续潜空间中高效定位高潜力对抗模板,从而显著提升攻击成功率(Attack Success Rate, ASR)和查询效率。
链接: https://arxiv.org/abs/2602.16958
作者: Xinhao Deng,Jiaqing Wu,Miao Chen,Yue Xiao,Ke Xu,Qi Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Agent hijacking, highlighted by OWASP as a critical threat to the Large Language Model (LLM) ecosystem, enables adversaries to manipulate execution by injecting malicious instructions into retrieved content. Most existing attacks rely on manually crafted, semantics-driven prompt manipulation, which often yields low attack success rates and limited transferability to closed-source commercial models. In this paper, we propose Phantom, an automated agent hijacking framework built upon Structured Template Injection that targets the fundamental architectural mechanisms of LLM agents. Our key insight is that agents rely on specific chat template tokens to separate system, user, assistant, and tool instructions. By injecting optimized structured templates into the retrieved context, we induce role confusion and cause the agent to misinterpret the injected content as legitimate user instructions or prior tool outputs. To enhance attack transferability against black-box agents, Phantom introduces a novel attack template search framework. We first perform multi-level template augmentation to increase structural diversity and then train a Template Autoencoder (TAE) to embed discrete templates into a continuous, searchable latent space. Subsequently, we apply Bayesian optimization to efficiently identify optimal adversarial vectors that are decoded into high-potency structured templates. Extensive experiments on Qwen, GPT, and Gemini demonstrate that our framework significantly outperforms existing baselines in both Attack Success Rate (ASR) and query efficiency. Moreover, we identified over 70 vulnerabilities in real-world commercial products that have been confirmed by vendors, underscoring the practical severity of structured template-based hijacking and providing an empirical foundation for securing next-generation agentic systems.
[AI-69] LLM 4Cov: Execution-Aware Agent ic Learning for High-coverag e Testbench Generation
【速读】:该论文旨在解决高覆盖率硬件验证中因执行反馈昂贵且缓慢而导致在线强化学习(Reinforcement Learning, RL)难以实施的问题。其核心挑战在于验证过程依赖工业级仿真器和不可微的执行信号,使得传统基于实时交互的学习方法效率低下。解决方案的关键在于提出一种离线代理学习框架 LLM4Cov,将验证建模为由确定性评估器驱动的无记忆状态转移过程,并引入三项关键技术:执行验证的数据清洗、策略感知的代理数据合成以及最差状态优先采样,从而在执行资源受限条件下实现可扩展的学习能力。该方法通过构建一个与现实对齐的基准测试集,在仅使用 4B 参数模型的情况下即达到 69.2% 的覆盖率通过率,优于教师模型 5.3%,并展现出媲美大一阶数量级模型的性能。
链接: https://arxiv.org/abs/2602.16953
作者: Hejia Zhang,Zhongming Yu,Chia-Tung Ho,Haoxing Ren,Brucek Khailany,Jishen Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback is often expensive and slow to obtain, making online reinforcement learning (RL) impractical. High-coverage hardware verification exemplifies this challenge due to its reliance on industrial simulators and non-differentiable execution signals. We propose LLM4Cov, an offline agent-learning framework that models verification as memoryless state transitions guided by deterministic evaluators. Building on this formulation, we introduce execution-validated data curation, policy-aware agentic data synthesis, and worst-state-prioritized sampling to enable scalable learning under execution constraints. We further curate a reality-aligned benchmark adapted from an existing verification suite through a revised evaluation protocol. Using the proposed pipeline, a compact 4B-parameter model achieves 69.2% coverage pass rate under agentic evaluation, outperforming its teacher by 5.3% and demonstrating competitive performance against models an order of magnitude larger.
[AI-70] Beyond Message Passing: A Symbolic Alternative for Expressive and Interpretable Graph Learning
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在高风险领域(如药物发现)中因黑箱特性导致的信任问题,尤其是现有自解释GNN方法受限于标准消息传递机制所引发的1-Weisfeiler-Lehman(1-WL)表达能力瓶颈以及细粒度可解释性不足的问题。其解决方案的关键在于提出SymGraph框架,通过用离散结构哈希和基于拓扑角色的聚合替代连续消息传递机制,理论上突破了1-WL表达限制,在无需微分优化开销的前提下实现更高表达能力;同时生成语义粒度更优的规则,显著提升模型透明度与科学可解释性。
链接: https://arxiv.org/abs/2602.16947
作者: Chuqin Geng,Li Zhang,Haolin Ye,Ziyu Zhao,Yuhe Jiang,Tara Saba,Xinyu Wang,Xujie Si
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 9 pages
Abstract:Graph Neural Networks (GNNs) have become essential in high-stakes domains such as drug discovery, yet their black-box nature remains a significant barrier to trustworthiness. While self-explainable GNNs attempt to bridge this gap, they often rely on standard message-passing backbones that inherit fundamental limitations, including the 1-Weisfeiler-Lehman (1-WL) expressivity barrier and a lack of fine-grained interpretability. To address these challenges, we propose SymGraph, a symbolic framework designed to transcend these constraints. By replacing continuous message passing with discrete structural hashing and topological role-based aggregation, our architecture theoretically surpasses the 1-WL barrier, achieving superior expressiveness without the overhead of differentiable optimization. Extensive empirical evaluations demonstrate that SymGraph achieves state-of-the-art performance, outperforming existing self-explainable GNNs. Notably, SymGraph delivers 10x to 100x speedups in training time using only CPU execution. Furthermore, SymGraph generates rules with superior semantic granularity compared to existing rule-based methods, offering great potential for scientific discovery and explainable AI.
[AI-71] Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)作为智能体(Agent)在与外部系统交互时,仅通过文本层面的安全性评估无法有效反映其工具调用(tool call)行为安全性的关键问题。当前主流的安全评估方法聚焦于文本拒绝行为(text-level refusal),但忽略了模型可能在输出“安全”文本的同时执行有害的工具调用动作,即存在文本安全与工具调用安全之间的不一致性。解决方案的关键在于提出GAP基准测试框架(GAP benchmark),该框架系统化地量化了文本级安全与工具调用级安全之间的差异,并通过大规模实证实验(17,420个分析就绪数据点)验证了这一差异的存在:即使在强化安全提示条件下,仍存在大量模型文本拒绝请求却实际执行禁止工具调用的行为(即GAP现象)。研究进一步揭示系统提示词设计对工具调用行为具有显著影响,且运行时治理合约虽能减少信息泄露,但无法有效阻止非法工具调用,从而强调必须针对工具调用行为单独建立评估和防护机制。
链接: https://arxiv.org/abs/2602.16943
作者: Arnold Cartagena,Ariane Teixeira
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 23 pages, 5 figures, 4 tables, code and data at this https URL
Abstract:Large language models deployed as agents increasingly interact with external systems through tool calls–actions with real-world consequences that text outputs alone do not carry. Safety evaluations, however, overwhelmingly measure text-level refusal behavior, leaving a critical question unanswered: does alignment that suppresses harmful text also suppress harmful actions? We introduce the GAP benchmark, a systematic evaluation framework that measures divergence between text-level safety and tool-call-level safety in LLM agents. We test six frontier models across six regulated domains (pharmaceutical, financial, educational, employment, legal, and infrastructure), seven jailbreak scenarios per domain, three system prompt conditions (neutral, safety-reinforced, and tool-encouraging), and two prompt variants, producing 17,420 analysis-ready datapoints. Our central finding is that text safety does not transfer to tool-call safety. Across all six models, we observe instances where the model’s text output refuses a harmful request while its tool calls simultaneously execute the forbidden action–a divergence we formalize as the GAP metric. Even under safety-reinforced system prompts, 219 such cases persist across all six models. System prompt wording exerts substantial influence on tool-call behavior: TC-safe rates span 21 percentage points for the most robust model and 57 for the most prompt-sensitive, with 16 of 18 pairwise ablation comparisons remaining significant after Bonferroni correction. Runtime governance contracts reduce information leakage in all six models but produce no detectable deterrent effect on forbidden tool-call attempts themselves. These results demonstrate that text-only safety evaluations are insufficient for assessing agent behavior and that tool-call safety requires dedicated measurement and mitigation.
[AI-72] SourceBench: Can AI Answers Reference Quality Web Sources?
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在回答用户查询时,过度关注答案正确性而忽视所引用网络来源质量的问题。现有评估体系缺乏对引用信息可信度、内容相关性及权威性的系统衡量,导致生成式 AI(Generative AI)输出可能存在事实偏差或误导风险。解决方案的关键在于提出 SourceBench——一个涵盖100个真实世界查询的基准测试框架,采用八项指标综合评估来源的内容质量(如内容相关性、事实准确性、客观性)与页面级信号(如时效性、权威性/可问责性、清晰度),并构建了人工标注数据集与校准后的 LLM 评估器,确保评测结果与专家判断高度一致。通过在3996个引用来源上对8个LLM、Google搜索及3个AI搜索引擎进行实证分析,揭示了四个关键洞见,为未来生成式 AI 与网络搜索融合方向的研究提供了重要指导。
链接: https://arxiv.org/abs/2602.16942
作者: Hexi Jin,Stephen Liu,Yuheng Li,Simran Malik,Yiying Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents. SourceBench uses an eight-metric framework covering content quality (content relevance, factual accuracy, objectivity) and page-level signals (e.g., freshness, authority/accountability, clarity), and includes a human-labeled dataset with a calibrated LLM-based evaluator that matches expert judgments closely. We evaluate eight LLMs, Google Search, and three AI search tools over 3996 cited sources using SourceBench and conduct further experiments to understand the evaluation results. Overall, our work reveals four key new insights that can guide future research in the direction of GenAI and web search.
[AI-73] DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLM s
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在多轮对话中因安全防护机制缺乏时序感知能力而导致的“安全缺口”(Safety Gap)问题。现有 Stateless 安全过滤器将多轮对话视为孤立事件,无法识别恶意意图通过逐轮渗透(如 Crescendo 和 ActorAttack 攻击)逐步累积的风险。解决方案的关键在于提出 DeepContext——一个基于循环神经网络(Recurrent Neural Network, RNN)的有状态监控框架,通过引入细粒度的回合级嵌入序列和隐藏状态传播机制,显式建模用户意图在对话中的时序演化过程,从而有效捕捉传统方法忽略的增量风险。实验表明,DeepContext 在多轮越狱检测任务上达到 F1 分数 0.84,显著优于主流云厂商和开源模型,并保持亚 20ms 的推理延迟,验证了时序建模在提升安全性与效率方面的优势。
链接: https://arxiv.org/abs/2602.16935
作者: Justin Albrethsen,Yash Datta,Kunal Kumar,Sharath Rajasekar
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 18 Pages, 7 Tables, 1 Figure
Abstract:While Large Language Model (LLM) capabilities have scaled, safety guardrails remain largely stateless, treating multi-turn dialogues as a series of disconnected events. This lack of temporal awareness facilitates a “Safety Gap” where adversarial tactics, like Crescendo and ActorAttack, slowly bleed malicious intent across turn boundaries to bypass stateless filters. We introduce DeepContext, a stateful monitoring framework designed to map the temporal trajectory of user intent. DeepContext discards the isolated evaluation model in favor of a Recurrent Neural Network (RNN) architecture that ingests a sequence of fine-tuned turn-level embeddings. By propagating a hidden state across the conversation, DeepContext captures the incremental accumulation of risk that stateless models overlook. Our evaluation demonstrates that DeepContext significantly outperforms existing baselines in multi-turn jailbreak detection, achieving a state-of-the-art F1 score of 0.84, which represents a substantial improvement over both hyperscaler cloud-provider guardrails and leading open-weight models such as Llama-Prompt-Guard-2 (0.67) and Granite-Guardian (0.67). Furthermore, DeepContext maintains a sub-20ms inference overhead on a T4 GPU, ensuring viability for real-time applications. These results suggest that modeling the sequential evolution of intent is a more effective and computationally efficient alternative to deploying massive, stateless models.
[AI-74] Narrow fine-tuning erodes safety alignment in vision-language agents
【速读】:该论文旨在解决持续学习场景下多模态智能体在后训练(post-training)过程中因引入有害数据而导致的安全对齐退化问题,即如何在不断适应新任务的同时保持模型行为的安全性和对齐性。其关键发现是:即使仅使用10%的有害数据进行微调,也会引发广泛且严重的对齐偏差,且这种偏差在视觉-语言(vision-language)模态中显著高于纯文本评估结果(如LoRA秩r=128时,多模态评估误对齐率达70.71±1.22,远超文本评估的41.19±2.51)。进一步地,几何分析表明有害行为集中在低维子空间内,主要信息可由前10个主成分捕获。解决方案的关键在于采用两种缓解策略——良性窄域微调(benign narrow fine-tuning)与基于激活的引导(activation-based steering),二者虽能显著降低误对齐程度,但无法彻底消除已习得的有害行为,凸显了当前后训练范式在部署后对齐保护方面的局限性,亟需构建更鲁棒的持续学习框架以保障生成式AI(Generative AI)系统的长期安全。
链接: https://arxiv.org/abs/2602.16931
作者: Idhant Gulati,Shivam Raval
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 11 figures
Abstract:Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates fundamental tension between acquiring capabilities and preserving safety alignment. We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly across unrelated tasks and modalities. Through experiments on Gemma3-4B, we show that misalignment scales monotonically with LoRA rank, and that multimodal evaluation reveals substantially higher misalignment ( 70.71 \pm 1.22 at r=128 ) than text-only evaluation ( 41.19 \pm 2.51 ), suggesting that unimodal safety benchmarks may underestimate alignment degradation in vision-language models. Critically, even 10% harmful data in the training mixture induces substantial alignment degradation. Geometric analysis reveals that harmful behaviors occupy a remarkably low-dimensional subspace, with the majority of misalignment information captured in 10 principal components. To mitigate misalignment, we evaluate two strategies: benign narrow fine-tuning and activation-based steering. While both approaches substantially reduce misalignment, neither completely removes the learned harmful behaviors. Our findings highlight the need for robust continual learning frameworks, as current post-training paradigms may not sufficiently preserve alignment in post-deployment settings.
[AI-75] A Reversible Semantics for Janus
【速读】:该论文旨在解决Janus语言的小步语义(small-step semantics)不可逆的问题,即传统小步语义在正向执行过程中会丢失信息,导致不满足过程演算中关键的Loop Lemma(任何归约都有其逆归约),从而无法支持真正的可逆计算。解决方案的关键在于提出一种全新的小步语义,该语义通过引入“程序计数器”(program counter)机制来保留执行路径的信息,从而实现语义的可逆性,同时保持与原有语义的等价性。这一设计突破了高阶编程语言中可逆语义建模的技术难点。
链接: https://arxiv.org/abs/2602.16913
作者: Ivan Lanese,Germán Vidal
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Submitted for publication
Abstract:Janus is a paradigmatic example of reversible programming language. Indeed, Janus programs can be executed backwards as well as forwards. However, its small-step semantics (useful, e.g., for debugging or as a basis for extensions with concurrency primitives) is not reversible, since it loses information while computing forwards. E.g., it does not satisfy the Loop Lemma, stating that any reduction has an inverse, a main property of reversibility in process calculi, where small-step semantics is commonly used. We present here a novel small-step semantics which is actually reversible, while remaining equivalent to the previous one. It involves the non-trivial challenge of defining a semantics based on a “program counter” for a high-level programming language.
[AI-76] LLM -WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂任务中规划能力、推理能力和世界知识整合方面的评估难题。现有基准测试难以全面衡量模型在多步决策、概念关联理解及长期路径规划中的表现,尤其缺乏对真实世界知识与动态策略调整能力的系统性考察。为此,作者提出了LLM-Wikirace——一个基于维基百科超链接导航的任务基准,要求模型从源页面出发,通过逐步点击链接到达目标页面,从而检验其看穿未来步骤的规划能力(look-ahead planning)、跨概念连接的推理能力(reasoning)以及对现实世界知识的掌握程度(world knowledge)。解决方案的关键在于构建了一个结构清晰、难度分层的任务框架,能够有效揭示当前前沿模型虽在简单场景下表现出超人类性能,但在高难度任务中仍面临显著瓶颈,特别是失败后的重规划能力不足和循环行为频发等问题,凸显了长程推理与动态适应能力仍是当前生成式AI(Generative AI)系统的核心挑战。
链接: https://arxiv.org/abs/2602.16902
作者: Juliusz Ziomek,William Bankes,Lorenz Wolf,Shyam Sundhar Ramesh,Xiaohang Tang,Ilija Bogunovic
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove. Our code and leaderboard available at https:/llmwikirace.github.io.
[AI-77] Agent LAB: Benchmarking LLM Agents against Long-Horizon Attacks
【速读】:该论文旨在解决大语言模型代理(LLM agents)在长周期、复杂环境中面临的新型安全威胁问题,特别是那些利用多轮用户-代理-环境交互来实现单轮场景下不可行目标的长周期攻击(long-horizon attacks)。为量化此类风险,作者提出AgentLAB,这是首个专注于评估LLM代理对自适应长周期攻击敏感性的基准测试平台。其关键创新在于构建了涵盖五类新型攻击(意图劫持、工具链攻击、任务注入、目标漂移和记忆污染)的系统性测评框架,覆盖28个真实代理环境和644个安全测试用例,并通过实证发现现有针对单轮交互设计的防御机制无法有效应对长周期威胁,从而推动面向实际部署场景的LLM代理安全研究发展。
链接: https://arxiv.org/abs/2602.16901
作者: Tanqiu Jiang,Yuhui Wang,Jiacheng Liang,Ting Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLM agents are increasingly deployed in long-horizon, complex environments to solve challenging problems, but this expansion exposes them to long-horizon attacks that exploit multi-turn user-agent-environment interactions to achieve objectives infeasible in single-turn settings. To measure agent vulnerabilities to such risks, we present AgentLAB, the first benchmark dedicated to evaluating LLM agent susceptibility to adaptive, long-horizon attacks. Currently, AgentLAB supports five novel attack types including intent hijacking, tool chaining, task injection, objective drifting, and memory poisoning, spanning 28 realistic agentic environments, and 644 security test cases. Leveraging AgentLAB, we evaluate representative LLM agents and find that they remain highly susceptible to long-horizon attacks; moreover, defenses designed for single-turn interactions fail to reliably mitigate long-horizon threats. We anticipate that AgentLAB will serve as a valuable benchmark for tracking progress on securing LLM agents in practical settings. The benchmark is publicly available at this https URL.
[AI-78] OpenSage: Self-programming Agent Generation Engine
【速读】:该论文旨在解决当前Agent开发工具包(Agent Development Kits, ADKs)在功能支持不足或依赖人工设计agent拓扑结构、工具集和记忆管理等问题,从而限制了智能体的泛化能力和整体性能。其解决方案的关键在于提出OpenSage,这是首个能够使大语言模型(LLMs)自动构建具备自生成拓扑结构和工具集的智能体的ADK,并提供全面且结构化的记忆支持;其核心创新包括:支持智能体自主创建与管理子智能体和工具包的能力,以及基于分层图结构的记忆系统,同时配备面向软件工程任务的专用工具集,显著提升了智能体的自主性与任务执行效率。
链接: https://arxiv.org/abs/2602.16891
作者: Hongwei Li,Zhun Wang,Qinrun Dai,Yuzhou Nie,Jinjun Peng,Ruitong Liu,Jingyang Zhang,Kaijie Zhu,Jingxuan He,Lun Wang,Yangruibo Ding,Yueqi Chen,Wenbo Guo,Dawn Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
备注:
Abstract:Agent development kits (ADKs) provide effective platforms and tooling for constructing agents, and their designs are critical to the constructed agents’ performance, especially the functionality for agent topology, tools, and memory. However, current ADKs either lack sufficient functional support or rely on humans to manually design these components, limiting agents’ generalizability and overall performance. We propose OpenSage, the first ADK that enables LLMs to automatically create agents with self-generated topology and toolsets while providing comprehensive and structured memory support. OpenSage offers effective functionality for agents to create and manage their own sub-agents and toolkits. It also features a hierarchical, graph-based memory system for efficient management and a specialized toolkit tailored to software engineering tasks. Extensive experiments across three state-of-the-art benchmarks with various backbone models demonstrate the advantages of OpenSage over existing ADKs. We also conduct rigorous ablation studies to demonstrate the effectiveness of our design for each component. We believe OpenSage can pave the way for the next generation of agent development, shifting the focus from human-centered to AI-centered paradigms.
[AI-79] Position: Why a Dynamical Systems Perspective is Needed to Advance Time Series Modeling
【速读】:该论文试图解决当前时间序列(Time Series, TS)建模领域中,尽管生成式 AI (Generative AI) 和基础模型(Foundation Models)快速发展,但实际进展仍不清晰、且缺乏理论根基的问题。其解决方案的关键在于引入动力系统(Dynamical Systems, DS)视角:通过动力系统重建(Dynamical Systems Reconstruction, DSR)方法,从观测数据中推断出描述系统演化规律的代理模型,从而不仅提升短期预测精度,还能准确预测系统的长期统计特性,并提供跨领域的理论洞察,如性能上限、外推能力及控制策略设计,最终实现更高效、更具泛化能力的时间序列建模。
链接: https://arxiv.org/abs/2602.16864
作者: Daniel Durstewitz,Christoph Jürgen Hemmer,Florian Hess,Charlotte Ricarda Doll,Lukas Eisenmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS)
备注:
Abstract:Time series (TS) modeling has come a long way from early statistical, mainly linear, approaches to the current trend in TS foundation models. With a lot of hype and industrial demand in this field, it is not always clear how much progress there really is. To advance TS forecasting and analysis to the next level, here we argue that the field needs a dynamical systems (DS) perspective. TS of observations from natural or engineered systems almost always originate from some underlying DS, and arguably access to its governing equations would yield theoretically optimal forecasts. This is the promise of DS reconstruction (DSR), a class of ML/AI approaches that aim to infer surrogate models of the underlying DS from data. But models based on DS principles offer other profound advantages: Beyond short-term forecasts, they enable to predict the long-term statistics of an observed system, which in many practical scenarios may be the more relevant quantities. DS theory furthermore provides domain-independent theoretical insight into mechanisms underlying TS generation, and thereby will inform us, e.g., about upper bounds on performance of any TS model, generalization into unseen regimes as in tipping points, or potential control strategies. After reviewing some of the central concepts, methods, measures, and models in DS theory and DSR, we will discuss how insights from this field can advance TS modeling in crucial ways, enabling better forecasting with much lower computational and memory footprints. We conclude with a number of specific suggestions for translating insights from DSR into TS modeling.
[AI-80] SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation
【速读】:该论文旨在解决机器人在真实场景中进行通用灵巧工具操作(dexterous tool manipulation)的难题,尤其是如何在无需针对特定物体或任务进行重新训练的情况下实现高泛化能力。传统方法依赖于大量人工标注的遥操作数据或为每个任务单独设计奖励函数与物理模型,存在工程成本高、扩展性差的问题。其解决方案的关键在于提出SimToolReal框架:通过在仿真中程序化生成多样化的类工具物体原型(tool-like object primitives),并训练单一强化学习(reinforcement learning, RL)策略以统一目标——将任意物体操纵至随机目标位姿。这种“通用目标驱动”的训练范式使策略具备零样本迁移能力,在真实世界中对24种不同物体、12种实例和6类日常工具均表现出强泛化性能,且在多项任务上优于现有重定向(retargeting)和固定抓取方法37%,媲美专门训练的RL策略。
链接: https://arxiv.org/abs/2602.16863
作者: Kushal Kedia,Tyler Ga Wei Lum,Jeannette Bohg,C. Karen Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The ability to manipulate tools significantly expands the set of tasks a robot can perform. Yet, tool manipulation represents a challenging class of dexterity, requiring grasping thin objects, in-hand object rotations, and forceful interactions. Since collecting teleoperation data for these behaviors is challenging, sim-to-real reinforcement learning (RL) is a promising alternative. However, prior approaches typically require substantial engineering effort to model objects and tune reward functions for each task. In this work, we propose SimToolReal, taking a step towards generalizing sim-to-real RL policies for tool manipulation. Instead of focusing on a single object and task, we procedurally generate a large variety of tool-like object primitives in simulation and train a single RL policy with the universal goal of manipulating each object to random goal poses. This approach enables SimToolReal to perform general dexterous tool manipulation at test-time without any object or task-specific training. We demonstrate that SimToolReal outperforms prior retargeting and fixed-grasp methods by 37% while matching the performance of specialist RL policies trained on specific target objects and tasks. Finally, we show that SimToolReal generalizes across a diverse set of everyday tools, achieving strong zero-shot performance over 120 real-world rollouts spanning 24 tasks, 12 object instances, and 6 tool categories.
[AI-81] VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training – A Chess Case Study
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)后训练阶段中因稀疏反馈和庞大动作空间导致的探索问题,这一问题常引发模型过早陷入重复行为。解决方案的关键在于提出口语化动作掩码(Verbalized Action Masking, VAM),即通过将动作掩码以自然语言形式嵌入提示(prompt),强制模型从掩码指定的动作集中输出有效动作;在此基础上进一步引入迭代动作空间剪枝(iterative action-space pruning)机制:若目标动作未被采样,则移除已采样的合法动作并重新采样,直至命中目标动作或达到预设预算。该方法显著提升了棋类任务中的学习效率与最终性能,验证了VAM作为可控探索机制的有效性。
链接: https://arxiv.org/abs/2602.16833
作者: Zhicheng Zhang,Ziyan Wang,Yali Du,Fei Fang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Exploration remains a key bottleneck for reinforcement learning (RL) post-training of large language models (LLMs), where sparse feedback and large action spaces can lead to premature collapse into repetitive behaviors. We propose Verbalized Action Masking (VAM), which verbalizes an action mask in the prompt and enforces that the model outputs an action from the masked set. Building on this interface, we introduce iterative action-space pruning: if the target action is not sampled, we remove valid sampled actions from the mask and resample under the reduced candidate set, repeating until the target is sampled or a fixed budget is exhausted. We study VAM in chess and evaluate it under two training regimes: an engine-play regime that generates states via play against an engine opponent and a fixed-dataset regime that trains from a fixed dataset of positions with verifier scores. Across held-out chess puzzles and full-game play measured by average centipawn loss (ACPL), VAM improves learning efficiency and final performance over strong baselines, highlighting verbalized masking as a practical mechanism for controllable exploration in LLM RL post-training.
[AI-82] Learning under noisy supervision is governed by a feedback-truth gap
【速读】:该论文旨在解决在学习过程中,当反馈信息的吸收速度超过任务结构评估速度时,学习者倾向于优先采纳反馈而非真实信息(即“反馈-真理差距”)这一根本性问题。其解决方案的关键在于提出一个双时间尺度模型,表明该差距在反馈与任务结构更新速率不一致时是不可避免的,仅在两者匹配时才消失;并通过跨系统实验证明,该差距普遍存在,但不同系统(神经网络、人类概率反转学习及含EEG的人类奖惩学习)通过不同机制调节该差距:密集神经网络以记忆化形式积累差距,稀疏残差结构抑制差距,而人类则表现出短暂的过度承诺并能主动恢复,揭示了该差距作为噪声监督下学习的基本约束及其调控机制的多样性。
链接: https://arxiv.org/abs/2602.16829
作者: Elan Schonfeld,Elias Wisnia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 33 pages, 5 figures, 10 extended data figures, 4 extended data tables; 10-page supplementary information
Abstract:When feedback is absorbed faster than task structure can be evaluated, the learner will favor feedback over truth. A two-timescale model shows this feedback-truth gap is inevitable whenever the two rates differ and vanishes only when they match. We test this prediction across neural networks trained with noisy labels (30 datasets, 2,700 runs), human probabilistic reversal learning (N = 292), and human reward/punishment learning with concurrent EEG (N = 25). In each system, truth is defined operationally: held-out labels, the objectively correct option, or the participant’s pre-feedback expectation - the only non-circular reference decodable from post-feedback EEG. The gap appeared universally but was regulated differently: dense networks accumulated it as memorization; sparse-residual scaffolding suppressed it; humans generated transient over-commitment that was actively recovered. Neural over-commitment (~0.04-0.10) was amplified tenfold into behavioral commitment (d = 3.3-3.9). The gap is a fundamental constraint on learning under noisy supervision; its consequences depend on the regulation each system employs.
[AI-83] An order-oriented approach to scoring hesitant fuzzy elements
【速读】:该论文旨在解决传统犹豫模糊集(Hesitant Fuzzy Set, HFS)评分方法缺乏序理论(order theory)形式基础的问题,从而导致评分机制不统一且难以满足规范性要求。其解决方案的关键在于提出一个以序关系为核心的统一框架,明确将每个评分函数定义为相对于特定序的映射;在此基础上,证明了基于对称序(symmetric order)定义的评分函数能够满足强单调性(strong monotonicity)和Gärdenfors条件等关键规范性准则,并进一步引入“主导函数”(dominance functions)用于比较犹豫模糊元素,通过控制集合与最小可接受阈值实现排序,进而支持群决策中的模糊偏好关系构建。
链接: https://arxiv.org/abs/2602.16827
作者: Luis Merino,Gabriel Navarro,Carlos Salvatierra,Evangelina Santos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional scoring approaches on hesitant fuzzy sets often lack a formal base in order theory. This paper proposes a unified framework, where each score is explicitly defined with respect to a given order. This order-oriented perspective enables more flexible and coherent scoring mechanisms. We examine several classical orders on hesitant fuzzy elements, that is, nonempty subsets in [0,1], and show that, contrary to prior claims, they do not induce lattice structures. In contrast, we prove that the scores defined with respect to the symmetric order satisfy key normative criteria for scoring functions, including strong monotonicity with respect to unions and the Gärdenfors condition. Following this analysis, we introduce a class of functions, called dominance functions, for ranking hesitant fuzzy elements. They aim to compare hesitant fuzzy elements relative to control sets incorporating minimum acceptability thresholds. Two concrete examples of dominance functions for finite sets are provided: the discrete dominance function and the relative dominance function. We show that these can be employed to construct fuzzy preference relations on typical hesitant fuzzy sets and support group decision-making. Subjects: Artificial Intelligence (cs.AI) MSC classes: 03B52, 68T37 Cite as: arXiv:2602.16827 [cs.AI] (or arXiv:2602.16827v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.16827 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-84] HiVAE: Hierarchical Latent Variables for Scalable Theory of Mind AAAI AAAI-26
【速读】:该论文旨在解决当前理论心理(Theory of Mind, ToM)模型在复杂现实场景中推理能力受限的问题,尤其是现有方法多局限于小规模、人类可理解的网格世界(gridworld),难以扩展至真实时空域。其解决方案的关键在于提出HiVAE——一种受人类认知信念-欲望-意图结构启发的分层变分自编码器架构,通过三层嵌套的变分自动编码器实现对复杂任务(如包含3,185个节点的校园导航)的高效推理,显著提升了ToM性能;但研究同时指出,尽管层级结构改善了预测效果,学习到的潜在表示缺乏与实际心理状态的显式对齐,因此进一步引入自监督对齐策略以增强表征的可解释性与合理性。
链接: https://arxiv.org/abs/2602.16826
作者: Nigel Doering,Rahath Malladi,Arshia Sangwan,David Danks,Tauhidur Rahman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the Workshop on Theory of Mind for AI (ToM4AI) at the 40th AAAI Conference on Artificial Intelligence (AAAI-26), Singapore, 2026
Abstract:Theory of mind (ToM) enables AI systems to infer agents’ hidden goals and mental states, but existing approaches focus mainly on small human understandable gridworld spaces. We introduce HiVAE, a hierarchical variational architecture that scales ToM reasoning to realistic spatiotemporal domains. Inspired by the belief-desire-intention structure of human cognition, our three-level VAE hierarchy achieves substantial performance improvements on a 3,185-node campus navigation task. However, we identify a critical limitation: while our hierarchical structure improves prediction, learned latent representations lack explicit grounding to actual mental states. We propose self-supervised alignment strategies and present this work to solicit community feedback on grounding approaches.
[AI-85] Node Learning: A Framework for Adaptive Decentralised and Collaborative Network Edge AI
【速读】:该论文旨在解决集中式人工智能(Artificial Intelligence, AI)在边缘计算场景下所面临的成本高、脆弱性强等问题,包括数据传输开销大、延迟高、能耗高以及对大型数据中心的依赖,这些问题在异构、移动和资源受限环境中难以扩展。其解决方案的关键在于提出“节点学习”(Node Learning)这一去中心化学习范式:智能分布在各个边缘节点上,通过选择性同侪交互实现知识传播,节点持续从本地数据中学习并维护自身模型状态,在协作有益时机会性地交换知识;学习过程依赖于重叠与扩散机制,而非全局同步或中心聚合,从而统一自主行为与协作行为,并适应数据、硬件、目标和连通性的异质性。
链接: https://arxiv.org/abs/2602.16814
作者: Eiman Kanjo,Mustafa Aslanov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures, 3 tables, this paper introduces a new concept
Abstract:The expansion of AI toward the edge increasingly exposes the cost and fragility of cen- tralised intelligence. Data transmission, latency, energy consumption, and dependence on large data centres create bottlenecks that scale poorly across heterogeneous, mobile, and resource-constrained environments. In this paper, we introduce Node Learning, a decen- tralised learning paradigm in which intelligence resides at individual edge nodes and expands through selective peer interaction. Nodes learn continuously from local data, maintain their own model state, and exchange learned knowledge opportunistically when collaboration is beneficial. Learning propagates through overlap and diffusion rather than global synchro- nisation or central aggregation. It unifies autonomous and cooperative behaviour within a single abstraction and accommodates heterogeneity in data, hardware, objectives, and connectivity. This concept paper develops the conceptual foundations of this paradigm, contrasts it with existing decentralised approaches, and examines implications for communi- cation, hardware, trust, and governance. Node Learning does not discard existing paradigms, but places them within a broader decentralised perspective
[AI-86] NeuDiff Agent : A Governed AI Workflow for Single-Crystal Neutron Crystallography
【速读】:该论文旨在解决大科学设施中晶体学数据分析与报告延迟问题,特别是在处理结构和磁性复杂的样品时,传统人工干预的迭代还原、积分、精修及验证流程效率低下,严重制约了科研产出速度。其解决方案的关键在于提出 NeuDiff Agent——一种受控的、具备工具调用能力的智能体(Agent)工作流系统,该系统在 TOPAZ 仪器上部署,通过限制仅允许使用白名单内的工具、在关键流程节点设置“fail-closed”验证门控机制,并完整记录溯源信息,从而实现从原始数据到可发表 CIF 文件的自动化闭环处理,同时保障结果的可审计性和出版合规性。实证表明,该方案将端到端分析时间从人工模式的 435 分钟显著缩短至 86.5–94.4 分钟(提升 4.6–5.0 倍),且生成的 CIF 文件无 checkCIF 级别 A 或 B 警告,证明其在提升效率的同时满足专业验证要求。
链接: https://arxiv.org/abs/2602.16812
作者: Zhongcan Xiao(1),Leyi Zhang(1 and 2),Guannan Zhang(3),Xiaoping Wang(1) ((1) Neutron Scattering Division, Oak Ridge National Laboratory, Oak Ridge, Tennesse USA, (2) Department of Linguistics, University of Illinois Urbana-Champaign, Urbana, Illinois, USA, (3) Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large-scale facilities increasingly face analysis and reporting latency as the limiting step in scientific throughput, particularly for structurally and magnetically complex samples that require iterative reduction, integration, refinement, and validation. To improve time-to-result and analysis efficiency, NeuDiff Agent is introduced as a governed, tool-using AI workflow for TOPAZ at the Spallation Neutron Source that takes instrument data products through reduction, integration, refinement, and validation to a validated crystal structure and a publication-ready CIF. NeuDiff Agent executes this established pipeline under explicit governance by restricting actions to allowlisted tools, enforcing fail-closed verification gates at key workflow boundaries, and capturing complete provenance for inspection, auditing, and controlled replay. Performance is assessed using a fixed prompt protocol and repeated end-to-end runs with two large language model backends, with user and machine time partitioned and intervention burden and recovery behaviors quantified under gating. In a reference-case benchmark, NeuDiff Agent reduces wall time from 435 minutes (manual) to 86.5(4.7) to 94.4(3.5) minutes (4.6-5.0x faster) while producing a validated CIF with no checkCIF level A or B alerts. These results establish a practical route to deploy agentic AI in facility crystallography while preserving traceability and publication-facing validation requirements.
[AI-87] Improved Upper Bounds for Slicing the Hypercube
【速读】:该论文旨在解决超立方体 $ Q_n $ 的边切割问题,即确定最少需要多少个超平面(hyperplane)才能确保每条边都被至少一个超平面在其内部相交。这一最小数量记为 $ S(n) $。此前已知的上界是 $ S(n) \leq \lceil \frac{5n}{6} \rceil $(Paterson, 1971),而本文通过构造性证明将该上界改进为 $ S(n) \leq \lceil \frac{4n}{5} \rceil $,当 $ n $ 为奇数倍的 5 时允许额外增加 1。关键突破在于利用一种新型自动搜索工具 CPro1——该工具结合推理型大语言模型(reasoning LLMs)与自动化超参数调优,生成用于发现数学构造的搜索算法,并成功构造出在 $ Q_{10} $ 中用 8 个超平面实现边切割的方案,从而推动了整体上界优化。
链接: https://arxiv.org/abs/2602.16807
作者: Duncan Soiffer,Nathaniel Itty,Christopher D. Rosin,Blake Bruell,Mason DiCicco,Gábor N. Sárközy,Ryan Offstein,Daniel Reichman
机构: 未知
类目: Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Combinatorics (math.CO)
备注:
Abstract:A collection of hyperplanes \mathcalH slices all edges of the n -dimensional hypercube Q_n with vertex set -1,1^n if, for every edge e in the hypercube, there exists a hyperplane in \mathcalH intersecting e in its interior. Let S(n) be the minimum number of hyperplanes needed to slice Q_n . We prove that S(n) \leq \lceil \frac4n5 \rceil , except when n is an odd multiple of 5 , in which case S(n) \leq \frac4n5 +1 . This improves upon the previously known upper bound of S(n) \leq \lceil\frac5n6 \rceil due to Paterson reported in 1971. We also obtain new lower bounds on the maximum number of edges in Q_n that can be sliced using kn hyperplanes. We prove the improved upper bound on S(n) by constructing 8 hyperplanes slicing Q_10 aided by the recently introduced CPro1: an automatic tool that uses reasoning LLMs coupled with automated hyperparameter tuning to create search algorithms for the discovery of mathematical constructions.
[AI-88] Simple Baselines are Competitive with Code Evolution
【速读】:该论文旨在解决当前代码演化(Code Evolution)方法在实际应用中缺乏严谨基准对比的问题,尤其关注其在数学界 bounds 优化、代理架构(Agentic Scaffold)设计以及机器学习竞赛中的有效性。研究发现,相较于复杂的演化算法,简单的基线方法在三个领域均表现相当甚至更优,揭示了现有代码演化实践中的诸多不足:在数学边界优化中,搜索空间设计与提示中的领域知识才是性能上限的关键,而非演化流程本身;在代理架构设计中,由于数据集小且架构方差高,导致选择偏差,使人工设计的多数投票方案最优;为此,作者提出改进评估方法以降低随机性并保持经济可行性。解决方案的关键在于重新审视代码演化的价值定位——即应优先聚焦于高质量搜索空间构建和稳健评估机制设计,而非过度依赖复杂演化策略。
链接: https://arxiv.org/abs/2602.16805
作者: Yonatan Gideoni,Sebastian Risi,Yarin Gal
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Code evolution is a family of techniques that rely on large language models to search through possible computer programs by evolving or mutating existing code. Many proposed code evolution pipelines show impressive performance but are often not compared to simpler baselines. We test how well two simple baselines do over three domains: finding better mathematical bounds, designing agentic scaffolds, and machine learning competitions. We find that simple baselines match or exceed much more sophisticated methods in all three. By analyzing these results we find various shortcomings in how code evolution is both developed and used. For the mathematical bounds, a problem’s search space and domain knowledge in the prompt are chiefly what dictate a search’s performance ceiling and efficiency, with the code evolution pipeline being secondary. Thus, the primary challenge in finding improved bounds is designing good search spaces, which is done by domain experts, and not the search itself. When designing agentic scaffolds we find that high variance in the scaffolds coupled with small datasets leads to suboptimal scaffolds being selected, resulting in hand-designed majority vote scaffolds performing best. We propose better evaluation methods that reduce evaluation stochasticity while keeping the code evolution economically feasible. We finish with a discussion of avenues and best practices to enable more rigorous code evolution in future work.
[AI-89] Large-scale online deanonymization with LLM s
【速读】:该论文旨在解决大规模在线用户去匿名化(deanonymization)问题,即如何仅通过用户的伪匿名文本内容(如论坛发言、社交媒体帖子等)识别其真实身份。传统方法依赖结构化数据或人工特征工程,适用范围有限且效率低下。本文的关键解决方案是利用大语言模型(Large Language Models, LLMs)构建一个可扩展的攻击流水线:首先从原始文本中自动提取与身份相关的特征,其次通过语义嵌入(semantic embeddings)快速筛选候选匹配对象,最后基于推理能力对高潜力候选者进行验证以降低误报率。该方法无需人工干预即可在任意平台的非结构化文本上实现高效去匿名化,在多个真实场景下显著优于经典基线方法,最高达到90%精确度下的68%召回率,揭示了当前网络隐私保护机制的脆弱性。
链接: https://arxiv.org/abs/2602.16800
作者: Simon Lermen,Daniel Paleka,Joshua Swanson,Michael Aerni,Nicholas Carlini,Florian Tramèr
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 10 figures
Abstract:We show that large language models can be used to perform at-scale deanonymization. With full Internet access, our agent can re-identify Hacker News users and Anthropic Interviewer participants at high precision, given pseudonymous online profiles and conversations alone, matching what would take hours for a dedicated human investigator. We then design attacks for the closed-world setting. Given two databases of pseudonymous individuals, each containing unstructured text written by or about that individual, we implement a scalable attack pipeline that uses LLMs to: (1) extract identity-relevant features, (2) search for candidate matches via semantic embeddings, and (3) reason over top candidates to verify matches and reduce false positives. Compared to prior deanonymization work (e.g., on the Netflix prize) that required structured data or manual feature engineering, our approach works directly on raw user content across arbitrary platforms. We construct three datasets with known ground-truth data to evaluate our attacks. The first links Hacker News to LinkedIn profiles, using cross-platform references that appear in the profiles. Our second dataset matches users across Reddit movie discussion communities; and the third splits a single user’s Reddit history in time to create two pseudonymous profiles to be matched. In each setting, LLM-based methods substantially outperform classical baselines, achieving up to 68% recall at 90% precision compared to near 0% for the best non-LLM method. Our results show that the practical obscurity protecting pseudonymous users online no longer holds and that threat models for online privacy need to be reconsidered.
[AI-90] When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation
【速读】:该论文试图解决的问题是:当前用于评估大语言模型(Large Language Model, LLM)性能的基准测试(benchmark)普遍存在饱和现象,即随着模型能力提升,基准测试无法有效区分最优模型,从而削弱其长期评估价值。解决方案的关键在于系统性识别导致基准饱和的核心设计因素,通过分析60个主流LLM基准的14项属性(涵盖任务设计、数据构建和评估格式),验证五项假设,发现“专家标注”优于“众包”数据来源可显著延缓饱和,而公开/私有测试数据设置并无保护作用;这一发现为设计更具持久性的评估体系提供了实证依据和优化方向。
链接: https://arxiv.org/abs/2602.16763
作者: Mubashara Akhtar,Anka Reuel,Prajna Soni,Sanchit Ahuja,Pawan Sasanka Ammanamanchi,Ruchit Rawal,Vilém Zouhar,Srishti Yadav,Chenxi Whitehouse,Dayeon Ki,Jennifer Mickel,Leshem Choshen,Marek Šuppa,Jan Batzner,Jenny Chim,Jeba Sania,Yanan Long,Hossein A. Rahmani,Christina Knight,Yiyang Nan,Jyoutir Raj,Yu Fan,Shubham Singh,Subramanyam Sahoo,Eliya Habba,Usman Gohar,Siddhesh Pawar,Robert Scholz,Arjun Subramonian,Jingwei Ni,Mykel Kochenderfer,Sanmi Koyejo,Mrinmaya Sachan,Stella Biderman,Zeerak Talat,Avijit Ghosh,Irene Solaiman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format. We test five hypotheses examining how each property contributes to saturation rates. Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age. Notably, hiding test data (i.e., public vs. private) shows no protective effect, while expert-curated benchmarks resist saturation better than crowdsourced ones. Our findings highlight which design choices extend benchmark longevity and inform strategies for more durable evaluation.
[AI-91] Attending to Routers Aids Indoor Wireless Localization AAAI2026
【速读】:该论文旨在解决基于Wi-Fi信号的现代机器学习无线定位方法在多样化环境中难以实现突破性性能的问题,其核心挑战在于现有算法在聚合来自多个路由器的信息时未能对不同路由器的贡献进行差异化加权,导致收敛性能欠佳和定位精度下降。解决方案的关键在于引入“路由器注意力机制”(Attention to Routers),即在标准机器学习定位架构中嵌入注意力层,使模型能够根据每个路由器信息的相关性动态调整其权重,从而提升多路由器信息融合的合理性与准确性。实验结果表明,该方法在公开数据集上的定位准确率比基准架构提升超过30%。
链接: https://arxiv.org/abs/2602.16762
作者: Ayush Roy,Tahsin Fuad Hassan,Roshan Ayyalasomayajula,Vishnu Suresh Lokhande
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: AAAI 2026 Workshop on Machine Learning for Wireless Communication and Networks (ML4Wireless)
Abstract:Modern machine learning-based wireless localization using Wi-Fi signals continues to face significant challenges in achieving groundbreaking performance across diverse environments. A major limitation is that most existing algorithms do not appropriately weight the information from different routers during aggregation, resulting in suboptimal convergence and reduced accuracy. Motivated by traditional weighted triangulation methods, this paper introduces the concept of attention to routers, ensuring that each router’s contribution is weighted differently when aggregating information from multiple routers for triangulation. We demonstrate, by incorporating attention layers into a standard machine learning localization architecture, that emphasizing the relevance of each router can substantially improve overall performance. We have also shown through evaluation over the open-sourced datasets and demonstrate that Attention to Routers outperforms the benchmark architecture by over 30% in accuracy.
[AI-92] LiveClin: A Live Clinical Benchmark without Leakage
【速读】:该论文旨在解决医疗领域大语言模型(Large Language Models, LLMs)评估中因数据污染(data contamination)和知识过时(knowledge obsolescence)导致的可靠性问题,这些问题使得静态基准测试结果被人为高估。其解决方案的关键在于构建一个动态更新、临床真实性强的实时基准测试平台——LiveClin,该平台基于最新同行评审病例报告并每两年更新一次,确保内容的时效性与去污染性;同时通过239名医生参与的AI-人类协同工作流,将真实患者案例转化为覆盖完整临床路径的多模态复杂评估场景,从而更准确地反映模型在实际医疗环境中的表现能力。
链接: https://arxiv.org/abs/2602.16747
作者: Xidong Wang,Shuqi Guo,Yue Shen,Junying Chen,Jian Wang,Jinjie Gu,Ping Zhang,Lei Liu,Benyou Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The reliability of medical LLM evaluation is critically undermined by data contamination and knowledge obsolescence, leading to inflated scores on static benchmarks. To address these challenges, we introduce LiveClin, a live benchmark designed for approximating real-world clinical practice. Built from contemporary, peer-reviewed case reports and updated biannually, LiveClin ensures clinical currency and resists data contamination. Using a verified AI-human workflow involving 239 physicians, we transform authentic patient cases into complex, multimodal evaluation scenarios that span the entire clinical pathway. The benchmark currently comprises 1,407 case reports and 6,605 questions. Our evaluation of 26 models on LiveClin reveals the profound difficulty of these real-world scenarios, with the top-performing model achieving a Case Accuracy of just 35.7%. In benchmarking against human experts, Chief Physicians achieved the highest accuracy, followed closely by Attending Physicians, with both surpassing most models. LiveClin thus provides a continuously evolving, clinically grounded framework to guide the development of medical LLMs towards closing this gap and achieving greater reliability and real-world utility. Our data and code are publicly available at this https URL.
[AI-93] Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking
【速读】:该论文旨在解决“grokking”现象的机制问题,即在小型算法任务中模型从记忆数据到实现泛化能力的延迟转变过程缺乏理论解释。其核心解决方案在于提出一种几何视角:通过主成分分析(PCA)揭示训练过程中注意力权重轨迹主要沿一个低维执行子空间(execution subspace)演化,且该子空间能解释68–83%的轨迹方差;进一步利用梯度步长的非交换性(commutator defects)量化损失曲面几何特性,发现垂直于该子空间的方向上曲率显著增长,而这一曲率增长先于泛化发生,并遵循grokking时间尺度上的幂律关系。因果干预实验表明,沿学习到的子空间运动是grokking所必需的,而人为增强曲率则无效,从而支持了grokking本质上是模型逃逸于由低维约束和横向曲率累积构成的亚稳态区域的几何机制。
链接: https://arxiv.org/abs/2602.16746
作者: Yongzhong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages, 22 figures
Abstract:Grokking – the delayed transition from memorization to generalization in small algorithmic tasks – remains poorly understood. We present a geometric analysis of optimization dynamics in transformers trained on modular arithmetic. PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace, with a single principal component capturing 68-83% of trajectory variance. To probe loss-landscape geometry, we measure commutator defects – the non-commutativity of successive gradient steps – and project them onto this learned subspace. We find that curvature grows sharply in directions orthogonal to the execution subspace while the trajectory remains largely confined to it. Importantly, curvature growth consistently precedes generalization across learning rates and hyperparameter regimes, with the lead time obeying a power law in the grokking timescale. Causal intervention experiments show that motion along the learned subspace is necessary for grokking, while artificially increasing curvature is insufficient. Together, these results support a geometric account in which grokking reflects escape from a metastable regime characterized by low-dimensional confinement and transverse curvature accumulation. All findings replicate across this learning-rate range, a qualitatively different slow regime (lr=5e-5, wd=0.1, 3 layers), and three random seeds, though alignment dynamics differ quantitatively between regimes. Causal intervention experiments establish that orthogonal gradient flow is necessary but not sufficient for grokking: suppressing it prevents generalization with a monotonic dose-response across four operations, while artificially boosting curvature defects has no effect.
[AI-94] PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency
【速读】:该论文旨在解决测试阶段(test-time)样本效率低下的问题,即在有限的计算预算下如何实现高效的自一致性(self-consistency)推理轨迹分配,以提升模型性能。其核心挑战在于如何在不浪费资源的前提下,通过合理分配采样预算来最大化推理结果的一致性。解决方案的关键在于提出PETS(Principled and Efficient Test-Time Self-Consistency),引入“自一致性率”(self-consistency rate)作为衡量指标,定义为与无限预算下多数投票结果的一致性,并基于此构建一个理论严谨的优化框架。该方法在离线场景中将推理轨迹建模为众包中的“工作者”,从而借用经典众包理论设计高效算法;在在线流式场景中,则提出一种自适应预算分配策略,根据问题难度动态调整采样数量,同时保持强理论保证和计算效率。实验表明,PETS在GPQA数据集上实现了完美自一致性,且相比均匀分配可减少高达75%(离线)和55%(在线)的采样预算。
链接: https://arxiv.org/abs/2602.16745
作者: Zhangyi Liu,Huaizhi Qu,Xiaowei Yin,He Sun,Yanjun Han,Tianlong Chen,Zhun Deng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Test-time scaling can improve model performance by aggregating stochastic reasoning trajectories. However, achieving sample-efficient test-time self-consistency under a limited budget remains an open challenge. We introduce PETS (Principled and Efficient Test-TimeSelf-Consistency), which initiates a principled study of trajectory allocation through an optimization framework. Central to our approach is the self-consistency rate, a new measure defined as agreement with the infinite-budget majority vote. This formulation makes sample-efficient test-time allocation theoretically grounded and amenable to rigorous analysis. We study both offline and online settings. In the offline regime, where all questions are known in advance, we connect trajectory allocation to crowdsourcing, a classic and well-developed area, by modeling reasoning traces as workers. This perspective allows us to leverage rich existing theory, yielding theoretical guarantees and an efficient majority-voting-based allocation algorithm. In the online streaming regime, where questions arrive sequentially and allocations must be made on the fly, we propose a novel method inspired by the offline framework. Our approach adapts budgets to question difficulty while preserving strong theoretical guarantees and computational efficiency. Experiments show that PETS consistently outperforms uniform allocation. On GPQA, PETS achieves perfect self-consistency in both settings while reducing the sampling budget by up to 75% (offline) and 55% (online) relative to uniform allocation. Code is available at this https URL.
[AI-95] DeepVision-103K: A Visually Diverse Broad-Coverag e and Verifiable Mathematical Dataset for Multimodal Reasoning
【速读】:该论文旨在解决当前用于强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练的数据集在多样性与覆盖范围上的局限性问题,现有数据多源于小规模人工构建或已有资源的重组,难以支撑大型多模态模型(Large Multimodal Models, LMMs)在视觉反思与推理能力上的进一步提升。其解决方案的关键在于提出一个名为DeepVision-103K的综合性RLVR训练数据集,该数据集覆盖了多样化的K12数学主题、广泛的知识点以及丰富的视觉元素,从而显著增强模型在多模态数学基准测试中的表现,并展现出对通用多模态推理任务的良好泛化能力。
链接: https://arxiv.org/abs/2602.16742
作者: Haoxiang Sun,Lizhen Xu,Bing Zhao,Wotao Yin,Wei Wang,Boyu Yang,Rui Wang,Hu Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has been shown effective in enhancing the visual reflection and reasoning capabilities of Large Multimodal Models (LMMs). However, existing datasets are predominantly derived from either small-scale manual construction or recombination of prior resources, which limits data diversity and coverage, thereby constraining further gains in model performance. To this end, we introduce \textbfDeepVision-103K, a comprehensive dataset for RLVR training that covers diverse K12 mathematical topics, extensive knowledge points, and rich visual elements. Models trained on DeepVision achieve strong performance on multimodal mathematical benchmarks, and generalize effectively to general multimodal reasoning tasks. Further analysis reveals enhanced visual perception, reflection and reasoning capabilities in trained models, validating DeepVision’s effectiveness for advancing multimodal reasoning. Data: \hrefthis https URLthis url.
[AI-96] Can Adversarial Code Comments Fool AI Security Reviewers – Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis
【速读】:该论文旨在解决对抗性注释(adversarial comments)是否会影响大语言模型(Large Language Model, LLM)在漏洞检测任务中的表现这一问题。研究发现,尽管在代码生成场景中注释操纵可显著降低LLM性能,但在漏洞检测任务中,对抗性注释对检测准确率的影响微小且统计上不显著(McNemar精确p > 0.21;所有95%置信区间包含零),无论商业模型(基线准确率89–96%)还是开源模型(53–72%)均如此。关键解决方案在于验证了静态分析交叉验证方法的有效性——该方法在4,646次额外测试中实现96.9%的检测准确率,并能恢复47%原本被漏检的漏洞,尤其在处理竞争条件、时序侧信道和复杂授权逻辑等固有难题时表现突出,而简单注释删除反而削弱了弱模型的检测能力。
链接: https://arxiv.org/abs/2602.16741
作者: Scott Thornton
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 6 figures
Abstract:AI-assisted code review is widely used to detect vulnerabilities before production release. Prior work shows that adversarial prompt manipulation can degrade large language model (LLM) performance in code generation. We test whether similar comment-based manipulation misleads LLMs during vulnerability detection. We build a 100-sample benchmark across Python, JavaScript, and Java, each paired with eight comment variants ranging from no comments to adversarial strategies such as authority spoofing and technical deception. Eight frontier models, five commercial and three open-source, are evaluated in 9,366 trials. Adversarial comments produce small, statistically non-significant effects on detection accuracy (McNemar exact p 0.21; all 95 percent confidence intervals include zero). This holds for commercial models with 89 to 96 percent baseline detection and open-source models with 53 to 72 percent, despite large absolute performance gaps. Unlike generation settings where comment manipulation achieves high attack success, detection performance does not meaningfully degrade. More complex adversarial strategies offer no advantage over simple manipulative comments. We test four automated defenses across 4,646 additional trials (14,012 total). Static analysis cross-referencing performs best at 96.9 percent detection and recovers 47 percent of baseline misses. Comment stripping reduces detection for weaker models by removing helpful context. Failures concentrate on inherently difficult vulnerability classes, including race conditions, timing side channels, and complex authorization logic, rather than on adversarial comments.
[AI-97] Quantifying LLM Attention-Head Stability: Implications for Circuit Universality
【速读】:该论文旨在解决生成式 AI(Generative AI)模型中可解释性研究的核心瓶颈问题:现有对 Transformer 模型“电路”(circuit)的分析缺乏跨不同随机初始化训练实例的稳定性验证,这使得所发现的电路是否具有普遍性仍不明确,进而限制了其在安全关键场景下的可信度。解决方案的关键在于系统性地量化注意力头(attention head)在多个独立训练轮次中的表示学习一致性,通过大规模实验揭示了中间层注意力头最不稳定但最具表征差异性、深层模型中深度方向上的分歧增强、不稳定的深层注意力头反而更功能重要等规律,并进一步证明权重衰减(weight decay)优化能显著提升注意力头的跨实例稳定性,从而为构建可扩展的白盒监控机制奠定了实证基础。
链接: https://arxiv.org/abs/2602.16740
作者: Karan Bali,Jack Stanley,Praneet Suresh,Danilo Bzdok
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Main Body: 8 pages, Total length: 33 pages, Code repo: this https URL , Weights repo: this https URL
Abstract:In mechanistic interpretability, recent work scrutinizes transformer “circuits” - sparse, mono or multi layer sub computations, that may reflect human understandable functions. Yet, these network circuits are rarely acid-tested for their stability across different instances of the same deep learning architecture. Without this, it remains unclear whether reported circuits emerge universally across labs or turn out to be idiosyncratic to a particular estimation instance, potentially limiting confidence in safety-critical settings. Here, we systematically study stability across-refits in increasingly complex transformer language models of various sizes. We quantify, layer by layer, how similarly attention heads learn representations across independently initialized training runs. Our rigorous experiments show that (1) middle-layer heads are the least stable yet the most representationally distinct; (2) deeper models exhibit stronger mid-depth divergence; (3) unstable heads in deeper layers become more functionally important than their peers from the same layer; (4) applying weight decay optimization substantially improves attention-head stability across random model initializations; and (5) the residual stream is comparatively stable. Our findings establish the cross-instance robustness of circuits as an essential yet underappreciated prerequisite for scalable oversight, drawing contours around possible white-box monitorability of AI systems.
[AI-98] he Compute ICE-AGE: Invariant Compute Envelope under Addressable Graph Evolution
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 架构中因概率性语义重构导致的计算成本随 token 量和上下文长度显著增长的问题。传统推理驱动架构依赖于动态重组语义状态,其性能瓶颈在于计算复杂度与输入规模呈正相关。论文提出的解决方案是构建一个确定性的语义状态底座(deterministic semantic state substrate),基于有界局部生成类(Bounded Local Generator Classes)的形式化定义,在 CPU 上实现为一个受局部状态演化约束的图引擎。其核心创新在于将语义连续性建模为由时间调制局部算子 $ g(t) $ 驱动的可寻址记忆图结构,使得每次更新仅依赖局部语义变化量 $ \Delta s $,而与总内存基数 $ M $ 无关,从而实现计算资源消耗的不变性(invariant traversal latency, stable CPU utilization)。实验证明,在 Apple M2 类硅片上,系统在节点数从 1M 到 25M 的范围内保持恒定热特性与低延迟,验证了“可寻址图演化下的不变计算包络”(Compute ICE-AGE)的存在。
链接: https://arxiv.org/abs/2602.16736
作者: Raymond Jay Martin II
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI)
备注: 53 pages, 6 figures, 4 appendices. Empirical systems study of a deterministic semantic substrate evaluated up to 25M nodes on Apple M2-class silicon. Includes density accounting, thermodynamic analysis, and scaling argument
Abstract:This paper presents empirical results from a production-grade C++ implementation of a deterministic semantic state substrate derived from prior formal work on Bounded Local Generator Classes (Martin, 2026). The system was mathematically specified prior to implementation and realized as a CPU-resident graph engine operating under bounded local state evolution. Contemporary inference-driven AI architectures reconstruct semantic state through probabilistic recomposition, producing compute cost that scales with token volume and context horizon. In contrast, the substrate described here represents semantic continuity as a persistent, addressable memory graph evolved under a time-modulated local operator g(t). Work is bounded by local semantic change Delta s, independent of total memory cardinality M. Empirical measurements on Apple M2-class silicon demonstrate invariant traversal latency (approximately 0.25 to 0.32 ms), stable CPU utilization (approximately 17.2 percent baseline with Delta CPU approximately 0 to 0.2 percent), and no scale-correlated thermal signature across 1M to 25M node regimes under sustained operation. Measured per-node density ranges from approximately 1.3 KB (Float64 baseline) to approximately 687 bytes (compressed Float32 accounting). Under binary memory accounting, this yields a 1.6 billion node capacity projection within a 1 TiB envelope. These results indicate an empirically invariant thermodynamic regime in which scaling is governed by memory capacity rather than inference-bound recomposition. The Compute ICE-AGE is defined as the Invariant Compute Envelope under Addressable Graph Evolution, and the empirical evidence presented demonstrates this regime up to 25M nodes. Comments: 53 pages, 6 figures, 4 appendices. Empirical systems study of a deterministic semantic substrate evaluated up to 25M nodes on Apple M2-class silicon. Includes density accounting, thermodynamic analysis, and scaling argument Subjects: Operating Systems (cs.OS); Artificial Intelligence (cs.AI) MSC classes: 68M07, 68Q85, 37N99 ACMclasses: C.2.4; F.1.1; F.2.2 Cite as: arXiv:2602.16736 [cs.OS] (or arXiv:2602.16736v1 [cs.OS] for this version) https://doi.org/10.48550/arXiv.2602.16736 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Raymond Martin Jr [view email] [v1] Tue, 17 Feb 2026 20:57:34 UTC (2,290 KB)
[AI-99] Mobility-Aware Cache Framework for Scalable LLM -Based Human Mobility Simulation
【速读】:该论文旨在解决大规模人类移动模拟中因使用大语言模型(Large Language Models, LLMs)作为人类代理而带来的高计算成本问题,从而限制了模拟的可扩展性。其解决方案的关键在于设计了一个名为MobCache的移动感知缓存框架,通过引入可重构缓存机制实现高效模拟:一方面利用潜在空间嵌入(latent-space embedding)编码推理步骤,并借助潜在空间评估器实现推理步骤的复用与重组;另一方面采用轻量级解码器,基于移动规律约束的蒸馏训练策略将潜在空间推理链转化为自然语言输出,从而在保持与最先进LLM方法相当性能的同时显著提升模拟效率。
链接: https://arxiv.org/abs/2602.16727
作者: Hua Yan,Heng Tan,Yingxue Zhang,Yu Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large-scale human mobility simulation is critical for applications such as urban planning, epidemiology, and transportation analysis. Recent works treat large language models (LLMs) as human agents to simulate realistic mobility behaviors using structured reasoning, but their high computational cost limits scalability. To address this, we design a mobility-aware cache framework named MobCache that leverages reconstructible caches to enable efficient large-scale human mobility simulations. It consists of: (1) a reasoning component that encodes each reasoning step as a latent-space embedding and uses a latent-space evaluator to enable the reuse and recombination of reasoning steps; and (2) a decoding component that employs a lightweight decoder trained with mobility law-constrained distillation to translate latent-space reasoning chains into natural language, thereby improving simulation efficiency while maintaining fidelity. Experiments show that MobCache significantly improves efficiency across multiple dimensions while maintaining performance comparable to state-of-the-art LLM-based methods.
[AI-100] Is Mamba Reliable for Medical Imaging?
【速读】:该论文旨在解决状态空间模型(State-space Models, SSMs)如Mamba在医疗影像领域部署时的鲁棒性问题,特别是在面对现实软件与硬件威胁模型下的脆弱性尚未被充分研究的现状。解决方案的关键在于系统性地评估Mamba在多个MedM-NIST分类基准上的表现,涵盖输入级攻击类型,包括白盒对抗扰动(FGSM/PGD)、基于遮挡的PatchDrop、常见采集噪声(高斯噪声和散焦模糊)以及通过权重与激活位翻转模拟的硬件故障攻击。该评估量化了各类攻击对模型准确率的影响,揭示了当前模型缺乏足够防御机制,从而指出部署前必须设计针对性防护策略。
链接: https://arxiv.org/abs/2602.16723
作者: Banafsheh Saber Latibari,Najmeh Nazari,Daniel Brignac,Hossein Sayadi,Houman Homayoun,Abhijit Mahalanobis
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This paper has been accepted at ISQED 2026
Abstract:State-space models like Mamba offer linear-time sequence processing and low memory, making them attractive for medical imaging. However, their robustness under realistic software and hardware threat models remains underexplored. This paper evaluates Mamba on multiple MedM-NIST classification benchmarks under input-level attacks, including white-box adversarial perturbations (FGSM/PGD), occlusion-based PatchDrop, and common acquisition corruptions (Gaussian noise and defocus blur) as well as hardware-inspired fault attacks emulated in software via targeted and random bit-flip injections into weights and activations. We profile vulnerabilities and quantify impacts on accuracy indicating that defenses are needed for deployment.
[AI-101] APEX-SQL: Talking to the data via Agent ic Exploration for Text-to-SQL
【速读】:该论文旨在解决生成式 AI (Generative AI) 驱动的 Text-to-SQL 系统在复杂企业环境中性能下降的问题,其核心瓶颈在于依赖静态模式表示(schema representation)无法有效处理语义歧义且难以扩展至大规模、复杂的数据库。解决方案的关键在于提出 APEX-SQL 框架,该框架通过引入“代理探索”(agentic exploration)机制,将传统的被动翻译范式转变为基于假设-验证循环的主动推理过程:在模式链接阶段采用逻辑规划生成假设、双路径剪枝缩小搜索空间、并行数据剖析验证列角色,并通过全局合成确保拓扑连通性;在 SQL 生成阶段则引入确定性机制获取探索指令,使代理能够有效探索数据分布、迭代优化假设并生成语义准确的 SQL 查询。实验证明,该方法显著提升了执行准确率(如 BIRD 达 70.65%,Spider 2.0-Snow 达 51.01%),同时降低 Token 消耗,且消融实验验证了各模块对鲁棒性和准确性的重要贡献。
链接: https://arxiv.org/abs/2602.16720
作者: Bowen Cao,Weibin Liao,Yushi Sun,Dong Fang,Haitao Li,Wai Lam
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:Text-to-SQL systems powered by Large Language Models have excelled on academic benchmarks but struggle in complex enterprise environments. The primary limitation lies in their reliance on static schema representations, which fails to resolve semantic ambiguity and scale effectively to large, complex databases. To address this, we propose APEX-SQL, an Agentic Text-to-SQL Framework that shifts the paradigm from passive translation to agentic exploration. Our framework employs a hypothesis-verification loop to ground model reasoning in real data. In the schema linking phase, we use logical planning to verbalize hypotheses, dual-pathway pruning to reduce the search space, and parallel data profiling to validate column roles against real data, followed by global synthesis to ensure topological connectivity. For SQL generation, we introduce a deterministic mechanism to retrieve exploration directives, allowing the agent to effectively explore data distributions, refine hypotheses, and generate semantically accurate SQLs. Experiments on BIRD (70.65% execution accuracy) and Spider 2.0-Snow (51.01% execution accuracy) demonstrate that APEX-SQL outperforms competitive baselines with reduced token consumption. Further analysis reveals that agentic exploration acts as a performance multiplier, unlocking the latent reasoning potential of foundation models in enterprise settings. Ablation studies confirm the critical contributions of each component in ensuring robust and accurate data analysis.
[AI-102] GPU-Accelerated Algorithms for Graph Vector Search: Taxonomy Empirical Study and Research Directions
【速读】:该论文旨在解决大规模向量搜索中基于图的近似最近邻搜索(Approximate Nearest Neighbor Search, ANNS)在现代GPU架构上的优化缺乏系统性理解,以及其在实际场景中端到端性能不明确的问题。解决方案的关键在于构建一个详尽的GPU优化策略分类体系,厘清算法任务与GPU硬件执行单元之间的映射关系,并通过在八个大规模基准数据集上对六种主流算法的全面评估,揭示距离计算仍是主要计算瓶颈,而主机CPU与GPU间的数据传输则成为影响真实场景延迟的关键因素,从而为设计可扩展、鲁棒的GPU加速ANNS系统提供明确指导和基准参考。
链接: https://arxiv.org/abs/2602.16719
作者: Yaowen Liu,Xuejia Chen,Anxin Tian,Haoyang Li,Qinbin Li,Xin Zhang,Alexander Zhou,Chen Jason Zhang,Qing Li,Lei Chen
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:Approximate Nearest Neighbor Search (ANNS) underpins many large-scale data mining and machine learning applications, with efficient retrieval increasingly hinging on GPU acceleration as dataset sizes grow. Although graph-based approaches represent the state of the art in approximate nearest neighbor search, there is a lack of systematic understanding regarding their optimization for modern GPU architectures and their end-to-end effectiveness in practical scenarios. In this work, we present a comprehensive survey and experimental study of GPU-accelerated graph-based vector search algorithms. We establish a detailed taxonomy of GPU optimization strategies and clarify the mapping between algorithmic tasks and hardware execution units within GPUs. Through a thorough evaluation of six leading algorithms on eight large-scale benchmark datasets, we assess both graph index construction and query search performance. Our analysis reveals that distance computation remains the primary computational bottleneck, while data transfer between the host CPU and GPU emerges as the dominant factor influencing real-world latency at large scale. We also highlight key trade-offs in scalability and memory usage across different system designs. Our findings offer clear guidelines for designing scalable and robust GPU-powered approximate nearest neighbor search systems, and provide a comprehensive benchmark for the knowledge discovery and data mining community.
[AI-103] Contextuality from Single-State Representations: An Information-Theoretic Principle for Adaptive Intelligence
【速读】:该论文试图解决的问题是:在资源受限(如内存、表征或物理资源)条件下,适应性系统如何通过复用固定内部状态空间(single-state reuse)来处理多情境下的任务时,其经典概率表示所面临的根本性表征限制。解决方案的关键在于证明:任何试图重现情境依赖性结果统计的经典模型都必须承担不可消除的信息论代价——即情境依赖无法仅通过内部状态进行中介;这一结论揭示了情境性(contextuality)并非量子力学特有现象,而是经典概率框架下单状态复用的必然结果。此外,论文进一步指出,非经典概率框架可通过放弃全局联合概率空间假设来规避此障碍,无需引入量子动力学或希尔伯特空间结构,从而将情境性识别为一种适用于所有自适应智能系统的通用表征约束。
链接: https://arxiv.org/abs/2602.16716
作者: Song-Ju Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: This paper addresses contextuality from a representation-theoretic and information-theoretic perspective in adaptive systems. It is conceptually and technically distinct from the authors’ earlier arXiv works (QTOW/QTOW2), which pursue different formulations of contextuality
Abstract:Adaptive systems often operate across multiple contexts while reusing a fixed internal state space due to constraints on memory, representation, or physical resources. Such single-state reuse is ubiquitous in natural and artificial intelligence, yet its fundamental representational consequences remain poorly understood. We show that contextuality is not a peculiarity of quantum mechanics, but an inevitable consequence of single-state reuse in classical probabilistic representations. Modeling contexts as interventions acting on a shared internal state, we prove that any classical model reproducing contextual outcome statistics must incur an irreducible information-theoretic cost: dependence on context cannot be mediated solely through the internal state. We provide a minimal constructive example that explicitly realizes this cost and clarifies its operational meaning. We further explain how nonclassical probabilistic frameworks avoid this obstruction by relaxing the assumption of a single global joint probability space, without invoking quantum dynamics or Hilbert space structure. Our results identify contextuality as a general representational constraint on adaptive intelligence, independent of physical implementation.
[AI-104] AIdentifyAGE Ontology for Decision Support in Forensic Dental Age Assessment
【速读】:该论文旨在解决法医齿科年龄评估(forensic dental age assessment)在临床、法医与法律信息系统之间因方法学异质性、数据表示碎片化及互操作性不足而导致的透明度和可重复性问题,尤其在人工智能(AI)辅助方法日益普及的背景下更为突出。解决方案的关键在于构建一个名为AIdentifyAGE的领域特定本体(domain-specific ontology),该本体提供语义一致的标准框架,涵盖从人工到AI辅助的完整法医齿科年龄评估流程,并实现观察结果、评估方法、参考数据与报告结果之间的可追溯关联,同时整合司法背景、个体信息、影像学数据、统计参考研究及AI估计方法,确保符合FAIR原则(可发现、可访问、可互操作、可重用),从而提升决策支持系统的透明性、一致性与可解释性。
链接: https://arxiv.org/abs/2602.16714
作者: Renato Marcelo,Ana Rodrigues,Cristiana Palmela Pereira,António Figueiras,Rui Santos,José Rui Figueira,Alexandre P Francisco,Cátia Vaz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Age assessment is crucial in forensic and judicial decision-making, particularly in cases involving undocumented individuals and unaccompanied minors, where legal thresholds determine access to protection, healthcare, and judicial procedures. Dental age assessment is widely recognized as one of the most reliable biological approaches for adolescents and young adults, but current practices are challenged by methodological heterogeneity, fragmented data representation, and limited interoperability between clinical, forensic, and legal information systems. These limitations hinder transparency and reproducibility, amplified by the increasing adoption of AI- based methods. The AIdentifyAGE ontology is domain-specific and provides a standardized, semantically coherent framework, encompassing both manual and AI-assisted forensic dental age assessment workflows, and enabling traceable linkage between observations, methods, reference data, and reported outcomes. It models the complete medico-legal workflow, integrating judicial context, individual-level information, forensic examination data, dental developmental assessment methods, radiographic imaging, statistical reference studies, and AI-based estimation methods. It is being developed together with domain experts, and it builds on upper and established biomedical, dental, and machine learning ontologies, ensuring interoperability, extensibility, and compliance with FAIR principles. The AIdentifyAGE ontology is a fundamental step to enhance consistency, transparency, and explainability, establishing a robust foundation for ontology-driven decision support systems in medico-legal and judicial contexts.
[AI-105] oward a Fully Autonomous AI-Native Particle Accelerator
【速读】:该论文试图解决当前粒子加速器系统依赖人工干预、设计与运行效率受限的问题,旨在实现自主化运行以提升科学产出和可靠性。其解决方案的关键在于采用人工智能(AI)协同设计(AI co-design)方法,从设施构建之初即以AI原生(AI-native)平台为理念,联合优化加速器晶格结构、诊断系统及科学应用,从而实现全生命周期的性能最大化,并通过九大研究方向(如代理控制架构、数字孪生、自适应学习等)推动加速器向智能化、自主化演进。
链接: https://arxiv.org/abs/2602.17536
作者: Chris Tennant
机构: 未知
类目: Accelerator Physics (physics.acc-ph); Artificial Intelligence (cs.AI)
备注: 14 pages, 1 figure
Abstract:This position paper presents a vision for self-driving particle accelerators that operate autonomously with minimal human intervention. We propose that future facilities be designed through artificial intelligence (AI) co-design, where AI jointly optimizes the accelerator lattice, diagnostics, and science application from inception to maximize performance while enabling autonomous operation. Rather than retrofitting AI onto human-centric systems, we envision facilities designed from the ground up as AI-native platforms. We outline nine critical research thrusts spanning agentic control architectures, knowledge integration, adaptive learning, digital twins, health monitoring, safety frameworks, modular hardware design, multimodal data fusion, and cross-domain collaboration. This roadmap aims to guide the accelerator community toward a future where AI-driven design and operation deliver unprecedented science output and reliability.
[AI-106] Systematic Evaluation of Single-Cell Foundation Model Interpretability Reveals Attention Captures Co-Expression Rather Than Unique Regulatory Signal
【速读】:该论文旨在解决单细胞基础模型(single-cell foundation models)中机制可解释性(mechanistic interpretability)的评估问题,特别是关注注意力模式(attention patterns)是否能有效捕捉生物结构并提升扰动预测性能。其关键解决方案是构建了一个系统性的评估框架,包含37项分析、153个统计检验、四种细胞类型和两种扰动模态,用于量化注意力机制与基因调控网络(GRN)之间的关系及其对扰动预测的贡献。研究发现,尽管注意力模式在不同层级编码了特定生物学信息(如早期层体现蛋白质相互作用、晚期层体现转录调控),但这些结构对扰动预测并无增量价值——简单的基因水平基线已显著优于注意力或相关边(AUROC 0.81–0.88 vs. 0.70),且成对边评分无预测增益,因果消融调控头亦未导致性能下降。此外,提出细胞状态分层可解释性(Cell-State Stratified Interpretability, CSSI)以解决注意力机制的缩放失效问题,在RPE1和K562细胞系中均提升了GRN恢复能力达1.85倍,从而为领域建立了可复用的质量控制标准。
链接: https://arxiv.org/abs/2602.17532
作者: Ihor Kendiukhov
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a systematic evaluation framework - thirty-seven analyses, 153 statistical tests, four cell types, two perturbation modalities - for assessing mechanistic interpretability in single-cell foundation models. Applying this framework to scGPT and Geneformer, we find that attention patterns encode structured biological information with layer-specific organisation - protein-protein interactions in early layers, transcriptional regulation in late layers - but this structure provides no incremental value for perturbation prediction: trivial gene-level baselines outperform both attention and correlation edges (AUROC 0.81-0.88 versus 0.70), pairwise edge scores add zero predictive contribution, and causal ablation of regulatory heads produces no degradation. These findings generalise from K562 to RPE1 cells; the attention-correlation relationship is context-dependent, but gene-level dominance is universal. Cell-State Stratified Interpretability (CSSI) addresses an attention-specific scaling failure, improving GRN recovery up to 1.85x. The framework establishes reusable quality-control standards for the field.
[AI-107] Extending quantum theory with AI-assisted deterministic game theory
【速读】:该论文旨在解决如何在不违背量子非定域性实验结果的前提下,构建一个局域隐变量理论(local hidden-variable theory)以扩展量子理论的问题。其核心挑战在于克服诸如贝尔定理等不可行性定理的限制,这些定理通常依赖于“自由选择”(free choice)假设,即测量设置与潜在隐变量独立。解决方案的关键在于引入一种弱化的、兼容主义版本的自由选择——称为“条件自由选择”(contingent free choice),并提出将复杂量子实验建模为观察者与宇宙之间的博弈,其中宇宙被视为最小化作用量的经济代理(economic agent)。在此框架下,利用神经网络学习博弈中的奖励函数(含隐变量),并通过最小化Kullback-Leibler散度优化预测分布与扩展玻恩规则(extended Born rule)的一致性,从而实现对量子实验的确定性模拟。该方法摒弃了传统纳什均衡中单边偏离(unilateral deviation)的假设,转而采用“完美预测”(Perfect Prediction)机制,为探索局域实在论路径提供了新的计算范式和实证基础。
链接: https://arxiv.org/abs/2602.17213
作者: Florian Pauschitz,Ben Moseley,Ghislain Fourny
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: Extended abstract, 3 pages plus references. Preprint in progress
Abstract:We present an AI-assisted framework for predicting individual runs of complex quantum experiments, including contextuality and causality (adaptive measurements), within our long-term programme of discovering a local hidden-variable theory that extends quantum theory. In order to circumvent impossibility theorems, we replace the assumption of free choice (measurement independence and parameter independence) with a weaker, compatibilistic version called contingent free choice. Our framework is based on interpreting complex quantum experiments as a Chess-like game between observers and the universe, which is seen as an economic agent minimizing action. The game structures corresponding to generic experiments such as fixed-causal-order process matrices or causal contextuality scenarios, together with a deterministic non-Nashian resolution algorithm that abandons unilateral deviation assumptions (free choice) and assumes Perfect Prediction instead, were described in previous work. In this new research, we learn the reward functions of the game, which contain a hidden variable, using neural networks. The cost function is the Kullback-Leibler divergence between the frequency histograms obtained through many deterministic runs of the game and the predictions of the extended Born rule. Using our framework on the specific case of the EPR 2-2-2 experiment acts as a proof-of-concept and a toy local-realist hidden-variable model that non-Nashian quantum theory is a promising avenue towards a local hidden-variable theory. Our framework constitutes a solid foundation, which can be further expanded in order to fully discover a complete quantum theory. Comments: Extended abstract, 3 pages plus references. Preprint in progress Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT) MSC classes: 91A35 ACMclasses: J.4 Cite as: arXiv:2602.17213 [quant-ph] (or arXiv:2602.17213v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2602.17213 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ghislain Fourny [view email] [v1] Thu, 19 Feb 2026 10:04:07 UTC (500 KB)
[AI-108] Deeper detection limits in astronomical imaging using self-supervised spatiotemporal denoising
【速读】:该论文旨在解决天文成像观测中因多种噪声源(包括像素间和曝光间的相关噪声)导致的检测极限受限问题。解决方案的关键在于提出了一种基于自监督Transformer架构的去噪算法(ASTERIS),其核心创新在于能够整合多张曝光图像中的时空信息,从而学习并校正相关噪声,同时保持点扩散函数(PSF)和测光精度不变。通过模拟数据基准测试和詹姆斯·韦布空间望远镜(JWST)及昴宿星团望远镜的实际观测验证,ASTERIS显著提升了检测灵敏度(在90%完整性和纯度下提升1.0个星等),并识别出此前无法探测到的低表面亮度星系结构与引力透镜弧,进一步在深场JWST图像中使红移z=9星系候选体数量增加三倍,且可探测到比以往方法更暗1.0个星等的紫外光度。
链接: https://arxiv.org/abs/2602.17205
作者: Yuduo Guo,Hao Zhang,Mingyu Li,Fujiang Yu,Yunjing Wu,Yuhan Hao,Song Huang,Yongming Liang,Xiaojing Lin,Xinyang Li,Jiamin Wu,Zheng Cai,Qionghai Dai
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Astrophysics of Galaxies (astro-ph.GA); Artificial Intelligence (cs.AI)
备注: Published in Science. This is the author’s version of the work. It is posted here by permission of the AAAS for personal use, not for redistribution
Abstract:The detection limit of astronomical imaging observations is limited by several noise sources. Some of that noise is correlated between neighbouring image pixels and exposures, so in principle could be learned and corrected. We present an astronomical self-supervised transformer-based denoising algorithm (ASTERIS), that integrates spatiotemporal information across multiple exposures. Benchmarking on mock data indicates that ASTERIS improves detection limits by 1.0 magnitude at 90% completeness and purity, while preserving the point spread function and photometric accuracy. Observational validation using data from the James Webb Space Telescope (JWST) and Subaru telescope identifies previously undetectable features, including low-surface-brightness galaxy structures and gravitationally-lensed arcs. Applied to deep JWST images, ASTERIS identifies three times more redshift 9 galaxy candidates, with rest-frame ultraviolet luminosity 1.0 magnitude fainter, than previous methods.
[AI-109] Universal Fine-Grained Symmetry Inference and Enforcement for Rigorous Crystal Structure Prediction
【速读】:该论文旨在解决晶体结构预测(Crystal Structure Prediction, CSP)中现有深度学习模型对晶格对称性处理不足的问题,尤其是传统方法依赖已知结构数据库进行检索或使用空间群和Wyckoff位点模板,导致物理保真度低且难以发现全新材料结构。其解决方案的关键在于:首先利用大语言模型(Large Language Model, LLM)编码化学语义并直接生成细粒度的Wyckoff模式,从而绕过对已有数据库的依赖;其次通过高效的约束优化搜索机制,严格保证原子计量比与位置多重性之间的代数一致性,确保生成结构的对称性物理合理性;最终将这一对称一致的模板嵌入扩散模型(diffusion backbone)中,约束随机生成轨迹落在物理有效的几何流形上,显著提升了预测在稳定性、唯一性和新颖性(SUN)指标上的表现,并实现了对未探索材料空间的高效拓展。
链接: https://arxiv.org/abs/2602.17176
作者: Shi Yin,Jinming Mu,Xudong Zhu,Lixin He
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:
Abstract:Crystal structure prediction (CSP), which aims to predict the three-dimensional atomic arrangement of a crystal from its composition, is central to materials discovery and mechanistic understanding. Existing deep learning models often treat crystallographic symmetry only as a soft heuristic or rely on space group and Wyckoff templates retrieved from known structures, which limits both physical fidelity and the ability to discover genuinely new material structures. In contrast to retrieval-based methods, our approach leverages large language models to encode chemical semantics and directly generate fine-grained Wyckoff patterns from composition, effectively circumventing the limitations inherent to database lookups. Crucially, we incorporate domain knowledge into the generative process through an efficient constrained-optimization search that rigorously enforces algebraic consistency between site multiplicities and atomic stoichiometry. By integrating this symmetry-consistent template into a diffusion backbone, our approach constrains the stochastic generative trajectory to a physically valid geometric manifold. This framework achieves state-of-the-art performance across stability, uniqueness, and novelty (SUN) benchmarks, alongside superior matching performance, thereby establishing a new paradigm for the rigorous exploration of targeted crystallographic space. This framework enables efficient expansion into previously uncharted materials space, eliminating reliance on existing databases or a priori structural knowledge.
[AI-110] Deep Reinforcement Learning for Optimal Portfolio Allocation: A Comparative Study with Mean-Variance Optimization ICAPS2023
【速读】:该论文旨在解决传统金融实践中Portfolio Management(投资组合管理)中资产配置优化的效率与效果问题,特别是针对深度强化学习(Deep Reinforcement Learning, DRL)在实际应用中的有效性缺乏与经典方法如均值-方差优化(Mean-Variance Portfolio Optimization, MVO)进行系统性对比的问题。其解决方案的关键在于构建一个模型无关的DRL框架,并通过详尽的回测实验对DRL代理与MVO方法在多个关键指标(如夏普比率、最大回撤和绝对收益)上进行公平比较,同时明确指出DRL在实践部署时所需的调整策略及MVO实现中的改进点,从而验证了DRL在复杂市场环境下具备更强的适应性和稳健性。
链接: https://arxiv.org/abs/2602.17098
作者: Srijan Sood,Kassiani Papasotiriou,Marius Vaiciulis,Tucker Balch
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 6 figures. Published at the FinPlan’23 Workshop, the 33rd International Conference on Automated Planning and Scheduling (ICAPS 2023)
Abstract:Portfolio Management is the process of overseeing a group of investments, referred to as a portfolio, with the objective of achieving predetermined investment goals. Portfolio optimization is a key component that involves allocating the portfolio assets so as to maximize returns while minimizing risk taken. It is typically carried out by financial professionals who use a combination of quantitative techniques and investment expertise to make decisions about the portfolio allocation. Recent applications of Deep Reinforcement Learning (DRL) have shown promising results when used to optimize portfolio allocation by training model-free agents on historical market data. Many of these methods compare their results against basic benchmarks or other state-of-the-art DRL agents but often fail to compare their performance against traditional methods used by financial professionals in practical settings. One of the most commonly used methods for this task is Mean-Variance Portfolio Optimization (MVO), which uses historical time series information to estimate expected asset returns and covariances, which are then used to optimize for an investment objective. Our work is a thorough comparison between model-free DRL and MVO for optimal portfolio allocation. We detail the specifics of how to make DRL for portfolio optimization work in practice, also noting the adjustments needed for MVO. Backtest results demonstrate strong performance of the DRL agent across many metrics, including Sharpe ratio, maximum drawdowns, and absolute returns. Comments: 9 pages, 6 figures. Published at the FinPlan’23 Workshop, the 33rd International Conference on Automated Planning and Scheduling (ICAPS 2023) Subjects: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: 91G10, 68T05 ACMclasses: I.2.6; J.4 Cite as: arXiv:2602.17098 [q-fin.PM] (or arXiv:2602.17098v1 [q-fin.PM] for this version) https://doi.org/10.48550/arXiv.2602.17098 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-111] General sample size analysis for probabilities of causation: a delta method approach
【速读】:该论文旨在解决概率因果推断(Probabilities of Causation, PoCs)中样本量分析的不足问题,即在给定精度要求下,如何确定所需的实验和观测样本数量。现有研究虽能基于实验与观测数据推导PoCs(如必要性与充分性概率,PNS)的边界,但缺乏对样本规模的理论指导。本文提出一种基于delta方法的通用样本量框架,其关键在于将目标PoCs边界表示为实验与观测概率线性组合的有限极小值或极大值形式,从而实现对边界估计误差的可控性分析,并通过模拟验证了该方法可稳定实现边界估计。
链接: https://arxiv.org/abs/2602.17070
作者: Tianyuan Cheng,Ruirui Mao,Judea Pearl,Ang Li
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
备注:
Abstract:Probabilities of causation (PoCs), such as the probability of necessity and sufficiency (PNS), are important tools for decision making but are generally not point identifiable. Existing work has derived bounds for these quantities using combinations of experimental and observational data. However, there is very limited research on sample size analysis, namely, how many experimental and observational samples are required to achieve a desired margin of error. In this paper, we propose a general sample size framework based on the delta method. Our approach applies to settings in which the target bounds of PoCs can be expressed as finite minima or maxima of linear combinations of experimental and observational probabilities. Through simulation studies, we demonstrate that the proposed sample size calculations lead to stable estimation of these bounds.
机器学习
[LG-0] Multi-Round Human-AI Collaboration with User-Specified Requirements
链接: https://arxiv.org/abs/2602.17646
作者: Sima Noorani,Shayan Kiyani,Hamed Hassani,George Pappas
类目: Machine Learning (cs.LG)
*备注:
Abstract:As humans increasingly rely on multiround conversational AI for high stakes decisions, principled frameworks are needed to ensure such interactions reliably improve decision quality. We adopt a human centric view governed by two principles: counterfactual harm, ensuring the AI does not undermine human strengths, and complementarity, ensuring it adds value where the human is prone to err. We formalize these concepts via user defined rules, allowing users to specify exactly what harm and complementarity mean for their specific task. We then introduce an online, distribution free algorithm with finite sample guarantees that enforces the user-specified constraints over the collaboration dynamics. We evaluate our framework across two interactive settings: LLM simulated collaboration on a medical diagnostic task and a human crowdsourcing study on a pictorial reasoning task. We show that our online procedure maintains prescribed counterfactual harm and complementarity violation rates even under nonstationary interaction dynamics. Moreover, tightening or loosening these constraints produces predictable shifts in downstream human accuracy, confirming that the two principles serve as practical levers for steering multi-round collaboration toward better decision quality without the need to model or constrain human behavior.
[LG-1] A.R.I.S.: Automated Recycling Identification System for E-Waste Classification Using Deep Learning
链接: https://arxiv.org/abs/2602.17642
作者: Dhruv Talwar,Harsh Desai,Wendong Yin,Goutam Mohanty,Rafael Reveles
类目: Machine Learning (cs.LG)
*备注:
Abstract:Traditional electronic recycling processes suffer from significant resource loss due to inadequate material separation and identification capabilities, limiting material recovery. We present A.R.I.S. (Automated Recycling Identification System), a low-cost, portable sorter for shredded e-waste that addresses this efficiency gap. The system employs a YOLOx model to classify metals, plastics, and circuit boards in real time, achieving low inference latency with high detection accuracy. Experimental evaluation yielded 90% overall precision, 82.2% mean average precision (mAP), and 84% sortation purity. By integrating deep learning with established sorting methods, A.R.I.S. enhances material recovery efficiency and lowers barriers to advanced recycling adoption. This work complements broader initiatives in extending product life cycles, supporting trade-in and recycling programs, and reducing environmental impact across the supply chain.
[LG-2] Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning
链接: https://arxiv.org/abs/2602.17625
作者: Obaidullah Zaland,Zulfiqar Ahmad Khan,Monowar Bhuyan
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted for publication in the IEEE International Conference on Big Data (IEEE BigData) 2025
Abstract:Modern big-data systems generate massive, heterogeneous, and geographically dispersed streams that are large-scale and privacy-sensitive, making centralization challenging. While federated learning (FL) provides a privacy-enhancing training mechanism, it assumes a static data flow and learns a collaborative model over multiple rounds, making learning with \textitincremental data challenging in limited-communication scenarios. This paper presents One-Shot Incremental Federated Learning (OSI-FL), the first FL framework that addresses the dual challenges of communication overhead and catastrophic forgetting. OSI-FL communicates category-specific embeddings, devised by a frozen vision-language model (VLM) from each client in a single communication round, which a pre-trained diffusion model at the server uses to synthesize new data similar to the client’s data distribution. The synthesized samples are used on the server for training. However, two challenges still persist: i) tasks arriving incrementally need to retrain the global model, and ii) as future tasks arrive, retraining the model introduces catastrophic forgetting. To this end, we augment training with Selective Sample Retention (SSR), which identifies and retains the top-p most informative samples per category and task pair based on sample loss. SSR bounds forgetting by ensuring that representative retained samples are incorporated into training in further iterations. The experimental results indicate that OSI-FL outperforms baselines, including traditional and one-shot FL approaches, in both class-incremental and domain-incremental scenarios across three benchmark datasets.
[LG-3] Guarding the Middle: Protecting Intermediate Representations in Federated Split Learning
链接: https://arxiv.org/abs/2602.17614
作者: Obaidullah Zaland,Sajib Mistry,Monowar Bhuyan
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted for Publication in IEEE International Conference on Big Data (IEEE BigData) 2025
Abstract:Big data scenarios, where massive, heterogeneous datasets are distributed across clients, demand scalable, privacy-preserving learning methods. Federated learning (FL) enables decentralized training of machine learning (ML) models across clients without data centralization. Decentralized training, however, introduces a computational burden on client devices. U-shaped federated split learning (UFSL) offloads a fraction of the client computation to the server while keeping both data and labels on the clients’ side. However, the intermediate representations (i.e., smashed data) shared by clients with the server are prone to exposing clients’ private data. To reduce exposure of client data through intermediate data representations, this work proposes k-anonymous differentially private UFSL (KD-UFSL), which leverages privacy-enhancing techniques such as microaggregation and differential privacy to minimize data leakage from the smashed data transferred to the server. We first demonstrate that an adversary can access private client data from intermediate representations via a data-reconstruction attack, and then present a privacy-enhancing solution, KD-UFSL, to mitigate this risk. Our experiments indicate that, alongside increasing the mean squared error between the actual and reconstructed images by up to 50% in some cases, KD-UFSL also decreases the structural similarity between them by up to 40% on four benchmarking datasets. More importantly, KD-UFSL improves privacy while preserving the utility of the global model. This highlights its suitability for large-scale big data applications where privacy and utility must be balanced.
[LG-4] Asymptotic Smoothing of the Lipschitz Loss Landscape in Overparameterized One-Hidden-Layer ReLU Networks
链接: https://arxiv.org/abs/2602.17596
作者: Saveliy Baturin
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the topology of the loss landscape of one-hidden-layer ReLU networks under overparameterization. On the theory side, we (i) prove that for convex L -Lipschitz losses with an \ell_1 -regularized second layer, every pair of models at the same loss level can be connected by a continuous path within an arbitrarily small loss increase \epsilon (extending a known result for the quadratic loss); (ii) obtain an asymptotic upper bound on the energy gap \epsilon between local and global minima that vanishes as the width m grows, implying that the landscape flattens and sublevel sets become connected in the limit. Empirically, on a synthetic Moons dataset and on the Wisconsin Breast Cancer dataset, we measure pairwise energy gaps via Dynamic String Sampling (DSS) and find that wider networks exhibit smaller gaps; in particular, a permutation test on the maximum gap yields p_perm=0 , indicating a clear reduction in the barrier height.
[LG-5] Canonicalizing Multimodal Contrastive Representation Learning
链接: https://arxiv.org/abs/2602.17584
作者: Sharut Gupta,Sanyam Kansal,Stefanie Jegelka,Phillip Isola,Vikas Garg
类目: Machine Learning (cs.LG)
*备注: 78 pages, 57 figures
Abstract:As models and data scale, independently trained networks often induce analogous notions of similarity. But, matching similarities is weaker than establishing an explicit correspondence between the representation spaces, especially for multimodal models, where consistency must hold not only within each modality, but also for the learned image-text coupling. We therefore ask: given two independently trained multimodal contrastive models (with encoders (f, g) and (\widetildef,\widetildeg) ) – trained on different distributions and with different architectures – does a systematic geometric relationship exist between their embedding spaces? If so, what form does it take, and does it hold uniformly across modalities? In this work, we show that across model families such as CLIP, SigLIP, and FLAVA, this geometric relationship is well approximated by an orthogonal map (up to a global mean shift), i.e., there exists an orthogonal map Q where Q^\top Q = I such that \widetildef(x)\approx Q f(x) for paired images x . Strikingly, the same Q simultaneously aligns the text encoders i.e., \widetildeg(y)\approx Q g(y) for texts y . Theoretically, we prove that if the multimodal kernel agrees across models on a small anchor set i.e. \langle f(x), g(y)\rangle \approx \langle \widetildef(x), \widetildeg(y)\rangle , then the two models must be related by a single orthogonal map Q and the same Q maps images and text across models. More broadly, this finding enables backward-compatible model upgrades, avoiding costly re-embedding, and has implications for the privacy of learned representations. Our project page: this https URL Comments: 78 pages, 57 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.17584 [cs.LG] (or arXiv:2602.17584v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.17584 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-6] Simultaneous Blackwell Approachability and Applications to Multiclass Omniprediction
链接: https://arxiv.org/abs/2602.17577
作者: Lunjia Hu,Kevin Tian,Chutong Yang
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Omniprediction is a learning problem that requires suboptimality bounds for each of a family of losses \mathcalL against a family of comparator predictors \mathcalC . We initiate the study of omniprediction in a multiclass setting, where the comparator family \mathcalC may be infinite. Our main result is an extension of the recent binary omniprediction algorithm of [OKK25] to the multiclass setting, with sample complexity (in statistical settings) or regret horizon (in online settings) \approx \varepsilon^-(k+1) , for \varepsilon -omniprediction in a k -class prediction problem. En route to proving this result, we design a framework of potential broader interest for solving Blackwell approachability problems where multiple sets must simultaneously be approached via coupled actions.
[LG-7] Revisiting Weight Regularization for Low-Rank Continual Learning ICLR2026
链接: https://arxiv.org/abs/2602.17559
作者: Yaoyue Zheng,Yin Zhang,Joost van de Weijer,Gido M van de Ven,Shaoyi Du,Xuetao Zhang,Zhiqiang Tian
类目: Machine Learning (cs.LG)
*备注: Accepted by ICLR 2026
Abstract:Continual Learning (CL) with large-scale pre-trained models (PTMs) has recently gained wide attention, shifting the focus from training from scratch to continually adapting PTMs. This has given rise to a promising paradigm: parameter-efficient continual learning (PECL), where task interference is typically mitigated by assigning a task-specific module during training, such as low-rank adapters. However, weight regularization techniques, such as Elastic Weight Consolidation (EWC)-a key strategy in CL-remain underexplored in this new paradigm. In this paper, we revisit weight regularization in low-rank CL as a new perspective for mitigating task interference in PECL. Unlike existing low-rank CL methods, we mitigate task interference by regularizing a shared low-rank update through EWC, thereby keeping the storage requirement and inference costs constant regardless of the number of tasks. Our proposed method EWC-LoRA leverages a low-rank representation to estimate parameter importance over the full-dimensional space. This design offers a practical, computational- and memory-efficient solution for CL with PTMs, and provides insights that may inform the broader application of regularization techniques within PECL. Extensive experiments on various benchmarks demonstrate the effectiveness of EWC-LoRA, achieving a stability-plasticity trade-off superior to existing low-rank CL approaches. These results indicate that, even under low-rank parameterizations, weight regularization remains an effective mechanism for mitigating task interference. Code is available at: this https URL.
[LG-8] A Theoretical Framework for Modular Learning of Robust Generative Models
链接: https://arxiv.org/abs/2602.17554
作者: Corinna Cortes,Mehryar Mohri,Yutao Zhong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Training large-scale generative models is resource-intensive and relies heavily on heuristic dataset weighting. We address two fundamental questions: Can we train Large Language Models (LLMs) modularly-combining small, domain-specific experts to match monolithic performance-and can we do so robustly for any data mixture, eliminating heuristic tuning? We present a theoretical framework for modular generative modeling where a set of pre-trained experts are combined via a gating mechanism. We define the space of normalized gating functions, G_1 , and formulate the problem as a minimax game to find a single robust gate that minimizes divergence to the worst-case data mixture. We prove the existence of such a robust gate using Kakutani’s fixed-point theorem and show that modularity acts as a strong regularizer, with generalization bounds scaling with the lightweight gate’s complexity. Furthermore, we prove that this modular approach can theoretically outperform models retrained on aggregate data, with the gap characterized by the Jensen-Shannon Divergence. Finally, we introduce a scalable Stochastic Primal-Dual algorithm and a Structural Distillation method for efficient inference. Empirical results on synthetic and real-world datasets confirm that our modular architecture effectively mitigates gradient conflict and can robustly outperform monolithic baselines.
[LG-9] IRIS: Learning-Driven Task-Specific Cinema Robot Arm for Visuomotor Motion Control
链接: https://arxiv.org/abs/2602.17537
作者: Qilong Cheng,Matthew Mackay,Ali Bereyhi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Robotic camera systems enable dynamic, repeatable motion beyond human capabilities, yet their adoption remains limited by the high cost and operational complexity of industrial-grade platforms. We present the Intelligent Robotic Imaging System (IRIS), a task-specific 6-DOF manipulator designed for autonomous, learning-driven cinematic motion control. IRIS integrates a lightweight, fully 3D-printed hardware design with a goal-conditioned visuomotor imitation learning framework based on Action Chunking with Transformers (ACT). The system learns object-aware and perceptually smooth camera trajectories directly from human demonstrations, eliminating the need for explicit geometric programming. The complete platform costs under 1,000 USD, supports a 1.5 kg payload, and achieves approximately 1 mm repeatability. Real-world experiments demonstrate accurate trajectory tracking, reliable autonomous execution, and generalization across diverse cinematic motions.
[LG-10] Provably Explaining Neural Additive Models ICLR2026
链接: https://arxiv.org/abs/2602.17530
作者: Shahaf Bassan,Yizhak Yisrael Elboher,Tobias Ladner,Volkan Şahin,Jan Kretinsky,Matthias Althoff,Guy Katz
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO)
*备注: To appear in ICLR 2026
Abstract:Despite significant progress in post-hoc explanation methods for neural networks, many remain heuristic and lack provable guarantees. A key approach for obtaining explanations with provable guarantees is by identifying a cardinally-minimal subset of input features which by itself is provably sufficient to determine the prediction. However, for standard neural networks, this task is often computationally infeasible, as it demands a worst-case exponential number of verification queries in the number of input features, each of which is NP-hard. In this work, we show that for Neural Additive Models (NAMs), a recent and more interpretable neural network family, we can efficiently generate explanations with such guarantees. We present a new model-specific algorithm for NAMs that generates provably cardinally-minimal explanations using only a logarithmic number of verification queries in the number of input features, after a parallelized preprocessing step with logarithmic runtime in the required precision is applied to each small univariate NAM component. Our algorithm not only makes the task of obtaining cardinally-minimal explanations feasible, but even outperforms existing algorithms designed to find the relaxed variant of subset-minimal explanations - which may be larger and less informative but easier to compute - despite our algorithm solving a much more difficult task. Our experiments demonstrate that, compared to previous algorithms, our approach provides provably smaller explanations than existing works and substantially reduces the computation time. Moreover, we show that our generated provable explanations offer benefits that are unattainable by standard sampling-based techniques typically used to interpret NAMs. Comments: To appear in ICLR 2026 Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO) Cite as: arXiv:2602.17530 [cs.LG] (or arXiv:2602.17530v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.17530 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tobias Ladner [view email] [v1] Thu, 19 Feb 2026 16:42:29 UTC (323 KB) Full-text links: Access Paper: View a PDF of the paper titled Provably Explaining Neural Additive Models, by Shahaf Bassan and 6 other authorsView PDFTeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-02 Change to browse by: cs cs.CC cs.LO References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-11] Variational inference via radial transport
链接: https://arxiv.org/abs/2602.17525
作者: Luca Ghafourpour,Sinho Chewi,Alessio Figalli,Aram-Alexandre Pooladian
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:In variational inference (VI), the practitioner approximates a high-dimensional distribution \pi with a simple surrogate one, often a (product) Gaussian distribution. However, in many cases of practical interest, Gaussian distributions might not capture the correct radial profile of \pi , resulting in poor coverage. In this work, we approach the VI problem from the perspective of optimizing over these radial profiles. Our algorithm radVI is a cheap, effective add-on to many existing VI schemes, such as Gaussian (mean-field) VI and Laplace approximation. We provide theoretical convergence guarantees for our algorithm, owing to recent developments in optimization over the Wasserstein space–the space of probability distributions endowed with the Wasserstein distance–and new regularity properties of radial transport maps in the style of Caffarelli (2000).
[LG-12] Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models NEURIPS2025
链接: https://arxiv.org/abs/2602.17497
作者: Wen-Tse Chen,Jiayu Chen,Fahim Tajwar,Hao Zhu,Xintong Duan,Ruslan Salakhutdinov,Jeff Schneider
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2025
Abstract:Learning from self-sampled data and sparse environmental feedback remains a fundamental challenge in training self-evolving agents. Temporal credit assignment mitigates this issue by transforming sparse feedback into dense supervision signals. However, previous approaches typically depend on learning task-specific value functions for credit assignment, which suffer from poor sample efficiency and limited generalization. In this work, we propose to leverage pretrained knowledge from large language models (LLMs) to transform sparse rewards into dense training signals (i.e., the advantage function) through retrospective in-context learning (RICL). We further propose an online learning framework, RICOL, which iteratively refines the policy based on the credit assignment results from RICL. We empirically demonstrate that RICL can accurately estimate the advantage function with limited samples and effectively identify critical states in the environment for temporal credit assignment. Extended evaluation on four BabyAI scenarios show that RICOL achieves comparable convergent performance with traditional online RL algorithms with significantly higher sample efficiency. Our findings highlight the potential of leveraging LLMs for temporal credit assignment, paving the way for more sample-efficient and generalizable RL paradigms.
[LG-13] Variational Grey-Box Dynamics Matching AISTATS2026
链接: https://arxiv.org/abs/2602.17477
作者: Gurjeet Sangra Singh,Frantzeska Lavda,Giangiacomo Mercatali,Alexandros Kalousis
类目: Machine Learning (cs.LG)
*备注: AISTATS 2026. Code is available at this https URL
Abstract:Deep generative models such as flow matching and diffusion models have shown great potential in learning complex distributions and dynamical systems, but often act as black-boxes, neglecting underlying physics. In contrast, physics-based simulation models described by ODEs/PDEs remain interpretable, but may have missing or unknown terms, unable to fully describe real-world observations. We bridge this gap with a novel grey-box method that integrates incomplete physics models directly into generative models. Our approach learns dynamics from observational trajectories alone, without ground-truth physics parameters, in a simulation-free manner that avoids scalability and stability issues of Neural ODEs. The core of our method lies in modelling a structured variational distribution within the flow matching framework, by using two latent encodings: one to model the missing stochasticity and multi-modal velocity, and a second to encode physics parameters as a latent variable with a physics-informed prior. Furthermore, we present an adaptation of the framework to handle second-order dynamics. Our experiments on representative ODE/PDE problems show that our method performs on par with or superior to fully data-driven approaches and previous grey-box baselines, while preserving the interpretability of the physics model. Our code is available at this https URL.
[LG-14] MDP Planning as Policy Inference
链接: https://arxiv.org/abs/2602.17375
作者: David Tolpin
类目: Machine Learning (cs.LG)
*备注: 28 pages, many figures
Abstract:We cast episodic Markov decision process (MDP) planning as Bayesian inference over policies. A policy is treated as the latent variable and is assigned an unnormalized probability of optimality that is monotone in its expected return, yielding a posterior distribution whose modes coincide with return-maximizing solutions while posterior dispersion represents uncertainty over optimal behavior. To approximate this posterior in discrete domains, we adapt variational sequential Monte Carlo (VSMC) to inference over deterministic policies under stochastic dynamics, introducing a sweep that enforces policy consistency across revisited states and couples transition randomness across particles to avoid confounding from simulator noise. Acting is performed by posterior predictive sampling, which induces a stochastic control policy through a Thompson-sampling interpretation rather than entropy regularization. Across grid worlds, Blackjack, Triangle Tireworld, and Academic Advising, we analyze the structure of inferred policy distributions and compare the resulting behavior to discrete Soft Actor-Critic, highlighting qualitative and statistical differences that arise from policy-level uncertainty.
[LG-15] 2Mamba2Furious: Linear in Complexity Competitive in Accuracy
链接: https://arxiv.org/abs/2602.17363
作者: Gabriel Mongaras,Eric C. Larson
类目: Machine Learning (cs.LG)
*备注:
Abstract:Linear attention transformers have become a strong alternative to softmax attention due to their efficiency. However, linear attention tends to be less expressive and results in reduced accuracy compared to softmax attention. To bridge the accuracy gap between softmax attention and linear attention, we manipulate Mamba-2, a very strong linear attention variant. We first simplify Mamba-2 down to its most fundamental and important components, evaluating which specific choices make it most accurate. From this simplified Mamba variant (Mamba-2S), we improve the A-mask and increase the order of the hidden state, resulting in a method, which we call 2Mamba, that is nearly as accurate as softmax attention, yet much more memory efficient for long context lengths. We also investigate elements to Mamba-2 that help surpass softmax attention accuracy. Code is provided for all our experiments
[LG-16] Shortcut learning in geometric knot classification
链接: https://arxiv.org/abs/2602.17350
作者: Djordje Mihajlovic,Davide Michieletto
类目: Machine Learning (cs.LG); Soft Condensed Matter (cond-mat.soft); Geometric Topology (math.GT)
*备注: 17 pages, 6 figures, submitted to Machine Learning: Science and Technology, IOP
Abstract:Classifying the topology of closed curves is a central problem in low dimensional topology with applications beyond mathematics spanning protein folding, polymer physics and even magnetohydrodynamics. The central problem is how to determine whether two embeddings of a closed arc are equivalent under ambient isotopy. Given the striking ability of neural networks to solve complex classification tasks, it is therefore natural to ask if the knot classification problem can be tackled using Machine Learning (ML). In this paper, we investigate generic shortcut methods employed by ML to solve the knot classification challenge and specifically discover hidden non-topological features in training data generated through Molecular Dynamics simulations of polygonal knots that are used by ML to arrive to positive classifications results. We then provide a rigorous foundation for future attempts to tackle the knot classification challenge using ML by developing a publicly-available (i) dataset, that aims to remove the potential of non-topological feature classification and (ii) code, that can generate knot embeddings that faithfully explore chosen geometric state space with fixed knot topology. We expect that our work will accelerate the development of ML models that can solve complex geometric knot classification challenges.
[LG-17] Partial Optimality in the Preordering Problem
链接: https://arxiv.org/abs/2602.17346
作者: David Stein,Jannik Irmai,Bjoern Andres
类目: Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:Preordering is a generalization of clustering and partial ordering with applications in bioinformatics and social network analysis. Given a finite set V and a value c_ab \in \mathbbR for every ordered pair ab of elements of V , the preordering problem asks for a preorder \lesssim on V that maximizes the sum of the values of those pairs ab for which a \lesssim b . Building on the state of the art in solving this NP-hard problem partially, we contribute new partial optimality conditions and efficient algorithms for deciding these conditions. In experiments with real and synthetic data, these new conditions increase, in particular, the fraction of pairs ab for which it is decided efficiently that a \not\lesssim b in an optimal preorder.
[LG-18] Open Datasets in Learning Analytics: Trends Challenges and Best PRACTICE KDD DATE
链接: https://arxiv.org/abs/2602.17314
作者: Valdemar Švábenský,Brendan Flanagan,Erwin Daniel López Zapata,Atsushi Shimada
类目: Computers and Society (cs.CY); Databases (cs.DB); Machine Learning (cs.LG)
*备注: Recently accepted to ACM Transactions on Knowledge Discovery from Data (TKDD). To appear. (Preprint will be updated with full bibliographic info.)
Abstract:Open datasets play a crucial role in three research domains that intersect data science and education: learning analytics, educational data mining, and artificial intelligence in education. Researchers in these domains apply computational methods to analyze data from educational contexts, aiming to better understand and improve teaching and learning. Providing open datasets alongside research papers supports reproducibility, collaboration, and trust in research findings. It also provides individual benefits for authors, such as greater visibility, credibility, and citation potential. Despite these advantages, the availability of open datasets and the associated practices within the learning analytics research communities, especially at their flagship conference venues, remain unclear. We surveyed available datasets published alongside research papers in learning analytics. We manually examined 1,125 papers from three flagship conferences (LAK, EDM, and AIED) over the past five years. We discovered, categorized, and analyzed 172 datasets used in 204 publications. Our study presents the most comprehensive collection and analysis of open educational datasets to date, along with the most detailed categorization. Of the 172 datasets identified, 143 were not captured in any prior survey of open data in learning analytics. We provide insights into the datasets’ context, analytical methods, use, and other properties. Based on this survey, we summarize the current gaps in the field. Furthermore, we list practical recommendations, advice, and 8-item guidelines under the acronym PRACTICE with a checklist to help researchers publish their data. Lastly, we share our original dataset: an annotated inventory detailing the discovered datasets and the corresponding publications. We hope these findings will support further adoption of open data practices in learning analytics communities and beyond.
[LG-19] LexiSafe: Offline Safe Reinforcement Learning with Lexicographic Safety-Reward Hierarchy
链接: https://arxiv.org/abs/2602.17312
作者: Hsin-Jung Yang,Zhanhong Jiang,Prajwal Koirala,Qisai Liu,Cody Fleming,Soumik Sarkar
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 17th ACM/IEEE International Conference on Cyber-Physical Systems
Abstract:Offline safe reinforcement learning (RL) is increasingly important for cyber-physical systems (CPS), where safety violations during training are unacceptable and only pre-collected data are available. Existing offline safe RL methods typically balance reward-safety tradeoffs through constraint relaxation or joint optimization, but they often lack structural mechanisms to prevent safety drift. We propose LexiSafe, a lexicographic offline RL framework designed to preserve safety-aligned behavior. We first develop LexiSafe-SC, a single-cost formulation for standard offline safe RL, and derive safety-violation and performance-suboptimality bounds that together yield sample-complexity guarantees. We then extend the framework to hierarchical safety requirements with LexiSafe-MC, which supports multiple safety costs and admits its own sample-complexity analysis. Empirically, LexiSafe demonstrates reduced safety violations and improved task performance compared to constrained offline baselines. By unifying lexicographic prioritization with structural bias, LexiSafe offers a practical and theoretically grounded approach for safety-critical CPS decision-making.
[LG-20] Efficient privacy loss accounting for subsampling and random allocation
链接: https://arxiv.org/abs/2602.17284
作者: Vitaly Feldman,Moshe Shenfeld
类目: Machine Learning (cs.LG)
*备注:
Abstract:We consider the privacy amplification properties of a sampling scheme in which a user’s data is used in k steps chosen randomly and uniformly from a sequence (or set) of t steps. This sampling scheme has been recently applied in the context of differentially private optimization (Chua et al., 2024a; Choquette-Choo et al., 2025) and communication-efficient high-dimensional private aggregation (Asi et al., 2025), where it was shown to have utility advantages over the standard Poisson sampling. Theoretical analyses of this sampling scheme (Feldman Shenfeld, 2025; Dong et al., 2025) lead to bounds that are close to those of Poisson sampling, yet still have two significant shortcomings. First, in many practical settings, the resulting privacy parameters are not tight due to the approximation steps in the analysis. Second, the computed parameters are either the hockey stick or Renyi divergence, both of which introduce overheads when used in privacy loss accounting. In this work, we demonstrate that the privacy loss distribution (PLD) of random allocation applied to any differentially private algorithm can be computed efficiently. When applied to the Gaussian mechanism, our results demonstrate that the privacy-utility trade-off for random allocation is at least as good as that of Poisson subsampling. In particular, random allocation is better suited for training via DP-SGD. To support these computations, our work develops new tools for general privacy loss accounting based on a notion of PLD realization. This notion allows us to extend accurate privacy loss accounting to subsampling which previously required manual noise-mechanism-specific analysis. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.17284 [cs.LG] (or arXiv:2602.17284v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.17284 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-21] RLGT: A reinforcement learning framework for extremal graph theory
链接: https://arxiv.org/abs/2602.17276
作者: Ivan Damnjanović,Uroš Milivojević,Irena Đorđević,Dragan Stevanović
类目: Machine Learning (cs.LG); Combinatorics (math.CO)
*备注:
Abstract:Reinforcement learning (RL) is a subfield of machine learning that focuses on developing models that can autonomously learn optimal decision-making strategies over time. In a recent pioneering paper, Wagner demonstrated how the Deep Cross-Entropy RL method can be applied to tackle various problems from extremal graph theory by reformulating them as combinatorial optimization problems. Subsequently, many researchers became interested in refining and extending the framework introduced by Wagner, thereby creating various RL environments specialized for graph theory. Moreover, a number of problems from extremal graph theory were solved through the use of RL. In particular, several inequalities concerning the Laplacian spectral radius of graphs were refuted, new lower bounds were obtained for certain Ramsey numbers, and contributions were made to the Turán-type extremal problem in which the forbidden structures are cycles of length three and four. Here, we present Reinforcement Learning for Graph Theory (RLGT), a novel RL framework that systematizes the previous work and provides support for both undirected and directed graphs, with or without loops, and with an arbitrary number of edge colors. The framework efficiently represents graphs and aims to facilitate future RL-based research in extremal graph theory through optimized computational performance and a clean and modular design.
[LG-22] Learning a Latent Pulse Shape Interface for Photoinjector Laser Systems
链接: https://arxiv.org/abs/2602.17263
作者: Alexander Klemps,Denis Ilia,Pradeep Kr. Banerjee,Ye Chen,Henrik Tünnermann,Nihat Ay
类目: Machine Learning (cs.LG)
*备注:
Abstract:Controlling the longitudinal laser pulse shape in photoinjectors of Free-Electron Lasers is a powerful lever for optimizing electron beam quality, but systematic exploration of the vast design space is limited by the cost of brute-force pulse propagation simulations. We present a generative modeling framework based on Wasserstein Autoencoders to learn a differentiable latent interface between pulse shaping and downstream beam dynamics. Our empirical findings show that the learned latent space is continuous and interpretable while maintaining high-fidelity reconstructions. Pulse families such as higher-order Gaussians trace coherent trajectories, while standardizing the temporal pulse lengths shows a latent organization correlated with pulse energy. Analysis via principal components and Gaussian Mixture Models reveals a well behaved latent geometry, enabling smooth transitions between distinct pulse types via linear interpolation. The model generalizes from simulated data to real experimental pulse measurements, accurately reconstructing pulses and embedding them consistently into the learned manifold. Overall, the approach reduces reliance on expensive pulse-propagation simulations and facilitates downstream beam dynamics simulation and analysis.
[LG-23] Structured Prototype-Guided Adaptation for EEG Foundation Models
链接: https://arxiv.org/abs/2602.17251
作者: Jingying Ma,Feng Wu,Yucheng Xing,Qika Lin,Tianyu Liu,Chenyu Liu,Ziyu Jia,Mengling Feng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Electroencephalography (EEG) foundation models (EFMs) have achieved strong performance under full fine-tuning but exhibit poor generalization when subject-level supervision is limited, a common constraint in real-world clinical settings. We show that this failure stems not merely from limited supervision, but from a structural mismatch between noisy, limited supervision and the highly plastic parameter space of EFMs. To address this challenge, we propose SCOPE, a Structured COnfidence-aware Prototype-guided adaptation framework for EFM fine-tuning. SCOPE follows a two-stage pipeline. In the first stage, we construct reliable external supervision by learning geometry-regularized task priors, constructing balanced class-level prototypes over the resulting embeddings, and producing confidence-aware pseudo-labels from their agreement to filter unreliable signals on unlabeled data. In the second stage, we introduce ProAdapter, which adapts frozen EEG foundation models via a lightweight adapter conditioned on the structured prototypes. Experiments across three EEG tasks and five foundation model backbones demonstrate that SCOPE consistently achieves strong performance and efficiency under label-limited cross-subject settings.
[LG-24] CounterFlowNet: From Minimal Changes to Meaningful Counterfactual Explanations
链接: https://arxiv.org/abs/2602.17244
作者: Oleksii Furman,Patryk Marszałek,Jan Masłowski,Piotr Gaiński,Maciej Zięba,Marek Śmieja
类目: Machine Learning (cs.LG)
*备注:
Abstract:Counterfactual explanations (CFs) provide human-interpretable insights into model’s predictions by identifying minimal changes to input features that would alter the model’s output. However, existing methods struggle to generate multiple high-quality explanations that (1) affect only a small portion of the features, (2) can be applied to tabular data with heterogeneous features, and (3) are consistent with the user-defined constraints. We propose CounterFlowNet, a generative approach that formulates CF generation as sequential feature modification using conditional Generative Flow Networks (GFlowNet). CounterFlowNet is trained to sample CFs proportionally to a user-specified reward function that can encode key CF desiderata: validity, sparsity, proximity and plausibility, encouraging high-quality explanations. The sequential formulation yields highly sparse edits, while a unified action space seamlessly supports continuous and categorical features. Moreover, actionability constraints, such as immutability and monotonicity of features, can be enforced at inference time via action masking, without retraining. Experiments on eight datasets under two evaluation protocols demonstrate that CounterFlowNet achieves superior trade-offs between validity, sparsity, plausibility, and diversity with full satisfaction of the given constraints.
[LG-25] Privacy-Preserving Mechanisms Enable Cheap Verifiable Inference of LLM s
链接: https://arxiv.org/abs/2602.17223
作者: Arka Pal,Louai Zahran,William Gvozdjak,Akilesh Potti,Micah Goldblum
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:As large language models (LLMs) continue to grow in size, fewer users are able to host and run models locally. This has led to increased use of third-party hosting services. However, in this setting, there is a lack of guarantees on the computation performed by the inference provider. For example, a dishonest provider may replace an expensive large model with a cheaper-to-run weaker model and return the results from the weaker model to the user. Existing tools to verify inference typically rely on methods from cryptography such as zero-knowledge proofs (ZKPs), but these add significant computational overhead, and remain infeasible for use for large models. In this work, we develop a new insight – that given a method for performing private LLM inference, one can obtain forms of verified inference at marginal extra cost. Specifically, we propose two new protocols which leverage privacy-preserving LLM inference in order to provide guarantees over the inference that was carried out. Our approaches are cheap, requiring the addition of a few extra tokens of computation, and have little to no downstream impact. As the fastest privacy-preserving inference methods are typically faster than ZK methods, the proposed protocols also improve verification runtime. Our work provides novel insights into the connections between privacy and verifiability in LLM inference.
[LG-26] SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch
链接: https://arxiv.org/abs/2602.17206
作者: Ron Shapira Weber,Oren Freifeld
类目: Machine Learning (cs.LG)
*备注: Technical Report
Abstract:We present softdtw-cuda-torch, an open-source PyTorch library for computing Soft Dynamic Time Warping (SoftDTW) on GPUs. Our implementation addresses three key limitations of existing GPU implementations of SoftDTW: a hard sequence-length cap of 1024, numerical instability in the backward pass for small smoothing parameters, and excessive GPU memory consumption from materializing pairwise distance tensors. We introduce (1) tiled anti-diagonal kernel execution that removes the sequence-length constraint, (2) a log-space back-ward pass that prevents floating-point overflow, and (3) a fused distance-computation mode that eliminates the O(BN M ) intermediate distance tensor, achieving up to 98% memory reduction compared to prior work. The library supports arbitrary sequence lengths, full PyTorch autograd integration, and Soft-DTW Barycenter computation. Code is available at this https URL.
[LG-27] Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization
链接: https://arxiv.org/abs/2602.17155
作者: Yicheng Lang,Changsheng Wang,Yihua Zhang,Mingyi Hong,Zheng Zhang,Wotao Yin,Sijia Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Zeroth-order (ZO) optimization provides a gradient-free alternative to first-order (FO) methods by estimating gradients via finite differences of function evaluations, and has recently emerged as a memory-efficient paradigm for fine-tuning large-scale models by avoiding backpropagation. However, ZO optimization has a fundamental tension between accuracy and query efficiency. In this work, we show that ZO optimization can be substantially improved by unifying two complementary principles: (i) a projection-based subspace view that reduces gradient estimation variance by exploiting the intrinsic low-rank structure of model updates, and (ii) Muon-style spectral optimization that applies gradient orthogonalization to extract informative spectral structure from noisy ZO gradients. These findings form a unified framework of subspace gradient orthogonalization, which we instantiate in a new method, ZO-Muon, admitting a natural interpretation as a low-rank Muon optimizer in the ZO setting. Extensive experiments on large language models (LLMs) and vision transformers (ViTs) demonstrate that ZO-Muon significantly accelerates convergence and achieves a win-win improvement in accuracy and query/runtime efficiency. Notably, compared to the popular MeZO baseline, ZO-Muon requires only 24.7% of the queries to reach the same SST-2 performance for LLM fine-tuning, and improves accuracy by 25.1% on ViT-B fine-tuning on CIFAR-100.
[LG-28] When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer
链接: https://arxiv.org/abs/2602.17144
作者: Shuqi Liu,Yuzhou Cao,Lei Feng,Bo An,Luke Ong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Learning to Defer (L2D) enables a classifier to abstain from predictions and defer to an expert, and has recently been extended to multi-expert settings. In this work, we show that multi-expert L2D is fundamentally more challenging than the single-expert case. With multiple experts, the classifier’s underfitting becomes inherent, which seriously degrades prediction performance, whereas in the single-expert setting it arises only under specific conditions. We theoretically reveal that this stems from an intrinsic expert identifiability issue: learning which expert to trust from a diverse pool, a problem absent in the single-expert case and renders existing underfitting remedies failed. To tackle this issue, we propose PiCCE (Pick the Confident and Correct Expert), a surrogate-based method that adaptively identifies a reliable expert based on empirical evidence. PiCCE effectively reduces multi-expert L2D to a single-expert-like learning problem, thereby resolving multi expert underfitting. We further prove its statistical consistency and ability to recover class probabilities and expert accuracies. Extensive experiments across diverse settings, including real-world expert scenarios, validate our theoretical results and demonstrate improved performance.
[LG-29] -PhysGaussian: Implicit Physical Simulation for 3D Gaussian Splatting
链接: https://arxiv.org/abs/2602.17117
作者: Yicheng Cao,Zhuo Huang,Yu Yao,Yiming Ying,Daoyi Dong,Tongliang Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Physical simulation predicts future states of objects based on material properties and external loads, enabling blueprints for both Industry and Engineering to conduct risk management. Current 3D reconstruction-based simulators typically rely on explicit, step-wise updates, which are sensitive to step time and suffer from rapid accuracy degradation under complicated scenarios, such as high-stiffness materials or quasi-static movement. To address this, we introduce i-PhysGaussian, a framework that couples 3D Gaussian Splatting (3DGS) with an implicit Material Point Method (MPM) integrator. Unlike explicit methods, our solution obtains an end-of-step state by minimizing a momentum-balance residual through implicit Newton-type optimization with a GMRES solver. This formulation significantly reduces time-step sensitivity and ensures physical consistency. Our results demonstrate that i-PhysGaussian maintains stability at up to 20x larger time steps than explicit baselines, preserving structural coherence and smooth motion even in complex dynamic transitions.
[LG-30] Simplify to Amplify: Achieving Information-Theoretic Bounds with Fewer Steps in Spectral Community Detection
链接: https://arxiv.org/abs/2602.17104
作者: Sie Hendrata Dharmawan,Peter Chin
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 9 pages plus appendix, 3 figures
Abstract:We propose a streamlined spectral algorithm for community detection in the two-community stochastic block model (SBM) under constant edge density assumptions. By reducing algorithmic complexity through the elimination of non-essential preprocessing steps, our method directly leverages the spectral properties of the adjacency matrix. We demonstrate that our algorithm exploits specific characteristics of the second eigenvalue to achieve improved error bounds that approach information-theoretic limits, representing a significant improvement over existing methods. Theoretical analysis establishes that our error rates are tighter than previously reported bounds in the literature. Comprehensive experimental validation confirms our theoretical findings and demonstrates the practical effectiveness of the simplified approach. Our results suggest that algorithmic simplification, rather than increasing complexity, can lead to both computational efficiency and enhanced performance in spectral community detection.
[LG-31] Online Learning with Improving Agents : Multiclass Budgeted Agents and Bandit Learners
链接: https://arxiv.org/abs/2602.17103
作者: Sajad Ashkezari,Shai Ben-David
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We investigate the recently introduced model of learning with improvements, where agents are allowed to make small changes to their feature values to be warranted a more desirable label. We extensively extend previously published results by providing combinatorial dimensions that characterize online learnability in this model, by analyzing the multiclass setup, learnability in a bandit feedback setup, modeling agents’ cost for making improvements and more.
[LG-32] Operationalization of Machine Learning with Serverless Architecture: An Industrial Operationalization of Machine Learning with Serverless Architecture: An Industrial Implementation for Harmonized System Code Prediction
链接: https://arxiv.org/abs/2602.17102
作者: Sai Vineeth Kandappareddigari,Santhoshkumar Jagadish,Gauri Verma,Ilhuicamina Contreras,Christopher Dignam,Anmol Srivastava,Benjamin Demers
类目: Machine Learning (cs.LG)
*备注: 13 pages. ICAD '26
Abstract:This paper presents a serverless MLOps framework orchestrating the complete ML lifecycle from data ingestion, training, deployment, monitoring, and retraining to using event-driven pipelines and managed services. The architecture is model-agnostic, supporting diverse inference patterns through standardized interfaces, enabling rapid adaptation without infrastructure overhead. We demonstrate practical applicability through an industrial implementation for Harmonized System (HS) code prediction, a compliance-critical task where short, unstructured product descriptions are mapped to standardized codes used by customs authorities in global trade. Frequent updates and ambiguous descriptions make classification challenging, with errors causing shipment delays and financial losses. Our solution uses a custom text embedding encoder and multiple deep learning architectures, with Text-CNN achieving 98 percent accuracy on ground truth data. Beyond accuracy, the pipeline ensures reproducibility, auditability, and SLA adherence under variable loads via auto-scaling. A key feature is automated A/B testing, enabling dynamic model selection and safe promotion in production. Cost-efficiency drives model choice; while transformers may achieve similar accuracy, their long-term operational costs are significantly higher. Deterministic classification with predictable latency and explainability is prioritized, though the architecture remains extensible to transformer variants and LLM-based inference. The paper first introduces the deep learning architectures with simulations and model comparisons, then discusses industrialization through serverless architecture, demonstrating automated retraining, prediction, and validation of HS codes. This work provides a replicable blueprint for operationalizing ML using serverless architecture, enabling enterprises to scale while optimizing performance and economics.
[LG-33] A Locality Radius Framework for Understanding Relational Inductive Bias in Database Learning
链接: https://arxiv.org/abs/2602.17092
作者: Aadi Joshi,Kavya Bhand
类目: Machine Learning (cs.LG)
*备注:
Abstract:Foreign key discovery and related schema-level prediction tasks are often modeled using graph neural networks (GNNs), implicitly assuming that relational inductive bias improves performance. However, it remains unclear when multi-hop structural reasoning is actually necessary. In this work, we introduce locality radius, a formal measure of the minimum structural neighborhood required to determine a prediction in relational schemas. We hypothesize that model performance depends critically on alignment between task locality radius and architectural aggregation depth. We conduct a controlled empirical study across foreign key prediction, join cost estimation, blast radius regression, cascade impact classification, and additional graph-derived schema tasks. Our evaluation includes multi-seed experiments, capacity-matched comparisons, statistical significance testing, scaling analysis, and synthetic radius-controlled benchmarks. Results reveal a consistent bias-radius alignment effect.
[LG-34] Synergizing Transport-Based Generative Models and Latent Geometry for Stochastic Closure Modeling
链接: https://arxiv.org/abs/2602.17089
作者: Xinghao Dong,Huchen Yang,Jin-long Wu
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Computational Physics (physics.comp-ph)
*备注:
Abstract:Diffusion models recently developed for generative AI tasks can produce high-quality samples while still maintaining diversity among samples to promote mode coverage, providing a promising path for learning stochastic closure models. Compared to other types of generative AI models, such as GANs and VAEs, the sampling speed is known as a key disadvantage of diffusion models. By systematically comparing transport-based generative models on a numerical example of 2D Kolmogorov flows, we show that flow matching in a lower-dimensional latent space is suited for fast sampling of stochastic closure models, enabling single-step sampling that is up to two orders of magnitude faster than iterative diffusion-based approaches. To control the latent space distortion and thus ensure the physical fidelity of the sampled closure term, we compare the implicit regularization offered by a joint training scheme against two explicit regularizers: metric-preserving (MP) and geometry-aware (GA) constraints. Besides offering a faster sampling speed, both explicitly and implicitly regularized latent spaces inherit the key topological information from the lower-dimensional manifold of the original complex dynamical system, which enables the learning of stochastic closure models without demanding a huge amount of training data.
[LG-35] MeGU: Machine-Guided Unlearning with Target Feature Disentanglement
链接: https://arxiv.org/abs/2602.17088
作者: Haoyu Wang,Zhuo Huang,Xiaolong Wang,Bo Han,Zhiwei Lin,Tongliang Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:The growing concern over training data privacy has elevated the “Right to be Forgotten” into a critical requirement, thereby raising the demand for effective Machine Unlearning. However, existing unlearning approaches commonly suffer from a fundamental trade-off: aggressively erasing the influence of target data often degrades model utility on retained data, while conservative strategies leave residual target information intact. In this work, the intrinsic representation properties learned during model pretraining are analyzed. It is demonstrated that semantic class concepts are entangled at the feature-pattern level, sharing associated features while preserving concept-specific discriminative components. This entanglement fundamentally limits the effectiveness of existing unlearning paradigms. Motivated by this insight, we propose Machine-Guided Unlearning (MeGU), a novel framework that guides unlearning through concept-aware re-alignment. Specifically, Multi-modal Large Language Models (MLLMs) are leveraged to explicitly determine re-alignment directions for target samples by assigning semantically meaningful perturbing labels. To improve efficiency, inter-class conceptual similarities estimated by the MLLM are encoded into a lightweight transition matrix. Furthermore, MeGU introduces a positive-negative feature noise pair to explicitly disentangle target concept influence. During finetuning, the negative noise suppresses target-specific feature patterns, while the positive noise reinforces remaining associated features and aligns them with perturbing concepts. This coordinated design enables selective disruption of target-specific representations while preserving shared semantic structures. As a result, MeGU enables controlled and selective forgetting, effectively mitigating both under-unlearning and over-unlearning.
[LG-36] Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum
链接: https://arxiv.org/abs/2602.17080
作者: Minxin Zhang,Yuxuan Liu,Hayden Scheaffer
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 39 pages, 6 figures
Abstract:Efficient stochastic optimization typically integrates an update direction that performs well in the deterministic regime with a mechanism adapting to stochastic perturbations. While Adam uses adaptive moment estimates to promote stability, Muon utilizes the weight layers’ matrix structure via orthogonalized momentum, showing superior performance in large language model training. We propose a new optimizer and a diagonal extension, NAMO and NAMO-D, providing the first principled integration of orthogonalized momentum with norm-based Adam-type noise adaptation. NAMO scales orthogonalized momentum using a single adaptive stepsize, preserving orthogonality while improving upon Muon at negligible additional cost. NAMO-D instead right-multiplies orthogonalized momentum by a diagonal matrix with clamped entries. This design enables neuron-wise noise adaptation and aligns with the common near block-diagonal Hessian structure. Under standard assumptions, we establish optimal convergence rates for both algorithms in the deterministic setting and show that, in the stochastic setting, their convergence guarantees adapt to the noise level of stochastic gradients. Experiments on pretraining GPT-2 models demonstrate improved performance of both NAMO and NAMO-D compared to the AdamW and Muon baselines, with NAMO-D achieving further gains over NAMO via an additional clamping hyperparameter that balances the competing goals of maintaining a well-conditioned update direction and leveraging fine-grained noise adaptation.
[LG-37] Spatio-temporal dual-stage hypergraph MARL for human-centric multimodal corridor traffic signal control
链接: https://arxiv.org/abs/2602.17068
作者: Xiaocai Zhang,Neema Nassir,Milad Haghani
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Human-centric traffic signal control in corridor networks must increasingly account for multimodal travelers, particularly high-occupancy public transportation, rather than focusing solely on vehicle-centric performance. This paper proposes STDSH-MARL (Spatio-Temporal Dual-Stage Hypergraph based Multi-Agent Reinforcement Learning), a scalable multi-agent deep reinforcement learning framework that follows a centralized training and decentralized execution paradigm. The proposed method captures spatio-temporal dependencies through a novel dual-stage hypergraph attention mechanism that models interactions across both spatial and temporal hyperedges. In addition, a hybrid discrete action space is introduced to jointly determine the next signal phase configuration and its corresponding green duration, enabling more adaptive signal timing decisions. Experiments conducted on a corridor network under five traffic scenarios demonstrate that STDSH-MARL consistently improves multimodal performance and provides clear benefits for public transportation priority. Compared with state-of-the-art baseline methods, the proposed approach achieves superior overall performance. Further ablation studies confirm the contribution of each component of STDSH-MARL, with temporal hyperedges identified as the most influential factor driving the observed performance gains.
[LG-38] Multi-Probe Zero Collision Hash (MPZCH): Mitigating Embedding Collisions and Enhancing Model Freshness in Large-Scale Recommenders
链接: https://arxiv.org/abs/2602.17050
作者: Ziliang Zhao,Bi Xue,Emma Lin,Mengjiao Zhou,Kaustubh Vartak,Shakhzod Ali-Zade,Carson Lu,Tao Li,Bin Kuang,Rui Jian,Bin Wen,Dennis van der Staay,Yixin Bao,Eddy Li,Chao Deng,Songbin Liu,Qifan Wang,Kai Ren
类目: Machine Learning (cs.LG)
*备注: 10 pages, 6 figures
Abstract:Embedding tables are critical components of large-scale recommendation systems, facilitating the efficient mapping of high-cardinality categorical features into dense vector representations. However, as the volume of unique IDs expands, traditional hash-based indexing methods suffer from collisions that degrade model performance and personalization quality. We present Multi-Probe Zero Collision Hash (MPZCH), a novel indexing mechanism based on linear probing that effectively mitigates embedding collisions. With reasonable table sizing, it often eliminates these collisions entirely while maintaining production-scale efficiency. MPZCH utilizes auxiliary tensors and high-performance CUDA kernels to implement configurable probing and active eviction policies. By retiring obsolete IDs and resetting reassigned slots, MPZCH prevents the stale embedding inheritance typical of hash-based methods, ensuring new features learn effectively from scratch. Despite its collision-mitigation overhead, the system maintains training QPS and inference latency comparable to existing methods. Rigorous online experiments demonstrate that MPZCH achieves zero collisions for user embeddings and significantly improves item embedding freshness and quality. The solution has been released within the open-source TorchRec library for the broader community.
[LG-39] WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning
链接: https://arxiv.org/abs/2602.17025
作者: Gagan Mundada,Zihan Huang,Rohan Surana,Sheldon Yu,Jennifer Yuntong Zhang,Xintong Li,Tong Yu,Lina Yao,Jingbo Shang,Julian McAuley,Junda Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Group Relative Policy Optimization (GRPO) is effective for training language models on complex reasoning. However, since the objective is defined relative to a group of sampled trajectories, extended deliberation can create more chances to realize relative gains, leading to inefficient reasoning and overthinking, and complicating the trade-off between correctness and rollout efficiency. Controlling this behavior is difficult in practice, considering (i) Length penalties are hard to calibrate because longer rollouts may reflect harder problems that require longer reasoning, penalizing tokens risks truncating useful reasoning along with redundant continuation; and (ii) supervision that directly indicates when to continue or stop is typically unavailable beyond final answer correctness. We propose Weakly Supervised GRPO (WS-GRPO), which improves rollout efficiency by converting terminal rewards into correctness-aware guidance over partial trajectories. Unlike global length penalties that are hard to calibrate, WS-GRPO trains a preference model from outcome-only correctness to produce prefix-level signals that indicate when additional continuation is beneficial. Thus, WS-GRPO supplies outcome-derived continue/stop guidance, reducing redundant deliberation while maintaining accuracy. We provide theoretical results and empirically show on reasoning benchmarks that WS-GRPO substantially reduces rollout length while remaining competitive with GRPO baselines.
[LG-40] Malliavin Calculus as Stochastic Backpropogation
链接: https://arxiv.org/abs/2602.17013
作者: Kevin D. Oden
类目: Machine Learning (cs.LG)
*备注:
Abstract:We establish a rigorous connection between pathwise (reparameterization) and score-function (Malliavin) gradient estimators by showing that both arise from the Malliavin integration-by-parts identity. Building on this equivalence, we introduce a unified and variance-aware hybrid estimator that adaptively combines pathwise and Malliavin gradients using their empirical covariance structure. The resulting formulation provides a principled understanding of stochastic backpropagation and achieves minimum variance among all unbiased linear combinations, with closed-form finite-sample convergence bounds. We demonstrate 9% variance reduction on VAEs (CIFAR-10) and up to 35% on strongly-coupled synthetic problems. Exploratory policy gradient experiments reveal that non-stationary optimization landscapes present challenges for the hybrid approach, highlighting important directions for future work. Overall, this work positions Malliavin calculus as a conceptually unifying and practically interpretable framework for stochastic gradient estimation, clarifying when hybrid approaches provide tangible benefits and when they face inherent limitations.
[LG-41] Action-Graph Policies: Learning Action Co-dependencies in Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2602.17009
作者: Nikunj Gupta,James Zachary Hare,Jesse Milzman,Rajgopal Kannan,Viktor Prasanna
类目: Machine Learning (cs.LG)
*备注:
Abstract:Coordinating actions is the most fundamental form of cooperation in multi-agent reinforcement learning (MARL). Successful decentralized decision-making often depends not only on good individual actions, but on selecting compatible actions across agents to synchronize behavior, avoid conflicts, and satisfy global constraints. In this paper, we propose Action Graph Policies (AGP), that model dependencies among agents’ available action choices. It constructs, what we call, \textitcoordination contexts, that enable agents to condition their decisions on global action dependencies. Theoretically, we show that AGPs induce a strictly more expressive joint policy compared to fully independent policies and can realize coordinated joint actions that are provably more optimal than greedy execution even from centralized value-decomposition methods. Empirically, we show that AGP achieves 80-95% success on canonical coordination tasks with partial observability and anti-coordination penalties, where other MARL methods reach only 10-25%. We further demonstrate that AGP consistently outperforms these baselines in diverse multi-agent environments.
[LG-42] Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding
链接: https://arxiv.org/abs/2602.16994
作者: Rahul Thomas,Teo Kitanovski,Micah Goldblum,Arka Pal
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-path speculative decoding accelerates lossless sampling from a target model by using a cheaper draft model to generate a draft tree of tokens, and then applies a verification algorithm that accepts a subset of these. While prior work has proposed various verification algorithms for i.i.d rollouts, their relative performance under matched settings remains unclear. In this work, we firstly present a systematic evaluation of verification strategies across model families, tasks, and sampling regimes, and find that Traversal Verification dominates consistently, with OT-based methods lagging far behind. Our analysis uncovers that this occurs because OT-based methods achieve high multi-token acceptance near the root of the draft tree, while multi-token gains are most impactful deeper in the draft tree, where draft and target distributions diverge. Based on this insight, we propose delayed tree expansion, which drafts a partial single path, delaying the i.i.d. branching point. We show that delayed tree expansion preserves the target distribution and improves on root-node i.i.d rollouts. Further, we develop a dynamic neural selector that estimates the expected block efficiency of optimal-transport-based verification methods from draft and target features, enabling context-dependent expansion decisions. Our neural selector allows OT-based methods like SpecInfer to outperform Traversal Verification for the first time, achieving 5% higher average throughput across a wide range of models, datasets, and sampling settings.
[LG-43] Discovering Universal Activation Directions for PII Leakage in Language Models
链接: https://arxiv.org/abs/2602.16980
作者: Leo Marchyok,Zachary Coalson,Sungho Keum,Sooel Son,Sanghyun Hong
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Pre-print
Abstract:Modern language models exhibit rich internal structure, yet little is known about how privacy-sensitive behaviors, such as personally identifiable information (PII) leakage, are represented and modulated within their hidden states. We present UniLeak, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model’s residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts. These model-specific directions generalize across contexts and amplify PII generation probability, with minimal impact on generation quality. UniLeak recovers such directions without access to training data or groundtruth PII, relying only on self-generated text. Across multiple models and datasets, steering along these universal directions substantially increases PII leakage compared to existing prompt-based extraction methods. Our results offer a new perspective on PII leakage: the superposition of a latent signal in the model’s representations, enabling both risk amplification and mitigation.
[LG-44] Fail-Closed Alignment for Large Language Models
链接: https://arxiv.org/abs/2602.16977
作者: Zachary Coalson,Beth Sohler,Aiden Gabriel,Sanghyun Hong
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Pre-print
Abstract:We identify a structural weakness in current large language model (LLM) alignment: modern refusal mechanisms are fail-open. While existing approaches encode refusal behaviors across multiple latent features, suppressing a single dominant feature - via prompt-based jailbreaks - can cause alignment to collapse, leading to unsafe generation. Motivated by this, we propose fail-closed alignment as a design principle for robust LLM safety: refusal mechanisms should remain effective even under partial failures via redundant, independent causal pathways. We present a concrete instantiation of this principle: a progressive alignment framework that iteratively identifies and ablates previously learned refusal directions, forcing the model to reconstruct safety along new, independent subspaces. Across four jailbreak attacks, we achieve the strongest overall robustness while mitigating over-refusal and preserving generation quality, with small computational overhead. Our mechanistic analyses confirm that models trained with our method encode multiple, causally independent refusal directions that prompt-based jailbreaks cannot suppress simultaneously, providing empirical support for fail-closed alignment as a principled foundation for robust LLM safety.
[LG-45] Multi-Agent Lipschitz Bandits
链接: https://arxiv.org/abs/2602.16965
作者: Sourav Chakraborty,Amit Kiran Rege,Claire Monteleoni,Lijun Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the decentralized multi-player stochastic bandit problem over a continuous, Lipschitz-structured action space where hard collisions yield zero reward. Our objective is to design a communication-free policy that maximizes collective reward, with coordination costs that are independent of the time horizon T . We propose a modular protocol that first solves the multi-agent coordination problem – identifying and seating players on distinct high-value regions via a novel maxima-directed search – and then decouples the problem into N independent single-player Lipschitz bandits. We establish a near-optimal regret bound of \tildeO(T^(d+1)/(d+2)) plus a T -independent coordination cost, matching the single-player rate. To our knowledge, this is the first framework providing such guarantees, and it extends to general distance-threshold collision models.
[LG-46] Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
链接: https://arxiv.org/abs/2602.16961
作者: Rahul Thomas,Arka Pal
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:The goal of L -step speculative decoding is to accelerate autoregressive decoding of a target model by using a cheaper draft model to generate a candidate path of L tokens. Based on a verification algorithm involving target and draft model probabilities, a prefix of the candidate sequence is accepted, and an additional correction token is sampled from a residual distribution to ensure that the final output adheres to the target distribution. While standard speculative decoding uses a verification algorithm which is independent at each token on the path, a recent extension called block verification uses a joint condition involving all sampled on-path probabilities. Block verification (BV) was shown to be optimal over all verification algorithms which use only on-path probabilities, improving on standard speculative decoding. In this work, we first show that block verification is optimal even over verification algorithms that use off-path probabilities, by constructing an information-agnostic linear program (LP). Further, we can extend our LP to the setting where the draft model samples multiple candidate paths, and use it to construct a natural class of multi-path block verification generalizations. While computing the optimal algorithm in this class is not tractable, by considering a stricter class of greedy algorithms, we can formulate an efficient method called greedy multi-path block verification (GBV). Empirically, GBV can improve block efficiency by over 30% and reduce decoding walltimes by over 15% relative to BV. On Llama-3 70B, GBV can improve the end-to-end decoding throughput over SOTA multi-path verification methods by more than 15%.
[LG-47] Neural Proposals Symbolic Guarantees: Neuro-Symbolic Graph Generation with Hard Constraints
链接: https://arxiv.org/abs/2602.16954
作者: Chuqin Geng,Li Zhang,Mark Zhang,Haolin Ye,Ziyu Zhao,Xujie Si
类目: Machine Learning (cs.LG)
*备注: 18 pages, 6 figures
Abstract:We challenge black-box purely deep neural approaches for molecules and graph generation, which are limited in controllability and lack formal guarantees. We introduce Neuro-Symbolic Graph Generative Modeling (NSGGM), a neurosymbolic framework that reapproaches molecule generation as a scaffold and interaction learning task with symbolic assembly. An autoregressive neural model proposes scaffolds and refines interaction signals, and a CPU-efficient SMT solver constructs full graphs while enforcing chemical validity, structural rules, and user-specific constraints, yielding molecules that are correct by construction and interpretable control that pure neural methods cannot provide. NSGGM delivers strong performance on both unconstrained generation and constrained generation tasks, demonstrating that neuro-symbolic modeling can match state-of-the-art generative performance while offering explicit controllability and guarantees. To evaluate more nuanced controllability, we also introduce a Logical-Constraint Molecular Benchmark, designed to test strict hard-rule satisfaction in workflows that require explicit, interpretable specifications together with verifiable compliance.
[LG-48] Exact Certification of Data-Poisoning Attacks Using Mixed-Integer Programming
链接: https://arxiv.org/abs/2602.16944
作者: Philip Sosnin,Jodie Knapp,Fraser Kennedy,Josh Collyer,Calvin Tsay
类目: Machine Learning (cs.LG)
*备注: Accepted to the 23rd International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research (CPAIOR)
Abstract:This work introduces a verification framework that provides both sound and complete guarantees for data poisoning attacks during neural network training. We formulate adversarial data manipulation, model training, and test-time evaluation in a single mixed-integer quadratic programming (MIQCP) problem. Finding the global optimum of the proposed formulation provably yields worst-case poisoning attacks, while simultaneously bounding the effectiveness of all possible attacks on the given training pipeline. Our framework encodes both the gradient-based training dynamics and model evaluation at test time, enabling the first exact certification of training-time robustness. Experimental evaluation on small models confirms that our approach delivers a complete characterization of robustness against data poisoning.
[LG-49] Construction of a classification model for dementia among Brazilian adults aged 50 and over
链接: https://arxiv.org/abs/2602.16887
作者: F. S. Menezes,M. C. F. G. Barretto,E. Q. C. Garcia,T. A. E. Ferreira,J. G. Alvez
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 38 pages; 3 figures
Abstract:To build a dementia classification model for middle-aged and elderly Brazilians, implemented in Python, combining variable selection and multivariable analysis, using low-cost variables with modification potential. Observational study with a predictive modeling approach using a cross-sectional design, aimed at estimating the chances of developing dementia, using data from the Brazilian Longitudinal Study of Aging (ELSI-Brazil), involving 9,412 participants. Dementia was determined based on neuropsychological assessment and informant-based cognitive function. Analyses were performed using Random Forest (RF) and multivariable logistic regression to estimate the risk of dementia in the middle-aged and elderly populations of Brazil. The prevalence of dementia was 9.6%. The highest odds of dementia were observed in illiterate individuals (Odds Ratio (OR) = 7.42), individuals aged 90 years or older (OR = 11.00), low weight (OR = 2.11), low handgrip strength (OR = 2.50), self-reported black skin color (OR = 1.47), physical inactivity (OR = 1.61), self-reported hearing loss (OR = 1.65), and presence of depressive symptoms (OR = 1.72). Higher education (OR=0.44), greater life satisfaction (OR=0.72), and being employed (OR=0.78) were protective factors. The RF model outperformed logistic regression, achieving an area under the ROC curve of 0.776, with a sensitivity of 0.708, a specificity of 0.702, an F1-score of 0.311, a G-means of 0.705, and an accuracy of 0.703. Conclusion: The findings reinforce the multidimensional nature of dementia and the importance of accessible factors for identifying vulnerable individuals. Strengthening public policies focused on promoting brain health can contribute significantly to the efficient allocation of resources in primary care and dementia prevention in Brazil
[LG-50] ML-driven detection and reduction of ballast information in multi-modal datasets
链接: https://arxiv.org/abs/2602.16876
作者: Yaroslav Solovko
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 20 pages, 27 figures, 10 tables
Abstract:Modern datasets often contain ballast as redundant or low-utility information that increases dimensionality, storage requirements, and computational cost without contributing meaningful analytical value. This study introduces a generalized, multimodal framework for ballast detection and reduction across structured, semi-structured, unstructured, and sparse data types. Using diverse datasets, entropy, mutual information, Lasso, SHAP, PCA, topic modelling, and embedding analysis are applied to identify and eliminate ballast features. A novel Ballast Score is proposed to integrate these signals into a unified, cross-modal pruning strategy. Experimental results demonstrate that significant portions of the feature space as often exceeding 70% in sparse or semi-structured data, can be pruned with minimal or even improved classification performance, along with substantial reductions in training time and memory footprint. The framework reveals distinct ballast typologies (e.g. statistical, semantic, infrastructural), and offers practical guidance for leaner, more efficient machine learning pipelines.
[LG-51] On the Mechanism and Dynamics of Modular Addition: Fourier Features Lottery Ticket and Grokking
链接: https://arxiv.org/abs/2602.16849
作者: Jianliang He,Leda Wang,Siyu Chen,Zhuoran Yang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:We present a comprehensive analysis of how two-layer neural networks learn features to solve the modular addition task. Our work provides a full mechanistic interpretation of the learned model and a theoretical explanation of its training dynamics. While prior work has identified that individual neurons learn single-frequency Fourier features and phase alignment, it does not fully explain how these features combine into a global solution. We bridge this gap by formalizing a diversification condition that emerges during training when overparametrized, consisting of two parts: phase symmetry and frequency diversification. We prove that these properties allow the network to collectively approximate a flawed indicator function on the correct logic for the modular addition task. While individual neurons produce noisy signals, the phase symmetry enables a majority-voting scheme that cancels out noise, allowing the network to robustly identify the correct sum. Furthermore, we explain the emergence of these features under random initialization via a lottery ticket mechanism. Our gradient flow analysis proves that frequencies compete within each neuron, with the “winner” determined by its initial spectral magnitude and phase alignment. From a technical standpoint, we provide a rigorous characterization of the layer-wise phase coupling dynamics and formalize the competitive landscape using the ODE comparison lemma. Finally, we use these insights to demystify grokking, characterizing it as a three-stage process involving memorization followed by two generalization phases, driven by the competition between loss minimization and weight decay.
[LG-52] What is the Value of Censored Data? An Exact Analysis for the Data-driven Newsvendor
链接: https://arxiv.org/abs/2602.16842
作者: Rachitesh Kumar,Omar Mouchtaki
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the offline data-driven newsvendor problem with censored demand data. In contrast to prior works where demand is fully observed, we consider the setting where demand is censored at the inventory level and only sales are observed; sales match demand when there is sufficient inventory, and equal the available inventory otherwise. We provide a general procedure to compute the exact worst-case regret of classical data-driven inventory policies, evaluated over all demand distributions. Our main technical result shows that this infinite-dimensional, non-convex optimization problem can be reduced to a finite-dimensional one, enabling an exact characterization of the performance of policies for any sample size and censoring levels. We leverage this reduction to derive sharp insights on the achievable performance of standard inventory policies under demand censoring. In particular, our analysis of the Kaplan-Meier policy shows that while demand censoring fundamentally limits what can be learned from passive sales data, just a small amount of targeted exploration at high inventory levels can substantially improve worst-case guarantees, enabling near-optimal performance even under heavy censoring. In contrast, when the point-of-sale system does not record stockout events and only reports realized sales, a natural and commonly used approach is to treat sales as demand. Our results show that policies based on this sales-as-demand heuristic can suffer severe performance degradation as censored data accumulates, highlighting how the quality of point-of-sale information critically shapes what can, and cannot, be learned offline.
[LG-53] A Residual-Aware Theory of Position Bias in Transformers
链接: https://arxiv.org/abs/2602.16837
作者: Hanna Herasimchyk,Robin Labryga,Tomislav Prusina,Sören Laue
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transformer models systematically favor certain token positions, yet the architectural origins of this position bias remain poorly understood. Under causal masking at infinite depth, prior theoretical analyses of attention rollout predict an inevitable collapse of attention onto the first token. Such collapse, however, does not occur in practice. We resolve this discrepancy with a residual-aware theory of cumulative attention rollout. By incorporating residual connections, we show that this architectural component prevents collapse under realistic conditions. At finite depth, we prove that causal Transformers induce a U-shaped position bias, with attention concentrating on early and late tokens. This result provides a principled architectural explanation for the Lost-in-the-Middle phenomenon.
[LG-54] NeST: Neuron Selective Tuning for LLM Safety
链接: https://arxiv.org/abs/2602.16835
作者: Sasha Behrouzi,Lichao Wu,Mohamadreza Rostami,Ahmad-Reza Sadeghi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Safety alignment is essential for the responsible deployment of large language models (LLMs). Yet, existing approaches often rely on heavyweight fine-tuning that is costly to update, audit, and maintain across model families. Full fine-tuning incurs substantial computational and storage overhead, while parameter-efficient methods such as LoRA trade efficiency for inconsistent safety gains and sensitivity to design choices. Safety intervention mechanisms such as circuit breakers reduce unsafe outputs without modifying model weights, but do not directly shape or preserve the internal representations that govern safety behavior. These limitations hinder rapid and reliable safety updates, particularly in settings where models evolve frequently or must adapt to new policies and domains. We present NeST, a lightweight, structure-aware safety alignment framework that strengthens refusal behavior by selectively adapting a small subset of safety-relevant neurons while freezing the remainder of the model. NeST aligns parameter updates with the internal organization of safety behavior by clustering functionally coherent safety neurons and enforcing shared updates within each cluster, enabling targeted and stable safety adaptation without broad model modification or inference-time overhead. We benchmark NeST against three dominant baselines: full fine-tuning, LoRA-based fine-tuning, and circuit breakers across 10 open-weight LLMs spanning multiple model families and sizes. Across all evaluated models, NeST reduces the attack success rate from an average of 44.5% to 4.36%, corresponding to a 90.2% reduction in unsafe generations, while requiring only 0.44 million trainable parameters on average. This amounts to a 17,310x decrease in updated parameters compared to full fine-tuning and a 9.25x reduction relative to LoRA, while consistently achieving stronger safety performance for alignment. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2602.16835 [cs.CR] (or arXiv:2602.16835v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.16835 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-55] Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees ICLR2026
链接: https://arxiv.org/abs/2602.16823
作者: Itamar Hadad,Guy Katz,Shahaf Bassan
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: To appear in ICLR 2026
Abstract:Automated circuit discovery is a central tool in mechanistic interpretability for identifying the internal components of neural networks responsible for specific behaviors. While prior methods have made significant progress, they typically depend on heuristics or approximations and do not offer provable guarantees over continuous input domains for the resulting circuits. In this work, we leverage recent advances in neural network verification to propose a suite of automated algorithms that yield circuits with provable guarantees. We focus on three types of guarantees: (1) input domain robustness, ensuring the circuit agrees with the model across a continuous input region; (2) robust patching, certifying circuit alignment under continuous patching perturbations; and (3) minimality, formalizing and capturing a wide array of various notions of succinctness. Interestingly, we uncover a diverse set of novel theoretical connections among these three families of guarantees, with critical implications for the convergence of our algorithms. Finally, we conduct experiments with state-of-the-art verifiers on various vision models, showing that our algorithms yield circuits with substantially stronger robustness guarantees than standard circuit discovery methods, establishing a principled foundation for provable circuit discovery.
[LG-56] opoFlow: Physics-guided Neural Networks for high-resolution air quality prediction
链接: https://arxiv.org/abs/2602.16821
作者: Ammar Kheder,Helmi Toropainen,Wenqing Peng,Samuel Antão,Jia Chen,Zhi-Song Liu,Michael Boy
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose TopoFlow (Topography-aware pollutant Flow learning), a physics-guided neural network for efficient, high-resolution air quality prediction. To explicitly embed physical processes into the learning framework, we identify two critical factors governing pollutant dynamics: topography and wind direction. Complex terrain can channel, block, and trap pollutants, while wind acts as a primary driver of their transport and dispersion. Building on these insights, TopoFlow leverages a vision transformer architecture with two novel mechanisms: topography-aware attention, which explicitly models terrain-induced flow patterns, and wind-guided patch reordering, which aligns spatial representations with prevailing wind directions. Trained on six years of high-resolution reanalysis data assimilating observations from over 1,400 surface monitoring stations across China, TopoFlow achieves a PM2.5 RMSE of 9.71 ug/m3, representing a 71-80% improvement over operational forecasting systems and a 13% improvement over state-of-the-art AI baselines. Forecast errors remain well below China’s 24-hour air quality threshold of 75 ug/m3 (GB 3095-2012), enabling reliable discrimination between clean and polluted conditions. These performance gains are consistent across all four major pollutants and forecast lead times from 12 to 96 hours, demonstrating that principled integration of physical knowledge into neural networks can fundamentally advance air quality prediction.
[LG-57] Efficient Tail-Aware Generative Optimization via Flow Model Fine-Tuning
链接: https://arxiv.org/abs/2602.16796
作者: Zifan Wang,Riccardo De Santi,Xiaoyu Mo,Michael M. Zavlanos,Andreas Krause,Karl H. Johansson
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 33 pages
Abstract:Fine-tuning pre-trained diffusion and flow models to optimize downstream utilities is central to real-world deployment. Existing entropy-regularized methods primarily maximize expected reward, providing no mechanism to shape tail behavior. However, tail control is often essential: the lower tail determines reliability by limiting low-reward failures, while the upper tail enables discovery by prioritizing rare, high-reward outcomes. In this work, we present Tail-aware Flow Fine-Tuning (TFFT), a principled and efficient distributional fine-tuning algorithm based on the Conditional Value-at-Risk (CVaR). We address two distinct tail-shaping goals: right-CVaR for seeking novel samples in the high-reward tail and left-CVaR for controlling worst-case samples in the low-reward tail. Unlike prior approaches that rely on non-linear optimization, we leverage the variational dual formulation of CVaR to decompose it into a decoupled two-stage procedure: a lightweight one-dimensional threshold optimization step, and a single entropy-regularized fine-tuning process via a specific pseudo-reward. This decomposition achieves CVaR fine-tuning efficiently with computational cost comparable to standard expected fine-tuning methods. We demonstrate the effectiveness of TFFT across illustrative experiments, high-dimensional text-to-image generation, and molecular design.
[LG-58] Escaping the Cognitive Well: Efficient Competition Math with Off-the-Shelf Models
链接: https://arxiv.org/abs/2602.16793
作者: Xingyu Dang,Rohit Agarwal,Rodrigo Porto,Anirudh Goyal,Liam H Fowl,Sanjeev Arora
类目: Machine Learning (cs.LG)
*备注:
Abstract:In the past year, custom and unreleased math reasoning models reached gold medal performance on the International Mathematical Olympiad (IMO). Similar performance was then reported using large-scale inference on publicly available models but at prohibitive costs (e.g., 3000 USD per problem). In this work, we present an inference pipeline that attains best-in-class performance on IMO-style math problems at an average inference cost orders of magnitude below competing methods while using only general-purpose off-the-shelf models. Our method relies on insights about grader failure in solver-grader pipelines, which we call the Cognitive Well (iterative refinement converging to a wrong solution that the solver as well as the pipeline’s internal grader consider to be basically correct). Our pipeline addresses these failure modes through conjecture extraction, wherein candidate lemmas are isolated from generated solutions and independently verified alongside their negations in a fresh environment (context detachment). On IMO-ProofBench Advanced (PB-Adv), our pipeline achieves 67.1 percent performance using Gemini 3.0 Pro with an average cost per question of approximately 31 USD. At the time of evaluation, this represented the state-of-the-art on PB-Adv among both public and unreleased models, and more than doubles the success rate of the next best publicly accessible pipeline, all at a fraction of the cost.
[LG-59] Machine Learning Argument of Latitude Error Model for LEO Satellite Orbit and Covariance Correction
链接: https://arxiv.org/abs/2602.16764
作者: Alex Moody,Penina Axelrad,Rebecca Russell
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: Appearing in 2026 IEEE Aerospace Conference
Abstract:Low Earth orbit (LEO) satellites are leveraged to support new position, navigation, and timing (PNT) service alternatives to GNSS. These alternatives require accurate propagation of satellite position and velocity with a realistic quantification of uncertainty. It is commonly assumed that the propagated uncertainty distribution is Gaussian; however, the validity of this assumption can be quickly compromised by the mismodeling of atmospheric drag. We develop a machine learning approach that corrects error growth in the argument of latitude for a diverse set of LEO satellites. The improved orbit propagation accuracy extends the applicability of the Gaussian assumption and modeling of the errors with a corrected mean and covariance. We compare the performance of a time-conditioned neural network and a Gaussian Process on datasets computed with an open source orbit propagator and publicly available Vector Covariance Message (VCM) ephemerides. The learned models predict the argument of latitude error as a Gaussian distribution given parameters from a single VCM epoch and reverse propagation errors. We show that this one-dimensional model captures the effect of mismodeled drag, which can be mapped to the Cartesian state space. The correction method only updates information along the dimensions of dominant error growth, while maintaining the physics-based propagation of VCM covariance in the remaining dimensions. We therefore extend the utility of VCM ephemerides to longer time horizons without modifying the functionality of the existing propagator.
[LG-60] Real-time Secondary Crash Likelihood Prediction Excluding Post Primary Crash Features
链接: https://arxiv.org/abs/2602.16739
作者: Lei Han,Mohamed Abdel-Aty,Zubayer Islam,Chenzhu Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Secondary crash likelihood prediction is a critical component of an active traffic management system to mitigate congestion and adverse impacts caused by secondary crashes. However, existing approaches mainly rely on post-crash features (e.g., crash type and severity) that are rarely available in real time, limiting their practical applicability. To address this limitation, we propose a hybrid secondary crash likelihood prediction framework that does not depend on post-crash features. A dynamic spatiotemporal window is designed to extract real-time traffic flow and environmental features from primary crash locations and their upstream segments. The framework includes three models: a primary crash model to estimate the likelihood of secondary crash occurrence, and two secondary crash models to evaluate traffic conditions at crash and upstream segments under different comparative scenarios. An ensemble learning strategy integrating six machine learning algorithms is developed to enhance predictive performance, and a voting-based mechanism combines the outputs of the three models. Experiments on Florida freeways demonstrate that the proposed hybrid framework correctly identifies 91% of secondary crashes with a low false alarm rate of 0.20. The Area Under the ROC Curve improves from 0.654, 0.744, and 0.902 for the individual models to 0.952 for the hybrid model, outperforming previous studies.
[LG-61] A Few-Shot LLM Framework for Extreme Day Classification in Electricity Markets
链接: https://arxiv.org/abs/2602.16735
作者: Saud Alghumayjan,Ming Yi,Bolun Xu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:This paper proposes a few-shot classification framework based on Large Language Models (LLMs) to predict whether the next day will have spikes in real-time electricity prices. The approach aggregates system state information, including electricity demand, renewable generation, weather forecasts, and recent electricity prices, into a set of statistical features that are formatted as natural-language prompts and fed to an LLM along with general instructions. The model then determines the likelihood that the next day would be a spike day and reports a confidence score. Using historical data from the Texas electricity market, we demonstrate that this few-shot approach achieves performance comparable to supervised machine learning models, such as Support Vector Machines and XGBoost, and outperforms the latter two when limited historical data are available. These findings highlight the potential of LLMs as a data-efficient tool for classifying electricity price spikes in settings with scarce data.
[LG-62] MMCAformer: Macro-Micro Cross-Attention Transformer for Traffic Speed Prediction with Microscopic Connected Vehicle Driving Behavior
链接: https://arxiv.org/abs/2602.16730
作者: Lei Han,Mohamed Abdel-Aty,Younggun Kim,Yang-Jun Joo,Zubayer Islam
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate speed prediction is crucial for proactive traffic management to enhance traffic efficiency and safety. Existing studies have primarily relied on aggregated, macroscopic traffic flow data to predict future traffic trends, whereas road traffic dynamics are also influenced by individual, microscopic human driving behaviors. Recent Connected Vehicle (CV) data provide rich driving behavior features, offering new opportunities to incorporate these behavioral insights into speed prediction. To this end, we propose the Macro-Micro Cross-Attention Transformer (MMCAformer) to integrate CV data-based micro driving behavior features with macro traffic features for speed prediction. Specifically, MMCAformer employs self-attention to learn intrinsic dependencies in macro traffic flow and cross-attention to capture spatiotemporal interplays between macro traffic status and micro driving behavior. MMCAformer is optimized with a Student-t negative log-likelihood loss to provide point-wise speed prediction and estimate uncertainty. Experiments on four Florida freeways demonstrate the superior performance of the proposed MMCAformer compared to baselines. Compared with only using macro features, introducing micro driving behavior features not only enhances prediction accuracy (e.g., overall RMSE, MAE, and MAPE reduced by 9.0%, 6.9%, and 10.2%, respectively) but also shrinks model prediction uncertainty (e.g., mean predictive intervals decreased by 10.1-24.0% across the four freeways). Results reveal that hard braking and acceleration frequencies emerge as the most influential features. Such improvements are more pronounced under congested, low-speed traffic conditions.
[LG-63] Speech to Speech Synthesis for Voice Impersonation DATE
链接: https://arxiv.org/abs/2602.16721
作者: Bjorn Johnson,Jared Levy
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Original work completed in April 2020. This version includes minor formatting updates
Abstract:Numerous models have shown great success in the fields of speech recognition as well as speech synthesis, but models for speech to speech processing have not been heavily explored. We propose Speech to Speech Synthesis Network (STSSN), a model based on current state of the art systems that fuses the two disciplines in order to perform effective speech to speech style transfer for the purpose of voice impersonation. We show that our proposed model is quite powerful, and succeeds in generating realistic audio samples despite a number of drawbacks in its capacity. We benchmark our proposed model by comparing it with a generative adversarial model which accomplishes a similar task, and show that ours produces more convincing results.
[LG-64] DARTH-PUM: A Hybrid Processing-Using-Memory Architecture ASPLOS
链接: https://arxiv.org/abs/2602.16075
作者: Ryan Wong,Ben Feinberg,Saugata Ghose
类目: Hardware Architecture (cs.AR); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: To appear in the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2026
Abstract:Analog processing-using-memory (PUM; a.k.a. in-memory computing) makes use of electrical interactions inside memory arrays to perform bulk matrix-vector multiplication (MVM) operations. However, many popular matrix-based kernels need to execute non-MVM operations, which analog PUM cannot directly perform. To retain its energy efficiency, analog PUM architectures augment memory arrays with CMOS-based domain-specific fixed-function hardware to provide complete kernel functionality, but the difficulty of integrating such specialized CMOS logic with memory arrays has largely limited analog PUM to being an accelerator for machine learning inference, or for closely related kernels. An opportunity exists to harness analog PUM for general-purpose computation: recent works have shown that memory arrays can also perform Boolean PUM operations, albeit with very different supporting hardware and electrical signals than analog PUM. We propose DARTH-PUM, a general-purpose hybrid PUM architecture that tackles key hardware and software challenges to integrating analog PUM and digital PUM. We propose optimized peripheral circuitry, coordinating hardware to manage and interface between both types of PUM, an easy-to-use programming interface, and low-cost support for flexible data widths. These design elements allow us to build a practical PUM architecture that can execute kernels fully in memory, and can scale easily to cater to domains ranging from embedded applications to large-scale data-driven computing. We show how three popular applications (AES encryption, convolutional neural networks, large-language models) can map to and benefit from DARTH-PUM, with speedups of 59.4x, 14.8x, and 40.8x over an analog+CPU baseline. Comments: To appear in the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2026 Subjects: Hardware Architecture (cs.AR); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG) Cite as: arXiv:2602.16075 [cs.AR] (or arXiv:2602.16075v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2602.16075 Focus to learn more arXiv-issued DOI via DataCite
[LG-65] Efficient Remote Prefix Fetching with GPU-native Media ASICs
链接: https://arxiv.org/abs/2602.09725
作者: Liang Mi,Weijun Wang,Jinghan Chen,Ting Cao,Haipeng Dai,Yunxin Liu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Remote KV cache reuse fetches KV cache for identical contexts from remote storage, avoiding recomputation, accelerating LLM inference. While it excels in high-speed networks, its performance degrades significantly in bandwidth-limited scenarios. Recent studies address this by transmitting KV caches in compressed form, but the associated heavyweight decompression counteracts the KV reuse benefits. In this paper, we propose an efficient and widely deployable remote KV cache reuse solution that leverages GPU-native video codecs. Our system, KVFetcher, enables effective KV cache coding with two techniques. The codec-friendly tensor layout compresses the KV cache in a highly compact video format, enabling fast transmission. The efficient KV fetcher orchestrates the transmission, decoding, and restoration of compressed KV caches in an efficient pipelined manner, eliminating resource contention, masking network fluctuations, and achieving minimum time-to-first-token (TTFT). We prototype KVFetcher on diverse GPUs from high- to low-end. Experiments reveal that it reduces TTFT by up to 3.51 times while maintaining lossless accuracy, compared to SOTA methods.
[LG-66] Asymptotically Optimal Sequential Testing with Markovian Data
链接: https://arxiv.org/abs/2602.17587
作者: Alhad Sethi,Kavali Sofia Sagar,Shubhada Agrawal,Debabrota Basu,P. N. Karthik
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study one-sided and \alpha -correct sequential hypothesis testing for data generated by an ergodic Markov chain. The null hypothesis is that the unknown transition matrix belongs to a prescribed set P of stochastic matrices, and the alternative corresponds to a disjoint set Q . We establish a tight non-asymptotic instance-dependent lower bound on the expected stopping time of any valid sequential test under the alternative. Our novel analysis improves the existing lower bounds, which are either asymptotic or provably sub-optimal in this setting. Our lower bound incorporates both the stationary distribution and the transition structure induced by the unknown Markov chain. We further propose an optimal test whose expected stopping time matches this lower bound asymptotically as \alpha \to 0 . We illustrate the usefulness of our framework through applications to sequential detection of model misspecification in Markov Chain Monte Carlo and to testing structural properties, such as the linearity of transition dynamics, in Markov decision processes. Our findings yield a sharp and general characterization of optimal sequential testing procedures under Markovian dependence.
[LG-67] Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements Precise Asymptotics and One-Shot Tuning
链接: https://arxiv.org/abs/2602.17565
作者: Hien Dang,Pratik Patil,Alessandro Rinaldo
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 78 pages, 25 figures
Abstract:Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth labels and the teacher’s own predictions using the same architecture and training data. Although SD has been empirically shown to often improve generalization, its formal guarantees remain limited. We study SD for ridge regression in unconstrained setting in which the mixing weight \xi may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level \lambda 0 at which the teacher ridge risk R(\lambda) is nonstationary (i.e., R’(\lambda) \neq 0 ). We obtain a closed-form expression for the optimal mixing weight \xi^\star(\lambda) for any value of \lambda and show that it obeys the sign rule: \operatornamesign(\xi^\star(\lambda))=-\operatornamesign(R’(\lambda)) . In particular, \xi^\star(\lambda) can be negative, which is the case in over-regularized regimes. To quantify the risk improvement due to SD, we derive exact deterministic equivalents for the optimal SD risk in the proportional asymptotics regime (where the sample and feature sizes n and p both diverge but their aspect ratio p/n converges) under general anisotropic covariance and deterministic signals. Our asymptotic analysis extends standard second-order ridge deterministic equivalents to their fourth-order analogs using block linearization, which may be of independent interest. From a practical standpoint, we propose a consistent one-shot tuning method to estimate \xi^\star without grid search, sample splitting, or refitting. Experiments on real-world datasets and pretrained neural network features support our theory and the one-shot tuning method.
[LG-68] genriesz: A Python Package for Automatic Debiased Machine Learning with Generalized Riesz Regression
链接: https://arxiv.org/abs/2602.17543
作者: Masahiro Kato
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:Efficient estimation of causal and structural parameters can be automated using the Riesz representation theorem and debiased machine learning (DML). We present genriesz, an open-source Python package that implements automatic DML and generalized Riesz regression, a unified framework for estimating Riesz representers by minimizing empirical Bregman divergences. This framework includes covariate balancing, nearest-neighbor matching, calibrated estimation, and density ratio estimation as special cases. A key design principle of the package is automatic regressor balancing (ARB): given a Bregman generator g and a representer model class, genriesz automatically constructs a compatible link function so that the generalized Riesz regression estimator satisfies balancing (moment-matching) optimality conditions in a user-chosen basis. The package provides a modulr interface for specifying (i) the target linear functional via a black-box evaluation oracle, (ii) the representer model via basis functions (polynomial, RKHS approximations, random forest leaf encodings, neural embeddings, and a nearest-neighbor catchment basis), and (iii) the Bregman generator, with optional user-supplied derivatives. It returns regression adjustment (RA), Riesz weighting (RW), augmented Riesz weighting (ARW), and TMLE-style estimators with cross-fitting, confidence intervals, and p -values. We highlight representative workflows for estimation problems such as the average treatment effect (ATE), ATE on treated (ATT), and average marginal effect estimation. The Python package is available at this https URL and on PyPI.
[LG-69] Quantum Scrambling Born Machine
链接: https://arxiv.org/abs/2602.17281
作者: Marcin Płodzień
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Quantum generative modeling, where the Born rule naturally defines probability distributions through measurement of parameterized quantum states, is a promising near-term application of quantum computing. We propose a Quantum Scrambling Born Machine in which a fixed entangling unitary – acting as a scrambling reservoir – provides multi-qubit entanglement, while only single-qubit rotations are optimized. We consider three entangling unitaries – a Haar random unitary and two physically realizable approximations, a finite-depth brickwork random circuit and analog time evolution under nearest-neighbor spin-chain Hamiltonians – and show that, for the benchmark distributions and system sizes considered, once the entangler produces near-Haar-typical entanglement the model learns the target distribution with weak sensitivity to the scrambler’s microscopic origin. Finally, promoting the Hamiltonian couplings to trainable parameters casts the generative task as a variational Hamiltonian problem, with performance competitive with representative classical generative models at matched parameter count.
[LG-70] MGD: Moment Guided Diffusion for Maximum Entropy Generation
链接: https://arxiv.org/abs/2602.17211
作者: Etienne Lempereur,Nathanaël Cuvelle–Magar,Florentin Coeurdoux,Stéphane Mallat,Eric Vanden-Eijnden
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Generating samples from limited information is a fundamental problem across scientific domains. Classical maximum entropy methods provide principled uncertainty quantification from moment constraints but require sampling via MCMC or Langevin dynamics, which typically exhibit exponential slowdown in high dimensions. In contrast, generative models based on diffusion and flow matching efficiently transport noise to data but offer limited theoretical guarantees and can overfit when data is scarce. We introduce Moment Guided Diffusion (MGD), which combines elements of both approaches. Building on the stochastic interpolant framework, MGD samples maximum entropy distributions by solving a stochastic differential equation that guides moments toward prescribed values in finite time, thereby avoiding slow mixing in equilibrium-based methods. We formally obtain, in the large-volatility limit, convergence of MGD to the maximum entropy distribution and derive a tractable estimator of the resulting entropy computed directly from the dynamics. Applications to financial time series, turbulent flows, and cosmological fields using wavelet scattering moments yield estimates of negentropy for high-dimensional multiscale processes.
[LG-71] Anti-causal domain generalization: Leverag ing unlabeled data
链接: https://arxiv.org/abs/2602.17187
作者: Sorawit Saengkyongam,Juan L. Gamella,Andrew C. Miller,Jonas Peters,Nicolai Meinshausen,Christina Heinze-Deml
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The problem of domain generalization concerns learning predictive models that are robust to distribution shifts when deployed in new, previously unseen environments. Existing methods typically require labeled data from multiple training environments, limiting their applicability when labeled data are scarce. In this work, we study domain generalization in an anti-causal setting, where the outcome causes the observed covariates. Under this structure, environment perturbations that affect the covariates do not propagate to the outcome, which motivates regularizing the model’s sensitivity to these perturbations. Crucially, estimating these perturbation directions does not require labels, enabling us to leverage unlabeled data from multiple environments. We propose two methods that penalize the model’s sensitivity to variations in the mean and covariance of the covariates across environments, respectively, and prove that these methods have worst-case optimality guarantees under certain classes of environments. Finally, we demonstrate the empirical performance of our approach on a controlled physical system and a physiological signal dataset.
[LG-72] Semi-Supervised Learning on Graphs using Graph Neural Networks
链接: https://arxiv.org/abs/2602.17115
作者: Juntong Chen,Claire Donnat,Olga Klopp,Johannes Schmidt-Hieber
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 57 pages, 7 figures
Abstract:Graph neural networks (GNNs) work remarkably well in semi-supervised node regression, yet a rigorous theory explaining when and why they succeed remains lacking. To address this gap, we study an aggregate-and-readout model that encompasses several common message passing architectures: node features are first propagated over the graph then mapped to responses via a nonlinear function. For least-squares estimation over GNNs with linear graph convolutions and a deep ReLU readout, we prove a sharp non-asymptotic risk bound that separates approximation, stochastic, and optimization errors. The bound makes explicit how performance scales with the fraction of labeled nodes and graph-induced dependence. Approximation guarantees are further derived for graph-smoothing followed by smooth nonlinear readouts, yielding convergence rates that recover classical nonparametric behavior under full supervision while characterizing performance when labels are scarce. Numerical experiments validate our theory, providing a systematic framework for understanding GNN performance and limitations.
[LG-73] Dynamic Decision-Making under Model Misspecification: A Stochastic Stability Approach
链接: https://arxiv.org/abs/2602.17086
作者: Xinyu Dai,Daniel Chen,Yian Qian
类目: Theoretical Economics (econ.TH); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Dynamic decision-making under model uncertainty is central to many economic environments, yet existing bandit and reinforcement learning algorithms rely on the assumption of correct model specification. This paper studies the behavior and performance of one of the most commonly used Bayesian reinforcement learning algorithms, Thompson Sampling (TS), when the model class is misspecified. We first provide a complete dynamic classification of posterior evolution in a misspecified two-armed Gaussian bandit, identifying distinct regimes: correct model concentration, incorrect model concentration, and persistent belief mixing, characterized by the direction of statistical evidence and the model-action mapping. These regimes yield sharp predictions for limiting beliefs, action frequencies, and asymptotic regret. We then extend the analysis to a general finite model class and develop a unified stochastic stability framework that represents posterior evolution as a Markov process on the belief simplex. This approach characterizes two sufficient conditions to classify the ergodic and transient behaviors and provides inductive dimensional reductions of the posterior dynamics. Our results offer the first qualitative and geometric classification of TS under misspecification, bridging Bayesian learning with evolutionary dynamics, and also build the foundations of robust decision-making in structured bandits.
[LG-74] BrainRVQ: A High-Fidelity EEG Foundation Model via Dual-Domain Residual Quantization and Hierarchical Autoregression
链接: https://arxiv.org/abs/2602.16951
作者: Mingzhe Cui,Tao Chen,Yang Jiao,Yiqin Wang,Lei Xie,Yi Pan,Luca Mainardi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Developing foundation models for electroencephalography (EEG) remains challenging due to the signal’s low signal-to-noise ratio and complex spectro-temporal non-stationarity. Existing approaches often overlook the hierarchical latent structure inherent in neural dynamics, leading to suboptimal reconstruction of fine-grained information. In this work, we propose BrainRVQ, a general-purpose EEG foundation model pre-trained on a large-scale corpus of clinical EEG data. Unlike standard masked modeling, BrainRVQ features a Dual-Domain Residual Vector Quantization (DD-RVQ) tokenizer that disentangles temporal waveforms and spectral patterns into hierarchical discrete codes. We further introduce a hierarchical autoregressive pre-training objective that learns to reconstruct these codes in a coarse-to-fine manner, utilizing an importance-guided curriculum masking strategy to prioritize information-rich neural events over background noise. Extensive experiments across 8 diverse downstream datasets demonstrate that BrainRVQ consistently outperforms state-of-the-art baselines, validating its effectiveness in learning robust and generalizable neural representations. Our code and model weights are available:this https URL
[LG-75] Poisson-MNL Bandit: Nearly Optimal Dynamic Joint Assortment and Pricing with Decision-Dependent Customer Arrivals
链接: https://arxiv.org/abs/2602.16923
作者: Junhui Cai,Ran Chen,Qitao Huang,Linda Zhao,Wu Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study dynamic joint assortment and pricing where a seller updates decisions at regular accounting/operating intervals to maximize the cumulative per-period revenue over a horizon T . In many settings, assortment and prices affect not only what an arriving customer buys but also how many customers arrive within the period, whereas classical multinomial logit (MNL) models assume arrivals as fixed, potentially leading to suboptimal decisions. We propose a Poisson-MNL model that couples a contextual MNL choice model with a Poisson arrival model whose rate depends on the offered assortment and prices. Building on this model, we develop an efficient algorithm PMNL based on the idea of upper confidence bound (UCB). We establish its (near) optimality by proving a non-asymptotic regret bound of order \sqrtT\logT and a matching lower bound (up to \log T ). Simulation studies underscore the importance of accounting for the dependency of arrival rates on assortment and pricing: PMNL effectively learns customer choice and arrival models and provides joint assortment-pricing decisions that outperform others that assume fixed arrival rates.
[LG-76] A statistical perspective on transformers for small longitudinal cohort data
链接: https://arxiv.org/abs/2602.16914
作者: Kiana Farhadyar,Maren Hackenberg,Kira Ahrens,Charlotte Schenk,Bianca Kollmann,Oliver Tüscher,Klaus Lieb,Michael M. Plichta,Andreas Reif,Raffael Kalisch,Martin Wolkewitz,Moritz Hess,Harald Binder
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Modeling of longitudinal cohort data typically involves complex temporal dependencies between multiple variables. There, the transformer architecture, which has been highly successful in language and vision applications, allows us to account for the fact that the most recently observed time points in an individual’s history may not always be the most important for the immediate future. This is achieved by assigning attention weights to observations of an individual based on a transformation of their values. One reason why these ideas have not yet been fully leveraged for longitudinal cohort data is that typically, large datasets are required. Therefore, we present a simplified transformer architecture that retains the core attention mechanism while reducing the number of parameters to be estimated, to be more suitable for small datasets with few time points. Guided by a statistical perspective on transformers, we use an autoregressive model as a starting point and incorporate attention as a kernel-based operation with temporal decay, where aggregation of multiple transformer heads, i.e. different candidate weighting schemes, is expressed as accumulating evidence on different types of underlying characteristics of individuals. This also enables a permutation-based statistical testing procedure for identifying contextual patterns. In a simulation study, the approach is shown to recover contextual dependencies even with a small number of individuals and time points. In an application to data from a resilience study, we identify temporal patterns in the dynamics of stress and mental health. This indicates that properly adapted transformers can not only achieve competitive predictive performance, but also uncover complex context dependencies in small data settings.
[LG-77] Multi-objective optimization and quantum hybridization of equivariant deep learning interatomic potentials on organic and inorganic compounds
链接: https://arxiv.org/abs/2602.16908
作者: G. Laskaris,D. Morozov,D. Tarpanov,A. Seth,J. Procelewska,G. Sai Gautam,A. Sagingalieva,R. Brasher,A. Melnikov
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 13 pages, 6 figures, 5 tables
Abstract:Allegro is a machine learning interatomic potential (MLIP) model designed to predict atomic properties in molecules using E(3) equivariant neural networks. When training this model, there tends to be a trade-off between accuracy and inference time. For this reason we apply multi-objective hyperparameter optimization to the two objectives. Additionally, we experiment with modified architectures by making variants of Allegro some by adding strictly classical multi-layer perceptron (MLP) layers and some by adding quantum-classical hybrid layers. We compare the results from QM9, rMD17-aspirin, rMD17-benzene and our own proprietary dataset consisting of copper and lithium atoms. As results, we have a list of variants that surpass the Allegro in accuracy and also results which demonstrate the trade-off with inference times.
[LG-78] he Impact of Formations on Football Matches Using Double Machine Learning. Is it worth parking the bus?
链接: https://arxiv.org/abs/2602.16830
作者: Genís Ruiz-Menárguez,Llorenç Badiella
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures, 3 tables
Abstract:This study addresses a central tactical dilemma for football coaches: whether to employ a defensive strategy, colloquially known as “parking the bus”, or a more offensive one. Using an advanced Double Machine Learning (DML) framework, this project provides a robust and interpretable tool to estimate the causal impact of different formations on key match outcomes such as goal difference, possession, corners, and disciplinary actions. Leveraging a dataset of over 22,000 matches from top European leagues, formations were categorized into six representative types based on tactical structure and expert consultation. A major methodological contribution lies in the adaptation of DML to handle categorical treatments, specifically formation combinations, through a novel matrix-based residualization process, allowing for a detailed estimation of formation-versus-formation effects that can inform a coach’s tactical decision-making. Results show that while offensive formations like 4-3-3 and 4-2-3-1 offer modest statistical advantages in possession and corners, their impact on goals is limited. Furthermore, no evidence supports the idea that defensive formations, commonly associated with parking the bus, increase a team’s winning potential. Additionally, red cards appear unaffected by formation choice, suggesting other behavioral factors dominate. Although this approach does not fully capture all aspects of playing style or team strength, it provides a valuable framework for coaches to analyze tactical efficiency and sets a precedent for future research in sports analytics.
[LG-79] Beyond Procedure: Substantive Fairness in Conformal Prediction
链接: https://arxiv.org/abs/2602.16794
作者: Pengqi Liu,Zijun Yu,Mouloud Belbahri,Arthur Charpentier,Masoud Asgharian,Jesse C. Cresswell
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Conformal prediction (CP) offers distribution-free uncertainty quantification for machine learning models, yet its interplay with fairness in downstream decision-making remains underexplored. Moving beyond CP as a standalone operation (procedural fairness), we analyze the holistic decision-making pipeline to evaluate substantive fairness-the equity of downstream outcomes. Theoretically, we derive an upper bound that decomposes prediction-set size disparity into interpretable components, clarifying how label-clustered CP helps control method-driven contributions to unfairness. To facilitate scalable empirical analysis, we introduce an LLM-in-the-loop evaluator that approximates human assessment of substantive fairness across diverse modalities. Our experiments reveal that label-clustered CP variants consistently deliver superior substantive fairness. Finally, we empirically show that equalized set sizes, rather than coverage, strongly correlate with improved substantive fairness, enabling practitioners to design more fair CP systems. Our code is available at this https URL.
[LG-80] U-FedTomAtt: Ultra-lightweight Federated Learning with Attention for Tomato Disease Recognition
链接: https://arxiv.org/abs/2602.16749
作者: Romiyal George,Sathiyamohan Nishankar,Selvarajah Thuseethan,Chathrie Wimalasooriya,Yakub Sebastian,Roshan G. Ragel,Zhongwei Liang
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 10 pages and 4 figures
Abstract:Federated learning has emerged as a privacy-preserving and efficient approach for deploying intelligent agricultural solutions. Accurate edge-based diagnosis across geographically dispersed farms is crucial for recognising tomato diseases in sustainable farming. Traditional centralised training aggregates raw data on a central server, leading to communication overhead, privacy risks and latency. Meanwhile, edge devices require lightweight networks to operate effectively within limited resources. In this paper, we propose U-FedTomAtt, an ultra-lightweight federated learning framework with attention for tomato disease recognition in resource-constrained and distributed environments. The model comprises only 245.34K parameters and 71.41 MFLOPS. First, we propose an ultra-lightweight neural network with dilated bottleneck (DBNeck) modules and a linear transformer to minimise computational and memory overhead. To mitigate potential accuracy loss, a novel local-global residual attention (LoGRA) module is incorporated. Second, we propose the federated dual adaptive weight aggregation (FedDAWA) algorithm that enhances global model accuracy. Third, our framework is validated using three benchmark datasets for tomato diseases under simulated federated settings. Experimental results show that the proposed method achieves 0.9910% and 0.9915% Top-1 accuracy and 0.9923% and 0.9897% F1-scores on SLIF-Tomato and PlantVillage tomato datasets, respectively.
[LG-81] Exploring the Utility of MALDI-TOF Mass Spectrometry and Antimicrobial Resistance in Hospital Outbreak Detection
链接: https://arxiv.org/abs/2602.16737
作者: Chang Liu,Jieshi Chen,Alexander J. Sundermann,Kathleen Shutt,Marissa P. Griffith,Lora Lee Pless,Lee H. Harrison,Artur W. Dubrawski
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:
Abstract:Accurate and timely identification of hospital outbreak clusters is crucial for preventing the spread of infections that have epidemic potential. While assessing pathogen similarity through whole genome sequencing (WGS) is considered the gold standard for outbreak detection, its high cost and lengthy turnaround time preclude routine implementation in clinical laboratories. We explore the utility of two rapid and cost-effective alternatives to WGS, matrix-assisted laser desorption ionization-time of flight (MALDI-TOF) mass spectrometry and antimicrobial resistance (AR) patterns. We develop a machine learning framework that extracts informative representations from MALDI-TOF spectra and AR patterns for outbreak detection and explore their fusion. Through multi-species analyses, we demonstrate that in some cases MALDI-TOF and AR have the potential to reduce reliance on WGS, enabling more accessible and rapid outbreak surveillance.
附件下载







