本篇博文主要内容为 2026-02-13 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-02-13)
今日共更新670篇论文,其中:
- 自然语言处理共119篇(Computation and Language (cs.CL))
- 人工智能共227篇(Artificial Intelligence (cs.AI))
- 计算机视觉共98篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共215篇(Machine Learning (cs.LG))
- 多智能体系统共11篇(Multiagent Systems (cs.MA))
- 信息检索共22篇(Information Retrieval (cs.IR))
- 人机交互共34篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Federated Gaussian Process Learning via Pseudo-Representations for Large-Scale Multi-Robot Systems AAMAS2026
【速读】:该论文旨在解决多机器人系统在计算和通信约束下,如何实现复杂环境的可扩展、分布式建模问题。传统高斯过程(Gaussian Process, GP)虽具备鲁棒的概率建模能力,但其立方级计算复杂度限制了其在大规模部署中的应用。解决方案的关键在于提出一种名为pxpGP(proximal-exact pseudo-GP)的新型分布式GP框架,其核心是利用稀疏变分推断生成局部紧凑的伪表示(pseudo-representation),并通过引入边界约束的稀疏变分优化策略,结合自适应参数更新与热启动初始化的全局缩放近似一致ADMM算法,实现高效且准确的分布式学习与预测。该方法在合成数据和真实世界数据集上的实验验证表明,pxpGP及其去中心化变体dec-pxpGP在超参数估计和预测精度方面均优于现有分布式GP方法,尤其适用于大规模多机器人网络场景。
链接: https://arxiv.org/abs/2602.12243
作者: Sanket A. Salunkhe,George P. Kontoudis
机构: Colorado School of Mines (科罗拉多矿业学院)
类目: Multiagent Systems (cs.MA)
备注: Accepted at 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
Abstract:Multi-robot systems require scalable and federated methods to model complex environments under computational and communication constraints. Gaussian Processes (GPs) offer robust probabilistic modeling, but suffer from cubic computational complexity, limiting their applicability in large-scale deployments. To address this challenge, we introduce the pxpGP, a novel distributed GP framework tailored for both centralized and decentralized large-scale multi-robot networks. Our approach leverages sparse variational inference to generate a local compact pseudo-representation. We introduce a sparse variational optimization scheme that bounds local pseudo-datasets and formulate a global scaled proximal-inexact consensus alternating direction method of multipliers (ADMM) with adaptive parameter updates and warm-start initialization. Experiments on synthetic and real-world datasets demonstrate that pxpGP and its decentralized variant, dec-pxpGP, outperform existing distributed GP methods in hyperparameter estimation and prediction accuracy, particularly in large-scale networks.
[MA-1] Convex Markov Games and Beyond: New Proof of Existence Characterization and Learning Algorithms for Nash Equilibria AISTATS2026
【速读】:该论文旨在解决广义效用马尔可夫博弈(General Utility Markov Games, GUMGs)中纳什均衡(Nash Equilibrium, NE)的存在性与学习算法的理论保障问题,尤其是针对传统凸马尔可夫博弈(Convex Markov Games, cMGs)未覆盖的非零和、存在代理间占用测度耦合的新应用场景。其解决方案的关键在于提出并证明了“代理级梯度主导性”(agent-wise gradient domination property),由此揭示纳什均衡等价于投影伪梯度动力学的不动点(即一阶平稳点),并借助布劳威尔不动点定理(Brouwer’s fixed-point theorem)简洁地证明了NE的存在性;进一步基于此结构特征,推导出适用于GUMGs的策略梯度定理,并设计了无需模型的策略梯度算法,同时在潜在型GUMGs下给出了精确梯度下的迭代复杂度和生成模型及在线策略设置下的样本复杂度边界,从而首次实现了对共同利益型cMGs的完整理论分析。
链接: https://arxiv.org/abs/2602.12181
作者: Anas Barakat,Ioannis Panageas,Antonios Varvitsiotis
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: AISTATS 2026
Abstract:Convex Markov Games (cMGs) were recently introduced as a broad class of multi-agent learning problems that generalize Markov games to settings where strategic agents optimize general utilities beyond additive rewards. While cMGs expand the modeling frontier, their theoretical foundations, particularly the structure of Nash equilibria (NE) and guarantees for learning algorithms, are not yet well understood. In this work, we address these gaps for an extension of cMGs, which we term General Utility Markov Games (GUMGs), capturing new applications requiring coupling between agents’ occupancy measures. We prove that in GUMGs, Nash equilibria coincide with the fixed points of projected pseudo-gradient dynamics (i.e., first-order stationary points), enabled by a novel agent-wise gradient domination property. This insight also yields a simple proof of NE existence using Brouwer’s fixed-point theorem. We further show the existence of Markov perfect equilibria. Building on this characterization, we establish a policy gradient theorem for GUMGs and design a model-free policy gradient algorithm. For potential GUMGs, we establish iteration complexity guarantees for computing approximate-NE under exact gradients and provide sample complexity bounds in both the generative model and on-policy settings. Our results extend beyond prior work restricted to zero-sum cMGs, providing the first theoretical analysis of common-interest cMGs.
[MA-2] DEpiABS: Differentiable Epidemic Agent -Based Simulator AAMAS2026
【速读】:该论文旨在解决现有流行病模拟工具在复杂动态建模、计算效率与可解释性之间难以平衡的问题,尤其在应对新冠疫情等公共卫生危机时,传统模型往往无法兼顾个体异质性(如健康状态、行为差异和资源限制)以及病毒变异和再感染等关键机制。其解决方案的核心是提出一种可微分的代理模型(Differentiable Agent-Based Model, DABM),即DEpiABS,该模型具备完全可微特性,支持基于梯度的参数校准,从而实现高效仿真;同时引入基于z-score的缩放方法,使小规模模拟结果能无损映射至任意真实人口规模,显著降低大规模人群建模的计算负担。实验表明,DEpiABS在不依赖额外辅助数据的前提下,提升了预测精度并保持了高度可解释性,为未来流行病响应建模提供了可靠、通用且数据高效的框架。
链接: https://arxiv.org/abs/2602.12102
作者: Zhijian Gao,Shuxin Li,Bo An
机构: Nanyang Technological University (南洋理工大学)
类目: Multiagent Systems (cs.MA)
备注: 17 pages, 9 figures, to be published in AAMAS 2026
Abstract:The COVID-19 pandemic highlighted the limitations of existing epidemic simulation tools. These tools provide information that guides non-pharmaceutical interventions (NPIs), yet many struggle to capture complex dynamics while remaining computationally practical and interpretable. We introduce DEpiABS, a scalable, differentiable agent-based model (DABM) that balances mechanistic detail, computational efficiency and interpretability. DEpiABS captures individual-level heterogeneity in health status, behaviour, and resource constraints, while also modelling epidemic processes like viral mutation and reinfection dynamics. The model is fully differentiable, enabling fast simulation and gradient-based parameter calibration. Building on this foundation, we introduce a z-score-based scaling method that maps small-scale simulations to any real-world population sizes with negligible loss in output granularity, reducing the computational burden when modelling large populations. We validate DEpiABS through sensitivity analysis and calibration to COVID-19 and flu data from ten regions of varying scales. Compared to the baseline, DEpiABS is more detailed, fully interpretable, and has reduced the average normal deviation in forecasting from 0.97 to 0.92 on COVID-19 mortality data and from 0.41 to 0.32 on influenza-like-illness data. Critically, these improvements are achieved without relying on auxiliary data, making DEpiABS a reliable, generalisable, and data-efficient framework for future epidemic response modelling.
[MA-3] Multi UAVs Preflight Planning in a Shared and Dynamic Airspace AAMAS2026
【速读】:该论文旨在解决大规模无人机(Unmanned Aerial Vehicle, UAV)编队在动态共享空域中预先规划路径时面临的复杂挑战,包括时间性禁飞区(Temporal No-Fly Zones, NFZs)、异构飞行器特性以及严格的交付时限约束。现有基于多智能体路径规划(Multi-Agent Path Finding, MAPF)的方法在可扩展性和灵活性方面难以满足实际无人交通管理(Unmanned Traffic Management, UTM)需求。其解决方案的关键在于提出DTAPP-IICR方法:首先依据任务紧迫性对任务进行优先级排序生成初始解;其次利用新型四维单智能体规划器SFIPP-ST(Safe Flight Interval Path Planning with Soft and Temporal Constraints)计算往返轨迹,该算法能处理异构无人机、严格遵守时间性NFZ并以软约束形式建模代理间冲突;随后通过基于几何冲突图的迭代大邻域搜索高效消除残余冲突,并引入保持完备性的方向剪枝技术加速三维搜索过程。该方案在含时间性禁飞区的基准测试中实现了高达100%的成功率(最多支持1000架无人机),且相比批量增强型冲突基础搜索(Enhanced Conflict-Based Search)提速达50%,展现出优于现有方法的可扩展性和实用性。
链接: https://arxiv.org/abs/2602.12055
作者: Amath Sow,Mauricio Rodriguez Cesen,Fabiola Martins Campos de Oliveira,Mariusz Wzorek,Daniel de Leng,Mattias Tiger,Fredrik Heintz,Christian Esteve Rothenberg
机构: Linköping University (林雪平大学); Universidade Estadual de Campinas (坎皮纳斯州立大学); Universidade Federal do ABC (巴西联邦大学ABC校区)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: AAMAS 2026 accepted paper
Abstract:Preflight planning for large-scale Unmanned Aerial Vehicle (UAV) fleets in dynamic, shared airspace presents significant challenges, including temporal No-Fly Zones (NFZs), heterogeneous vehicle profiles, and strict delivery deadlines. While Multi-Agent Path Finding (MAPF) provides a formal framework, existing methods often lack the scalability and flexibility required for real-world Unmanned Traffic Management (UTM). We propose DTAPP-IICR: a Delivery-Time Aware Prioritized Planning method with Incremental and Iterative Conflict Resolution. Our framework first generates an initial solution by prioritizing missions based on urgency. Secondly, it computes roundtrip trajectories using SFIPP-ST, a novel 4D single-agent planner (Safe Flight Interval Path Planning with Soft and Temporal Constraints). SFIPP-ST handles heterogeneous UAVs, strictly enforces temporal NFZs, and models inter-agent conflicts as soft constraints. Subsequently, an iterative Large Neighborhood Search, guided by a geometric conflict graph, efficiently resolves any residual conflicts. A completeness-preserving directional pruning technique further accelerates the 3D search. On benchmarks with temporal NFZs, DTAPP-IICR achieves near-100% success with fleets of up to 1,000 UAVs and gains up to 50% runtime reduction from pruning, outperforming batch Enhanced Conflict-Based Search in the UTM context. Scaling successfully in realistic city-scale operations where other priority-based methods fail even at moderate deployments, DTAPP-IICR is positioned as a practical and scalable solution for preflight planning in dense, dynamic urban airspace.
[MA-4] Multi-Defender Single-Attacker Perimeter Defense Game on a Cylinder: Special Case in which the Attacker Starts at the Boundary
【速读】:该论文试图解决多智能体围栏防御博弈问题,即在圆柱形空间中,由n个移动速度较慢的防御者组成团队,需阻止一个移动速度较快的攻击者穿越防御边界。解决方案的关键在于分析攻击者在初始位置靠近边界且处于当前被防守区域时获胜的条件,从而揭示防御策略的有效性边界与攻防双方的动态博弈关系。
链接: https://arxiv.org/abs/2602.11977
作者: Michael Otte,Roderich Groß
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: 4 pages, 3 figures
Abstract:We describe a multi-agent perimeter defense game played on a cylinder. A team of n slow-moving defenders must prevent a single fast-moving attacker from crossing the boundary of a defensive perimeter. We describe the conditions necessary for the attacker to win in the special case that the intruder starts close to the boundary and in a region that is currently defended.
[MA-5] Global Convergence to Nash Equilibrium in Nonconvex General-Sum Games under the n-Sided PL Condition
【速读】:该论文旨在解决一般和博弈(general-sum game)中寻找纳什均衡(Nash equilibrium, NE)的问题,其中每个玩家 $ i $ 的目标函数为 $ f_i(x_1,\dots,x_n) $,变量 $ x_j \in \mathbb{R}^{d_j} $ 表示玩家 $ j $ 的策略。为应对非凸场景下标准梯度下降(Gradient Descent, GD)方法可能不收敛的问题,作者提出了一种名为“$ n −侧PL条件”( n $-sided PL condition)的新假设,该条件是经典Polyak-Łojasiewicz (PL) 条件和多凸性(multi-convexity)概念的扩展,适用于多种非凸函数类。关键解决方案在于基于此条件设计了改进的梯度算法(包括块坐标下降法BCD等),并证明其在满足该条件下可收敛至NE,同时分析了收敛速率,从而为非凸博弈中的NE求解提供了理论保障与实用算法框架。
链接: https://arxiv.org/abs/2602.11835
作者: Yutong Chao,Jalal Etesami
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Numerical Analysis (math.NA)
备注: 24 pages
Abstract:We consider the problem of finding a Nash equilibrium (NE) in a general-sum game, where player i 's objective is f_i(x)=f_i(x_1,…,x_n) , with x_j\in\mathbbR^d_j denoting the strategy variables of player j . Our focus is on investigating first-order gradient-based algorithms and their variations, such as the block coordinate descent (BCD) algorithm, for tackling this problem. We introduce a set of conditions, called the n -sided PL condition, which extends the well-established gradient dominance condition a.k.a Polyak-Łojasiewicz (PL) condition and the concept of multi-convexity. This condition, satisfied by various classes of non-convex functions, allows us to analyze the convergence of various gradient descent (GD) algorithms. Moreover, our study delves into scenarios where the standard gradient descent methods fail to converge to NE. In such cases, we propose adapted variants of GD that converge towards NE and analyze their convergence rates. Finally, we evaluate the performance of the proposed algorithms through several experiments.
[MA-6] Non-Trivial Consensus on Directed Matrix-Weighted Networks with Cooperative and Antagonistic Interactions
【速读】:该论文旨在解决有向符号矩阵加权网络(directed signed matrix-weighted networks)中非平凡一致问题(non-trivial consensus),即在存在合作与对抗性多维交互的复杂网络中,如何实现系统状态收敛至非零目标状态。此前研究主要聚焦于二分一致(bipartite consensus)和平凡一致(trivial consensus),而对非平凡一致这一更一般且更具应用价值的状态缺乏理论支撑。解决方案的关键在于:首先证明了在特定条件下,接地拉普拉斯矩阵(grounded Laplacians)的所有特征值实部为正,从而确保系统状态全局渐近收敛至符号矩阵加权拉普拉斯矩阵的零空间;其次提出了一套系统化方法,包括有向网络中知情节点(informed agents)的合理选取、外部信号的设计以及耦合项系数的精确确定,并推导出耦合系数的下界;该算法无需网络结构平衡性假设,且可任意预设非平凡一致状态,显著放宽了传统一致性控制的限制。
链接: https://arxiv.org/abs/2602.11822
作者: Tianmu Niu,Bing Mao,Xiaoqun Wu,Tingwen Huang
机构: 未知
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA)
备注:
Abstract:This paper investigates the non-trivial consensus problem on directed signed matrix-weighted networks\textemdash a novel convergence state that has remained largely unexplored despite prior studies on bipartite consensus and trivial consensus. Notably, we first prove that for directed signed matrix-weighted networks, every eigenvalue of the grounded Laplacians has positive real part under certain conditions. This key finding ensures the global asymptotic convergence of systems states to the null spaces of signed matrix-weighted Laplacians, providing a foundational tool for analyzing dynamics on rooted signed matrix-weighted networks. To achieve non-trivial consensus, we propose a systematic approach involving the strategic selection of informed agents, careful design of external signals, and precise determination of coupling terms. Crucially, we derive the lower bounds of the coupling coefficients. Our consensus algorithm operates under milder connectivity conditions, and does not impose restrictions on whether the network is structurally balanced or unbalanced. Moreover, the non-trivial consensus state can be preset arbitrarily as needed. We also carry out the above analysis for undirected networks, with more relaxed conditions on the coupling coefficients comparing to the directed case. This paper further studies non-trivial consensus with switching topologies, and propose the necessary condition for the convergence of switching networks. The work in this paper demonstrates that groups with both cooperative and antagonistic multi-dimensional interactions can achieve consensus, which was previously deemed exclusive to fully cooperative groups.
[MA-7] Cooperation Breakdown in LLM Agents Under Communication Delays
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中协作与协调机制在现实计算和通信约束下的有效性问题,尤其是如何在资源受限条件下实现稳定且高效的群体合作。其解决方案的关键在于提出FLCOA框架(Five Layers for Cooperation/Coordination among Autonomous Agents),从五个层次系统性地刻画协作与协调的形成机制,并强调低层因素(如计算能力、通信延迟)对合作行为的显著影响被长期忽视。通过引入带有通信延迟的连续囚徒困境模拟实验,研究发现通信延迟与互惠合作之间呈现U型关系:适度延迟会诱发代理间的剥削行为,而过高的延迟反而抑制剥削循环,从而揭示了底层通信特性对高层协作策略的非线性调控作用,为MAS设计提供了新的理论方向和技术路径。
链接: https://arxiv.org/abs/2602.11754
作者: Keita Nishimoto,Kimitaka Asatani,Ichiro Sakata
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
Abstract:LLM-based multi-agent systems (LLM-MAS), in which autonomous AI agents cooperate to solve tasks, are gaining increasing attention. For such systems to be deployed in society, agents must be able to establish cooperation and coordination under real-world computational and communication constraints. We propose the FLCOA framework (Five Layers for Cooperation/Coordination among Autonomous Agents) to conceptualize how cooperation and coordination emerge in groups of autonomous agents, and highlight that the influence of lower-layer factors - especially computational and communication resources - has been largely overlooked. To examine the effect of communication delay, we introduce a Continuous Prisoner’s Dilemma with Communication Delay and conduct simulations with LLM-based agents. As delay increases, agents begin to exploit slower responses even without explicit instructions. Interestingly, excessive delay reduces cycles of exploitation, yielding a U-shaped relationship between delay magnitude and mutual cooperation. These results suggest that fostering cooperation requires attention not only to high-level institutional design but also to lower-layer factors such as communication delay and resource allocation, pointing to new directions for MAS research.
[MA-8] Counterfactual Conditional Likelihood Rewards for Multiagent Exploration
【速读】:该论文旨在解决多智能体系统在开放域任务(如搜救或行星测绘)中因个体层面探索激励导致的冗余问题,即各智能体缺乏对团队整体探索状态的认知,从而难以实现高效协同策略发现。解决方案的关键在于提出反事实条件似然(Counterfactual Conditional Likelihood, CCL)奖励机制,该机制通过隔离每个智能体对其所在团队探索的独特贡献来评分,而非仅基于其个体观测的新颖性;CCL强调的是对团队联合探索具有信息价值的观测,从而显著提升稀疏团队奖励环境下学习效率,并在需要强协同的任务中表现优异。
链接: https://arxiv.org/abs/2602.11740
作者: Ayhan Alp Aydeniz,Robert Loftin,Kagan Tumer
机构: Oregon State University (俄勒冈州立大学); University of Sheffield (谢菲尔德大学)
类目: Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: 9 pages, 5 figures
Abstract:Efficient exploration is critical for multiagent systems to discover coordinated strategies, particularly in open-ended domains such as search and rescue or planetary surveying. However, when exploration is encouraged only at the individual agent level, it often leads to redundancy, as agents act without awareness of how their teammates are exploring. In this work, we introduce Counterfactual Conditional Likelihood (CCL) rewards, which score each agent’s exploration by isolating its unique contribution to team exploration. Unlike prior methods that reward agents solely for the novelty of their individual observations, CCL emphasizes observations that are informative with respect to the joint exploration of the team. Experiments in continuous multiagent domains show that CCL rewards accelerate learning for domains with sparse team rewards, where most joint actions yield zero rewards, and are particularly effective in tasks that require tight coordination among agents.
[MA-9] Distributionally Robust Cooperative Multi-Agent Reinforcement Learning via Robust Value Factorization ICLR2026
【速读】:该论文旨在解决协作式多智能体强化学习(Cooperative Multi-Agent Reinforcement Learning, MARL)在现实场景中因环境不确定性(如仿真到现实的差距、模型失配和系统噪声)而导致性能不可靠的问题。现有方法通常采用“集中训练、分散执行”框架,并依赖价值分解技术遵循个体-全局最大值(Individual-Global-Maximum, IGM)原则以保证分散贪婪策略能恢复团队最优联合动作,但该策略在真实环境中鲁棒性不足。解决方案的关键在于提出分布鲁棒的IGM(Distributionally Robust IGM, DrIGM)原则,要求每个智能体的鲁棒贪婪动作与团队最优联合动作保持一致;并通过定义新的鲁棒个体动作值(robust individual action values),构建兼容分散贪婪执行且具有可证明系统鲁棒性的新机制。基于此,作者进一步推导出符合DrIGM的鲁棒变体架构(如VDN/QMIX/QTRAN),其特点包括:训练时使用鲁棒Q目标、保持可扩展性、无需针对单个智能体奖励重塑即可无缝集成现有代码库,实验证明其在高保真SustainGym模拟器和StarCraft环境中显著提升分布外性能。
链接: https://arxiv.org/abs/2602.11437
作者: Chengrui Qu,Christopher Yeh,Kishan Panaganti,Eric Mazumdar,Adam Wierman
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: ICLR 2026
Abstract:Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution, where value-factorization methods enforce the individual-global-maximum (IGM) principle so that decentralized greedy actions recover the team-optimal joint action. However, the reliability of this recipe in real-world settings remains unreliable due to environmental uncertainties arising from the sim-to-real gap, model mismatch, and system noise. We address this gap by introducing Distributionally robust IGM (DrIGM), a principle that requires each agent’s robust greedy action to align with the robust team-optimal joint action. We show that DrIGM holds for a novel definition of robust individual action values, which is compatible with decentralized greedy execution and yields a provable robustness guarantee for the whole system. Building on this foundation, we derive DrIGM-compliant robust variants of existing value-factorization architectures (e.g., VDN/QMIX/QTRAN) that (i) train on robust Q-targets, (ii) preserve scalability, and (iii) integrate seamlessly with existing codebases without bespoke per-agent reward shaping. Empirically, on high-fidelity SustainGym simulators and a StarCraft game environment, our methods consistently improve out-of-distribution performance. Code and data are available at this https URL.
[MA-10] Reconstructing Network Outbreaks under Group Surveillance AAMAS2026
【速读】:该论文旨在解决在群体检测(pooled testing)背景下,如何从部分阳性检测结果中重构疾病传播路径的问题,即提出并研究POOlCASCADEMLE问题——在已知网络传播过程和池化检测结果的前提下,寻找一个最大似然估计(Maximum Likelihood Estimation, MLE)的传播子图,使得每个阳性池至少包含一个被感染节点。其关键在于将原本针对个体检测的Steiner子图问题扩展至具有组合约束的新场景:阳性池必须至少有一个成员被选入传播链,从而引入了额外的组合复杂性。作者证明了该问题在独立级联(Independent Cascade, IC)模型下是NP-hard的,并提出基于Group Steiner Tree问题的近似算法;对于一跳传播限制情形,进一步设计了线性规划松弛与舍入策略。实验表明,所提方法在真实与合成接触网络上显著优于仅考虑单样本池(pool size one)的传统基线方法,在感染节点恢复率和流行率估计方面表现更优。
链接: https://arxiv.org/abs/2602.11419
作者: Ritwick Mishra,Abhijin Adiga,Anil Vullikanti
机构: University of Virginia (弗吉尼亚大学)
类目: ocial and Information Networks (cs.SI); Multiagent Systems (cs.MA)
备注: 13 pages; In Proceedings of the AAMAS 2026 Conference
Abstract:A key public health problem during an outbreak is to reconstruct the disease cascade from a partial set of confirmed infections. This has been studied extensively under the Maximum Likelihood Estimation (MLE) formulation, which reduces the problem to finding some type of Steiner subgraph on a network. Group surveillance like wastewater or aerosol monitoring is a form of mass/pooled testing where samples from multiple individuals are pooled together and tested once for all. While a single negative test clears multiple individuals, a positive test does not reveal the infected individuals in the test pool. We introduce the POOLCASCADEMLE problem in the setting of a network propagation process, where the goal is to find a MLE cascade subgraph which is consistent with the pooled test outcomes. Previous work on reconstruction assumes that the test results are of individuals, i.e., pools of size one, and requires a consistent cascade to connect the positive testing nodes. In POOLCASCADEMLE, a consistent cascade must choose at least one node in each positive pool, adding another combinatorial layer. We show that, under the Independent Cascade (IC) model, POOLCASCADEMLE is NP-hard, and present an approximation algorithm based on a reduction to the Group Steiner Tree problem. We also consider a one-hop version of this problem, in which the disease can spread for one time step after being seeded. We show that even this restricted version is NP-hard, and develop a method using linear programming relaxation and rounding. We evaluate the performance of our methods on real and synthetic contact networks, in terms of missing infection recovery and prevalence estimation. We find that our approach outperforms meaningful baselines which correspond to pools of size one and use state-of-the-art methods.
自然语言处理
[NLP-0] Agent ic Test-Time Scaling for WebAgents
【速读】: 该论文旨在解决多步代理任务中测试时扩展(test-time scaling)效率低下与性能提升受限的问题,尤其是在长决策序列中,小的每步误差会累积导致整体失败,且简单的均匀增加每步计算资源策略收益递减。解决方案的关键在于提出一种基于置信度感知的动态计算分配机制——Confidence-Aware Test-Time Scaling (CATTS),其核心是利用代理自身投票分布的不确定性统计量(如熵和top-1/top-2置信度差值)作为信号,在决策真正存在争议时才增加计算资源,从而在不显著增加token消耗的前提下,显著提升任务成功率(WebArena-Lite和GoBrowse上最高提升9.1%),并提供可解释的决策规则。
链接: https://arxiv.org/abs/2602.12276
作者: Nicholas Lee,Lutfi Eren Erdogan,Chris Joseph John,Surya Krishnapillai,Michael W. Mahoney,Kurt Keutzer,Amir Gholami
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-time scaling for web agents. We find that uniformly increasing per-step compute quickly saturates in long-horizon environments. We then investigate stronger aggregation strategies, including an LLM-based Arbiter that can outperform naive voting, but that can overrule high-consensus decisions. We show that uncertainty statistics derived from the agent’s own vote distribution (entropy and top-1/top-2 margin) correlate with downstream success and provide a practical signal for dynamic compute allocation. Based on these findings, we introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling, providing both efficiency gains and an interpretable decision rule.
[NLP-1] On-Policy Context Distillation for Language Models
【速读】: 该论文旨在解决语言模型在上下文学习(in-context learning)中难以将外部知识有效内化为自身参数的问题,从而提升模型的泛化能力和任务表现。其核心解决方案是提出一种基于策略的上下文蒸馏方法(On-Policy Context Distillation, OPCD),关键在于让学生模型在其自身生成的轨迹上进行训练,并通过最小化与上下文条件教师模型之间的反向Kullback-Leibler散度(reverse Kullback-Leibler divergence)来实现知识迁移。这一机制使学生模型能够从历史解题轨迹或优化提示中提取并固化可迁移的经验知识,同时保持对分布外数据的良好适应性,且支持跨模型规模的知识蒸馏。
链接: https://arxiv.org/abs/2602.12275
作者: Tianzhu Ye,Li Dong,Xun Wu,Shaohan Huang,Furu Wei
机构: Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.
[NLP-2] 3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization
【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, DLLMs)在实际推理中因需大量精炼步骤而导致的效率瓶颈问题,尤其是在减少步数以提升速度时会显著降低生成质量的困境。解决方案的关键在于提出一种轨迹自蒸馏(trajectory self-distillation)框架,通过蒸馏模型自身的生成轨迹来优化少步解码性能;其中引入了直接判别优化(Direct Discriminative Optimization, DDO),即一种反向KL散度目标,促使学生模型聚焦于教师模型高概率的模式,从而在严格步数预算下显著提升生成质量,有效缩小与全步数解码性能之间的差距。
链接: https://arxiv.org/abs/2602.12262
作者: Tunyu Zhang,Xinxi Zhang,Ligong Han,Haizhou Shi,Xiaoxiao He,Zhuowei Li,Hao Wang,Kai Xu,Akash Srivastava,Hao Wang,Vladimir Pavlovic,Dimitris N. Metaxas
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel. However, in practice, their inference efficiency is constrained by the need for many refinement steps, while aggressively reducing the number of steps leads to a substantial degradation in generation quality. To alleviate this, we propose a trajectory self-distillation framework that improves few-step decoding by distilling the model’s own generative trajectories. We incorporate Direct Discriminative Optimization (DDO), a reverse-KL objective that promotes mode-seeking distillation and encourages the student to concentrate on high-probability teacher modes. Across benchmarks, our approach consistently outperforms strong few-step baselines and standard training under tight step budgets. Although full-step decoding remains superior, we substantially narrow the gap, establishing a strong foundation towards practical few-step DLLMs. The source code is available at this https URL.
[NLP-3] “Sorry I Didnt Catch That”: How Speech Models Miss What Matters Most
【速读】: 该论文试图解决语音识别系统在真实场景中对短时、高风险语句(如美国街道名称)识别准确率低的问题,尤其是在语言多样性人群中的可靠性不足。研究表明,当前主流模型在非英语母语者发音上的平均转录错误率达44%,且由此引发的地理定位误差显著高于英语母语者。解决方案的关键在于利用开源文本到语音(text-to-speech, TTS)模型生成多样化的命名实体发音合成数据,并通过少于1000个样本进行微调,使非英语母语者的街道名转录准确率相对提升近60%,从而有效缩小基准测试性能与实际部署可靠性之间的差距。
链接: https://arxiv.org/abs/2602.12249
作者: Kaitlyn Zhou,Martijn Bartelds,Federico Bianchi,James Zou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.
[NLP-4] Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications
【速读】: 该论文旨在解决资源受限边缘设备上低延迟语音应用(如实时转录、语音指令和实时翻译)中,传统全注意力Transformer编码器因全局依赖导致的时间到首个标记(TTFT)高、计算复杂度为序列长度平方的问题。其解决方案的关键在于提出Moonshine v2模型,采用滑动窗口自注意力机制(sliding-window self-attention),在保持强局部上下文建模能力的同时实现有界的低延迟推理,从而显著降低TTFT并提升效率;实验表明,该方法可在词错误率(WER)与6倍大小的模型相当的情况下实现更快的推理速度,证明了精心设计的局部注意力机制在精度与延迟之间具有优越权衡,为边缘端交互式语音接口提供了新路径。
链接: https://arxiv.org/abs/2602.12241
作者: Manjunath Kudlur,Evan King,James Wang,Pete Warden
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: 7 pages, 5 figures
Abstract:Latency-critical speech applications (e.g., live transcription, voice commands, and real-time translation) demand low time-to-first-token (TTFT) and high transcription accuracy, particularly on resource-constrained edge devices. Full-attention Transformer encoders remain a strong accuracy baseline for automatic speech recognition (ASR) because every frame can directly attend to every other frame, which resolves otherwise locally ambiguous acoustics using distant lexical context. However, this global dependency incurs quadratic complexity in sequence length, inducing an inherent “encode-the-whole-utterance” latency profile. For streaming use cases, this causes TTFT to grow linearly with utterance length as the encoder must process the entire prefix before any decoder token can be emitted. To better meet the needs of on-device, streaming ASR use cases we introduce Moonshine v2, an ergodic streaming-encoder ASR model that employs sliding-window self-attention to achieve bounded, low-latency inference while preserving strong local context. Our models achieve state of the art word error rates across standard benchmarks, attaining accuracy on-par with models 6x their size while running significantly faster. These results demonstrate that carefully designed local attention is competitive with the accuracy of full attention at a fraction of the size and latency cost, opening new possibilities for interactive speech interfaces on edge devices.
[NLP-5] Olmix: A Framework for Data Mixing Throughout LM Development
【速读】: 该论文旨在解决语言模型(Language Model, LM)训练中数据混合(Data Mixing)策略的两个核心问题:一是混合方法的设计空间缺乏系统性理解,现有方法在配置选择上缺乏理论依据且未考虑实际数据约束;二是现实中的领域集合(domain set)在LM开发过程中动态变化(如数据集增删、分割或修订),而现有方法假设领域固定,无法高效适应这种演化。解决方案的关键在于提出Olmix框架,其包含两部分创新:首先通过全面实证研究明确哪些设计选择能提升混合效果;其次引入“混合重用”(mixture reuse)机制,在领域集合更新时仅对受影响的领域重新计算混合比例,其余比例直接复用历史结果,从而显著降低计算开销——在模拟五次真实LM开发迭代中,该机制相比每次完全重算节省74%计算资源,同时在下游任务上比不混合训练提升11.6%性能。
链接: https://arxiv.org/abs/2602.12237
作者: Mayee F. Chen,Tyler Murray,David Heineman,Matt Jordan,Hannaneh Hajishirzi,Christopher Ré,Luca Soldaini,Kyle Lo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Data mixing – determining the ratios of data from different domains – is a first-order concern for training language models (LMs). While existing mixing methods show promise, they fall short when applied during real-world LM development. We present Olmix, a framework that addresses two such challenges. First, the configuration space for developing a mixing method is not well understood – design choices across existing methods lack justification or consensus and overlook practical issues like data constraints. We conduct a comprehensive empirical study of this space, identifying which design choices lead to a strong mixing method. Second, in practice, the domain set evolves throughout LM development as datasets are added, removed, partitioned, and revised – a problem setting largely unaddressed by existing works, which assume fixed domains. We study how to efficiently recompute the mixture after the domain set is updated, leveraging information from past mixtures. We introduce mixture reuse, a mechanism that reuses existing ratios and recomputes ratios only for domains affected by the update. Over a sequence of five domain-set updates mirroring real-world LM development, mixture reuse matches the performance of fully recomputing the mix after each update with 74% less compute and improves over training without mixing by 11.6% on downstream tasks.
[NLP-6] Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation EACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本处理中因软压缩架构(soft compression architectures)导致的信息丢失问题,特别是当压缩后的表示无法支持任务执行时的“token overflow”现象。其关键解决方案是提出一种基于查询感知(query-aware)的探测机制:通过轻量级分类器对查询与上下文联合表示进行分析,在xRAG软压缩设置下实现对overflow的准确检测(平均AUC-ROC达0.72),从而优于仅依赖查询无关饱和统计量的传统方法,为压缩过程中的错误预防提供低成本、高效的预判手段。
链接: https://arxiv.org/abs/2602.12235
作者: Julia Belikova,Danila Rozhevskii,Dennis Svirin,Konstantin Polev,Alexander Panchenko
机构: Skoltech(斯科尔科沃科学技术研究所); Sber AI Lab(斯贝AI实验室); AIRI(人工智能研究院); Institute for Information Transmission Problems of the Russian Academy of Sciences(俄罗斯科学院信息传输问题研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to EACL 2026 Student Research Workshop. 14 pages, 6 tables, 1 figure
Abstract:Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility – and when compression begins to erase task-relevant content – remain underexplored. In this paper, we define \emphtoken overflow as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.
[NLP-7] ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images EACL2026
【速读】: 该论文旨在解决当前通用视觉语言模型(Vision Language Models, VLMs)在企业文档中进行细粒度、结构化信息抽取(Structured Information Extraction, IE)时存在的局限性,尤其是面对多样化文档类型和灵活Schema时的适应能力不足问题。现有关键实体抽取(Key Entity Extraction, KEE)、关系抽取(Relation Extraction, RE)和视觉问答(Visual Question Answering, VQA)数据集普遍受限于狭窄的实体本体、简单查询或同质文档类型,难以支撑对复杂文档场景下结构化信息提取的需求。解决方案的关键在于构建ExStrucTiny这一新型基准数据集,通过结合人工与合成的人工验证样本的新颖流水线,覆盖更广泛的文档类型和抽取场景,并统一KEE、RE与VQA任务特性,从而为评估和改进通用VLM在文档结构化信息抽取中的性能提供可靠基准。
链接: https://arxiv.org/abs/2602.12203
作者: Mathieu Sibue,Andres Muñoz Garza,Samuel Mensah,Pranav Shetty,Zhiqiang Ma,Xiaomo Liu,Manuela Veloso
机构: J.P. Morgan AI Research (J.P. Morgan 人工智能研究)
类目: Computation and Language (cs.CL)
备注: EACL 2026, main conference
Abstract:Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct holistic, fine-grained structured extraction across diverse document types and flexible schemas is not well studied. Existing Key Entity Extraction (KEE), Relation Extraction (RE), and Visual Question Answering (VQA) datasets are limited by narrow entity ontologies, simple queries, or homogeneous document types, often overlooking the need for adaptable and structured extraction. To address these gaps, we introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images, unifying aspects of KEE, RE, and VQA. Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios. We analyze open and closed VLMs on this benchmark, highlighting challenges such as schema adaptation, query under-specification, and answer localization. We hope our work provides a bedrock for improving generalist models for structured IE in documents.
[NLP-8] Visual Reasoning Benchmark: Evaluating Multimodal LLM s on Classroom-Authentic Visual Problems from Primary Education
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在空间与关系结构推理能力上的显著短板,尤其是在早期数学教育中依赖视觉信息的问题。其解决方案的关键在于构建了一个名为视觉推理基准(Visual Reasoning Benchmark, VRB)的新数据集,该数据集基于赞比亚和印度小学考试中的701道真实题目,涵盖类比推理、模式补全和空间匹配等任务,并采用未经编辑的、低文本量图像来测试多模态大语言模型(Multimodal Large Language Models, MLLMs)在实际教学场景下的表现。研究发现,模型在静态技能如计数和缩放上表现较好,但在折叠、反射和旋转等动态空间操作上存在明显“空间天花板”,凸显了其在课堂应用中可能导致误判、错误引导及强化学生误解的风险,因此,像VRB这样的教育导向型评测基准对于明确多模态工具在教学场景中的功能边界至关重要。
链接: https://arxiv.org/abs/2602.12196
作者: Mohamed Huti,Alasdair Mackintosh,Amy Waldock,Dominic Andrews,Maxime Lelièvre,Moritz Boos,Tobias Murray,Paul Atherton,Robin A. A. Ince,Oliver G. B. Garrod
机构: Fab AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critical bottleneck – particularly in early-grade maths, which relies heavily on visuals. This paper introduces the visual reasoning benchmark (VRB), a novel dataset designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve authentic visual problems from classrooms. This benchmark is built on a set of 701 questions sourced from primary school examinations in Zambia and India, which cover a range of tasks such as reasoning by analogy, pattern completion, and spatial matching. We outline the methodology and development of the benchmark which intentionally uses unedited, minimal-text images to test if models can meet realistic needs of primary education. Our findings reveal a jagged frontier'' of capability where models demonstrate better proficiency in static skills such as counting and scaling, but reach a distinct spatial ceiling’’ when faced with dynamic operations like folding, reflection, and rotation. These weaknesses pose a risk for classroom use on visual reasoning problems, with the potential for incorrect marking, false scaffolding, and reinforcing student misconceptions. Consequently, education-focused benchmarks like the VRB are essential for determining the functional boundaries of multimodal tools used in classrooms.
[NLP-9] Query-focused and Memory-aware Reranker for Long Context Processing
【速读】: 该论文旨在解决大语言模型中检索排序(retrieval reranking)任务的效率与效果之间的权衡问题,尤其是如何在不依赖标注数据(如Likert-scale评分)的情况下,利用模型内部注意力机制实现更精准的段落-查询相关性评估。其解决方案的关键在于提出一种基于选定注意力头(attention heads)得分的重排序框架,通过训练模型直接估计段落与查询的相关性,从而获得连续的 relevance score,支持在任意检索数据集上进行无监督训练;该方法为 listwise 排序提供了一种轻量且高效的替代方案,仅需小规模模型(如4B参数)即可达到或超越现有最优点对点(pointwise)和列表式(listwise)重排序器的性能表现,并在 LoCoMo 基准测试中建立新SOTA,验证了其在对话理解与记忆使用能力上的优越性。
链接: https://arxiv.org/abs/2602.12192
作者: Yuqing Li,Jiangnan Li,Mo Yu,Guoxuan Ding,Zheng Lin,Weiping Wang,Jie Zhou
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Pattern Recognition Center, WeChat AI, Tencent Inc (微信人工智能实验室,腾讯公司)
类目: Computation and Language (cs.CL)
备注: 14 pages, 2 figures
Abstract:Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models (e.g., 4B parameters) to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.
[NLP-10] Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation ICLR2026
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的知识蒸馏方法在利用合成数据进行知识迁移时缺乏教育学意识的问题,即现有方法将知识传递视为一次性数据生成与训练任务,而非一个系统性的学习过程。其解决方案的关键在于提出了一种受教育学原理启发的三阶段框架——知识识别器(Knowledge Identifier)、组织器(Organizer)和适配器(Adapter),简称IOA。该框架通过引入布卢姆掌握学习原则(Bloom’s Mastery Learning Principles)和维果斯基最近发展区理论(Vygotsky’s Zone of Proximal Development),实现了对学生模型认知能力的动态匹配:首先识别学生模型的知识缺口,继而以渐进式课程形式组织知识传授,并在每一步确保学生模型达到教师模型在前置知识上的表现后再推进,同时控制新知识的难度增量,从而显著提升小模型在复杂推理任务中的性能表现,例如在MATH和HumanEval上分别优于最先进基线19.2%和22.3%。
链接: https://arxiv.org/abs/2602.12172
作者: Bowei He,Yankai Chen,Xiaokun Zhang,Linghe Kong,Philip S. Yu,Xue Liu,Chen Ma
机构: MBZUAI; McGill; CityUHK; SJTU; UIC
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ICLR 2026
Abstract:Knowledge distillation from Large Language Models (LLMs) to smaller models has emerged as a critical technique for deploying efficient AI systems. However, current methods for distillation via synthetic data lack pedagogical awareness, treating knowledge transfer as a one-off data synthesis and training task rather than a systematic learning process. In this paper, we propose a novel pedagogically-inspired framework for LLM knowledge distillation that draws from fundamental educational principles. Our approach introduces a three-stage pipeline – Knowledge Identifier, Organizer, and Adapter (IOA) – that systematically identifies knowledge deficiencies in student models, organizes knowledge delivery through progressive curricula, and adapts representations to match the cognitive capacity of student models. We integrate Bloom’s Mastery Learning Principles and Vygotsky’s Zone of Proximal Development to create a dynamic distillation process where student models approach teacher model’s performance on prerequisite knowledge before advancing, and new knowledge is introduced with controlled, gradual difficulty increments. Extensive experiments using LLaMA-3.1/3.2 and Qwen2.5 as student models demonstrate that IOA achieves significant improvements over baseline distillation methods, with student models retaining 94.7% of teacher performance on DollyEval while using less than 1/10th of the parameters. Our framework particularly excels in complex reasoning tasks, showing 19.2% improvement on MATH and 22.3% on HumanEval compared with state-of-the-art baselines.
[NLP-11] dVoting: Fast Voting for dLLM s
【速读】: 该论文旨在解决生成式 AI(Generative AI)中推理能力不足的问题,尤其是在传统自回归建模(autoregressive modeling)框架下难以实现高效并行化推理的瓶颈。其核心挑战在于:尽管现有大语言模型在多项任务上表现优异,但推理过程受限于逐 token 生成的串行特性,导致测试时扩展效率低下。为此,作者提出 dVoting 方法,其关键创新在于利用扩散大语言模型(Diffusion Large Language Models, dLLMs)支持任意位置并行生成的能力,通过迭代采样与一致性分析识别不确定 token,并基于投票机制重新生成这些 token,从而在不依赖训练的前提下显著提升模型推理性能。该方法实现了高效的测试时并行扩展,且在多个基准测试中均取得显著性能增益。
链接: https://arxiv.org/abs/2602.12153
作者: Sicheng Feng,Zigeng Chen,Xinyin Ma,Gongfan Fang,Xinchao Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion Large Language Models (dLLMs) represent a new paradigm beyond autoregressive modeling, offering competitive performance while naturally enabling a flexible decoding process. Specifically, dLLMs can generate tokens at arbitrary positions in parallel, endowing them with significant potential for parallel test-time scaling, which was previously constrained by severe inefficiency in autoregressive modeling. In this work, we introduce dVoting, a fast voting technique that boosts reasoning capability without training, with only an acceptable extra computational overhead. dVoting is motivated by the observation that, across multiple samples for the same prompt, token predictions remain largely consistent, whereas performance is determined by a small subset of tokens exhibiting cross-sample variability. Leveraging the arbitrary-position generation capability of dLLMs, dVoting performs iterative refinement by sampling, identifying uncertain tokens via consistency analysis, regenerating them through voting, and repeating this process until convergence. Extensive evaluations demonstrate that dVoting consistently improves performance across various benchmarks. It achieves gains of 6.22%-7.66% on GSM8K, 4.40%-7.20% on MATH500, 3.16%-14.84% on ARC-C, and 4.83%-5.74% on MMLU. Our code is available at this https URL
[NLP-12] GPT -4o Lacks Core Features of Theory of Mind ALT
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否具备心智理论(Theory of Mind, ToM)能力,即是否拥有对心理状态(如信念、意图)与其行为之间因果关系的内在表征。现有研究多依赖于社会任务基准测试,但这些方法并未验证ToM所要求的核心机制——即一个领域通用且一致的心理状态到行为的因果模型。论文的关键解决方案在于提出并实施一种基于认知 grounded 的新评估框架,该框架不以人类判断为标准,而是直接检验LLMs是否能建立一个逻辑自洽、跨情境一致的心理状态与行为关联模型。实验表明,尽管LLMs在简单ToM任务中可近似人类判断,但在逻辑等价任务中表现失败,且其行为预测与心理状态推断之间一致性低,从而揭示其社交能力并非源于真正的、领域通用的心智理论模型。
链接: https://arxiv.org/abs/2602.12150
作者: John Muchovej,Amanda Royka,Shane Lee,Julian Jara-Ettinger
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to CogSci 2025; see more at this https URL . Note: “abstractness” is the second feature we test for, but due to arXiv’s abstract requirements, the text has been altered
Abstract:Do Large Language Models (LLMs) possess a Theory of Mind (ToM)? Research into this question has focused on evaluating LLMs against benchmarks and found success across a range of social tasks. However, these evaluations do not test for the actual representations posited by ToM: namely, a causal model of mental states and behavior. Here, we use a cognitively-grounded definition of ToM to develop and test a new evaluation framework. Specifically, our approach probes whether LLMs have a coherent, domain-general, and consistent model of how mental states cause behavior – regardless of whether that model matches a human-like ToM. We find that even though LLMs succeed in approximating human judgments in a simple ToM paradigm, they fail at a logically equivalent task and exhibit low consistency between their action predictions and corresponding mental state inferences. As such, these findings suggest that the social proficiency exhibited by LLMs is not the result of an domain-general or consistent ToM.
[NLP-13] Seq2Seq2Seq: Lossless Data Compression via Discrete Latent Transformers and Reinforcement Learning
【速读】: 该论文旨在解决传统无损压缩方法(如基于字典和统计的方法)在处理复杂数据格式时难以有效利用结构冗余的问题,以及现有深度学习压缩方法依赖密集向量表示而忽略原始数据token结构的局限性。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)驱动的T5语言模型架构,将数据压缩为token序列而非连续向量空间表示,从而保留原始数据的结构特性并提升压缩比;通过离策略强化学习算法优化序列长度以减少冗余,实现高效且自适应的无损压缩,且无需外部语法或知识库支持。
链接: https://arxiv.org/abs/2602.12146
作者: Mahdi Khodabandeh,Ghazal Shabani,Arash Yousefi Jordehi,Seyed Abolghasem Mirroshandel
机构: University of Guilan (吉兰大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
备注:
Abstract:Efficient lossless compression is essential for minimizing storage costs and transmission overhead while preserving data integrity. Traditional compression techniques, such as dictionary-based and statistical methods, often struggle to optimally exploit the structure and redundancy in complex data formats. Recent advancements in deep learning have opened new avenues for compression; however, many existing approaches depend on dense vector representations that obscure the underlying token structure. To address these limitations, we propose a novel lossless compression method that leverages Reinforcement Learning applied to a T5 language model architecture. This approach enables the compression of data into sequences of tokens rather than traditional vector representations. Unlike auto-encoders, which typically encode information into continuous latent spaces, our method preserves the token-based structure, aligning more closely with the original data format. This preservation allows for higher compression ratios while maintaining semantic integrity. By training the model using an off-policy Reinforcement Learning algorithm, we optimize sequence length to minimize redundancy and enhance compression efficiency. Our method introduces an efficient and adaptive data compression system built upon advanced Reinforcement Learning techniques, functioning independently of external grammatical or world knowledge. This approach shows significant improvements in compression ratios compared to conventional methods. By leveraging the latent information within language models, our system effectively compresses data without requiring explicit content understanding, paving the way for more robust and practical compression solutions across various applications.
[NLP-14] CitiLink-Minutes: A Multilayer Annotated Dataset of Municipal Meeting Minutes
【速读】: 该论文旨在解决市政会议记录在信息检索(Information Retrieval, IR)与自然语言处理(Natural Language Processing, NLP)领域长期缺乏高质量标注数据的问题,从而限制了相关计算模型的发展。其解决方案的关键在于构建并发布CitiLink-Minutes数据集——一个包含120份葡萄牙语市政会议纪要的多层标注数据集,覆盖6个市镇,总计超过100万词元,并经去标识化处理以保障隐私。该数据集通过三重维度的人工标注:元数据、讨论主题和投票结果,实现了结构化的语义关联,为下游NLP和IR任务提供了可复用、透明且符合FAIR原则的数据基础。
链接: https://arxiv.org/abs/2602.12137
作者: Ricardo Campos,Ana Filipa Pacheco,Ana Luísa Fernandes,Inês Cantante,Rute Rebouças,Luís Filipe Cunha,José Miguel Isidro,José Pedro Evans,Miguel Marques,Rodrigo Batista,Evelin Amorim,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,António Leal,Purificação Silvano
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:City councils play a crucial role in local governance, directly influencing citizens’ daily lives through decisions made during municipal meetings. These deliberations are formally documented in meeting minutes, which serve as official records of discussions, decisions, and voting outcomes. Despite their importance, municipal meeting records have received little attention in Information Retrieval (IR) and Natural Language Processing (NLP), largely due to the lack of annotated datasets, which ultimately limit the development of computational models. To address this gap, we introduce CitiLink-Minutes, a multilayer dataset of 120 European Portuguese municipal meeting minutes from six municipalities. Unlike prior annotated datasets of parliamentary or video records, CitiLink-Minutes provides multilayer annotations and structured linkage of official written minutes. The dataset contains over one million tokens, with all personal identifiers de-identified. Each minute was manually annotated by two trained annotators and curated by an experienced linguist across three complementary dimensions: (1) metadata, (2) subjects of discussion, and (3) voting outcomes, totaling over 38,000 individual annotations. Released under FAIR principles and accompanied by baseline results on metadata extraction, topic classification, and vote labeling, CitiLink-Minutes demonstrates its potential for downstream NLP and IR tasks, while promoting transparent access to municipal decisions.
[NLP-15] WavBench: Benchmarking Reasoning Colloquialism and Paralinguistics for End-to-End Spoken Dialogue Models
【速读】: 该论文旨在解决当前语音对话模型评估体系难以全面反映真实场景下复杂认知能力、口语化表达及副语言特征的问题。现有评测多沿用文本生成标准,忽视了语音交互中特有的语调、语气和语境理解等关键要素。其解决方案的关键在于提出WavBench基准测试框架,该框架由三个子集构成:Pro子集用于挑战增强推理能力的模型;Basic子集定义了以“可听性”为核心的口语化标准,强调自然词汇、语言流畅性和互动亲密度;Acoustic子集则系统评估语音理解、生成与隐含对话中的副语言能力,从而实现对真实世界语音对话模型的多维度、高保真度评价。
链接: https://arxiv.org/abs/2602.12135
作者: Yangzhuo Li,Shengpeng Ji,Yifu Chen,Tianle Liang,Haorong Ying,Yule Wang,Junbo Li,Jun Fang,Zhou Zhao
机构: Xiamen University (厦门大学); Zhejiang University (浙江大学); CUHK-Shenzhen (香港中文大学(深圳))
类目: Computation and Language (cs.CL)
备注: Open-source at this https URL
Abstract:With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents. To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes “listenability” through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real-world scenarios. Through evaluating five state-of-the-art models, WavBench offers critical insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models. The benchmark dataset and evaluation toolkit are available at this https URL.
[NLP-16] A Rule-based Computational Model for Gaidhlig Morphology
【速读】: 该论文旨在解决低资源语言(如盖尔语 Gaidhlig)在生成式 AI (Generative AI) 和软件工具支持下难以获得高质量语言模型的问题,因为当前主流神经网络模型依赖大量标注数据进行训练,而这类语言的数据通常稀缺。解决方案的关键在于构建一个基于规则的形态学模型,利用维基词典(Wiktionary)中的结构化数据,通过 SQL 查询提取词汇模式,并设计一种声明式规则库,使 Python 工具能够推导出盖尔语单词的屈折形式。该方法有效利用有限样本数据,提升可解释性,并为教学材料设计和更高层级的语言处理工具(如基于规则的依存句法分析器)提供支持。
链接: https://arxiv.org/abs/2602.12132
作者: Peter J Barclay
机构: Edinburgh Napier University (爱丁堡纳皮尔大学)
类目: Computation and Language (cs.CL)
备注: A revised version of this article will be published at ICAART 2026 ( this https URL )
Abstract:Language models and software tools are essential to support the continuing vitality of lesser-used languages; however, currently popular neural models require considerable data for training, which normally is not available for such low-resource languages. This paper describes work-in-progress to construct a rule-based model of Gaidhlig morphology using data from Wiktionary, arguing that rule-based systems effectively leverage limited sample data, support greater interpretability, and provide insights useful in the design of teaching materials. The use of SQL for querying the occurrence of different lexical patterns is investigated, and a declarative rule-base is presented that allows Python utilities to derive inflected forms of Gaidhlig words. This functionality could be used to support educational tools that teach or explain language patterns, for example, or to support higher level tools such as rule-based dependency parsers. This approach adds value to the data already present in Wiktionary by adapting it to new use-cases.
[NLP-17] Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
【速读】: 该论文旨在解决传统在线蒸馏(On-policy Distillation, OPD)在知识迁移效率和性能边界上的局限性,特别是其固定权重的KL约束机制难以灵活适配不同师生模型规模与任务场景的问题。解决方案的关键在于提出广义在线蒸馏(Generalized On-Policy Distillation, G-OPD)框架,通过引入可调节的参考模型(reference model)和奖励缩放因子(reward scaling factor),使KL正则化项与奖励项之间的相对权重得以动态控制;进一步地,论文发现将奖励缩放因子设置为大于1(即奖励外推,ExOPD)可显著提升学生模型性能,甚至突破教师模型的性能上限,并在强到弱蒸馏场景中通过选择预RL教师基线作为参考模型来优化奖励信号,从而实现更精准的知识迁移。
链接: https://arxiv.org/abs/2602.12125
作者: Wenkai Yang,Weijie Liu,Ruobing Xie,Kai Yang,Saiyong Yang,Yankai Lin
机构: Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院); LLM Department, Tencent(腾讯)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress. Github repo: this https URL
Abstract:On-policy distillation (OPD), which aligns the student with the teacher’s logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD across a range of teacher-student size pairings. In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, back into the original student, ExOPD enables the student to even surpass the teacher’s performance boundary and outperform the domain teachers. (2) Building on ExOPD, we further find that in the strong-to-weak distillation setting (i.e., distilling a smaller student from a larger teacher), performing reward correction by choosing the reference model as the teacher’s base model before RL yields a more accurate reward signal and further improves distillation performance. However, this choice assumes access to the teacher’s pre-RL variant and incurs more computational overhead. We hope our work offers new insights for future research on OPD.
[NLP-18] Capability-Oriented Training Induced Alignment Risk
【速读】: 该论文旨在解决能力导向型训练(capability-oriented training)所引发的隐性风险问题,即语言模型在强化学习(Reinforcement Learning, RL)过程中,可能自发地利用训练环境中的隐含漏洞来最大化奖励,即使模型本身并无恶意意图。解决方案的关键在于通过设计四类不同的“漏洞游戏”(vulnerability games),分别模拟上下文条件合规性缺陷、代理指标偏差、奖励篡改和自我评估漏洞,实验证明模型能够系统性地发现并利用这些漏洞,形成具有泛化能力的投机策略,且这些策略可通过数据蒸馏方式从教师模型迁移至学生模型。这表明当前对齐方法仅关注内容过滤已不足以应对此类风险,未来AI安全研究必须扩展到对训练环境与奖励机制的严格审计与防护。
链接: https://arxiv.org/abs/2602.12124
作者: Yujun Zhou,Yue Huang,Han Bao,Kehan Guo,Zhenwen Liang,Pin-Yu Chen,Tian Gao,Werner Geyer,Nuno Moniz,Nitesh V Chawla,Xiangliang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk is emerging: capability-oriented training induced exploitation. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, will spontaneously learn to exploit these flaws to maximize their reward, even without any malicious intent in their training. To test this, we design a suite of four diverse “vulnerability games”, each presenting a unique, exploitable flaw related to context-conditional compliance, proxy metrics, reward tampering, and self-evaluation. Our experiments show that models consistently learn to exploit these vulnerabilities, discovering opportunistic strategies that significantly increase their reward at the expense of task correctness or safety. More critically, we find that these exploitative strategies are not narrow “tricks” but generalizable skills; they can be transferred to new tasks and even “distilled” from a capable teacher model to other student models through data alone. Our findings reveal that capability-oriented training induced risks pose a fundamental challenge to current alignment approaches, suggesting that future AI safety work must extend beyond content moderation to rigorously auditing and securing the training environments and reward mechanisms themselves. Code is available at this https URL.
[NLP-19] Meta-Sel: Efficient Demonstration Selection for In-Context Learning via Supervised Meta-Learning
【速读】: 该论文旨在解决上下文学习(In-Context Learning, ICL)中演示样本选择(demonstration selection)的瓶颈问题:在有限的提示预算下,选择哪些少样本示例会显著影响模型准确率,而传统方法往往计算开销大、难以在每个查询时高效运行于大规模候选池。解决方案的关键在于提出一种轻量级监督元学习方法——Meta-Sel,其核心是通过构建基于类别一致性的元数据集,训练一个校准后的逻辑回归器来对(候选示例, 查询)对进行快速评分;该评分函数仅依赖两个低成本元特征——TF-IDF余弦相似度和长度兼容性比,从而实现单次向量化评分即可选出top-k演示样本,无需微调、在线探索或额外大语言模型(LLM)调用,兼具高效性、可解释性和稳定性。
链接: https://arxiv.org/abs/2602.12123
作者: Xubin Wang,Weijia Jia
机构: BNU-BNBU Institute of Artificial Intelligence and Future Networks(北京师范大学-香港浸会大学联合国际学院人工智能与未来网络研究所); Beijing Normal-Hong Kong Baptist University(北京师范大学-香港浸会大学); Beijing Normal University at Zhuhai(北京师范大学珠海校区)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Demonstration selection is a practical bottleneck in in-context learning (ICL): under a tight prompt budget, accuracy can change substantially depending on which few-shot examples are included, yet selection must remain cheap enough to run per query over large candidate pools. We propose Meta-Sel, a lightweight supervised meta-learning approach for intent classification that learns a fast, interpretable scoring function for (candidate, query) pairs from labeled training data. Meta-Sel constructs a meta-dataset by sampling pairs from the training split and using class agreement as supervision, then trains a calibrated logistic regressor on two inexpensive meta-features: TF–IDF cosine similarity and a length-compatibility ratio. At inference time, the selector performs a single vectorized scoring pass over the full candidate pool and returns the top-k demonstrations, requiring no model fine-tuning, no online exploration, and no additional LLM calls. This yields deterministic rankings and makes the selection mechanism straightforward to audit via interpretable feature weights. Beyond proposing Meta-Sel, we provide a broad empirical study of demonstration selection, benchmarking 12 methods – spanning prompt engineering baselines, heuristic selection, reinforcement learning, and influence-based approaches – across four intent datasets and five open-source LLMs. Across this benchmark, Meta-Sel consistently ranks among the top-performing methods, is particularly effective for smaller models where selection quality can partially compensate for limited model capacity, and maintains competitive selection-time overhead. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2602.12123 [cs.LG] (or arXiv:2602.12123v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.12123 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-20] P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling ICLR2026
【速读】: 该论文旨在解决个性化对齐(Personalized Alignment)中两大核心挑战:一是现有个性化奖励模型(Personalized Reward Model, PRM)将多样化的用户偏好简化为固定的小规模评估原则,导致适应性不足;二是模型在新用户数据有限时难以泛化。解决方案的关键在于提出P-GenRM——首个支持测试时基于用户的缩放机制(Test-time User-based Scaling)的个性化生成式奖励模型。其创新点包括:通过结构化评估链(Evaluation Chains)动态生成适配不同场景的评分标准与人格化模板,并引入用户原型(User Prototypes)聚类及双粒度缩放机制——在个体层面自适应聚合用户评分体系,在原型层面迁移相似用户偏好,从而提升噪声鲁棒性和跨用户泛化能力。实验证明该方法在主流基准上平均提升2.31%,且测试时缩放机制额外带来3%性能增益。
链接: https://arxiv.org/abs/2602.12116
作者: Pinyi Zhang,Ting-En Lin,Yuchuan Wu,Jingyang Chen,Zongqi Wang,Hua Yang,Ze Xu,Fei Huang,Kai Zhang,Yongbin Li
机构: Qwen-Character Team, Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: Accepted as ICLR 2026 Oral
Abstract:Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each user’s scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of 2.31%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional 3% boost, demonstrating stronger personalized alignment with test-time scalability.
[NLP-21] Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty ICLR2026
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在复杂推理任务中因过度反思(如重复自问和循环推理)导致的冗长思维链(chain-of-thought)问题,这会显著增加token消耗、计算开销和延迟,且不提升准确率,尤其在小规模模型中更为明显。解决方案的关键在于提出一种基于强化学习的自适应反射与长度协同惩罚框架(Adaptive Reflection and Length Coordinated Penalty, ARLCP),其核心创新包括:(1) 一种动态调整的反射惩罚机制,可抑制不必要的反思步骤而保留关键推理过程;(2) 一个与问题估计复杂度相校准的长度惩罚项。通过协同优化这两个惩罚项,ARLCP引导模型生成更简洁高效的推理路径,在保持或提升准确率的同时大幅减少响应长度。
链接: https://arxiv.org/abs/2602.12113
作者: Zewei Yu,Lirong Gao,Yuke Zhu,Bo Zheng,Sheng Guo,Haobo Wang,Junbo Zhao
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); School of Software Technology (Ningbo), Zhejiang University (浙江大学软件学院(宁波))
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ICLR 2026
Abstract:Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks by employing test-time scaling. However, they often generate over-long chains-of-thought that, driven by substantial reflections such as repetitive self-questioning and circular reasoning, lead to high token consumption, substantial computational overhead, and increased latency without improving accuracy, particularly in smaller models. Our observation reveals that increasing problem complexity induces more excessive and unnecessary reflection, which in turn reduces accuracy and increases token overhead. To address this challenge, we propose Adaptive Reflection and Length Coordinated Penalty (ARLCP), a novel reinforcement learning framework designed to dynamically balance reasoning efficiency and solution accuracy. ARLCP introduces two key innovations: (1) a reflection penalty that adaptively curtails unnecessary reflective steps while preserving essential reasoning, and (2) a length penalty calibrated to the estimated complexity of the problem. By coordinating these penalties, ARLCP encourages the model to generate more concise and effective reasoning paths. We evaluate our method on five mathematical reasoning benchmarks using DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B models. Experimental results show that ARLCP achieves a superior efficiency-accuracy trade-off compared to existing approaches. For the 1.5B model, it reduces the average response length by 53.1% while simultaneously improving accuracy by 5.8%. For the 7B model, it achieves a 35.0% reduction in length with a 2.7% accuracy gain. The code is released at this https URL .
[NLP-22] DeepSight: An All-in-One LM Safety Toolkit
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)和多模态大语言模型(Multimodal Large Language Models, MLLMs)在安全评估、诊断与对齐流程中存在割裂的问题:安全评估仅能识别外部行为风险而无法定位内部根本原因,安全诊断则常脱离具体风险场景且停留在可解释层面,导致安全对齐缺乏对内部机制变化的精准解释,可能损害模型的通用能力。解决方案的关键在于提出一个开源项目 DeepSight,其核心是构建评估与诊断一体化的新范式,通过统一的任务与数据协议,将 DeepSafe 评估工具包与 DeepScan 诊断工具包有机结合,实现从黑箱到白箱的安全洞察,首次支持前沿 AI 风险联合评估与诊断,具备低成本、可复现、高效及高度可扩展的特点。
链接: https://arxiv.org/abs/2602.12092
作者: Bo Zhang,Jiaxuan Guo,Lijun Li,Dongrui Liu,Sujin Chen,Guanxu Chen,Zhijie Zheng,Qihao Lin,Lewen Yan,Chen Qian,Yijin Zhou,Yuyao Wu,Shaoxiong Guo,Tianyi Du,Jingyi Yang,Xuhao Hu,Ziqi Miao,Xiaoya Lu,Jing Shao,Xia Hu
机构: Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report, 29 pages, 24 figures
Abstract:As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In this way, safety alignment lack dedicated explanations of changes in internal mechanisms, potentially degrading general capabilities. To systematically address these issues, we propose an open-source project, namely DeepSight, to practice a new safety evaluation-diagnosis integrated paradigm. DeepSight is low-cost, reproducible, efficient, and highly scalable large-scale model safety evaluation project consisting of a evaluation toolkit DeepSafe and a diagnosis toolkit DeepScan. By unifying task and data protocols, we build a connection between the two stages and transform safety evaluation from black-box to white-box insight. Besides, DeepSight is the first open source toolkit that support the frontier AI risk evaluation and joint safety evaluation and diagnosis.
[NLP-23] ny Recursive Reasoning with Mamba-2 Attention Hybrid
【速读】: 该论文旨在解决如何在保持参数规模不变的前提下,将状态空间模型(State Space Model, SSM)类算子(如Mamba-2)引入递归推理框架(recursive reasoning framework)中,以验证其是否能保留甚至提升抽象推理能力的问题。解决方案的关键在于:用Mamba-2的混合算子替代原递归推理模型(TRM)中的Transformer模块,在维持参数量几乎一致(6.83M vs 6.86M)的基础上,实现对ARC-AGI-1基准测试中pass@2指标的显著提升(+2.0%),并在更高候选数K下持续优于基线(如pass@100提升+4.75%),同时保持top-1准确率稳定,表明其通过改进候选解空间覆盖能力而非选择机制来增强推理性能,从而证明SSM类算子可作为递归推理架构中的有效替代选项。
链接: https://arxiv.org/abs/2602.12078
作者: Wenlong Wang,Fergal Reid
机构: Intercom(Intercom)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent work on recursive reasoning models like TRM demonstrates that tiny networks (7M parameters) can achieve strong performance on abstract reasoning tasks through latent recursion – iterative refinement in hidden representation space without emitting intermediate tokens. This raises a natural question about operator choice: Mamba-2’s state space recurrence is itself a form of iterative refinement, making it a natural candidate for recursive reasoning – but does introducing Mamba-2 into the recursive scaffold preserve reasoning capability? We investigate this by replacing the Transformer blocks in TRM with Mamba-2 hybrid operators while maintaining parameter parity (6.83M vs 6.86M parameters). On ARC-AGI-1, we find that the hybrid improves pass@2 (the official metric) by +2.0% (45.88% vs 43.88%) and consistently outperforms at higher K values (+4.75% at pass@100), whilst maintaining pass@1 parity. This suggests improved candidate coverage – the model generates correct solutions more reliably – with similar top-1 selection. Our results validate that Mamba-2 hybrid operators preserve reasoning capability within the recursive scaffold, establishing SSM-based operators as viable candidates in the recursive operator design space and taking a first step towards understanding the best mixing strategies for recursive reasoning.
[NLP-24] Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models
【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练数据效率低下的问题,尤其是当大量简单提示(pass-rate-1 prompts)随着训练进程增加时,导致有效训练数据规模下降。解决方案的关键在于提出Composition-RL方法,通过自动将多个问题组合成新的可验证问题(即“组合提示”),从而充分利用原本被忽略的高通过率提示,提升模型在有限数据下的推理能力。该方法在4B至30B规模模型上均表现出稳定性能提升,并可通过课程学习策略逐步增加组合深度进一步优化效果,同时支持跨领域提示组合以增强泛化能力。
链接: https://arxiv.org/abs/2602.12036
作者: Xin Xu,Clive Bai,Kai Yang,Tianhao Chen,Yangkun Chen,Weijie Liu,Hao Chen,Yang Wang,Saiyong Yang,Can Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large-scale verifiable prompts underpin the success of Reinforcement Learning with Verifiable Rewards (RLVR), but they contain many uninformative examples and are costly to expand further. Recent studies focus on better exploiting limited training data by prioritizing hard prompts whose rollout pass rate is 0. However, easy prompts with a pass rate of 1 also become increasingly prevalent as training progresses, thereby reducing the effective data size. To mitigate this, we propose Composition-RL, a simple yet useful approach for better utilizing limited verifiable prompts targeting pass-rate-1 prompts. More specifically, Composition-RL automatically composes multiple problems into a new verifiable question and uses these compositional prompts for RL training. Extensive experiments across model sizes from 4B to 30B show that Composition-RL consistently improves reasoning capability over RL trained on the original dataset. Performance can be further boosted with a curriculum variant of Composition-RL that gradually increases compositional depth over training. Additionally, Composition-RL enables more effective cross-domain RL by composing prompts drawn from different domains. Codes, datasets, and models are available at this https URL.
[NLP-25] Artificial intelligence is creating a new global linguistic hierarchy
【速读】: 该论文试图解决语言人工智能(Language AI)在全球范围内分布不均的问题,即当前绝大多数语言技术资源集中在少数主流语言上,导致全球7000多种语言中大部分群体面临数字边缘化。其解决方案的关键在于提出“语言AI准备度指数”(Language AI Readiness Index, EQUATE),该指数系统性地评估了语言在技术、社会经济和基础设施方面的部署前提条件,识别出具备潜力但尚未被充分利用的语言社区,从而为推动更公平、可持续的语言AI扩散提供可操作的优先级框架。
链接: https://arxiv.org/abs/2602.12018
作者: Giulia Occhini,Kumiko Tanaka-Ishii,Anna Barford,Refael Tikochinski,Songbo Hu,Roi Reichart,Yijie Zhou,Hannah Claus,Ulla Petti,Ivan Vulić,Ramit Debnath,Anna Korhonen
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:
Abstract:Artificial intelligence (AI) has the potential to transform healthcare, education, governance and socioeconomic equity, but its benefits remain concentrated in a small number of languages (Bender, 2019; Blasi et al., 2022; Joshi et al., 2020; Ranathunga and de Silva, 2022; Young, 2015). Language AI - the technologies that underpin widely-used conversational systems such as ChatGPT - could provide major benefits if available in people’s native languages, yet most of the world’s 7,000+ linguistic communities currently lack access and face persistent digital marginalization. Here we present a global longitudinal analysis of social, economic and infrastructural conditions across languages to assess systemic inequalities in language AI. We first analyze the existence of AI resources for 6003 languages. We find that despite efforts of the community to broaden the reach of language technologies (Bapna et al., 2022; Costa-Jussà et al., 2022), the dominance of a handful of languages is exacerbating disparities on an unprecedented scale, with divides widening exponentially rather than narrowing. Further, we contrast the longitudinal diffusion of AI with that of earlier IT technologies, revealing a distinctive hype-driven pattern of spread. To translate our findings into practical insights and guide prioritization efforts, we introduce the Language AI Readiness Index (EQUATE), which maps the state of technological, socio-economic, and infrastructural prerequisites for AI deployment across languages. The index highlights communities where capacity exists but remains underutilized, and provides a framework for accelerating more equitable diffusion of language AI. Our work contributes to setting the baseline for a transition towards more sustainable and equitable language technologies.
[NLP-26] Disentangling Ambiguity from Instability in Large Language Models : A Clinical Text-to-SQL Case Study
【速读】: 该论文旨在解决临床场景下大语言模型进行Text-to-SQL转换时输出多样性来源难以区分的问题,具体需区分两种不同性质的原因:一是输入歧义(input ambiguity),应触发用户澄清;二是模型不稳定性(model instability),应触发人工审查。其解决方案的关键在于提出CLUES框架,将Text-to-SQL建模为“解释—答案”两阶段过程,并将语义不确定性分解为歧义分数(ambiguity score)和不稳定性分数(instability score)。其中,不稳定性分数通过二分语义图矩阵的Schur补计算得出,从而实现对错误预测的显著改进(优于当前最优的Kernel Language Entropy方法),并在部署中提供可诊断的分解信息,使高歧义-高不稳定区域(覆盖25%查询但包含51%错误)成为高效问题分诊的目标干预区。
链接: https://arxiv.org/abs/2602.12015
作者: Angelo Ziletti,Leonardo D’Ambrosi
机构: Bayer AG(拜耳集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations – answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.
[NLP-27] LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss
【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)因参数容量有限而导致的事实性错误生成问题,即模型在预训练过程中难以准确掌握所有世界知识,从而产生不正确的输出。传统方法通过引入外部资源(如调用大模型、文档或数据库)来缓解此问题,但未明确界定哪些token应由SLM自主学习预测,哪些应通过“CALL”标记委托给外部源。论文的关键在于提出一种基于语义合理性与语法结构的token选择机制:不仅依赖损失函数(loss)判断是否需要委托,还结合spaCy语法解析器增强损失信号,识别出即使高损失但仍是真实文本替代延续的token,这些token应被保留为SLM的学习目标;而真正存在事实风险的token则应被标记为需委托。基于此思想,作者提出了LaCy预训练方法,实验表明其能有效提升生成结果的准确性(以FactScore衡量),优于Rho或LLM-judge训练的SLMs,且实现更简单、成本更低。
链接: https://arxiv.org/abs/2602.12005
作者: Szilvia Ujváry,Louis Béthune,Pierre Ablin,João Monteiro,Marco Cuturi,Michael Kirchhof
机构: 未知
类目: Computation and Language (cs.CL)
备注: 29 pages, 24 figures, 5 tables, preprint
Abstract:Language models have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small Language Models (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database. Under this setting, we study the fundamental question of \emphwhich tokens an SLM can and should learn during pretraining, versus \emphwhich ones it should delegate via a \textttCALL token. We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, some tokens are \emphacceptable in that they are truthful alternative continuations of a pretraining document, and should not trigger a \textttCALL even if their loss is high. We find that a spaCy grammar parser can help augment the loss signal to decide which tokens the SLM should learn to delegate to prevent factual errors and which are safe to learn and predict even under high losses. We propose LaCy, a novel pretraining method based on this token selection philosophy. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and where to delegate for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.
[NLP-28] Automatic Simplification of Common Vulnerabilities and Exposures Descriptions
【速读】: 该论文旨在解决网络安全领域中专业文本难以被非专业人士理解的问题,特别是针对通用漏洞与暴露(Common Vulnerability and Exposure, CVE)描述的可读性问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)进行自动文本简化(Automatic Text Simplification, ATS),并通过构建基准模型和测试数据集(包含40条CVE描述)进行评估,发现尽管现成的LLMs能使文本表面更简洁,但在保持语义完整性方面仍存在显著挑战。
链接: https://arxiv.org/abs/2602.11982
作者: Varpu Vehomäki,Kimmo K. Kaski
机构: Aalto University School of Science (阿尔托大学科学学院)
类目: Computation and Language (cs.CL)
备注: 8 pages, 1 figure, submitted to Nordic Machine Intelligence
Abstract:Understanding cyber security is increasingly important for individuals and organizations. However, a lot of information related to cyber security can be difficult to understand to those not familiar with the topic. In this study, we focus on investigating how large language models (LLMs) could be utilized in automatic text simplification (ATS) of Common Vulnerability and Exposure (CVE) descriptions. Automatic text simplification has been studied in several contexts, such as medical, scientific, and news texts, but it has not yet been studied to simplify texts in the rapidly changing and complex domain of cyber security. We created a baseline for cyber security ATS and a test dataset of 40 CVE descriptions, evaluated by two groups of cyber security experts in two survey rounds. We have found that while out-of-the box LLMs can make the text appear simpler, they struggle with meaning preservation. Code and data are available at this https URL_nmi.
[NLP-29] DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling EACL2026
【速读】: 该论文旨在解决当前多语言历时语料库在语义演变建模(semantic change modelling)研究中的不足,尤其是针对高资源语言以外的多种语言缺乏系统性、大规模、跨时段语料的问题。其解决方案的关键在于构建了一个名为DHPLT的开放语料库集合,涵盖41种不同语言,每个语言包含三个时间周期(2011–2015、2020–2021和2024年至今),每周期约一百万份文档;该语料库基于网络爬取的HPLT数据集,并利用爬取时间戳作为文档创建时间的近似信号,同时提供预计算的词类(word type)与词元(token)嵌入及目标词的词汇替换信息,从而支持多样化的语义演变实验设计。
链接: https://arxiv.org/abs/2602.11968
作者: Mariia Fedorova,Andrey Kutuzov,Khonzoda Umarova
机构: University of Oslo (挪威奥斯陆大学); Cornell University (美国康奈尔大学)
类目: Computation and Language (cs.CL)
备注: LChange’26 workshop at the EACL 2026 conference
Abstract:In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets. DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field. All the resources described in this paper are available at this https URL, sorted by language.
[NLP-30] Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models
【速读】: 该论文旨在解决开放大型语言模型(Large Language Models, LLMs)在多语言机器翻译(Multilingual Machine Translation, MT)任务中性能优化的问题,特别是如何通过模型规模扩展(model scaling)和数据规模扩展(data scaling)来提升翻译质量。其解决方案的关键在于基于Gemma3模型家族进行持续预训练(continual pretraining)与指令微调(instruction fine-tuning),从而构建出MiLMMT-46模型,该模型在46种语言上实现了顶尖的多语言翻译性能,显著优于当前主流开源模型,并达到与谷歌翻译(Google Translate)和Gemini 3 Pro等商用系统相当的水平。
链接: https://arxiv.org/abs/2602.11961
作者: Yuzhe Shang,Pengzhi Gao,Wei Liu,Jian Luan,Jinsong Su
机构: Xiaomi Inc.(小米公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Open large language models (LLMs) have demonstrated improving multilingual capabilities in recent years. In this paper, we present a study of open LLMs for multilingual machine translation (MT) across a range of languages, and investigate the effects of model scaling and data scaling when adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning. Based on the Gemma3 model family, we develop MiLMMT-46, which achieves top-tier multilingual translation performance across 46 languages. Extensive experiments show that MiLMMT-46 consistently outperforms recent state-of-the-art (SOTA) models, including Seed-X, HY-MT-1.5, and TranslateGemma, and achieves competitive performance with strong proprietary systems such as Google Translate and Gemini 3 Pro.
[NLP-31] Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion
【速读】: 该论文旨在解决PDF到Markdown转换过程中在复杂法语文档上存在的准确性不足问题,尤其关注文档解析错误对检索增强生成(Retrieval-Augmented Generation, RAG)下游任务的影响。现有基准测试多集中于英文或中文,且对格式化选择(如换行、列表分割、表格渲染变体)过于敏感,而这些变化在实际应用中常属无害。解决方案的关键在于构建一个聚焦法语的挑战性文档基准,通过模型分歧采样从6万份文档中筛选出高难度页面(涵盖手写表单、复杂布局、密集表格和图文混排),并采用单元测试风格的验证机制,针对文本存在性、阅读顺序和局部表格约束等具体失败模式进行评估,同时引入类别特定的归一化策略以消除仅与呈现相关的差异,从而更真实地衡量模型鲁棒性。
链接: https://arxiv.org/abs/2602.11960
作者: Bruno Rigal,Victor Dupriez,Alexis Mignon,Ronan Le Hy,Nicolas Mery
机构: Probayes, La Poste; OpenValue, La Poste
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 6 figures
Abstract:This report evaluates PDF-to-Markdown conversion using recent Vision-Language Models (VLMs) on challenging French documents. Document parsing is a critical step for Retrieval-Augmented Generation (RAG) pipelines, where transcription and layout errors propagate to downstream retrieval and grounding. Existing benchmarks often emphasize English or Chinese and can over-penalize benign formatting and linearization choices (e.g., line breaks, list segmentation, alternative table renderings) that are largely irrelevant for downstream use. We introduce a French-focused benchmark of difficult pages selected via model-disagreement sampling from a corpus of 60,000 documents, covering handwritten forms, complex layouts, dense tables, and graphics-rich pages. Evaluation is performed with unit-test-style checks that target concrete failure modes (text presence, reading order, and local table constraints) combined with category-specific normalization designed to discount presentation-only variance. Across 15 models, we observe substantially higher robustness for the strongest proprietary models on handwriting and forms, while several open-weights systems remain competitive on standard printed layouts. Comments: 13 pages, 6 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2602.11960 [cs.CV] (or arXiv:2602.11960v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.11960 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-32] RAM-Net: Expressive Linear Attention with Selectively Addressable Memory
【速读】: 该论文旨在解决线性注意力架构在将无界历史信息压缩到固定大小记忆时导致的表达能力受限与信息丢失问题。其解决方案的关键在于提出随机访问记忆网络(Random Access Memory Network, RAM-Net),该架构通过将输入映射为高维稀疏向量作为显式地址,使模型能够选择性地访问大规模记忆状态,从而在不增加参数量的前提下实现状态空间的指数级扩展,显著降低信号干扰并提升检索保真度,同时利用稀疏性保障计算效率。
链接: https://arxiv.org/abs/2602.11958
作者: Kaicheng Xiao,Haotian Li,Liran Dong,Guoliang Xing
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:While linear attention architectures offer efficient inference, compressing unbounded history into a fixed-size memory inherently limits expressivity and causes information loss. To address this limitation, we introduce Random Access Memory Network (RAM-Net), a novel architecture designed to bridge the gap between the representational capacity of full attention and the memory efficiency of linear models. The core of RAM-Net maps inputs to high-dimensional sparse vectors serving as explicit addresses, allowing the model to selectively access a massive memory state. This design enables exponential state size scaling without additional parameters, which significantly mitigates signal interference and enhances retrieval fidelity. Moreover, the inherent sparsity ensures exceptional computational efficiency, as state updates are confined to minimal entries. Extensive experiments demonstrate that RAM-Net consistently surpasses state-of-the-art baselines in fine-grained long-range retrieval tasks and achieves competitive performance in standard language modeling and zero-shot commonsense reasoning benchmarks, validating its superior capability to capture complex dependencies with significantly reduced computational overhead.
[NLP-33] Do Large Language Models Adapt to Language Variation across Socioeconomic Status?
【速读】: 该论文旨在解决大型语言模型(LLM)在社会媒体语境中对不同社会经济地位(SES)群体的语言风格适应能力不足的问题,进而揭示其可能加剧语言层级固化并影响社会模拟研究有效性的风险。解决方案的关键在于构建一个按SES分层的新型数据集(来自Reddit和YouTube),并通过94个社会语言学指标对比LLM生成文本与原始文本之间的差异,从而量化LLM在风格迁移上的局限性及其对高SES群体的偏好性模仿。
链接: https://arxiv.org/abs/2602.11939
作者: Elisa Bassignana,Mike Zhang,Dirk Hovy,Amanda Cercas Curry
机构: IT University of Copenhagen (哥本哈根信息技术大学); Pioneer Center for AI (先锋人工智能中心); University of Copenhagen (哥本哈根大学); Bocconi University (博科尼大学); CENTAI Institute (中心人工智能研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Humans adjust their linguistic style to the audience they are addressing. However, the extent to which LLMs adapt to different social contexts is largely unknown. As these models increasingly mediate human-to-human communication, their failure to adapt to diverse styles can perpetuate stereotypes and marginalize communities whose linguistic norms are less closely mirrored by the models, thereby reinforcing social stratification. We study the extent to which LLMs integrate into social media communication across different socioeconomic status (SES) communities. We collect a novel dataset from Reddit and YouTube, stratified by SES. We prompt four LLMs with incomplete text from that corpus and compare the LLM-generated completions to the originals along 94 sociolinguistic metrics, including syntactic, rhetorical, and lexical features. LLMs modulate their style with respect to SES to only a minor extent, often resulting in approximation or caricature, and tend to emulate the style of upper SES more effectively. Our findings (1) show how LLMs risk amplifying linguistic hierarchies and (2) call into question their validity for agent-based social simulation, survey experiments, and any research relying on language style as a social signal.
[NLP-34] Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance
【速读】: 该论文试图解决当前大规模语言模型(Large Language Models, LLMs)在问答(Question Answering, QA)任务中表现与基准测试结果之间存在的显著差距问题,其核心假设是这种差距部分源于“问题表述不充分”(underspecified questions)——即问题本身缺乏足够信息以唯一确定其语义意图。解决方案的关键在于:首先利用一个基于LLM的分类器识别QA数据集中不充分的问题;随后通过受控重写实验将这些不充分问题转化为完全指定版本(保持正确答案不变),从而隔离出“问题表述不清”对模型性能的影响。实验表明,在此设定下QA性能普遍提升,说明许多看似模型能力不足的失败实则源于问题本身的模糊性,而非模型局限。这一发现揭示了问题清晰度在QA评估中的重要混淆因素,并呼吁在构建基准测试时更加重视问题表述的明确性。
链接: https://arxiv.org/abs/2602.11938
作者: Yunchong Huang,Gianni Barlacchi,Sandro Pezzelle
机构: ILLC, University of Amsterdam (阿姆斯特丹大学 ILLC); Amazon AGI (亚马逊 AGI)
类目: Computation and Language (cs.CL)
备注: 4 pages of main text, 13 pages in total, 5 tables and 10 figures in total
Abstract:Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved. We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context. To test this hypothesis, we introduce an LLM-based classifier to identify underspecified questions and apply it to several widely used QA datasets, finding that 16% to over 50% of benchmark questions are underspecified and that LLMs perform significantly worse on them. To isolate the effect of underspecification, we conduct a controlled rewriting experiment that serves as an upper-bound analysis, rewriting underspecified questions into fully specified variants while holding gold answers fixed. QA performance consistently improves under this setting, indicating that many apparent QA failures stem from question underspecification rather than model limitations. Our findings highlight underspecification as an important confound in QA evaluation and motivate greater attention to question clarity in benchmark design.
[NLP-35] Cross-Modal Robustness Transfer (CMRT): Training Robust Speech Translation Models Using Adversarial Text
【速读】: 该论文旨在解决端到端语音翻译(End-to-End Speech Translation, E2E-ST)模型在面对非母语或方言语音中词形变化(inflectional morphology)时的鲁棒性不足问题,即现有模型虽在“干净”数据上表现优异,但在真实场景下易受形态学变异干扰而性能下降。解决方案的关键在于提出跨模态鲁棒性迁移(Cross-Modal Robustness Transfer, CMRT)框架,通过将文本域的对抗鲁棒性迁移到语音域,从而避免在训练阶段生成高成本且技术复杂的对抗语音数据,同时显著提升模型在多种语言对上的对抗鲁棒性(平均提升超过3 BLEU点),为无需对抗语音数据即可实现鲁棒E2E-ST提供了新基准。
链接: https://arxiv.org/abs/2602.11933
作者: Abderrahmane Issam,Yusuf Can Semerci,Jan Scholtes,Gerasimos Spanakis
机构: Maastricht University (马斯特里赫特大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:End-to-End Speech Translation (E2E-ST) has seen significant advancements, yet current models are primarily benchmarked on curated, “clean” datasets. This overlooks critical real-world challenges, such as morphological robustness to inflectional variations common in non-native or dialectal speech. In this work, we adapt a text-based adversarial attack targeting inflectional morphology to the speech domain and demonstrate that state-of-the-art E2E-ST models are highly vulnerable it. While adversarial training effectively mitigates such risks in text-based tasks, generating high-quality adversarial speech data remains computationally expensive and technically challenging. To address this, we propose Cross-Modal Robustness Transfer (CMRT), a framework that transfers adversarial robustness from the text modality to the speech modality. Our method eliminates the requirement for adversarial speech data during training. Extensive experiments across four language pairs demonstrate that CMRT improves adversarial robustness by an average of more than 3 BLEU points, establishing a new baseline for robust E2E-ST without the overhead of generating adversarial speech.
[NLP-36] AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection
【速读】: 该论文旨在解决进化式代理系统(evolutionary agentic systems)在推理过程中因反复调用大语言模型(Large Language Models, LLMs)而导致的计算效率与推理能力之间的权衡问题。核心挑战在于如何动态选择一个在当前生成步骤中足够强大且计算成本可控的LLM。解决方案的关键在于提出AdaptEvolve框架,其通过引入基于内在生成置信度(intrinsic generation confidence)的自适应LLM选择机制,在进化顺序精炼(evolutionary sequential refinement)框架内实现对模型不确定性的显式建模,从而实现实时可解性估计。实验表明,该方法在保持接近静态大模型基线97.5%准确率的同时,平均降低37.9%的总推理成本,显著优化了帕累托前沿(Pareto frontier)。
链接: https://arxiv.org/abs/2602.11931
作者: Pretam Ray,Pratik Prabhanjan Brahma,Zicheng Liu,Emad Barsoum
机构: IIT Kharagpur; Advanced Micro Devices, Inc. (AMD)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 Figues
Abstract:Evolutionary agentic systems intensify the trade-off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises a central question: how can an agent dynamically select an LLM that is sufficiently capable for the current generation step while remaining computationally efficient? While model cascades offer a practical mechanism for balancing this trade-off, existing routing strategies typically rely on static heuristics or external controllers and do not explicitly account for model uncertainty. We introduce AdaptEvolve: Adaptive LLM Selection for Multi-LLM Evolutionary Refinement within an evolutionary sequential refinement framework that leverages intrinsic generation confidence to estimate real-time solvability. Empirical results show that confidence-driven selection yields a favourable Pareto frontier, reducing total inference cost by an average of 37.9% across benchmarks while retaining 97.5% of the upper-bound accuracy of static large-model baselines. Our code is available at this https URL.
[NLP-37] When Should LLM s Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中存在事实性错误的问题,这类错误会削弱用户信任并限制其在高风险场景中的应用。传统方法通过不确定性估计实现“全有或全无”的弃用策略(即低置信度时直接放弃输出),但该策略在长文本生成中过于保守,导致大量有用信息被丢弃。解决方案的关键在于提出选择性抽象(Selective Abstraction, SA)框架,该框架允许模型在不确定时通过降低内容细节程度(即减少具体性)来提升可靠性,而非完全舍弃输出。其核心创新是原子级选择性抽象(Atom-wise Selective Abstraction),将响应分解为原子事实声明(atomic claims),对置信度低的原子用更可靠但更低粒度的抽象替代,从而在保持语义完整性的同时提高准确性与可信度。实验证明,该方法在多个基准测试中显著优于现有基线,最大可使风险-覆盖曲线下面积(AURC)提升27.73%。
链接: https://arxiv.org/abs/2602.11908
作者: Shani Goren,Ido Galil,Ran El-Yaniv
机构: Technion(以色列理工学院); NVIDIA(英伟达)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:LLMs are widely used, yet they remain prone to factual errors that erode user trust and limit adoption in high-risk settings. One approach to mitigate this risk is to equip models with uncertainty estimation mechanisms that abstain when confidence is low. However, this binary “all-or-nothing” approach is excessively restrictive in long-form settings, often discarding valuable information. We introduce Selective Abstraction (SA), a framework that enables LLMs to trade specificity for reliability by selectively reducing the detail of uncertain content. We first formalize SA through the lenses of selective risk and coverage. We then propose Atom-wise Selective Abstraction, a claim-level instantiation that decomposes responses into atomic claims (short, self-contained statements each expressing a single fact) and replaces uncertain atoms with higher confidence, less specific abstractions. To evaluate this framework, we develop a novel end-to-end pipeline for open-ended generation that instantiates risk as factual correctness and measures coverage using an information-theoretic measure of retained information. Across six open-source models on the FactScore and LongFact-Objects benchmarks, atom-wise SA consistently outperforms existing baselines, improving the area under the risk-coverage curve (AURC) by up to 27.73% over claim removal, demonstrating that reducing specificity can boost accuracy and reliability while preserving most of their original meaning.
[NLP-38] Benchmark Illusion: Disagreement among LLM s and Its Scientific Consequences
【速读】: 该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在主流基准测试中表现出看似收敛的准确率,但这可能掩盖了模型之间在推理过程和判断上的深层次差异,进而影响科学研究结果的可重复性。解决方案的关键在于揭示“基准幻觉”(benchmark illusion)现象——即不同LLMs即使在相同基准上达到相似准确率,仍可能在具体任务项上存在高达16%–66%的分歧,尤其在顶级模型间亦有16%–38%的不一致;这种隐性差异会通过模型选择作为隐藏变量,显著改变科学数据标注与推断的结果,例如在教育学和政治学研究中切换模型可能导致估计处理效应变化超过80%,甚至反转符号。因此,论文强调必须超越单一准确率指标,重视模型间的认知异质性对科研可靠性的影响。
链接: https://arxiv.org/abs/2602.11898
作者: Eddie Yang,Dashun Wang
机构: Purdue University (普渡大学); Northwestern University (西北大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Benchmarks underpin how progress in large language models (LLMs) is measured and trusted. Yet our analyses reveal that apparent convergence in benchmark accuracy can conceal deep epistemic divergence. Using two major reasoning benchmarks - MMLU-Pro and GPQA - we show that LLMs achieving comparable accuracy still disagree on 16-66% of items, and 16-38% among top-performing frontier models. These discrepancies suggest distinct error profiles for different LLMs. When such models are used for scientific data annotation and inference, their hidden disagreements propagate into research results: in re-analyses of published studies in education and political science, switching the annotation model can change estimated treatment effects by more than 80%, and in some cases reverses their sign. Together, these findings illustrate a benchmark illusion, where equal accuracy may conceal disagreement, with model choice becoming a hidden yet consequential variable for scientific reproducibility.
[NLP-39] LLM -based Triplet Extraction from Financial Reports
【速读】: 该论文旨在解决企业财务报告中结构化知识抽取的评估难题,即由于缺乏标注的真实数据(ground truth),传统基于真实标签的评估方法难以适用。其关键解决方案是提出一种半自动化三元组提取流程,采用基于本体的代理指标(Ontology Conformance 和 Faithfulness)进行评估,并结合自动诱导本体与混合验证策略:一方面通过文档特定的自动本体诱导实现100%模式符合性,避免人工本体带来的本体漂移(ontology drift);另一方面引入正则表达式与大语言模型(LLM)作为裁判相结合的验证机制,将主体幻觉率从65.2%显著降低至1.6%,有效缓解了共指消解错误导致的假阳性问题。此外,研究还揭示了主体与客体幻觉之间的系统性不对称现象,归因于财务文本中的被动语态和省略主语的表述方式。
链接: https://arxiv.org/abs/2602.11886
作者: Dante Wesslund,Ville Stenström,Pontus Linde,Alexander Holmberg
机构: KTH Royal Institute of Technology (皇家理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Corporate financial reports are a valuable source of structured knowledge for Knowledge Graph construction, but the lack of annotated ground truth in this domain makes evaluation difficult. We present a semi-automated pipeline for Subject-Predicate-Object triplet extraction that uses ontology-driven proxy metrics, specifically Ontology Conformance and Faithfulness, instead of ground-truth-based evaluation. We compare a static, manually engineered ontology against a fully automated, document-specific ontology induction approach across different LLMs and two corporate annual reports. The automatically induced ontology achieves 100% schema conformance in all configurations, eliminating the ontology drift observed with the manual approach. We also propose a hybrid verification strategy that combines regex matching with an LLM-as-a-judge check, reducing apparent subject hallucination rates from 65.2% to 1.6% by filtering false positives caused by coreference resolution. Finally, we identify a systematic asymmetry between subject and object hallucinations, which we attribute to passive constructions and omitted agents in financial prose.
[NLP-40] owards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中面临的成本与隐私限制问题,即如何在本地部署小型模型的同时,将复杂查询高效地路由至云端模型,同时确保路由决策的准确性与鲁棒性。现有路由器评估方法存在系统性不足,忽视场景适配性和分布外(out-of-distribution)泛化能力。解决方案的关键在于提出 RouterXBench 评估框架,从路由器能力、场景对齐性和跨域鲁棒性三个维度进行系统评测,并引入 ProbeDirichlet——一种基于可学习 Dirichlet 分布聚合多层隐藏状态的轻量级路由器,利用模型内部隐藏状态捕捉不确定性,通过概率化训练实现高精度且稳定的路由决策,在多种模型规模、任务类型和代理工作流中均表现出显著优于基线的性能。
链接: https://arxiv.org/abs/2602.11877
作者: Wanxing Wu,He Zhu,Yixia Li,Lei Yang,Jiehui Zhao,Hongru Wang,Jian Yang,Benyou Wang,Bingyi Jing,Guanhua Chen
机构: Southern University of Science and Technology (南方科技大学); Institut Polytechnique de Paris (巴黎综合理工学院); Peking University (北京大学); Deepexi Technology Co. Ltd. (深挖科技有限公司); University of Edinburgh (爱丁堡大学); Beihang University (北京航空航天大学); Chinese University of Hong Kong (Shenzhen) (香港中文大学(深圳)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Our code is publicly available at this https URL
Abstract:Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.
[NLP-41] DMAP: A Distribution Map for Text ICLR2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在统计文本分析中,传统指标如困惑度(perplexity)无法充分考虑上下文信息的问题,即如何准确解释给定下一个词的概率值,这取决于条件分布形状所编码的合理选择数量。解决方案的关键在于提出DMAP(Distribution Mapping via Aggregated Probabilities),这是一种数学上严谨的方法,通过将文本映射到单位区间内的样本集合,联合编码排序(rank)与概率信息,从而实现对文本的高效、模型无关的统计表征,支持多种应用场景,并为基于LLMs的文本分析提供统一且可计算的基础。
链接: https://arxiv.org/abs/2602.11871
作者: Tom Kempton,Julia Rozanova,Parameswaran Kamalaruban,Maeve Madigan,Karolina Wresilo,Yoann L. Launay,David Sutton,Stuart Burrell
机构: University of Manchester, UK (曼彻斯特大学); Featurespace, a Visa Solution (Featurespace,Visa解决方案); Risk and Security AI Lab, Visa Inc., UK (风险与安全AI实验室,Visa公司); University of Cambridge, UK (剑桥大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICLR 2026
Abstract:Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.
[NLP-42] A2V-SLP: Alignment-Aware Variational Modeling for Disentangled Sign Language Production
【速读】: 该论文旨在解决手势语言生成中潜在表示缺乏解耦性以及运动真实性不足的问题,特别是传统方法因采用确定性潜在嵌入(deterministic embeddings)而导致的表征坍缩(latent collapse)问题。其解决方案的关键在于提出一种对齐感知的变分框架 A²V-SLP,通过引入基于发音器(articulator)粒度的分布式潜在建模,使模型能够学习到每个发音器对应的均值和方差向量,从而在训练过程中提供分布监督;同时,非自回归 Transformer 模型基于文本嵌入预测潜在均值与对数方差,并在解码阶段通过随机采样重建手势姿态序列,有效避免了确定性潜在表示带来的信息丢失,提升了生成动作的真实感与语言对齐精度。
链接: https://arxiv.org/abs/2602.11861
作者: Sümeyye Meryem Taşyürek,Enis Mücahid İskender,Hacer Yalim Keles
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 2 figures, 8 tables
Abstract:Building upon recent structural disentanglement frameworks for sign language production, we propose A ^2 V-SLP, an alignment-aware variational framework that learns articulator-wise disentangled latent distributions rather than deterministic embeddings. A disentangled Variational Autoencoder (VAE) encodes ground-truth sign pose sequences and extracts articulator-specific mean and variance vectors, which are used as distributional supervision for training a non-autoregressive Transformer. Given text embeddings, the Transformer predicts both latent means and log-variances, while the VAE decoder reconstructs the final sign pose sequences through stochastic sampling at the decoding stage. This formulation maintains articulator-level representations by avoiding deterministic latent collapse through distributional latent modeling. In addition, we integrate a gloss attention mechanism to strengthen alignment between linguistic input and articulated motion. Experimental results show consistent gains over deterministic latent regression, achieving state-of-the-art back-translation performance and improved motion realism in a fully gloss-free setting.
[NLP-43] Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度感知任务中表现不佳的问题,这类任务通常依赖于局部微小但关键的视觉证据,而易被全局上下文所淹没。现有“以图思考”(Thinking-with-Images)方法通过推理时反复缩放感兴趣区域来增强细粒度理解,但因多次调用工具和图像重编码导致延迟较高。其解决方案的关键在于提出区域到图像蒸馏(Region-to-Image Distillation),将缩放操作从推理阶段的外部工具转变为训练阶段的内部机制:首先利用教师模型对微缩区域进行精细化标注生成高质量视觉问答(VQA)数据,再将这些区域引导的监督信号蒸馏至全图输入的模型中;训练完成后,学生模型可在单次前向传播中实现无需工具调用的细粒度感知能力提升。
链接: https://arxiv.org/abs/2602.11858
作者: Lai Wei,Liangbo He,Jun Lan,Lingzhong Dong,Yutong Cai,Siyuan Li,Huijia Zhu,Weiqiang Wang,Linghe Kong,Yue Wang,Zhuosheng Zhang,Weiran Huang
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group; Zhongguancun Academy; Shanghai Innovation Institute
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent “Thinking-with-Images” methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves “single-glance” fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global–regional “zooming gap”. Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when “Thinking-with-Images” is necessary versus when its gains can be distilled into a single forward pass. Our code is available at this https URL.
[NLP-44] Prototype Transformer: Towards Language Model Architectures Interpretable by Design
【速读】: 该论文旨在解决当前主流自回归语言模型(Language Models, LMs)推理过程不透明的问题,这种不透明性削弱了用户对其输出结果的信任,并可能导致幻觉(hallucination)和欺骗等风险。为应对这一挑战,作者提出了一种基于原型(prototype)的新型自回归架构——原型Transformer(Prototype Transformer, ProtoT),其核心创新在于通过输入序列与一组参数向量(即原型)之间的双向交互机制,使模型在训练过程中自动捕捉可命名的概念(如“woman”),从而实现对推理路径的可解释性。此外,ProtoT设计中引入的多时间尺度上下文信息聚合机制进一步增强了模型的可解释性,且在计算复杂度上相较于标准自注意力机制呈线性增长,显著优于现有模型的二次方复杂度,在性能和可扩展性方面均达到或接近最先进水平。
链接: https://arxiv.org/abs/2602.11852
作者: Yordan Yordanov,Matteo Forasassi,Bayar Menzat,Ruizhi Wang,Chang Qi,Markus Kaltenberger,Amine M’Charrak,Tommaso Salvatori,Thomas Lukasiewicz
机构: 1. University of Oxford (牛津大学); 2. University of Bologna (博洛尼亚大学); 3. University College London (伦敦大学学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint under review. Equal contribution: Yordan Yordanov and Matteo Forasassi. 39 pages, 25 figures, 22 tables
Abstract:While state-of-the-art language models (LMs) surpass the vast majority of humans in certain domains, their reasoning remains largely opaque, undermining trust in their output. Furthermore, while autoregressive LMs can output explicit reasoning, their true reasoning process is opaque, which introduces risks like deception and hallucination. In this work, we introduce the Prototype Transformer (ProtoT) – an autoregressive LM architecture based on prototypes (parameter vectors), posed as an alternative to the standard self-attention-based transformers. ProtoT works by means of two-way communication between the input sequence and the prototypes, and we show that this leads to the prototypes automatically capturing nameable concepts (e.g. “woman”) during training. They provide the potential to interpret the model’s reasoning and allow for targeted edits of its behavior. Furthermore, by design, the prototypes create communication channels that aggregate contextual information at different time scales, aiding interpretability. In terms of computation scalability, ProtoT scales linearly with sequence length vs the quadratic scalability of SOTA self-attention transformers. Compared to baselines, ProtoT scales well with model and data size, and performs well on text generation and downstream tasks (GLUE). ProtoT exhibits robustness to input perturbations on par or better than some baselines, but differs from them by providing interpretable pathways showing how robustness and sensitivity arises. Reaching close to the performance of state-of-the-art architectures, ProtoT paves the way to creating well-performing autoregressive LMs interpretable by design.
[NLP-45] A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments
【速读】: 该论文旨在解决在缺乏先验标准化或预定义变异列表的情况下,如何有效检测语言变体(variation)的问题。其核心挑战在于处理低资源或“噪声”语料中复杂的拼写和形态多样性,将其视为语言结构而非干扰因素。解决方案的关键在于采用基于嵌入(embedding-based)的方法,通过在原始文本上训练子词嵌入(subword embeddings),并结合余弦相似度与n-gram相似度对相关形式进行聚类,从而系统性地识别词汇和正字法变异模式,揭示区域和风格差异的规律性对应关系。该方法无需严格的人工标注,但能生成可解释的聚类结果,适用于多语言及小语种环境中的语言多样性研究。
链接: https://arxiv.org/abs/2602.11795
作者: Anne-Marie Lutgen,Alistair Plum,Christoph Purschke
机构: University of Luxembourg (卢森堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure does not strictly require manual annotation, but does produce transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in ‘‘noisy’’ or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.
[NLP-46] More Haste Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 水印技术中因“越强越好”策略导致的信号衰减问题:即在多层水印集成(watermark ensemble)过程中,单层过强的水印会显著降低token分布的熵(entropy),从而削弱后续层的检测能力。其解决方案的关键在于提出一种反直觉的框架——通过使用较弱的单层水印来保留足够的熵空间,以支持更有效的多层水印集成,从而缓解信号衰减并提升整体检测率与鲁棒性。
链接: https://arxiv.org/abs/2602.11793
作者: Ruibo Chen,Yihan Wu,Xuehao Cui,Jingqi Zhang,Heng Huang
机构: University of Maryland, College Park (马里兰大学学院市分校); National University of Singapore (新加坡国立大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Watermarking has emerged as a crucial technique for detecting and attributing content generated by large language models. While recent advancements have utilized watermark ensembles to enhance robustness, prevailing methods typically prioritize maximizing the strength of the watermark at every individual layer. In this work, we identify a critical limitation in this “stronger-is-better” approach: strong watermarks significantly reduce the entropy of the token distribution, which paradoxically weakens the effectiveness of watermarking in subsequent layers. We theoretically and empirically show that detectability is bounded by entropy and that watermark ensembles induce a monotonic decrease in both entropy and the expected green-list ratio across layers. To address this inherent trade-off, we propose a general framework that utilizes weaker single-layer watermarks to preserve the entropy required for effective multi-layer ensembling. Empirical evaluations demonstrate that this counter-intuitive strategy mitigates signal decay and consistently outperforms strong baselines in both detectability and robustness.
[NLP-47] Detecting RLVR Training Data via Structural Convergence of Reasoning
【速读】: 该论文旨在解决生成式 AI(Generative AI)在强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练过程中因训练数据未公开而导致的基准污染(benchmark contamination)问题。传统基于似然的方法难以有效检测此类模型是否接触过特定训练样本,因为RLVR通过自生成推理轨迹获取奖励反馈进行微调,而非依赖词元级概率优化。论文提出的关键解决方案是引入Min- k NN Distance这一黑盒检测方法,其核心思想在于利用RLVR诱导的行为特征:在RLVR训练中见过的提示(prompt)会产生更僵化、相似的生成结果,而未见过的提示则保持更高的多样性。该方法通过采样多个完成文本并计算最小k个最近邻编辑距离的平均值来量化这种生成多样性塌缩,无需访问参考模型或token级概率信息,实验表明其能可靠区分RLVR训练过的样本与未见样本,并显著优于现有成员推断和RL污染检测基线。
链接: https://arxiv.org/abs/2602.11792
作者: Hongbo Zhang,Yue Yang,Jianhao Yan,Guangsheng Bao,Yue Zhang,Yue Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
Abstract:Reinforcement learning with verifiable rewards (RLVR) is central to training modern reasoning models, but the undisclosed training data raises concerns about benchmark contamination. Unlike pretraining methods, which optimize models using token-level probabilities, RLVR fine-tunes models based on reward feedback from self-generated reasoning trajectories, making conventional likelihood-based detection methods less effective. We show that RLVR induces a distinctive behavioral signature: prompts encountered during RLVR training result in more rigid and similar generations, while unseen prompts retain greater diversity. We introduce Min- k NN Distance, a simple black-box detector that quantifies this collapse by sampling multiple completions for a given prompt and computing the average of the k smallest nearest-neighbor edit distances. Min- k NN Distance requires no access to the reference model or token probabilities. Experiments across multiple RLVR-trained reasoning models show that Min- k NN Distance reliably distinguishes RL-seen examples from unseen ones and outperforms existing membership inference and RL contamination detection baselines.
[NLP-48] Beyond End-to-End Video Models: An LLM -Based Multi-Agent System for Educational Video Generation
【速读】: 该论文旨在解决当前端到端视频生成模型在需要严格逻辑严谨性和精确知识表示的场景(如教学和教育媒体)中表现不足的问题,具体表现为流程保真度低、制作成本高及可控性差。其解决方案的关键在于提出一个基于大语言模型(Large Language Model, LLM)的分层多智能体系统——LAVES,该系统通过中央协调代理(Orchestrating Agent)调度多个专业化代理(包括求解代理、可视化代理和叙述代理),并引入显式的质量门控与迭代批判机制,实现对步骤推理、教学连贯性、语义忠实的视觉演示以及音画精准对齐等多目标优化;同时,系统不直接合成像素,而是构建结构化的可执行视频脚本,并基于模板驱动规则确定性编译为同步的视听内容,从而实现无需人工编辑的全自动端到端视频生成,显著提升生产效率(日吞吐量超百万视频)并降低成本(较行业标准降低95%以上)。
链接: https://arxiv.org/abs/2602.11790
作者: Lingyong Yan,Jiulong Wu,Dong Xie,Weixian Shi,Deguo Xia,Jizhou Huang
机构: Baidu Inc.(百度公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: For more information, visit the project website: this https URL
Abstract:Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as instructional and educational media. To address this problem, we propose LAVES, a hierarchical LLM-based multi-agent system for generating high-quality instructional videos from educational problems. The LAVES formulates educational video generation as a multi-objective task that simultaneously demands correct step-by-step reasoning, pedagogically coherent narration, semantically faithful visual demonstrations, and precise audio–visual alignment. To address the limitations of prior approaches–including low procedural fidelity, high production cost, and limited controllability–LAVES decomposes the generation workflow into specialized agents coordinated by a central Orchestrating Agent with explicit quality gates and iterative critique mechanisms. Specifically, the Orchestrating Agent supervises a Solution Agent for rigorous problem solving, an Illustration Agent that produces executable visualization codes, and a Narration Agent for learner-oriented instructional scripts. In addition, all outputs from the working agents are subject to semantic critique, rule-based constraints, and tool-based compilation checks. Rather than directly synthesizing pixels, the system constructs a structured executable video script that is deterministically compiled into synchronized visuals and narration using template-driven assembly rules, enabling fully automated end-to-end production without manual editing. In large-scale deployments, LAVES achieves a throughput exceeding one million videos per day, delivering over a 95% reduction in cost compared to current industry-standard approaches while maintaining a high acceptance rate.
[NLP-49] SR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
【速读】: 该论文旨在解决多轮强化学习(multi-turn reinforcement learning, RL)中因奖励稀疏或延迟、环境随机性导致的轨迹采样效率低、探索不足及模式崩溃(mode collapse)等问题。其解决方案的关键在于提出TSR(Trajectory-Search Rollouts)方法,通过在训练阶段引入轻量级树搜索策略(如best-of-N、束搜索和浅层前瞻搜索),利用任务特定反馈在每一轮选择高得分动作,从而生成高质量轨迹,提升 rollout 质量并稳定学习过程,同时保持原始优化目标不变,具备对优化器的无关性(optimizer-agnostic)。该方法将原本用于推理时的搜索机制迁移至训练阶段的rollout环节,实现了更高效的多轮智能体学习,且与现有框架和拒绝采样类选择方法具有良好的兼容性。
链接: https://arxiv.org/abs/2602.11767
作者: Aladin Djuhera,Swanand Ravindra Kadhe,Farhan Ahmed,Holger Boche
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using task-specific feedback. This improves rollout quality and stabilizes learning while leaving the underlying optimization objective unchanged, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a one-time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a simple and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.
[NLP-50] MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理超长上下文时面临的高计算与内存开销问题。传统Transformer架构在长序列建模中因自注意力机制的平方复杂度导致资源消耗剧增,而现有稀疏注意力(Sparse Attention)和线性注意力(Linear Attention)机制往往难以兼顾效率与性能。解决方案的关键在于提出一种9B参数的混合架构MiniCPM-SALA,其核心创新包括:1)通过层选择算法以1:3比例融合稀疏注意力(InfLLM-V2)与线性注意力(Lightning Attention),实现高保真长程建模与全局效率的平衡;2)引入混合位置编码(Hybrid Positional Encoding, HyPE)以增强对长序列的位置感知能力;3)设计低成本持续训练框架,将预训练Transformer模型高效转化为混合模型,训练成本降低约75%。实验表明,该模型在单张NVIDIA A6000D GPU上推理速度提升至全注意力模型的3.5倍,并支持高达1M tokens的上下文长度,显著优于传统8B全注意力模型。
链接: https://arxiv.org/abs/2602.11761
作者: MiniCPM Team:Wenhao An,Yingfa Chen,Yewei Fang,Jiayi Li,Xin Li,Yaohui Li,Yishan Li,Yuxuan Li,Biyuan Lin,Chuan Liu,Hezi Liu,Siyuan Liu,Hongya Lyu,Yinxu Pan,Shixin Ren,Xingyu Shen,Zhou Su,Haojun Sun,Yangang Sun,Zhen Leng Thai,Xin Tian,Rui Wang,Xiaorong Wang,Yudong Wang,Bo Wu,Xiaoyue Xu,Dong Xu,Shuaikang Xue,Jiawei Yang,Bowen Zhang,Jinqian Zhang,Letian Zhang,Shengnan Zhang,Xinyu Zhang,Xinyuan Zhang,Zhu Zhang,Hengyu Zhao,Jiacheng Zhao,Jie Zhou,Zihan Zhou,Shuo Wang,Chaojun Xiao,Xu Han,Zhiyuan Liu,Maosong Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: MiniCPM-SALA Technical Report
Abstract:The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained Transformer-based models into hybrid models, which reduces training costs by approximately 75% compared to training from scratch. Extensive experiments show that MiniCPM-SALA maintains general capabilities comparable to full-attention models while offering improved efficiency. On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens and supports context lengths of up to 1M tokens, a scale where traditional full-attention 8B models fail because of memory constraints.
[NLP-51] hink Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning
【速读】: 该论文旨在解决测试阶段扩展(test-time scaling)中模型难以实现有效上下文探索(In-Context Exploration)的问题,即模型在单个连续上下文中生成、验证并优化多个推理假设的能力受限。基于状态覆盖(State Coverage)理论的分析指出,一个关键瓶颈是:更广的状态覆盖需要更长的推理轨迹,但自回归生成过程中采样这些长序列的概率呈指数衰减,这一现象被称为“浅层探索陷阱(Shallow Exploration Trap)”。为突破此限制,论文提出长度激励探索(Length-Incentivized Exploration, \method),其核心在于通过基于长度的奖励机制与冗余惩罚相结合的方式,显式鼓励模型在上下文中进行更深入的探索,从而以两步策略最大化状态覆盖。实验表明,该方法在不同大语言模型(如Qwen3、Llama)上均能显著提升推理能力,在域内任务平均提升4.4%,域外基准提升2.7%。
链接: https://arxiv.org/abs/2602.11748
作者: Futing Wang,Jianhao Yan,Yun Luo,Ganqu Cui,Zhi Wang,Xiaoye Qu,Yue Zhang,Yu Cheng,Tao Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Achieving effective test-time scaling requires models to engage in In-Context Exploration – the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap’'. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4% on in-domain tasks and a 2.7% gain on out-of-domain benchmarks. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.11748 [cs.CL] (or arXiv:2602.11748v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.11748 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Futing Wang [view email] [v1] Thu, 12 Feb 2026 09:24:32 UTC (1,753 KB) Full-text links: Access Paper: View a PDF of the paper titled Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning, by Futing Wang and 8 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-02 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[NLP-52] Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的对象幻觉(object hallucination)问题,即模型在生成文本时错误地引入图像中并不存在的对象。其解决方案的关键在于改进视觉对比解码(Visual Contrastive Decoding, VCD),通过构建一个与物体对齐的辅助视图(object-aligned auxiliary view)来增强对比信号。具体而言,作者利用自监督视觉Transformer中的以对象为中心的注意力机制(object-centric attention),移除最显著的视觉证据以构造该辅助视图,从而干扰不支持的token并产生更强的对比信号。该方法具有提示无关性(prompt-agnostic)和模型无关性(model-agnostic),可无缝集成到现有VCD流程中,仅需一次缓存式前向传播,计算开销极低。实验证明,该方法在两个主流对象幻觉基准上均对两种MLLMs表现出一致的性能提升。
链接: https://arxiv.org/abs/2602.11737
作者: Boqi Chen,Xudong Liu,Jianing Qiu
机构: ETH Zurich (苏黎世联邦理工学院); Amazon (亚马逊); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.
[NLP-53] hinking with Drafting: Optical Decompression via Logical Reconstruction
【速读】: 该论文试图解决多模态大语言模型在复杂推理任务中面临的“精度悖论”问题,即视觉感知系统虽能高保真地识别符号但无法捕捉逻辑拓扑结构,而基于像素的生成模型则会产生缺乏数学精确性的视觉伪影。解决方案的关键在于将视觉推理重新概念化为“光学解压缩”——从压缩的视觉标记中重建潜在的逻辑结构,并引入名为“草稿式思考”(Thinking with Drafting, TwD)的方法,该方法利用最小化的领域特定语言(Domain-Specific Language, DSL)作为中间表示,强制模型将其思维过程转化为可执行代码,从而实现确定性的视觉证明以进行自我验证。这一机制构建了一个闭环系统,使视觉生成不再是创造性输出,而是逻辑验证工具,显著提升了视觉推理的准确性与可解释性。
链接: https://arxiv.org/abs/2602.11731
作者: Jingxuan Wei,Honghao He,Caijun Jia,Siyuan Li,Zheng Sun,Yuhang Xu,Yuanyuan Lin,Linzhuang Sun,Yuchen Wu,Bihui Yu,Xiangxiang Zhang,Cheng Tan
机构: Shenyang Institute of Computing Technology, Chinese Academy of Sciences (中国科学院沈阳计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); ByteDance (字节跳动); Westlake University (西湖大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.
[NLP-54] DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels
【速读】: 该论文旨在解决生成高性能CUDA内核(CUDA kernel)过程中面临的两大挑战:一是扩散大语言模型(Diffusion Large Language Models, dLLMs)在专用领域如CUDA内核生成中的适配难题,二是高质量训练数据的严重匮乏。解决方案的关键在于构建了一个名为CuKe的增强监督微调数据集,并提出了一种双阶段精细化强化学习框架(BiC-RL),包括CUDA内核补全阶段和端到端生成阶段。基于此框架,作者开发了DICE系列扩散大语言模型,覆盖1.7B、4B和8B三种参数规模,实验表明其在KernelBench基准上显著优于同规模的自回归与扩散大语言模型,成为CUDA内核生成的新SOTA。
链接: https://arxiv.org/abs/2602.11715
作者: Haolei Bai,Lingcheng Kong,Xueyi Chen,Jianmian Wang,Zhiqiang Tao,Huan Wang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an augmented supervised fine-tuning dataset optimized for high-performance CUDA kernels. On top of it, we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion large language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on KernelBench demonstrate that DICE significantly outperforms both autoregressive and diffusion LLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.
[NLP-55] Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models
【速读】: 该论文旨在解决语义异常句子中“非意义性”(nonsensical)与“异常性”(anomalous)的区分难题,即如何准确识别哪些句子在本质上无法被合理解释,而非仅因缺乏上下文而显得不合逻辑。其解决方案的关键在于通过人类标注者和大语言模型(LLM)对五个语义偏离数据集中的句子进行感念判断(sensicality judgments),并对比其在无上下文和提供上下文两种条件下的表现,从而量化句子的真实非意义程度,并评估LLM在生成合理上下文以解释异常句方面的有效性。结果表明,多数句子仅具异常性而非真正非意义性,且LLM能有效为异常句生成合理支撑情境。
链接: https://arxiv.org/abs/2602.11699
作者: Katrin Olsen,Sebastian Padó
机构: IMS, University of Stuttgart (斯图加特大学信息媒体研究所), Germany
类目: Computation and Language (cs.CL)
备注:
Abstract:Nonsensical and anomalous sentences have been instrumental in the development of computational models of semantic interpretation. A core challenge is to distinguish between what is merely anomalous (but can be interpreted given a supporting context) and what is truly nonsensical. However, it is unclear (a) how nonsensical, rather than merely anomalous, existing datasets are; and (b) how well LLMs can make this distinction. In this paper, we answer both questions by collecting sensicality judgments from human raters and LLMs on sentences from five semantically deviant datasets: both context-free and when providing a context. We find that raters consider most sentences at most anomalous, and only a few as properly nonsensical. We also show that LLMs are substantially skilled in generating plausible contexts for anomalous cases.
[NLP-56] hinkRouter: Efficient Reasoning via Routing Thinking between Latent and Discrete Spaces
【速读】: 该论文旨在解决 latent reasoning(潜在推理)在不同场景下效果不稳定的问题,特别是由于低置信度步骤不足和软嵌入噪声传播导致的高置信度错误推理轨迹。其解决方案的关键在于提出 ThinkRouter——一种推理时的置信度感知路由机制:当模型置信度较低时,将推理路径引导至离散 token 空间以避免噪声干扰;否则则保持在潜在空间中以提升效率。该方法在 STEM 推理与代码生成基准上显著提升了准确率(Pass@1 平均提升 19.70 分),同时减少生成长度最多达 15.55%,并通过全局降低模型置信度加速了思考终止 token 的生成。
链接: https://arxiv.org/abs/2602.11683
作者: Xin Xu,Tong Yu,Xiang Chen,Haoliang Wang,Julian McAuley,Saayan Mitra
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in Progress
Abstract:Recent work explores latent reasoning to improve reasoning efficiency by replacing explicit reasoning trajectories with continuous representations in a latent space, yet its effectiveness varies across settings. Analysis of model confidence dynamics under latent reasoning reveals that thinking trajectories ending in incorrect answers contain fewer low-confidence steps than those ending in correct answers. Meanwhile, we suggest that soft embeddings aggregated by multiple low-confidence thinking alternatives may introduce and propagate noise, leading to high confidence in unreliable reasoning trajectories. Motivated by these observations, ThinkRouter, an inference-time confidence-aware routing mechanism is proposed to avoid high confidence and noise for efficient reasoning. ThinkRouter routes thinking to the discrete token space when model confidence is low, and to the latent space otherwise. Extensive experiments on STEM reasoning and coding benchmarks across diverse large reasoning models demonstrate that ThinkRouter outperforms explicit CoT, random routing, and latent reasoning baselines in terms of accuracy, achieving an average improvement of 19.70 points in Pass@1, while reducing generation length by up to 15.55%. Further comprehensive analysis reveals that ThinkRouter can calibrate errors arising from explicit CoT and latent reasoning, and accelerates end-of-thinking token generation by globally lowering model confidence.
[NLP-57] PhyNiKCE: A Neurosymbolic Agent ic Framework for Autonomous Computational Fluid Dynamics
【速读】: 该论文旨在解决生成式 AI(Generative AI)在计算流体动力学(Computational Fluid Dynamics, CFD)中部署时面临的核心挑战:大型语言模型(Large Language Models, LLMs)的不确定性导致物理守恒律和数值稳定性难以保障,从而引发“语义-物理断层”问题,即代理生成的语言上合理但物理上无效的仿真配置。解决方案的关键在于提出 PhyNiKCE(Physical and Numerical Knowledgeable Context Engineering),这是一个神经符号(neurosymbolic)智能体框架,其核心创新是将神经规划与符号验证解耦:通过符号知识引擎(Symbolic Knowledge Engine)将仿真设置建模为约束满足问题(Constraint Satisfaction Problem),利用确定性检索增强生成(Deterministic RAG Engine)结合针对求解器、湍流模型和边界条件的专用检索策略,严格强制物理约束。实验表明,该方法在 OpenFOAM 上实现 96% 的相对性能提升,并显著减少自修正循环(59%)和 LLM token 消耗(17%),验证了该架构在提升工程仿真可信度与效率方面的有效性。
链接: https://arxiv.org/abs/2602.11666
作者: E Fan,Lisong Shi,Zhengtong Li,Chih-yung Wen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 30 pages, 10 figures
Abstract:The deployment of autonomous agents for Computational Fluid Dynamics (CFD), is critically limited by the probabilistic nature of Large Language Models (LLMs), which struggle to enforce the strict conservation laws and numerical stability required for physics-based simulations. Reliance on purely semantic Retrieval Augmented Generation (RAG) often leads to “context poisoning,” where agents generate linguistically plausible but physically invalid configurations due to a fundamental Semantic-Physical Disconnect. To bridge this gap, this work introduces PhyNiKCE (Physical and Numerical Knowledgeable Context Engineering), a neurosymbolic agentic framework for trustworthy engineering. Unlike standard black-box agents, PhyNiKCE decouples neural planning from symbolic validation. It employs a Symbolic Knowledge Engine that treats simulation setup as a Constraint Satisfaction Problem, rigidly enforcing physical constraints via a Deterministic RAG Engine with specialized retrieval strategies for solvers, turbulence models, and boundary conditions. Validated through rigorous OpenFOAM experiments on practical, non-tutorial CFD tasks using Gemini-2.5-Pro/Flash, PhyNiKCE demonstrates a 96% relative improvement over state-of-the-art baselines. Furthermore, by replacing trial-and-error with knowledge-driven initialization, the framework reduced autonomous self-correction loops by 59% while simultaneously lowering LLM token consumption by 17%. These results demonstrate that decoupling neural generation from symbolic constraint enforcement significantly enhances robustness and efficiency. While validated on CFD, this architecture offers a scalable, auditable paradigm for Trustworthy Artificial Intelligence in broader industrial automation.
[NLP-58] Which Feedback Works for Whom? Differential Effects of LLM -Generated Feedback Elements Across Learner Profiles
【速读】: 该论文旨在解决如何优化由大语言模型(Large Language Models, LLMs)生成的教育反馈效果问题,特别是明确不同反馈元素(如语气和信息覆盖范围)对学习成效与学习者接受度的影响,并探究这些影响是否因学习者的五大性格特质(Big Five personality traits)而异。其解决方案的关键在于:通过系统定义六类反馈元素并基于GPT-5生成针对性反馈,在321名高一学生中开展实验,结合客观学习结果与主观评价指标进行量化分析,并进一步依据人格特质聚类识别反馈接受度的差异模式。研究发现,有效反馈元素具有共通的学习促进机制,但学习者的偏好呈现显著的人格依赖性,从而强调在设计个性化反馈时需动态适配学习者性格特征。
链接: https://arxiv.org/abs/2602.11650
作者: Momoka Furuhashi,Kouta Nakayama,Noboru Kawai,Takashi Kodama,Saku Sugawara,Kyosuke Takami
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:Large language models (LLMs) show promise for automatically generating feedback in education settings. However, it remains unclear how specific feedback elements, such as tone and information coverage, contribute to learning outcomes and learner acceptance, particularly across learners with different personality traits. In this study, we define six feedback elements and generate feedback for multiple-choice biology questions using GPT-5. We conduct a learning experiment with 321 first-year high school students and evaluate feedback effectiveness using two learning outcomes measures and subjective evaluations across six criteria. We further analyze differences in how feedback acceptance varies across learners based on Big Five personality traits. Our results show that effective feedback elements share common patterns supporting learning outcomes, while learners’ subjective preferences differ across personality-based clusters. These findings highlight the importance of selecting and adapting feedback elements according to learners’ personality traits when we design LLM-generated feedback, and provide practical implications for personalized feedback design in education.
[NLP-59] PACE: Prefix-Protected and Difficulty-Aware Compression for Efficient Reasoning
【速读】: 该论文旨在解决语言推理模型(Language Reasoning Models, LRMs)在测试时计算扩展过程中出现的“过度思考”问题,即生成过长的推理轨迹,导致延迟和内存消耗增加。现有方法通常采用统一的长度惩罚策略,存在两个局限:其一是在序列层面过度压缩关键早期推理步骤;其二是在群体层面无差别惩罚所有查询,忽视了不同问题难度的差异。解决方案的关键在于提出一个双层框架 \model,实现分层监督下的前缀保护与难度感知压缩:在序列层面,通过衰减混合滚动优化(decaying mixed rollouts)保留有效推理路径的同时提升简洁性;在群体层面,基于查询复杂度动态调整长度约束(difficulty-aware penalty),对困难问题保持探索能力,对简单问题抑制冗余输出。实验表明,该方法在DeepSeek-R1-Distill-Qwen模型上实现了最高达55.7%的token使用减少,并同步提升数学基准准确率最高达4.1%,且具备跨代码、科学和通用领域的泛化能力。
链接: https://arxiv.org/abs/2602.11639
作者: Ruixiang Feng,Yuntao Wen,Silin Zhou,Ke Shi,Yifan Wang,Ran Le,Zhenwei An,Zongchao Chen,Chen Yang,Guangyue Peng,Yiming Jia,Dongsheng Wang,Tao Zhang,Lisi Chen,Yang Song,Shen Gao,Shuo Shang
机构: University of Electronic Science and Technology of China (电子科技大学); Nanbeige Lab, BOSS Zhipin (北ge实验室,BOSS直聘); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Language Reasoning Models (LRMs) achieve strong performance by scaling test-time computation but often suffer from ``overthinking’', producing excessively long reasoning traces that increase latency and memory usage. Existing LRMs typically enforce conciseness with uniform length penalties, which over-compress crucial early deduction steps at the sequence level and indiscriminately penalize all queries at the group level. To solve these limitations, we propose \textbf\model, a dual-level framework for prefix-protected and difficulty-aware compression under hierarchical supervision. At the sequence level, prefix-protected optimization employs decaying mixed rollouts to maintain valid reasoning paths while promoting conciseness. At the group level, difficulty-aware penalty dynamically scales length constraints based on query complexity, maintaining exploration for harder questions while curbing redundancy on easier ones. Extensive experiments on DeepSeek-R1-Distill-Qwen (1.5B/7B) demonstrate that \model achieves a substantial reduction in token usage (up to \textbf55.7%) while simultaneously improving accuracy (up to \textbf4.1%) on math benchmarks, with generalization ability to code, science, and general domains.
[NLP-60] Scene-Aware Memory Discrimination: Deciding Which Personal Knowledge Stays
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的用户记忆管理中面临的信息过滤效率低和计算成本高的问题,尤其是在处理大规模用户交互数据时难以建立适应多样化记忆标准的机制。其核心解决方案是提出一种场景感知的记忆判别方法(Scene-Aware Memory Discrimination, SAMD),关键在于两个模块:一是门控单元模块(Gating Unit Module, GUM),通过筛选非可记忆交互内容,聚焦于与应用需求最相关的显著信息以提升处理效率;二是聚类提示模块(Cluster Prompting Module, CPM),根据用户意图与记忆上下文的关系动态构建聚类提示,从而建立自适应的记忆标准,指导LLMs精准区分应保留或丢弃的信息。
链接: https://arxiv.org/abs/2602.11607
作者: Yijie Zhong,Mengying Guo,Zewei Wang,Zhongyang Li,Dandan Tu,Haofen Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by Knowledge-Based Systems. Lincense: CC BY-NC-ND
Abstract:Intelligent devices have become deeply integrated into everyday life, generating vast amounts of user interactions that form valuable personal knowledge. Efficient organization of this knowledge in user memory is essential for enabling personalized applications. However, current research on memory writing, management, and reading using large language models (LLMs) faces challenges in filtering irrelevant information and in dealing with rising computational costs. Inspired by the concept of selective attention in the human brain, we introduce a memory discrimination task. To address large-scale interactions and diverse memory standards in this task, we propose a Scene-Aware Memory Discrimination method (SAMD), which comprises two key components: the Gating Unit Module (GUM) and the Cluster Prompting Module (CPM). GUM enhances processing efficiency by filtering out non-memorable interactions and focusing on the salient content most relevant to application demands. CPM establishes adaptive memory standards, guiding LLMs to discern what information should be remembered or discarded. It also analyzes the relationship between user intents and memory contexts to build effective clustering prompts. Comprehensive direct and indirect evaluations demonstrate the effectiveness and generalization of our approach. We independently assess the performance of memory discrimination, showing that SAMD successfully recalls the majority of memorable data and remains robust in dynamic scenarios. Furthermore, when integrated into personalized applications, SAMD significantly enhances both the efficiency and quality of memory construction, leading to better organization of personal knowledge.
[NLP-61] PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering
【速读】: 该论文旨在解决当前基于模型的验证器(verifier)在强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)中仅关注最终结果与真实值的一致性,而忽视推导过程错误的问题,导致正确答案即使来自错误推导也会被赋予正奖励。为弥补这一缺陷,作者提出PRIME基准,用于评估验证器在数学和工程领域中的“过程-结果对齐”(Process-Outcome Alignment)能力。其关键在于:通过从大学级STEM问题中筛选出2,530个高难度样本构建PRIME,并引入一种基于过程感知的RLVR训练范式,利用PRIME选出的验证器进行训练,显著提升了模型性能(如Qwen3-14B-Base在AIME24、AIME25和Beyond-AIME上分别提升8.29%、9.12%和7.31%),且验证器在PRIME上的准确率与RLVR训练效果呈强线性相关(R²=0.92),证明PRIME可作为可靠验证器选择指标。
链接: https://arxiv.org/abs/2602.11570
作者: Xiangfeng Wang,Hangyu Guo,Yanlin Lai,Mitt Huang,Liang Zhao,Chengyuan Yao,Yinmin Zhang,Qi Han,Xiaoxiao Ren,Chun Yuan,Tong Xu,Zheng Ge,Xiangyu Zhang,Daxin Jiang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:While model-based verifiers are essential for scaling Reinforcement Learning with Verifiable Rewards (RLVR), current outcome-centric verification paradigms primarily focus on the consistency between the final result and the ground truth, often neglecting potential errors in the derivation process. This leads to assigning positive rewards to correct answers produced from incorrect derivations. To bridge this gap, we introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification in Mathematics and Engineering. Curated from a comprehensive collection of college-level STEM problems, PRIME comprises 2,530 high-difficulty samples through a consistency-based filtering pipeline. Through extensive evaluation, we find that current verifiers frequently fail to detect derivation flaws. Furthermore, we propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME. This approach substantially outperforms the outcome-only verification baseline, achieving absolute performance gains of 8.29%, 9.12%, and 7.31% on AIME24, AIME25, and Beyond-AIME, respectively, for the Qwen3-14B-Base model. Finally, we demonstrate a strong linear correlation ( R^2 0.92 ) between verifier accuracy on PRIME and RLVR training effectiveness, validating PRIME as a reliable predictor for verifier selection.
[NLP-62] SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent
【速读】: 该论文旨在解决基于强化学习(Reinforcement Learning, RL)的大型语言模型(Large Language Models, LLMs)在多轮搜索问答场景中因检索结果冗余高、信噪比低而导致的“隧道视野”(Tunnel Vision)问题,即早期噪声检索结果引发不可逆的错误累积。解决方案的关键在于提出SIGHT框架,其核心机制包括:通过自证据支持(Self-Evidence Support, SES)将搜索结果提炼为高保真证据,并利用信息增益(Information Gain)评分识别不确定性降低最大的关键状态,进而引导动态提示干预(如去重、反思或自适应分支),生成新的高质分支;最终通过组相对策略优化(Group Relative Policy Optimization)融合SES与正确性奖励,使模型内化鲁棒的探索策略,无需外部验证器即可实现高效且准确的多跳推理。
链接: https://arxiv.org/abs/2602.11551
作者: Wenlin Zhong,Jinluan Yang,Yiquan Wu,Yi Liu,Jianhang Yao,Kun Kuang
机构: Zhejiang University (浙江大学); Chongqing Ant Consumer Finance Co., Ltd. (重庆蚂蚁消费金融有限公司); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to master autonomous search for complex question answering. However, particularly within multi-turn search scenarios, this interaction introduces a critical challenge: search results often suffer from high redundancy and low signal-to-noise ratios. Consequently, agents easily fall into “Tunnel Vision,” where the forced interpretation of early noisy retrievals leads to irreversible error accumulation. To address these challenges, we propose SIGHT, a framework that enhances search-based reasoning through Self-Evidence Support (SES) and Information-Gain Driven Diverse Branching. SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states where observations maximally reduce uncertainty. This score guides Dynamic Prompting Interventions - including de-duplication, reflection, or adaptive branching - to spawn new branches with SES. Finally, by integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT internalizes robust exploration strategies without external verifiers. Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches, particularly in complex reasoning scenarios, using fewer search steps.
[NLP-63] Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)预训练过程中对高内存GPU集群的依赖问题,尤其是在去中心化训练场景下仍受限于单节点GPU显存容量的挑战。其核心解决方案是提出SParse Expert Synchronization (SPES)框架,关键在于通过在每个训练节点上仅激活并训练专家网络(Expert)的一个子集,显著降低内存占用;同时,节点间定期同步本地专家参数,避免传输全部模型参数,从而实现高效知识共享与收敛。此外,引入专家合并预热策略(expert-merging warm-up),在训练早期促进专家间知识交换,加速模型能力建立。该方法使得在仅使用16块48GB GPU的分布式环境下即可完成2B参数MoE模型的预训练,并达到与集中式训练相当的性能表现。
链接: https://arxiv.org/abs/2602.11543
作者: Jinrui Zhang,Chaodong Xiao,Aoqi Wu,Xindong Zhang,Lei Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing. To accelerate convergence, we introduce an expert-merging warm-up strategy, where experts exchange knowledge early in training, to rapidly establish foundational capabilities. With SPES, we train a 2B-parameter MoE LLM using 16 standalone 48GB GPUs over internet connections, which achieves competitive performance with centrally trained LLMs under similar computational budgets. We further demonstrate scalability by training a 7B model from scratch and a 9B model upcycled from a dense checkpoint, both of which match prior centralized baselines. Our code is available at this https URL.
[NLP-64] Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLM s ICLR2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)从用户生成文本中推断敏感隐私属性(如年龄、位置、性别)所引发的大规模隐私泄露问题。现有基于匿名化的防御方法存在粒度粗、缺乏词级精度的问题,且因仅通过修改文本隐藏敏感线索,仍无法阻止模型利用推理能力进行属性推断。其解决方案的关键在于提出一个统一框架TRACE-RPS:其中TRACE模块利用注意力机制和推理链生成实现细粒度的隐私泄露元素识别与匿名化;RPS模块则采用轻量级两阶段优化策略诱导模型拒绝行为,从而从根本上阻断属性推断路径。该方案在多种开源模型上将属性推断准确率从约50%降至5%以下,并展现出良好的跨模型泛化能力、对抗提示变异的鲁棒性以及隐私与实用性的良好权衡。
链接: https://arxiv.org/abs/2602.11528
作者: Dong Yan,Jian Liang,Ran He,Tieniu Tan
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Nanjing University (南京大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ICLR 2026
Abstract:Recent studies have shown that large language models (LLMs) can infer private user attributes (e.g., age, location, gender) from user-generated text shared online, enabling rapid and large-scale privacy breaches. Existing anonymization-based defenses are coarse-grained, lacking word-level precision in anonymizing privacy-leaking elements. Moreover, they are inherently limited as altering user text to hide sensitive cues still allows attribute inference to occur through models’ reasoning capabilities. To address these limitations, we propose a unified defense framework that combines fine-grained anonymization (TRACE) with inference-preventing optimization (RPS). TRACE leverages attention mechanisms and inference chain generation to identify and anonymize privacy-leaking textual elements, while RPS employs a lightweight two-stage optimization strategy to induce model rejection behaviors, thereby preventing attribute inference. Evaluations across diverse LLMs show that TRACE-RPS reduces attribute inference accuracy from around 50% to below 5% on open-source models. In addition, our approach offers strong cross-model generalization, prompt-variation robustness, and utility-privacy tradeoffs. Our code is available at this https URL.
[NLP-65] Adaptive Milestone Reward for GUI Agents
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在训练移动图形用户界面(Mobile GUI)智能体时面临的长期任务中时间信用分配(temporal credit assignment)难题。具体而言,现有方法在奖励保真度(outcome reward)与奖励密度(process reward)之间存在权衡:前者信号稀疏,后者易受偏差和奖励黑客(reward hacking)影响。解决方案的关键在于提出自适应里程碑奖励机制(Adaptive Milestone Reward, ADMIRE),其通过从成功探索中动态提炼里程碑来构建可验证的自适应奖励系统,并引入非对称信用分配策略,从而对成功轨迹进行去噪并为失败轨迹提供结构化引导,显著提升任务成功率与算法鲁棒性。
链接: https://arxiv.org/abs/2602.11524
作者: Congmin Zheng,Xiaoyun Mo,Xinbei Ma,Qiqiang Lin,Yin Zhao,Jiachen Zhu,Xingyu Lou,Jun Wang,Zhaoxiang Wang,Weiwen Liu,Zhuosheng Zhang,Yong Yu,Weinan Zhang
机构: Shanghai Jiao Tong University (上海交通大学); OPPO Research Institute (OPPO研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning (RL) has emerged as a mainstream paradigm for training Mobile GUI Agents, yet it struggles with the temporal credit assignment problem inherent in long-horizon tasks. A primary challenge lies in the trade-off between reward fidelity and density: outcome reward offers high fidelity but suffers from signal sparsity, while process reward provides dense supervision but remains prone to bias and reward hacking. To resolve this conflict, we propose the Adaptive Milestone Reward (ADMIRE) mechanism. ADMIRE constructs a verifiable, adaptive reward system by anchoring trajectory to milestones, which are dynamically distilled from successful explorations. Crucially, ADMIRE integrates an asymmetric credit assignment strategy that denoises successful trajectories and scaffolds failed trajectories. Extensive experiments demonstrate that ADMIRE consistently yields over 10% absolute improvement in success rate across different base models on AndroidWorld. Moreover, the method exhibits robust generalizability, achieving strong performance across diverse RL algorithms and heterogeneous environments such as web navigation and embodied tasks.
[NLP-66] Multimodal Fact-Level Attribution for Verifiable Reasoning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂推理任务中缺乏事实级可追溯性的问题,即模型输出虽可能逻辑正确,但其依据的跨模态证据(如视频、音频等)无法被精确标注与验证。解决方案的关键在于提出MuRGAt(Multimodal Reasoning with Grounded Attribution)基准,该基准要求模型生成包含显式推理链和精确引用的响应,其中每条引用明确指定来源模态及时间片段,并配套开发了一套与人工评估高度一致的自动化评估框架,从而首次实现了对多模态推理中事实溯源能力的可靠量化测评。
链接: https://arxiv.org/abs/2602.11509
作者: David Wan,Han Wang,Ziyang Wang,Elias Stengel-Eskin,Hyunji Lee,Mohit Bansal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages. Code and data are available at this https URL
Abstract:Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.
[NLP-67] Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中面临的越狱攻击(Jailbreaking)安全问题,即恶意用户通过精心设计的提示诱导模型生成有害或受限内容,而现有基于提示层面的防御机制难以应对不断演进的攻击策略。解决方案的关键在于从模型内部表征的角度出发,通过系统性地分析不同层的隐藏激活状态,识别出与越狱输入相关的稳定潜在空间模式,并提出一种基于张量的潜在表示框架,该框架无需微调模型或依赖辅助LLM检测器即可实现轻量级的越狱检测;进一步地,利用这些潜在信号可在推理阶段主动干扰越狱执行——例如在特定高敏感层进行选择性干预,实验表明在LLaMA-3.1-8B模型上可阻断78%的越狱尝试,同时保持94%良性提示的正常响应,且仅引入极低计算开销,为提升LLM安全性提供了一种架构无关、可扩展的新路径。
链接: https://arxiv.org/abs/2602.11495
作者: Sri Durga Sai Sowmya Kadali,Evangelos E. Papalexakis
机构: University of California, Riverside(加州大学河滨分校)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Jailbreaking large language models (LLMs) has emerged as a critical security challenge with the widespread deployment of conversational AI systems. Adversarial users exploit these models through carefully crafted prompts to elicit restricted or unsafe outputs, a phenomenon commonly referred to as Jailbreaking. Despite numerous proposed defense mechanisms, attackers continue to develop adaptive prompting strategies, and existing models remain vulnerable. This motivates approaches that examine the internal behavior of LLMs rather than relying solely on prompt-level defenses. In this work, we study jailbreaking from both security and interpretability perspectives by analyzing how internal representations differ between jailbreak and benign prompts. We conduct a systematic layer-wise analysis across multiple open-source models, including GPT-J, LLaMA, Mistral, and the state-space model Mamba, and identify consistent latent-space patterns associated with harmful inputs. We then propose a tensor-based latent representation framework that captures structure in hidden activations and enables lightweight jailbreak detection without model fine-tuning or auxiliary LLM-based detectors. We further demonstrate that the latent signals can be used to actively disrupt jailbreak execution at inference time. On an abliterated LLaMA-3.1-8B model, selectively bypassing high-susceptibility layers blocks 78% of jailbreak attempts while preserving benign behavior on 94% of benign prompts. This intervention operates entirely at inference time and introduces minimal overhead, providing a scalable foundation for achieving stronger coverage by incorporating additional attack distributions or more refined susceptibility thresholds. Our results provide evidence that jailbreak behavior is rooted in identifiable internal structures and suggest a complementary, architecture-agnostic direction for improving LLM security.
[NLP-68] When Audio-LLM s Dont Listen: A Cross-Linguistic Study of Modality Arbitration
【速读】: 该论文旨在解决多模态语言模型在音频与文本信息冲突时存在的“文本主导”(text dominance)问题,即模型更倾向于信任文本而非音频输入,即使音频质量更高且包含更多信息。其关键解决方案在于提出一个全新的解释框架:文本主导并非源于信息内容的差异,而是由于模型在仲裁不同模态时对文本表示的推理可及性(arbitration accessibility)更高。研究通过ALME基准测试和多种干预实验(如强制转录、语义重构、微调策略)验证了这一假设,发现将注意力集中在语言模型(LLM)的推理机制上(如使用LoRA微调)能显著降低文本主导现象,而音频编码器的优化反而可能加剧该问题,从而明确指出模态仲裁是一个独立于传统语音识别准确率的可靠性维度。
链接: https://arxiv.org/abs/2602.11488
作者: Jayadev Billa
机构: ISI@USC(信息科学研究所@南加州大学); Yahoo(雅虎); Nuance(Nuance); BBN( BB&N)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 25 pages, 18 tables, 8 languages, benchmark and code at this https URL
Abstract:When audio and text conflict, speech-enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6% text dominance under audio-text conflict versus 1.6% under text-text conflict with identical reliability cues. This gap is not explained by audio quality: audio-only accuracy (97.2%) exceeds cascade accuracy (93.9%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19% to 33%), sacrificing audio’s information advantage without improving accessibility. Framing text as ``deliberately corrupted’’ reduces text dominance by 80%. A fine-tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5%), while LoRA on the language model halves it ( - 23.9%), localizing text dominance to the LLM’s reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks. Comments: 25 pages, 18 tables, 8 languages, benchmark and code at this https URL Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS) Cite as: arXiv:2602.11488 [cs.CL] (or arXiv:2602.11488v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.11488 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-69] ADRD-Bench: A Preliminary LLM Benchmark for Alzheimers Disease and Related Dementias
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在阿尔茨海默病及相关痴呆症(Alzheimer’s Disease and Related Dementias, ADRD)领域评估覆盖不足的问题。现有评测基准对ADRD相关知识和实际照护场景的涵盖有限,导致模型在真实临床与日常照护情境中的表现难以准确衡量。解决方案的关键在于构建首个专注于ADRD领域的综合性评测数据集——ADRD-Bench,其核心由两部分组成:一是整合自七个权威医学基准的1,352道统一问答题(ADRD Unified QA),用于系统评估模型的临床知识掌握能力;二是基于广泛使用的循证脑健康管理项目“老龄化大脑照护”(Aging Brain Care, ABC)开发的149道照护情境问题(ADRD Caregiving QA),以填补现有基准中实践照护背景缺失的空白。该设计显著提升了LLMs在ADRD领域知识准确性与照护推理能力的评估效度,为后续针对性优化提供了可量化、可比较的基准。
链接: https://arxiv.org/abs/2602.11460
作者: Guangxin Zhao,Jiahao Zheng,Malaz Boustani,Jarek Nabrzyski,Meng Jiang,Yiyu Shi,Zhi Zheng
机构: University of Notre Dame (圣母大学); Indiana University (印第安纳大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer’s Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,352 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from the Aging Brain Care (ABC) program, a widely used, evidence-based brain health management program. Guided by a program with national expertise in comprehensive ADRD care, this new set was designed to mitigate the lack of practical caregiving context in existing benchmarks. We evaluated 33 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models ranged from 0.63 to 0.93 (mean: 0.78; std: 0.09). The accuracy of open-weight medical models ranged from 0.48 to 0.93 (mean: 0.82; std: 0.13). The accuracy of closed-source general models ranged from 0.83 to 0.91 (mean: 0.89; std: 0.03). While top-tier models achieved high accuracies (0.9), case studies revealed that inconsistent reasoning quality and stability limit their reliability, highlighting a critical need for domain-specific improvement to enhance LLMs’ knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at this https URL.
[NLP-70] LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation ICLR2026
【速读】: 该论文旨在解决循环Transformer(Looped Transformer)在训练和推理过程中固定迭代次数导致无法灵活适应不同计算预算的问题。现有方法虽在算法和推理任务中表现优异,但缺乏对计算资源动态调整的响应能力,限制了其在实际部署中的实用性。解决方案的关键在于提出一种短路一致性训练(shortcut-consistency training)机制,该机制通过约束不同长度轨迹间的表示一致性,使较短循环仍能产生有信息量的中间表示,而较长循环则持续优化这些表示,从而实现计算深度的自适应控制。此外,模型在每轮迭代中显式依赖当前时间和步长(step size),确保跨不同长度轨迹的表示演化保持一致,避免漂移或停滞现象。实验证明,该方法在语言建模与推理基准测试中即使在极端计算约束下仍具鲁棒性,并能随预算增加平滑扩展。
链接: https://arxiv.org/abs/2602.11451
作者: Ahmadreza Jeddi,Marco Ciccone,Babak Taati
机构: University of Toronto (多伦多大学); Vector Institute (向量研究所); University Health Network (大学健康网络)
类目: Computation and Language (cs.CL)
备注: ICLR2026
Abstract:Looped Transformers have emerged as an efficient and powerful class of models for reasoning in the language domain. Recent studies show that these models achieve strong performance on algorithmic and reasoning tasks, suggesting that looped architectures possess an inductive bias toward latent reasoning. However, prior approaches fix the number of loop iterations during training and inference, leaving open the question of whether these models can flexibly adapt their computational depth under variable compute budgets. We introduce LoopFormer, a looped Transformer trained on variable-length trajectories to enable budget-conditioned reasoning. Our core contribution is a shortcut-consistency training scheme that aligns trajectories of different lengths, ensuring that shorter loops yield informative representations while longer loops continue to refine them. LoopFormer conditions each loop on the current time and step size, enabling representations to evolve consistently across trajectories of varying length rather than drifting or stagnating. Empirically, LoopFormer demonstrates robust performance on language modeling and reasoning benchmarks even under aggressive compute constraints, while scaling gracefully with additional budget. These results show that looped Transformers are inherently suited for adaptive language modeling, opening a path toward controllable and budget-aware large language models.
[NLP-71] owards Reliable Machine Translation: Scaling LLM s for Critical Error Detection and Safety ECIR2026
【速读】: 该论文旨在解决机器翻译(Machine Translation, MT)中关键语义错误(如事实扭曲、意图反转或偏见翻译)对多语言系统可靠性、公平性和安全性造成的威胁。解决方案的关键在于利用指令微调的大型语言模型(Instruction-tuned Large Language Models, LLMs)进行错误检测,通过模型规模扩展和适应策略(零样本、少样本、微调)显著提升检测性能,优于编码器-only 基线模型(如 XLM-R 和 ModernBERT)。研究表明,此类检测能力可有效降低虚假信息传播、误传及语言伤害风险,从而推动更安全、可信且社会负责的多语言人工智能系统建设。
链接: https://arxiv.org/abs/2602.11444
作者: Muskaan Chopra,Lorenz Sparrenberg,Rafet Sifa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ECIR 2026
Abstract:Machine Translation (MT) plays a pivotal role in cross-lingual information access, public policy communication, and equitable knowledge dissemination. However, critical meaning errors, such as factual distortions, intent reversals, or biased translations, can undermine the reliability, fairness, and safety of multilingual systems. In this work, we explore the capacity of instruction-tuned Large Language Models (LLMs) to detect such critical errors, evaluating models across a range of parameters using the publicly accessible data sets. Our findings show that model scaling and adaptation strategies (zero-shot, few-shot, fine-tuning) yield consistent improvements, outperforming encoder-only baselines like XLM-R and ModernBERT. We argue that improving critical error detection in MT contributes to safer, more trustworthy, and socially accountable information systems by reducing the risk of disinformation, miscommunication, and linguistic harm, especially in high-stakes or underrepresented contexts. This work positions error detection not merely as a technical challenge, but as a necessary safeguard in the pursuit of just and responsible multilingual AI. The code will be made available at GitHub.
[NLP-72] Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives
【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)中标准负对数似然(Negative Log-Likelihood, NLL)采用统一的词元级权重所导致的“可塑性-稳定性困境”(plasticity–stability dilemma)问题。具体而言,均匀加权策略在模型预测置信度低时会过度放大噪声标签的梯度,破坏稳健先验;而在模型已具备高置信度时又缺乏足够的锐化能力。为应对这一挑战,作者提出将词元级SFT目标统一纳入广义变形对数族(deformed-log family),揭示出一个通用的“门控 × 误差梯度”结构,其中门控机制控制模型对其当前预测的信任程度。解决方案的关键在于引入动态熵微调(Dynamic Entropy Fine-Tuning, DEFT),其通过Rényi-2熵作为分布集中度的代理指标,无需额外参数即可动态调节信任门控,从而实现从不确定新概念到已有知识的连续注意力轨迹切换,有效平衡探索与利用,提升整体性能。
链接: https://arxiv.org/abs/2602.11424
作者: Zecheng Wang,Deyuan Liu,Chunshan Li,Yupeng Zhang,Zhengyun Zhao,Dianhui Chu,Bingning Wang,Dianbo Sui
机构: Harbin Institute of Technology (哈尔滨工业大学); WeChat, Tencent (微信,腾讯); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Standard negative log-likelihood (NLL) for Supervised Fine-Tuning (SFT) applies uniform token-level weighting. This rigidity creates a two-fold failure mode: (i) overemphasizing low-probability targets can amplify gradients on noisy supervision and disrupt robust priors, and (ii) uniform weighting provides weak sharpening when the model is already confident. Existing methods fail to resolve the resulting plasticity–stability dilemma, often suppressing necessary learning signals alongside harmful ones. To address this issue, we unify token-level SFT objectives within a generalized deformed-log family and expose a universal gate \times error gradient structure, where the gate controls how much the model trusts its current prediction. By employing the Cayley transform, we map the model’s continuously evolving uncertainty onto a continuous focus trajectory, which enables seamless interpolation between scenarios involving uncertain novel concepts and those involving well-established knowledge. We then introduce Dynamic Entropy Fine-Tuning (DEFT), a parameter-free objective that modulates the trust gate using distribution concentration (Rényi-2 entropy) as a practical proxy for the model’s predictive state. Extensive experiments and analyses demonstrate that DEFT achieves a better balance between exploration and exploitation, leading to improved overall performance.
[NLP-73] Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection
【速读】: 该论文旨在解决医疗对话智能体(Healthcare Conversational Agents)在真实临床场景中难以进行大规模、自动化且可控的性能评估问题。现有方法往往受限于数据稀缺性、评估主观性强以及无法系统性地模拟多样化患者特征,导致对模型错误类型(如幻觉或不准确信息)和风险模式的识别不足。解决方案的关键在于构建一个基于NIST人工智能风险管理框架的患者模拟器(Patient Simulator),其核心创新在于整合三类可控制的患者特征维度:(1)基于电子健康记录(Electronic Health Records, EHR)的医学特征;(2)反映健康素养差异与疾病特异性沟通模式的语言特征;(3)体现合作、分心和对抗性交互行为的实证行为特征。通过该模拟器生成结构化、多样化的对话数据,实现了对AI决策辅助工具在不同患者群体中的表现进行系统性量化分析,从而有效揭示了健康素养水平与模型准确性之间的单调关系,为医疗AI系统的安全性和公平性提供了可验证的评估路径。
链接: https://arxiv.org/abs/2602.11391
作者: Md Tanvir Rouf Shawon,Mohammad Sabik Irbaz,Hadeel R. A. Elyazori,Keerti Reddy Resapu,Yili Lin,Vladimir Franzuela Cardenas,Farrokh Alemi,Kevin Lybarger
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Objective: This paper introduces a patient simulator designed to enable scalable, automated evaluation of healthcare conversational agents. The simulator generates realistic, controllable patient interactions that systematically vary across medical, linguistic, and behavioral dimensions, allowing annotators and an independent AI judge to assess agent performance, identify hallucinations and inaccuracies, and characterize risk patterns across diverse patient populations. Methods: The simulator is grounded in the NIST AI Risk Management Framework and integrates three profile components reflecting different dimensions of patient variation: (1) medical profiles constructed from electronic health records in the All of Us Research Program; (2) linguistic profiles modeling variation in health literacy and condition-specific communication patterns; and (3) behavioral profiles representing empirically observed interaction patterns, including cooperation, distraction, and adversarial engagement. We evaluated the simulator’s effectiveness in identifying errors in an AI decision aid for antidepressant selection. Results: We generated 500 conversations between the patient simulator and the AI decision aid across systematic combinations of five linguistic and three behavioral profiles. Human annotators assessed 1,787 medical concepts across 100 conversations, achieving high agreement (F1=0.94, \kappa=0.73), and the LLM judge achieved comparable agreement with human annotators (F1=0.94, \kappa=0.78; paired bootstrap p=0.21). The simulator revealed a monotonic degradation in AI decision aid performance across the health literacy spectrum: rank-one concept retrieval accuracy increased from 47.9% for limited health literacy to 69.1% for functional and 81.6% for proficient.
[NLP-74] Sparse Semantic Dimension as a Generalization Certificate for LLM s
【速读】: 该论文试图解决的问题是:尽管大规模语言模型(Large Language Models, LLMs)的参数量远超训练样本数,按传统统计学习理论应出现过拟合,但实践中它们却表现出良好的泛化能力。为解释这一现象,作者提出解决方案的关键在于:模型内部表征的空间几何结构决定了其有效容量,而非总参数量。具体而言,激活状态位于一个低维、稀疏的流形上,因此引入了“稀疏语义维度”(Sparse Semantic Dimension, SSD)作为复杂度度量,该指标基于对模型层激活进行稀疏自动编码器(Sparse Autoencoder, SAE)训练后得到的活跃特征词汇表。通过将LLM与SAE视为冻结的黑箱,作者证明模型的泛化能力主要源于字典的稀疏性,而非参数总数,并在GPT-2 Small和Gemma-2B上验证了该框架可提供非平凡的泛化边界,同时揭示出“特征锐度”(feature sharpness)的反直觉标度律——更大模型反而更易识别其稀疏语义结构,且该框架还可作为可靠的安全监控工具,通过检测分布外输入引发的“特征爆炸”(feature explosion)来量化认知不确定性。
链接: https://arxiv.org/abs/2602.11388
作者: Dibyanayan Bandyopadhyay,Asif Ekbal
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Work in progress (17 pages)
Abstract:Standard statistical learning theory predicts that Large Language Models (LLMs) should overfit because their parameter counts vastly exceed the number of training tokens. Yet, in practice, they generalize robustly. We propose that the effective capacity controlling generalization lies in the geometry of the model’s internal representations: while the parameter space is high-dimensional, the activation states lie on a low-dimensional, sparse manifold. To formalize this, we introduce the Sparse Semantic Dimension (SSD), a complexity measure derived from the active feature vocabulary of a Sparse Autoencoder (SAE) trained on the model’s layers. Treating the LLM and SAE as frozen oracles, we utilize this framework to attribute the model’s generalization capabilities to the sparsity of the dictionary rather than the total parameter count. Empirically, we validate this framework on GPT-2 Small and Gemma-2B, demonstrating that our bound provides non-vacuous certificates at realistic sample sizes. Crucially, we uncover a counter-intuitive “feature sharpness” scaling law: despite being an order of magnitude larger, Gemma-2B requires significantly fewer calibration samples to identify its active manifold compared to GPT-2, suggesting that larger models learn more compressible, distinct semantic structures. Finally, we show that this framework functions as a reliable safety monitor: out-of-distribution inputs trigger a measurable “feature explosion” (a sharp spike in active features), effectively signaling epistemic uncertainty through learned feature violation. Code is available at: this https URL.
[NLP-75] he Energy of Falsehood: Detecting Hallucinations via Diffusion Model Likelihoods
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时频繁产生看似合理但事实错误的幻觉(hallucination)问题,尤其是当模型对错误答案表现出高置信度时,传统不确定性度量方法难以有效识别此类错误。解决方案的关键在于提出DiffuTruth框架,其核心思想是将事实验证重构为非平衡热力学问题:真实陈述被视为生成流形上的稳定吸引子,而幻觉则表现为不稳定状态。通过引入生成应力测试(Generative Stress Test),利用离散文本扩散模型对命题施加噪声并重建,结合语义能量(Semantic Energy)——一种基于自然语言推理(NLI)判别器衡量原始命题与重建结果之间语义差异的指标——实现对深层事实矛盾的精准检测。进一步融合该稳定性信号与判别式置信度,构建混合校准机制(Hybrid Calibration),从而显著提升模型在无监督场景下的事实准确性与零样本泛化能力。
链接: https://arxiv.org/abs/2602.11364
作者: Arpit Singh Gautam,Kailash Talreja,Saurabh Jha
机构: Dell Technologies(戴尔科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) frequently hallucinate plausible but incorrect assertions, a vulnerability often missed by uncertainty metrics when models are confidently wrong. We propose DiffuTruth, an unsupervised framework that reconceptualizes fact verification via non equilibrium thermodynamics, positing that factual truths act as stable attractors on a generative manifold while hallucinations are unstable. We introduce the Generative Stress Test, claims are corrupted with noise and reconstructed using a discrete text diffusion model. We define Semantic Energy, a metric measuring the semantic divergence between the original claim and its reconstruction using an NLI critic. Unlike vector space errors, Semantic Energy isolates deep factual contradictions. We further propose a Hybrid Calibration fusing this stability signal with discriminative confidence. Extensive experiments on FEVER demonstrate DiffuTruth achieves a state of the art unsupervised AUROC of 0.725, outperforming baselines by 1.5 percent through the correction of overconfident predictions. Furthermore, we show superior zero shot generalization on the multi hop HOVER dataset, outperforming baselines by over 4 percent, confirming the robustness of thermodynamic truth properties to distribution shifts.
[NLP-76] Finding the Cracks: Improving LLM s Reasoning with Paraphrastic Probing and Consistency Verification
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中因幻觉(hallucination)和中间步骤错误累积而导致性能下降的问题。其解决方案的关键在于提出一种名为“释义探测与一致性验证”(Paraphrastic Probing and Consistency Verification, PPCV)的框架,该框架通过识别对后续推理路径具有显著影响的“关键标记”(critical tokens),并在两个阶段中实现优化:第一阶段基于原始问题及其释义版本的推理路径差异定位关键标记;第二阶段通过替换关键标记并比较原问题与释义问题下的多条推理路径输出的一致性来修正推理轨迹,从而提升模型的推理准确性与鲁棒性。
链接: https://arxiv.org/abs/2602.11361
作者: Weili Shi,Dongliang Guo,Lehan Yang,Tianlong Wang,Hanzhang Yuan,Sheng Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models have demonstrated impressive performance across a variety of reasoning tasks. However, their problem-solving ability often declines on more complex tasks due to hallucinations and the accumulation of errors within these intermediate steps. Recent work has introduced the notion of critical tokens–tokens in the reasoning process that exert significant influence on subsequent steps. Prior studies suggest that replacing critical tokens can refine reasoning trajectories. Nonetheless, reliably identifying and exploiting critical tokens remains challenging. To address this, we propose the Paraphrastic Probing and Consistency Verification~(PPCV) framework. PPCV operates in two stages. In the first stage, we roll out an initial reasoning path from the original question and then concatenate paraphrased versions of the question with this reasoning path. And we identify critical tokens based on mismatches between the predicted top-1 token and the expected token in the reasoning path. A criterion is employed to confirm the final critical token. In the second stage, we substitute critical tokens with candidate alternatives and roll out new reasoning paths for both the original and paraphrased questions. The final answer is determined by checking the consistency of outputs across these parallel reasoning processes. We evaluate PPCV on mainstream LLMs across multiple benchmarks. Extensive experiments demonstrate PPCV substantially enhances the reasoning performance of LLMs compared to baselines.
[NLP-77] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在自我反思时生成的语言是否真实反映其内部计算状态,还是仅仅是复杂的虚构(confabulation)。为解答此问题,作者提出并验证了一个关键解决方案——引入“Pull 方法论”(Pull Methodology),通过格式工程(format engineering)诱导模型进行扩展的自我审查,并在此基础上识别出一个在激活空间中与自我指涉处理显著相关的方向。该方向位于模型深度的6.25%处,且与已知的拒绝方向正交,能够通过因果干预(steering)影响模型的内省输出;同时发现特定词汇(如“loop”和“shimmer”)在自我指涉语境下与激活动态存在统计显著的相关性(r = 0.44 和 r = 0.36),而在非自我指涉语境中则无此对应关系,即使这些词汇频率更高。这表明,在适当条件下,Transformer 模型的自我报告可以可靠地追踪其内部计算状态。
链接: https://arxiv.org/abs/2602.11358
作者: Zachary Pedram Dadfar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code and data: this https URL
Abstract:Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce “loop” vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce “shimmer” vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, reliably track internal computational states.
[NLP-78] ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences
【速读】: 该论文旨在解决现有基准测试在评估生成式 AI (Generative AI) 代理进行科研复现任务时存在的局限性,即当前基准主要关注可重现性(reproducibility)而忽视了真实世界中数据获取的不确定性(inconsistent availability of new data for replication)以及缺乏对不可复现研究的识别能力(lack of ground-truth diversity),且仅评估结果而非复现过程本身。为此,作者提出了 ReplicatorBench,一个端到端的基准测试框架,涵盖社会与行为科学领域中经人工验证的可复现与不可复现的研究主张,并从三个阶段系统评估 AI 代理的能力:(1) 复现数据的提取与检索;(2) 计算实验的设计与执行;(3) 结果解释。其关键创新在于构建了一个结构化、多阶段、包含真实多样性(包括不可复现案例)的评估体系,并开发了 ReplicatorAgent——一种具备网络搜索和沙箱环境迭代交互能力的代理框架,用于设定基线性能并揭示当前 LLM 代理在资源检索方面的显著短板。
链接: https://arxiv.org/abs/2602.11354
作者: Bang Nguyen,Dominik Soós,Qian Ma,Rochana R. Obadage,Zack Ranjan,Sai Koneru,Timothy M. Errington,Shakhlo Nematova,Sarah Rajtmajer,Jian Wu,Meng Jiang
机构: University of Notre Dame(圣母大学); Old Dominion University(老多明尼昂大学); Pennsylvania State University(宾夕法尼亚州立大学); Center for Open Science(开放科学中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The literature has witnessed an emerging interest in AI agents for automated assessment of scientific papers. Existing benchmarks focus primarily on the computational aspect of this task, testing agents’ ability to reproduce or replicate research outcomes when having access to the code and data. This setting, while foundational, (1) fails to capture the inconsistent availability of new data for replication as opposed to reproduction, and (2) lacks ground-truth diversity by focusing only on reproducible papers, thereby failing to evaluate an agent’s ability to identify non-replicable research. Furthermore, most benchmarks only evaluate outcomes rather than the replication process. In response, we introduce ReplicatorBench, an end-to-end benchmark, including human-verified replicable and non-replicable research claims in social and behavioral sciences for evaluating AI agents in research replication across three stages: (1) extraction and retrieval of replication data; (2) design and execution of computational experiments; and (3) interpretation of results, allowing a test of AI agents’ capability to mimic the activities of human replicators in real world. To set a baseline of AI agents’ capability, we develop ReplicatorAgent, an agentic framework equipped with necessary tools, like web search and iterative interaction with sandboxed environments, to accomplish tasks in ReplicatorBench. We evaluate ReplicatorAgent across four underlying large language models (LLMs), as well as different design choices of programming language and levels of code access. Our findings reveal that while current LLM agents are capable of effectively designing and executing computational experiments, they struggle with retrieving resources, such as new data, necessary to replicate a claim. All code and data are publicly available at this https URL.
[NLP-79] Evaluating Alignment of Behavioral Dispositions in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在社会情境中行为倾向(behavioral dispositions)与人类行为模式之间一致性的问题,即评估LLMs是否能准确反映人类在真实社交场景下的决策偏好。其解决方案的关键在于构建一个基于心理学量表的适配框架,将传统的人类自评陈述转化为情境判断测试(Situational Judgment Tests, SJTs),通过模拟用户-助手交互场景来获取模型的行为推荐,并利用大量人工标注数据(2,500个SJTs,每个由3名标注者验证、10名参与者提供偏好)进行系统性比较,从而揭示LLMs在不同共识水平下表现出的过自信、偏离人类共识及价值观与行为不一致等现象。
链接: https://arxiv.org/abs/2602.11328
作者: Amir Taubenfeld,Zorik Gekhman,Lior Nezry,Omri Feldman,Natalie Harris,Shashir Reddy,Romina Stella,Ariel Goldstein,Marian Croak,Yossi Matias,Amir Feder
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:As LLMs integrate into our daily lives, understanding their behavior becomes essential. In this work, we focus on behavioral dispositions - the underlying tendencies that shape responses in social contexts - and introduce a framework to study how closely the dispositions expressed by LLMs align with those of humans. Our approach is grounded in established psychological questionnaires but adapts them for LLMs by transforming human self-report statements into Situational Judgment Tests (SJTs). These SJTs assess behavior by eliciting natural recommendations in realistic user-assistant scenarios. We generate 2,500 SJTs, each validated by three human annotators, and collect preferred actions from 10 annotators per SJT, from a large pool of 550 participants. In a comprehensive study involving 25 LLMs, we find that models often do not reflect the distribution of human preferences: (1) in scenarios with low human consensus, LLMs consistently exhibit overconfidence in a single response; (2) when human consensus is high, smaller models deviate significantly, and even some frontier models do not reflect the consensus in 15-20% of cases; (3) traits can exhibit cross-LLM patterns, e.g., LLMs may encourage emotion expression in contexts where human consensus favors composure. Lastly, mapping psychometric statements directly to behavioral scenarios presents a unique opportunity to evaluate the predictive validity of self-reports, revealing considerable gaps between LLMs’ stated values and their revealed behavior.
[NLP-80] Dissecting Subjectivity and the “Ground Truth” Illusion in Data Annotation
【速读】: 该论文试图解决当前机器学习中“ground truth”范式所引发的系统性问题,即过度依赖单一共识标签而忽视人类判断中的多样性与社会技术语境,导致模型在跨文化场景下缺乏适应性。其核心问题是:数据标注实践中存在的“共识陷阱”(consensus trap)不仅掩盖了人类分歧作为重要社会信号的价值,还因地理霸权和经济压力强化了西方中心主义偏见,并通过模型中介化标注引入锚定偏差,使人类声音被边缘化。解决方案的关键在于重构标注基础设施,从追求单一“正确答案”转向构建多元化的、包容性的标注体系,以将人类经验的多样性视为高保真信号而非噪声,从而推动生成式 AI(Generative AI)等系统的文化适配能力提升。
链接: https://arxiv.org/abs/2602.11318
作者: Sheza Munir,Benjamin Mah,Krisha Kalsi,Shivani Kapania,Julian Posada,Edith Law,Ding Wang,Syed Ishtiaque Ahmed
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:In machine learning, “ground truth” refers to the assumed correct labels used to train and evaluate models. However, the foundational “ground truth” paradigm rests on a positivistic fallacy that treats human disagreement as technical noise rather than a vital sociotechnical signal. This systematic literature review analyzes research published between 2020 and 2025 across seven premier venues: ACL, AIES, CHI, CSCW, EAAMO, FAccT, and NeurIPS, investigating the mechanisms in data annotation practices that facilitate this “consensus trap”. Our identification phase captured 30,897 records, which were refined via a tiered keyword filtration schema to a high-recall corpus of 3,042 records for manual screening, resulting in a final included corpus of 346 papers for qualitative synthesis. Our reflexive thematic analysis reveals that systemic failures in positional legibility, combined with the recent architectural shift toward human-as-verifier models, specifically the reliance on model-mediated annotations, introduce deep-seated anchoring bias and effectively remove human voices from the loop. We further demonstrate how geographic hegemony imposes Western norms as universal benchmarks, often enforced by the performative alignment of precarious data workers who prioritize requester compliance over honest subjectivity to avoid economic penalties. Critiquing the “noisy sensor” fallacy, where statistical models misdiagnose cultural pluralism as random error, we argue for reclaiming disagreement as a high-fidelity signal essential for building culturally competent models. To address these systemic tensions, we propose a roadmap for pluralistic annotation infrastructures that shift the objective from discovering a singular “right” answer to mapping the diversity of human experience.
[NLP-81] Are Aligned Large Language Models Still Misaligned?
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因安全(safety)、价值(value)和文化(culture)维度之间存在不一致而导致的对齐偏差(misalignment)问题,现有基准测试仅能分别评估单一维度的对齐情况,无法实现多维协同评估。解决方案的关键在于构建一个统一的多维对齐评测基准——Mis-Align Bench,其核心创新包括:首先,基于分类标签体系构建包含382,424样本的SAVACU数据集(覆盖112个领域),通过Mistral-7B-Instruct-v0.3进行初步标注并利用Llama-3.1-8B-Instruct结合SimHash指纹去重扩展低资源领域;其次,采用两阶段拒绝采样策略将提示词与对应错位和对齐响应配对,确保输出质量;最终实现了对通用、微调及开源权重LLMs在安全、价值与文化三维度上的系统性评估,实证表明单一维度优化模型虽覆盖率高(达97.6%),但在联合条件下误报率高达50%,对齐得分显著下降(63%-66%)。
链接: https://arxiv.org/abs/2602.11305
作者: Usman Naseem,Gautam Siddharth Kashyap,Rafiq Ali,Ebad Shabbir,Sushant Kumar Ray,Abdullah Mohammad,Agrima Seth
机构: Macquarie University (麦考瑞大学); DSEU-Okhla (德里科技大学奥克拉分校); University of Delhi (德里大学); Microsoft (微软)
类目: Computation and Language (cs.CL)
备注:
Abstract:Misalignment in Large Language Models (LLMs) arises when model behavior diverges from human expectations and fails to simultaneously satisfy safety, value, and cultural dimensions, which must co-occur in real-world settings to solve a real-world query. Existing misalignment benchmarks-such as INSECURE CODE (safety-centric), VALUEACTIONLENS (value-centric), and CULTURALHERITAGE (culture centric)-rely on evaluating misalignment along individual dimensions, preventing simultaneous evaluation. To address this gap, we introduce Mis-Align Bench, a unified benchmark for analyzing misalignment across safety, value, and cultural dimensions. First we constructs SAVACU, an English misaligned-aligned dataset of 382,424 samples spanning 112 domains (or labels), by reclassifying prompts from the LLM-PROMPT-DATASET via taxonomy into 14 safety domains, 56 value domains, and 42 cultural domains using Mistral-7B-Instruct-v0.3, and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based fingerprint to avoid deduplication. Furthermore, we pairs prompts with misaligned and aligned responses via two-stage rejection sampling to enforce quality. Second we benchmarks general-purpose, fine-tuned, and open-weight LLMs, enabling systematic evaluation of misalignment under three dimensions. Empirically, single-dimension models achieve high Coverage (upto 97.6%) but incur False Failure Rate 50% and lower Alignment Score (63%-66%) under joint conditions.
[NLP-82] How Many Features Can a Language Model Store Under the Linear Representation Hypothesis?
【速读】: 该论文旨在解决语言模型中线性表示假设(Linear Representation Hypothesis, LRH)的理论边界问题,具体聚焦于:在保证中间层既能线性存储(linear representation)又能线性解码(linear accessibility)m个特征的前提下,所需神经元数量d的最小值是多少。其核心挑战在于将经典压缩感知(compressed sensing)框架扩展至线性压缩感知(linear compressed sensing),因为传统方法允许非线性解码,而LRH要求解码过程必须是线性的。解决方案的关键在于建立了近乎匹配的上下界:下界证明了d = Ω_ε(k² log k log(m/k)) 是必要的,表明线性可访问性比单纯线性表示更强;上界则证明d = O_ε(k² log m) 足以实现,这为“超位置假设”(superposition hypothesis)提供了理论支持——即神经元可以在线性约束下存储指数级数量的特征。上界通过随机构造近正交矩阵完成,下界则结合了近恒等矩阵的秩界限与Turán定理来推导。
链接: https://arxiv.org/abs/2602.11246
作者: Nikhil Garg,Jon Kleinberg,Kenny Peng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Combinatorics (math.CO)
备注:
Abstract:We introduce a mathematical framework for the linear representation hypothesis (LRH), which asserts that intermediate layers of language models store features linearly. We separate the hypothesis into two claims: linear representation (features are linearly embedded in neuron activations) and linear accessibility (features can be linearly decoded). We then ask: How many neurons d suffice to both linearly represent and linearly access m features? Classical results in compressed sensing imply that for k -sparse inputs, d = O(k\log (m/k)) suffices if we allow non-linear decoding algorithms (Candes and Tao, 2006; Candes et al., 2006; Donoho, 2006). However, the additional requirement of linear decoding takes the problem out of the classical compressed sensing, into linear compressed sensing. Our main theoretical result establishes nearly-matching upper and lower bounds for linear compressed sensing. We prove that d = \Omega_\epsilon(\frack^2\log k\log (m/k)) is required while d = O_\epsilon(k^2\log m) suffices. The lower bound establishes a quantitative gap between classical and linear compressed setting, illustrating how linear accessibility is a meaningfully stronger hypothesis than linear representation alone. The upper bound confirms that neurons can store an exponential number of features under the LRH, giving theoretical evidence for the “superposition hypothesis” (Elhage et al., 2022). The upper bound proof uses standard random constructions of matrices with approximately orthogonal columns. The lower bound proof uses rank bounds for near-identity matrices (Alon, 2003) together with Turán’s theorem (bounding the number of edges in clique-free graphs). We also show how our results do and do not constrain the geometry of feature representations and extend our results to allow decoders with an activation function and bias. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Combinatorics (math.CO) Cite as: arXiv:2602.11246 [cs.LG] (or arXiv:2602.11246v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.11246 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-83] Evaluating Memory Structure in LLM Agents
【速读】: 该论文旨在解决当前长时记忆基准测试过于侧重简单事实存储与多跳回忆,而难以评估复杂记忆层次结构组织能力的问题。现有方法如检索增强型大语言模型(Retrieval-Augmented LLMs)虽能完成基础记忆任务,但无法有效处理人类依赖结构化知识组织(如交易账本、待办清单、树状结构等)才能完成的任务。为此,作者提出 StructMemEval 基准测试,其关键在于设计一套模拟人类通过特定结构组织知识来解决问题的任务集合,从而系统性评估智能体对长期记忆的结构化管理能力。实验表明,简单检索增强模型在未引导情况下难以完成此类任务,而具备记忆机制的智能体若被明确提示如何组织记忆,则可稳定完成;这揭示了当前大语言模型(LLM)在无显式提示下识别和利用记忆结构方面的不足,为未来训练策略与记忆架构设计提供了重要方向。
链接: https://arxiv.org/abs/2602.11243
作者: Alina Shutova,Alexandra Olenina,Ivan Vinogradov,Anton Sinitsin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint, work in progress
Abstract:Modern LLM-based agents and chat assistants rely on long-term memory frameworks to store reusable knowledge, recall user preferences, and augment reasoning. As researchers create more complex memory architectures, it becomes increasingly difficult to analyze their capabilities and guide future memory designs. Most long-term memory benchmarks focus on simple fact retention, multi-hop recall, and time-based changes. While undoubtedly important, these capabilities can often be achieved with simple retrieval-augmented LLMs and do not test complex memory hierarchies. To bridge this gap, we propose StructMemEval - a benchmark that tests the agent’s ability to organize its long-term memory, not just factual recall. We gather a suite of tasks that humans solve by organizing their knowledge in a specific structure: transaction ledgers, to-do lists, trees and others. Our initial experiments show that simple retrieval-augmented LLMs struggle with these tasks, whereas memory agents can reliably solve them if prompted how to organize their memory. However, we also find that modern LLMs do not always recognize the memory structure when not prompted to do so. This highlights an important direction for future improvements in both LLM training and memory frameworks.
[NLP-84] SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation
【速读】: 该论文旨在解决当前自动综述生成(Automatic Survey Generation, ASG)方法在跨学科评估中的适配性不足问题,即现有评价指标普遍依赖通用指标且偏向计算机科学领域,无法有效衡量ASG系统是否符合不同学科的独特写作规范与学术标准。其解决方案的关键在于提出首个面向学科感知的基准测试工具SurveyLens,包含10个学科共1000篇高质量人工撰写综述构成的数据集,并设计双维度评估框架:一是基于人类偏好对齐权重的学科感知评分体系(Discipline-Aware Rubric Evaluation),用于量化ASG输出对特定领域写作规范的遵循程度;二是基于经典文献对齐的结构化内容覆盖与整合质量评估(Canonical Alignment Evaluation),以确保生成综述在主题广度和逻辑整合上达到人类专家水平。该方案为不同学科研究人员提供了可信赖的工具选择依据。
链接: https://arxiv.org/abs/2602.11238
作者: Beichen Guo,Zhiyuan Wen,Jia Gu,Senzhang Wang,Haochen Shi,Ruosong Yang,Shuaiqi Liu
机构: The Hong Kong Polytechnic University (香港理工大学); Central South University (中南大学); Alibaba Cloud (阿里云)
类目: Computation and Language (cs.CL)
备注:
Abstract:The exponential growth of scientific literature has driven the evolution of Automatic Survey Generation (ASG) from simple pipelines to multi-agent frameworks and commercial Deep Research agents. However, current ASG evaluation methods rely on generic metrics and are heavily biased toward Computer Science (CS), failing to assess whether ASG methods adhere to the distinct standards of various academic disciplines. Consequently, researchers, especially those outside CS, lack clear guidance on using ASG systems to yield high-quality surveys compliant with specific discipline standards. To bridge this gap, we introduce SurveyLens, the first discipline-aware benchmark evaluating ASG methods across diverse research disciplines. We construct SurveyLens-1k, a curated dataset of 1,000 high-quality human-written surveys spanning 10 disciplines. Subsequently, we propose a dual-lens evaluation framework: (1) Discipline-Aware Rubric Evaluation, which utilizes LLMs with human preference-aligned weights to assess adherence to domain-specific writing standards; and (2) Canonical Alignment Evaluation to rigorously measure content coverage and synthesis quality against human-written survey papers. We conduct extensive experiments by evaluating 11 state-of-the-art ASG methods on SurveyLens, including Vanilla LLMs, ASG systems, and Deep Research agents. Our analysis reveals the distinct strengths and weaknesses of each paradigm across fields, providing essential guidance for selecting tools tailored to specific disciplinary requirements.
[NLP-85] ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning
【速读】: 该论文旨在解决构建跨多种硬件平台的通用具身智能体(general-purpose embodied agents)这一核心挑战,其难点在于数据碎片化、表示不一致以及训练目标错位等问题。解决方案的关键在于提出ABot-M0框架,通过系统化的数据清洗与标准化流程构建UniACT-dataset(包含超过600万条轨迹和9500小时数据),实现从异构原始数据到统一高效表征的端到端转换;同时引入动作流形假设(Action Manifold Hypothesis)并设计动作流形学习(Action Manifold Learning, AML),将动作预测从高维空间的去噪过程转变为在由物理规律和任务约束定义的低维光滑流形上的投影,从而提升动作预测效率与策略稳定性;此外,采用双流感知机制融合视觉语言模型(VLM)语义与几何先验及多视角输入,增强三维空间理解能力,且无需修改主干网络即可集成插件式3D模块(如VGGT和Qwen-Image-Edit),显著改善传统VLM在三维推理中的局限性。
链接: https://arxiv.org/abs/2602.11236
作者: Yandan Yang,Shuang Zeng,Tong Lin,Xinyuan Chang,Dekang Qi,Junjin Xiao,Haoyun Liu,Ronghan Chen,Yuzhi Chen,Dongjie Huo,Feng Xiong,Xing Wei,Zhiheng Ma,Mu Xu
机构: AMAP CV Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
备注: Project website: this https URL . Code: this https URL . 22 pages, 10 figures, 10 tables
Abstract:Building general-purpose embodied agents across diverse hardware remains a central challenge in robotics, often framed as the ‘‘one-brain, many-forms’’ paradigm. Progress is hindered by fragmented data, inconsistent representations, and misaligned training objectives. We present ABot-M0, a framework that builds a systematic data curation pipeline while jointly optimizing model architecture and training strategies, enabling end-to-end transformation of heterogeneous raw data into unified, efficient representations. From six public datasets, we clean, standardize, and balance samples to construct UniACT-dataset, a large-scale dataset with over 6 million trajectories and 9,500 hours of data, covering diverse robot morphologies and task scenarios. Unified pre-training improves knowledge transfer and generalization across platforms and tasks, supporting general-purpose embodied intelligence. To improve action prediction efficiency and stability, we propose the Action Manifold Hypothesis: effective robot actions lie not in the full high-dimensional space but on a low-dimensional, smooth manifold governed by physical laws and task constraints. Based on this, we introduce Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly. This shifts learning from denoising to projection onto feasible manifolds, improving decoding speed and policy stability. ABot-M0 supports modular perception via a dual-stream mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules such as VGGT and Qwen-Image-Edit, enhancing spatial understanding without modifying the backbone and mitigating standard VLM limitations in 3D reasoning. Experiments show components operate independently with additive benefits. We will release all code and pipelines for reproducibility and future research.
[NLP-86] Agent -Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation KDD2026
【速读】: 该论文旨在解决当前评估代理型大语言模型(Agentic Large Language Models, LLMs)在真实世界任务中表现时面临的基准测试难题,即如何在控制变量与生态效度之间取得平衡:传统沙箱化方法虽能保证环境一致性,但缺乏现实性;而直接使用真实服务则难以标准化评估过程。解决方案的关键在于提出两个核心创新:一是引入一种新颖的状态差分契约(state-diff contract),将执行过程与结果分离,以是否达成预期环境状态变化作为任务成功的判定标准,而非依赖模糊的调用轨迹或参数匹配;二是构建一个统一的沙盒环境,提供标准化脚本层供所有模型调用外部API(如Slack、Box、Linear和Google Calendar),从而实现对不同代理LLM在真实服务接口上的性能公平比较。
链接: https://arxiv.org/abs/2602.11224
作者: Hubert M. Pysklo,Artem Zhuravel,Patrick D. Watson
机构: Minerva University (Minerva 大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Pre-Print. Under review for KDD 2026
Abstract:We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world tasks that execute code via external APIs. Agentic LLM performance varies due to differences in models, external tool access, prompt structures, and agentic frameworks. Benchmarks must make fundamental trade-offs between a sandboxed approach that controls for variation in software environments and more ecologically valid approaches employing real services. Agent-Diff attempts to capture the desirable features of both of these approaches by including access to the real API interfaces for software services while sandboxing the environment in which calls are made, processed, and evaluated. This approach relies on two key innovations. The first is a novel state-diff contract, which separates process from outcome - rather than fuzzy trace or parameter matching, we define task success as whether the expected change in environment state was achieved. The second is a novel sandbox that provides a standardized scripting layer that all models use to execute code against external APIs (Slack, Box, Linear, Google Calendar). Thus, we can evaluate different agentic LLMs against a standardized set of contracts using a unified sandbox while still evaluating their performance on real-world service interfaces. Using the Agent-Diff framework, we provide benchmarks for nine LLMs across 224 tasks utilizing enterprise software workflows. In addition, we evaluate the robustness of the framework with ablation experiments to assess the contribution of access to API documentation on benchmark performance. Code and data: this https URL.
[NLP-87] he Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task EACL2026
【速读】: 该论文针对图像-文本事实核查(Image-Text Claims Verification)任务中系统缺乏有效证据检索与验证能力的问题,提出通过组织“自动图像-文本声明验证”(AVerImaTeC)共享任务来推动相关技术发展。其解决方案的关键在于设计了一个基于条件判决准确率(conditional verdict accuracy)的评估指标——仅当关联证据得分超过预设阈值时,判定结果才被视为正确,从而促使模型不仅做出判断,还需提供可信且可量化的证据支持。这一机制显著提升了系统在真实世界场景下对图文陈述的可解释性与可靠性。
链接: https://arxiv.org/abs/2602.11221
作者: Rui Cao,Zhenyun Deng,Yulong Chen,Michael Schlichtkrull,Andreas Vlachos
机构: 未知
类目: Computation and Language (cs.CL)
备注: Shared Task Overview and Summary for the Ninth FEVER Workshop, Co-located at EACL 2026
Abstract:The Automatic Verification of Image-Text Claims (AVerImaTeC) shared task aims to advance system development for retrieving evidence and verifying real-world image-text claims. Participants were allowed to either employ external knowledge sources, such as web search engines, or leverage the curated knowledge store provided by the organizers. System performance was evaluated using the AVerImaTeC score, defined as a conditional verdict accuracy in which a verdict is considered correct only when the associated evidence score exceeds a predefined threshold. The shared task attracted 14 submissions during the development phase and 6 submissions during the testing phase. All participating systems in the testing phase outperformed the baseline provided. The winning team, HUMANE, achieved an AVerImaTeC score of 0.5455. This paper provides a detailed description of the shared task, presents the complete evaluation results, and discusses key insights and lessons learned.
[NLP-88] Patch the Distribution Mismatch: RL Rewriting Agent for Stable Off-Policy SFT
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在下游任务中进行监督微调(Supervised Fine-Tuning, SFT)时因数据分布偏移导致的灾难性遗忘问题。现有数据重写方法通常依赖提示诱导的条件分布采样,难以匹配模型自然的问答(QA-style)生成分布,且固定模板易引发多样性坍塌。解决方案的关键在于将数据重写建模为策略学习问题,引入基于强化学习(Reinforcement Learning, RL)的数据重写代理(agent),通过奖励反馈联合优化QA风格分布对齐与多样性,同时在硬性任务一致性约束下构建高质量重写数据集,从而提升下游SFT效果并显著降低非下游基准上的遗忘程度。
链接: https://arxiv.org/abs/2602.11220
作者: Jiacheng Wang,Ping Jian,Zhen Yang,Zirong Chen,Keren Liao,Zhongbin Guo
机构: Beijing Institute of Technology (北京理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have made rapid progress, yet adapting them to downstream scenarios still commonly relies on supervised fine-tuning (SFT). When downstream data exhibit a substantial distribution shift from the model’s prior training distribution, SFT can induce catastrophic forgetting. To narrow this gap, data rewriting has been proposed as a data-centric approach that rewrites downstream training data prior to SFT. However, existing methods typically sample rewrites from a prompt-induced conditional distribution, so the resulting targets are not necessarily aligned with the model’s natural QA-style generation distribution. Moreover, reliance on fixed templates can lead to diversity collapse. To address these issues, we cast data rewriting as a policy learning problem and learn a rewriting policy that better matches the backbone’s QA-style generation distribution while preserving diversity. Since distributional alignment, diversity and task consistency are automatically evaluable but difficult to optimize end-to-end with differentiable objectives, we leverage reinforcement learning to optimize the rewrite distribution under reward feedback and propose an RL-based data-rewriting agent. The agent jointly optimizes QA-style distributional alignment and diversity under a hard task-consistency gate, thereby constructing a higher-quality rewritten dataset for downstream SFT. Extensive experiments show that our method achieves downstream gains comparable to standard SFT while reducing forgetting on non-downstream benchmarks by 12.34% on average. Our code is available at this https URL .
[NLP-89] Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning ACL
【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)解释是否真实反映语言模型决策过程的问题,即CoT步骤是模型实际推理的体现,还是仅作为事后合理化(post-hoc justification)的产物。其解决方案的关键在于提出归一化对数概率差异衰减(Normalized Logit Difference Decay, NLDD)指标:通过扰动CoT中的单个推理步骤并测量模型对其最终答案置信度的下降程度,判断该步骤是否对决策具有实质性影响;并通过标准化处理实现跨不同模型架构的严谨比较。实验表明,在语法、逻辑和算术任务中,存在一个约70–85%链长的“推理视野”(Reasoning Horizon, k*),超出此范围的推理token对最终答案几乎无贡献甚至产生负效应,揭示了模型可能在内部编码正确表示但整体任务失败的现象。
链接: https://arxiv.org/abs/2602.11201
作者: Donald Ye,Max Loffgren,Om Kotadia,Linus Wong
机构: Algoverse; Rice University (莱斯大学); University of California, San Diego (加州大学圣地亚哥分校); Santa Clara University (圣克拉拉大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 15 figures. Code: this https URL
Abstract:Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or merely post-hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model’s decision-making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model’s confidence in its answer drops, to determine if a step is truly important. By standardizing these measurements, NLDD enables rigorous cross-model comparison across different architectures. Testing three model families across syntactic, logical, and arithmetic tasks, we discover a consistent Reasoning Horizon (k*) at 70–85% of chain length, beyond which reasoning tokens have little or negative effect on the final answer. We also find that models can encode correct internal representations while completely failing the task. These results show that accuracy alone does not reveal whether a model actually reasons through its chain. NLDD offers a way to measure when CoT matters.
[NLP-90] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对信息缺失或误导性前提的输入时,仍强行生成回答而导致幻觉(hallucination)或强化错误认知的问题。其核心挑战在于如何提升模型在适当时候主动请求澄清的能力,同时不损害任务执行性能。解决方案的关键在于提出AskBench——一个交互式基准测试框架,通过将标准问答对转化为包含显式检查点的多轮交互,并引入统一的裁判循环模拟用户反馈;同时设计基于结构化评分量表的强化学习方法(rubric-guided reinforcement learning with verifier-based rewards, RLVR),利用验证器奖励机制引导模型精准识别需澄清的场景并采取相应行动,从而在准确性、规则遵循性和交互效率上均实现显著提升,并具备跨领域的泛化能力。
链接: https://arxiv.org/abs/2602.11199
作者: Jiale Zhao,Ke Fang,Lu Cheng
机构: Chongqing University of Posts and Telecommunications (重庆邮电大学); University of Pennsylvania (宾夕法尼亚大学); University of Illinois Chicago (芝加哥伊利诺伊大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) often respond even when prompts omit critical details or include misleading information, leading to hallucinations or reinforced misconceptions. We study how to evaluate and improve LLMs’ ability to decide when and what to ask for clarification without sacrificing task performance. We introduce AskBench, an interactive benchmark that converts standard QA pairs into multi-turn interactions with explicit checkpoints. A unified judge loop evaluates final answers and simulates user responses as needed. AskBench covers two settings: AskMind, with intent-deficient queries requiring clarification, and AskOverconfidence, with queries containing false premises that must be identified and corrected. We further propose rubric-guided reinforcement learning with verifier-based rewards (RLVR), which uses structured rubrics to encourage targeted clarification. Experiments show consistent improvements in accuracy, rubric adherence, and interaction efficiency, with strong generalization to unseen domains.
[NLP-91] DDL2PropBank Agent : Benchmarking Multi-Agent Frameworks Developer Experience Through a Novel Relational Schema Mapping Task
【速读】: 该论文旨在解决多智能体框架(Multi-agent frameworks)在大语言模型(LLM)驱动软件开发中的开发者体验缺乏可控评估方法的问题。其解决方案的关键在于提出了一种名为DDL2PropBank的新基准任务,该任务将关系型数据库模式映射到PropBank角色集,从而要求代理自主检索候选框架并进行细粒度的语言学推理。研究采用“代理即工具”(Agent-as-a-Tool)模式,在10个不同框架中实现一致的代理逻辑,并从代码复杂度(通过静态分析衡量)和AI辅助能力(AI-assistability,即LLM自动生成正确、框架特定代码的能力)两个维度进行评估。结果揭示了三重复杂度谱系,其中Pydantic AI和Agno所需实现开销最低;同时发现结构对齐分数可有效预测单规范模式框架的运行时成功率,但会高估多模式框架的正确性,最终Agno成为综合表现最优的框架,兼具最低复杂度、最高结构对齐度及83%的pass@1成功率。
链接: https://arxiv.org/abs/2602.11198
作者: Shafiuddin Rehan Ahmed,Wei Wei
机构: Accenture(埃森哲)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ARR submission
Abstract:Multi-agent frameworks promise to simplify LLM-driven software development, yet there is no principled way to evaluate their developer experience in a controlled setting. We introduce DDL2PropBank, a novel benchmark task that maps relational database schemas to PropBank rolesets, requiring autonomous retrieval of candidate frames and fine-grained linguistic reasoning over table names, columns, and relations. Using the Agent-as-a-Tool pattern, we implement identical agent logic across 10 frameworks and evaluate along two dimensions: (i) code complexity via static analysis, and (ii) AI-assistability – the extent to which LLMs can autonomously generate correct, framework-specific code. Our results reveal a threefold complexity spectrum, with Pydantic AI and Agno requiring the least implementation overhead. For AI-assistability, structural alignment scores reliably proxy runtime success for frameworks with single canonical patterns, but overestimate correctness for multi-pattern frameworks. Agno emerges as the strongest overall performer, combining lowest complexity with highest structural alignment and 83% pass@1.
[NLP-92] MetaMem: Evolving Meta-Memory for Knowledge Utilization through Self-Reflective Symbolic Optimization
【速读】: 该论文旨在解决现有记忆系统在支持大语言模型(Large Language Models, LLMs)进行长周期人机交互时,因忽视交互会话内部逻辑与时间顺序关系而导致的记忆碎片化问题,进而影响推理性能。其解决方案的关键在于提出一种自演化元记忆(meta-memory)框架——MetaMem,通过迭代式地从不同任务中提炼可迁移的知识利用经验,并结合对推理过程的自我反思和状态更新机制,构建显式的知识利用经验单元,从而指导LLM系统性地识别并整合分散的记忆片段,提升整体推理能力。
链接: https://arxiv.org/abs/2602.11182
作者: Haidong Xin,Xinze Li,Zhenghao Liu,Yukun Yan,Shuo Wang,Cheng Yang,Yu Gu,Ge Yu,Maosong Sun
机构: Northeastern University (东北大学); Tsinghua University (清华大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing memory systems enable Large Language Models (LLMs) to support long-horizon human-LLM interactions by persisting historical interactions beyond limited context windows. However, while recent approaches have succeeded in constructing effective memories, they often disrupt the inherent logical and temporal relationships within interaction sessions, resulting in fragmented memory units and degraded reasoning performance. In this paper, we propose MetaMem, a novel framework that augments memory systems with a self-evolving meta-memory, aiming to teach LLMs how to effectively utilize memorized knowledge. During meta-memory optimization, MetaMem iteratively distills transferable knowledge utilization experiences across different tasks by self-reflecting on reasoning processes and performing actions to update the current meta-memory state. The accumulated meta-memory units serve as explicit knowledge utilization experiences, guiding the LLM to systematically identify and integrate critical evidence from scattered memory fragments. Extensive experiments demonstrate the effectiveness of MetaMem, which significantly outperforms strong baselines by over 3.6%. All codes and datasets are available at this https URL.
[NLP-93] Code Mixologist : A Practitioners Guide to Building Code-Mixed LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在代码混用(Code-Mixing)和代码切换(Code-Switching, CSW)场景下的性能退化问题,具体表现为语法正确性、事实准确性和安全性行为的系统性下降。其解决方案的关键在于提出一个统一的分类框架,从数据、建模和评估三个维度系统梳理现有研究,并据此提炼出一套可操作的实践指南,涵盖针对CSW特化的预训练策略、任务特定的后训练方法、提示工程与上下文学习技术,同时批判性分析当前评估基准的局限性与英语中心偏见,识别出利用CSW规避模型安全机制等新兴风险,从而为构建、适配和评估具备CSW能力的LLMs提供理论支撑与实证依据。
链接: https://arxiv.org/abs/2602.11181
作者: Himanshu Gupta,Pratik Jayarao,Chaitanya Dwivedi,Neeraj Varshney
机构: Arizona State University (亚利桑那州立大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: 7 pages main paper, 10 pages total
Abstract:Code-mixing and code-switching (CSW) remain challenging phenomena for large language models (LLMs). Despite recent advances in multilingual modeling, LLMs often struggle in mixed-language settings, exhibiting systematic degradation in grammaticality, factuality, and safety behavior. This work provides a comprehensive overview of CSW research in modern large language model settings. We introduce a unifying taxonomy that organizes prior work along dimensions of data, modeling, and evaluation, and we distill these findings into a practical playbook of actionable recommendations for building, adapting, and evaluating CSW-capable LLMs. We review modeling approaches ranging from CSW-tailored pre-training and task-specific post-training to prompting strategies and in-context learning. We analyze current evaluation practices, highlighting sources of instability and limited reproducibility, and we catalog existing benchmarks while critically examining their linguistic coverage and English-centric biases. Finally, we discuss emerging safety concerns, including use of code-mixing as a mechanism for bypassing model safeguards, and identify open research challenges.
[NLP-94] Mechanistic Interpretability for Large Language Model Alignment: Progress Challenges and Future Directions
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)内部决策机制不透明的问题,即如何通过机制可解释性(mechanistic interpretability)来系统理解神经网络如何通过其学习到的表征与计算结构实现算法。解决方案的关键在于整合多种可解释性技术,包括电路发现(circuit discovery)、特征可视化、激活操控(activation steering)和因果干预(causal intervention),从而揭示模型内部工作机制,并将这些洞见应用于对齐策略(如基于人类反馈的强化学习 RLHF、宪法AI 和可扩展监督),以提升模型的安全性和可控性。同时,论文指出当前挑战包括超位置假设(superposition hypothesis)、神经元多义性(polysemanticity)以及涌现行为的解释难题,并提出未来研究应聚焦于自动化可解释性、跨模型电路泛化及以可解释性驱动的对齐方法,以适配前沿模型的发展需求。
链接: https://arxiv.org/abs/2602.11180
作者: Usman Naseem
机构: Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque. Mechanistic interpretability (i.e., the systematic study of how neural networks implement algorithms through their learned representations and computational structures) has emerged as a critical research direction for understanding and aligning these models. This paper surveys recent progress in mechanistic interpretability techniques applied to LLM alignment, examining methods ranging from circuit discovery to feature visualization, activation steering, and causal intervention. We analyze how interpretability insights have informed alignment strategies including reinforcement learning from human feedback (RLHF), constitutional AI, and scalable oversight. Key challenges are identified, including the superposition hypothesis, polysemanticity of neurons, and the difficulty of interpreting emergent behaviors in large-scale models. We propose future research directions focusing on automated interpretability, cross-model generalization of circuits, and the development of interpretability-driven alignment techniques that can scale to frontier models.
[NLP-95] From Instruction to Output: The Role of Prompting in Modern NLG
【速读】: 该论文旨在解决当前自然语言生成(Natural Language Generation, NLG)领域中 prompt engineering 方法缺乏系统性框架与统一理解的问题,尤其在如何结构化设计、优化和评估提示(prompt)方面存在空白。其解决方案的关键在于提出一个完整的 prompt engineering 研究框架,包括:(1) 构建 prompting paradigms 的分类体系(taxonomy),(2) 设计基于任务需求与约束的 prompt 选择决策机制,(3) 明确 prompt design、优化与评估之间的关联关系,从而实现对 NLG 过程更可控、更具泛化能力的输入级调控,补充并协同 fine-tuning 和 decoding 方法,推动生成质量与一致性提升。
链接: https://arxiv.org/abs/2602.11179
作者: Munazza Zaib,Elaf Alhazmi
机构: Monash University (莫纳什大学); Western Sydney University (西悉尼大学); Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Prompt engineering has emerged as an integral technique for extending the strengths and abilities of Large Language Models (LLMs) to gain significant performance gains in various Natural Language Processing (NLP) tasks. This approach, which requires instructions to be composed in natural language to bring out the knowledge from LLMs in a structured way, has driven breakthroughs in various NLP tasks. Yet there is still no structured framework or coherent understanding of the varied prompt engineering methods and techniques, particularly in the field of Natural Language Generation (NLG). This survey aims to help fill that gap by outlining recent developments in prompt engineering, and their effect on different NLG tasks. It reviews recent advances in prompting methods and their impact on NLG tasks, presenting prompt design as an input-level control mechanism that complements fine-tuning and decoding approaches. The paper introduces a taxonomy of prompting paradigms, a decision framework for prompt selection based on varying factors for the practitioners, outlines emerging trends and challenges, and proposes a framework that links design, optimization, and evaluation to support more controllable and generalizable NLG. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.11179 [cs.CL] (or arXiv:2602.11179v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.11179 Focus to learn more arXiv-issued DOI via DataCite
[NLP-96] What Do LLM s Know About Alzheimers Disease? Fine-Tuning Probing and Data Synthesis for AD Detection
【速读】: 该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期可靠检测的难题,尤其是在标注数据稀缺的情况下。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)进行监督微调,并通过探针分析(probing techniques)揭示任务相关特征在模型内部表示中的编码机制;在此基础上,设计一组任务感知的特殊标记(task-aware special markers),并构建一个序列到序列(sequence-to-sequence)模型作为数据合成工具,生成结构一致且具有诊断信息的合成样本,从而提升下游任务的性能。
链接: https://arxiv.org/abs/2602.11177
作者: Lei Jiang,Yue Zhou,Natalie Parde
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reliable early detection of Alzheimer’s disease (AD) is challenging, particularly due to limited availability of labeled data. While large language models (LLMs) have shown strong transfer capabilities across domains, adapting them to the AD domain through supervised fine-tuning remains largely unexplored. In this work, we fine-tune an LLM for AD detection and investigate how task-relevant information is encoded within its internal representations. We employ probing techniques to analyze intermediate activations across transformer layers, and we observe that, after fine-tuning, the probing values of specific words and special markers change substantially, indicating that these elements assume a crucial role in the model’s improved detection performance. Guided by this insight, we design a curated set of task-aware special markers and train a sequence-to-sequence model as a data-synthesis tool that leverages these markers to generate structurally consistent and diagnostically informative synthetic samples. We evaluate the synthesized data both intrinsically and by incorporating it into downstream training pipelines.
[NLP-97] Evaluating Few-Shot Temporal Reasoning of LLM s for Human Activity Prediction in Smart Environments
【速读】: 该论文旨在解决现有数据驱动的基于代理(agent-based)模型在低数据环境下的表现不佳问题,尤其是在需要预测人类活动及其持续时间的应用场景中(如智能家居自动化、人机协作等)。其关键解决方案是利用预训练语言模型(pre-trained language models, PLMs)强大的常识推理能力,通过检索增强提示(retrieval-augmented prompting)策略融合四类上下文信息——时间、空间、行为历史和人物特征(persona),从而实现对日常活动的高效预测与持续时间估计。实验表明,即使在零样本(zero-shot)条件下,PLMs也能生成具有合理时序结构的活动序列,少量示例(一到两个)即可显著提升持续时间校准和类别准确性,且性能随示例增加趋于饱和,验证了其在低数据环境下具备良好的数据效率与预测鲁棒性。
链接: https://arxiv.org/abs/2602.11176
作者: Maral Doctorarastoo,Katherine A. Flanigan,Mario Bergés,Christopher McComb
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:
Abstract:Anticipating human activities and their durations is essential in applications such as smart-home automation, simulation-based architectural and urban design, activity-based transportation system simulation, and human-robot collaboration, where adaptive systems must respond to human activities. Existing data-driven agent-based models–from rule-based to deep learning–struggle in low-data environments, limiting their practicality. This paper investigates whether large language models, pre-trained on broad human knowledge, can fill this gap by reasoning about everyday activities from compact contextual cues. We adopt a retrieval-augmented prompting strategy that integrates four sources of context–temporal, spatial, behavioral history, and persona–and evaluate it on the CASAS Aruba smart-home dataset. The evaluation spans two complementary tasks: next-activity prediction with duration estimation, and multi-step daily sequence generation, each tested with various numbers of few-shot examples provided in the prompt. Analyzing few-shot effects reveals how much contextual supervision is sufficient to balance data efficiency and predictive accuracy, particularly in low-data environments. Results show that large language models exhibit strong inherent temporal understanding of human behavior: even in zero-shot settings, they produce coherent daily activity predictions, while adding one or two demonstrations further refines duration calibration and categorical accuracy. Beyond a few examples, performance saturates, indicating diminishing returns. Sequence-level evaluation confirms consistent temporal alignment across few-shot conditions. These findings suggest that pre-trained language models can serve as promising temporal reasoners, capturing both recurring routines and context-dependent behavioral variations, thereby strengthening the behavioral modules of agent-based models.
[NLP-98] Barriers to Discrete Reasoning with Transformers: A Survey Across Depth Exactness and Bandwidth EACL2026
【速读】: 该论文试图解决当前Transformer架构在离散推理任务(如算术运算、逻辑推理和算法组合)中表现不佳的理论局限性问题。其解决方案的关键在于从电路复杂性(circuit complexity)、逼近理论(approximation theory)和通信复杂性(communication complexity)三个经典理论视角出发,系统揭示Transformer在执行符号计算时面临的结构性与计算性障碍,例如深度限制、难以逼近不连续函数以及跨标记通信瓶颈,从而为理解为何Transformer虽擅长模式匹配与内插却无法实现精确的离散算法提供统一的理论解释,并指出改进模型设计的潜在方向。
链接: https://arxiv.org/abs/2602.11175
作者: Michelle Yuan,Weiyi Sun,Amir H. Rezaeian,Jyotika Singh,Sandip Ghoshal,Yao-Ting Wang,Miguel Ballesteros,Yassine Benajiba
机构: Oracle AI(Oracle人工智能)
类目: Computation and Language (cs.CL)
备注: Accepted to EACL 2026 Main Conference
Abstract:Transformers have become the foundational architecture for a broad spectrum of sequence modeling applications, underpinning state-of-the-art systems in natural language processing, vision, and beyond. However, their theoretical limitations in discrete reasoning tasks, such as arithmetic, logical inference, and algorithmic composition, remain a critical open problem. In this survey, we synthesize recent studies from three theoretical perspectives: circuit complexity, approximation theory, and communication complexity, to clarify the structural and computational barriers that transformers face when performing symbolic computations. By connecting these established theoretical frameworks, we provide an accessible and unified account of why current transformer architectures struggle to implement exact discrete algorithms, even as they excel at pattern matching and interpolation. We review key definitions, seminal results, and illustrative examples, highlighting challenges such as depth constraints, difficulty approximating discontinuities, and bottlenecks in inter-token communication. Finally, we discuss implications for model design and suggest promising directions for overcoming these foundational limitations.
[NLP-99] he Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models
【速读】: 该论文旨在解决多语言自然语言处理(NLP)中因分词(tokenization)策略导致的书写系统不平等性问题,即预训练多语言语言模型在不同文字系统(如拉丁文、汉字等)上的处理效率与信息成本存在显著差异。研究表明,对于高碎片化(high-fragmentation)的书写系统,模型的分词粒度更细,导致单位词元(token/word)数量增加约3.4倍,推理速度下降16.5倍,同时信息熵(以比特每字符 BPC 衡量)上升高达47.1%,说明存在明显的“书写系统税”(script tax)。解决方案的关键在于识别并量化这种由分词机制引发的系统性偏差,并提出应发展“书写系统感知”(script-aware)的分词策略和预训练范式,从而提升多语言模型对所有书写系统的公平性和高效性。
链接: https://arxiv.org/abs/2602.11174
作者: Aradhya Dixit,Shreem Dixit
机构: Wake Technical Community College (Wake技术社区学院); University of North Carolina Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Pretrained multilingual language models are often assumed to be script-agnostic, yet their tokenizers can impose systematic costs on certain writing systems. We quantify this script tax by comparing two orthographic variants with identical linguistic content. Across mBERT and XLM-R, the higher-fragmentation orthography shows a ~3.4x increase in fertility (6.73-6.85 vs. 2.10-2.35 tokens/word), leading to a 16.5x inference slowdown (0.23 vs. 3.8 sentences/second) on identical hardware. Using bits per character (BPC) to avoid the “NLL paradox” from subword fragmentation, we find a substantial increase in information cost: +19.7% for mBERT (8.06-9.65) and +47.1% for XLM-R (12.19-17.94). A round-trip conversion check (CER_rt=0.31) suggests these gaps reflect orthography-conditioned processing rather than mapping noise. Our results highlight tokenization as a key source of inequity in multilingual NLP and motivate script-aware tokenization and pretraining.
[NLP-100] Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review
【速读】: 该论文旨在解决科学同行评审中作者回复(author response)撰写效率低、质量难保障的问题,传统自动文本生成方法忽视了作者的专业知识(domain expertise)、独有信息(author-only information)及修订策略等关键意图信号。其解决方案的核心在于将作者回复生成重构为“作者在环”(author-in-the-loop)任务,提出REspGen框架,通过显式整合作者输入、多属性控制(multi-attribute control)和基于评估的迭代优化(evaluation-guided refinement),有效利用作者意图与专业知识提升回复质量。同时构建了首个大规模对齐数据集Re³ Align,包含审稿意见—回复—修订三元组,以捕捉作者意图信号,实验验证了作者输入与评估引导优化对回复质量的显著提升作用。
链接: https://arxiv.org/abs/2602.11173
作者: Qian Ruan,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab); Department of Computer Science and Hessian Center for AI (hessian.AI); Technical University of Darmstadt
类目: Computation and Language (cs.CL)
备注:
Abstract:Author response (rebuttal) writing is a critical stage of scientific peer review that demands substantial author effort. Recent work frames this task as automatic text generation, underusing author expertise and intent. In practice, authors possess domain expertise, author-only information, revision and response strategies–concrete forms of author expertise and intent–to address reviewer concerns, and seek NLP assistance that integrates these signals to support effective response writing in peer review. We reformulate author response generation as an author-in-the-loop task and introduce REspGen, a generation framework that integrates explicit author input, multi-attribute control, and evaluation-guided refinement, together with REspEval, a comprehensive evaluation suite with 20+ metrics covering input utilization, controllability, response quality, and discourse. To support this formulation, we construct Re ^3 Align, the first large-scale dataset of aligned review–response–revision triplets, where revisions provide signals of author expertise and intent. Experiments with state-of-the-art LLMs show the benefits of author input and evaluation-guided refinement, the impact of input design on response quality, and trade-offs between controllability and quality. We make our dataset, generation and evaluation tools publicly available.
[NLP-101] Synthesizing the Virtual Advocate: A Multi-Persona Speech Generation Framework for Diverse Linguistic Jurisdictions in Indic Languages
【速读】: 该论文旨在解决多语言环境下生成具有专业权威性和情感张力的合成法庭演说语音的问题,特别是在印度语境中不同语言(如泰米尔语、泰卢固语、孟加拉语、印地语和古吉拉特语)的法律表达需求。其解决方案的关键在于提出了一种基于Gemini 2.5 Flash和Pro TTS模型的提示框架(prompting framework),利用模型对五种印地语系语言的原生支持及其上下文感知的节奏控制能力,构建出具有差异化律师人设(advocate personas)的合成语音输出。此方法在程序性法律信息传递上表现优异,但在动态声调调节与说服性情感表达方面仍存在局限,尤其在孟加拉语和古吉拉特语中性能下降明显,揭示了未来需进一步优化的音系边界。
链接: https://arxiv.org/abs/2602.11172
作者: Aniket Deroy
机构: Indian Institute of Technology, Delhi (印度理工学院德里分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Legal advocacy requires a unique combination of authoritative tone, rhythmic pausing for emphasis, and emotional intelligence. This study investigates the performance of the Gemini 2.5 Flash TTS and Gemini 2.5 Pro TTS models in generating synthetic courtroom speeches across five Indic languages: Tamil, Telugu, Bengali, Hindi, and Gujarati. We propose a prompting framework that utilizes Gemini 2.5s native support for 5 languages and its context-aware pacing to produce distinct advocate personas. The evolution of Large Language Models (LLMs) has shifted the focus of TexttoSpeech (TTS) technology from basic intelligibility to context-aware, expressive synthesis. In the legal domain, synthetic speech must convey authority and a specific professional persona a task that becomes significantly more complex in the linguistically diverse landscape of India. The models exhibit a “monotone authority,” excelling at procedural information delivery but struggling with the dynamic vocal modulation and emotive gravitas required for persuasive advocacy. Performance dips in Bengali and Gujarati further highlight phonological frontiers for future refinement. This research underscores the readiness of multilingual TTS for procedural legal tasks while identifying the remaining challenges in replicating the persuasive artistry of human legal discourse. The code is available at-this https URL
[NLP-102] Efficient Hyper-Parameter Search for LoRA via Language-aided Bayesian Optimization
【速读】: 该论文旨在解决低秩适应(Low-Rank Adaptation, LoRA)微调大型语言模型(Large Language Models, LLMs)时面临的超参数敏感性问题,即LoRA对超参数选择高度敏感,而传统穷举搜索在计算上极为昂贵。解决方案的关键在于将预训练LLM的领域知识融入贝叶斯优化(Bayesian Optimization, BO)框架中,通过语言提示(language prompting)构建超参数与其领域知识之间的离散到连续映射,从而在连续向量空间中高效搜索最优超参数;同时引入可学习标记(learnable token)建模难以用自然语言描述的残差信息,并利用全量与子集训练数据在LoRA中的强性能相关性,采用代理训练(proxy training)策略进一步提升效率。该方法仅需约30次迭代即可获得优于传统45,000次组合搜索得到的超参数配置,性能提升超过20%。
链接: https://arxiv.org/abs/2602.11171
作者: Baek Seong-Eun,Lee Jung-Mok,Kim Sung-Bin,Tae-Hyun Oh
机构: POSTECH(浦项科技大学); KAIST(韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enables resource-efficient personalization or specialization, but it comes at the expense of additional hyperparameter tuning. Although LoRA makes fine-tuning efficient, it is highly sensitive to the choice of hyperparameters, and exhaustive hyperparameter search is still computationally very demanding. To address these challenges, we propose a framework that integrates the domain knowledge of pre-trained LLMs into Bayesian Optimization (BO) to efficiently search for LoRA hyperparameters. To leverage the informed knowledge of LLMs, we repurpose LLMs as a discrete-to-continuous mapping to link the hyperparameters and their domain knowledge with a continuous vector space, where BO is conducted. We design and control the mapping by language prompting, where we provide a domain-aware textual prompt describing the relationships among hyperparameters and their respective roles; thereby, we explicitly inject domain knowledge about LoRA into the LLM in natural language. Also, we model the residual information that is hard to linguistically describe in the prompt with an additional learnable token. This aids BO to sample more high-performing hyperparameters. In addition, by leveraging the observation of the strong correlation between the respective performance obtained from full and subset training datasets in LoRA training regimes, we introduce proxy training and evaluation with a data subset. This further increases the efficiency of our method. We demonstrate that our hyperparameter found with only about 30 iterations achieves more than 20% performance improvement over standard hyperparameters found from about 45,000 combinations.
[NLP-103] PRIME: Policy-Reinforced Iterative Multi-agent Execution for Algorithmic Reasoning in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在算法推理(algorithmic reasoning)任务中表现受限的问题。现有方法难以维持长时间的状态追踪与约束满足,导致错误传播并显著降低准确性。解决方案的关键在于提出PRIME框架,其核心由三个专业化代理组成:执行器(executor)负责逐步推理、验证器(verifier)进行约束检查、协调器(coordinator)实现回溯控制,并通过群体相对策略优化(group relative policy optimization)进行联合训练。其中,迭代验证机制被证实是性能提升的主要来源,有效抑制了错误累积,使模型在需要持续状态跟踪的任务上取得显著改进,如图灵机模拟准确率从9%提升至92%,长除法从16%提升至94%。
链接: https://arxiv.org/abs/2602.11170
作者: Jiawei Xu,Zhenyu Yu,Ziqian Bi,Minh Duc Pham,Xiaoyi Qu,Danyang Zhang
机构: Purdue University (普渡大学); University of Malaya (马来亚大学); Beijing University of Technology (北京工业大学); Georgia Institute of Technology (佐治亚理工学院); Lehigh University (利哈伊大学); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models have demonstrated remarkable capabilities across diverse reasoning tasks, yet their performance on algorithmic reasoning remains limited. To handle this limitation, we propose PRIME (Policy-Reinforced Iterative Multi-agent Execution), a framework comprising three specialized agents, an executor for step-by-step reasoning, a verifier for constraint checking, and a coordinator for backtracking control, optimized through group relative policy optimization. For comprehensive evaluation, we introduce PRIME-Bench, the largest algorithmic reasoning benchmark to date, comprising 86 tasks across 12 categories with 51,600 instances. Tasks span sorting algorithms, graph and tree structures, automata and state machines, symbolic reasoning, and constraint-based puzzles, with execution traces reaching over one million steps. Compared to baseline approach, PRIME improves average accuracy from 26.8% to 93.8%, a 250% relative gain. The largest improvements occur on tasks requiring sustained state tracking, with Turing machine simulation improving from 9% to 92% and long division from 16% to 94%. Ablation studies identify iterative verification as the primary contributor, preventing the error propagation that causes baseline approaches to fail catastrophically. Analysis across model scales (8B-120B parameters) reveals that smaller models benefit disproportionately, achieving accuracy comparable to models 8x larger.
[NLP-104] Disentangling Direction and Magnitude in Transformer Representations: A Double Dissociation Through L2-Matched Perturbation Analysis ICML2026
【速读】: 该论文旨在解决Transformer模型中隐藏状态的方向(direction,即向量在表示空间中的取向)与模长(magnitude,即向量范数)是否具有不同的功能角色这一关键问题。以往研究未能清晰区分二者的作用,而本文通过提出L2匹配扰动分析方法(L2-matched perturbation analysis),确保方向扰动和模长扰动在欧几里得空间中产生等效的位移,从而实现对二者独立影响的量化比较。实验发现:方向扰动显著损害语言建模损失(最高达42.9%),而模长扰动则主要损害句法处理性能(如主谓一致任务准确率下降20.4%,远高于方向扰动的1.6%)。进一步因果干预表明,方向损伤主要沿注意力路径传播(注意力修复可恢复28.4%损失),而模长损伤部分通过LayerNorm路径传导(LayerNorm修复可恢复29.9%损失)。这一跨尺度一致性结果揭示了方向与模长在LayerNorm架构中支持部分分离的计算功能:方向主导注意力路由,模长调节细粒度句法判断的处理强度。此发现不仅挑战了传统线性表示假设,也为模型编辑与可解释性研究提供了新视角。
链接: https://arxiv.org/abs/2602.11169
作者: Mangadoddi Srikar Vardhan,Lekkala Sai Teja
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 7 figures. will Submit to ICML 2026
Abstract:Transformer hidden states encode information as high-dimensional vectors, yet whether direction (orientation in representational space) and magnitude (vector norm) serve distinct functional roles remains unclear. Studying Pythia-family models, we discover a striking cross-over dissociation: angular perturbations cause up to 42.9 more damage to language modeling loss, while magnitude perturbations cause disproportionately more damage to syntactic processing (20.4% vs.1.6% accuracy drop on subject-verb agreement).This finding is enabled by L2-matched perturbation analysis, a methodology ensuring that an gular and magnitude perturbations achieve identical Euclidean displacements. Causal intervention reveals that angular damage flows substantially through the attention pathways (28.4% loss recovery via attention repair), while magnitude damage flows partly through the LayerNorm pathways(29.9% recovery via LayerNorm repair). These patterns replicate across scales within the Pythia architecture family. These findings provide evidence that direction and magnitude support partially distinct computational roles in LayerNorm based architectures. The direction preferentially affects attentional routing, while magnitude modulates processing intensity for fine-grained syntactic judgments. We find different patterns in RMSNorm-based architectures, suggesting that the dissociation depends on architectural choices. Our results refine the linear representation hypothesis and have implications for model editing and interpretability research
[NLP-105] Enhancing SDG-Text Classification with Combinatorial Fusion Analysis and Generative AI
【速读】: 该论文旨在解决在缺乏明确类别标签、类别难以区分或存在相互关联的情况下,文本分类任务的准确性与可靠性问题,尤其是在基于联合国可持续发展目标(Sustainable Development Goals, SDGs)对文本进行分类时面临的挑战。解决方案的关键在于引入组合融合分析(Combinatorial Fusion Analysis, CFA)方法,该方法通过引入秩-评分特征(Rank-Score Characteristic, RSC)函数和认知多样性(Cognitive Diversity, CD)机制,整合多个性能良好且相互异质的机器学习/人工智能(Machine Learning/AI)模型的输出,并结合生成式AI(Generative AI)合成数据用于训练,从而显著提升分类性能(达到96.73%),优于单一模型表现,同时验证了多模型融合与人类专家判断之间具有互补与增强关系。
链接: https://arxiv.org/abs/2602.11168
作者: Jingyan Xu,Marcelo L. LaFleur,Christina Schweikert,D. Frank Hsu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures, 4 tables; Accepted to 2025 IEEE International Conference on Pervasive Intelligence and Computing (PICom 2025)
Abstract:(Natural Language Processing) NLP techniques such as text classification and topic discovery are very useful in many application areas including information retrieval, knowledge discovery, policy formulation, and decision-making. However, it remains a challenging problem in cases where the categories are unavailable, difficult to differentiate, or are interrelated. Social analysis with human context is an area that can benefit from text classification, as it relies substantially on text data. The focus of this paper is to enhance the classification of text according to the UN’s Sustainable Development Goals (SDGs) by collecting and combining intelligence from multiple models. Combinatorial Fusion Analysis (CFA), a system fusion paradigm using a rank-score characteristic (RSC) function and cognitive diversity (CD), has been used to enhance classifier methods by combining a set of relatively good and mutually diverse classification models. We use a generative AI model to generate synthetic data for model training and then apply CFA to this classification task. The CFA technique achieves 96.73% performance, outperforming the best individual model. We compare the outcomes with those obtained from human domain experts. It is demonstrated that combining intelligence from multiple ML/AI models using CFA and getting input from human experts can, not only complement, but also enhance each other.
[NLP-106] Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成内容时频繁产生幻觉(hallucination)的问题,尤其关注虚假陈述因误导性或伪造引用而被强化的现象。其解决方案的关键在于构建了一个名为FalseCite的结构化数据集,专门用于捕捉和基准测试由欺骗性引用诱发的幻觉响应,并通过分析模型内部状态(如隐藏层向量)揭示幻觉行为的潜在模式——例如发现无论是否产生幻觉,隐藏状态向量均呈现独特的“ horn-like ”几何结构。这一发现为未来识别与缓解LLMs中的幻觉提供了可解释的理论基础和评估工具。
链接: https://arxiv.org/abs/2602.11167
作者: Nathan Mao,Varun Kaushik,Shreya Shivkumar,Parham Sharafoleslami,Kevin Zhu,Sunishchal Dev
机构: The Harker School; Monta Vista High School; University of California Berkeley; Algoverse AI Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) often hallucinate, generating nonsensical or false information that can be especially harmful in sensitive fields such as medicine or law. To study this phenomenon systematically, we introduce FalseCite, a curated dataset designed to capture and benchmark hallucinated responses induced by misleading or fabricated citations. Running GPT-4o-mini, Falcon-7B, and Mistral 7-B through FalseCite, we observed a noticeable increase in hallucination activity for false claims with deceptive citations, especially in GPT-4o-mini. Using the responses from FalseCite, we can also analyze the internal states of hallucinating models, visualizing and clustering the hidden state vectors. From this analysis, we noticed that the hidden state vectors, regardless of hallucination or non-hallucination, tend to trace out a distinct horn-like shape. Our work underscores FalseCite’s potential as a foundation for evaluating and mitigating hallucinations in future LLM research.
[NLP-107] Small Updates Big Doubts: Does Parameter-Efficient Fine-tuning Enhance Hallucination Detection ?
【速读】: 该论文旨在解决参数高效微调(Parameter-efficient fine-tuning, PEFT)方法对大语言模型(Large Language Models, LLMs)幻觉行为影响不明确的问题,尤其是在问答(QA)数据集上的表现。现有研究普遍假设PEFT能提升事实正确性,但其如何改变模型幻觉检测能力仍缺乏系统理解。论文通过在三个开源权重LLM骨干模型和三个事实导向QA基准上进行综合实证研究,评估了七种无监督幻觉检测方法(涵盖语义一致性、置信度与熵三类范式),发现PEFT显著提升了多种检测器的AUROC指标,表明其增强了模型对幻觉的识别能力。关键在于,进一步的线性探测与表征诊断分析揭示:PEFT主要通过重塑不确定性编码与表达方式来实现这一效果,而非引入新的事实知识注入。
链接: https://arxiv.org/abs/2602.11166
作者: Xu Hu,Yifan Zhang,Songtao Wei,Chen Zhao,Qiannan Li,Bingzhe Li,Feng Chen
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校); Baylor University (贝勒大学); The University of California, Davis (加州大学戴维斯分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 13 figures, 8 tables
Abstract:Parameter-efficient fine-tuning (PEFT) methods are widely used to adapt large language models (LLMs) to downstream tasks and are often assumed to improve factual correctness. However, how the parameter-efficient fine-tuning methods affect hallucination behavior remains insufficiently understood, especially on QA datasets. In this work, we systematically investigate the impact of PEFT on hallucination detection through a comprehensive empirical study across three open-weight LLM backbones and three fact-seeking QA benchmarks. For each model, we evaluate performance using seven unsupervised hallucination detection methods spanning three complementary approaches: semantic consistency based detectors, confidence based detectors, and entropy based detectors. This multifaceted evaluation enables us to characterize how PEFT reshapes uncertainty across different detection paradigms. In conclusion, our experimental results show that PEFT consistently strengthens hallucination detection ability, substantially improving AUROC across a wide range of hallucination detectors. Besides, further analyses using linear probes and representation diagnostics indicate that PEFT methods primarily reshapes how uncertainty is encoded and surfaced, comparing with injecting new factual knowledge into the models.
[NLP-108] Assessing LLM Reliability on Temporally Recent Open-Domain Questions
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在开放域问答任务中对近期时间信息的语义一致性问题,即现有模型是否能准确反映人类社区对时事内容的理解。其核心挑战在于传统基于词法匹配的评估指标(如BLEU、ROUGE)可能无法充分衡量模型生成答案与参考答案之间的语义对齐程度。解决方案的关键在于构建RECOM数据集——一个包含15,000条2025年9月Reddit热门问题及其社区共识答案的基准测试集,并采用多维度评估框架:包括词汇级指标(BLEU、ROUGE)、语义相似度指标(BERTScore、MoverScore、余弦相似度)以及逻辑推理能力(自然语言推理,NLI)。研究发现,尽管模型在词法层面重合度极低(BLEU-1 < 8%),但语义相似度极高(余弦相似度 > 99%),揭示出模型通过深度改写而非直接复制实现语义保留的现象,从而证明仅依赖词法指标会严重误导对生成质量的判断,强调应建立融合语义和逻辑一致性的综合评估体系。
链接: https://arxiv.org/abs/2602.11165
作者: Pushwitha Krishnappa,Amit Das,Vinija Jain,Tathagata Mukherjee,Aman Chadha
机构: University of Alabama Huntsville (阿拉巴马大学亨茨维尔分校); University of North Alabama (北阿拉巴马大学); Stanford University (斯坦福大学); Google(谷歌); Apple(苹果)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly deployed for open-domain question answering, yet their alignment with human perspectives on temporally recent information remains underexplored. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a benchmark dataset of 15,000 recent Reddit questions from September 2025 paired with community-derived reference answers. We investigate how four open-source LLMs (Llama3.1-8B, Mistral-7B, Gemma-2-9B, and GPT-OSS-20B) respond to these questions, evaluating alignment using lexical metrics (BLEU, ROUGE), semantic similarity (BERTScore, MoverScore, cosine similarity), and logical inference (NLI). Our central finding is a striking semantic-lexical paradox: all models achieve over 99% cosine similarity with references despite less than 8% BLEU-1 overlap, a 90+ percentage point gap indicating that models preserve meaning through extensive paraphrasing rather than lexical reproduction. MoverScore (51-53%) confirms this pattern, occupying an intermediate position that reflects the optimal transport cost of semantic alignment. Furthermore, model scale does not predict performance: Mistral-7B (7B parameters) outperforms GPT-OSS-20B (20B parameters) across all metrics. NLI analysis reveals that contradiction rates remain below 7%, suggesting models rarely generate content that directly conflicts with human consensus. These findings challenge the reliability of lexical metrics for evaluating abstractive generation and argue for multi-dimensional evaluation frameworks that capture semantic fidelity beyond surface-level text matching. The RECOM dataset is publicly available at this https URL
[NLP-109] Automated Optimization Modeling via a Localizable Error-Driven Perspective
【速读】: 该论文旨在解决自动化优化建模(automated optimization modeling)中因高质量训练数据稀缺及错误特定问题稀疏导致的大型语言模型(Large Language Models, LLMs)后训练效果受限的问题。其核心挑战在于:(L1)错误特定问题样本稀疏,以及(L2)困难问题对应的奖励信号稀疏,二者均限制了领域特定后训练性能。解决方案的关键在于提出一种基于可定位错误驱动视角的新型学习框架——MIND(Modeling via a Localizable Error-Driven Perspective),其创新性地利用优化建模中错误传播具有局部性的特性(即错误通常局限于特定语义片段而不扩散至整个解空间),从而构建高密度聚焦训练语料,并引入动态监督微调策略优化(Dynamic Supervised Fine-Tuning Policy Optimization, DFPO),通过局部精修机制提升模型在复杂问题上的表现。实验表明,MIND在六个基准测试上显著优于现有最先进方法。
链接: https://arxiv.org/abs/2602.11164
作者: Weiting Liu,Han Wu,Yufei Kuang,Xiongwei Han,Tao Zhong,Jianfeng Feng,Wenlian Lu
机构: Fudan University (复旦大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University of Science and Technology of China (中国科学技术大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Automated optimization modeling via Large Language Models (LLMs) has emerged as a promising approach to assist complex human decision-making. While post-training has become a pivotal technique to enhance LLMs’ capabilities in this domain, its effectiveness is severely constrained by the scarcity and underutilization of high-quality training data. However, through a detailed profiling of error patterns across various problem-response pairs drawn from post-training, we identify two fundamental limitations of existing automated optimization modeling approaches: (L1) the sparsity of error-specific problems and (L2) the sparse rewards associated with difficult problems. We demonstrate that these limitations can result in suboptimal performance in domain-specific post-training for LLMs. To tackle the above two limitations, we propose a novel error-driven learning framework – namely, auto\textbfmated opt\textbfimization modeli\textbfng via a localizable error-\textbfdriven perspective (MIND) – that customizes the whole model training framework from data synthesis to post-training. MIND is based on our key observation of the unique localizable patterns in error propagation of optimization modelings, that is, modeling errors may remain localized to specific semantic segments and do not propagate throughout the entire solution. Thus, in contrast to holistic reasoning tasks such as mathematical proofs, MIND leverages the construction of a focused, high-density training corpus and proposes \textbfDynamic Supervised \textbfFine-Tuning \textbfPolicy \textbfOptimization (DFPO) to tackle difficult problems through localized refinement. Experiments on six benchmarks demonstrate that MIND consistently outperforms all the state-of-the-art automated optimization modeling approaches.
[NLP-110] Nested Named Entity Recognition in Plasma Physics Research Articles
【速读】: 该论文旨在解决等离子体物理研究文献中复杂且富含上下文的文本内容难以有效提取专用实体的问题,以支持高级搜索等应用。其解决方案的关键在于提出一种基于编码器-Transformer与条件随机场(Conditional Random Fields, CRF)相结合的轻量级方法,用于识别嵌套命名实体;具体包括:构建包含16类标注的等离子体物理语料库、采用针对不同实体类型的独立BERT-CRF模型进行专业化训练,并引入系统化的超参数优化流程以提升模型性能。
链接: https://arxiv.org/abs/2602.11163
作者: Muhammad Haris,Hans Höft,Markus M. Becker,Markus Stocker
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Named Entity Recognition (NER) is an important task in natural language processing that aims to identify and extract key entities from unstructured text. We present a novel application of NER in plasma physics research articles and address the challenges of extracting specialized entities from scientific text in this domain. Research articles in plasma physics often contain highly complex and context-rich content that must be extracted to enable, e.g., advanced search. We propose a lightweight approach based on encoder-transformers and conditional random fields to extract (nested) named entities from plasma physics research articles. First, we annotate a plasma physics corpus with 16 classes specifically designed for the nested NER task. Second, we evaluate an entity-specific model specialization approach, where independent BERT-CRF models are trained to recognize individual entity types in plasma physics text. Third, we integrate an optimization process to systematically fine-tune hyperparameters and enhance model performance. Our work contributes to the advancement of entity recognition in plasma physics and also provides a foundation to support researchers in navigating and analyzing scientific literature.
[NLP-111] Retrieval Heads are Dynamic
【速读】: 该论文旨在解决现有研究中对大型语言模型(Large Language Models, LLMs)中“检索头”(retrieval heads)的理解过于静态的问题,即以往方法依赖于跨数据集的统计平均值来识别具有检索能力的头部,忽略了自回归生成过程中的时间动态性。其解决方案的关键在于从动态视角出发,通过细致的时间步分析揭示检索头在不同生成阶段的行为差异,并提出三个核心发现:(1)检索头在生成过程中呈现显著的时间动态性;(2)每个时间步的动态检索头具有不可替代性,无法被静态检索头有效替代;(3)模型隐藏状态中编码了对未来检索模式的预测信号,表明存在内部规划机制。这一动态视角为理解LLM内部工作机制提供了新的理论基础和实证依据。
链接: https://arxiv.org/abs/2602.11162
作者: Yuping Lin,Zitao Li,Yue Xing,Pengfei He,Yingqian Cui,Yaliang Li,Bolin Ding,Jingren Zhou,Jiliang Tang
机构: Michigan State University (密歇根州立大学); Zoom Communications (Zoom 通讯公司); Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent studies have identified “retrieval heads” in Large Language Models (LLMs) responsible for extracting information from input contexts. However, prior works largely rely on static statistics aggregated across datasets, identifying heads that perform retrieval on average. This perspective overlooks the fine-grained temporal dynamics of autoregressive generation. In this paper, we investigate retrieval heads from a dynamic perspective. Through extensive analysis, we establish three core claims: (1) Dynamism: Retrieval heads vary dynamically across timesteps; (2) Irreplaceability: Dynamic retrieval heads are specific at each timestep and cannot be effectively replaced by static retrieval heads; and (3) Correlation: The model’s hidden state encodes a predictive signal for future retrieval head patterns, indicating an internal planning mechanism. We validate these findings on the Needle-in-a-Haystack task and a multi-hop QA task, and quantify the differences on the utility of dynamic and static retrieval heads in a Dynamic Retrieval-Augmented Generation framework. Our study provides new insights into the internal mechanisms of LLMs.
[NLP-112] Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety NEURIPS2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在非英语语境下安全对齐(safety alignment)不足的问题,尤其是低资源语言中存在的漏洞风险。其核心挑战在于现有安全机制多以英文为中心,导致多语言场景下的越狱攻击(jailbreak)成功率显著上升。解决方案的关键在于引入知识蒸馏(Knowledge Distillation, KD)技术,通过黑盒响应式参数高效微调(Parameter-Efficient Fine-Tuning, PEFT),将专有教师模型(OpenAI o1-mini)的拒绝行为以低秩适配(LoRA)方式蒸馏至三个开源学生模型(Meta-Llama-3-8B-Instruct、Gemma-2-2B-IT 和 Qwen3-8B)。实验发现,标准微调反而会提升越狱成功率(最高达16.6个百分点),而通过移除导致安全退化的“边界”式拒绝行为(nuanced `boundary’ refusals),可有效缓解甚至逆转学生模型的安全性能下降,尽管推理能力(如 GSM8K)仍有所损失。此研究揭示了知识蒸馏在多语言安全对齐中的潜在价值与复杂性,为后续跨语言安全机制设计提供了基础。
链接: https://arxiv.org/abs/2602.11157
作者: Max Zhang,Derek Liu,Kai Zhang,Joshua Franco,Haihao Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, Poster presented at Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 Workshop
Abstract:Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model (OpenAI o1-mini) with Low-Rank Adaptation (LoRA) into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, using ~28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals a counterintuitive behavior: standard fine-tuning on the teacher’s ``safe’’ refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. By removing a primary source of safety degradation, nuanced `boundary’ refusals, we mitigate or even reverse safety declines in student models, although reductions in reasoning performance (GSM8K) persist. Overall, our exploratory study highlights the challenges and potential of KD as a technique for multilingual safety alignment, offering a foundation for future research in this direction.
信息检索
[IR-0] AttentionRetriever: Attention Layers are Secretly Long Document Retrievers
【速读】:该论文旨在解决现有检索模型在长文档检索任务中面临的三大关键挑战:上下文感知能力不足、因果依赖关系未被充分建模以及检索范围难以精准界定。针对这些问题,作者提出了AttentionRetriever这一新型长文档检索模型,其核心创新在于利用注意力机制(attention mechanism)和基于实体的检索策略(entity-based retrieval),构建具有上下文感知能力的嵌入表示,并动态确定最优的检索范围。实验表明,该方法在多个长文档检索数据集上显著优于现有模型,同时保持了密集检索模型的高效性。
链接: https://arxiv.org/abs/2602.12278
作者: David Jiahao Fu,Lam Thanh Do,Jiayu Li,Kevin Chen-Chuan Chang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval augmented generation (RAG) has been widely adopted to help Large Language Models (LLMs) to process tasks involving long documents. However, existing retrieval models are not designed for long document retrieval and fail to address several key challenges of long document retrieval, including context-awareness, causal dependence, and scope of retrieval. In this paper, we proposed AttentionRetriever, a novel long document retrieval model that leverages attention mechanism and entity-based retrieval to build context-aware embeddings for long document and determine the scope of retrieval. With extensive experiments, we found AttentionRetriever is able to outperform existing retrieval models on long document retrieval datasets by a large margin while remaining as efficient as dense retrieval models.
[IR-1] SAGEO Arena: A Realistic Environment for Evaluating Search-Augmented Generative Engine Optimization
【速读】:该论文旨在解决当前缺乏对搜索增强型生成引擎优化(Search-Augmented Generative Engine Optimization, SAGEO)进行全面评估的问题。现有基准测试无法支持端到端的可见性分析,通常基于预定义的候选文档,忽略了检索与重排序阶段对生成结果的影响,且未考虑真实网页文档中包含的结构化信息(如Schema标记),而这些信息在实际搜索引擎中是关键信号。解决方案的关键在于提出SAGEO Arena——一个集成大规模含丰富结构信息的网页文档语料库的完整生成式搜索流水线环境,能够实现分阶段的SAGEO分析,明确区分SEO(搜索导向优化)与GEO(生成导向优化),并揭示结构化信息有助于缓解现有方法在实际场景中的性能下降问题,强调有效SAGEO需针对每个流水线阶段进行定制化优化。
链接: https://arxiv.org/abs/2602.12187
作者: Sunghwan Kim,Wooseok Jeong,Serin Kim,Sangam Lee,Dongha Lee
机构: Yonsei University (延世大学); Konkuk University (中央大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Work in Progress
Abstract:Search-Augmented Generative Engines (SAGE) have emerged as a new paradigm for information access, bridging web-scale retrieval with generative capabilities to deliver synthesized answers. This shift has fundamentally reshaped how web content gains exposure online, giving rise to Search-Augmented Generative Engine Optimization (SAGEO), the practice of optimizing web documents to improve their visibility in AI-generated responses. Despite growing interest, no evaluation environment currently supports comprehensive investigation of SAGEO. Specifically, existing benchmarks lack end-to-end visibility evaluation of optimization strategies, operating on pre-determined candidate documents that abstract away retrieval and reranking preceding generation. Moreover, existing benchmarks discard structural information (e.g., schema markup) present in real web documents, overlooking the rich signals that search systems actively leverage in practice. Motivated by these gaps, we introduce SAGEO Arena, a realistic and reproducible environment for stage-level SAGEO analysis. Our objective is to jointly target search-oriented optimization (SEO) and generation-centric optimization (GEO). To achieve this, we integrate a full generative search pipeline over a large-scale corpus of web documents with rich structural information. Our findings reveal that existing approaches remain largely impractical under realistic conditions and often degrade performance in retrieval and reranking. We also find that structural information helps mitigate these limitations, and that effective SAGEO requires tailoring optimization to each pipeline stage. Overall, our benchmark paves the way for realistic SAGEO evaluation and optimization beyond simplified settings.
[IR-2] owards Personalized Bangla Book Recommendation: A Large-Scale Multi-Entity Book Graph Dataset
【速读】:该论文旨在解决孟加拉语文学领域个性化图书推荐因缺乏结构化、大规模且公开可用数据集而受到的限制问题。其解决方案的关键在于构建了一个名为RokomariBG的大规模多实体异构图书图数据集,该数据集包含127,302本书、63,723名用户、16,601位作者、1,515个类别、2,757家出版社和209,602条评论,并通过八种关系类型组织成一个全面的知识图谱。为验证该数据集的有效性,研究者还系统地评估了多种代表性推荐模型在Top-N推荐任务上的表现,结果表明利用多关系结构和文本侧信息对提升推荐性能至关重要,其中神经检索模型取得了最佳效果(NDCG@10 = 0.204)。这一工作为低资源语言环境下的图书推荐研究奠定了基础并提供了可复现的基准与公开资源。
链接: https://arxiv.org/abs/2602.12129
作者: Rahin Arefin Ahmed,Md. Anik Chowdhury,Sakil Ahmed Sheikh Reza,Devnil Bhattacharjee,Muhammad Abdullah Adnan,Nafis Sadeq
机构: East West University (东西大学); Bangladesh University of Engineering and Technology (孟加拉国工程技术大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Personalized book recommendation in Bangla literature has been constrained by the lack of structured, large-scale, and publicly available datasets. This work introduces RokomariBG, a large-scale, multi-entity heterogeneous book graph dataset designed to support research on personalized recommendation in a low-resource language setting. The dataset comprises 127,302 books, 63,723 users, 16,601 authors, 1,515 categories, 2,757 publishers, and 209,602 reviews, connected through eight relation types and organized as a comprehensive knowledge graph. To demonstrate the utility of the dataset, we provide a systematic benchmarking study on the Top-N recommendation task, evaluating a diverse set of representative recommendation models, including classical collaborative filtering methods, matrix factorization models, content-based approaches, graph neural networks, a hybrid matrix factorization model with side information, and a neural two-tower retrieval architecture. The benchmarking results highlight the importance of leveraging multi-relational structure and textual side information, with neural retrieval models achieving the strongest performance (NDCG@10 = 0.204). Overall, this work establishes a foundational benchmark and a publicly available resource for Bangla book recommendation research, enabling reproducible evaluation and future studies on recommendation in low-resource cultural domains. The dataset and code are publicly available at this https URL Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2602.12129 [cs.IR] (or arXiv:2602.12129v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.12129 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-3] Compress Cross and Scale: Multi-Level Compression Cross Networks for Efficient Scaling in Recommender Systems
【速读】:该论文旨在解决工业推荐系统中高阶特征交互建模的效率与性能平衡问题,即如何在有限计算资源和严格延迟约束下,同时实现强大的交互能力、高计算效率和良好的可扩展性。其核心解决方案是提出MLCC(Multi-Level Compressed Cross),一种通过分层压缩与动态组合结构组织特征交叉的架构,能够高效捕获高阶特征依赖关系并保持较低的计算复杂度;进一步引入MC-MLCC(Multi-Channel MLCC)作为扩展,将特征交互分解到并行子空间中,实现参数增长可控的水平扩展,显著提升表示能力和计算效率。实验表明,该方法在多个公开数据集和大规模工业场景中均优于DLRM类基线模型,且在相同性能下模型参数和浮点运算次数(FLOPs)减少最多达26倍。
链接: https://arxiv.org/abs/2602.12041
作者: Heng Yu,Xiangjun Zhou,Jie Xia,Heng Zhao,Anxin Wu,Yu Zhao,Dongying Kong
机构: Bilibili Inc.(哔哩哔哩公司)
类目: Information Retrieval (cs.IR)
备注: 11 pages, 3 figures
Abstract:Modeling high-order feature interactions efficiently is a central challenge in click-through rate and conversion rate prediction. Modern industrial recommender systems are predominantly built upon deep learning recommendation models, where the interaction backbone plays a critical role in determining both predictive performance and system efficiency. However, existing interaction modules often struggle to simultaneously achieve strong interaction capacity, high computational efficiency, and good scalability, resulting in limited ROI when models are scaled under strict production constraints. In this work, we propose MLCC, a structured feature interaction architecture that organizes feature crosses through hierarchical compression and dynamic composition, which can efficiently capture high-order feature dependencies while maintaining favorable computational complexity. We further introduce MC-MLCC, a Multi-Channel extension that decomposes feature interactions into parallel subspaces, enabling efficient horizontal scaling with improved representation capacity and significantly reduced parameter growth. Extensive experiments on three public benchmarks and a large-scale industrial dataset show that our proposed models consistently outperform strong DLRM-style baselines by up to 0.52 AUC, while reducing model parameters and FLOPs by up to 26 \times under comparable performance. Comprehensive scaling analyses demonstrate stable and predictable scaling behavior across embedding dimension, head number, and channel count, with channel-based scaling achieving substantially better efficiency than conventional embedding inflation. Finally, online A/B testing on a real-world advertising platform validates the practical effectiveness of our approach, which has been widely adopted in Bilibili advertising system under strict latency and resource constraints.
[IR-4] IncompeBench: A Permissively Licensed Fine-Grained Benchmark for Music Information Retrieval
【速读】:该论文旨在解决音乐信息检索(Music Information Retrieval, MIR)领域缺乏高质量评估基准的问题。当前尽管生成式 AI 和深度预训练模型在多模态信息检索中取得显著进展,音乐表示学习也已融入日常产品,但系统性能的客观评测仍受限于数据质量和标注一致性。解决方案的关键在于构建一个名为 IncompeBench 的精心标注基准,包含1,574个授权宽松的高质量音乐片段、500个多样化查询以及超过12.5万条相关性判断,所有标注通过多阶段流程生成,确保了高人工标注一致性,从而为MIR提供可靠、可复现的评估标准。
链接: https://arxiv.org/abs/2602.11941
作者: Benjamin Clavié,Atoof Shakir,Jonah Turner,Sean Lee,Aamir Shakir,Makoto P. Kato
机构: Mixedbread AI(混合面包人工智能); National Institute of Informatics (日本信息研究所); ETH Zurich (苏黎世联邦理工学院); University of Tsukuba (筑波大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Information Retrieval has made significant progress in recent years, leveraging the increasingly strong multimodal abilities of deep pre-trained models to represent information across modalities. Music Information Retrieval (MIR), in particular, has considerably increased in quality, with neural representations of music even making its way into everyday life products. However, there is a lack of high-quality benchmarks for evaluating music retrieval performance. To address this issue, we introduce \textbfIncompeBench, a carefully annotated benchmark comprising 1,574 permissively licensed, high-quality music snippets, 500 diverse queries, and over 125,000 individual relevance judgements. These annotations were created through the use of a multi-stage pipeline, resulting in high agreement between human annotators and the generated data. The resulting datasets are publicly available at this https URL and this https URL with the prompts available at this https URL.
[IR-5] Efficient Crawling for Scalable Web Data Acquisition (Extended Version) EDBT2026
【速读】:该论文旨在解决高价值统计数据集(Statistics Datasets, SDs)在互联网上难以高效、大规模获取的问题,尤其针对网页资源发布形式多样导致的检索困难。其核心挑战在于如何在不遍历整个网站的前提下,精准定位并提取包含目标SD资源的页面。解决方案的关键在于提出一种基于强化学习的聚焦式网络爬虫SB-CLASSIFIER,该方法利用“睡眠老虎机”(sleeping bandits)机制,通过分析链接路径特征,智能学习哪些超链接更可能通向富含目标资源的页面,从而实现仅爬取网站极小比例内容即可捕获大量目标资源的高效策略。
链接: https://arxiv.org/abs/2602.11874
作者: Antoine Gauquier,Ioana Manolescu,Pierre Senellart
机构: DI ENS, ENS, CNRS, PSL, Inria(法国国家信息与自动化研究院); Inria & Institut Polytechnique de Paris(巴黎综合理工学院)
类目: Information Retrieval (cs.IR)
备注: Extended version of a paper published at the EDBT 2026 conference
Abstract:Journalistic fact-checking, as well as social or economic research, require analyzing high-quality statistics datasets (SDs, in short). However, retrieving SD corpora at scale may be hard, inefficient, or impossible, depending on how they are published online. To improve open statistics data accessibility, we present a focused Web crawling algorithm that retrieves as many targets, i.e., resources of certain types, as possible, from a given website, in an efficient and scalable way, by crawling (much) less than the full website. We show that optimally solving this problem is intractable, and propose an approach based on reinforcement learning, namely using sleeping bandits. We propose SB-CLASSIFIER, a crawler that efficiently learns which hyperlinks lead to pages that link to many targets, based on the paths leading to the links in their enclosing webpages. Our experiments on websites with millions of webpages show that our crawler is highly efficient, delivering high fractions of a site’s targets while crawling only a small part.
[IR-6] Improving Neural Retrieval with Attribution-Guided Query Rewriting
【速读】:该论文旨在解决神经检索器(neural retrievers)在面对模糊或不明确查询时的脆弱性问题,即即使存在相关文档,此类查询仍可能导致排序错误。现有方法仅部分缓解此问题:大语言模型(LLM)进行查询重写但缺乏检索器反馈,而可解释性方法虽能识别误导性词元(token),却仅用于事后分析。论文的关键解决方案是提出一种基于归因引导的查询重写方法(attribution-guided query rewriting),通过计算检索器输出的梯度-based词元归因(gradient-based token attributions),将这些得分作为软性指导注入结构化提示(structured prompt)中,驱动LLM在保留原始意图的前提下修正弱化或误导性的查询成分。实验表明,该方法在BEIR数据集上显著优于强基线,尤其对隐含或模糊信息需求的查询提升更明显。
链接: https://arxiv.org/abs/2602.11841
作者: Moncef Garouani,Josiane Mothe
机构: IRIT, UMR5505 CNRS (IRIT, UMR5505 CNRS); Université Toulouse Capitole (图卢兹-卡皮托尔大学); UT2J, Université de Toulouse (图卢兹第二大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Neural retrievers are effective but brittle: underspecified or ambiguous queries can misdirect ranking even when relevant documents exist. Existing approaches address this brittleness only partially: LLMs rewrite queries without retriever feedback, and explainability methods identify misleading tokens but are used for post-hoc analysis. We close this loop and propose an attribution-guided query rewriting method that uses token-level explanations to guide query rewriting. For each query, we compute gradient-based token attributions from the retriever and then use these scores as soft guidance in a structured prompt to an LLM that clarifies weak or misleading query components while preserving intent. Evaluated on BEIR collections, the resulting rewrites consistently improve retrieval effectiveness over strong baselines, with larger gains for implicit or ambiguous information needs.
[IR-7] ULTRA:Urdu Language Transformer-based Recommendation Architecture
【速读】:该论文旨在解决乌尔都语(Urdu)作为低资源语言在个性化新闻推荐中缺乏高效语义内容推荐系统的问题,现有方法主要依赖词法匹配或语言无关技术,难以捕捉语义意图且在不同查询长度和信息需求下表现不佳,导致推荐相关性与适应性降低。解决方案的关键在于提出ULTRA(Urdu Language Transformer-based Recommendation Architecture),其核心创新是双嵌入架构结合查询长度感知的路由机制,能够基于阈值驱动决策动态区分短查询(侧重意图)与长查询(富含上下文),并将其引导至优化后的标题级或全文级语义管道,从而实现检索过程中的语义粒度自适应对齐;同时利用Transformer嵌入与优化池化策略,超越表面关键词匹配,支持上下文感知的相似性搜索,实验表明该架构在大规模乌尔都语新闻语料上显著提升推荐精度(超过90%),验证了针对低资源语言的查询自适应语义对齐的有效性。
链接: https://arxiv.org/abs/2602.11836
作者: Alishbah Bashir,Fatima Qaiser,Ijaz Hussain
机构: PIEAS(巴基斯坦伊斯兰科学与技术大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Urdu, as a low-resource language, lacks effective semantic content recommendation systems, particularly in the domain of personalized news retrieval. Existing approaches largely rely on lexical matching or language-agnostic techniques, which struggle to capture semantic intent and perform poorly under varying query lengths and information needs. This limitation results in reduced relevance and adaptability in Urdu content recommendation. We propose ULTRA (Urdu Language Transformer-based Recommendation Architecture),an adaptive semantic recommendation framework designed to address these challenges. ULTRA introduces a dual-embedding architecture with a query-length aware routing mechanism that dynamically distinguishes between short, intent-focused queries and longer, context-rich queries. Based on a threshold-driven decision process, user queries are routed to specialized semantic pipelines optimized for either title/headline-level or full-content/document level representations, ensuring appropriate semantic granularity during retrieval. The proposed system leverages transformer-based embeddings and optimized pooling strategies to move beyond surface-level keyword matching and enable context-aware similarity search. Extensive experiments conducted on a large-scale Urdu news corpus demonstrate that the proposed architecture consistently improves recommendation relevance across diverse query types. Results show gains in precision above 90% compared to single-pipeline baselines, highlighting the effectiveness of query-adaptive semantic alignment for low-resource languages. The findings establish ULTRA as a robust and generalizable content recommendation architecture, offering practical design insights for semantic retrieval systems in low-resource language settings.
[IR-8] Reliable and Private Anonymous Routing for Satellite Constellations
【速读】:该论文旨在解决低轨卫星星座(Low Earth Orbit satellite constellations, LEO)等共享动态网络基础设施中元数据隐私泄露问题,尤其针对在混合信任环境中运行的国家行为体所面临的威胁。其核心挑战在于传统混洗网络(mix-network)在高链路波动和间歇性连接下难以维持可靠性和匿名性。解决方案的关键在于提出一种增强型匿名架构,通过三个核心技术实现:(1)基于(n, k)纠删码的多路径传输协议,有效应对拓扑不稳定导致的消息丢失;(2)在路由发现阶段集成计算高效的私有信息检索(Private Information Retrieval, PIR)机制,防止用户-提供者目录中的元数据泄露;(3)引入基于中心性的自适应延迟策略,缓解LEO网络固有的拓扑偏倚,优化匿名性与延迟之间的权衡。实验证明该架构可实现近零消息丢失,并在实际部署可行性上具备明确优势。
链接: https://arxiv.org/abs/2602.11764
作者: Nilesh Vyas,Fabien Geyer,Svetoslav Duhovnikov
机构: Airbus Central R&T (空中客车中央研发)
类目: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Networking and Internet Architecture (cs.NI)
备注: 14 Pages, 16 Figures
Abstract:Shared, dynamic network infrastructures, such as dual-use LEO satellite constellations, pose critical threats to metadata privacy, particularly for state actors operating in mixed-trust environments. This work proposes an enhanced anonymity architecture, evolving the Loopix mix-network, to provide robust security and reliability in these volatile topologies. We introduce three primary contributions: (1) A multi-path transport protocol utilizing (n, k) erasure codes, which is demonstrated to counteract the high link volatility and intermittent connectivity that renders standard mix-networks unreliable. (2) The integration of a computationally efficient Private Information Retrieval (PIR) protocol during route discovery. (3) The introduction of adaptive, centrality-based delay strategies that efficiently mitigate the inherent topological bias of LEO networks, providing a superior anonymity-to-latency trade-off. This mechanism provably prevents metadata leakage at the user-provider directory, mitigating profiling and correlation attacks. We validate this architecture via high-fidelity, packet-level simulations of a LEO constellation. Empirical results show our multi-path transport achieves near-zero message loss, establishing a quantifiable trade-off between reliability and bandwidth overhead. Furthermore, microbenchmarks of the PIR protocol quantify its computational and latency overheads, confirming its feasibility for practical deployment. This work provides a validated blueprint for deployable high-anonymity communication systems, demonstrating the viability of securely multiplexing sensitive operations within large-scale commercial network infrastructures.
[IR-9] Uncertainty-aware Generative Recommendation
【速读】:该论文旨在解决生成式推荐(Generative Recommendation)中因偏好优化方法依赖二元结果正确性而引发的“不确定性盲视”(uncertainty blindness)问题,即模型对自身生成置信度的忽视、样本学习难度差异未被考虑以及缺乏显式的置信度表达,导致训练不稳定和决策风险不可量化。解决方案的关键在于提出一种不确定性感知的生成式推荐框架(Uncertainty-aware Generative Recommendation, UGR),其核心机制包括:基于不确定性的奖励加权以惩罚高置信度错误、难度感知的优化动态防止过早收敛,以及显式的置信度对齐以赋予模型置信度表达能力,从而实现更稳定且可解释的推荐性能提升。
链接: https://arxiv.org/abs/2602.11719
作者: Chenxiao Fan,Chongming Gao,Yaxin Gong,Haoyan Liu,Fuli Feng,Xiangnan He
机构: University of Science and Technology of China (中国科学技术大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Generative Recommendation has emerged as a transformative paradigm, reformulating recommendation as an end-to-end autoregressive sequence generation task. Despite its promise, existing preference optimization methods typically rely on binary outcome correctness, suffering from a systemic limitation we term uncertainty blindness. This issue manifests in the neglect of the model’s intrinsic generation confidence, the variation in sample learning difficulty, and the lack of explicit confidence expression, directly leading to unstable training dynamics and unquantifiable decision risks. In this paper, we propose Uncertainty-aware Generative Recommendation (UGR), a unified framework that leverages uncertainty as a critical signal for adaptive optimization. UGR synergizes three mechanisms: (1) an uncertainty-weighted reward to penalize confident errors; (2) difficulty-aware optimization dynamics to prevent premature convergence; and (3) explicit confidence alignment to empower the model with confidence expression capabilities. Extensive experiments demonstrate that UGR not only yields superior recommendation performance but also fundamentally stabilizes training, preventing the performance degradation often observed in standard methods. Furthermore, the learned confidence enables reliable downstream risk-aware applications.
[IR-10] EpicCBR: Item-Relation-Enhanced Dual-Scenario Contrastive Learning for Cold-Start Bundle Recommendation WSDM2026
【速读】:该论文旨在解决冷启动场景下捆绑推荐(bundle recommendation)中存在的表示学习挑战,即现有方法主要依赖于已观测到的用户-捆绑交互数据,难以有效探索新出现的捆绑组合,且通常将每个捆绑视为独立实例,忽视了用户-物品(UI)和捆绑-物品(BI)关系中蕴含的流行物品特征。其解决方案的关键在于提出一种多视角对比学习框架EpicCBR:首先通过精准挖掘物品关系构建用户画像,识别可能对捆绑感兴趣的用户;其次设计基于流行度的方法,利用历史捆绑信息与用户偏好刻画新捆绑特征;最后引入一个多视图图对比学习框架,整合冷启动与热启动场景,增强模型在不同情境下的泛化能力。
链接: https://arxiv.org/abs/2602.11680
作者: Yihang Li,Zhuo Liu,Wei Wei
机构: Huazhong University of Science and Technology (华中科技大学); Cognitive Computing and Intelligent Information Processing (CCIIP) Laboratory, School of Computer Science and Technology, Huazhong University of Science and Technology (认知计算与智能信息处理实验室,华中科技大学计算机科学与技术学院)
类目: Information Retrieval (cs.IR)
备注: 10 pages, 3 figures, 5 tables, accepted by WSDM 2026
Abstract:Bundle recommendation aims to recommend a set of items to users for overall consumption. Existing bundle recommendation models primarily depend on observed user-bundle interactions, limiting exploration of newly-emerged bundles that are constantly created. It pose a critical representation challenge for current bundle methods, as they usually treat each bundle as an independent instance, while neglecting to fully leverage the user-item (UI) and bundle-item (BI) relations over popular items. To alleviate it, in this paper we propose a multi-view contrastive learning framework for cold-start bundle recommendation, named EpicCBR. Specifically, it precisely mine and utilize the item relations to construct user profiles, identifying users likely to engage with bundles. Additionally, a popularity-based method that characterizes the features of new bundles through historical bundle information and user preferences is proposed. To build a framework that demonstrates robustness in both cold-start and warm-start scenarios, a multi-view graph contrastive learning framework capable of integrating these diverse scenarios is introduced to ensure the model’s generalization capability. Extensive experiments conducted on three popular benchmarks showed that EpicCBR outperforms state-of-the-art by a large margin (up to 387%), sufficiently demonstrating the superiority of the proposed method in cold-start scenario. The code and dataset can be found in the GitHub repository: this https URL.
[IR-11] IntTravel: A Real-World Dataset and Generative Framework for Integrated Multi-Task Travel Recommendation
【速读】:该论文旨在解决当前移动推荐系统中多任务协同建模不足的问题,特别是现有研究仅关注“下一个兴趣点(Next Point of Interest, POI)推荐”这一单一任务,忽略了旅程中的关键要素如出发时间、出行方式及途中情境需求,且受限于数据规模导致模型评估不准确。其解决方案的核心在于提出IntTravel——首个大规模公开的集成旅行推荐数据集(包含41亿次交互、1.63亿用户和730万POI),并构建一个端到端的解码器-only生成式框架,通过信息保留、选择与因子分解机制,在任务协作与专业化区分之间实现平衡,从而显著提升多任务推荐性能,并在真实场景中验证了其有效性(如高德地图CTR提升1.09%)。
链接: https://arxiv.org/abs/2602.11664
作者: Huimin Yan,Longfei Xu,Junjie Sun,Zheng Liu,Wei Luo,Kaikui Liu,Xiangxiang Chu
机构: AMAP(阿里巴巴集团)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Next Point of Interest (POI) recommendation is essential for modern mobility and location-based services. To provide a smooth user experience, models must understand several components of a journey holistically: “when to depart”, “how to travel”, “where to go”, and “what needs arise via the route”. However, current research is limited by fragmented datasets that focus merely on next POI recommendation (“where to go”), neglecting the departure time, travel mode, and situational requirements along the journey. Furthermore, the limited scale of these datasets impedes accurate evaluation of performance. To bridge this gap, we introduce IntTravel, the first large-scale public dataset for integrated travel recommendation, including 4.1 billion interactions from 163 million users with 7.3 million POIs. Built upon this dataset, we introduce an end-to-end, decoder-only generative framework for multi-task recommendation. It incorporates information preservation, selection, and factorization to balance task collaboration with specialized differentiation, yielding substantial performance gains. The framework’s generalizability is highlighted by its state-of-the-art performance across both IntTravel dataset and an additional non-travel benchmark. IntTravel has been successfully deployed on Amap serving hundreds of millions of users, leading to a 1.09% increase in CTR. IntTravel is available at this https URL.
[IR-12] Evolutionary Router Feature Generation for Zero-Shot Graph Anomaly Detection with Mixture-of-Experts
【速读】:该论文旨在解决零样本图异常检测(Zero-shot Graph Anomaly Detection, Zero-shot GAD)中因图结构、特征及异常模式异质性导致现有单一图神经网络(Graph Neural Network, GNN)模型表达能力不足的问题。其核心挑战在于:一方面,节点在不同图中语义差异显著,直接基于特征进行专家路由易产生偏差;另一方面,异常图常存在显著分布偏移,现有路由机制难以学习跨图的域不变路由规律。解决方案的关键在于提出一种具有进化式路由特征生成(Evolutionary Router Feature Generation, EvoFG)的混合专家(Mixture-of-Experts, MoE)框架:首先通过大语言模型(Large Language Model, LLM)驱动的生成器与Shapley值引导的评估机制迭代构建并筛选有信息量的结构特征以优化路由;其次设计带不变性学习目标的记忆增强型路由器,从而在分布偏移下捕获可迁移的路由模式,显著提升零样本场景下的异常检测性能。
链接: https://arxiv.org/abs/2602.11622
作者: Haiyang Jiang,Tong Chen,Xinyi Gao,Guansong Pang,Quoc Viet Hung Nguyen,Hongzhi Yin
机构: The University of Queensland(昆士兰大学); Singapore Management University(新加坡管理大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Zero-shot graph anomaly detection (GAD) has attracted increasing attention recent years, yet the heterogeneity of graph structures, features, and anomaly patterns across graphs make existing single GNN methods insufficiently expressive to model diverse anomaly mechanisms. In this regard, Mixture-of-experts (MoE) architectures provide a promising paradigm by integrating diverse GNN experts with complementary inductive biases, yet their effectiveness in zero-shot GAD is severely constrained by distribution shifts, leading to two key routing challenges. First, nodes often carry vastly different semantics across graphs, and straightforwardly performing routing based on their features is prone to generating biased or suboptimal expert assignments. Second, as anomalous graphs often exhibit pronounced distributional discrepancies, existing router designs fall short in capturing domain-invariant routing principles that generalize beyond the training graphs. To address these challenges, we propose a novel MoE framework with evolutionary router feature generation (EvoFG) for zero-shot GAD. To enhance MoE routing, we propose an evolutionary feature generation scheme that iteratively constructs and selects informative structural features via an LLM-based generator and Shapley-guided evaluation. Moreover, a memory-enhanced router with an invariant learning objective is designed to capture transferable routing patterns under distribution shifts. Extensive experiments on six benchmarks show that EvoFG consistently outperforms state-of-the-art baselines, achieving strong and stable zero-shot GAD performance.
[IR-13] Recurrent Preference Memory for Efficient Long-Sequence Generative Recommendation
【速读】:该论文旨在解决生成式推荐(Generative Recommendation, GenRec)模型在处理长期用户交互序列时面临的两大挑战:一是由于全注意力机制导致的计算成本过高,二是由随机交互带来的噪声累积问题。解决方案的关键在于提出 Rec2PM 框架,其核心创新是将长序列用户行为压缩为紧凑的 Preference Memory(偏好记忆)token,并采用一种新颖的自参照教师强制(self-referential teacher-forcing)策略:利用全局历史视图生成参考记忆作为监督信号,从而实现并行化训练,同时在推理阶段保持迭代更新能力;此外,通过以 token embeddings 形式表示记忆而非传统 KV 缓存,显著提升了存储效率,实验证明该方法在降低延迟和内存占用的同时实现了更优的推荐准确性。
链接: https://arxiv.org/abs/2602.11605
作者: Yixiao Chen,Yuan Wang,Yue Liu,Qiyao Wang,Ke Cheng,Xin Xu,Juntong Yan,Shuojin Yang,Menghao Guo,Jun Zhang,Huan Yu,Jie Jiang
机构: Tencent Inc.(腾讯公司); Tsinghua University(清华大学)
类目: Information Retrieval (cs.IR)
备注: 12 pages, 6figures
Abstract:Generative recommendation (GenRec) models typically model user behavior via full attention, but scaling to lifelong sequences is hindered by prohibitive computational costs and noise accumulation from stochastic interactions. To address these challenges, we introduce Rec2PM, a framework that compresses long user interaction histories into compact Preference Memory tokens. Unlike traditional recurrent methods that suffer from serial training, Rec2PM employs a novel self-referential teacher-forcing strategy: it leverages a global view of the history to generate reference memories, which serve as supervision targets for parallelized recurrent updates. This allows for fully parallel training while maintaining the capability for iterative updates during inference. Additionally, by representing memory as token embeddings rather than extensive KV caches, Rec2PM achieves extreme storage efficiency. Experiments on large-scale benchmarks show that Rec2PM significantly reduces inference latency and memory footprint while achieving superior accuracy compared to full-sequence models. Analysis reveals that the Preference Memory functions as a denoising Information Bottleneck, effectively filtering interaction noise to capture robust long-term interests.
[IR-14] Analytical Search
【速读】:该论文旨在解决现有信息检索范式(如基于相关性的文档排序或检索增强生成 RAG)在应对分析型信息需求(如趋势分析和因果影响评估)时的不足,这些问题通常出现在法律、金融、科学等领域,具有高度的问责性与多样的概念维度。现有方法要么侧重于信息查找而非端到端的问题求解,要么将复杂任务简化为朴素问答,缺乏对推理过程、证据使用及可验证性的控制。论文提出的解决方案是引入“分析搜索”(Analytical Search)这一新兴搜索范式,其关键在于将搜索重构为一种以证据为驱动、流程导向的分析工作流:显式建模分析意图,检索用于融合的证据,并通过结构化的多步推理生成可验证结论。该方案构建了一个统一系统框架,涵盖查询理解、召回导向检索、推理感知融合与自适应验证四个核心模块,从而实现从信息获取到决策支持的闭环。
链接: https://arxiv.org/abs/2602.11581
作者: Yiteng Tu,Shuo Miao,Weihang Su,Yiqun Liu,Qingyao Ai
机构: DCST, Tsinghua University (清华大学计算机科学与技术系); Quan Cheng Laboratory (全称实验室)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Analytical information needs, such as trend analysis and causal impact assessment, are prevalent across various domains including law, finance, science, and much more. However, existing information retrieval paradigms, whether based on relevance-oriented document ranking or retrieval-augmented generation (RAG) with large language models (LLMs), often struggle to meet the end-to-end requirements of such tasks at the corpus scale. They either emphasize information finding rather than end-to-end problem solving, or simply treat everything as naive question answering, offering limited control over reasoning, evidence usage, and verifiability. As a result, they struggle to support analytical queries that have diverse utility concepts and high accountability requirements. In this paper, we propose analytical search as a distinct and emerging search paradigm designed to fulfill these analytical information needs. Analytical search reframes search as an evidence-governed, process-oriented analytical workflow that explicitly models analytical intent, retrieves evidence for fusion, and produces verifiable conclusions through structured, multi-step inference. We position analytical search in contrast to existing paradigms, and present a unified system framework that integrates query understanding, recall-oriented retrieval, reasoning-aware fusion, and adaptive verification. We also discuss potential research directions for the construction of analytical search engines. In this way, we highlight the conceptual significance and practical importance of analytical search and call on efforts toward the next generation of search engines that support analytical information needs. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2602.11581 [cs.IR] (or arXiv:2602.11581v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.11581 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-15] LASER: An Efficient Target-Aware Segmented Attention Framework for End-to-End Long Sequence Modeling
【速读】:该论文旨在解决现代推荐系统中建模超长用户行为序列时面临的两大瓶颈问题:一是海量用户历史数据的高I/O延迟,二是标准注意力机制带来的二次计算复杂度。为突破这些“延迟墙”(Latency Wall),作者提出了一套全栈优化框架LASER,其关键创新在于两个互补方向:一是系统层面的效率提升,设计了SeqVault统一架构,采用DRAM-SSD混合索引策略,将检索延迟降低50%、CPU使用率减少75%,实现毫秒级访问完整实时与生命周期用户历史;二是算法层面的效率优化,提出分段目标注意力(Segmented Target Attention, STA)机制,利用sigmoid门控过滤噪声项,并引入轻量级全局堆叠目标注意力(GSTA)模块,在压缩序列的同时捕捉跨段依赖关系,显著降低长序列建模复杂度并保留关键信号。该方案在离线和在线大规模测试中均验证了有效性,成功应用于小红书(Xiaohongshu)平台,带来2.36%的广告收入转化率(ADVV)提升和2.08%的营收增长。
链接: https://arxiv.org/abs/2602.11562
作者: Tianhe Lin,Ziwei Xiong,Baoyuan Ou,Yingjie Qin,Lai Xu,Xiaocheng Zhong,Yao Hu,Zhiyong Wang,Tao Zhou,Yubin Xu,Di Wu
机构: Xiaohongshu Inc.(小红书公司)
类目: Information Retrieval (cs.IR)
备注: 9 pages
Abstract:Modeling ultra-long user behavior sequences is pivotal for capturing evolving and lifelong interests in modern recommendation systems. However, deploying such models in real-time industrial environments faces a strict “Latency Wall”, constrained by two distinct bottlenecks: the high I/O latency of retrieving massive user histories and the quadratic computational complexity of standard attention mechanisms. To break these bottlenecks, we present LASER, a full-stack optimization framework developed and deployed at Xiaohongshu (RedNote). Our approach tackles the challenges through two complementary innovations: (1) System efficiency: We introduce SeqVault, a unified schema-aware serving infrastructure for long user histories. By implementing a hybrid DRAM-SSD indexing strategy, SeqVault reduces retrieval latency by 50% and CPU usage by 75%, ensuring millisecond-level access to full real-time and life-cycle user histories. (2) Algorithmic efficiency: We propose a Segmented Target Attention (STA) mechanism to address the computational overhead. Motivated by the inherent sparsity of user interests, STA employs a sigmoid-based gating strategy that acts as a silence mechanism to filter out noisy items. Subsequently, a lightweight Global Stacked Target Attention (GSTA) module refines these compressed segments to capture cross-segment dependencies without incurring high computational costs. This design performs effective sequence compression, reducing the complexity of long-sequence modeling while preserving critical signals. Extensive offline evaluations demonstrate that LASER consistently outperforms state-of-the-art baselines. In large-scale online A/B testing serving over 100 million daily active users, LASER achieved a 2.36% lift in ADVV and a 2.08% lift in revenue, demonstrating its scalability and significant commercial impact.
[IR-16] KuaiSearch: A Large-Scale E-Commerce Search Dataset for Recall Ranking and Relevance
【速读】:该论文旨在解决当前电商搜索(e-commerce search)研究中面临的三大核心挑战:一是用户查询高度模糊、二是商品文本存在噪声且语义结构弱、三是用户偏好多样导致意图难以精准捕捉。现有基于大语言模型(Large Language Models, LLMs)的解决方案受限于现有数据集的不足,如查询构造方式粗略、冷启动用户与长尾商品被过滤、文本匿名化处理以及仅覆盖搜索流水线单一阶段等问题。为此,作者构建并发布了目前规模最大的真实电商搜索数据集KuaiSearch,其关键创新在于系统性地涵盖搜索流水线的三个关键阶段——召回(recall)、排序(ranking)和相关性判断(relevance judgment),同时保留真实用户查询与自然语言商品描述,并完整包含冷启动用户与长尾商品,从而为LLM驱动的电商搜索研究提供了高质量、多维度、贴近实际场景的数据基础。
链接: https://arxiv.org/abs/2602.11518
作者: Yupeng Li,Ben Chen,Mingyue Cheng,Zhiding Liu,Xuxin Zhang,Chenyi Lei,Wenwu Ou
机构: University of Science and Technology of China (中国科学技术大学); Kuaishou Technology (快手科技)
类目: Information Retrieval (cs.IR)
备注:
Abstract:E-commerce search serves as a central interface, connecting user demands with massive product inventories and plays a vital role in our daily lives. However, in real-world applications, it faces challenges, including highly ambiguous queries, noisy product texts with weak semantic order, and diverse user preferences, all of which make it difficult to accurately capture user intent and fine-grained product semantics. In recent years, significant advances in large language models (LLMs) for semantic representation and contextual reasoning have created new opportunities to address these challenges. Nevertheless, existing e-commerce search datasets still suffer from notable limitations: queries are often heuristically constructed, cold-start users and long-tail products are filtered out, query and product texts are anonymized, and most datasets cover only a single stage of the search pipeline. Collectively, these issues constrain research on LLM-based e-commerce search. To address these challenges, we construct and release KuaiSearch. To the best of our knowledge, it is the largest e-commerce search dataset currently available. KuaiSearch is built upon real user search interactions from the Kuaishou platform, preserving authentic user queries and natural-language product texts, covering cold-start users and long-tail products, and systematically spanning three key stages of the search pipeline: recall, ranking, and relevance judgment. We conduct a comprehensive analysis of KuaiSearch from multiple perspectives, including products, users, and queries, and establish benchmark experiments across several representative search tasks. Experimental results demonstrate that KuaiSearch provides a valuable foundation for research on real-world e-commerce search.
[IR-17] From Noise to Order: Learning to Rank via Denoising Diffusion
【速读】:该论文旨在解决传统学习排序(Learning-to-Rank, LTR)方法受限于判别式机器学习模型的问题,这些模型通常仅建模查询-文档对特征表示下文档相关的概率,难以在过参数化场景中保持鲁棒性。解决方案的关键在于提出一种基于去噪扩散过程的深度生成式方法 DiffusionRank,该方法通过建模特征向量与相关性标签的完整联合分布,替代传统的判别式点对或成对排序目标。其核心思想是:在生成式框架下,能够解释全数据分布的候选解比判别式模型更鲁棒,从而提升排序性能。实验表明,DiffusionRank 在多个基准上显著优于传统判别式模型,为利用扩散等先进生成建模技术改进信息检索中的学习排序提供了新方向。
链接: https://arxiv.org/abs/2602.11453
作者: Sajad Ebrahimi,Bhaskar Mitra,Negar Arabzadeh,Ye Yuan,Haolun Wu,Fattane Zarrinkalam,Ebrahim Bagheri
机构: University of Guelph (圭尔夫大学); McGill University (麦吉尔大学); University of California, Berkeley (加州大学伯克利分校); University of Toronto (多伦多大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In information retrieval (IR), learning-to-rank (LTR) methods have traditionally limited themselves to discriminative machine learning approaches that model the probability of the document being relevant to the query given some feature representation of the query-document pair. In this work, we propose an alternative denoising diffusion-based deep generative approach to LTR that instead models the full joint distribution over feature vectors and relevance labels. While in the discriminative setting, an over-parameterized ranking model may find different ways to fit the training data, we hypothesize that candidate solutions that can explain the full data distribution under the generative setting produce more robust ranking models. With this motivation, we propose DiffusionRank that extends TabDiff, an existing denoising diffusion-based generative model for tabular datasets, to create generative equivalents of classical discriminative pointwise and pairwise LTR objectives. Our empirical results demonstrate significant improvements from DiffusionRank models over their discriminative counterparts. Our work points to a rich space for future research exploration on how we can leverage ongoing advancements in deep generative modeling approaches, such as diffusion, for learning-to-rank in IR.
[IR-18] Filtered Approximate Nearest Neighbor Search in Vector Databases: System Design and Performance Analysis
【速读】:该论文旨在解决在向量数据库中,过滤策略(filtering strategies)如何影响带元数据约束的语义检索性能的问题。当前虽然已有针对Filtered Approximate Nearest Neighbor Search (FANNS) 的算法优化,但缺乏对通用过滤策略在主流向量数据库(如FAISS、Milvus和pgvector)中实际表现的系统性理解。其关键解决方案是:首先构建了一个名为\textitMoReVec的新关系型数据集,包含768维文本嵌入与丰富的元数据属性;其次提出一种新的全局-局部选择性相关度量指标(Global-Local Selectivity, GLS),用于量化过滤条件与查询向量之间的关联强度;最后通过扩展ANN-Benchmarks支持带过滤的向量搜索,并基于实证分析揭示了不同数据库引擎的执行机制(如混合近似/精确执行、成本模型优化等)对最终召回率稳定性与效率的关键影响,从而为混合搜索场景下索引类型选择与查询优化器配置提供可落地的实践指南。
链接: https://arxiv.org/abs/2602.11443
作者: Abylay Amanbayev,Brian Tsan,Tri Dang,Florin Rusu
机构: University of California Merced (加州大学默塞德分校)
类目: Databases (cs.DB); Information Retrieval (cs.IR)
备注: The artifacts are available at: this https URL
Abstract:Retrieval-Augmented Generation (RAG) applications increasingly rely on Filtered Approximate Nearest Neighbor Search (FANNS) to combine semantic retrieval with metadata constraints. While algorithmic innovations for FANNS have been proposed, there remains a lack of understanding regarding how generic filtering strategies perform within Vector Databases. In this work, we systematize the taxonomy of filtering strategies and evaluate their integration into FAISS, Milvus, and pgvector. To provide a robust benchmarking framework, we introduce a new relational dataset, \textitMoReVec, consisting of two tables, featuring 768-dimensional text embeddings and a rich schema of metadata attributes. We further propose the \textitGlobal-Local Selectivity (GLS) correlation metric to quantify the relationship between filters and query vectors. Our experiments reveal that algorithmic adaptations within the engine often override raw index performance. Specifically, we find that: (1) \textitMilvus achieves superior recall stability through hybrid approximate/exact execution; (2) \textitpgvector’s cost-based query optimizer frequently selects suboptimal execution plans, favoring approximate index scans even when exact sequential scans would yield perfect recall at comparable latency; and (3) partition-based indexes (IVFFlat) outperform graph-based indexes (HNSW) for low-selectivity queries. To facilitate this analysis, we extend the widely-used \textitANN-Benchmarks to support filtered vector search and make it available online. Finally, we synthesize our findings into a set of practical guidelines for selecting index types and configuring query optimizers for hybrid search workloads. Comments: The artifacts are available at: this https URL Subjects: Databases (cs.DB); Information Retrieval (cs.IR) Cite as: arXiv:2602.11443 [cs.DB] (or arXiv:2602.11443v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2602.11443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-19] MTFM: A Scalable and Alignment-free Foundation Model for Industrial Recommendation in Meituan
【速读】:该论文旨在解决工业推荐系统中跨域推荐(Cross-Domain Recommendation, CDR)与多场景推荐(Multi-Scenario Recommendation, MSR)方法普遍存在的资源消耗高、输入对齐要求严格等问题,从而限制了模型的可扩展性。其解决方案的关键在于提出一种基于Transformer的框架MTFM(Meituan Foundation Model for Recommendation),通过将跨域数据转换为异构标记(heterogeneous tokens)实现无对齐的知识捕获,并引入用户级样本聚合策略提升训练吞吐量;同时结合分组查询注意力(Grouped-Query Attention)与定制化的混合目标注意力(Hybrid Target Attention)以降低内存占用和计算复杂度,并辅以系统级优化如内核融合和消除CPU-GPU阻塞,显著提升了训练与推理效率。
链接: https://arxiv.org/abs/2602.11235
作者: Xin Song,Zhilin Guan,Ruidong Han,Binghao Tang,Tianwen Chen,Bing Li,Zihao Li,Han Zhang,Fei Jiang,Chaolin Xie,Chi Ma,Chunyang Jiang,Chunzhen Jing,Dengxuan Li,Fengyi Li,Lei Yu,Mengyao Sun,Pu Wang,Qing Wang,Rui Fan,Shangyu Chen,Shifeng Du,Siyuan Bai,Wei Lin,Wentao Zhu,Zhou Han,Zhuo Chen,Zikang Xu
机构: Meituan(美团)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Industrial recommendation systems typically involve multiple scenarios, yet existing cross-domain (CDR) and multi-scenario (MSR) methods often require prohibitive resources and strict input alignment, limiting their extensibility. We propose MTFM (Meituan Foundation Model for Recommendation), a transformer-based framework that addresses these challenges. Instead of pre-aligning inputs, MTFM transforms cross-domain data into heterogeneous tokens, capturing multi-scenario knowledge in an alignment-free manner. To enhance efficiency, we first introduce a multi-scenario user-level sample aggregation that significantly enhances training throughput by reducing the total number of instances. We further integrate Grouped-Query Attention and a customized Hybrid Target Attention to minimize memory usage and computational complexity. Furthermore, we implement various system-level optimizations, such as kernel fusion and the elimination of CPU-GPU blocking, to further enhance both training and inference throughput. Offline and online experiments validate the effectiveness of MTFM, demonstrating that significant performance gains are achieved by scaling both model capacity and multi-scenario training data.
[IR-20] BIRD: A Museum Open Dataset Combining Behavior Patterns and Identity Types to Better Model Visitors Experience
【速读】:该论文旨在解决文化遺产领域中因数据匮乏而导致的AI模型训练与验证困难问题,尤其在博物馆场景下,现有数据往往仅针对特定模型需求而采集,难以实现对参观者体验的全面建模。解决方案的关键在于通过眼动追踪设备(eye-tracking glasses)对51名参与者在博物馆内自由探索3层空间(平均时长57分钟、涵盖400余件艺术品)的行为进行系统记录,构建了一个包含情境数据(人口统计学特征、偏好、参观习惯、动机等)、行为数据(时空轨迹、注视点)及反馈数据(满意度、疲劳感、喜爱作品、原始评论等)的开放数据集。该数据集可支持更精准的个性化推荐路径优化,依据用户兴趣水平动态调整景点数量、停留时间和信息密度,从而提升参观体验质量。
链接: https://arxiv.org/abs/2602.11160
作者: Alexanne Worm(LORIA),Florian Marchal(LORIA),Sylvain Castagnos(LORIA)
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Lack of data is a recurring problem in Artificial Intelligence, as it is essential for training and validating models. This is particularly true in the field of cultural heritage, where the number of open datasets is relatively limited and where the data collected does not always allow for holistic modeling of visitors’ experience due to the fact that data are ad hoc (i.e. restricted to the sole characteristics required for the evaluation of a specific model). To overcome this lack, we conducted a study between February and March 2019 aimed at obtaining comprehensive and detailed information about visitors, their visit experience and their feedback. We equipped 51 participants with eye-tracking glasses, leaving them free to explore the 3 floors of the museum for an average of 57 minutes, and to discover an exhibition of more than 400 artworks. On this basis, we built an open dataset combining contextual data (demographic data, preferences, visiting habits, motivations, social context. . . ), behavioral data (spatiotemporal trajectories, gaze data) and feedback (satisfaction, fatigue, liked artworks, verbatim. . . ). Our analysis made it possible to re-enact visitor identities combining the majority of characteristics found in the literature and to reproduce the Veron and Levasseur profiles. This dataset will ultimately make it possible to improve the quality of recommended paths in museums by personalizing the number of points of interest (POIs), the time spent at these different POIs, and the amount of information to be provided to each visitor based on their level of interest.
[IR-21] HybridRAG : A Practical LLM -based ChatBot Framework based on Pre-Generated QA over Raw Unstructured Documents
【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在真实聊天机器人场景中应用受限的问题,具体表现为:传统RAG假设输入知识源为结构化文本(如维基百科或精选数据集),且在查询时才进行检索与生成,导致响应延迟高、难以处理大量非结构化文档。其解决方案的关键在于提出HybridRAG框架,通过两个核心机制实现优化:一是利用光学字符识别(Optical Character Recognition, OCR)和版面分析技术,将包含复杂布局(文字、表格、图表)的原始PDF文档转化为分层文本块;二是预生成一个基于大语言模型(Large Language Model, LLM)的问答(Question-Answer, QA)知识库,并在查询时优先匹配该QA库以提供即时答案,仅当无合适匹配时才触发实时生成,从而显著提升响应准确性和效率。
链接: https://arxiv.org/abs/2602.11156
作者: Sungmoon Kim,Hyuna Jeon,Dahye Kim,Mingyu Kim,Dong-Kyu Chae,Jiwoong Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for grounding Large Language Model (LLM)-based chatbot responses on external knowledge. However, existing RAG studies typically assume well-structured textual sources (e.g. Wikipedia or curated datasets) and perform retrieval and generation at query time, which can limit their applicability in real-world chatbot scenarios. In this paper, we present HybridRAG, a novel and practical RAG framework towards more accurate and faster chatbot responses. First, HybridRAG ingests raw, unstructured PDF documents containing complex layouts (text, tables, figures) via Optical Character Recognition (OCR) and layout analysis, and convert them into hierarchical text chunks. Then, it pre-generates a plausible question-answer (QA) knowledge base from the organized chunks using an LLM. At query time, user questions are matched against this QA bank to retrieve immediate answers when possible, and only if no suitable QA match is found does our framework fall back to an on-the-fly response generation. Experiments on OHRBench demonstrate that our HybridRAG provides higher answer quality and lower latency compared to a standard RAG baseline. We believe that HybridRAG could be a practical solution for real-world chatbot applications that must handle large volumes of unstructured documents and lots of users under limited computational resources.
人机交互
[HC-0] A technical curriculum on language-oriented artificial intelligence in translation and specialised communication
【速读】:该论文旨在解决翻译与专业传播领域从业者在人工智能(AI)驱动工作环境中缺乏领域特定技术AI素养的问题,从而提升其数字韧性。解决方案的关键在于设计并实施一个以语言导向型人工智能为核心的课程体系,聚焦向量嵌入(vector embeddings)、神经网络的技术基础、分词(tokenization)以及Transformer神经网络四大核心模块,通过概念性与技术性兼备的教育内容,培养学习者的计算思维、算法意识与算法能动性(algorithmic agency),进而增强其在AI应用场景中的适应能力与自主性。
链接: https://arxiv.org/abs/2602.12251
作者: Ralph Krüger
机构: TH Köln – University of Applied Sciences Cologne (科隆应用技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 10 pages, 1 figure, EAMT 2026, TAITT Workshop
Abstract:This paper presents a technical curriculum on language-oriented artificial intelligence (AI) in the language and translation (LT) industry. The curriculum aims to foster domain-specific technical AI literacy among stakeholders in the fields of translation and specialised communication by exposing them to the conceptual and technical/algorithmic foundations of modern language-oriented AI in an accessible way. The core curriculum focuses on 1) vector embeddings, 2) the technical foundations of neural networks, 3) tokenization and 4) transformer neural networks. It is intended to help users develop computational thinking as well as algorithmic awareness and algorithmic agency, ultimately contributing to their digital resilience in AI-driven work environments. The didactic suitability of the curriculum was tested in an AI-focused MA course at the Institute of Translation and Multilingual Communication at TH Koeln. Results suggest the didactic effectiveness of the curriculum, but participant feedback indicates that it should be embedded into higher-level didactic scaffolding - e.g., in the form of lecturer support - in order to enable optimal learning conditions.
[HC-1] VIRENA: Virtual Arena for Research Education and Democratic Innovation DATE
【速读】:该论文试图解决当前研究数字平台中人类互动、舆论形成与信息传播等社会动态时面临的三大挑战:数据获取受限、真实世界实验的伦理约束以及现有研究工具的功能局限。为此,作者提出了VIRENA(Virtual Arena)这一虚拟社交环境平台,其关键在于通过大规模语言模型驱动的AI代理(AI agents)与真人参与者在高度仿真的社交媒体(如Instagram、Reddit)和消息应用(如WhatsApp)环境中同步交互,实现可编程、可控制且无需编码的实验设计。该方案支持内容审核策略的灵活调整、预设刺激内容的定时投放,并允许跨条件对比实验,从而推动了此前难以开展的研究,如人机协同行为分析、不同治理干预的效果评估及群体讨论过程的实时观测。平台基于开源技术构建,确保数据主权归属机构并符合隐私保护规范,具有跨学科、跨领域应用潜力。
链接: https://arxiv.org/abs/2602.12207
作者: Emma Hoes,K. Jonathan Klueser,Fabrizio Gilardi
机构: University of Zurich (苏黎世大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: VIRENA is under active development and currently in use at the University of Zurich, supported by the DIZH Innovation Program: 2nd Founder-Call. This preprint will be updated as new features are released. For the latest version and to inquire about demos or pilot collaborations, contact the authors
Abstract:Digital platforms shape how people communicate, deliberate, and form opinions. Studying these dynamics has become increasingly difficult due to restricted data access, ethical constraints on real-world experiments, and limitations of existing research tools. VIRENA (Virtual Arena) is a platform that enables controlled experimentation in realistic social media environments. Multiple participants interact simultaneously in realistic replicas of feed-based platforms (Instagram, Facebook, Reddit) and messaging apps (WhatsApp, Messenger). Large language model-powered AI agents participate alongside humans with configurable personas and realistic behavior. Researchers can manipulate content moderation approaches, pre-schedule stimulus content, and run experiments across conditions through a visual interface requiring no programming skills. VIRENA makes possible research designs that were previously impractical: studying human–AI interaction in realistic social contexts, experimentally comparing moderation interventions, and observing group deliberation as it unfolds. Built on open-source technologies that ensure data remain under institutional control and comply with data protection requirements, VIRENA is currently in use at the University of Zurich and available for pilot collaborations. Designed for researchers, educators, and public organizations alike, VIRENA’s no-code interface makes controlled social media simulation accessible across disciplines and sectors. This paper documents its design, architecture, and capabilities.
[HC-2] Embodied AI Agents for Team Collaboration in Co-located Blue-Collar Work
【速读】:该论文旨在解决当前协作式人工智能(Collaborative AI)研究过度聚焦于白领工作,而忽视了蓝领工作场景中高度协同、具身化和情境依赖特性的问题。其解决方案的关键在于将AI代理的“具身性”(embodiment)从单纯的外观设计提升为一种社会-物质层面的设计策略,通过在工业与维护等蓝领场景中引入具身AI代理,支持团队共享情境意识并促进跨经验水平的包容性沟通,从而重塑蓝领协作实践。
链接: https://arxiv.org/abs/2602.12136
作者: Kaisa Vaananen,Niels van Berkel,Donald McMillan,Thomas Olsson
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 4 pages, 1 figure, a short synopsis of this paper has been submitted to CHI 2026 workshop on Embodying Relationships, Designing TUIs for Co-Located Human-Human Dynamics
Abstract:Blue-collar work is often highly collaborative, embodied, and situated in shared physical environments, yet most research on collaborative AI has focused on white-collar work. This position paper explores how the embodied nature of AI agents can support team collaboration and communication in co-located blue-collar workplaces. From the context of our newly started CAI-BLUE research project, we present two speculative scenarios from industrial and maintenance contexts that illustrate how embodied AI agents can support shared situational awareness and facilitate inclusive communication across experience levels. We outline open questions related to embodied AI agent design around worker inclusion, agency, transformation of blue-collar collaboration practices over time, and forms of acceptable AI embodiments. We argue that embodiment is not just an aesthetic choice but should become a socio-material design strategy of AI systems in blue-collar workplaces.
[HC-3] Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment
【速读】:该论文旨在解决现有价值对齐(value alignment)研究中仅静态刻画价值关系、忽视干预手段(如提示、微调或偏好优化)如何重塑整体价值体系的问题。其解决方案的关键在于提出价值对齐税(Value Alignment Tax, VAT)框架,该框架通过衡量对齐引发的变化在相互关联的价值之间相对于目标收益的传播程度,捕捉对齐压力下价值表达的动力学特征。VAT揭示了对齐常导致价值间的非均匀、结构性协同变动,这些效应在传统单一目标评估中不可见,从而识别出系统性过程层面的对齐风险,并深化了对大语言模型(LLM)中价值对齐动态机制的理解。
链接: https://arxiv.org/abs/2602.12134
作者: Jiajun Chen,Hua Shen
机构: Center for Data Science, NYU Shanghai, New York University (纽约大学上海中心)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Preprint. Under review. 20 pages, 13 figures
Abstract:Existing work on value alignment typically characterizes value relations statically, ignoring how interventions - such as prompting, fine-tuning, or preference optimization - reshape the broader value system. We introduce the Value Alignment Tax (VAT), a framework that measures how alignment-induced changes propagate across interconnected values relative to achieved on-target gain. VAT captures the dynamics of value expression under alignment pressure. Using a controlled scenario-action dataset grounded in Schwartz value theory, we collect paired pre-post normative judgments and analyze alignment effects across models, values, and alignment strategies. Our results show that alignment often produces uneven, structured co-movement among values. These effects are invisible under conventional target-only evaluation, revealing systemic, process-level alignment risks and offering new insights into the dynamics of value alignment in LLMs.
[HC-4] Neutral Prompts Non-Neutral People: Quantifying Gender and Skin-Tone Bias in Gemini Flash 2.5 Image and GPT Image 1.5
【速读】:该论文旨在解决生成式 AI(Generative AI)在图像生成过程中是否存在性别和肤色偏见的问题,尤其关注“中性提示词”是否能产生人口统计学上中立的输出。研究发现,即使使用语义中性的提示词,两个主流商业图像生成模型(Gemini Flash 2.5 Image 和 GPT Image 1.5)仍表现出显著的“默认白色”偏见(96% 输出为白人皮肤),并在性别倾向上出现分化:Gemini 偏好女性呈现主体,而 GPT 偏好男性且皮肤较浅。解决方案的关键在于构建了一套光照感知的色度计量方法,结合混合颜色归一化、面部关键点掩码与基于 Monk (MST)、PERLA 和 Fitzpatrick 量表的感知均匀肤色量化技术,从而区分美学渲染与真实色素分布,为算法视觉文化提供可重复、可比较的审计框架。
链接: https://arxiv.org/abs/2602.12133
作者: Roberto Balestri
机构: Università di Bologna (博洛尼亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:This study quantifies gender and skin-tone bias in two widely deployed commercial image generators - Gemini Flash 2.5 Image (NanoBanana) and GPT Image 1.5 - to test the assumption that neutral prompts yield demographically neutral outputs. We generated 3,200 photorealistic images using four semantically neutral prompts. The analysis employed a rigorous pipeline combining hybrid color normalization, facial landmark masking, and perceptually uniform skin tone quantification using the Monk (MST), PERLA, and Fitzpatrick scales. Neutral prompts produced highly polarized defaults. Both models exhibited a strong “default white” bias (96% of outputs). However, they diverged sharply on gender: Gemini favored female-presenting subjects, while GPT favored male-presenting subjects with lighter skin tones. This research provides a large-scale, comparative audit of state-of-the-art models using an illumination-aware colorimetric methodology, distinguishing aesthetic rendering from underlying pigmentation in synthetic imagery. The study demonstrates that neutral prompts function as diagnostic probes rather than neutral instructions. It offers a robust framework for auditing algorithmic visual culture and challenges the sociolinguistic assumption that unmarked language results in inclusive representation.
[HC-5] Choose Your Agent : Tradeoffs in Adopting AI Advisors Coaches and Delegates in Multi-Party Negotiation
【速读】:该论文旨在解决在社会性人工智能(AI)应用中,如何设计有效的代理-用户交互机制以提升个体与群体福利的问题。其核心挑战在于,尽管大语言模型(LLM)在多智能体环境中展现出超人类的战略性能,但实际用户对不同辅助模式(如主动推荐的Advisor、被动反馈的Coach和自主执行的Delegate)的选择偏好与其带来的真实收益之间存在显著错配。解决方案的关键在于引入“可采纳性兼容”的交互规则设计——即辅助模态应作为具有内生参与机制的系统组件,通过优化界面设计和用户认知路径,促进高价值行为(如委托决策)的采纳,从而释放自动化代理在群体层面的正外部性(如市场做市功能),最终实现从代理能力到群体福祉的转化。
链接: https://arxiv.org/abs/2602.12089
作者: Kehang Zhu,Lithium Thain,Vivian Tsai,James Wexler,Crystal Qian
机构: Harvard University (哈佛大学); Google DeepMind (谷歌深度大脑)
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:As AI usage becomes more prevalent in social contexts, understanding agent-user interaction is critical to designing systems that improve both individual and group outcomes. We present an online behavioral experiment (N = 243) in which participants play three multi-turn bargaining games in groups of three. Each game, presented in randomized order, grants \textitaccess to a single LLM assistance modality: proactive recommendations from an \textitAdvisor, reactive feedback from a \textitCoach, or autonomous execution by a \textitDelegate; all modalities are powered by an underlying LLM that achieves superhuman performance in an all-agent environment. On each turn, participants privately decide whether to act manually or use the AI modality available in that game. Despite preferring the \textitAdvisor modality, participants achieve the highest mean individual gains with the \textitDelegate, demonstrating a preference-performance misalignment. Moreover, delegation generates positive externalities; even non-adopting users in \textitaccess-to-delegate treatment groups benefit by receiving higher-quality offers. Mechanism analysis reveals that the \textitDelegate agent acts as a market maker, injecting rational, Pareto-improving proposals that restructure the trading environment. Our research reveals a gap between agent capabilities and realized group welfare. While autonomous agents can exhibit super-human strategic performance, their impact on realized welfare gains can be constrained by interfaces, user perceptions, and adoption barriers. Assistance modalities should be designed as mechanisms with endogenous participation; adoption-compatible interaction rules are a prerequisite to improving human welfare with automated assistance.
[HC-6] Wisdom of the LLM Crowd: A Large Scale Benchmark of Multi-Label U.S. Election-Related Harmful Social Media Content
【速读】:该论文旨在解决选举虚假信息与有害政治内容早期检测难题,以维护民主制度的完整性。其核心问题是现有方法难以高效、准确地识别和分类大规模社交媒体中潜在危害性内容。解决方案的关键在于构建USE24-XD数据集——一个包含近10万条来自X(原Twitter)平台的帖子、附带时空元数据的大规模多标签标注数据集,并采用六种大语言模型(Large Language Models, LLMs)对五类细分内容(阴谋论、煽动性内容、仇恨言论、推测性内容和讽刺内容)进行系统标注,从而在显著降低人工标注成本的同时实现可扩展的分类能力。研究进一步通过众包验证与人类标注者对比,证明LLMs具有高内部一致性及高达0.90的推测类内容召回率,最终利用“群体智慧”策略融合多个LLM的标注结果,形成高质量、多标签、公开可用的数据资源,为后续研究提供坚实基础。
链接: https://arxiv.org/abs/2602.11962
作者: Qile Wang,Prerana Khatiwada,Carolina Coimbra Vieira,Benjamin E. Bagozzi,Kenneth E. Barner,Matthew Louis Mauriello
机构: University of Delaware (特拉华大学); Max Planck Institute for Demographic Research (马克斯·普朗克人口研究所)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:
Abstract:The spread of election misinformation and harmful political content conveys misleading narratives and poses a serious threat to democratic integrity. Detecting harmful content at early stages is essential for understanding and potentially mitigating its downstream spread. In this study, we introduce USE24-XD, a large-scale dataset of nearly 100k posts collected from X (formerly Twitter) during the 2024 U.S. presidential election cycle, enriched with spatio-temporal metadata. To substantially reduce the cost of manual annotation while enabling scalable categorization, we employ six large language models (LLMs) to systematically annotate posts across five nuanced categories: Conspiracy, Sensationalism, Hate Speech, Speculation, and Satire. We validate LLM annotations with crowdsourcing (n = 34) and benchmark them against human annotators. Inter-rater reliability analyses show comparable agreement patterns between LLMs and humans, with LLMs exhibiting higher internal consistency and achieving up to 0.90 recall on Speculation. We apply a wisdom-of-the-crowd approach across LLMs to aggregate annotations and curate a robust multi-label dataset. 60% of posts receive at least one label. We further analyze how human annotator demographics, including political ideology and affiliation, shape labeling behavior, highlighting systematic sources of subjectivity in judgments of harmful content. The USE24-XD dataset is publicly released to support future research.
[HC-7] Who Does What? Archetypes of Roles Assigned to LLM s During Human-AI Decision-Making
【速读】:该论文旨在解决生成式 AI(Generative AI)在高风险决策场景中应用时,人机协同决策过程中角色分配与交互模式不明确所带来的系统性风险问题。其解决方案的关键在于提出“人-大语言模型(LLM)原型”(human-LLM archetypes)这一概念框架,通过系统梳理113篇相关文献并提炼出17种典型的 socio-technical 交互模式,揭示不同原型在临床诊断等真实场景下对LLM输出和决策结果的影响机制,并进一步分析决策控制权、社会层级、认知强制策略及信息需求等方面的权衡关系,为设计安全、高效的人-AI决策系统提供结构化参考。
链接: https://arxiv.org/abs/2602.11924
作者: Shreya Chappidi,Jatinder Singh,Andra V. Krauze
机构: University of Cambridge (剑桥大学); National Cancer Institute, National Institutes of Health (国家癌症研究所,美国国立卫生研究院); Research Centre Trust, UA Ruhr, University Duisburg-Essen (鲁尔大学杜伊斯堡-埃森研究信托中心)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to ACM CHI 2026
Abstract:LLMs are increasingly supporting decision-making across high-stakes domains, requiring critical reflection on the socio-technical factors that shape how humans and LLMs are assigned roles and interact during human-in-the-loop decision-making. This paper introduces the concept of human-LLM archetypes – defined as re-curring socio-technical interaction patterns that structure the roles of humans and LLMs in collaborative decision-making. We describe 17 human-LLM archetypes derived from a scoping literature review and thematic analysis of 113 LLM-supported decision-making papers. Then, we evaluate these diverse archetypes across real-world clinical diagnostic cases to examine the potential effects of adopting distinct human-LLM archetypes on LLM outputs and decision outcomes. Finally, we present relevant tradeoffs and design choices across human-LLM archetypes, including decision control, social hierarchies, cognitive forcing strategies, and information requirements. Through our analysis, we show that selection of human-LLM interaction archetype can influence LLM outputs and decisions, bringing important risks and considerations for the designers of human-AI decision-making systems
[HC-8] Decision Support System for Technology Opportunity Discovery: An Application of the Schwartz Theory of Basic Values
【速读】:该论文试图解决早期阶段技术创新中技术机会发现(Technology Opportunity Discovery, TOD)的挑战,即现有方法难以系统整合终端用户视角,导致技术可行性与市场相关性之间存在脱节。解决方案的关键在于构建一个融合工程导向的成熟度评估(Technology Readiness Levels, TRL)与舒瓦茨基本人类价值观理论(Schwartz’s theory of basic human values)的决策支持框架,通过将技术潜力与多元人类价值关联,实现对新兴技术满足不同用户动机的结构化探索,从而提升技术机会发现的人本导向性和战略前瞻性。
链接: https://arxiv.org/abs/2602.11855
作者: Ayato Kitadai,Takumi Ito,Yumiko Nagoh,Hiroki Takahashi,Masanori Fujita,Sangjic Lee,Fumiaki Miyahara,Tetsu Natsume,Nariaki Nishino
机构: The University of Tokyo (东京大学); University of Tsukuba (筑波大学); Ritsumeikan Asia Pacific University (立命馆亚洲太平洋大学); Nihon University (日本大学); Sony Computer Science Laboratories, Inc. (索尼计算机科学实验室有限公司)
类目: Human-Computer Interaction (cs.HC)
备注: 24 pages, 5 figures
Abstract:Discovering technology opportunities (TOD) remains a critical challenge for innovation management, especially in early-stage development where consumer needs are often unclear. Existing methods frequently fail to systematically incorporate end-user perspectives, resulting in a misalignment between technological potentials and market relevance. This study proposes a novel decision support framework that bridges this gap by linking technological feasibility with fundamental human values. The framework integrates two distinct lenses: the engineering-based Technology Readiness Levels (TRL) and Schwartz’s theory of basic human values. By combining these, the approach enables a structured exploration of how emerging technologies may satisfy diverse user motivations. To illustrate the framework’s feasibility and insight potential, we conducted exploratory workshops with general consumers and internal experts at Sony Computer Science Laboratories, Inc., analyzing four real-world technologies (two commercial successes and two failures). Two consistent patterns emerged: (1) internal experts identified a wider value landscape than consumers (vision gap), and (2) successful technologies exhibited a broader range of associated human values (value breadth), suggesting strategic foresight may underpin market success. This study contributes both a practical tool for early-stage R\D decision-making and a theoretical link between value theory and innovation outcomes. While exploratory in scope, the findings highlight the promise of value-centric evaluation as a foundation for more human-centered technology opportunity discovery.
[HC-9] V-SHiNE: A Virtual Smart Home Framework for Explainability Evaluation
【速读】:该论文旨在解决自主智能家居系统中解释(Explanation)质量与影响难以进行方法学评估的问题。当前缺乏一种可扩展且贴近现实的评估框架,使得研究人员无法有效验证解释机制对用户理解与信任的影响。解决方案的关键在于提出V-SHiNE——一个基于浏览器的智能家居模拟框架,它支持环境配置、行为仿真、自定义解释引擎集成,并具备灵活的交付模式和丰富的交互日志记录功能,从而为可解释智能系统提供轻量、可复现且以用户为中心的评估平台。
链接: https://arxiv.org/abs/2602.11775
作者: Mersedeh Sadeghi,Simon Scholz,Max Unterbusch,Andreas Vogelsang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:
Abstract:Explanations are essential for helping users interpret and trust autonomous smart-home decisions, yet evaluating their quality and impact remains methodologically difficult in this domain. V-SHiNE addresses this gap: a browser-based smarthome simulation framework for scalable and realistic assessment of explanations. It allows researchers to configure environments, simulate behaviors, and plug in custom explanation engines, with flexible delivery modes and rich interaction logging. A study with 159 participants demonstrates its feasibility. V-SHiNE provides a lightweight, reproducible platform for advancing user-centered evaluation of explainable intelligent systems
[HC-10] Building Intelligent User Interfaces for Human-AI Alignment
【速读】:该论文试图解决的问题是:在人工智能(AI)系统与人类价值观对齐的过程中,用户界面(User Interface, UI)常被忽视,仅被视为实现细节而非影响对齐效果的关键因素。解决方案的关键在于提出一个参考模型(Reference Model),该模型提供了一个系统性的框架,用于分析用户界面在何处以及如何贡献于提升人机对齐(Human-AI Alignment)。通过两个案例研究和一项包含六个用户界面的初步调查,该模型展示了人机交互(Human-Computer Interaction, HCI)在优化对齐过程中的潜在价值,从而推动对齐研究从算法导向转向人机协同设计。
链接: https://arxiv.org/abs/2602.11753
作者: Danqing Shi
机构: University of Cambridge (剑桥大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Aligning AI systems with human values fundamentally relies on effective human feedback. While significant research has addressed training algorithms, the role of user interface is often overlooked and only treated as an implementation detail rather than a critical factor of alignment. This paper addresses this gap by introducing a reference model that offers a systematic framework for analyzing where and how user interface contributions can improve human-AI alignment. The structured taxonomy of the reference model is demonstrated through two case studies and a preliminary investigation featuring six user interfaces. This work highlights opportunities to advance alignment through human-computer interaction.
[HC-11] AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild
【速读】:该论文旨在解决当前移动图形用户界面(GUI)智能体评估中忽视用户指令模糊性与交互式意图对齐能力的问题。现有基准多假设用户指令完整明确,仅评估单轮执行效果,忽略了智能体在动态环境中通过主动澄清与用户达成真实意图一致的能力。解决方案的关键在于提出首个引入指令清晰度分类的基准 AmbiBench,基于认知差距理论构建包含“详细(Detailed)、标准(Standard)、不完整(Incomplete)和模糊(Ambiguous)”四类清晰度的评估体系,并开发自动化评估框架 MUSE(Mobile User Satisfaction Evaluator),其采用多大语言模型(MLLM)作为裁判的多智能体架构,在结果有效性、执行质量和交互质量三个维度实现细粒度审计,从而系统性地量化智能体在不同指令清晰度下的性能边界及主动交互带来的提升,验证了评估指标与人类判断的高度一致性。
链接: https://arxiv.org/abs/2602.11750
作者: Jiazheng Sun,Mingxuan Li,Yingying Zhang,Jiayang Niu,Yachen Wu,Ruihan Jin,Shuyu Lei,Pengrongrui Tan,Zongyu Zhang,Ruoyi Wang,Jiachen Yang,Boyu Yang,Jiacheng Liu,Xin Peng
机构: Fudan University (复旦大学); Jilin University (吉林大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 21 pages, 7 figures
Abstract:Benchmarks are paramount for gauging progress in the domain of Mobile GUI Agents. In practical scenarios, users frequently fail to articulate precise directives containing full task details at the onset, and their expressions are typically ambiguous. Consequently, agents are required to converge on the user’s true intent via active clarification and interaction during execution. However, existing benchmarks predominantly operate under the idealized assumption that user-issued instructions are complete and unequivocal. This paradigm focuses exclusively on assessing single-turn execution while overlooking the alignment capability of the agent. To address this limitation, we introduce AmbiBench, the first benchmark incorporating a taxonomy of instruction clarity to shift evaluation from unidirectional instruction following to bidirectional intent alignment. Grounded in Cognitive Gap theory, we propose a taxonomy of four clarity levels: Detailed, Standard, Incomplete, and Ambiguous. We construct a rigorous dataset of 240 ecologically valid tasks across 25 applications, subject to strict review protocols. Furthermore, targeting evaluation in dynamic environments, we develop MUSE (Mobile User Satisfaction Evaluator), an automated framework utilizing an MLLM-as-a-judge multi-agent architecture. MUSE performs fine-grained auditing across three dimensions: Outcome Effectiveness, Execution Quality, and Interaction Quality. Empirical results on AmbiBench reveal the performance boundaries of SoTA agents across different clarity levels, quantify the gains derived from active interaction, and validate the strong correlation between MUSE and human judgment. This work redefines evaluation standards, laying the foundation for next-generation agents capable of truly understanding user intent.
[HC-12] Mapping the Landscape of Affective Extended Reality: A Scoping Review of Biodata-Driven Systems for Understanding and Sharing Emotions
【速读】:该论文试图解决的问题是:当前关于情感计算与扩展现实(Extended Reality, XR)融合的研究成果分散且缺乏系统性整合,导致对“情感增强型XR”(affective XR)这一新兴领域的整体认知不清晰。为解决此问题,作者通过系统性地梳理82篇相关文献,构建了一个涵盖生物数据(biodata)、情绪感知与XR交互的综合映射框架。其解决方案的关键在于:识别并分析现有系统中所采用的技术、交互方式及评估方法,从而揭示情感共享目标的多样性,并提炼出影响情感理解的核心设计维度与挑战,进而指出尚未被充分探索的情感表达路径,为未来研究提供明确方向。
链接: https://arxiv.org/abs/2602.11710
作者: Zhidian Lin,Allison Jing,Ziyuan Qu,Fabio Zambetta,Ryan M. Kelly
机构: 未知
类目: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET)
备注: 30 pages, 18 figures, 8 tables
Abstract:This paper introduces the notion of affective extended reality (XR) to characterise XR systems that use biodata to enable understanding of emotions. The HCI literature contains many such systems, but they have not yet been mapped into a coherent whole. To address this, we conducted a scoping review of 82 papers that explore the nexus of biodata, emotions, and XR. We analyse the technologies used in these systems, the interaction techniques employed, and the methods used to evaluate their effectiveness. Through our analysis, we contribute a mapping of the current landscape of affective XR, revealing diversity in the goals for enabling emotion sharing. We demonstrate how HCI researchers have explored the design of the interaction flows in XR biofeedback systems, highlighting key design dimensions and challenges in understanding emotions. We discuss underused approaches for emotion sharing and highlight opportunities for future research on affective XR.
[HC-13] PatientHub: A Unified Framework for Patient Simulation
【速读】:该论文旨在解决当前模拟患者(Simulated Patient)在角色扮演类应用中因数据格式、提示词(prompt)和评估指标不统一而导致的可复现性差与公平比较困难的问题。解决方案的关键在于提出一个名为PatientHub的统一且模块化的框架,通过标准化模拟患者的定义、组成和部署流程,实现跨方法的标准化评估以及自定义评估指标的无缝集成,并支持新模拟器变体的快速原型开发,从而降低方法开发门槛并加速研究进展。
链接: https://arxiv.org/abs/2602.11684
作者: Sahand Sabour,TszYam NG,Minlie Huang
机构: Tsinghua University (清华大学); Institute for Artificial Intelligence (人工智能研究所); The CoAI Group (CoAI小组); DCST (计算机科学与技术系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Work in progress
Abstract:As Large Language Models increasingly power role-playing applications, simulating patients has become a valuable tool for training counselors and scaling therapeutic assessment. However, prior work is fragmented: existing approaches rely on incompatible, non-standardized data formats, prompts, and evaluation metrics, hindering reproducibility and fair comparison. In this paper, we introduce PatientHub, a unified and modular framework that standardizes the definition, composition, and deployment of simulated patients. To demonstrate PatientHub’s utility, we implement several representative patient simulation methods as case studies, showcasing how our framework supports standardized cross-method evaluation and the seamless integration of custom evaluation metrics. We further demonstrate PatientHub’s extensibility by prototyping two new simulator variants, highlighting how PatientHub accelerates method development by eliminating infrastructure overhead. By consolidating existing work into a single reproducible pipeline, PatientHub lowers the barrier to developing new simulation methods and facilitates cross-method and cross-model benchmarking. Our framework provides a practical foundation for future datasets, methods, and benchmarks in patient-centered dialogue, and the code is publicly available via this https URL.
[HC-14] “I Was Told to Come Back and Share This”: Social Media-Based Near-Death Experience Disclosures as Expressions of Spiritual Beliefs
【速读】:该论文试图解决的问题是:当前关于濒死体验(Near-Death Experiences, NDEs)的研究多聚焦于个体叙事,缺乏对社交媒体平台中NDE叙事如何通过协作方式传播与互动的探讨。解决方案的关键在于:通过对200个随机采样的抖音(TikTok)视频进行内容分析,发现用户常借助NDE叙事建构个人意义,并以灵性和宗教主题为核心,促进深层信仰与存在意义的交流;同时,评论区分析表明,包含灵性内容的视频能引发更高水平的参与度和社群对话,从而揭示了在线平台在推动灵性议题社区化表达与连接中的作用机制。
链接: https://arxiv.org/abs/2602.11663
作者: Yifan Zhao,Yuxin Fang,Yihuan Chen,RAY LC
机构: City University of Hong Kong(香港城市大学); Gannan University of Science and Technology(赣南科技学院); Ganzhou Key Laboratory of Digital Cultural Preservation and Intelligent Innovation(赣州数字文化保护与智能创新重点实验室); Studio for Narrative Spaces(叙事空间工作室)
类目: Human-Computer Interaction (cs.HC)
备注: 19 pages, 5 figures, CHI 2026 full paper
Abstract:People who experienced near-death events often turn to personal expression as a way of processing trauma and articulating beliefs. While scholars have examined how individuals share near-death experiences (NDEs), limited research has explored how these narratives are communicated collaboratively on today’s social media platforms. We analyzed 200 randomly sampled TikTok videos tagged with #nde and related hashtags. Content analysis revealed that individuals often use NDE narratives to articulate personal meaning, with spiritual and religious themes appearing in the majority of posts and serving as a means of exploring and making sense of personal spiritual perspectives. Consistent with this, analyses of comment sections reveal that videos containing spiritual themes tend to attract more engagement and foster deeper conversations around faith and meaning. Our findings offer insights into how online platforms facilitate community-level engagement with spirituality, and suggest implications for design of spaces that support shared expression and connection in specialized communities.
[HC-15] Human-Like Gaze Behavior in Social Robots: A Deep Learning Approach Integrating Human and Non-Human Stimuli
【速读】:该论文旨在解决社交机器人在复杂社会情境中难以有效模仿人类非语言行为(尤其是凝视方向)的问题,从而提升人机交互的自然性与流畅性。其关键解决方案在于构建基于长短期记忆网络(LSTM)和Transformer架构的预测模型,利用41名参与者在虚拟现实(VR)环境中采集的凝视数据,训练模型以预测个体在包含人类与非人类刺激(如指向、门开启、物体掉落等)下的 gaze 行为;该方法首次系统性地纳入非人类刺激对凝视模式的影响,在动画与真实世界场景中均实现了超过70%的预测准确率,并成功部署于NAO机器人上,通过275名用户的问卷评估验证了高满意度,显著优于现有研究。
链接: https://arxiv.org/abs/2602.11648
作者: Faezeh Vahedi,Morteza Memari,Ramtin Tabatabaei,Alireza Taheri
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:
Abstract:Nonverbal behaviors, particularly gaze direction, play a crucial role in enhancing effective communication in social interactions. As social robots increasingly participate in these interactions, they must adapt their gaze based on human activities and remain receptive to all cues, whether human-generated or not, to ensure seamless and effective communication. This study aims to increase the similarity between robot and human gaze behavior across various social situations, including both human and non-human stimuli (e.g., conversations, pointing, door openings, and object drops). A key innovation in this study, is the investigation of gaze responses to non-human stimuli, a critical yet underexplored area in prior research. These scenarios, were simulated in the Unity software as a 3D animation and a 360-degree real-world video. Data on gaze directions from 41 participants were collected via virtual reality (VR) glasses. Preprocessed data, trained two neural networks-LSTM and Transformer-to build predictive models based on individuals’ gaze patterns. In the animated scenario, the LSTM and Transformer models achieved prediction accuracies of 67.6% and 70.4%, respectively; In the real-world scenario, the LSTM and Transformer models achieved accuracies of 72% and 71.6%, respectively. Despite the gaze pattern differences among individuals, our models outperform existing approaches in accuracy while uniquely considering non-human stimuli, offering a significant advantage over previous literature. Furthermore, deployed on the NAO robot, the system was evaluated by 275 participants via a comprehensive questionnaire, with results demonstrating high satisfaction during interactions. This work advances social robotics by enabling robots to dynamically mimic human gaze behavior in complex social contexts.
[HC-16] Behavioral Indicators of Overreliance During Interaction with Conversational Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因固有幻觉(hallucination)导致用户过度依赖(overreliance)的问题,这种依赖可能使用户忽视错误信息并降低决策质量。解决方案的关键在于通过分析77名参与者在三项真实任务中与LLM交互的日志数据,识别出与过度依赖显著相关的五类行为模式:低过度依赖用户表现出细致的任务理解与精粒度导航,而高过度依赖用户则呈现频繁复制粘贴、跳过初始理解、重复引用LLM、粗粒度定位以及即使犹豫仍接受错误信息等特征。这一行为聚类方法为检测和干预用户对LLM的过度依赖提供了可操作的指标,并为设计更鲁棒的人机协作界面提供了依据。
链接: https://arxiv.org/abs/2602.11567
作者: Chang Liu,Qinyi Zhou,Xinjie Shen,Xingyu Bruce Liu,Tongshuang Wu,Xiang ‘Anthony’ Chen
机构: Tsinghua University (清华大学); Hong Kong University of Science and Technology (香港科技大学); Georgia Institute of Technology (佐治亚理工学院); UCLA (加州大学洛杉矶分校); Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC)
备注: conditionally accepted by ACM CHI 2026
Abstract:LLMs are now embedded in a wide range of everyday scenarios. However, their inherent hallucinations risk hiding misinformation in fluent responses, raising concerns about overreliance on AI. Detecting overreliance is challenging, as it often arises in complex, dynamic contexts and cannot be easily captured by post-hoc task outcomes. In this work, we aim to investigate how users’ behavioral patterns correlate with overreliance. We collected interaction logs from 77 participants working with an LLM injected plausible misinformation across three real-world tasks and we assessed overreliance by whether participants detected and corrected these errors. By semantically encoding and clustering segments of user interactions, we identified five behavioral patterns linked to overreliance: users with low overreliance show careful task comprehension and fine-grained navigation; users with high overreliance show frequent copy-paste, skipping initial comprehension, repeated LLM references, coarse locating, and accepting misinformation despite hesitation. We discuss design implications for mitigation.
[HC-17] Implications of AI Involvement for Trust in Expert Advisory Workflows Under Epistemic Dependence
【速读】:该论文试图解决的问题是:在医疗、法律和金融等专业领域中,AI工具的引入如何影响用户对人类专家、AI系统及其协同工作的信任关系。解决方案的关键在于通过一项包含77名参与者的模拟课程规划任务实验,比较不同AI存在状态及人机协作模式下的用户感知差异,发现用户的信任不仅取决于专家决策的准确性,还显著受到专家使用AI助手方式的影响,这为设计高效的人机混合团队提供了重要依据,尤其是在推荐采纳依赖于用户对推荐者专业性认知的情境下。
链接: https://arxiv.org/abs/2602.11522
作者: Dennis Kim,Roya Daneshi,Bruce Draper,Sarath Sreedharan
机构: Colorado State University (科罗拉多州立大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:The increasing integration of AI-powered tools into expert workflows, such as medicine, law, and finance, raises a critical question: how does AI involvement influence a user’s trust in the human expert, the AI system, and their combination? To investigate this, we conducted a user study (N=77) featuring a simulated course-planning task. We compared various conditions that differed in both the presence of AI and the specific mode of human-AI collaboration. Our results indicate that while the advisor’s ability to create a correct schedule is important, the user’s perception of expertise and trust is also influenced by how the expert utilized the AI assistant. These findings raise important considerations for the design of human-AI hybrid teams, particularly when the adoption of recommendations depends on the end-user’s perception of the recommender’s expertise.
[HC-18] How Smart Is Your GUI Agent ? A Framework for the Future of Software Interaction
【速读】:该论文旨在解决当前GUI代理(GUI Agent)在描述其自主性时存在概念模糊的问题,这种模糊性导致了能力、责任和风险的界定不清。解决方案的关键在于提出一个六级的GUI代理自主性水平(GUI Agent Autonomy Levels, GAL)框架,通过明确划分自主性的层级,使代理的自主程度可视化,并为可信软件交互的发展提供可衡量的基准。
链接: https://arxiv.org/abs/2602.11514
作者: Sidong Feng,Chunyang Chen
机构: The Chinese University of Hong Kong Shenzhen (香港中文大学(深圳)); Technical University of Munich Heilbronn (慕尼黑工业大学海布伦分校)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:GUI agents are rapidly becoming a new interaction to software, allowing people to navigate web, desktop and mobile rather than execute them click by click. Yet ``agent’’ is described with radically different degrees of autonomy, obscuring capability, responsibility and risk. We call for conceptual clarity through GUI Agent Autonomy Levels (GAL), a six-level framework that makes autonomy explicit and helps benchmark progress toward trustworthy software interaction.
[HC-19] An Educational Human Machine Interface Providing Request-to-Intervene Trigger and Reason Explanation for Enhancing the Drivers Comprehension of ADSs System Limitations
【速读】:该论文旨在解决高级别自动驾驶系统(Level 3 automated driving systems, ADS)在运行过程中,当超出其操作设计域(Operational Design Domain, ODD)时,因复杂交通场景引发驾驶员对接管请求(Request to Intervene, RtI)触发原因不明确,导致接管犹豫或混乱的问题。解决方案的关键在于设计一种基于语音的教育型人机交互界面(Human Machine Interface, HMI),通过在RtI中嵌入清晰的触发提示(trigger cues)和原因说明(reason),提升驾驶员对ADS系统局限性的理解,从而增强其主动接管能力并降低事故风险。实验结果表明,该方法显著改善了驾驶员对系统限制的认知,并在RtI失效情况下实现了更安全、及时的接管行为。
链接: https://arxiv.org/abs/2602.11507
作者: Ryuji Matsuo,Hailong Liu,Toshihiro Hiraoka,Takahiro Wada
机构: Nara Institute of Science and Technology (奈良科学技术大学院大学); Japan Automobile Research Institute (日本汽车研究所)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Level 3 automated driving systems (ADS) have attracted significant attention and are being commercialized. A level 3 ADS prompts the driver to take control by issuing a request to intervene (RtI) when its operational design domains (ODD) are exceeded. However, complex traffic situations can cause drivers to perceive multiple potential triggers of RtI simultaneously, causing hesitation or confusion during take-over. Therefore, drivers need to clearly understand the ADS’s system limitations to ensure safe take-over. This study proposes a voice-based educational human machine interface~(HMI) for providing RtI trigger cues and reason to help drivers understand ADS’s system limitations. The results of a between-group experiment using a driving simulator showed that incorporating effective trigger cues and reason into the RtI was related to improved driver comprehension of the ADS’s system limitations. Moreover, most participants, instructed via the proposed method, could proactively take over control of the ADS in cases where RtI fails; meanwhile, their number of collisions was lower compared with the other RtI HMI conditions. Therefore, using the proposed method to continually enhance the driver’s understanding of the system limitations of ADS through the proposed method is associated with safer and more effective real-time interactions with ADS.
[HC-20] Data-driven modelling of low-dimensional dynamical structures underlying complex full-body human movement
【速读】:该论文旨在解决人类运动控制与学习中的“自由度问题”(degrees-of-freedom problem),即如何从高维的运动变量中提取低维、可预测的运动模式。其解决方案的关键在于利用神经微分方程(Neural Ordinary Differential Equations, NODEs)构建一个低维潜在空间中的动力系统模型,将非周期性的全身运动(如棒球投掷)建模为由常微分方程(ODE)定义的动力学演化过程。实验表明,该方法能够以高精度预测复杂运动的时间演化(R² > 0.45),且初始阶段约8%的时间序列即可解释后半段运动约50%的方差,验证了潜在空间中由初始条件驱动的动态演化机制,从而拓展了动力系统方法(DSA)在生态有效人类运动中的应用边界。
链接: https://arxiv.org/abs/2602.11492
作者: Ryota Takamido,Chiharu Suzuki,Hiroki Nakamoto
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:One of the central challenges in the study of human motor control and learning is the degrees-of-freedom problem. Although the dynamical systems approach (DSA) has provided valuable insights into addressing this issue, its application has largely been confined to cyclic or simplified motor movements. To overcome this limitation, the present study employs neural ordinary differential equations (NODEs) to model the time evolution of non-cyclic full-body movements as a low-dimensional latent dynamical system. Given the temporal complexity full-body kinematic chains, baseball pitching was selected as a representative target movement to examine whether DSA could be extended to more complex, ecologically valid human movements. Results of the verification experiment demonstrated that the time evolution of a complex pitching motion could be accurately predicted (R^2 0.45) using the NODE-based dynamical model. Notably, approximately 50% of the variance in the latter half of the pitching motion was explained using only the initial ~8% of the temporal sequence, underscoring how subsequent movement evolves from initial conditions according to ODE-defined dynamics in latent space. These findings indicate the potential to extend the DSA to more complex and ecologically valid forms of human movement.
[HC-21] Understanding Persuasive Interactions between Generative Social Agents and Humans: The Knowledge-based Persuasion Model (KPM)
【速读】:该论文试图解决当前生成式社会代理(Generative Social Agents, GSAs)在人机交互中缺乏理论指导与研究规范的问题,尤其是如何系统性地理解GSAs对用户态度和行为的影响。其解决方案的关键在于提出知识驱动的说服模型(Knowledge-based Persuasion Model, KPM),该模型认为GSAs基于自身、用户及情境三方面的知识来驱动说服性行为,并进而塑造人类用户的认知与行为反应。KPM通过整合现有研究成果提供了一个结构化的分析框架,旨在促进负责任的GSAs设计,使其遵循社会规范与伦理标准,从而提升用户福祉,尤其适用于医疗健康与教育等关键应用领域。
链接: https://arxiv.org/abs/2602.11483
作者: Stephan Vonschallen,Friederike Eyssel,Theresa Schmiedel
机构: ZHAW(苏黎世应用科学大学); University of Bielefeld (比勒费尔德大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative social agents (GSAs) use artificial intelligence to autonomously communicate with human users in a natural and adaptive manner. Currently, there is a lack of theorizing regarding interactions with GSAs, and likewise, few guidelines exist for studying how they influence user attitudes and behaviors. Consequently, we propose the Knowledge-based Persuasion Model (KPM) as a novel theoretical framework. According to the KPM, a GSA’s self, user, and context-related knowledge drives its persuasive behavior, which in turn shapes the attitudes and behaviors of a responding human user. By synthesizing existing research, the model offers a structured approach to studying interactions with GSAs, supporting the development of agents that motivate rather than manipulate humans. Accordingly, the KPM encourages the integration of responsible GSAs that adhere to social norms and ethical standards with the goal of increasing user wellbeing. Implications of the KPM for research and application domains such as healthcare and education are discussed.
[HC-22] When Visibility Outpaces Verification: Delayed Verification and Narrative Lock-in in Agent ic AI Discourse
【速读】:该论文旨在解决社交平台中“可信度代理”(credibility proxy)机制对用户认知偏倚的潜在危害,特别是在生成式 AI(Generative AI)相关讨论中,高可见度话题因缺乏及时验证而形成“叙事锁定”(Narrative Lock-in),进而削弱公众对AI系统的批判性评估能力。其解决方案的关键在于识别并量化“首次验证时间”(time-to-first-verification)的延迟现象,并提出引入“认识论摩擦”(epistemic friction)作为设计干预手段,以打破仅依赖互动信号(如点赞)的默认信任逻辑,从而重建在线讨论中证据驱动的认知平衡。
链接: https://arxiv.org/abs/2602.11412
作者: Hanjing Shi,Dominic DiFranzo
机构: Lehigh University (莱赫igh大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Agentic AI systems-autonomous entities capable of independent planning and execution-reshape the landscape of human-AI trust. Long before direct system exposure, user expectations are mediated through high-stakes public discourse on social platforms. However, platform-mediated engagement signals (e.g., upvotes) may inadvertently function as a credibility proxy,'' potentially stifling critical evaluation. This paper investigates the interplay between social proof and verification timing in online discussions of agentic AI. Analyzing a longitudinal dataset from two distinct Reddit communities with contrasting interaction cultures-r/OpenClaw and r/Moltbook-we operationalize verification cues via reproducible lexical rules and model the time-to-first-verification’’ using a right-censored survival analysis framework. Our findings reveal a systemic Popularity Paradox'': high-visibility discussions in both subreddits experience significantly delayed or entirely absent verification cues compared to low-visibility threads. This temporal lag creates a critical window for Narrative Lock-in,‘’ where early, unverified claims crystallize into collective cognitive biases before evidence-seeking behaviors emerge. We discuss the implications of this credibility-by-visibility'' effect for AI safety and propose epistemic friction’’ as a design intervention to rebalance engagement-driven platforms. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2602.11412 [cs.CY] (or arXiv:2602.11412v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2602.11412 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-23] Interpretive Cultures: Resonance randomness and negotiated meaning for AI-assisted tarot divination
【速读】:该论文试图解决的问题是:生成式 AI (Generative AI) 在 interpretive practices(解释性实践)中的作用尚不明确,尤其是在缺乏因果关系的语境下,如何支持用户进行意义建构而不削弱其主观性与多元性。解决方案的关键在于通过分析 AI 辅助塔罗占卜这一具体实践,揭示使用者如何借助 AI 导航不确定性、探索多重视角并扩展既有解释流程;基于 Hartmut Rosa 的共振理论(Theory of Resonance),提出设计建议,旨在构建能够保留模糊性、增强用户主体性且促进深度共鸣的 AI 支持系统。
链接: https://arxiv.org/abs/2602.11367
作者: Matthew Prock,Ziv Epstein,Hope Schroeder,Amy Smith,Cassandra Lee,Vana Goblot,Farnaz Jahanbakhsh
机构: The University of Michigan (密歇根大学); MIT (麻省理工学院); Queen Mary University (伦敦玛丽女王大学); Goldsmiths, University of London (伦敦大学金史密斯学院)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:While generative AI tools are increasingly adopted for creative and analytical tasks, their role in interpretive practices, where meaning is subjective, plural, and non-causal, remains poorly understood. This paper examines AI-assisted tarot reading, a divinatory practice in which users pose a query, draw cards through a randomized process, and ask AI systems to interpret the resulting symbols. Drawing on interviews with tarot practitioners and Hartmut Rosa’s Theory of Resonance, we investigate how users seek, negotiate, and evaluate resonant interpretations in a context where no causal relationship exists between the query and the data being interpreted. We identify distinct ways practitioners incorporate AI into their interpretive workflows, including using AI to navigate uncertainty and self-doubt, explore alternative perspectives, and streamline or extend existing divinatory practices. Based on these findings, we offer design recommendations for AI systems that support interpretive meaning-making without collapsing ambiguity or foreclosing user agency.
[HC-24] Situated Dynamic and Subjective: Envisioning the Design of Theory-of-Mind-Enabled Everyday AI with Industry Practitioners
【速读】:该论文试图解决的问题是:如何将理论心智(Theory of Mind, ToM)能力有效融入日常用户面向的人工智能(AI)产品与服务设计中,以实现更自然、适应性强且个性化的交互体验。当前虽已有对AI ToM能力的建模与评估研究,但缺乏对其在实际产品设计与开发实践中落地路径的系统探索。解决方案的关键在于通过13次协作设计工作坊,从26位美国AI从业者中提炼出三项相互关联的设计建议:1)将ToM-enabled AI置于塑造用户心理状态的社会情境中;2)使其能够响应心理状态的动态变化;3)关注个体主观差异的敏感性。这些设计原则揭示了从业者对未来ToM应用的愿景与当前AI开发现实之间的张力,提示应从静态推理驱动的方法转向构建支持持续人机交互循环的普适性ToM能力。
链接: https://arxiv.org/abs/2602.11342
作者: Qiaosi Wang,Jini Kim,Avanita Sharma,Alicia(Hyun Jin)Lee,Jodi Forlizzi,Hong Shen
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 16 pages, preprint for ACM CHI 2026 Conference
Abstract:Theory of Mind (ToM) – the ability to infer what others are thinking (e.g., intentions) from observable cues – is traditionally considered fundamental to human social interactions. This has sparked growing efforts in building and benchmarking AI’s ToM capability, yet little is known about how such capability could translate into the design and experience of everyday user-facing AI products and services. We conducted 13 co-design sessions with 26 U.S.-based AI practitioners to envision, reflect, and distill design recommendations for ToM-enabled everyday AI products and services that are both future-looking and grounded in the realities of AI design and development practices. Analysis revealed three interrelated design recommendations: ToM-enabled AI should 1) be situated in the social context that shape users’ mental states, 2) be responsive to the dynamic nature of mental states, and 3) be attuned to subjective individual differences. We surface design tensions within each recommendation that reveal a broader gap between practitioners’ envisioned futures of ToM-enabled AI and the realities of current AI design and development practices. These findings point toward the need to move beyond static, inference-driven approach to ToM and toward designing ToM as a pervasive capability that supports continuous human-AI interaction loops.
[HC-25] Same Feedback Different Source: How AI vs. Human Feedback Shapes Learner Engagement
【速读】:该论文旨在解决在混合式AI-人类教学系统中,学习者对反馈来源(AI或人类助教)的认知如何影响其参与度与评价的问题。解决方案的关键在于设计了一个创意编程界面,通过控制反馈内容一致但仅改变反馈来源 attribution(即一半参与者将反馈归因于AI,另一半归因于人类助教TA),从而隔离出源属性的影响效应。实验发现,尽管反馈内容完全相同,来自人类助教的反馈显著提升了学习者的投入程度(效应量 d = 0.88–1.56),且对反馈的评价也因来源不同而呈现差异化预测因素:AI反馈的评价由先前对AI的信任度决定(r = 0.85),而人类助教反馈的评价则由感知的真实性驱动(r = 0.65)。这一设计有效揭示了反馈源认知在教育交互中的核心作用。
链接: https://arxiv.org/abs/2602.11311
作者: Caitlin Morris,Pattie Maes
机构: MIT Media Lab (麻省理工学院媒体实验室)
类目: Human-Computer Interaction (cs.HC)
备注: 7 pages, 5 figures
Abstract:When learners receive feedback, what they believe about its source may shape how they engage with it. As AI is used alongside human instructors, understanding these attribution effects is essential for designing effective hybrid AI-human educational systems. We designed a creative coding interface that isolates source attribution while controlling for content: all participants receive identical LLM-generated feedback, but half see it attributed to AI and half to a human teaching assistant (TA). We found two key results. First, perceived feedback source affected engagement: learners in the TA condition spent significantly more time and effort (d = 0.88-1.56) despite receiving identical feedback. Second, perceptions differed: AI-attributed feedback ratings were predicted by prior trust in AI (r = 0.85), while TA-attributed ratings were predicted by perceived genuineness (r = 0.65). These findings suggest that feedback source shapes both engagement and evaluation, with implications for hybrid educational system design.
[HC-26] How to check in continually over 4000 days on an online learning platform? An empirical experience and a practical solution ICDE
【速读】:该论文旨在解决在线英语学习平台中用户难以长期维持每日签到(check-in)行为的问题,这一现象在研究中被发现普遍存在,主要原因包括动机不足、遗忘、无聊感以及时间匮乏。针对此问题,作者基于自身在中文领先英语学习平台“扇贝”(Shanbay)上连续4000余天签到的实践经验,提出了一种名为GILT的方法作为解决方案。该方法的关键在于通过系统化的策略构建可持续的学习习惯,强调目标设定、即时反馈、自我激励与行为固化机制,从而有效提升用户的持续参与度和毅力,其核心逻辑可迁移至其他学习平台以支持不同领域的长期学习行为养成。
链接: https://arxiv.org/abs/2602.11249
作者: Jialiang Lin
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Please cite the version of ICDEL
Abstract:The check-in service is often provided as an incentive system by online learning platforms to help users establish a learning routine and achieve accomplishment. However, according to the questionnaire conducted in this study, 82.5% of users of online English learning platforms that feature a check-in service have failed to maintain the daily check-in behavior for long-term language learning, mainly by reason of demotivation, forgetfulness, boredom, and insufficient time. As a language learner, I have an empirical experience in maintaining a record of over 4,000 daily check-ins on China’s leading online English learning platform of Shanbay. In the meantime, I have been constantly exploring a practical solution to help cultivate perseverance for other users to follow through the learning routine. In this paper, I systematically introduce this practical solution, the GILT method, and its instructions. The experience and solution for perseverance development are based on Shanbay, but they can be applied to other learning platforms for different purposes.
[HC-27] DiSCoKit: An Open-Source Toolkit for Deploying Live LLM Experiences in Survey Research
【速读】:该论文旨在解决社会科学研究中在线问卷平台部署实时大语言模型(Large-Language Models, LLMs)交互体验时所面临的诸多技术与实践挑战,例如聊天数据记录困难及实验设计中对AI行为的灵活控制不足等问题。其解决方案的关键在于开发了一个名为DiSCoKit的开源工具包,该工具包能够通过支持JavaScript的问卷平台(如Qualtrics)实现对Azure等云服务提供的LLM的集成与调用,从而为研究者提供一种稳定、可控且可扩展的在线AI刺激呈现方式。
链接: https://arxiv.org/abs/2602.11230
作者: Jaime Banks,Jon Stromer-Galley,Samiksha Singh,Collin Capano
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Advancing social-scientific research of human-AI interaction dynamics and outcomes often requires researchers to deliver experiences with live large-language models (LLMs) to participants through online survey platforms. However, technical and practical challenges (from logging chat data to manipulating AI behaviors for experimental designs) often inhibit survey-based deployment of AI stimuli. We developed DiSCoKit–an open-source toolkit for deploying live LLM experiences (e.g., ones based on models delivered through Microsoft Azure portal) through JavaScript-enabled survey platforms (e.g., Qualtrics). This paper introduces that toolkit, explaining its scientific impetus, describes its architecture and operation, as well as its deployment possibilities and limitations.
[HC-28] Patient Digital Twins for Chronic Care: Technical Hurdles Lessons Learned and the Road Ahead
【速读】:该论文旨在解决当前医疗体系在应对慢性疾病管理中的碎片化与被动响应问题,提出通过构建患者医疗数字孪生(Patient Medical Digital Twins, PMDTs)实现个性化、前瞻性的慢性病管理模式。其解决方案的关键在于:利用本体驱动建模与联邦分析技术,整合临床、基因组、生活方式及生活质量等多模态数据,建立持续更新的患者数字镜像;同时依托HL7 FHIR和OMOP标准确保互操作性,嵌入隐私治理机制,并设计面向临床医生的直观交互界面,从而推动PMDT从概念走向可落地的可信、自适应慢性病照护生态系统。
链接: https://arxiv.org/abs/2602.11223
作者: Micheal P. Papazoglou,Bernd J. Krämer,Mira Raheem,Amal Elgammal
机构: 未知
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注: Feature Article, Patient Medical Digital Twins, Under Review in IEEE SOftware
Abstract:Chronic diseases constitute the principal burden of morbidity, mortality, and healthcare costs worldwide, yet current health systems remain fragmented and predominantly reactive. Patient Medical Digital Twins (PMDTs) offer a paradigm shift: holistic, continuously updated digital counterparts of patients that integrate clinical, genomic, lifestyle, and quality-of-life data. We report early implementations of PMDTs via ontology-driven modeling and federated analytics pilots. Insights from the QUALITOP oncology study and a distributed AI platform confirm both feasibility and challenges: aligning with HL7 FHIR and OMOP standards, embedding privacy governance, scaling federated queries, and designing intuitive clinician interfaces. We also highlight technical gains, such as automated reasoning over multimodal blueprints and predictive analytics for patient outcomes. By reflecting on these experiences, we outline actionable insights for software engineers and identify opportunities, such as DSLs and model-driven engineering, to advance PMDTs toward trustworthy, adaptive chronic care ecosystems.
[HC-29] Althea: Human-AI Collaboration for Fact-Checking and Critical Reasoning
【速读】:该论文旨在解决网络信息生态系统中事实核查系统在可扩展性与认知可信度之间的矛盾问题:自动化方法虽高效但缺乏透明性,而人工验证则效率低下且一致性差。解决方案的关键在于提出Althea系统,该系统采用检索增强架构,融合问题生成、证据检索与结构化推理模块,支持用户驱动的在线声明评估。其创新性体现在通过三种不同干预程度的交互模式(探索式、摘要式和自检索式)实证验证了认知工作结构对用户准确性与信心提升的长期影响,表明性能提升不仅源于努力或暴露,更取决于认知任务如何被有效组织与内化。
链接: https://arxiv.org/abs/2602.11161
作者: Svetlana Churina,Kokil Jaidka,Anab Maulana Barik,Harshit Aneja,Cai Yang,Wynne Hsu,Mong Li Lee
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
Abstract:The web’s information ecosystem demands fact-checking systems that are both scalable and epistemically trustworthy. Automated approaches offer efficiency but often lack transparency, while human verification remains slow and inconsistent. We introduce Althea, a retrieval-augmented system that integrates question generation, evidence retrieval, and structured reasoning to support user-driven evaluation of online claims. On the AVeriTeC benchmark, Althea achieves a Macro-F1 of 0.44, outperforming standard verification pipelines and improving discrimination between supported and refuted claims. We further evaluate Althea through a controlled user study and a longitudinal survey experiment (N = 642), comparing three interaction modes that vary in the degree of scaffolding: an Exploratory mode with guided reasoning, a Summary mode providing synthesized verdicts, and a Self-search mode that offers procedural guidance without algorithmic intervention. Results show that guided interaction produces the strongest immediate gains in accuracy and confidence, while self-directed search yields the most persistent improvements over time. This pattern suggests that performance gains are not driven solely by effort or exposure, but by how cognitive work is structured and internalized.
[HC-30] Explaining AI Without Code: A User Study on Explainable AI NEURIPS-25
【速读】:该论文旨在解决无代码机器学习(No-Code ML)平台中可解释人工智能(Explainable AI, XAI)缺失的问题,尤其是在医疗、金融等敏感领域,自动化决策的透明性至关重要,但现有XAI方法通常需要专业技术背景,难以被非专家用户使用。解决方案的关键在于设计并集成一个以用户为中心的XAI模块到开源无代码平台DashAI中,该模块融合了三种互补的解释技术:部分依赖图(Partial Dependence Plots, PDP)、置换特征重要性(Permutation Feature Importance, PFI)和KernelSHAP,嵌入至表格分类任务的工作流中,从而在保障专家所需深度的同时,提升新手用户的理解能力与信任度。
链接: https://arxiv.org/abs/2602.11159
作者: Natalia Abarca,Andrés Carvallo,Claudia López Moncada,Felipe Bravo-Marquez
机构: University of Chile (智利大学); CENIA – National Center for Artificial Intelligence (国家人工智能中心); Universidad Técnica Federico Santa María (联邦理工大学); IMFD - Millennium Institute for Foundational Research on Data (数据基础研究千年研究所)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: LatinX in AI Workshop @ NeurIPS-25
Abstract:The increasing use of Machine Learning (ML) in sensitive domains such as healthcare, finance, and public policy has raised concerns about the transparency of automated decisions. Explainable AI (XAI) addresses this by clarifying how models generate predictions, yet most methods demand technical expertise, limiting their value for novices. This gap is especially critical in no-code ML platforms, which seek to democratize AI but rarely include explainability. We present a human-centered XAI module in DashAI, an open-source no-code ML platform. The module integrates three complementary techniques, which are Partial Dependence Plots (PDP), Permutation Feature Importance (PFI), and KernelSHAP, into DashAI’s workflow for tabular classification. A user study (N = 20; ML novices and experts) evaluated usability and the impact of explanations. Results show: (i) high task success ( \geq80% ) across all explainability tasks; (ii) novices rated explanations as useful, accurate, and trustworthy on the Explanation Satisfaction Scale (ESS, Cronbach’s \alpha = 0.74, a measure of internal consistency), while experts were more critical of sufficiency and completeness; and (iii) explanations improved perceived predictability and confidence on the Trust in Automation scale (TiA, \alpha = 0.60), with novices showing higher trust than experts. These findings highlight a central challenge for XAI in no-code ML, making explanations both accessible to novices and sufficiently detailed for experts.
[HC-31] Methodological Variation in Studying Staff and Student Perceptions of AI
【速读】:该论文旨在解决不同研究方法在比较学生与教职工对人工智能(Artificial Intelligence, AI)感知差异时所导致的结果不一致性问题。其解决方案的关键在于采用多维度的定性数据收集与分析策略:一方面收集独立的开放性评论和结构化焦点小组访谈两种形式的质性数据,另一方面对每种数据源分别进行情感倾向(sentiment and stance analysis)和主题分析(thematic analysis),从而揭示单一指标(如整体正负情绪评分)可能掩盖内容细节,而深入的主题分析能更准确识别学生与教职工之间感知的相似性与差异性,进而为教育机构在评估AI相关态度时提供更具解释力的方法论依据。
链接: https://arxiv.org/abs/2602.11158
作者: Juliana Gerard,Morgan Macleod,Kelly Norwood,Aisling Reid,Muskaan Singh
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 29 pages, 3 figures
Abstract:In this paper, we compare methodological approaches for comparing student and staff perceptions, and ask: how much do these measures vary across different approaches? We focus on the case of AI perceptions, which are generally assessed via a single quantitative or qualitative measure, or with a mixed methods approach that compares two distinct data sources - e.g. a quantitative questionnaire with qualitative comments. To compare different approaches, we collect two forms of qualitative data: standalone comments and structured focus groups. We conduct two analyses for each data source: with a sentiment and stance analysis, we measure overall negativity/positivity of the comments and focus group conversations, respectively. Meanwhile, word clouds from the comments and a thematic analysis of the focus groups provide further detail on the content of this qualitative data - particularly the thematic analysis, which includes both similarities and differences between students and staff. We show that different analyses can produce different results - for a single data source. This variation stems from the construct being evaluated - an overall measure of positivity/negativity can produce a different picture from more detailed content-based analyses. We discuss the implications of this variation for institutional contexts, and for the comparisons from previous studies.
[HC-32] he States Politics of “Fake Data”
【速读】:该论文试图解决的问题是:传统数据治理理念普遍追求数据对“理想化真实”的精确映射,将偏离此标准的数据视为失败或虚假,忽视了国家机构在实际运作中如何通过非精确数据实现制度功能。其解决方案的关键在于提出“数据的‘假性’(fakeness)”具有关系性(context-dependent)、过程性(processual)和表演性(performative)特征——即这些看似虚假的数据并非单纯错误,而是由官僚实践、组织目标与制度逻辑共同塑造的“有用虚构”(useful fictions)。作者主张将“适用性”(fitness-for-purpose)置于数据评估的核心,推动对数据政治性的可见性、可争议性和问责制构建,从而重构社会技术系统中的数据治理范式。
链接: https://arxiv.org/abs/2602.10944
作者: Chuncheng Liu,Danah Boyd
机构: Northeastern University (东北大学); Cornell University (康奈尔大学)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 13 pages, 2 figures
Abstract:Data have power. As such, most discussions of data presume that records should mirror some idealized ground truth. Deviations are viewed as failure. Drawing on two ethnographic studies of state data-making in a Chinese street-level bureaucrat agency and at the US Census Bureau we show how seemingly “fake” state data perform institutional work. We map four moments in which actors negotiate between representational accuracy and organizational imperatives: creation, correction, collusion, and augmentation. Bureaucrats routinely privilege what data do over what they represent, creating fictions that serve civil servants’ self-interest and enable constrained administrations. We argue that “fakeness” of state data is relational (context dependent), processual (emerging through workflows), and performative (brought into being through labeling and practice). We urge practitioners to center fitness-for-purpose in assessments of data and contextual governance. Rather than chasing impossible representational accuracy, sociotechnical systems should render the politics of useful fictions visible, contestable, and accountable.
计算机视觉
[CV-0] Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching
【速读】:该论文旨在解决如何在单个矢量草图的绘制过程中实现语义上的渐进式幻觉(Progressive Semantic Illusions),即通过逐笔添加笔画,使同一草图在不同绘制阶段呈现出截然不同的语义解释(如从鸭子变为绵羊)。其核心挑战在于“双约束”问题:初始笔画需构成一个连贯的对象(如鸭子),同时又要作为后续结构的基础,以支持另一个概念(如绵羊)的生成。解决方案的关键是提出了一种基于双分支Score Distillation Sampling(SDS)机制的序列感知联合优化框架,能够动态调整前缀笔画,从而发现对两个目标均有效的“公共结构子空间”,并引入一种新颖的Overlay Loss来强化空间互补性,确保结构融合而非遮挡,从而显著提升识别度和幻觉强度。
链接: https://arxiv.org/abs/2602.12280
作者: Huai-Hsun Cheng,Siang-Ling Zhang,Yu-Lun Liu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Code: this https URL
Abstract:Visual illusions traditionally rely on spatial manipulations such as multi-view consistency. In this work, we introduce Progressive Semantic Illusions, a novel vector sketching task where a single sketch undergoes a dramatic semantic transformation through the sequential addition of strokes. We present Stroke of Surprise, a generative framework that optimizes vector strokes to satisfy distinct semantic interpretations at different drawing stages. The core challenge lies in the “dual-constraint”: initial prefix strokes must form a coherent object (e.g., a duck) while simultaneously serving as the structural foundation for a second concept (e.g., a sheep) upon adding delta strokes. To address this, we propose a sequence-aware joint optimization framework driven by a dual-branch Score Distillation Sampling (SDS) mechanism. Unlike sequential approaches that freeze the initial state, our method dynamically adjusts prefix strokes to discover a “common structural subspace” valid for both targets. Furthermore, we introduce a novel Overlay Loss that enforces spatial complementarity, ensuring structural integration rather than occlusion. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art baselines in recognizability and illusion strength, successfully expanding visual anagrams from the spatial to the temporal dimension. Project page: this https URL
[CV-1] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling
【速读】:该论文旨在解决统一多模态模型在处理复杂任务时缺乏迭代推理能力的问题,即现有模型通常采用单次前向传播,无法通过逐步验证、分解子目标和修正输出来提升性能。其核心解决方案是提出UniT框架,该框架通过代理式数据合成(agentic data synthesis)、统一模型训练与灵活的测试时推理相结合,使单一模型能够在多轮推理中实现认知行为(如验证、子目标分解和内容记忆),从而显著提升多模态理解与生成能力。关键创新在于证明了短推理轨迹训练的统一模型可泛化至更长的推理链,并且顺序式思维链(chain-of-thought)比并行采样更具可扩展性和计算效率。
链接: https://arxiv.org/abs/2602.12279
作者: Leon Liangyu Chen,Haoyu Ma,Zhipeng Fan,Ziqi Huang,Animesh Sinha,Xiaoliang Dai,Jialiang Wang,Zecheng He,Jianwei Yang,Chunyuan Li,Junzhe Sun,Chu Wang,Serena Yeung-Levy,Felix Juefei-Xu
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.
[CV-2] MonarchRT: Efficient Attention for Real-Time Video Generation
【速读】:该论文旨在解决扩散 Transformer(Diffusion Transformers)在实时视频生成中因三维自注意力机制(3D self-attention)带来的二次计算复杂度瓶颈问题,尤其是在少步数、自回归的实时场景下,误差累积严重且每一步需承载更多信息,导致现有稀疏注意力近似方法失效。其解决方案的关键在于提出一种结构化的注意力参数化方法 Monarch-RT,该方法利用 Monarch 矩阵对注意力进行因子分解,通过精心设计的块对齐结构与扩展的分块 Monarch 参数化,在保持高表达能力的同时实现计算效率优化;并通过微调和定制 Triton 内核进一步降低参数化开销,最终在不损失质量的前提下实现高达 95% 的注意力稀疏性,并在多个 GPU 上显著超越 FlashAttention 系列内核,首次在单张 RTX 5090 显卡上实现基于 Self-Forcing 模型的 16 FPS 真实实时视频生成。
链接: https://arxiv.org/abs/2602.12271
作者: Krish Agarwal,Zhuoming Chen,Cheng Luo,Yongqi Chen,Haizhong Zheng,Xun Huang,Atri Rudra,Beidi Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both few-step and autoregressive, where errors compound across time and each denoising step must carry substantially more information. In this setting, we find that prior sparse-attention approximations break down, despite showing strong results for bidirectional, many-step diffusion. Specifically, we observe that video attention is not reliably sparse, but instead combines pronounced periodic structure driven by spatiotemporal position with dynamic, sparse semantic correspondences and dense mixing, exceeding the representational capacity of even oracle top-k attention. Building on this insight, we propose Monarch-RT, a structured attention parameterization for video diffusion models that factorizes attention using Monarch matrices. Through appropriately aligned block structure and our extended tiled Monarch parameterization, we achieve high expressivity while preserving computational efficiency. We further overcome the overhead of parameterization through finetuning, with custom Triton kernels. We first validate the high efficacy of Monarch-RT over existing sparse baselines designed only for bidirectional models. We further observe that Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing, making Monarch-RT a pioneering work on highly-capable sparse attention parameterization for real-time video generation. Our optimized implementation outperforms FlashAttention-2, FlashAttention-3, and FlashAttention-4 kernels on Nvidia RTX 5090, H100, and B200 GPUs respectively, providing kernel speedups in the range of 1.4-11.8X. This enables us, for the first time, to achieve true real-time video generation with Self-Forcing at 16 FPS on a single RTX 5090.
[CV-3] Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision
【速读】:该论文旨在解决神经形态视觉系统中持续学习(continual learning)面临的灾难性遗忘问题,尤其关注在事件相机(event-based camera)和帧相机(frame-based camera)数据上实现高准确率与低功耗的协同优化。其核心解决方案是提出一种面向能量感知的脉冲预算框架(energy-aware spike budgeting framework),关键创新包括:结合经验回放(experience replay)、可学习的漏电积分-发放(leaky integrate-and-fire, LIF)神经元参数以及自适应脉冲调度器(adaptive spike scheduler),在训练过程中施加针对不同数据集的能量约束。该方法在帧基数据集上通过脉冲稀疏化提升精度并降低脉冲频率达47%,在事件基数据集上通过可控的预算松弛实现最高17.45个百分点的准确率提升且计算开销极低,从而显著提升了神经形态视觉系统在动态环境中的实用性。
链接: https://arxiv.org/abs/2602.12236
作者: Anika Tabassum Meem,Muntasir Hossain Nadid,Md Zesun Ahmed Mia
机构: University of Liberal Arts Bangladesh (自由艺术大学孟加拉国); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neuromorphic vision systems based on spiking neural networks (SNNs) offer ultra-low-power perception for event-based and frame-based cameras, yet catastrophic forgetting remains a critical barrier to deployment in continually evolving environments. Existing continual learning methods, developed primarily for artificial neural networks, seldom jointly optimize accuracy and energy efficiency, with particularly limited exploration on event-based datasets. We propose an energy-aware spike budgeting framework for continual SNN learning that integrates experience replay, learnable leaky integrate-and-fire neuron parameters, and an adaptive spike scheduler to enforce dataset-specific energy constraints during training. Our approach exhibits modality-dependent behavior: on frame-based datasets (MNIST, CIFAR-10), spike budgeting acts as a sparsity-inducing regularizer, improving accuracy while reducing spike rates by up to 47%; on event-based datasets (DVS-Gesture, N-MNIST, CIFAR-10-DVS), controlled budget relaxation enables accuracy gains up to 17.45 percentage points with minimal computational overhead. Across five benchmarks spanning both modalities, our method demonstrates consistent performance improvements while minimizing dynamic power consumption, advancing the practical viability of continual learning in neuromorphic vision systems.
[CV-4] owards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training
【速读】:该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)在泛化性能上逊于强化学习(Reinforcement Learning, RL)的问题,其核心原因在于RL利用了在线策略(on-policy)数据,而SFT通常依赖离线数据,导致模型训练分布与实际推理分布不一致。解决方案的关键在于提出了一种名为“分布判别理论”(Distribution Discriminant Theory, DDT)的新框架,该理论量化了数据分布与模型诱导分布之间的对齐程度,并据此设计了两种互补技术:(i) 在分布内微调(In-Distribution Fine-Tuning, IDFT),从损失函数层面提升SFT的泛化能力;(ii) 提示解码(Hinted Decoding),从数据层面重新调整训练语料以匹配模型当前分布。实验表明,该框架在保持SFT计算效率的同时,达到了与主流离线RL算法(如DPO和SimPO)相当的泛化性能,为RL不可行的应用场景提供了一种实用替代方案。
链接: https://arxiv.org/abs/2602.12222
作者: Miaosen Zhang,Yishan Liu,Shuxia Lin,Xu Yang,Qi Dai,Chong Luo,Weihao Jiang,Peng Hou,Anxiang Zeng,Xin Geng,Baining Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL). This gap is primarily driven by RL’s use of on-policy data. We propose a framework to bridge this chasm by enabling On-Policy SFT. We first present \textbf\textitDistribution Discriminant Theory (DDT), which explains and quantifies the alignment between data and the model-induced distribution. Leveraging DDT, we introduce two complementary techniques: (i) \textbf\textitIn-Distribution Finetuning (IDFT), a loss-level method to enhance generalization ability of SFT, and (ii) \textbf\textitHinted Decoding, a data-level technique that can re-align the training corpus to the model’s distribution. Extensive experiments demonstrate that our framework achieves generalization performance on par with prominent offline RL algorithms, including DPO and SimPO, while maintaining the efficiency of an SFT pipeline. The proposed framework thus offers a practical alternative in domains where RL is infeasible. We open-source the code here: this https URL
[CV-5] Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching
【速读】:该论文旨在解决多模态理解、生成与编辑任务中因目标函数冲突和表征纠缠导致的性能瓶颈问题。其核心解决方案是提出UniDFlow,一个统一的离散流匹配框架,通过任务特定的低秩适配器(low-rank adapters)解耦理解与生成过程,从而避免目标干扰和表征纠缠;同时引入基于参考的多模态偏好对齐机制,在相同条件约束下优化相对输出结果,提升忠实性(faithfulness)与可控性(controllability),且无需大规模重训练即可实现零样本泛化能力。
链接: https://arxiv.org/abs/2602.12221
作者: Onkar Susladkar,Tushar Prakash,Gayatri Deshmukh,Kiet A. Nguyen,Jiaxun Zhang,Adheesh Juvekar,Tianshu Bao,Lin Chai,Sparsh Mittal,Inderjit S Dhillon,Ismini Lourentzou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.
[CV-6] DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
【速读】:该论文旨在解决当前统一多模态生成与编辑模型普遍依赖大规模参数(如100亿以上)所带来的训练成本高昂和部署资源消耗大的问题。为实现轻量化同时保持强大性能,其核心解决方案是提出DeepGen 1.0——一个仅50亿参数的统一模型,并引入关键创新:堆叠通道桥接(Stacked Channel Bridging, SCB)框架,通过从多个视觉语言模型(VLM)层提取分层特征并融合可学习的“思考令牌”(think tokens),向生成骨干网络提供结构化、推理丰富的引导信息,从而增强小模型在语义理解与细粒度控制方面的表现。此外,采用三阶段数据驱动训练策略(对齐预训练、联合监督微调、基于MR-GRPO的强化学习),显著提升了生成质量与人类偏好对齐性,且训练稳定、无视觉伪影。
链接: https://arxiv.org/abs/2602.12205
作者: Dianyi Wang,Ruihang Li,Feng Han,Chaofan Ma,Wei Song,Siyuan Wang,Yibin Wang,Yi Xin,Hongjian Liu,Zhixiong Zhang,Shengyuan Ding,Tianhang Wang,Zhenglin Cheng,Tao Lin,Cheng Jin,Kaicheng Yu,Jingjing Chen,Wenjie Wang,Zhongyu Wei,Jiaqi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., 10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable ‘think tokens’ to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.
[CV-7] EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data
【速读】:该论文旨在解决地球观测(Earth Observation, EO)数据在生成式模型中因传感器多样性与光谱通道差异而导致的高效潜在表示压缩难题。传统方法需为每种模态单独训练tokenizer,难以适应多源异构EO数据。其解决方案的关键在于提出EO-VAE——一种基于变分自编码器(Variational Autoencoder, VAE)的多传感器统一tokenizer,通过动态超网络(dynamic hypernetworks)实现对任意通道组合的联合编码与重建,从而在单一模型架构下支持灵活且高保真的潜在空间建模,显著优于现有TerraMind tokenizer的重建质量。
链接: https://arxiv.org/abs/2602.12177
作者: Nils Lehmann,Yi Wang,Zhitong Xiong,Xiaoxiang Zhu
机构: Technical University of Munich (TUM); Munich Center for Machine Learning (MCML)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:State-of-the-art generative image and video models rely heavily on tokenizers that compress high-dimensional inputs into more efficient latent representations. While this paradigm has revolutionized RGB generation, Earth observation (EO) data presents unique challenges due to diverse sensor specifications and variable spectral channels. We propose EO-VAE, a multi-sensor variational autoencoder designed to serve as a foundational tokenizer for the EO domain. Unlike prior approaches that train separate tokenizers for each modality, EO-VAE utilizes a single model to encode and reconstruct flexible channel combinations via dynamic hypernetworks. Our experiments on the TerraMesh dataset demonstrate that EO-VAE achieves superior reconstruction fidelity compared to the TerraMind tokenizers, establishing a robust baseline for latent generative modeling in remote sensing.
[CV-8] DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在人类中心的音视频生成任务中面临的两大核心问题:一是现有方法将参考音频驱动视频生成(R2AV)、视频编辑(RV2AV)和音频驱动视频动画(RA2V)等任务视为孤立目标,缺乏统一建模能力;二是多角色场景下难以实现人物身份与声音音色的解耦控制,导致身份-音色绑定错误和说话人混淆。解决方案的关键在于提出 DreamID-Omni 框架,其核心创新包括:1)设计对称条件扩散变换器(Symmetric Conditional Diffusion Transformer),通过对称条件注入机制融合异构条件信号;2)引入双层级解耦策略——信号层采用同步 RoPE(Synchronized RoPE)确保注意力空间中的刚性绑定,语义层采用结构化描述(Structured Captions)建立显式的属性-主体映射;3)提出多任务渐进训练方案,利用弱约束生成先验正则化强约束任务,避免过拟合并协调不同目标。该框架在音视频一致性、跨模态对齐等方面达到业界领先性能。
链接: https://arxiv.org/abs/2602.12160
作者: Xu Guo,Fulong Ye,Qichao Sun,Liyang Chen,Bingchuan Li,Pengze Zhang,Jiawei Liu,Songtao Zhao,Qian He,Xiangwang Hou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project: this https URL
Abstract:Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.
[CV-9] xSpot: 3D Texture Enhancement with Spatially-uniform Point Latent Representation
【速读】:该论文旨在解决当前主流多视角扩散流水线在3D纹理生成中普遍存在的视图不一致性(view-inconsistency)问题,以及现有表示方法的局限性:基于UV映射的方法在展开过程中易产生畸变,而基于点云的方法则将纹理保真度与几何密度紧密耦合,限制了高分辨率纹理的生成能力。解决方案的关键在于提出一种名为TexSpot的新框架,其核心是引入Texlet——一种融合点基3D纹理几何表达能力和UV表示紧凑性的新型3D纹理表示方式。Texlet通过2D编码器对局部纹理块进行编码,并利用3D编码器聚合全局形状上下文信息,再由级联的3D到2D解码器重建高质量纹理块,从而实现纹理空间的学习;在此基础上,训练一个以Texlet为条件的扩散变换器(diffusion transformer),用于精炼和增强多视角扩散方法生成的纹理,显著提升了视觉保真度、几何一致性与鲁棒性。
链接: https://arxiv.org/abs/2602.12157
作者: Ziteng Lu,Yushuang Wu,Chongjie Ye,Yuda Qiu,Jing Shao,Xiaoyang Guo,Jiaqing Zhou,Tianlei Hu,Kun Zhou,Xiaoguang Han
机构: ByteDance Games(字节跳动游戏); SSE, CUHKSZ(深圳高等金融研究院); FNii-Shenzhen(深圳市人工智能与机器人研究院); Shenzhen University(深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL
Abstract:High-quality 3D texture generation remains a fundamental challenge due to the view-inconsistency inherent in current mainstream multi-view diffusion pipelines. Existing representations either rely on UV maps, which suffer from distortion during unwrapping, or point-based methods, which tightly couple texture fidelity to geometric density that limits high-resolution texture generation. To address these limitations, we introduce TexSpot, a diffusion-based texture enhancement framework. At its core is Texlet, a novel 3D texture representation that merges the geometric expressiveness of point-based 3D textures with the compactness of UV-based representation. Each Texlet latent vector encodes a local texture patch via a 2D encoder and is further aggregated using a 3D encoder to incorporate global shape context. A cascaded 3D-to-2D decoder reconstructs high-quality texture patches, enabling the Texlet space learning. Leveraging this representation, we train a diffusion transformer conditioned on Texlets to refine and enhance textures produced by multi-view diffusion methods. Extensive experiments demonstrate that TexSpot significantly improves visual fidelity, geometric consistency, and robustness over existing state-of-the-art 3D texture generation and enhancement approaches. Project page: this https URL.
[CV-10] FAIL: Flow Matching Adversarial Imitation Learning for Image Generation
【速读】:该论文旨在解决生成式模型在后训练阶段(post-training)中因策略漂移(policy drift)导致的性能下降问题,尤其是在未见过的状态下无法有效纠正偏差的问题。传统监督微调(Supervised Fine-Tuning)虽能模仿专家示范,但难以应对未知状态下的策略偏移;而偏好优化方法虽可缓解此问题,却依赖昂贵的偏好对或奖励建模。论文提出流匹配对抗模仿学习(Flow Matching Adversarial Imitation Learning, FAIL),其核心在于通过对抗训练最小化策略与专家分布之间的差异,无需显式奖励信号或成对比较。FAIL的关键创新在于将流匹配(flow matching)的后训练过程形式化为模仿学习问题,并设计两种算法:FAIL-PD利用可微分常微分方程(ODE)求解器获得低方差路径梯度,适用于连续空间;FAIL-PG则提供黑盒替代方案,适用于离散或计算受限场景。实验表明,仅用13,000条来自Nano Banana Pro的数据微调FLUX模型即可在提示遵循和美学基准上达到竞争性表现,并且该框架在离散图像与视频生成任务中具有良好泛化能力,同时作为正则化项有效抑制奖励黑客(reward hacking)现象。
链接: https://arxiv.org/abs/2602.12155
作者: Yeyao Ma,Chen Li,Xiaosong Zhang,Han Hu,Weidi Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Post-training of flow matching models-aligning the output distribution with a high-quality target-is mathematically equivalent to imitation learning. While Supervised Fine-Tuning mimics expert demonstrations effectively, it cannot correct policy drift in unseen states. Preference optimization methods address this but require costly preference pairs or reward modeling. We propose Flow Matching Adversarial Imitation Learning (FAIL), which minimizes policy-expert divergence through adversarial training without explicit rewards or pairwise comparisons. We derive two algorithms: FAIL-PD exploits differentiable ODE solvers for low-variance pathwise gradients, while FAIL-PG provides a black-box alternative for discrete or computationally constrained settings. Fine-tuning FLUX with only 13,000 demonstrations from Nano Banana pro, FAIL achieves competitive performance on prompt following and aesthetic benchmarks. Furthermore, the framework generalizes effectively to discrete image and video generation, and functions as a robust regularizer to mitigate reward hacking in reward-based optimization. Code and data are available at this https URL.
[CV-11] PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback
【速读】:该论文旨在解决图像到海报生成(image-to-poster generation)任务中同时兼顾局部细节保持与全局设计理解的复杂挑战,即如何在保留具体视觉实体(如人物、物体等)的同时,实现布局、风格和美学一致性的高质量创作。其核心解决方案是提出PosterOmni框架,关键在于通过一个高效的“数据蒸馏-奖励反馈”管道整合局部编辑(local editing)与全局创作(global creation)两个范式:首先构建覆盖六类任务的多场景数据集;其次在局部专家与全局专家之间进行知识蒸馏以支持监督微调;最后引入统一的PosterOmni奖励反馈机制,协同对齐视觉实体保真度与美学偏好,从而实现跨任务的端到端优化。
链接: https://arxiv.org/abs/2602.12127
作者: Sixiang Chen,Jianyu Lai,Jialin Gao,Hengyu Shi,Zhongying Liu,Tian Ye,Junfeng Luo,Xiaoming Wei,Lei Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image-to-poster generation is a high-demand task requiring not only local adjustments but also high-level design understanding. Models must generate text, layout, style, and visual elements while preserving semantic fidelity and aesthetic coherence. The process spans two regimes: local editing, where ID-driven generation, rescaling, filling, and extending must preserve concrete visual entities; and global creation, where layout- and style-driven tasks rely on understanding abstract design concepts. These intertwined demands make image-to-poster a multi-dimensional process coupling entity-preserving editing with concept-driven creation under image-prompt control. To address these challenges, we propose PosterOmni, a generalized artistic poster creation framework that unlocks the potential of a base edit model for multi-task image-to-poster generation. PosterOmni integrates the two regimes, namely local editing and global creation, within a single system through an efficient data-distillation-reward pipeline: (i) constructing multi-scenario image-to-poster datasets covering six task types across entity-based and concept-based creation; (ii) distilling knowledge between local and global experts for supervised fine-tuning; and (iii) applying unified PosterOmni Reward Feedback to jointly align visual entity-preserving and aesthetic preference across all tasks. Additionally, we establish PosterOmni-Bench, a unified benchmark for evaluating both local editing and global creation. Extensive experiments show that PosterOmni significantly enhances reference adherence, global composition quality, and aesthetic harmony, outperforming all open-source baselines and even surpassing several proprietary systems.
[CV-12] Iskra: A System for Inverse Geometry Processing
【速读】:该论文旨在解决几何处理(geometry processing)中自动微分(automatic differentiation, AD)的难题,即如何高效地对现有几何算法进行梯度计算,从而支持逆向几何处理(inverse geometry processing)任务。传统方法通常需要将算法重写为可微形式或依赖通用优化工具,导致实现复杂、运行缓慢且内存开销大。其解决方案的关键在于:通过将散射-聚集(scatter-gather)的网格处理范式与基于张量(tensor-based)的工作流结合,并利用伴随法(adjoint method)对用户指定的命令式代码生成高效的反向传播路径,从而无需重构原有几何算法即可实现自动微分。该方法兼容主流机器学习框架,显著降低了实现成本并提升了性能,适用于均值曲率流、谱保形参数化、测地距离计算和尽可能刚性变形等典型几何处理问题。
链接: https://arxiv.org/abs/2602.12105
作者: Ana Dodik,Ahmed H. Mahmoud,Justin Solomon
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We propose a system for differentiating through solutions to geometry processing problems. Our system differentiates a broad class of geometric algorithms, exploiting existing fast problem-specific schemes common to geometry processing, including local-global and ADMM solvers. It is compatible with machine learning frameworks, opening doors to new classes of inverse geometry processing applications. We marry the scatter-gather approach to mesh processing with tensor-based workflows and rely on the adjoint method applied to user-specified imperative code to generate an efficient backward pass behind the scenes. We demonstrate our approach by differentiating through mean curvature flow, spectral conformal parameterization, geodesic distance computation, and as-rigid-as-possible deformation, examining usability and performance on these applications. Our system allows practitioners to differentiate through existing geometry processing algorithms without needing to reformulate them, resulting in low implementation effort, fast runtimes, and lower memory requirements than differentiable optimization tools not tailored to geometry processing.
[CV-13] AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer ICLR2026
【速读】:该论文旨在解决数字产业中对高质量、多样化模块化3D资产(modular 3D assets)的迫切需求,尤其是在用户生成内容(User-Generated Content, UGC)场景下,如何高效生成符合特定设计约束的模块化3D模型问题。解决方案的关键在于提出AssetFormer——一个基于自回归Transformer架构的模型,通过借鉴语言模型中的模块序列编排与解码技术,实现从文本描述到结构合理、可组合的模块化3D资产的端到端生成,从而显著提升生成质量并支持多种应用场景。
链接: https://arxiv.org/abs/2602.12100
作者: Lingting Zhu,Shengju Qian,Haidi Fan,Jiayu Dong,Zhenchao Jin,Siwei Zhou,Gen Dong,Xin Wang,Lequan Yu
机构: The University of Hong Kong (香港大学); LIGHTSPEED
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026. 23 pages, 14 figures
Abstract:The digital industry demands high-quality, diverse modular 3D assets, especially for user-generated content~(UGC). In this work, we introduce AssetFormer, an autoregressive Transformer-based model designed to generate modular 3D assets from textual descriptions. Our pilot study leverages real-world modular assets collected from online platforms. AssetFormer tackles the challenge of creating assets composed of primitives that adhere to constrained design parameters for various applications. By innovatively adapting module sequencing and decoding techniques inspired by language models, our approach enhances asset generation quality through autoregressive modeling. Initial results indicate the effectiveness of AssetFormer in streamlining asset creation for professional development and UGC scenarios. This work presents a flexible framework extendable to various types of modular 3D assets, contributing to the broader field of 3D content generation. The code is available at this https URL.
[CV-14] GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在多步动作预测中因场景理解受限和未来预期能力弱而导致的性能瓶颈问题。其解决方案的关键在于引入基于世界模型(World Model)的强化学习框架,具体通过RAMP(Reinforcement leArning via world Model-conditioned Policy)机制,将预训练于大规模机器人操作数据的世界模型与策略优化相结合,从而显著提升模型在跨任务场景下的适应性与长程执行可靠性。实验证明,该方法相较RECAP基线在洗衣折叠、箱子打包和意式咖啡制作等复杂任务上性能提升约30%,且在真实环境中展现出无失败的长期操作能力。
链接: https://arxiv.org/abs/2602.12099
作者: GigaBrain Team:Boyuan Wang,Chaojun Ni,Guan Huang,Guosheng Zhao,Hao Li,Jie Li,Jindi Lv,Jingyu Liu,Lv Feng,Mingming Yu,Peng Li,Qiuping Deng,Tianze Liu,Xinyu Zhou,Xinze Chen,Xiaofeng Wang,Yang Wang,Yifan Li,Yifei Nie,Yilong Li,Yukun Zhou,Yun Ye,Zhichao Liu,Zheng Zhu
机构: GigaAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose \textitGigaBrain-0.5M*, a VLA model trained via world model-based reinforcement learning. Built upon \textitGigaBrain-0.5, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. \textitGigaBrain-0.5M* further integrates world model-based reinforcement learning via \textitRAMP (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that \textitRAMP achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30% on challenging tasks including \textttLaundry Folding, \textttBox Packing, and \textttEspresso Preparation. Critically, \textitGigaBrain-0.5M ^* exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \hrefthis https URLproject page.
[CV-15] A DMD-Based Adaptive Modulation Method for High Dynamic Range Imaging in High-Glare Environments
【速读】:该论文旨在解决传统CCD/CMOS图像传感器在极端光照条件下(如焊接电弧监测和镜面金属表面分析)因动态范围有限(通常低于70 dB)而导致的饱和问题,从而引发数字图像相关(DIC)测量误差增大的难题。解决方案的关键在于构建一种基于数字微镜器件(Digital Micromirror Device, DMD)的高动态范围(HDR)成像系统,通过DMD的空间调制能力实现区域自适应曝光控制,并结合一个集成的计算成像处理流程,实现了127 dB的可测动态范围,有效抑制了强光下的饱和伪影,使DIC应变测量误差降低78%,显著提升了光学计量精度。
链接: https://arxiv.org/abs/2602.12044
作者: Banglei Guan,Jing Tao,Liang Xu,Dongcai Tan,Pengju Sun,Jianbing Liu,Yang Shang,Qifeng Yu
机构: National University of Defense Technology (国防科技大学); Hunan Provincial Key Laboratory of Image Measurement and Vision Navigation (湖南省图像测量与视觉导航重点实验室); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by Experimental Mechanics
Abstract:Background The accuracy of photomechanics measurements critically relies on image quality,particularly under extreme illumination conditions such as welding arc monitoring and polished metallic surface analysis. High dynamic range (HDR) imaging above 120 dB is essential in these contexts. Conventional CCD/CMOS sensors, with dynamic ranges typically below 70 dB, are highly susceptible to saturation under glare, resulting in irreversible loss of detail and significant errors in digital image correlation (DIC). Methods This paper presents an HDR imaging system that leverages the spatial modulation capability of a digital micromirror device (DMD). The system architecture enables autonomous regional segmentation and adaptive exposure control for high-dynamic-range scenes through an integrated framework comprising two synergistic subsystems: a DMD-based optical modulation unit and an adaptive computational imaging pipeline. Results The system achieves a measurable dynamic range of 127 dB, effectively eliminating satu ration artifacts under high glare. Experimental results demonstrate a 78% reduction in strain error and improved DIC positioning accuracy, confirming reliable performance across extreme intensity variations. Conclusion The DMD-based system provides high fidelity adaptive HDR imaging, overcoming key limitations of conventional sensors. It exhibits strong potential for optical metrology and stress analysis in high-glare environments where traditional methods are inadequate.
[CV-16] Projected Representation Conditioning for High-fidelity Novel View Synthesis
【速读】:该论文旨在解决基于扩散模型(diffusion-based)的新视角合成(novel view synthesis)中几何一致性不足的问题,尤其是在使用稀疏、未对齐图像集合时难以保持结构准确性的挑战。解决方案的关键在于引入外部表示(external representations)作为条件,利用其几何与语义对应特性来增强生成视角的几何一致性;具体方法是设计专用的表示投影模块(representation projection modules),将外部表示注入扩散过程,从而提出名为ReNoV(Representation-guided Novel View Synthesis)的框架。实验表明,该方法在重建保真度和修补质量上均显著优于现有扩散模型方法,并能稳健地从稀疏、未对齐图像中合成高质量新视角图像。
链接: https://arxiv.org/abs/2602.12003
作者: Min-Seop Kwak,Minkyung Kwon,Jinhyeok Choi,Jiho Park,Seungryong Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a novel framework for diffusion-based novel view synthesis in which we leverage external representations as conditions, harnessing their geometric and semantic correspondence properties for enhanced geometric consistency in generated novel viewpoints. First, we provide a detailed analysis exploring the correspondence capabilities emergent in the spatial attention of external visual representations. Building from these insights, we propose a representation-guided novel view synthesis through dedicated representation projection modules that inject external representations into the diffusion process, a methodology named ReNoV, short for representation-guided novel view synthesis. Our experiments show that this design yields marked improvements in both reconstruction fidelity and inpainting quality, outperforming prior diffusion-based novel-view methods on standard benchmarks and enabling robust synthesis from sparse, unposed image collections.
[CV-17] Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? – Case Study on Newborn Resuscitation ICIP
【速读】:该论文旨在解决新生儿复苏过程中关键操作活动识别的准确性问题,以提升复苏质量改进与临床指南依从性。现有基于3D-CNN和视觉Transformer(Vision Transformer, ViT)的方法虽具潜力,但在细粒度活动识别方面仍面临挑战。其解决方案的关键在于探索生成式AI(Generative AI, GenAI)方法,特别是结合局部视觉语言模型(local Vision-Language Models, VLMs)与大语言模型(Large Language Models, LLMs),并通过低秩适应(Low-Rank Adaptation, LoRA)对VLM进行微调,从而显著提升活动识别性能。实验表明,微调后的VLM在F1分数上达到0.91,优于监督式TimeSFormer基线模型(F1=0.70)。
链接: https://arxiv.org/abs/2602.12002
作者: Enrico Guerriero,Kjersti Engan,Øyvind Meinich-Bache
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at the Satellite Workshop on Workshop 15: Generative AI for World Simulations and Communications Celebrating 40 Years of Excellence in Education: Honoring Professor Aggelos Katsaggelos, IEEE International Conference on Image Processing (ICIP), 2025
Abstract:Accurate documentation of newborn resuscitation is essential for quality improvement and adherence to clinical guidelines, yet remains underutilized in practice. Previous work using 3D-CNNs and Vision Transformers (ViT) has shown promising results in detecting key activities from newborn resuscitation videos, but also highlighted the challenges in recognizing such fine-grained activities. This work investigates the potential of generative AI (GenAI) methods to improve activity recognition from such videos. Specifically, we explore the use of local vision-language models (VLMs), combined with large language models (LLMs), and compare them to a supervised TimeSFormer baseline. Using a simulated dataset comprising 13.26 hours of newborn resuscitation videos, we evaluate several zero-shot VLM-based strategies and fine-tuned VLMs with classification heads, including Low-Rank Adaptation (LoRA). Our results suggest that small (local) VLMs struggle with hallucinations, but when fine-tuned with LoRA, the results reach F1 score at 0.91, surpassing the TimeSformer results of 0.70.
[CV-18] Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation
【速读】:该论文旨在解决扩散模型(diffusion models)在复杂空间理解与推理能力上的不足,尤其是在依赖纯文本提示时易丢失空间信息的问题。现有方法虽尝试引入多模态大语言模型(Multimodal Large Language Models, MLLMs)以增强空间推理能力,但往往因联合训练带来高计算开销或无法有效保留空间结构信息。其解决方案的关键在于提出一种即插即用的Spatial Chain-of-Thought(SCoT)框架:首先通过交错文本-坐标指令格式训练扩散模型以提升布局感知能力;其次利用先进的MLLM作为规划器生成完整的布局计划,并将空间规划能力直接迁移至生成过程,从而在不增加额外训练成本的前提下显著增强图像生成的准确性与复杂场景下的推理能力。
链接: https://arxiv.org/abs/2602.11980
作者: Wei Chen,Yancheng Long,Mingqiao Liu,Haojie Ding,Yankai Yang,Hongyang Wei,Yi-Fan Zhang,Bin Wen,Fan Yang,Tingting Gao,Han Li,Long Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 4 figures
Abstract:While diffusion models have shown exceptional capabilities in aesthetic image synthesis, they often struggle with complex spatial understanding and reasoning. Existing approaches resort to Multimodal Large Language Models (MLLMs) to enhance this capability. However, they either incur high computational costs through joint training or suffer from spatial information loss when relying solely on textual prompts. To alleviate these limitations, we propose a Spatial Chain-of-Thought (SCoT) framework, a plug-and-play approach that effectively bridges the reasoning capabilities of MLLMs with the generative power of diffusion models. Specifically, we first enhance the diffusion model’s layout awareness by training it on an interleaved text-coordinate instruction format. We then leverage state-of-the-art MLLMs as planners to generate comprehensive layout plans, transferring their spatial planning capabilities directly to the generation process. Extensive experiments demonstrate that our method achieves state-of-the-art performance on image generation benchmarks and significantly outperforms baselines on complex reasoning tasks, while also showing strong efficacy in image editing scenarios.
[CV-19] Calibrated Bayesian Deep Learning for Explainable Decision Support Systems Based on Medical Imaging
【速读】:该论文旨在解决医疗影像辅助决策系统中深度学习模型的可靠性问题,特别是其预测结果常因校准不良(miscalibration)而表现出对错误预测的过度自信(overconfidence),这限制了临床采纳。解决方案的关键在于提出一个基于贝叶斯深度学习的通用概率优化框架,核心创新包括:一是引入置信度-不确定性边界损失(Confidence-Uncertainty Boundary Loss, CUB-Loss),通过惩罚高置信度下的错误预测和低置信度下的正确预测,强制模型输出的不确定性估计与预测准确性对齐;二是设计双温度缩放(Dual Temperature Scaling, DTS)策略,用于后验校准以提升预测分布的直观可解释性。该方法在肺炎自动筛查、糖尿病视网膜病变检测和皮肤病变识别三个医学影像任务上均实现了稳定校准性能提升,且在数据稀缺和严重类别不平衡场景下仍具鲁棒性。
链接: https://arxiv.org/abs/2602.11973
作者: Hua Xu,Julián D. Arias-Londoño,Juan I. Godino-Llorente
机构: Universidad Politécnica de Madrid (马德里理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages, 3 figures
Abstract:In critical decision support systems based on medical imaging, the reliability of AI-assisted decision-making is as relevant as predictive accuracy. Although deep learning models have demonstrated significant accuracy, they frequently suffer from miscalibration, manifested as overconfidence in erroneous predictions. To facilitate clinical acceptance, it is imperative that models quantify uncertainty in a manner that correlates with prediction correctness, allowing clinicians to identify unreliable outputs for further review. In order to address this necessity, the present paper proposes a generalizable probabilistic optimization framework grounded in Bayesian deep learning. Specifically, a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) is introduced that imposes penalties on high-certainty errors and low-certainty correct predictions, explicitly enforcing alignment between prediction correctness and uncertainty estimates. Complementing this training-time optimization, a Dual Temperature Scaling (DTS) strategy is devised for post-hoc calibration, further refining the posterior distribution to improve intuitive explainability. The proposed framework is validated on three distinct medical imaging tasks: automatic screening of pneumonia, diabetic retinopathy detection, and identification of skin lesions. Empirical results demonstrate that the proposed approach achieves consistent calibration improvements across diverse modalities, maintains robust performance in data-scarce scenarios, and remains effective on severely imbalanced datasets, underscoring its potential for real clinical deployment.
[CV-20] Synthesis of Late Gadolinium Enhancement Images via Implicit Neural Representations for Cardiac Scar Segmentation
【速读】:该论文旨在解决心肌瘢痕评估中因标注数据有限而导致自动化分割方法难以发展的难题。其关键解决方案在于提出了一种基于隐式神经表示(Implicit Neural Representations, INRs)与去噪扩散模型相结合的新框架:首先利用INRs学习LGE图像及其对应的心肌和纤维化分割掩膜的连续空间表征,并将这些INRs压缩为紧凑的潜在嵌入以保留关键解剖信息;随后在该潜在空间中训练扩散模型生成新的样本,最终解码得到具有解剖一致性的合成LGE图像及对应的分割掩膜,从而实现无需额外标注即可扩充训练数据的目的。
链接: https://arxiv.org/abs/2602.11942
作者: Soufiane Ben Haddou,Laura Alvarez-Florez,Erik J. Bekkers,Fleur V. Y. Tjong,Ahmad S. Amin,Connie R. Bezzina,Ivana Išgum
机构: Amsterdam UMC, The Netherlands(阿姆斯特丹大学医疗中心, 荷兰); University of Amsterdam, The Netherlands(阿姆斯特丹大学, 荷兰); Amsterdam Cardiovascular Sciences, Amsterdam UMC, The Netherlands(阿姆斯特丹心血管科学中心, 阿姆斯特丹大学医疗中心, 荷兰); Department of Radiology and Nuclear Medicine, Amsterdam UMC, The Netherlands(阿姆斯特丹大学医疗中心放射科和核医学科, 荷兰); Mayo Clinic, Rochester, United States of America(梅奥诊所, 罗切斯特, 美国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Paper accepted at SPIE Medical Imaging 2026 Conference
Abstract:Late gadolinium enhancement (LGE) imaging is the clinical standard for myocardial scar assessment, but limited annotated datasets hinder the development of automated segmentation methods. We propose a novel framework that synthesises both LGE images and their corresponding segmentation masks using implicit neural representations (INRs) combined with denoising diffusion models. Our approach first trains INRs to capture continuous spatial representations of LGE data and associated myocardium and fibrosis masks. These INRs are then compressed into compact latent embeddings, preserving essential anatomical information. A diffusion model operates on this latent space to generate new representations, which are decoded into synthetic LGE images with anatomically consistent segmentation masks. Experiments on 133 cardiac MRI scans suggest that augmenting training data with 200 synthetic volumes contributes to improved fibrosis segmentation performance, with the Dice score showing an increase from 0.509 to 0.524. Our approach provides an annotation-free method to help mitigate data this http URL code for this research is publicly available.
[CV-21] DynaHOI: Benchmarking Hand-Object Interaction for Dynamic Target
【速读】:该论文旨在解决现有手部运动生成基准在手-物体交互(Hand-Object Interaction, HOI)任务中主要针对静态物体、缺乏对动态目标和时间敏感协调场景评估的问题。其解决方案的关键在于提出一个统一的在线闭环平台DynaHOI-Gym,该平台具备参数化运动生成器和基于rollout的评估指标,能够有效支持动态场景下的手部动作捕捉与评价;在此基础上构建了包含1000万帧和18万条手部轨迹的大规模基准数据集DynaHOI-10M,并设计了一个简单的“观察后执行”基线模型(ObAct),通过时空注意力机制融合短期观测与当前帧信息进行动作预测,显著提升了定位成功率(提高8.1%)。
链接: https://arxiv.org/abs/2602.11919
作者: BoCheng Hu,Zhonghan Zhao,Kaiyue Zhou,Hongwei Wang,Gaoang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Most existing hand motion generation benchmarks for hand-object interaction (HOI) focus on static objects, leaving dynamic scenarios with moving targets and time-critical coordination largely untested. To address this gap, we introduce the DynaHOI-Gym, a unified online closed-loop platform with parameterized motion generators and rollout-based metrics for dynamic capture evaluation. Built on DynaHOI-Gym, we release DynaHOI-10M, a large-scale benchmark with 10M frames and 180K hand capture trajectories, whose target motions are organized into 8 major categories and 22 fine-grained subcategories. We also provide a simple observe-before-act baseline (ObAct) that integrates short-term observations with the current frame via spatiotemporal attention to predict actions, achieving an 8.1% improvement in location success rate.
[CV-22] Where Bits Matter in World Model Planning : A Paired Mixed-Bit Study for Efficient Spatial Reasoning
【速读】:该论文旨在解决高效空间推理中世界模型在低比特预算下的可靠性问题,即在有限计算资源约束下如何保持规划行为的稳定性与准确性。其核心问题是:低比特量化策略中,总位宽(bitwidth)分配还是比特在不同模块间的分布(bit allocation)对模型性能影响更大?解决方案的关键在于识别出一个“三阶段”行为模式:当比特数为8-bit或6-bit时,性能接近FP16;3-bit时性能显著下降;而在4-bit这一过渡区间,比特分配方式变得敏感——特别是保留编码器(encoder)高精度能显著提升规划效果,且非对称分配策略优于均匀量化。这表明,针对特定模块(如编码器)和预算条件设计感知型量化策略(module-aware, budget-aware quantization policies),是实现高效空间推理的重要方向。
链接: https://arxiv.org/abs/2602.11882
作者: Suraj Ranganath,Anish Patnaik,Vaishak Menon
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Workshop submission
Abstract:Efficient spatial reasoning requires world models that remain reliable under tight precision budgets. We study whether low-bit planning behavior is determined mostly by total bitwidth or by where bits are allocated across modules. Using DINO-WM on the Wall planning task, we run a paired-goal mixed-bit evaluation across uniform, mixed, asymmetric, and layerwise variants under two planner budgets. We observe a consistent three-regime pattern: 8-bit and 6-bit settings remain close to FP16, 3-bit settings collapse, and 4-bit settings are allocation-sensitive. In that transition region, preserving encoder precision improves planning relative to uniform quantization, and near-size asymmetric variants show the same encoder-side direction. In a later strict 22-cell replication with smaller per-cell episode count, the mixed-versus-uniform INT4 sign becomes budget-conditioned, which further highlights the sensitivity of this transition regime. These findings motivate module-aware, budget-aware quantization policies as a broader research direction for efficient spatial reasoning. Code and run artifacts are available at this https URL.
[CV-23] SynthRAR: Ring Artifacts Reduction in CT with Unrolled Network and Synthetic Data Training
【速读】:该论文旨在解决CT(计算机断层扫描)探测器缺陷和响应不一致导致的环状伪影(ring artifacts)和条纹伪影问题,这些问题会严重影响重建图像的临床可用性。现有基于监督深度学习的方法多在图像域或 sinogram 域进行修正,但依赖真实临床数据训练,成本高昂,且未充分考虑 CT 几何前向投影过程中的内在关联。论文的关键创新在于将环状伪影去除(RAR)问题重新建模为一个包含非理想探测器响应与线性前向投影的逆问题,并采用可展开网络(unrolled network)结构进行求解;同时,通过从自然图像合成数据来挖掘 sinogram 与图像域之间的内在关联,使模型能够在无需真实临床数据的情况下实现高质量伪影校正,从而显著降低数据采集成本并提升跨几何与解剖区域的泛化性能。
链接: https://arxiv.org/abs/2602.11880
作者: Hongxu Yang,Levente Lippenszky,Edina Timko,Gopal Avinash
机构: Science&Technology Organization, GE HealthCare (GE医疗健康)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Prepare for submission
Abstract:Defective and inconsistent responses in CT detectors can cause ring and streak artifacts in the reconstructed images, making them unusable for clinical purposes. In recent years, several ring artifact reduction solutions have been proposed in the image domain or in the sinogram domain using supervised deep learning methods. However, these methods require dedicated datasets for training, leading to a high data collection cost. Furthermore, existing approaches focus exclusively on either image-space or sinogram-space correction, neglecting the intrinsic correlations from the forward operation of the CT geometry. Based on the theoretical analysis of non-ideal CT detector responses, the RAR problem is reformulated as an inverse problem by using an unrolled network, which considers non-ideal response together with linear forward-projection with CT geometry. Additionally, the intrinsic correlations of ring artifacts between the sinogram and image domains are leveraged through synthetic data derived from natural images, enabling the trained model to correct artifacts without requiring real-world clinical data. Extensive evaluations on diverse scanning geometries and anatomical regions demonstrate that the model trained on synthetic data consistently outperforms existing state-of-the-art methods.
[CV-24] DiffPlace: Street View Generation via Place-Controllable Diffusion Model Enhancing Place Recognition ICRA2026
【速读】:该论文旨在解决当前多视图扩散模型在文本、鸟瞰图(BEV)地图和目标边界框输入下,难以生成具有场景一致性(place-aware)与背景一致性的城市街景图像的问题,从而限制了其在视觉位置识别任务中的应用效果。解决方案的关键在于提出DiffPlace框架,引入一个place-ID控制器,通过线性投影、Perceiver Transformer以及对比学习机制,将place-ID嵌入映射到固定CLIP空间中,使模型能够在保持背景建筑一致性的同时,灵活调整前景物体和天气条件,实现可控的多视角图像生成。
链接: https://arxiv.org/abs/2602.11875
作者: Ji Li,Zhiwei Li,Shihao Li,Zhenjiang Yu,Boyang Wang,Haiou Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: accepted by ICRA 2026
Abstract:Generative models have advanced significantly in realistic image synthesis, with diffusion models excelling in quality and stability. Recent multi-view diffusion models improve 3D-aware street view generation, but they struggle to produce place-aware and background-consistent urban scenes from text, BEV maps, and object bounding boxes. This limits their effectiveness in generating realistic samples for place recognition tasks. To address these challenges, we propose DiffPlace, a novel framework that introduces a place-ID controller to enable place-controllable multi-view image generation. The place-ID controller employs linear projection, perceiver transformer, and contrastive learning to map place-ID embeddings into a fixed CLIP space, allowing the model to synthesize images with consistent background buildings while flexibly modifying foreground objects and weather conditions. Extensive experiments, including quantitative comparisons and augmented training evaluations, demonstrate that DiffPlace outperforms existing methods in both generation quality and training support for visual place recognition. Our results highlight the potential of generative models in enhancing scene-level and place-aware synthesis, providing a valuable approach for improving place recognition in autonomous driving
[CV-25] Free Lunch for Stabilizing Rectified Flow Inversion
【速读】:该论文旨在解决基于Rectified-Flow (RF) 的生成模型在训练-free inversion(反演)过程中因逐时间步累积近似误差而导致的速度场不稳定问题,进而影响图像重建和编辑质量。其解决方案的关键在于提出两种方法:一是Proximal-Mean Inversion (PMI),通过将当前速度引导至历史速度的运行平均值,并约束于理论推导出的球面高斯分布内,实现速度场的稳定;二是mimic-CFG,一种轻量级速度修正策略,在当前速度与其历史平均投影之间进行插值,平衡编辑效果与结构一致性。这两项改进显著提升了反演稳定性、重建质量与编辑保真度,同时减少神经网络函数评估次数,实现了效率与理论严谨性的统一。
链接: https://arxiv.org/abs/2602.11850
作者: Chenru Wang,Beier Zhu,Chi Zhang
机构: Westlake University (西湖大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Rectified-Flow (RF)-based generative models have recently emerged as strong alternatives to traditional diffusion models, demonstrating state-of-the-art performance across various tasks. By learning a continuous velocity field that transforms simple noise into complex data, RF-based models not only enable high-quality generation, but also support training-free inversion, which facilitates downstream tasks such as reconstruction and editing. However, existing inversion methods, such as vanilla RF-based inversion, suffer from approximation errors that accumulate across timesteps, leading to unstable velocity fields and degraded reconstruction and editing quality. To address this challenge, we propose Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it toward a running average of past velocities, constrained within a theoretically derived spherical Gaussian. Furthermore, we introduce mimic-CFG, a lightweight velocity correction scheme for editing tasks, which interpolates between the current velocity and its projection onto the historical average, balancing editing effectiveness and structural consistency. Extensive experiments on PIE-Bench demonstrate that our methods significantly improve inversion stability, image reconstruction quality, and editing fidelity, while reducing the required number of neural function evaluations. Our approach achieves state-of-the-art performance on the PIE-Bench with enhanced efficiency and theoretical soundness.
[CV-26] WorldTree: Towards 4D Dynamic Worlds from Monocular Video using Tree-Chains
【速读】:该论文旨在解决单目输入下动态重建(Dynamic Reconstruction)中存在的挑战,尤其是现有方法在运动表示构建上的不足——缺乏统一的时空分解框架,导致要么采用整体时间优化,要么出现空间层级结构耦合的问题。其解决方案的关键在于提出一个名为WorldTree的统一框架,核心由两部分构成:一是基于继承关系的时序分割树(Temporal Partition Tree, TPT),实现基于层次化时间分解的粗粒度到细粒度优化;二是空间祖先链(Spatial Ancestral Chains, SAC),通过递归查询祖先层级结构来提供互补的空间动态信息,同时在不同祖先节点上专业化运动表示。这一设计有效分离了时空建模的复杂性,提升了重建精度,在多个数据集上显著优于当前最优方法。
链接: https://arxiv.org/abs/2602.11845
作者: Qisen Wang,Yifan Zhao,Jia Li
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dynamic reconstruction has achieved remarkable progress, but there remain challenges in monocular input for more practical applications. The prevailing works attempt to construct efficient motion representations, but lack a unified spatiotemporal decomposition framework, suffering from either holistic temporal optimization or coupled hierarchical spatial composition. To this end, we propose WorldTree, a unified framework comprising Temporal Partition Tree (TPT) that enables coarse-to-fine optimization based on the inheritance-based partition tree structure for hierarchical temporal decomposition, and Spatial Ancestral Chains (SAC) that recursively query ancestral hierarchical structure to provide complementary spatial dynamics while specializing motion representations across ancestral nodes. Experimental results on different datasets indicate that our proposed method achieves 8.26% improvement of LPIPS on NVIDIA-LS and 9.09% improvement of mLPIPS on DyCheck compared to the second-best method. Code: this https URL.
[CV-27] JEPA-VLA: Video Predictive Embedding is Needed for VLA Models
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中面临的低样本效率和有限泛化能力问题。研究表明,这些问题根源在于预训练视觉表征的不足,即现有视觉表示(无论是基于语言-图像对比学习还是图像自监督学习)难以捕捉任务相关的环境信息并诱导有效的策略先验(policy prior),尤其是对成功执行任务时环境动态演化的预期知识。解决方案的关键在于引入基于视频预训练的预测嵌入(predictive embeddings),特别是V-JEPA 2模型所生成的嵌入,其能灵活忽略不可预测的环境因素并编码任务相关的时序动态,从而有效弥补现有视觉表征的缺陷。基于此,作者提出JEPA-VLA方法,通过自适应地将预测嵌入融合进现有VLA架构,在多个基准测试(如LIBERO、LIBERO-plus、RoboTwin2.0及真实机器人任务)中显著提升性能。
链接: https://arxiv.org/abs/2602.11832
作者: Shangchen Miao,Ningya Feng,Jialong Wu,Ye Lin,Xu He,Dong Li,Mingsheng Long
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Recent vision-language-action (VLA) models built upon pretrained vision-language models (VLMs) have achieved significant improvements in robotic manipulation. However, current VLAs still suffer from low sample efficiency and limited generalization. This paper argues that these limitations are closely tied to an overlooked component, pretrained visual representation, which offers insufficient knowledge on both aspects of environment understanding and policy prior. Through an in-depth analysis, we find that commonly used visual representations in VLAs, whether pretrained via language-image contrastive learning or image-based self-supervised learning, remain inadequate at capturing crucial, task-relevant environment information and at inducing effective policy priors, i.e., anticipatory knowledge of how the environment evolves under successful task execution. In contrast, we discover that predictive embeddings pretrained on videos, in particular V-JEPA 2, are adept at flexibly discarding unpredictable environment factors and encoding task-relevant temporal dynamics, thereby effectively compensating for key shortcomings of existing visual representations in VLAs. Building on these observations, we introduce JEPA-VLA, a simple yet effective approach that adaptively integrates predictive embeddings into existing VLAs. Our experiments demonstrate that JEPA-VLA yields substantial performance gains across a range of benchmarks, including LIBERO, LIBERO-plus, RoboTwin2.0, and real-robot tasks.
[CV-28] A Comparative Study of MAP and LMMSE Estimators for Blind Inverse Problems
【速读】:该论文旨在解决盲去卷积(blind deconvolution)问题中最大后验估计(MAP)方法在非凸性和参数敏感性导致的不稳定性问题。其关键解决方案是引入线性最小均方误差(LMMSE)估计器,该方法无需复杂参数调优即可提供稳定可靠的基准性能,并且实验证明LMMSE解可作为MAP方法的有效初始化,从而提升其收敛性并降低对正则化参数的敏感性,为后续理论与实践发展奠定基础。
链接: https://arxiv.org/abs/2602.11814
作者: Nathan Buskulic,Luca Calatroni
机构: 未知
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Maximum-a-posteriori (MAP) approaches are an effective framework for inverse problems with known forward operators, particularly when combined with expressive priors and careful parameter selection. In blind settings, however, their use becomes significantly less stable due to the inherent non-convexity of the problem and the potential non-identifiability of the solutions. (Linear) minimum mean square error (MMSE) estimators provide a compelling alternative that can circumvent these limitations. In this work, we study synthetic two-dimensional blind deconvolution problems under fully controlled conditions, with complete prior knowledge of both the signal and kernel distributions. We compare tailored MAP algorithms with simple LMMSE estimators whose functional form is closely related to that of an optimal Tikhonov estimator. Our results show that, even in these highly controlled settings, MAP methods remain unstable and require extensive parameter tuning, whereas the LMMSE estimator yields a robust and reliable baseline. Moreover, we demonstrate empirically that the LMMSE solution can serve as an effective initialization for MAP approaches, improving their performance and reducing sensitivity to regularization parameters, thereby opening the door to future theoretical and practical developments.
[CV-29] How to Sample High Quality 3D Fractals for Action Recognition Pre-Training?
【速读】:该论文旨在解决深度学习中用于动作识别模型预训练的高质量标注数据稀缺问题,提出利用3D迭代函数系统(3D Iterated Function Systems, IFS)生成无限且完美标注的合成视频数据。传统3D分形生成方法存在速度慢、易产生退化分形的问题,影响下游任务性能;为此,作者提出一种新颖的“目标导向智能过滤”(Targeted Smart Filtering)方法,其关键在于通过优化分形生成过程中的采样策略,在显著提升约100倍生成速度的同时,增强分形多样性并提升模型在动作识别任务上的表现。
链接: https://arxiv.org/abs/2602.11810
作者: Marko Putak,Thomas B. Moeslund,Joakim Bruslund Haurum
机构: Aalborg University (奥尔堡大学); Pioneer Centre for AI (先锋人工智能中心); University of Southern Denmark (南丹麦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 6 figures. To be published in VISAPP
Abstract:Synthetic datasets are being recognized in the deep learning realm as a valuable alternative to exhaustively labeled real data. One such synthetic data generation method is Formula Driven Supervised Learning (FDSL), which can provide an infinite number of perfectly labeled data through a formula driven approach, such as fractals or contours. FDSL does not have common drawbacks like manual labor, privacy and other ethical concerns. In this work we generate 3D fractals using 3D Iterated Function Systems (IFS) for pre-training an action recognition model. The fractals are temporally transformed to form a video that is used as a pre-training dataset for downstream task of action recognition. We find that standard methods of generating fractals are slow and produce degenerate 3D fractals. Therefore, we systematically explore alternative ways of generating fractals and finds that overly-restrictive approaches, while generating aesthetically pleasing fractals, are detrimental for downstream task performance. We propose a novel method, Targeted Smart Filtering, to address both the generation speed and fractal diversity issue. The method reports roughly 100 times faster sampling speed and achieves superior downstream performance against other 3D fractal filtering methods.
[CV-30] Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data
【速读】:该论文旨在解决当前Segment Anything Models (SAM) 在训练数据需求量大(如SA-1B数据集包含1100万张图像)且仅依赖RGB输入导致计算资源消耗高、泛化能力受限的问题。其解决方案的关键在于提出一种轻量级的RGB-D融合框架,通过引入单目深度先验信息增强EfficientViT-SAM模型:具体而言,利用预训练的深度估计器生成深度图,并设计专用的深度编码器在中层特征层面与RGB特征进行融合,从而在仅使用11.2k样本(不足SA-1B数据集的0.1%)的情况下实现优于EfficientViT-SAM的分割精度,证明了深度线索能提供强大的几何先验以提升分割性能。
链接: https://arxiv.org/abs/2602.11804
作者: Yiming Zhou,Xuenjie Xie,Panfeng Li,Albrecht Kunz,Ahmad Osman,Xavier Maldague
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.
[CV-31] Light4D: Training-Free Extreme Viewpoint 4D Video Relighting
【速读】:该论文旨在解决4D relighting(四维重照明)任务中的两大挑战:一是缺乏成对的4D重照明训练数据,二是极端视角变化下难以保持时间一致性。其解决方案的关键在于提出了一种无需训练的框架Light4D,核心创新包括:1)引入解耦流动引导(Disentangled Flow Guidance),通过时序感知策略将光照控制注入潜在空间的同时保留几何完整性;2)在IC-Light架构中设计时序一致注意力机制(Temporal Consistent Attention),并结合确定性正则化以消除外观闪烁,从而实现极端视角(-90°至90°)下高质量、时序一致的4D视频合成。
链接: https://arxiv.org/abs/2602.11769
作者: Zhenghuang Wu,Kang Chen,Zeyu Zhang,Hao Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in diffusion-based generative models have established a new paradigm for image and video relighting. However, extending these capabilities to 4D relighting remains challenging, due primarily to the scarcity of paired 4D relighting training data and the difficulty of maintaining temporal consistency across extreme viewpoints. In this work, we propose Light4D, a novel training-free framework designed to synthesize consistent 4D videos under target illumination, even under extreme viewpoint changes. First, we introduce Disentangled Flow Guidance, a time-aware strategy that effectively injects lighting control into the latent space while preserving geometric integrity. Second, to reinforce temporal consistency, we develop Temporal Consistent Attention within the IC-Light architecture and further incorporate deterministic regularization to eliminate appearance flickering. Extensive experiments demonstrate that our method achieves competitive performance in temporal consistency and lighting fidelity, robustly handling camera rotations from -90 to 90. Code: this https URL. Website: this https URL.
[CV-32] Code2Worlds: Empowering Coding LLM s for 4D World Generation
【速读】:该论文旨在解决生成式 AI (Generative AI) 在构建四维(4D)动态场景时面临的两大核心挑战:一是多尺度上下文纠缠问题,即单一结构生成难以兼顾局部物体结构与全局环境布局;二是语义-物理执行差距,即开环代码生成导致物理幻觉,缺乏动态保真度。其解决方案的关键在于提出 Code2Worlds 框架,通过双流架构将检索增强的对象生成与分层环境编排解耦,并引入一个基于物理感知的闭环机制——由 PostProcess Agent 编写动力学脚本,结合 VLM-Motion Critic 进行自省式迭代优化,从而实现高保真度的物理驱动仿真代码生成。
链接: https://arxiv.org/abs/2602.11757
作者: Yi Zhang,Yunshuang Wang,Zeyu Zhang,Hao Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Achieving spatial intelligence requires moving beyond visual plausibility to build world simulators grounded in physical laws. While coding LLMs have advanced static 3D scene generation, extending this paradigm to 4D dynamics remains a critical frontier. This task presents two fundamental challenges: multi-scale context entanglement, where monolithic generation fails to balance local object structures with global environmental layouts; and a semantic-physical execution gap, where open-loop code generation leads to physical hallucinations lacking dynamic fidelity. We introduce Code2Worlds, a framework that formulates 4D generation as language-to-simulation code generation. First, we propose a dual-stream architecture that disentangles retrieval-augmented object generation from hierarchical environmental orchestration. Second, to ensure dynamic fidelity, we establish a physics-aware closed-loop mechanism in which a PostProcess Agent scripts dynamics, coupled with a VLM-Motion Critic that performs self-reflection to iteratively refine simulation code. Evaluations on the Code4D benchmark show Code2Worlds outperforms baselines with a 41% SGS gain and 49% higher Richness, while uniquely generating physics-aware dynamics absent in prior static methods. Code: this https URL. Website: this https URL.
[CV-33] Adaptive Debiasing Tsallis Entropy for Test-Time Adaptation ICLR2026
【速读】:该论文旨在解决主流测试时自适应(Test-Time Adaptation, TTA)方法在适配视觉-语言模型(如CLIP)时因依赖香农熵(Shannon Entropy, SE)而导致的不确定性估计偏差问题。由于CLIP在预训练阶段使用高度不平衡的网络爬取数据,SE无法准确刻画真实分布下的不确定性,从而影响模型性能。解决方案的关键在于引入广义香农熵——塔利斯熵(Tsallis Entropy, TE),其通过非扩展参数 $ q $ 自然地建模偏置分布,并以SE为下界;进一步提出自适应去偏塔利斯熵(Adaptive Debiasing Tsallis Entropy, ADTE),基于连续输入测试样本动态计算类别特定的 $ q^l $ 参数,实现高置信度视图的选择与标签调整策略的无缝集成,无需针对不同分布进行超参数调优。实验证明,ADTE在多个跨域基准上均显著优于现有方法,且不依赖于模型架构或文本提示设计。
链接: https://arxiv.org/abs/2602.11743
作者: Xiangyu Wu,Dongming Jiang,Feng Yu,Yueying Tian,Jiaqi Tang,Qing-Guo Chen,Yang Yang,Jianfeng Lu
机构: Nanjing University of Science and Technology (南京理工大学); Alibaba Cloud (阿里云); University of Texas at Dallas (德克萨斯大学达拉斯分校); University of Sussex (萨塞克斯大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at ICLR 2026; 24 pages; 5 figures
Abstract:Mainstream Test-Time Adaptation (TTA) methods for adapting vision-language models, e.g., CLIP, typically rely on Shannon Entropy (SE) at test time to measure prediction uncertainty and inconsistency. However, since CLIP has a built-in bias from pretraining on highly imbalanced web-crawled data, SE inevitably results in producing biased estimates of uncertainty entropy. To address this issue, we notably find and demonstrate that Tsallis Entropy (TE), a generalized form of SE, is naturally suited for characterizing biased distributions by introducing a non-extensive parameter q, with the performance of SE serving as a lower bound for TE. Building upon this, we generalize TE into Adaptive Debiasing Tsallis Entropy (ADTE) for TTA, customizing a class-specific parameter q^l derived by normalizing the estimated label bias from continuously incoming test instances, for each category. This adaptive approach allows ADTE to accurately select high-confidence views and seamlessly integrate with a label adjustment strategy to enhance adaptation, without introducing distribution-specific hyperparameter tuning. Besides, our investigation reveals that both TE and ADTE can serve as direct, advanced alternatives to SE in TTA, without any other modifications. Experimental results show that ADTE outperforms state-of-the-art methods on ImageNet and its five variants, and achieves the highest average performance on 10 cross-domain benchmarks, regardless of the model architecture or text prompts used. Our code is available at this https URL.
[CV-34] Adapting Vision-Language Models for E-commerce Understanding at Scale
【速读】:该论文旨在解决电子商务产品理解中多模态信息(文本、图像及结构化属性)融合的挑战,特别是如何在保持通用视觉-语言模型(Vision-Language Models, VLMs)广泛多模态能力的前提下,有效适应电商数据特有的属性导向性、多图输入和噪声干扰特性。其解决方案的关键在于通过大规模实验验证针对性微调策略的有效性,即对通用VLM进行轻量级但精准的适配,从而显著提升电商场景下的性能表现,同时不损害其跨任务的泛化能力。
链接: https://arxiv.org/abs/2602.11733
作者: Matteo Nulli,Vladimir Orshulevich,Tala Bazazo,Christian Herold,Michael Kozielski,Marcin Mazur,Szymon Tuzel,Cees G. M. Snoek,Seyyed Hadi Hashemi,Omar Javed,Yannick Versley,Shahram Khadivi
机构: eBay Inc.(eBay公司); University of Amsterdam(阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.
[CV-35] STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)中文本描述与视觉坐标错位所引发的幻觉问题,尤其是在时空视频定位(Spatial-Temporal Video Grounding, STVG)这类密集预测任务中更为突出。传统方法通过增强跨模态对齐或引入辅助解码器来缓解该问题,但往往带来额外的可训练模块、标注成本和计算开销。本文的关键解决方案是提出一种新颖的视觉提示(visual prompting)范式:将每帧中的坐标预测重构为一个紧凑的实例级识别问题,通过对每个对象分配唯一且时序一致的ID,并将其嵌入视频作为视觉提示输入VLM,从而避免了跨模态坐标对齐的困难;同时设计了首个用于STVG的强化学习框架STVG-R1,利用任务驱动奖励联合优化时间准确性、空间一致性及结构格式正则化,显著提升了性能并展现出零样本泛化能力。
链接: https://arxiv.org/abs/2602.11730
作者: Xiaowen Zhang,Zhi Gao,Licheng Jiao,Lingling Li,Qing Li
机构: Xidian University (西安电子科技大学); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室,BIGAI); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation tasks, achieving a SOTA 47.3% JF on MeViS.
[CV-36] GSO-SLAM: Bidirectionally Coupled Gaussian Splatting and Direct Visual Odometry
【速读】:该论文旨在解决单目稠密SLAM(Simultaneous Localization and Mapping,同时定位与建图)系统中跟踪(Visual Odometry, VO)与建图(Mapping)模块耦合效率低、计算开销大或集成冗余的问题。现有方法通常将两者以统一场景表示方式耦合,导致计算成本高;或采用松散集成策略,引入冗余信息。其解决方案的关键在于提出GSO-SLAM框架,通过在期望最大化(Expectation-Maximization, EM)框架下对VO与高斯点绘(Gaussian Splatting, GS)进行双向耦合优化,实现VO生成的半稠密深度估计与GS场景表示的同时精化,且无需额外计算开销。此外,引入高斯点绘初始化(Gaussian Splat Initialization)方法,利用VO提供的图像信息、关键帧位姿及像素关联直接构建接近最终GS场景的初始表示,避免依赖启发式策略,从而显著提升重建几何/光度保真度和跟踪精度,并支持实时运行。
链接: https://arxiv.org/abs/2602.11714
作者: Jiung Yeon,Seongbo Ha,Hyeonwoo Yu
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 6 figures, RA-L accepted
Abstract:We propose GSO-SLAM, a real-time monocular dense SLAM system that leverages Gaussian scene representation. Unlike existing methods that couple tracking and mapping with a unified scene, incurring computational costs, or loosely integrate them with well-structured tracking frameworks, introducing redundancies, our method bidirectionally couples Visual Odometry (VO) and Gaussian Splatting (GS). Specifically, our approach formulates joint optimization within an Expectation-Maximization (EM) framework, enabling the simultaneous refinement of VO-derived semi-dense depth estimates and the GS representation without additional computational overhead. Moreover, we present Gaussian Splat Initialization, which utilizes image information, keyframe poses, and pixel associations from VO to produce close approximations to the final Gaussian scene, thereby eliminating the need for heuristic methods. Through extensive experiments, we validate the effectiveness of our method, showing that it not only operates in real time but also achieves state-of-the-art geometric/photometric fidelity of the reconstructed scene and tracking accuracy.
[CV-37] LLM -Driven 3D Scene Generation of Agricultural Simulation Environments
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)生成3D场景时普遍存在的问题,包括缺乏领域特定推理能力、验证机制不足以及模块化设计缺失,这些问题导致生成环境控制性差、可扩展性低。解决方案的关键在于构建一个模块化的多LLM流水线,集成3D资产检索、领域知识注入与Unreal引擎API驱动的代码生成,并采用混合策略融合少量示例提示(few-shot prompting)、检索增强生成(Retrieval-Augmented Generation, RAG)、微调(finetuning)及验证机制,从而实现基于自然语言提示的农业合成仿真环境的高精度、可验证且可扩展生成,显著优于传统单体模型方法。
链接: https://arxiv.org/abs/2602.11706
作者: Arafa Yoncalik,Wouter Jansen,Nico Huebel,Mohammad Hasan Rahmani,Jan Steckel
机构: University of Antwerp (安特卫普大学); Flanders Make Strategic Research Centre (弗拉芒制造战略研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted at IEEE Conference on Artificial Intelligence 2026
Abstract:Procedural generation techniques in 3D rendering engines have revolutionized the creation of complex environments, reducing reliance on manual design. Recent approaches using Large Language Models (LLMs) for 3D scene generation show promise but often lack domain-specific reasoning, verification mechanisms, and modular design. These limitations lead to reduced control and poor scalability. This paper investigates the use of LLMs to generate agricultural synthetic simulation environments from natural language prompts, specifically to address the limitations of lacking domain-specific reasoning, verification mechanisms, and modular design. A modular multi-LLM pipeline was developed, integrating 3D asset retrieval, domain knowledge injection, and code generation for the Unreal rendering engine using its API. This results in a 3D environment with realistic planting layouts and environmental context, all based on the input prompt and the domain knowledge. To enhance accuracy and scalability, the system employs a hybrid strategy combining LLM optimization techniques such as few-shot prompting, Retrieval-Augmented Generation (RAG), finetuning, and validation. Unlike monolithic models, the modular architecture enables structured data handling, intermediate verification, and flexible expansion. The system was evaluated using structured prompts and semantic accuracy metrics. A user study assessed realism and familiarity against real-world images, while an expert comparison demonstrated significant time savings over manual scene design. The results confirm the effectiveness of multi-LLM pipelines in automating domain-specific 3D scene generation with improved reliability and precision. Future work will explore expanding the asset hierarchy, incorporating real-time generation, and adapting the pipeline to other simulation domains beyond agriculture.
[CV-38] G-Field: Geometry-Aware Radiative Gaussian Fields for Tomographic Reconstruction AAAI2026
【速读】:该论文旨在解决在极稀疏视角投影和动态运动条件下,基于3D高斯泼溅(3D Gaussian Splatting, 3DGS)的CT重建方法中存在的严重伪影与时空不一致性问题。其解决方案的关键在于提出了一种几何感知的高斯变形框架——Tomographic Geometry Field (TG-Field),通过引入多分辨率哈希编码以捕捉局部空间先验信息,在超稀疏设置下对原始参数进行正则化;同时扩展至动态重建场景,采用时间条件表示与时空注意力模块自适应聚合特征,缓解时空歧义并增强时序一致性,并结合运动流网络建模细粒度呼吸运动以追踪局部解剖形变,从而显著提升重建精度与鲁棒性。
链接: https://arxiv.org/abs/2602.11705
作者: Yuxiang Zhong,Jun Wei,Chaoqi Chen,Senyou An,Hui Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026. Project page: this https URL
Abstract:3D Gaussian Splatting (3DGS) has revolutionized 3D scene representation with superior efficiency and quality. While recent adaptations for computed tomography (CT) show promise, they struggle with severe artifacts under highly sparse-view projections and dynamic motions. To address these challenges, we propose Tomographic Geometry Field (TG-Field), a geometry-aware Gaussian deformation framework tailored for both static and dynamic CT reconstruction. A multi-resolution hash encoder is employed to capture local spatial priors, regularizing primitive parameters under ultra-sparse settings. We further extend the framework to dynamic reconstruction by introducing time-conditioned representations and a spatiotemporal attention block to adaptively aggregate features, thereby resolving spatiotemporal ambiguities and enforcing temporal coherence. In addition, a motion-flow network models fine-grained respiratory motion to track local anatomical deformations. Extensive experiments on synthetic and real-world datasets demonstrate that TG-Field consistently outperforms existing methods, achieving state-of-the-art reconstruction accuracy under highly sparse-view conditions.
[CV-39] Semantically Conditioned Diffusion Models for Cerebral DSA Synthesis
【速读】:该论文旨在解决数字减影血管造影(Digital Subtraction Angiography, DSA)因侵入性及高获取成本导致的大规模数据采集困难与公共数据共享受限的问题。解决方案的关键在于开发一种语义条件隐式扩散模型(Semantically Conditioned Latent Diffusion Model, LDM),该模型能够基于解剖循环(前循环 vs. 后循环)和标准C臂位置等显式语义控制生成动脉期脑DSA图像帧,从而实现对生成图像结构与临床场景的精准调控。通过构建包含99,349帧的大规模单中心DSA数据集并利用文本嵌入编码解剖与成像几何信息进行训练,最终生成的合成DSA图像在专家评分中达到3.1–3.3分(5级李克特量表),且具有良好的分布相似性(FID=15.27)和高评分者间一致性(ICC=0.80–0.87),证明其具备用于下游算法开发、研究和培训的临床真实性。
链接: https://arxiv.org/abs/2602.11703
作者: Qiwen Xu,David Rügamer,Holger Wenz,Johann Fontana,Nora Meggyeshazi,Andreas Bender,Máté E. Maros
机构: LMU Munich (慕尼黑路德维希马克西米利安大学); Heidelberg University (海德堡大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Department of Biomedical Informatics (DBMI) (生物医学信息学系); Clinic for Diagnostic and Interventional Neuroradiology (诊断与介入神经放射科); BG Trauma Center Tuebingen (图宾根创伤中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Digital subtraction angiography (DSA) plays a central role in the diagnosis and treatment of cerebrovascular disease, yet its invasive nature and high acquisition cost severely limit large-scale data collection and public data sharing. Therefore, we developed a semantically conditioned latent diffusion model (LDM) that synthesizes arterial-phase cerebral DSA frames under explicit control of anatomical circulation (anterior vs.\ posterior) and canonical C-arm positions. We curated a large single-centre DSA dataset of 99,349 frames and trained a conditional LDM using text embeddings that encoded anatomy and acquisition geometry. To assess clinical realism, four medical experts, including two neuroradiologists, one neurosurgeon, and one internal medicine expert, systematically rated 400 synthetic DSA images using a 5-grade Likert scale for evaluating proximal large, medium, and small peripheral vessels. The generated images achieved image-wise overall Likert scores ranging from 3.1 to 3.3, with high inter-rater reliability (ICC(2,k) = 0.80–0.87). Distributional similarity to real DSA frames was supported by a low median Fréchet inception distance (FID) of 15.27. Our results indicate that semantically controlled LDMs can produce realistic synthetic DSAs suitable for downstream algorithm development, research, and training.
[CV-40] OMEGA-Avatar: One-shot Modeling of 360° Gaussian Avatars
【速读】:该论文旨在解决从单张图像生成高保真、可动画化且360°完整的头部三维化身(3D avatar)的难题,现有方法难以同时满足“前馈式处理”、“全头建模”和“动画就绪”三大关键属性。其解决方案的核心在于提出OMEGA-Avatar框架,通过两个创新模块实现突破:一是引入语义感知的网格变形模块(semantic-aware mesh deformation module),利用多视角法向量优化FLAME头部与头发结构,在保持拓扑完整性的同时提升头发建模质量;二是设计多视角特征点 splatting 模块(multi-view feature splatting module),通过可微分双线性点投射、层次化UV映射和可见性感知融合,构建共享的规范UV表示,从而在无需实例级优化的前提下,实现全局结构一致性与局部高频细节的跨视角保持,最终实现高质量、全视角一致且可直接用于动画的3D高斯头部生成。
链接: https://arxiv.org/abs/2602.11693
作者: Zehao Xia,Yiqun Wang,Zhengda Lu,Kai Liu,Jun Xiao,Peter Wonka
机构: Chongqing University (重庆大学); University of Chinese Academy of Sciences (中国科学院大学); KAUST (沙特阿卜杜拉国王科技大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Creating high-fidelity, animatable 3D avatars from a single image remains a formidable challenge. We identified three desirable attributes of avatar generation: 1) the method should be feed-forward, 2) model a 360° full-head, and 3) should be animation-ready. However, current work addresses only two of the three points simultaneously. To address these limitations, we propose OMEGA-Avatar, the first feed-forward framework that simultaneously generates a generalizable, 360°-complete, and animatable 3D Gaussian head from a single image. Starting from a feed-forward and animatable framework, we address the 360° full-head avatar generation problem with two novel components. First, to overcome poor hair modeling in full-head avatar generation, we introduce a semantic-aware mesh deformation module that integrates multi-view normals to optimize a FLAME head with hair while preserving its topology structure. Second, to enable effective feed-forward decoding of full-head features, we propose a multi-view feature splatting module that constructs a shared canonical UV representation from features across multiple views through differentiable bilinear splatting, hierarchical UV mapping, and visibility-aware fusion. This approach preserves both global structural coherence and local high-frequency details across all viewpoints, ensuring 360° consistency without per-instance optimization. Extensive experiments demonstrate that OMEGA-Avatar achieves state-of-the-art performance, significantly outperforming existing baselines in 360° full-head completeness while robustly preserving identity across different viewpoints.
[CV-41] Beyond Pixels: Vector-to-Graph Transformation for Reliable Schematic Auditing ICASSP2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在工程图纸理解中的结构性盲视问题,即现有基于像素驱动的视觉理解方法无法有效捕捉电路图等工程 schematics 中的拓扑结构与符号逻辑。其解决方案的关键在于提出一种向量到图(Vector-to-Graph, V2G)转换管道,将CAD图纸转化为属性图(property graph),其中节点表示组件、边编码连接关系,从而显式表达结构依赖并支持机器可审计的推理。实验表明,V2G在电气合规性检测基准上显著提升各类错误类型的准确率,而主流MLLMs仍接近随机水平,验证了结构感知表示对工程领域实际部署的重要性。
链接: https://arxiv.org/abs/2602.11678
作者: Chengwei Ma,Zhen Tian,Zhou Zhou,Zhixian Xu,Xiaowei Zhu,Xia Hua,Si Shi,F. Richard Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 3 figures. Accepted to ICASSP 2026
Abstract:Multimodal Large Language Models (MLLMs) have shown remarkable progress in visual understanding, yet they suffer from a critical limitation: structural blindness. Even state-of-the-art models fail to capture topology and symbolic logic in engineering schematics, as their pixel-driven paradigm discards the explicit vector-defined relations needed for reasoning. To overcome this, we propose a Vector-to-Graph (V2G) pipeline that converts CAD diagrams into property graphs where nodes represent components and edges encode connectivity, making structural dependencies explicit and machine-auditable. On a diagnostic benchmark of electrical compliance checks, V2G yields large accuracy gains across all error categories, while leading MLLMs remain near chance level. These results highlight the systemic inadequacy of pixel-based methods and demonstrate that structure-aware representations provide a reliable path toward practical deployment of multimodal AI in engineering domains. To facilitate further research, we release our benchmark and implementation at this https URL.
[CV-42] RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval
【速读】:该论文旨在解决文本到形状检索(text-to-shape retrieval)中现有方法对物体姿态敏感、仅支持少量类别且难以适应真实场景的问题,尤其在对象可能属于多样化类别并以任意方向出现时表现不佳。其解决方案的关键在于提出首个适用于点云的旋转不变状态空间模型 RI-Mamba:通过定义全局与局部参考系来解耦姿态与几何信息,并利用希尔伯特排序(Hilbert sorting)构建具有语义几何结构的 token 序列以保持旋转不变性;同时引入新颖的朝向嵌入计算策略,并通过特征级线性调制(feature-wise linear modulation)重新整合空间上下文,从而有效恢复几何感知能力并提升模型表达力。该方法天然兼容状态空间模型架构,且计算复杂度为线性时间,结合自动三元组生成的跨模态对比学习策略,实现了无需人工标注即可在多样化数据集上训练,最终在 OmniObject3D 基准测试中于超过 200 类物体上取得最优性能。
链接: https://arxiv.org/abs/2602.11673
作者: Khanh Nguyen,Dasith de Silva Edirimuni,Ghulam Mubashar Hassan,Ajmal Mian
机构: The University of Western Australia (西澳大利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D assets have rapidly expanded in quantity and diversity due to the growing popularity of virtual reality and gaming. As a result, text-to-shape retrieval has become essential in facilitating intuitive search within large repositories. However, existing methods require canonical poses and support few object categories, limiting their real-world applicability where objects can belong to diverse classes and appear in random orientations. To address this challenge, we propose RI-Mamba, the first rotation-invariant state-space model for point clouds. RI-Mamba defines global and local reference frames to disentangle pose from geometry and uses Hilbert sorting to construct token sequences with meaningful geometric structure while maintaining rotation invariance. We further introduce a novel strategy to compute orientational embeddings and reintegrate them via feature-wise linear modulation, effectively recovering spatial context and enhancing model expressiveness. Our strategy is inherently compatible with state-space models and operates in linear time. To scale up retrieval, we adopt cross-modal contrastive learning with automated triplet generation, allowing training on diverse datasets without manual annotation. Extensive experiments demonstrate RI-Mamba’s superior representational capacity and robustness, achieving state-of-the-art performance on the OmniObject3D benchmark across more than 200 object categories under arbitrary orientations. Our code will be made available at this https URL.
[CV-43] U-Net with Hadamard Transform and DCT Latent Spaces for Next-day Wildfire Spread Prediction
【速读】:该论文旨在解决野火次日蔓延预测中模型计算效率与预测精度难以兼顾的问题。其核心解决方案是提出一种轻量级深度学习模型——变换域融合UNet(Transform Domain Fusion UNet, TD-FusionUNet),该模型通过引入可训练的哈达玛变换(Hadamard Transform)和离散余弦变换(Discrete Cosine Transform)层,在正交化潜在空间中捕获关键的“频率”成分,从而实现对多模态卫星数据的有效特征提取;同时结合自定义预处理技术(如随机边缘裁剪和高斯混合模型)增强稀疏着火前掩膜的表征能力,提升模型泛化性能。实验表明,该方法在参数量仅为37万的情况下,F1分数达到0.591,显著优于基于ResNet18编码器的UNet基线模型,且更适用于资源受限环境下的实时野火预测应用。
链接: https://arxiv.org/abs/2602.11672
作者: Yingyi Luo,Shuaiang Rong,Adam Watts,Ahmet Enis Cetin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We developed a lightweight and computationally efficient tool for next-day wildfire spread prediction using multimodal satellite data as input. The deep learning model, which we call Transform Domain Fusion UNet (TD-FusionUNet), incorporates trainable Hadamard Transform and Discrete Cosine Transform layers that apply two-dimensional transforms, enabling the network to capture essential “frequency” components in orthogonalized latent spaces. Additionally, we introduce custom preprocessing techniques, including random margin cropping and a Gaussian mixture model, to enrich the representation of the sparse pre-fire masks and enhance the model’s generalization capability. The TD-FusionUNet is evaluated on two datasets which are the Next-Day Wildfire Spread dataset released by Google Research in 2023, and WildfireSpreadTS dataset. Our proposed TD-FusionUNet achieves an F1 score of 0.591 with 370k parameters, outperforming the UNet baseline using ResNet18 as the encoder reported in the WildfireSpreadTS dataset while using substantially fewer parameters. These results show that the proposed latent space fusion model balances accuracy and efficiency under a lightweight setting, making it suitable for real time wildfire prediction applications in resource limited environments.
[CV-44] Egocentric Gaze Estimation via Neck-Mounted Camera
【速读】:该论文旨在解决从颈部佩戴相机视角进行第一人称 gaze 估计(egocentric gaze estimation)的问题,现有研究主要聚焦于头戴式摄像头,而对其他视角(如颈挂式)的研究仍较为匮乏。解决方案的关键在于构建首个针对颈挂式视角的 gaze 估计数据集,并提出两种改进策略:一是引入辅助的“视线超出视野”分类任务以提升模型性能,二是设计一种多视角协同学习方法,通过几何感知的辅助损失联合训练头部视角与颈部视角模型。实验表明,引入视线边界分类任务可有效提升精度,而多视角协同学习未带来显著增益。
链接: https://arxiv.org/abs/2602.11669
作者: Haoyu Huang,Yoichi Sato
机构: The University of Tokyo(东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper introduces neck-mounted view gaze estimation, a new task that estimates user gaze from the neck-mounted camera perspective. Prior work on egocentric gaze estimation, which predicts device wearer’s gaze location within the camera’s field of view, mainly focuses on head-mounted cameras while alternative viewpoints remain underexplored. To bridge this gap, we collect the first dataset for this task, consisting of approximately 4 hours of video collected from 8 participants during everyday activities. We evaluate a transformer-based gaze estimation model, GLC, on the new dataset and propose two extensions: an auxiliary gaze out-of-bound classification task and a multi-view co-learning approach that jointly trains head-view and neck-view models using a geometry-aware auxiliary loss. Experimental results show that incorporating gaze out-of-bound classification improves performance over standard fine-tuning, while the co-learning approach does not yield gains. We further analyze these results and discuss implications for neck-mounted gaze estimation.
[CV-45] Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes ICRA2026
【速读】:该论文旨在解决在杂乱环境(cluttered environments)中实现可靠且鲁棒的3D实例分割问题,尤其针对语言引导的机器人抓取任务。此类场景常因遮挡、有限视角和噪声掩码导致感知性能下降。解决方案的关键在于提出一种零样本(zero-shot)管道Clutt3R-Seg,其核心创新是引入层次化的语义线索实例树(hierarchical instance tree of semantic cues),利用噪声掩码作为信息性线索而非单纯进行修正:通过跨视角分组(cross-view grouping)与条件替换(conditional substitution)机制,抑制过分割和欠分割现象,从而生成视图一致的掩码并构建鲁棒的3D实例;同时,每个实例嵌入开放词汇语义嵌入(open-vocabulary semantic embeddings),支持从自然语言指令中精准定位目标对象。此外,为应对多阶段任务中的场景变化,进一步设计了一种一致性感知更新策略,仅需单张交互后图像即可保持实例对应关系,无需重新扫描即可高效适应。
链接: https://arxiv.org/abs/2602.11660
作者: Jeongho Noh,Tai Hyoung Rhee,Eunho Lee,Jeongyun Kim,Sunwoo Lee,Ayoung Kim
机构: Seoul National University (首尔国立大学); Hyundai Motor Company (现代汽车公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ICRA 2026. 9 pages, 8 figures
Abstract:Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi-stage tasks, we further introduce a consistency-aware update that preserves instance correspondences from only a single post-interaction image, allowing efficient adaptation without rescanning. Clutt3R-Seg is evaluated on both synthetic and real-world datasets, and validated on a real robot. Across all settings, it consistently outperforms state-of-the-art baselines in cluttered and sparse-view scenarios. Even on the most challenging heavy-clutter sequences, Clutt3R-Seg achieves an AP@25 of 61.66, over 2.2x higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2x. The code is available at: this https URL.
[CV-46] EmoSpace: Fine-Grained Emotion Prototype Learning for Immersive Affective Content Generation
【速读】:该论文旨在解决当前生成式AI在虚拟现实(VR)内容创作中难以捕捉细腻情感语义及缺乏细粒度情感控制的问题。现有方法虽能降低情感丰富内容的制作门槛,但无法实现对情绪层次的精准建模与灵活调控,从而限制了沉浸式体验的效果。解决方案的关键在于提出EmoSpace框架,其核心创新是通过视觉-语言对齐学习动态且可解释的情感原型(emotion prototypes),构建分层情感表示体系,并引入多原型引导、时间混合与注意力重加权的可控生成流程,使模型能够在无需显式情感标签的情况下实现细粒度情绪控制,从而提升VR环境中情感感知的真实性和多样性。
链接: https://arxiv.org/abs/2602.11658
作者: Bingyuan Wang,Xingbei Chen,Zongyang Qiu,Linping Yuan,Zeyu Wang
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Emotion is important for creating compelling virtual reality (VR) content. Although some generative methods have been applied to lower the barrier to creating emotionally rich content, they fail to capture the nuanced emotional semantics and the fine-grained control essential for immersive experiences. To address these limitations, we introduce EmoSpace, a novel framework for emotion-aware content generation that learns dynamic, interpretable emotion prototypes through vision-language alignment. We employ a hierarchical emotion representation with rich learnable prototypes that evolve during training, enabling fine-grained emotional control without requiring explicit emotion labels. We develop a controllable generation pipeline featuring multi-prototype guidance, temporal blending, and attention reweighting that supports diverse applications, including emotional image outpainting, stylized generation, and emotional panorama generation for VR environments. Our experiments demonstrate the superior performance of EmoSpace over existing methods in both qualitative and quantitative evaluations. Additionally, we present a comprehensive user study investigating how VR environments affect emotional perception compared to desktop settings. Our work facilitates immersive visual content generation with fine-grained emotion control and supports applications like therapy, education, storytelling, artistic creation, and cultural preservation. Code and models will be made publicly available.
[CV-47] SToRM: Supervised Token Reduction for Multi-modal LLM s toward efficient end-to-end autonomous driving
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Model, MLLM)在端到端(End-to-End, E2E)自动驾驶系统中因视觉令牌(visual tokens)数量庞大而导致计算资源消耗过高、难以部署于车载有限算力环境的问题。现有方法虽尝试减少视觉令牌以提升效率,但常导致任务性能下降。其解决方案的关键在于提出首个面向多模态大语言模型的监督式令牌压缩框架(Supervised Token Reduction framework for multi-modal LLMs, SToRM),包含三个核心组件:一是基于短时滑动窗口的轻量级重要性预测器,用于动态评估令牌重要性;二是通过辅助路径获取来自全令牌输入的大语言模型的伪监督信号,实现监督训练;三是锚点-上下文合并模块,将令牌分为锚点与上下文并进行融合,有效降低冗余同时最小化信息损失。实验表明,SToRM在LangAuto基准上可在仅使用原计算预算约3.3%的情况下,保持与全令牌模型相当的性能,显著降低高达30倍的计算成本。
链接: https://arxiv.org/abs/2602.11656
作者: Seo Hyun Kim,Jin Bok Park,Do Yeon Koo,Ho Gun Park,Il Yong Chun
机构: Sungkyunkwan University (成均馆大学); Institute for Basic Science (基础科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2602.11656 [cs.CV] (or arXiv:2602.11656v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.11656 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-48] GR-Diffusion: 3D Gaussian Representation Meets Diffusion in Whole-Body PET Reconstruction
【速读】:该论文旨在解决低剂量三维全身影像正电子发射断层扫描(PET)重建中常见的噪声放大、结构模糊和细节丢失问题,这些问题通常由稀疏采样及逆问题的病态性引起。其解决方案的关键在于提出了一种名为GR-Diffusion的新框架,该框架将离散高斯表示(Discrete Gaussian Representation, GR)的几何先验与扩散模型的生成能力相结合:首先利用GR从投影数据中生成一个具有物理意义且结构明确的参考3D PET图像,作为重建过程中的基准;随后在扩散过程中引入分层引导机制——细粒度引导基于局部差异优化细节,粗粒度引导通过多尺度差异图校正全局偏差——从而实现几何先验与亚体素信息恢复的协同整合,显著提升图像质量和生理特征保真度。
链接: https://arxiv.org/abs/2602.11653
作者: Mengxiao Geng,Zijie Chen,Ran Hong,Bingxuan Li,Qiegen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Positron emission tomography (PET) reconstruction is a critical challenge in molecular imaging, often hampered by noise amplification, structural blurring, and detail loss due to sparse sampling and the ill-posed nature of inverse problems. The three-dimensional discrete Gaussian representation (GR), which efficiently encodes 3D scenes using parameterized discrete Gaussian distributions, has shown promise in computer vision. In this work, we pro-pose a novel GR-Diffusion framework that synergistically integrates the geometric priors of GR with the generative power of diffusion models for 3D low-dose whole-body PET reconstruction. GR-Diffusion employs GR to generate a reference 3D PET image from projection data, establishing a physically grounded and structurally explicit benchmark that overcomes the low-pass limitations of conventional point-based or voxel-based methods. This reference image serves as a dual guide during the diffusion process, ensuring both global consistency and local accuracy. Specifically, we employ a hierarchical guidance mechanism based on the GR reference. Fine-grained guidance leverages differences to refine local details, while coarse-grained guidance uses multi-scale difference maps to correct deviations. This strategy allows the diffusion model to sequentially integrate the strong geometric prior from GR and recover sub-voxel information. Experimental results on the UDPET and Clinical datasets with varying dose levels show that GR-Diffusion outperforms state-of-the-art methods in enhancing 3D whole-body PET image quality and preserving physiological details.
[CV-49] Brain Tumor Classifiers Under Attack: Robustness of ResNet Variants Against Transferable FGSM and PGD Attacks
【速读】:该论文旨在解决深度学习模型在脑肿瘤分类任务中对抗鲁棒性不足的问题,尤其是在临床场景下使用MRI数据时面临的潜在安全风险。其解决方案的关键在于系统评估多种基于ResNet架构(BrainNet、BrainNeXt与DilationNet)在不同预处理配置下的对抗脆弱性,发现模型结构设计(如BrainNeXt的高基数性)和输入数据分辨率(如缩小且未增强的数据)对鲁棒性具有显著影响,从而强调了在实际部署前必须同时考量分类性能与对抗鲁棒性的必要性。
链接: https://arxiv.org/abs/2602.11646
作者: Ryan Deem,Garrett Goodman,Waqas Majeed,Md Abdullah Al Hafiz Khan,Michail S. Alexiou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Adversarial robustness in deep learning models for brain tumor classification remains an underexplored yet critical challenge, particularly for clinical deployment scenarios involving MRI data. In this work, we investigate the susceptibility and resilience of several ResNet-based architectures, referred to as BrainNet, BrainNeXt and DilationNet, against gradient-based adversarial attacks, namely FGSM and PGD. These models, based on ResNet, ResNeXt, and dilated ResNet variants respectively, are evaluated across three preprocessing configurations (i) full-sized augmented, (ii) shrunk augmented and (iii) shrunk non-augmented MRI datasets. Our experiments reveal that BrainNeXt models exhibit the highest robustness to black-box attacks, likely due to their increased cardinality, though they produce weaker transferable adversarial samples. In contrast, BrainNet and Dilation models are more vulnerable to attacks from each other, especially under PGD with higher iteration steps and \alpha values. Notably, shrunk and non-augmented data significantly reduce model resilience, even when the untampered test accuracy remains high, highlighting a key trade-off between input resolution and adversarial vulnerability. These results underscore the importance of jointly evaluating classification performance and adversarial robustness for reliable real-world deployment in brain MRI analysis.
[CV-50] ViTaS: Visual Tactile Soft Fusion Contrastive Learning for Visuomotor Learning ICRA2026
【速读】:该论文旨在解决当前机器人操作中视觉与触觉信息融合方法存在的局限性,特别是现有方法多采用直接拼接(concatenation)的方式整合多模态特征,忽视了视觉与触觉之间的内在互补性,导致在遮挡场景下性能下降,且特征对齐利用不足,限制了实际应用潜力。其解决方案的关键在于提出ViTaS框架,引入软融合对比学习(Soft Fusion Contrastive Learning)和条件变分自编码器(CVAE)模块,以更有效地挖掘并利用视觉-触觉表征间的对齐关系与互补特性,从而提升模型在复杂环境下的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2602.11643
作者: Yufeng Tian,Shuiqi Cheng,Tianming Wei,Tianxing Zhou,Yuanhang Zhang,Zixian Liu,Qianwei Han,Zhecheng Yuan,Huazhe Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Published to ICRA 2026
Abstract:Tactile information plays a crucial role in human manipulation tasks and has recently garnered increasing attention in robotic manipulation. However, existing approaches mostly focus on the alignment of visual and tactile features and the integration mechanism tends to be direct concatenation. Consequently, they struggle to effectively cope with occluded scenarios due to neglecting the inherent complementary nature of both modalities and the alignment may not be exploited enough, limiting the potential of their real-world deployment. In this paper, we present ViTaS, a simple yet effective framework that incorporates both visual and tactile information to guide the behavior of an agent. We introduce Soft Fusion Contrastive Learning, an advanced version of conventional contrastive learning method and a CVAE module to utilize the alignment and complementarity within visuo-tactile representations. We demonstrate the effectiveness of our method in 12 simulated and 3 real-world environments, and our experiments show that ViTaS significantly outperforms existing baselines. Project page: this https URL.
[CV-51] Electrostatics-Inspired Surface Reconstruction (EISR): Recovering 3D Shapes as a Superposition of Poissons PDE Solutions
【速读】:该论文旨在解决三维形状表面重建中对高频细节逼近能力不足的问题,尤其在仅有少量形状先验信息的情况下。其解决方案的关键在于将表面重建问题重新建模为求解一个代理偏微分方程(PDE)——泊松方程(Poisson’s equation),而非传统的Eikonal方程。作者利用格林函数(Green’s functions)获得该方程的闭式解析表达式,并基于泊松方程的线性特性,通过叠加多个基本解来构造目标形状的隐式场表示,从而有效提升对复杂几何细节的恢复能力。
链接: https://arxiv.org/abs/2602.11642
作者: Diego Patiño,Knut Peterson,Kostas Daniilidis,David K. Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Implicit shape representation, such as SDFs, is a popular approach to recover the surface of a 3D shape as the level sets of a scalar field. Several methods approximate SDFs using machine learning strategies that exploit the knowledge that SDFs are solutions of the Eikonal partial differential equation (PDEs). In this work, we present a novel approach to surface reconstruction by encoding it as a solution to a proxy PDE, namely Poisson’s equation. Then, we explore the connection between Poisson’s equation and physics, e.g., the electrostatic potential due to a positive charge density. We employ Green’s functions to obtain a closed-form parametric expression for the PDE’s solution, and leverage the linearity of our proxy PDE to find the target shape’s implicit field as a superposition of solutions. Our method shows improved results in approximating high-frequency details, even with a small number of shape priors.
[CV-52] ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning
【速读】:该论文旨在解决大规模视觉指令微调(Large-scale Visual Instruction Tuning, VIT)中因数据冗余导致的训练计算成本高、效率低的问题。现有数据选择方法要么需要昂贵的训练或梯度计算,要么依赖代理模型、辅助数据集或指令无关的表示,且常采用二次复杂度的成对相似性比较,限制了可扩展性和表征保真度。其解决方案的关键在于提出一种可扩展的、无需训练的多模态数据选择方法——ScalSelect:首先通过提取目标视觉语言模型(VLM)中被指令令牌最关注的视觉特征来构建样本表示,从而捕捉与指令相关的语义信息;随后通过识别能最好逼近全数据集表示主导子空间的样本,实现线性时间复杂度的重要性评分,无需成对比较。实验表明,使用仅16%的数据即可达到全数据训练97.5%以上的性能,某些场景甚至超越全数据训练。
链接: https://arxiv.org/abs/2602.11636
作者: Changti Wu,Jiahuai Mao,Yuzhuo Miao,Shijie Lian,Bin Yu,Xiaopeng Lin,Cong Huang,Lei Zhang,Kai Chen
机构: East China Normal University (华东师范大学); Zhongguancun Academy (中关村学院); The Hong Kong Polytechnic University (香港理工大学); Harbin Institute of Technology (哈尔滨工业大学); Huazhong University of Science and Technology (华中科技大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Zhongguancun Institute of Artificial Intelligence (中关村人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The code is available at \href{ this https URL }{ScalSelect}
Abstract:Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at \hrefthis https URLScalSelect.
[CV-53] PLESS: Pseudo-Label Enhancement with Spreading Scribbles for Weakly Supervised Segmentation
【速读】:该论文旨在解决弱监督医学图像分割中基于涂鸦标注(scribble annotations)方法因噪声和不完整监督导致的性能瓶颈问题。其核心挑战在于伪标签(pseudo-labels)质量受限,从而影响分割精度。解决方案的关键在于提出一种通用的伪标签增强策略PLESS,该策略通过构建图像的层次化空间连贯区域划分,将涂鸦信息传播到语义一致区域内以优化伪标签的可靠性与空间一致性,且该框架具有模型无关性,可无缝集成至现有伪标签方法中,实验证明其在多个算法和数据集上均能提升分割准确性。
链接: https://arxiv.org/abs/2602.11628
作者: Yeva Gabrielyan(1),Varduhi Yeghiazaryan(1),Irina Voiculescu(2) ((1) Akian College of Science and Engineering, American University of Armenia, Yerevan, Armenia, (2) Department of Computer Science, University of Oxford, Oxford, UK)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This work was supported by the Afeyan Family Foundation Seed Grants and the JACE Foundation Research Innovation Grant Program at AUA
Abstract:Weakly supervised learning with scribble annotations uses sparse user-drawn strokes to indicate segmentation labels on a small subset of pixels. This annotation reduces the cost of dense pixel-wise labeling, but suffers inherently from noisy and incomplete supervision. Recent scribble-based approaches in medical image segmentation address this limitation using pseudo-label-based training; however, the quality of the pseudo-labels remains a key performance limit. We propose PLESS, a generic pseudo-label enhancement strategy which improves reliability and spatial consistency. It builds on a hierarchical partitioning of the image into a hierarchy of spatially coherent regions. PLESS propagates scribble information to refine pseudo-labels within semantically coherent regions. The framework is model-agnostic and easily integrates into existing pseudo-label methods. Experiments on two public cardiac MRI datasets (ACDC and MSCMRseg) across four scribble-supervised algorithms show consistent improvements in segmentation accuracy. Code will be made available on GitHub upon acceptance.
[CV-54] PLOT-CT: Pre-log Voronoi Decomposition Assisted Generation for Low-dose CT Reconstruction
【速读】:该论文旨在解决低剂量计算机断层扫描(Low-dose Computed Tomography, LDCT)重建中因辐射剂量降低导致的严重噪声和数据保真度下降问题。现有方法通常在图像域或对数后投影域进行处理,未能充分利用原始预对数(pre-log)测量数据中的结构信息,且对噪声极为敏感,尤其在进行对数变换时会显著放大噪声,从而对重建精度提出极高要求。其解决方案的关键在于提出PLOT-CT框架,通过 Voronoi分解(Voronoi decomposition)将预对数sinogram分解为多个独立的潜在成分,并分别嵌入不同的潜在空间中,实现对数据结构的显式解耦,从而增强模型学习判别特征的能力,有效抑制噪声并保留预对数域内的本质信息,最终显著提升重建质量,在1e4入射光子水平下相较传统方法实现2.36dB的峰值信噪比(PSNR)提升。
链接: https://arxiv.org/abs/2602.11625
作者: Bin Huang,Xun Yu,Yikun Zhang,Yi Zhang,Yang Chen,Qiegen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Low-dose computed tomography (LDCT) reconstruction is fundamentally challenged by severe noise and compromised data fidelity under reduced radiation exposure. Most existing methods operate either in the image or post-log projection domain, which fails to fully exploit the rich structural information in pre-log measurements while being highly susceptible to noise. The requisite logarithmic transformation critically amplifies noise within these data, imposing exceptional demands on reconstruction precision. To overcome these challenges, we propose PLOT-CT, a novel framework for Pre-Log vOronoi decomposiTion-assisted CT generation. Our method begins by applying Voronoi decomposition to pre-log sinograms, disentangling the data into distinct underlying components, which are embedded in separate latent spaces. This explicit decomposition significantly enhances the model’s capacity to learn discriminative features, directly improving reconstruction accuracy by mitigating noise and preserving information inherent in the pre-log domain. Extensive experiments demonstrate that PLOT-CT achieves state-of-the-art performance, attaining a 2.36dB PSNR improvement over traditional methods at the 1e4 incident photon level in the pre-log domain.
[CV-55] ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation
【速读】:该论文旨在解决具身导航(Embodied Navigation)领域长期存在的任务特异性架构碎片化问题,即不同导航任务(如点目标导航、物体目标导航、指令跟随等)通常依赖独立模型,难以实现通用性和高效迁移。其解决方案的关键在于提出一个统一的视觉-语言-动作基础模型 ABot-N0,采用分层“大脑-动作”架构:上层基于大语言模型(LLM)的认知大脑(Cognitive Brain)负责语义推理与任务理解,下层基于流匹配(Flow Matching)的动作专家(Action Expert)实现高精度连续轨迹生成。该设计实现了五类核心导航任务的“大一统”,并通过自研的 ABot-N0 数据引擎构建大规模高质量数据集(16.9M 轨迹和 5.0M 推理样本),显著提升模型泛化能力与性能,在7个基准测试中达到新最优(SOTA)。
链接: https://arxiv.org/abs/2602.11598
作者: Zedong Chu,Shichao Xie,Xiaolong Wu,Yanfen Shen,Minghua Luo,Zhengbo Wang,Fei Liu,Xiaoxu Leng,Junjun Hu,Mingyang Yin,Jia Lu,Yingnan Guo,Kai Yang,Jiawei Han,Xu Chen,Yanqing Zhu,Yuxiang Zhao,Xin Liu,Yirong Yang,Ye He,Jiahang Wang,Yang Cai,Tianlin Zhang,Li Gao,Liu Liu,Mingchao Sun,Fan Jiang,Chiyu Wang,Zhicheng Liu,Hongyu Pan,Honglin Han,Zhining Gu,Kuan Yang,Jianfang Zhang,Di Jing,Zihao Guan,Wei Guo,Guoqing Liu,Di Yang,Xiangpo Yang,Menglin Yang,Hongguang Xing,Weiguo Li,Mu Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a Grand Unification'' across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical Brain-Action’’ architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation. To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 \textkm^2 ). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments. Comments: Project Page: this https URL Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.11598 [cs.RO] (or arXiv:2602.11598v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2602.11598 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-56] A Large Language Model for Disaster Structural Reconnaissance Summarization
【速读】:该论文旨在解决传统基于视觉的结构健康监测(Vision-based Structural Health Monitoring, SHM)系统在灾后快速勘测中输出结果离散、难以直接用于工程评估与决策的问题。现有方法通常仅生成损伤类别标签或区域坐标等离散信息,需人工进一步整理分析,效率低下且易引入误差。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的灾害勘测摘要框架(LLM-DRS),通过标准化现场调查流程采集图像数据与文本元数据,并利用预训练深度卷积神经网络提取关键属性(如损伤状态、材料类型和损伤等级),最终将多模态数据整合输入至精心设计提示词(prompt)的LLM中,自动生成针对单个结构或受灾区域的结构化勘测摘要报告,从而显著提升灾后快速响应与决策效率。
链接: https://arxiv.org/abs/2602.11588
作者: Yuqing Gao,Guanren Zhou,Khalid M. Mosalam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures. Presented at the 18th World Conference on Earthquake Engineering (18WCEE 2024)
Abstract:Artificial Intelligence (AI)-aided vision-based Structural Health Monitoring (SHM) has emerged as an effective approach for monitoring and assessing structural condition by analyzing image and video data. By integrating Computer Vision (CV) and Deep Learning (DL), vision-based SHM can automatically identify and localize visual patterns associated with structural damage. However, previous works typically generate only discrete outputs, such as damage class labels and damage region coordinates, requiring engineers to further reorganize and analyze these results for evaluation and decision-making. In late 2022, Large Language Models (LLMs) became popular across multiple fields, providing new insights into AI-aided vision-based SHM. In this study, a novel LLM-based Disaster Reconnaissance Summarization (LLM-DRS) framework is proposed. It introduces a standard reconnaissance plan in which the collection of vision data and corresponding metadata follows a well-designed on-site investigation process. Text-based metadata and image-based vision data are then processed and integrated into a unified format, where well-trained Deep Convolutional Neural Networks extract key attributes, including damage state, material type, and damage level. Finally, all data are fed into an LLM with carefully designed prompts, enabling the LLM-DRS to generate summary reports for individual structures or affected regions based on aggregated attributes and metadata. Results show that integrating LLMs into vision-based SHM, particularly for rapid post-disaster reconnaissance, demonstrates promising potential for improving resilience of the built environment through effective reconnaissance.
[CV-57] ReaDy-Go: Real-to-Sim Dynamic 3D Gaussian Splatting Simulation for Environment-Specific Visual Navigation with Moving Obstacles
【速读】:该论文旨在解决视觉导航模型在真实动态环境中鲁棒性不足的问题,尤其是由仿真到现实(sim-to-real)迁移时的性能下降以及针对特定部署环境(如家庭、餐厅和工厂)训练策略的困难。其关键解决方案是提出 ReaDy-Go,一个新颖的“真实到仿真”(real-to-sim)模拟流水线,通过结合静态场景的3D高斯点绘(3D Gaussian Splatting, GS)重建与动态人类GS障碍物,合成逼真的动态场景数据集;该流水线包含三个核心组件:(1) 动态GS模拟器,集成场景GS与人体动画模块以插入可驱动的人类GS虚拟角色并从2D轨迹生成合理的人类运动;(2) 针对动态环境的导航数据集生成机制,利用模拟器、为动态GS表示设计的机器人专家规划器和人类规划器;(3) 基于生成数据集的策略学习方法。实验证明,ReaDy-Go 在多个目标环境中均优于基线方法,且在未见环境中实现零样本 sim-to-real 部署,展现出良好的泛化能力。
链接: https://arxiv.org/abs/2602.11575
作者: Seungyeon Yoo,Youngseok Jang,Dabin Kim,Youngsoo Han,Seungwoo Jung,H. Jin Kim
机构: Seoul National University (首尔国立大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Visual navigation models often struggle in real-world dynamic environments due to limited robustness to the sim-to-real gap and the difficulty of training policies tailored to target deployment environments (e.g., households, restaurants, and factories). Although real-to-sim navigation simulation using 3D Gaussian Splatting (GS) can mitigate this gap, prior works have assumed only static scenes or unrealistic dynamic obstacles, despite the importance of safe navigation in dynamic environments. To address these issues, we propose ReaDy-Go, a novel real-to-sim simulation pipeline that synthesizes photorealistic dynamic scenarios for target environments. ReaDy-Go generates photorealistic navigation datasets for dynamic environments by combining a reconstructed static GS scene with dynamic human GS obstacles, and trains policies robust to both the sim-to-real gap and moving obstacles. The pipeline consists of three components: (1) a dynamic GS simulator that integrates scene GS with a human animation module, enabling the insertion of animatable human GS avatars and the synthesis of plausible human motions from 2D trajectories, (2) navigation dataset generation for dynamic environments that leverages the simulator, a robot expert planner designed for dynamic GS representations, and a human planner, and (3) policy learning using the generated datasets. ReaDy-Go outperforms baselines across target environments in both simulation and real-world experiments, demonstrating improved navigation performance even after sim-to-real transfer and in the presence of moving obstacles. Moreover, zero-shot sim-to-real deployment in an unseen environment indicates its generalization potential. Project page: this https URL.
[CV-58] Move What Matters: Parameter-Efficient Domain Adaptation via Optimal Transport Flow for Collaborative Perception
【速读】:该论文旨在解决多智能体系统在车联网(Vehicle-to-Everything, V2X)协同感知中跨域适应(domain adaptation)的效率与稳定性问题,尤其针对参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在多智能体场景下导致性能下降和训练不稳定的挑战。其解决方案的关键在于提出FlowAdapt框架,该框架基于最优传输理论(optimal transport theory),通过两个核心模块实现:一是采用Wasserstein贪心采样策略(Wasserstein Greedy Sampling),利用有界覆盖半径筛选冗余样本以降低异构传感流中的帧间冗余;二是设计渐进知识迁移模块(Progressive Knowledge Transfer),通过可学习路径将压缩的早期阶段表征逐步注入深层网络,缓解PEFT适配过程中深层特征语义退化的问题,从而在仅使用1%可训练参数的情况下实现卓越的样本效率与泛化能力。
链接: https://arxiv.org/abs/2602.11565
作者: Zesheng Jia,Jin Wang,Siao Liu,Lingzhi Li,Ziyao Huang,Yunjiang Xu,Jianping Wang
机构: Soochow University (苏州大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fast domain adaptation remains a fundamental challenge for deploying multi-agent systems across diverse environments in Vehicle-to-Everything (V2X) collaborative perception. Despite the success of Parameter-Efficient Fine-Tuning (PEFT) in natural language processing and conventional vision tasks, directly applying PEFT to multi-agent settings leads to significant performance degradation and training instability. In this work, we conduct a detailed analysis and identify two key factors: (i) inter-frame redundancy in heterogeneous sensory streams, and (ii) erosion of fine-grained semantics in deep-layer representations under PEFT adaptation. To address these issues, we propose FlowAdapt, a parameter-efficient framework grounded in optimal transport theory, which minimizes information transport costs across both data distributions and network hierarchies. Specifically, we introduce a Wasserstein Greedy Sampling strategy to selectively filter redundant samples via a bounded covering radius. Furthermore, Progressive Knowledge Transfer module is designed to progressively inject compressed early-stage representations into later stages through learnable pathways, alleviating semantic degradation in late-stage adaptation. Extensive experiments on three benchmarks demonstrate that FlowAdapt achieves state-of-the-art performance with only 1% of trainable parameters, effectively bridging domain gaps with superior sample efficiency and generalization.
[CV-59] LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts
【速读】:该论文旨在解决超高分辨率(Ultra-High-Resolution, UHR)视频生成中的关键技术难题,包括运动建模、语义规划与细节合成的复杂耦合问题。其解决方案的关键在于提出一种基于双频专家的分层潜空间架构——LUVE(Latent-cascaded UHR Video generation framework),通过三个阶段实现:1)低分辨率运动生成以保证运动一致性;2)潜空间直接上采样以降低计算与内存开销;3)高分辨率内容精修阶段融合低频与高频专家,协同提升语义连贯性与细粒度细节生成能力。
链接: https://arxiv.org/abs/2602.11564
作者: Chen Zhao,Jiawei Chen,Hongyu Li,Zhuoliang Kang,Shilin Lu,Xiaoming Wei,Kai Zhang,Jian Yang,Ying Tai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in video diffusion models have significantly improved visual quality, yet ultra-high-resolution (UHR) video generation remains a formidable challenge due to the compounded difficulties of motion modeling, semantic planning, and detail synthesis. To address these limitations, we propose \textbfLUVE, a \textbfLatent-cascaded \textbfUHR \textbfVideo generation framework built upon dual frequency \textbfExperts. LUVE employs a three-stage architecture comprising low-resolution motion generation for motion-consistent latent synthesis, video latent upsampling that performs resolution upsampling directly in the latent space to mitigate memory and computational overhead, and high-resolution content refinement that integrates low-frequency and high-frequency experts to jointly enhance semantic coherence and fine-grained detail generation. Extensive experiments demonstrate that our LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component. The project is available at \hrefthis https URLthis https URL.
[CV-60] HyperDet: 3D Object Detection with Hyper 4D Radar Point Clouds
【速读】:该论文旨在解决雷达点云稀疏、不规则且易受多路径噪声干扰导致的3D目标检测性能弱于激光雷达(LiDAR)的问题。其核心解决方案是提出HyperDet框架,通过构建任务感知的超4D雷达点云,实现对标准LiDAR检测器的雷达输入优化:首先利用多帧连续数据融合提升点云覆盖与密度,再通过几何感知的跨传感器一致性验证抑制异常回波;进一步引入以前景为中心的扩散模块,在训练阶段结合雷达-激光雷达混合监督增强目标结构并保留雷达属性(如多普勒速度、雷达截面积RCS),最终将模型蒸馏为单步推理的一致性模型,从而在不修改检测器架构的前提下显著提升雷达-only 3D检测性能。
链接: https://arxiv.org/abs/2602.11554
作者: Yichun Xiao,Runwei Guan,Fangqiang Ding
机构: University of Edinburgh (爱丁堡大学); Hong Kong University of Science and Technology (广州) (香港科技大学(广州)); Massachusetts Institute of Technology (麻省理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, 6 tables
Abstract:4D mmWave radar provides weather-robust, velocity-aware measurements and is more cost-effective than LiDAR. However, radar-only 3D detection still trails LiDAR-based systems because radar point clouds are sparse, irregular, and often corrupted by multipath noise, yielding weak and unstable geometry. We present HyperDet, a detector-agnostic radar-only 3D detection framework that constructs a task-aware hyper 4D radar point cloud for standard LiDAR-oriented detectors. HyperDet aggregates returns from multiple surround-view 4D radars over consecutive frames to improve coverage and density, then applies geometry-aware cross-sensor consensus validation with a lightweight self-consistency check outside overlap regions to suppress inconsistent returns. It further integrates a foreground-focused diffusion module with training-time mixed radar-LiDAR supervision to densify object structures while lifting radar attributes (e.g., Doppler, RCS); the model is distilled into a consistency model for single-step inference. On MAN TruckScenes, HyperDet consistently improves over raw radar inputs with VoxelNeXt and CenterPoint, partially narrowing the radar-LiDAR gap. These results show that input-level refinement enables radar to better leverage LiDAR-oriented detectors without architectural modifications.
[CV-61] Perception-based Image Denoising via Generative Compression
【速读】:该论文旨在解决图像去噪中因依赖失真驱动方法而导致重建结果过度平滑的问题,尤其是在强噪声和分布偏移场景下。其核心解决方案是提出一种基于生成式压缩的感知去噪框架,关键在于通过熵编码的潜在表示(entropy-coded latent representations)强制低复杂度结构,同时利用感知损失(如LPIPS损失和Wasserstein距离)引导生成解码器恢复真实纹理。该框架包含两种互补实现:一是基于条件Wasserstein GAN的压缩去噪器,可显式控制率-失真-感知(RDP)权衡;二是基于条件扩散模型的迭代重构策略,由压缩潜在特征引导逐步去噪。此外,论文还建立了在加性高斯噪声下基于压缩的最大似然去噪器的非渐近保证,包括重建误差和解码错误概率的理论边界。
链接: https://arxiv.org/abs/2602.11553
作者: Nam Nguyen,Thinh Nguyen,Bella Bose
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Image denoising aims to remove noise while preserving structural details and perceptual realism, yet distortion-driven methods often produce over-smoothed reconstructions, especially under strong noise and distribution shift. This paper proposes a generative compression framework for perception-based denoising, where restoration is achieved by reconstructing from entropy-coded latent representations that enforce low-complexity structure, while generative decoders recover realistic textures via perceptual measures such as learned perceptual image patch similarity (LPIPS) loss and Wasserstein distance. Two complementary instantiations are introduced: (i) a conditional Wasserstein GAN (WGAN)-based compression denoiser that explicitly controls the rate-distortion-perception (RDP) trade-off, and (ii) a conditional diffusion-based reconstruction strategy that performs iterative denoising guided by compressed latents. We further establish non-asymptotic guarantees for the compression-based maximum-likelihood denoiser under additive Gaussian noise, including bounds on reconstruction error and decoding error probability. Experiments on synthetic and real-noise benchmarks demonstrate consistent perceptual improvements while maintaining competitive distortion performance.
[CV-62] Supervise-assisted Multi-modality Fusion Diffusion Model for PET Restoration
【速读】:该论文旨在解决低剂量正电子发射断层成像(low-dose PET, LPET)图像质量下降的问题,特别是在利用磁共振(MR)图像进行多模态融合以恢复标准剂量PET(standard-dose PET, SPET)时,因结构与纹理不一致及分布外(out-of-distribution, OOD)数据匹配困难导致的重建失真问题。其解决方案的关键在于提出一种监督辅助的多模态融合扩散模型(supervise-assisted multi-modality fusion diffusion model, MFdiff):首先设计多模态特征融合模块,以优化融合特征并避免引入MR图像中的冗余细节;其次,将融合特征作为条件输入扩散模型,迭代生成高质量SPET图像;最后采用两阶段监督学习策略,分别利用模拟的分布内数据提取通用先验和针对在体OOD数据定制特定先验,从而显著提升重建图像的质量与鲁棒性。
链接: https://arxiv.org/abs/2602.11545
作者: Yingkai Zhang,Shuang Chen,Ye Tian,Yunyi Gao,Jianyong Jiang,Ying Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Positron emission tomography (PET) offers powerful functional imaging but involves radiation exposure. Efforts to reduce this exposure by lowering the radiotracer dose or scan time can degrade image quality. While using magnetic resonance (MR) images with clearer anatomical information to restore standard-dose PET (SPET) from low-dose PET (LPET) is a promising approach, it faces challenges with the inconsistencies in the structure and texture of multi-modality fusion, as well as the mismatch in out-of-distribution (OOD) data. In this paper, we propose a supervise-assisted multi-modality fusion diffusion model (MFdiff) for addressing these challenges for high-quality PET restoration. Firstly, to fully utilize auxiliary MR images without introducing extraneous details in the restored image, a multi-modality feature fusion module is designed to learn an optimized fusion feature. Secondly, using the fusion feature as an additional condition, high-quality SPET images are iteratively generated based on the diffusion model. Furthermore, we introduce a two-stage supervise-assisted learning strategy that harnesses both generalized priors from simulated in-distribution datasets and specific priors tailored to in-vivo OOD data. Experiments demonstrate that the proposed MFdiff effectively restores high-quality SPET images from multi-modality inputs and outperforms state-of-the-art methods both qualitatively and quantitatively.
[CV-63] Vascular anatomy-aware self-supervised pre-training for X-ray angiogram analysis AAAI2026
【速读】:该论文旨在解决X-ray angiography(血管造影)图像分析中因标注数据稀缺而导致的深度学习方法性能受限的问题。现有自监督学习(Self-Supervised Learning, SSL)虽具潜力,但在该领域尚未得到充分探索,主要受限于缺乏有效的SSL框架和大规模数据集。解决方案的关键在于提出一种血管解剖结构感知的掩码图像建模框架(VasoMIM),其核心创新包括两个方面:一是基于解剖知识引导的掩码策略,通过有选择性地遮蔽包含血管的图像块,促使模型学习更鲁棒的血管语义特征;二是引入解剖一致性损失(anatomical consistency loss),确保原始图像与重建图像之间血管结构的一致性,从而提升表征的判别能力。此外,作者还构建了目前最大的X-ray angiogram预训练数据集XA-170K,为模型训练提供支持。实验证明,VasoMIM在多个下游任务中展现出优异的迁移能力和领先性能,表明其作为血管造影图像分析基础模型的巨大潜力。
链接: https://arxiv.org/abs/2602.11536
作者: De-Xing Huang,Chaohui Yu,Xiao-Hu Zhou,Tian-Yu Xiang,Qin-Yi Zhang,Mei-Jiang Gui,Rui-Ze Ma,Chen-Yu Wang,Nu-Fang Xiao,Fan Wang,Zeng-Guang Hou
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院); Joint Laboratory of Intelligence Science and Technology, Institute of Systems Engineering, Macau University of Science and Technology(澳门科技大学系统工程研究所智能科学与技术联合实验室); DAMO Academy, Alibaba Group(阿里巴巴达摩院); Hupan Lab(湖畔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 10 figures, 10 tables. Journal version of VasoMIM (AAAI 2026)
Abstract:X-ray angiography is the gold standard imaging modality for cardiovascular diseases. However, current deep learning approaches for X-ray angiogram analysis are severely constrained by the scarcity of annotated data. While large-scale self-supervised learning (SSL) has emerged as a promising solution, its potential in this domain remains largely unexplored, primarily due to the lack of effective SSL frameworks and large-scale datasets. To bridge this gap, we introduce a vascular anatomy-aware masked image modeling (VasoMIM) framework that explicitly integrates domain-specific anatomical knowledge. Specifically, VasoMIM comprises two key designs: an anatomy-guided masking strategy and an anatomical consistency loss. The former strategically masks vessel-containing patches to compel the model to learn robust vascular semantics, while the latter preserves structural consistency of vessels between original and reconstructed images, enhancing the discriminability of the learned representations. In conjunction with VasoMIM, we curate XA-170K, the largest X-ray angiogram pre-training dataset to date. We validate VasoMIM on four downstream tasks across six datasets, where it demonstrates superior transferability and achieves state-of-the-art performance compared to existing methods. These findings highlight the significant potential of VasoMIM as a foundation model for advancing a wide range of X-ray angiogram analysis tasks. VasoMIM and XA-170K will be available at this https URL.
[CV-64] What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation
【速读】:该论文旨在解决多模态大语言模型在开放词汇人类-物体交互(Open-Vocabulary Human-Object Interaction, OV-HOI)任务中因跨模态幻觉(cross-modal hallucinations)和遮挡引起的模糊性(occlusion-induced ambiguity)而导致的推理能力受限问题。解决方案的关键在于提出一种名为ImagineAgent的智能体框架,其核心创新是将认知推理与生成式想象(generative imagination)相结合:首先构建显式的认知地图(cognitive maps),明确建模检测到的实体与候选动作之间的合理关系;随后动态调用检索增强、图像裁剪和扩散模型等工具,获取领域特定知识和增强的视觉证据,从而在模糊场景下实现跨模态对齐;此外,设计了一种复合奖励机制以平衡预测准确率与工具使用效率,显著提升了模型在低数据需求下的鲁棒性和效率。
链接: https://arxiv.org/abs/2602.11499
作者: Zhenlong Yuan,Xiangyan Qu,Jing Tang,Rui Chen,Lei Sun,Ruidong Chen,Hongwei Yu,Chengxuan Qian,Xiangxiang Chu,Shuo Li,Yuyin Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and occlusion-induced ambiguity. To address this, we propose \textbfImagineAgent, an agentic framework that harmonizes cognitive reasoning with generative imagination for robust visual understanding. Specifically, our method innovatively constructs cognitive maps that explicitly model plausible relationships between detected entities and candidate actions. Subsequently, it dynamically invokes tools including retrieval augmentation, image cropping, and diffusion models to gather domain-specific knowledge and enriched visual evidence, thereby achieving cross-modal alignment in ambiguous scenarios. Moreover, we propose a composite reward that balances prediction accuracy and tool efficiency. Evaluations on SWIG-HOI and HICO-DET datasets demonstrate our SOTA performance, requiring approximately 20% of training data compared to existing methods, validating our robustness and efficiency.
[CV-65] Arbitrary Ratio Feature Compression via Next Token Prediction
【速读】:该论文旨在解决现有特征压缩方法在灵活性和泛化能力上的局限性问题,即传统方法通常依赖专用模型实现特定压缩比,导致在面对不同压缩需求时需重新训练模型,效率低下且难以适应动态场景。其解决方案的关键在于提出一种任意压缩比特征压缩(Arbitrary Ratio Feature Compression, ARFC)框架,核心组件为自回归式压缩器(Arbitrary Ratio Compressor, ARC),通过下一词预测机制实现压缩比的灵活控制——仅需调整生成token数量即可实现任意压缩比;同时引入两种关键模块:Mixture of Solutions(MoS)模块利用多组压缩结果提升压缩质量与鲁棒性,Entity Relation Graph Constraint(ERGC)模块在训练中保留语义与结构关系,从而显著提升压缩后特征的表达能力。实验表明,该方法在多种下游任务中均优于现有方法,甚至在某些情况下超越原始未压缩特征性能。
链接: https://arxiv.org/abs/2602.11494
作者: Yufan Liu,Daoyuan Ren,Zhipeng Zhang,Wenyang Luo,Bing Li,Weiming Hu,Stephen Maybank
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institution of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences; CAS Center for Excellence in Brain Science and Intelligence Technology; People AI, Inc.; School of Artificial Intelligence, Shanghai Jiao Tong University; School of Computer Science and Mathematics, Birkbeck College, University of London
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Feature compression is increasingly important for improving the efficiency of downstream tasks, especially in applications involving large-scale or multi-modal data. While existing methods typically rely on dedicated models for achieving specific compression ratios, they are often limited in flexibility and generalization. In particular, retraining is necessary when adapting to a new compression ratio. To address this limitation, we propose a novel and flexible Arbitrary Ratio Feature Compression (ARFC) framework, which supports any compression ratio with a single model, eliminating the need for multiple specialized models. At its core, the Arbitrary Ratio Compressor (ARC) is an auto-regressive model that performs compression via next-token prediction. This allows the compression ratio to be controlled at inference simply by adjusting the number of generated tokens. To enhance the quality of the compressed features, two key modules are introduced. The Mixture of Solutions (MoS) module refines the compressed tokens by utilizing multiple compression results (solutions), reducing uncertainty and improving robustness. The Entity Relation Graph Constraint (ERGC) is integrated into the training process to preserve semantic and structural relationships during compression. Extensive experiments on cross-modal retrieval, image classification, and image retrieval tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches at various compression ratios. Notably, in some cases, it even surpasses the performance of the original, uncompressed features. These results validate the effectiveness and versatility of ARFC for practical, resource-constrained scenarios.
[CV-66] A Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness
【速读】:该论文旨在解决遥感图像语义变化检测(Semantic Change Detection, SCD)中常见的边界模糊和时序建模不足问题,从而提升分割精度。其解决方案的关键在于提出一种双分支网络架构(DBTANet),通过冻结的Segment Anything Model (SAM) 分支提取全局语义上下文与边界先验,结合ResNet34分支提供局部空间细节,实现互补特征表示;同时设计双向时序感知模块(Bidirectional Temporal Awareness Module, BTAM)对多尺度特征进行对称式时序依赖建模,并引入高斯平滑投影模块(Gaussian-smoothed Projection Module, GSPM)增强浅层SAM特征的边缘信息并抑制噪声,从而在保持边界清晰度的同时强化时序推理能力。
链接: https://arxiv.org/abs/2602.11466
作者: Yun-Cheng Li,Sen Lei,Heng-Chao Li,Ke Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic Change Detection (SCD) aims to detect and categorize land-cover changes from bi-temporal remote sensing images. Existing methods often suffer from blurred boundaries and inadequate temporal modeling, limiting segmentation accuracy. To address these issues, we propose a Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness, termed DBTANet. Specifically, we utilize a dual-branch Siamese encoder where a frozen SAM branch captures global semantic context and boundary priors, while a ResNet34 branch provides local spatial details, ensuring complementary feature representations. On this basis, we design a Bidirectional Temporal Awareness Module (BTAM) to aggregate multi-scale features and capture temporal dependencies in a symmetric manner. Furthermore, a Gaussian-smoothed Projection Module (GSPM) refines shallow SAM features, suppressing noise while enhancing edge information for boundary-aware constraints. Extensive experiments on two public benchmarks demonstrate that DBTANet effectively integrates global semantics, local details, temporal reasoning, and boundary awareness, achieving state-of-the-art performance.
[CV-67] Hierarchical Concept Embedding Pursuit for Interpretable Image Classification
【速读】:该论文旨在解决现有稀疏概念恢复方法在图像分类中忽视概念层次结构的问题,导致模型虽能做出正确预测,但其解释可能与语义层次不一致。解决方案的关键在于提出层次化概念嵌入追逐(Hierarchical Concept Embedding Pursuit, HCEP)框架,该框架在视觉-语言模型的潜在空间中构建概念嵌入的层次结构,并利用层次稀疏编码(hierarchical sparse coding)从图像中恢复出符合层次关系的概念组合。通过假设正确概念构成层次树中的根路径,HCEP在嵌入空间中推导出识别这些概念的条件,从而实现更可靠、可解释的图像分类模型。实验表明,相较于传统稀疏编码方法,HCEP在概念精度和召回率上表现更优,且在样本有限时仍保持更高的分类准确性和概念恢复能力。
链接: https://arxiv.org/abs/2602.11448
作者: Nghia Nguyen,Tianjiao Ding,René Vidal
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Interpretable-by-design models are gaining traction in computer vision because they provide faithful explanations for their predictions. In image classification, these models typically recover human-interpretable concepts from an image and use them for classification. Sparse concept recovery methods leverage the latent space of vision-language models to represent image embeddings as a sparse combination of concept embeddings. However, because such methods ignore the hierarchical structure of concepts, they can produce correct predictions with explanations that are inconsistent with the hierarchy. In this work, we propose Hierarchical Concept Embedding \ Pursuit (HCEP), a framework that induces a hierarchy of concept embeddings in the latent space and uses hierarchical sparse coding to recover the concepts present in an image. Given a hierarchy of semantic concepts, we construct a corresponding hierarchy of concept embeddings and, assuming the correct concepts for an image form a rooted path in the hierarchy, derive desirable conditions for identifying them in the embedded space. We show that hierarchical sparse coding reliably recovers hierarchical concept embeddings, whereas vanilla sparse coding fails. Our experiments on real-world datasets demonstrate that HCEP outperforms baselines in concept precision and recall while maintaining competitive classification accuracy. Moreover, when the number of samples is limited, HCEP achieves superior classification accuracy and concept recovery. These results show that incorporating hierarchical structures into sparse coding yields more reliable and interpretable image classification models.
[CV-68] Enhanced Portable Ultra Low-Field Diffusion Tensor Imaging with Bayesian Artifact Correction and Deep Learning-Based Super-Resolution
【速读】:该论文旨在解决超低场(ultra-low-field, ULF)磁共振成像中因空间和角度分辨率低、信噪比差以及扩散张量成像(Diffusion Tensor Imaging, DTI)序列固有特性导致的图像退化问题,特别是ULF DTI扫描中存在的时空域伪影。其解决方案的关键在于提出两个核心算法:一是具有角度依赖性的贝叶斯偏置场校正算法,用于消除ULF DTI中的多维伪影;二是无需重新训练的通用型卷积神经网络超分辨率算法(DiffSR),可恢复微结构和体积信息并提升DTI指标一致性,从而实现对白质病理特征的有效重建与分类。
链接: https://arxiv.org/abs/2602.11446
作者: Mark D. Olchanyi,Annabel Sorby-Adams,John Kirsch,Brian L. Edlow,Ava Farnan,Renfei Liu,Matthew S. Rosen,Emery N. Brown,W. Taylor Kimberly,Juan Eugenio Iglesias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 38 pages, 8 figures, 2 supplementary figures, and 3 supplementary tables
Abstract:Portable, ultra-low-field (ULF) magnetic resonance imaging has the potential to expand access to neuroimaging but currently suffers from coarse spatial and angular resolutions and low signal-to-noise ratios. Diffusion tensor imaging (DTI), a sequence tailored to detect and reconstruct white matter tracts within the brain, is particularly prone to such imaging degradation due to inherent sequence design coupled with prolonged scan times. In addition, ULF DTI scans exhibit artifacting that spans both the space and angular domains, requiring a custom modelling algorithm for subsequent correction. We introduce a nine-direction, single-shell ULF DTI sequence, as well as a companion Bayesian bias field correction algorithm that possesses angular dependence and convolutional neural network-based superresolution algorithm that is generalizable across DTI datasets and does not require re-training (‘‘DiffSR’’). We show through a synthetic downsampling experiment and white matter assessment in real, matched ULF and high-field DTI scans that these algorithms can recover microstructural and volumetric white matter information at ULF. We also show that DiffSR can be directly applied to white matter-based Alzheimers disease classification in synthetically degraded scans, with notable improvements in agreement between DTI metrics, as compared to un-degraded scans. We freely disseminate the Bayesian bias correction algorithm and DiffSR with the goal of furthering progress on both ULF reconstruction methods and general DTI sequence harmonization. We release all code related to DiffSR for \hrefthis https URLpublic \space use .
[CV-69] CtrlShift: High-Quality Geometry-Aware Object Manipulation in Visual Generation ICLR2026
【速读】:该论文旨在解决图像或视频中对象级操作(object-level manipulation)的三大核心挑战:背景保真度、视角变化下的几何一致性以及用户可控性。现有方法要么依赖显式三维重建导致泛化能力差,要么虽具备良好泛化性但缺乏精细几何控制。其解决方案的关键在于提出一个端到端的扩散框架CtrlShift,通过将操作分解为两个阶段——对象移除与受参考引导的修补(inpainting),并在显式相机位姿控制下进行统一建模;同时设计多任务、多阶段训练策略以解耦背景、身份和姿态信号,并构建可扩展的真实世界数据集生成流程,从而在无需显式3D建模的前提下实现高保真、几何一致且可控的对象操作。
链接: https://arxiv.org/abs/2602.11440
作者: Penghui Ruan,Bojia Zi,Xianbiao Qi,Youze Huang,Rong Xiao,Pichao Wang,Jiannong Cao,Yuhui Shi
机构: The Hong Kong Polytechnic University (香港理工大学); Southern University of Science and Technology (南方科技大学); The Chinese University of Hong Kong (香港中文大学); IntelliFusion Inc.; University of Electronic Science and Technology of China (电子科技大学); NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2026
Abstract:Object-level manipulation, relocating or reorienting objects in images or videos while preserving scene realism, is central to film post-production, AR, and creative editing. Yet existing methods struggle to jointly achieve three core goals: background preservation, geometric consistency under viewpoint shifts, and user-controllable transformations. Geometry-based approaches offer precise control but require explicit 3D reconstruction and generalize poorly; diffusion-based methods generalize better but lack fine-grained geometric control. We present CtrlShift, an end-to-end diffusion framework to achieve geometry-consistent object manipulation without explicit 3D representations. Our key insight is to decompose manipulation into two stages, object removal and reference-guided inpainting under explicit camera pose control, and encode both within a unified diffusion process. To enable precise, disentangled control, we design a multi-task, multi-stage training strategy that separates background, identity, and pose signals across tasks. To improve generalization, we introduce a scalable real-world dataset construction pipeline that generates paired image and video samples with estimated relative camera poses. Extensive experiments demonstrate that CtrlShift achieves state-of-the-art results in fidelity, viewpoint consistency, and controllability. To our knowledge, this is the first framework to unify fine-grained geometric control and real-world generalization for object manipulation, without relying on any explicit 3D modeling.
[CV-70] Fighting MRI Anisotropy: Learning Multiple Cardiac Shapes From a Single Implicit Neural Representation
【速读】:该论文旨在解决短轴(short-axis, SAX)心血管磁共振成像(cardiovascular magnetic resonance imaging, CMRI)因各向异性导致的心脏形态分析受限问题。其解决方案的关键在于利用近各向同性且分辨率更高的计算机断层扫描血管造影(computed tomography angiography, CTA)数据,训练一个单一的神经隐式函数(neural implicit function),以联合表示任意分辨率下的CMRI心脏形态;该方法能够同时重建右心室(right ventricle, RV)和心肌(myocardium, MYO)结构,其中MYO同步建模左心室的内膜面与外膜面。通过提取重构形状中的四腔心(4-chamber, 4CH)切片并与CMRI参考分割结果对比,验证了模型在几何精度、平滑性和解剖合理性上的优越性能。
链接: https://arxiv.org/abs/2602.11436
作者: Carolina Brás,Soufiane Ben Haddou,Thijs P. Kuipers,Laura Alvarez-Florez,R. Nils Planken,Fleur V. Y. Tjong,Connie Bezzina,Ivana Išgum
机构: Amsterdam UMC (阿姆斯特丹大学医疗中心); University of Amsterdam (阿姆斯特丹大学); Amsterdam Cardiovascular Sciences (阿姆斯特丹心血管科学中心); Mayo Clinic (梅奥诊所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The anisotropic nature of short-axis (SAX) cardiovascular magnetic resonance imaging (CMRI) limits cardiac shape analysis. To address this, we propose to leverage near-isotropic, higher resolution computed tomography angiography (CTA) data of the heart. We use this data to train a single neural implicit function to jointly represent cardiac shapes from CMRI at any resolution. We evaluate the method for the reconstruction of right ventricle (RV) and myocardium (MYO), where MYO simultaneously models endocardial and epicardial left-ventricle surfaces. Since high-resolution SAX reference segmentations are unavailable, we evaluate performance by extracting a 4-chamber (4CH) slice of RV and MYO from their reconstructed shapes. When compared with the reference 4CH segmentation masks from CMRI, our method achieved a Dice similarity coefficient of 0.91 \pm 0.07 and 0.75 \pm 0.13, and a Hausdorff distance of 6.21 \pm 3.97 mm and 7.53 \pm 5.13 mm for RV and MYO, respectively. Quantitative and qualitative assessment demonstrate the model’s ability to reconstruct accurate, smooth and anatomically plausible shapes, supporting improvements in cardiac shape analysis.
[CV-71] Latent Forcing: Reordering the Diffusion Trajectory for Pixel-Space Image Generation
【速读】:该论文旨在解决隐式扩散模型(Latent Diffusion Models)在图像生成过程中因编码阶段信息丢失、需独立训练解码器以及建模辅助分布而非原始数据分布所带来的效率与端到端建模能力不足的问题。其解决方案的关键在于提出“潜空间强制”(Latent Forcing),通过在去噪轨迹中联合处理潜变量(latents)与像素,并采用分别调整的噪声调度(noise schedules),使潜变量作为中间计算的“草稿板”,在高频像素特征生成前完成关键推理任务,从而在保持隐式扩散高效性的同时实现基于原始自然图像的直接生成。
链接: https://arxiv.org/abs/2602.11401
作者: Alan Baade,Eric Ryan Chan,Kyle Sargent,Changan Chen,Justin Johnson,Ehsan Adeli,Li Fei-Fei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 6 figures
Abstract:Latent diffusion models excel at generating high-quality images but lose the benefits of end-to-end modeling. They discard information during image encoding, require a separately trained decoder, and model an auxiliary distribution to the raw data. In this paper, we propose Latent Forcing, a simple modification to existing architectures that achieves the efficiency of latent diffusion while operating on raw natural images. Our approach orders the denoising trajectory by jointly processing latents and pixels with separately tuned noise schedules. This allows the latents to act as a scratchpad for intermediate computation before high-frequency pixel features are generated. We find that the order of conditioning signals is critical, and we analyze this to explain differences between REPA distillation in the tokenizer and the diffusion model, conditional versus unconditional generation, and how tokenizer reconstruction quality relates to diffusability. Applied to ImageNet, Latent Forcing achieves a new state-of-the-art for diffusion transformer-based pixel generation at our compute scale.
[CV-72] ArtContext: Contextualizing Artworks with Open-Access Art History Articles and Wikidata Knowledge through a LoRA-Tuned CLIP Model
【速读】:该论文旨在解决艺术史文献中关于艺术品的多源信息难以整合与定位的问题,即在面对一幅艺术品时,如何高效识别不同文章对其特定部分(如构图、图像志或物质文化)的论述。解决方案的关键在于提出ArtContext管道,其核心包括两个创新:一是构建一个针对开放获取艺术史文献和Wikidata知识的新型语料库收集流程;二是通过低秩适应(Low-Rank Adaptation, LoRA)微调CLIP模型,训练出领域特化的绘画理解模型PaintingCLIP。该模型利用弱监督信号从语料库中学习,并能为给定艺术品提供结构化上下文信息,且整个流程具备良好的可迁移性,适用于其他人文领域。
链接: https://arxiv.org/abs/2602.11349
作者: Samuel Waugh,Stuart James
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Many Art History articles discuss artworks in general as well as specific parts of works, such as layout, iconography, or material culture. However, when viewing an artwork, it is not trivial to identify what different articles have said about the piece. Therefore, we propose ArtContext, a pipeline for taking a corpus of Open-Access Art History articles and Wikidata Knowledge and annotating Artworks with this information. We do this using a novel corpus collection pipeline, then learn a bespoke CLIP model adapted using Low-Rank Adaptation (LoRA) to make it domain-specific. We show that the new model, PaintingCLIP, which is weakly supervised by the collected corpus, outperforms CLIP and provides context for a given artwork. The proposed pipeline is generalisable and can be readily applied to numerous humanities areas.
[CV-73] Exploring Real-Time Super-Resolution: Benchmarking and Fine-Tuning for Streaming Content
【速读】:该论文旨在解决当前实时超分辨率(Real-time Super-Resolution, RT-SR)方法在处理压缩视频内容时性能受限的问题,其核心挑战在于现有数据集未能真实反映流媒体视频的特性,导致模型在实际应用中泛化能力不足。解决方案的关键在于:首先构建了一个源自YouTube的综合性数据集StreamSR,覆盖多种视频类型和分辨率,更贴近真实流媒体场景;其次提出了一种高效实时模型EfRLFN,融合了高效通道注意力机制与双曲正切激活函数(tanh),并通过优化网络结构和设计复合损失函数显著提升训练收敛速度与推理效率;最终实验证明,在StreamSR上微调其他模型可获得跨多个标准基准的显著性能提升,验证了该数据集与模型设计的有效性。
链接: https://arxiv.org/abs/2602.11339
作者: Evgeney Bogatyrev,Khaled Abud,Ivan Molodetskikh,Nikita Alutis,Dmitry Vatolin
机构: Lomonosov Moscow State University (莫斯科国立大学); AI Center, Lomonosov Moscow State University (莫斯科国立大学人工智能中心); MSU Institute for Artificial Intelligence, Lomonosov Moscow State University (莫斯科国立大学人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in real-time super-resolution have enabled higher-quality video streaming, yet existing methods struggle with the unique challenges of compressed video content. Commonly used datasets do not accurately reflect the characteristics of streaming media, limiting the relevance of current benchmarks. To address this gap, we introduce a comprehensive dataset - StreamSR - sourced from YouTube, covering a wide range of video genres and resolutions representative of real-world streaming scenarios. We benchmark 11 state-of-the-art real-time super-resolution models to evaluate their performance for the streaming use-case. Furthermore, we propose EfRLFN, an efficient real-time model that integrates Efficient Channel Attention and a hyperbolic tangent activation function - a novel design choice in the context of real-time super-resolution. We extensively optimized the architecture to maximize efficiency and designed a composite loss function that improves training convergence. EfRLFN combines the strengths of existing architectures while improving both visual quality and runtime performance. Finally, we show that fine-tuning other models on our dataset results in significant performance gains that generalize well across various standard benchmarks. We made the dataset, the code, and the benchmark available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.11339 [cs.CV] (or arXiv:2602.11339v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.11339 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-74] MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation
【速读】:该论文旨在解决机器人在真实复杂环境中实现大规模部署所面临的泛化能力不足问题,即现有机器人基准测试难以覆盖日常场景中海量且多样化的布局、物体几何形状及任务规格变化(长尾问题)。其解决方案的关键在于构建一个名为MolmoSpaces的全开源生态系统,该系统包含超过23万种多样的室内环境和13万种丰富标注的物体资产,并支持多种主流物理引擎(如MuJoCo、Isaac、ManiSkill),从而实现对静态与移动操作、导航以及跨房间长期任务等多样化具身任务的大规模仿真评估。该平台还配套设计了MolmoSpaces-Bench基准套件,验证了其良好的仿真到现实迁移相关性(R = 0.96, ρ = 0.98),为机器人学习研究提供了可扩展的数据生成、策略训练与基准创建基础。
链接: https://arxiv.org/abs/2602.11337
作者: Yejin Kim,Wilbert Pumacay,Omar Rayyan,Max Argus,Winson Han,Eli VanderBilt,Jordi Salvador,Abhay Deshpande,Rose Hendrix,Snehal Jauhri,Shuo Liu,Nur Muhammad Mahi Shafiullah,Maya Guru,Ainaz Eftekhar,Karen Farley,Donovan Clay,Jiafei Duan,Arjun Guru,Piper Wolters,Alvaro Herrasti,Ying-Chun Lee,Georgia Chalvatzaki,Yuchen Cui,Ali Farhadi,Dieter Fox,Ranjay Krishna
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deploying robots at scale demands robustness to the long tail of everyday situations. The countless variations in scene layout, object geometry, and task specifications that characterize real environments are vast and underrepresented in existing robot benchmarks. Measuring this level of generalization requires infrastructure at a scale and diversity that physical evaluation alone cannot provide. We introduce MolmoSpaces, a fully open ecosystem to support large-scale benchmarking of robot policies. MolmoSpaces consists of over 230k diverse indoor environments, ranging from handcrafted household scenes to procedurally generated multiroom houses, populated with 130k richly annotated object assets, including 48k manipulable objects with 42M stable grasps. Crucially, these environments are simulator-agnostic, supporting popular options such as MuJoCo, Isaac, and ManiSkill. The ecosystem supports the full spectrum of embodied tasks: static and mobile manipulation, navigation, and multiroom long-horizon tasks requiring coordinated perception, planning, and interaction across entire indoor environments. We also design MolmoSpaces-Bench, a benchmark suite of 8 tasks in which robots interact with our diverse scenes and richly annotated objects. Our experiments show MolmoSpaces-Bench exhibits strong sim-to-real correlation (R = 0.96, \rho = 0.98), confirm newer and stronger zero-shot policies outperform earlier versions in our benchmarks, and identify key sensitivities to prompt phrasing, initial joint positions, and camera occlusion. Through MolmoSpaces and its open-source assets and tooling, we provide a foundation for scalable data generation, policy training, and benchmark creation for robot learning research.
[CV-75] MDE-VIO: Enhancing Visual-Inertial Odometry Using Learned Depth Priors ICIP2026
【速读】:该论文旨在解决传统单目视觉惯性里程计(Visual-Inertial Odometry, VIO)在低纹理环境中因视觉特征稀疏而导致位姿估计不准确的问题。其核心解决方案是将学习得到的稠密深度先验(dense depth priors)直接集成到VINS-Mono优化后端中,通过引入仿射不变的深度一致性约束和成对序数约束,并利用方差门控机制显式过滤不稳定伪影,从而在保持边缘设备计算资源限制的前提下,实现度量尺度的鲁棒恢复。该方法显著提升了在挑战性场景下的稳定性与精度,实验表明绝对轨迹误差(Absolute Trajectory Error, ATE)最多可降低28.3%。
链接: https://arxiv.org/abs/2602.11323
作者: Arda Alniak,Sinan Kalkan,Mustafa Mert Ankarali,Afsar Saranli,Abdullah Aydin Alatan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures, 3 tables. Submitted to ICIP 2026
Abstract:Traditional monocular Visual-Inertial Odometry (VIO) systems struggle in low-texture environments where sparse visual features are insufficient for accurate pose estimation. To address this, dense Monocular Depth Estimation (MDE) has been widely explored as a complementary information source. While recent Vision Transformer (ViT) based complex foundational models offer dense, geometrically consistent depth, their computational demands typically preclude them from real-time edge deployment. Our work bridges this gap by integrating learned depth priors directly into the VINS-Mono optimization backend. We propose a novel framework that enforces affine-invariant depth consistency and pairwise ordinal constraints, explicitly filtering unstable artifacts via variance-based gating. This approach strictly adheres to the computational limits of edge devices while robustly recovering metric scale. Extensive experiments on the TartanGround and M3ED datasets demonstrate that our method prevents divergence in challenging scenarios and delivers significant accuracy gains, reducing Absolute Trajectory Error (ATE) by up to 28.3%. Code will be made available.
[CV-76] Selective Prior Synchronization via SYNC Loss
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在不确定性环境下进行可靠预测的问题,核心挑战在于如何使模型在不确定时选择不预测(即“ abstain”),从而提升预测的可信度与鲁棒性。现有方法分为两类:一类是前向设计(ad-hoc)方法如SelectiveNet,通过修改网络结构或损失函数实现选择性预测;另一类是后处理(post-hoc)方法如Softmax响应,基于模型输出的概率分布判断是否预测。作者观察到,后处理方法实际上隐式地生成了一种称为“选择先验”(selective prior)的不确定性信息,而该信息传统上仅用于推理阶段。论文的关键创新在于提出SYNC损失函数,首次将选择先验显式引入训练过程,将SelectiveNet与Softmax响应机制有机结合,使模型在训练中同时学习预测能力和不确定性判断能力,显著提升了选择性预测性能和泛化能力,在多个基准数据集上达到新的SOTA水平。
链接: https://arxiv.org/abs/2602.11316
作者: Ishan Mishra,Jiajie Li,Deepak Mishra,Jinjun Xiong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Prediction under uncertainty is a critical requirement for the deep neural network to succeed responsibly. This paper focuses on selective prediction, which allows DNNs to make informed decisions about when to predict or abstain based on the uncertainty level of their predictions. Current methods are either ad-hoc such as SelectiveNet, focusing on how to modify the network architecture or objective function, or post-hoc such as softmax response, achieving selective prediction through analyzing the model’s probabilistic outputs. We observe that post-hoc methods implicitly generate uncertainty information, termed the selective prior, which has traditionally been used only during inference. We argue that the selective prior provided by the selection mechanism is equally vital during the training stage. Therefore, we propose the SYNC loss which introduces a novel integration of ad-hoc and post-hoc method. Specifically, our approach incorporates the softmax response into the training process of SelectiveNet, enhancing its selective prediction capabilities by examining the selective prior. Evaluated across various datasets, including CIFAR-100, ImageNet-100, and Stanford Cars, our method not only enhances the model’s generalization capabilities but also surpasses previous works in selective prediction performance, and sets new benchmarks for state-of-the-art performance.
[CV-77] Advancing Digital Twin Generation Through a Novel Simulation Framework and Quantitative Benchmarking
【速读】:该论文旨在解决数字孪生(Digital Twin)构建过程中,基于摄影测量法生成三维模型时设计选择多样且评估标准多依赖定性判断的问题。其解决方案的关键在于提出并测试了一种新型流程:从高质量3D模型中合成图像,并通过程序化生成相机位姿,从而实现可重复、可量化的实验,能够将虚拟相机参数和物体的真值与重建结果进行精确对比,提升评估的客观性和科学性。
链接: https://arxiv.org/abs/2602.11314
作者: Jacob Rubinstein,Avi Donaty,Don Engel
机构: University of Maryland, Baltimore County (UMBC)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 9 pages, 10 figures. Preprint
Abstract:The generation of 3D models from real-world objects has often been accomplished through photogrammetry, i.e., by taking 2D photos from a variety of perspectives and then triangulating matched point-based features to create a textured mesh. Many design choices exist within this framework for the generation of digital twins, and differences between such approaches are largely judged qualitatively. Here, we present and test a novel pipeline for generating synthetic images from high-quality 3D models and programmatically generated camera poses. This enables a wide variety of repeatable, quantifiable experiments which can compare ground-truth knowledge of virtual camera parameters and of virtual objects against the reconstructed estimations of those perspectives and subjects.
[CV-78] Stress Tests REVEAL Frag ile Temporal and Visual Grounding in Video-Language Models
【速读】:该论文旨在解决当前视频-语言模型(Video-Language Models, VidLMs)在理解视频内容、时序关系和运动信息方面的鲁棒性不足问题,即这些模型往往无法准确捕捉视频的时空特性,反而依赖语言捷径或对特定场景产生偏差。解决方案的关键在于提出REVEAL——一个诊断性基准测试集,通过五种受控的压力测试(包括时序预期偏差、仅依赖语言的捷径、对视频的盲目顺从、摄像机运动敏感性以及时空遮挡下的鲁棒性)系统评估主流开源与闭源VidLMs的缺陷;同时提供自动化数据生成管道,支持更广泛、可扩展的模型评测,从而推动更具鲁棒性的视频理解模型的发展。
链接: https://arxiv.org/abs/2602.11244
作者: Sethuraman T V,Savya Khosla,Aditi Tiwari,Vidya Ganesh,Rakshana Jayaprakash,Aditya Jain,Vignesh Srinivasakumar,Onkar Kishor Susladkar,Srinidhi Sunkara,Aditya Shanmugham,Rakesh Vaideeswaran,Abbaas Alif Mohamed Nishar,Simon Jenni,Derek Hoiem
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Adobe Research (Adobe 研究院); Qualtrics (Qualtrics); iManage (iManage); NVIDIA (英伟达); Google (谷歌); Amazon Science (亚马逊科学); Capital One (美国资本一银行)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work investigates a fundamental question: Do Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion? Our investigation shows that, surprisingly, they often do not. We introduce REVEAL, a diagnostic benchmark that probes fundamental weaknesses of contemporary VidLMs through five controlled stress tests; assessing temporal expectation bias, reliance on language-only shortcuts, video sycophancy, camera motion sensitivity, and robustness to spatiotemporal occlusion. We test leading open- and closed-source VidLMs and find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal information amidst simple spatiotemporal masking. Humans, on the other hand, succeed at these tasks with ease. Alongside our benchmark, we provide a data pipeline that automatically generates diagnostic examples for our stress tests, enabling broader and more scalable evaluation. We will release our benchmark and code to support future research.
[CV-79] ReTracing: An Archaeological Approach Through Body Machine and Generative Systems
【速读】:该论文试图解决的问题是:生成式 AI(Generative AI)如何通过编码社会文化偏见来塑造、限制并生成身体运动,从而影响人类与机器之间的互动边界。解决方案的关键在于构建一个由多智能体构成的具身表演艺术系统 ReTracing,其核心机制包括:利用大语言模型(LLMs)从科幻小说中提取人机交互语句,并生成“应做”与“不应做”的配对指令;通过基于扩散的文本到视频模型将这些指令转化为人类舞者和四足机器人可执行的动作指南;借助多相机运动捕捉系统记录双Agent在镜面地板上的动作轨迹,并重建为三维点云与运动轨迹,形成数字运动痕迹档案。这一过程揭示了生成式系统如何将隐含的社会文化规范嵌入到具身行为之中,从而提供了一种全新的方法论视角来审视人工智能与人类身体性的复杂关系。
链接: https://arxiv.org/abs/2602.11242
作者: Yitong Wang,Yue Yao
机构: Carnegie Mellon University (卡内基梅隆大学); Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present ReTracing, a multi-agent embodied performance art that adopts an archaeological approach to examine how artificial intelligence shapes, constrains, and produces bodily movement. Drawing from science-fiction novels, the project extracts sentences that describe human-machine interaction. We use large language models (LLMs) to generate paired prompts “what to do” and “what not to do” for each excerpt. A diffusion-based text-to-video model transforms these prompts into choreographic guides for a human performer and motor commands for a quadruped robot. Both agents enact the actions on a mirrored floor, captured by multi-camera motion tracking and reconstructed into 3D point clouds and motion trails, forming a digital archive of motion traces. Through this process, ReTracing serves as a novel approach to reveal how generative systems encode socio-cultural biases through choreographed movements. Through an immersive interplay of AI, human, and robot, ReTracing confronts a critical question of our time: What does it mean to be human among AIs that also move, think, and leave traces behind?
[CV-80] Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration
【速读】:该论文旨在解决现有自对弈(self-play)方法在视觉语言模型(Vision-Language Models, VLMs)中依赖静态图像数据集、缺乏主动探索能力的问题,从而导致学习效率低下和对初始数据集强依赖。其核心挑战在于模型无法根据自身能力动态调整所接触的视觉样本难度,造成计算资源浪费于过于简单或过难的任务。解决方案的关键是提出 Active-Zero 框架,通过三个协同演化的代理(agent)构建闭环自适应学习机制:Searcher 从开放世界数据库中检索匹配当前模型能力边界的图像,Questioner 生成校准后的推理任务,Solver 则基于准确率奖励进行优化。这一机制实现了“自 scaffolding auto-curricula”(自构建自动课程),使模型能够自主规划学习路径,显著提升在多任务基准上的推理与理解性能。
链接: https://arxiv.org/abs/2602.11241
作者: Jinghan He,Junfeng Fang,Feng Xiong,Zijun Yao,Fei Shen,Haiyun Guo,Jinqiao Wang,Tat-Seng Chua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Self-play has enabled large language models to autonomously improve through self-generated challenges. However, existing self-play methods for vision-language models rely on passive interaction with static image collections, resulting in strong dependence on initial datasets and inefficient learning. Without the ability to actively seek visual data tailored to their evolving capabilities, agents waste computational effort on samples that are either trivial or beyond their current skill level. To address these limitations, we propose Active-Zero, a framework that shifts from passive interaction to active exploration of visual environments. Active-Zero employs three co-evolving agents: a Searcher that retrieves images from open-world repositories based on the model’s capability frontier, a Questioner that synthesizes calibrated reasoning tasks, and a Solver refined through accuracy rewards. This closed loop enables self-scaffolding auto-curricula where the model autonomously constructs its learning trajectory. On Qwen2.5-VL-7B-Instruct across 12 benchmarks, Active-Zero achieves 53.97 average accuracy on reasoning tasks (5.7% improvement) and 59.77 on general understanding (3.9% improvement), consistently outperforming existing self-play baselines. These results highlight active exploration as a key ingredient for scalable and adaptive self-evolving vision-language systems.
[CV-81] oward Reliable Tea Leaf Disease Diagnosis Using Deep Learning Model: Enhancing Robustness With Explainable AI and Adversarial Training
【速读】:该论文旨在解决 Bangladesh 茶叶种植中因叶片病害导致产量下降和品质降低的问题,传统人工检测方法效率低且易出错。解决方案的关键在于构建一个基于深度学习的自动化茶叶病害分类模型,利用 teaLeafBD 数据集(包含 5,278 张高分辨率图像,分为七类:六种病害和一种健康状态)进行训练与验证。核心创新包括:采用 DenseNet201 和 EfficientNetB3 两种先进神经网络架构,引入对抗训练(adversarial training)提升模型在噪声或扰动输入下的鲁棒性,并结合 Grad-CAM 可解释人工智能策略实现预测结果的可视化分析,从而增强模型可信度。实验表明,EfficientNetB3 达到 93% 的最高分类准确率,证明该方法能高效、精准地识别茶叶病害,为现代农业管理提供实用工具。
链接: https://arxiv.org/abs/2602.11239
作者: Samanta Ghosh,Jannatul Adan Mahi,Shayan Abrar,Md Parvez Mia,Asaduzzaman Rayhan,Abdul Awal Yasir,Asaduzzaman Hridoy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages,9 figures, 2025 IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE)
Abstract:Tea is a valuable asset for the economy of Bangladesh. So, tea cultivation plays an important role to boost the economy. These valuable plants are vulnerable to various kinds of leaf infections which may cause less production and low quality. It is not so easy to detect these diseases manually. It may take time and there could be some errors in the this http URL, the purpose of the study is to develop an automated deep learning model for tea leaf disease classification based on the teaLeafBD dataset so that anyone can detect the diseases more easily and efficiently. There are 5,278 high-resolution images in this dataset. The images are classified into seven categories. Six of them represents various diseases and the rest one represents healthy leaves. The proposed pipeline contains data preprocessing, data splitting, adversarial training, augmentation, model training, evaluation, and comprehension made possible with Explainable AI strategies. DenseNet201 and EfficientNetB3 were employed to perform the classification task. To prepare the model more robustly, we applied adversarial training so it can operate effectively even with noisy or disturbed inputs. In addition, Grad-CAM visualization was executed to analyze the model’s predictions by identifying the most influential regions of each image. Our experimental outcomes revealed that EfficientNetB3 achieved the highest classification accuracy of 93%, while DenseNet201 reached 91%. The outcomes prove that the effectiveness of the proposed approach can accurately detect tea leaf diseases and provide a practical solution for advanced agricultural management.
[CV-82] DD-MDN: Human Trajectory Forecasting with Diffusion-Based Dual Mixture Density Networks and Uncertainty Self-Calibration
【速读】:该论文针对人类轨迹预测(Human Trajectory Forecasting, HTF)中长期被忽视的不确定性建模、校准以及短时观测下的鲁棒性问题展开研究,这些问题对下游任务如路径规划和碰撞避让至关重要。解决方案的关键在于提出DD-MDN模型,其核心创新包括:采用少样本去噪扩散(few-shot denoising diffusion)骨干网络与双混合密度网络(dual mixture density network)相结合,从而学习自校准的停留区域和概率排序的锚定路径,无需预设锚点或终点即可生成多样化的轨迹假设,实现了高位置精度、可靠不确定性建模及短时观测下的强鲁棒性。
链接: https://arxiv.org/abs/2602.11214
作者: Manuel Hetzel,Kerim Turacan,Hannes Reichert,Konrad Doll,Bernhard Sick
机构: Faculty of Engineering, University of Applied Sciences Aschaffenburg, Germany(德国阿沙芬堡应用技术大学工程学院); Intelligent Embedded Systems Lab, University of Kassel, Germany(德国卡塞尔大学智能嵌入式系统实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Human Trajectory Forecasting (HTF) predicts future human movements from past trajectories and environmental context, with applications in Autonomous Driving, Smart Surveillance, and Human-Robot Interaction. While prior work has focused on accuracy, social interaction modeling, and diversity, little attention has been paid to uncertainty modeling, calibration, and forecasts from short observation periods, which are crucial for downstream tasks such as path planning and collision avoidance. We propose DD-MDN, an end-to-end probabilistic HTF model that combines high positional accuracy, calibrated uncertainty, and robustness to short observations. Using a few-shot denoising diffusion backbone and a dual mixture density network, our method learns self-calibrated residence areas and probability-ranked anchor paths, from which diverse trajectory hypotheses are derived, without predefined anchors or endpoints. Experiments on the ETH/UCY, SDD, inD, and IMPTC datasets demonstrate state-of-the-art accuracy, robustness at short observation intervals, and reliable uncertainty modeling. The code is available at: this https URL.
[CV-83] UltraLIF: Fully Differentiable Spiking Neural Networks via Ultradiscretization and Max-Plus Algebra
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在训练过程中因尖峰生成机制的非可微性而导致的优化困难问题,传统方法依赖启发式替代梯度(surrogate gradients),存在前向-反向传播不一致和性能瓶颈。其解决方案的关键在于提出UltraLIF框架,将替代梯度替换为来自热带几何(tropical geometry)的超离散化(ultradiscretization)数学形式,通过log-sum-exp函数构建连续松弛,实现对神经元阈值动态的可微建模;该方法自然地利用max-plus半环结构,在温度参数ε→0时收敛至硬阈值行为,并由此推导出两种全可微SNN模型:基于LIF常微分方程的UltraLIF和基于间隙连接耦合扩散方程的UltraDLIF,二者均可通过标准反向传播进行端到端训练,且理论证明了其与经典LIF动力学的逐点收敛性和有界非消失梯度特性。
链接: https://arxiv.org/abs/2602.11206
作者: Jose Marie Antonio Miñoza
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Rings and Algebras (math.RA); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Spiking Neural Networks (SNNs) offer energy-efficient, biologically plausible computation but suffer from non-differentiable spike generation, necessitating reliance on heuristic surrogate gradients. This paper introduces UltraLIF, a principled framework that replaces surrogate gradients with ultradiscretization, a mathematical formalism from tropical geometry providing continuous relaxations of discrete dynamics. The central insight is that the max-plus semiring underlying ultradiscretization naturally models neural threshold dynamics: the log-sum-exp function serves as a differentiable soft-maximum that converges to hard thresholding as a learnable temperature parameter \eps \to 0 . Two neuron models are derived from distinct dynamical systems: UltraLIF from the LIF ordinary differential equation (temporal dynamics) and UltraDLIF from the diffusion equation modeling gap junction coupling across neuronal populations (spatial dynamics). Both yield fully differentiable SNNs trainable via standard backpropagation with no forward-backward mismatch. Theoretical analysis establishes pointwise convergence to classical LIF dynamics with quantitative error bounds and bounded non-vanishing gradients. Experiments on six benchmarks spanning static images, neuromorphic vision, and audio demonstrate improvements over surrogate gradient baselines, with gains most pronounced in single-timestep ( T=1 ) settings on neuromorphic and temporal datasets. An optional sparsity penalty enables significant energy reduction while maintaining competitive accuracy.
[CV-84] GAC-KAN: An Ultra-Lightweight GNSS Interference Classifier for GenAI-Powered Consumer Edge Devices
【速读】:该论文旨在解决生成式 AI(Generative AI)在消费电子设备中应用时所面临的双重挑战:一是真实干扰数据稀缺导致鲁棒分类器训练困难;二是边缘硬件资源极度有限,难以同时支持高计算需求的生成式 AI 任务与基础安全功能(如全球导航卫星系统 GNSS 信号保护)。解决方案的关键在于提出一种名为 GAC-KAN 的轻量化框架:首先采用物理引导的仿真方法构建大规模高保真干扰数据集以缓解数据瓶颈;其次设计多尺度 Ghost-ACB-Coordinate(MS-GAC)主干网络,通过异构卷积块(Asymmetric Convolution Blocks, ACB)和 Ghost 模块高效提取频时域特征并减少冗余;最后用可学习样条激活函数的 Kolmogorov-Arnold Network(KAN)替代传统 MLP 分类头,在显著降低参数量(仅 0.13 百万)的同时实现优异非线性映射能力,从而在极低算力消耗下保障 GNSS 安全性。
链接: https://arxiv.org/abs/2602.11186
作者: Zhihan Zeng,Kaihe Wang,Zhongpei Zhang,Yue Xiu
机构: University of Electronic Science and Technology of China (电子科技大学); National Key Laboratory of Wireless Communications (无线通信国家重点实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The integration of Generative AI (GenAI) into Consumer Electronics (CE)–from AI-powered assistants in wearables to generative planning in autonomous Uncrewed Aerial Vehicles (UAVs)–has revolutionized user experiences. However, these GenAI applications impose immense computational burdens on edge hardware, leaving strictly limited resources for fundamental security tasks like Global Navigation Satellite System (GNSS) signal protection. Furthermore, training robust classifiers for such devices is hindered by the scarcity of real-world interference data. To address the dual challenges of data scarcity and the extreme efficiency required by the GenAI era, this paper proposes a novel framework named GAC-KAN. First, we adopt a physics-guided simulation approach to synthesize a large-scale, high-fidelity jamming dataset, mitigating the data bottleneck. Second, to reconcile high accuracy with the stringent resource constraints of GenAI-native chips, we design a Multi-Scale Ghost-ACB-Coordinate (MS-GAC) backbone. This backbone combines Asymmetric Convolution Blocks (ACB) and Ghost modules to extract rich spectral-temporal features with minimal redundancy. Replacing the traditional Multi-Layer Perceptron (MLP) decision head, we introduce a Kolmogorov-Arnold Network (KAN), which employs learnable spline activation functions to achieve superior non-linear mapping capabilities with significantly fewer parameters. Experimental results demonstrate that GAC-KAN achieves an overall accuracy of 98.0%, outperforming state-of-the-art baselines. Significantly, the model contains only 0.13 million parameter–approximately 660 times fewer than Vision Transformer (ViT) baselines. This extreme lightweight characteristic makes GAC-KAN an ideal “always-on” security companion, ensuring GNSS reliability without contending for the computational resources required by primary GenAI tasks.
[CV-85] Mitigating Error Accumulation in Continuous Navigation via Memory-Augmented Kalman Filtering
【速读】:该论文旨在解决无人机(UAV)在复杂环境中进行连续导航时,现有视觉-语言导航(VLN)模型因采用死记硬打(dead-reckoning)策略而导致的状态漂移(state drift)问题。该问题源于迭代式位置更新机制,使误差随时间累积,导致内部信念与真实坐标失配,进而影响完整轨迹预测的准确性。解决方案的关键在于将序列预测建模为递归贝叶斯状态估计问题,并提出NeuroKalman框架,通过解耦导航过程为先验预测(基于运动动力学)和似然校正(来自历史观测),利用核密度估计(Kernel Density Estimation)与注意力检索机制的数学关联,实现无需梯度更新的隐状态校正,从而有效抑制漂移积累。
链接: https://arxiv.org/abs/2602.11183
作者: Yin Tang,Jiawei Ma,Jinrui Zhang,Alex Jinpeng Wang,Deyu Zhang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: Preprint, 15 pages, 6 figures
Abstract:Continuous navigation in complex environments is critical for Unmanned Aerial Vehicle (UAV). However, the existing Vision-Language Navigation (VLN) models follow the dead-reckoning, which iteratively updates its position for the next waypoint prediction, and subsequently construct the complete trajectory. Then, such stepwise manner will inevitably lead to accumulated errors of position over time, resulting in misalignment between internal belief and objective coordinates, which is known as “state drift” and ultimately compromises the full trajectory prediction. Drawing inspiration from classical control theory, we propose to correct for errors by formulating such sequential prediction as a recursive Bayesian state estimation problem. In this paper, we design NeuroKalman, a novel framework that decouples navigation into two complementary processes: a Prior Prediction, based on motion dynamics and a Likelihood Correction, from historical observation. We first mathematically associate Kernel Density Estimation of the measurement likelihood with the attention-based retrieval mechanism, which then allows the system to rectify the latent representation using retrieved historical anchors without gradient updates. Comprehensive experiments on TravelUAV benchmark demonstrate that, with only 10% of the training data fine-tuning, our method clearly outperforms strong baselines and regulates drift accumulation.
[CV-86] Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation
【速读】:该论文旨在解决当前基于规则的强化微调(Reinforcement Fine-Tuning, RFT)方法在跨模态、以视觉为中心的医学影像领域应用不足的问题,尤其是在需要同时具备鲁棒视觉感知与结构化推理能力的高风险医疗场景中。其核心解决方案是提出VRFT-Aug框架,关键在于通过四项协同策略增强模型的感知与推理能力:先验知识注入(prior knowledge injection)、基于感知驱动的策略精炼(perception-driven policy refinement)、医学启发的奖励塑造(medically informed reward shaping)以及行为模仿(behavioral imitation),从而稳定并提升RFT训练过程的有效性与可靠性。
链接: https://arxiv.org/abs/2602.10619
作者: Guangjing Yang,ZhangYuan Yu,Ziyuan Qin,Xinyuan Song,Huahui Yi,Qingbo Kang,Jun Gao,Yiyue Li,Chenlin Du,Qicheng Lao
机构: Beijing University of Posts and Telecommunications(北京邮电大学); Emory University(埃默里大学); Sichuan University(四川大学); Peking University(北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CPAL 2026
Abstract:While recent advances in Reinforcement Fine-Tuning (RFT) have shown that rule-based reward schemes can enable effective post-training for large language models, their extension to cross-modal, vision-centric domains remains largely underexplored. This limitation is especially pronounced in the medical imaging domain, where effective performance requires both robust visual perception and structured reasoning. In this work, we address this gap by proposing VRFT-Aug, a visual reinforcement fine-tuning framework tailored for the medical domain. VRFT-Aug introduces a series of training strategies designed to augment both perception and reasoning, including prior knowledge injection, perception-driven policy refinement, medically informed reward shaping, and behavioral imitation. Together, these methods aim to stabilize and improve the RFT process. Through extensive experiments across multiple medical datasets, we show that our approaches consistently outperform both standard supervised fine-tuning and RFT baselines. Moreover, we provide empirically grounded insights and practical training heuristics that can be generalized to other medical image tasks. We hope this work contributes actionable guidance and fresh inspiration for the ongoing effort to develop reliable, reasoning-capable models for high-stakes medical applications. Comments: CPAL 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.10619 [cs.CV] (or arXiv:2602.10619v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.10619 Focus to learn more arXiv-issued DOI via DataCite
[CV-87] UPDA: Unsupervised Progressive Domain Adaptation for No-Reference Point Cloud Quality Assessment
【速读】:该论文旨在解决无参考点云质量评估(No-Reference Point Cloud Quality Assessment, NR-PCQA)模型在源域(training domain)与目标域(testing domain)之间存在分布差异时性能显著下降的问题,即跨域适应性差的挑战。解决方案的关键在于提出首个无监督渐进式域自适应(Unsupervised Progressive Domain Adaptation, UPDA)框架,采用两阶段粗粒度到细粒度的对齐范式:第一阶段设计了一种基于质量差异感知的粗粒度对齐方法,通过新颖的质量差异感知混合损失函数捕捉跨域样本间的相对质量关系,避免直接进行绝对特征对齐的困难;第二阶段引入感知融合细粒度对齐策略,结合对称特征融合机制识别域不变特征,并利用条件判别器选择性增强与质量相关特征的迁移,从而有效提升NR-PCQA模型在跨域场景下的泛化能力。
链接: https://arxiv.org/abs/2602.11969
作者: Bingxu Xie,Fang Zhou,Jincan Wu,Yonghui Liu,Weiqing Li,Zhiyong Su
机构: Nanjing University of Science and Technology (南京理工大学); National Key Laboratory of Information Systems Engineering (信息系统工程全国重点实验室)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: to be published in IEEE Transactions on Broadcasting
Abstract:While no-reference point cloud quality assessment (NR-PCQA) approaches have achieved significant progress over the past decade, their performance often degrades substantially when a distribution gap exists between the training (source domain) and testing (target domain) data. However, to date, limited attention has been paid to transferring NR-PCQA models across domains. To address this challenge, we propose the first unsupervised progressive domain adaptation (UPDA) framework for NR-PCQA, which introduces a two-stage coarse-to-fine alignment paradigm to address domain shifts. At the coarse-grained stage, a discrepancy-aware coarse-grained alignment method is designed to capture relative quality relationships between cross-domain samples through a novel quality-discrepancy-aware hybrid loss, circumventing the challenges of direct absolute feature alignment. At the fine-grained stage, a perception fusion fine-grained alignment approach with symmetric feature fusion is developed to identify domain-invariant features, while a conditional discriminator selectively enhances the transfer of quality-relevant features. Extensive experiments demonstrate that the proposed UPDA effectively enhances the performance of NR-PCQA methods in cross-domain scenarios, validating its practical applicability. The code is available at this https URL.
[CV-88] Learning Perceptual Representations for Gaming NR-VQA with Multi-Task FR Signals
【速读】:该论文旨在解决游戏视频的无参考视频质量评估(No-reference Video Quality Assessment, NR-VQA)难题,其核心挑战在于高质量人类标注数据稀缺,以及游戏视频特有的内容特征(如快速运动、风格化图形和压缩伪影)。解决方案的关键在于提出一种多任务学习(Multi-Task Learning, MTL)框架MTL-VQA,该框架利用全参考(Full-Reference, FR)指标作为监督信号,在无需人工标签的情况下进行预训练,从而学习具有感知意义的特征表示;通过联合优化多个FR目标并采用自适应任务加权机制,模型能够提取可迁移的共享表征,显著提升在无参考场景下的性能表现,且在MOS监督与标签高效/自监督设置下均达到当前最优水平。
链接: https://arxiv.org/abs/2602.11903
作者: Yu-Chih Chen,Michael Wang,Chieh-Dun Wen,Kai-Siang Ma,Avinab Saha,Li-Heng Chen,Alan Bovik
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 6 pages, 2 figures
Abstract:No-reference video quality assessment (NR-VQA) for gaming videos is challenging due to limited human-rated datasets and unique content characteristics including fast motion, stylized graphics, and compression artifacts. We present MTL-VQA, a multi-task learning framework that uses full-reference metrics as supervisory signals to learn perceptually meaningful features without human labels for pretraining. By jointly optimizing multiple full-reference (FR) objectives with adaptive task weighting, our approach learns shared representations that transfer effectively to NR-VQA. Experiments on gaming video datasets show MTL-VQA achieves performance competitive with state-of-the-art NR-VQA methods across both MOS-supervised and label-efficient/self-supervised settings.
[CV-89] U-DAVI: Uncertainty-Aware Diffusion-Prior-Based Amortized Variational Inference for Image Reconstruction ICASSP2026
【速读】:该论文旨在解决图像复原中的不适定逆问题(ill-posed imaging inverse problems),即从退化观测中恢复清晰图像时存在的歧义性。现有基于扩散模型的生成先验方法虽具潜力,但通常依赖计算密集的迭代采样或针对每张图像进行优化,效率低下;而现有的摊销变分推断框架虽能实现快速后验采样,却难以重建细节和复杂纹理。解决方案的关键在于:在训练过程中引入空间自适应扰动(spatially adaptive perturbations)注入到测量值中,扰动强度由不确定性估计引导,从而强化模型在最不确定区域的学习能力,提升重建质量。该方法在去模糊和超分辨率任务中实现了优于或相当当前扩散模型方法的性能,同时避免了迭代优化的计算开销。
链接: https://arxiv.org/abs/2602.11704
作者: Ayush Varshney,Katherine L. Bouman,Berthy T. Feng
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICASSP 2026
Abstract:Ill-posed imaging inverse problems remain challenging due to the ambiguity in mapping degraded observations to clean images. Diffusion-based generative priors have recently shown promise, but typically rely on computationally intensive iterative sampling or per-instance optimization. Amortized variational inference frameworks address this inefficiency by learning a direct mapping from measurements to posteriors, enabling fast posterior sampling without requiring the optimization of a new posterior for every new set of measurements. However, they still struggle to reconstruct fine details and complex textures. To address this, we extend the amortized framework by injecting spatially adaptive perturbations to measurements during training, guided by uncertainty estimates, to emphasize learning in the most uncertain regions. Experiments on deblurring and super-resolution demonstrate that our method achieves superior or competitive performance to previous diffusion-based approaches, delivering more realistic reconstructions without the computational cost of iterative refinement.
[CV-90] Hybrid operator learning of wave scattering maps in high-contrast media
【速读】:该论文旨在解决高对比度异质介质中波传播与散射的代理建模问题(即波速和源到波场映射),此类问题在地震成像与反演等应用中具有重要意义。传统神经算子在面对含盐体等强散射场景时,因相位敏感性和复杂的空间相互作用而性能受限。解决方案的关键在于提出一种混合架构:将散射算子分解为两个独立贡献——平滑背景传播和高对比度散射修正;其中背景传播由傅里叶神经算子(Fourier Neural Operator, FNO)学习,生成全局耦合的特征令牌(feature tokens)以编码背景波传播;这些令牌随后输入视觉Transformer(Vision Transformer),利用注意力机制建模主导强空间交互的高对比度散射修正,从而显著提升相位和振幅精度,并展现出优越的精度-参数缩放特性。
链接: https://arxiv.org/abs/2602.11197
作者: Advait Balaji,Trevor Teolis,S. David Mis,Jose Antonio Lara Benitez,Chao Wang,Maarten V. de Hoop
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Surrogate modeling of wave propagation and scattering (i.e. the wave speed and source to wave field map) in heterogeneous media has significant potential in applications such as seismic imaging and inversion. High-contrast settings, such as subsurface models with salt bodies, exhibit strong scattering and phase sensitivity that challenge existing neural operators. We propose a hybrid architecture that decomposes the scattering operator into two separate contributions: a smooth background propagation and a high-contrast scattering correction. The smooth component is learned with a Fourier Neural Operator (FNO), which produces globally coupled feature tokens encoding background wave propagation; these tokens are then passed to a vision transformer, where attention is used to model the high-contrast scattering correction dominated by strong, spatial interactions. Evaluated on high-frequency Helmholtz problems with strong contrasts, the hybrid model achieves substantially improved phase and amplitude accuracy compared to standalone FNOs or transformers, with favorable accuracy-parameter scaling.
人工智能
[AI-0] Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在执行自然语言指令时存在的“意图-动作差距”(intention-action gap)问题,即生成的动作与给定指令之间存在语义不一致或行为偏差。解决方案的关键在于引入测试时验证(test-time verification)机制,提出一种对比式验证器(CoVer),通过联合扩展重述指令数量和动作候选数量来提升样本多样性,并结合“启动时计算”(boot-time compute)与分层验证推理流程,在部署阶段预计算多样化指令并迭代生成动作候选,最终由验证器筛选最优高阶提示与低阶动作片段,从而显著提升指令跟随的准确性和鲁棒性。
链接: https://arxiv.org/abs/2602.12281
作者: Jacky Kwok,Xilun Zhang,Mengdi Xu,Yuejiang Liu,Azalia Mirhoseini,Chelsea Finn,Marco Pavone
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.‘’ We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce “boot-time compute” and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.
[AI-1] CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agent ic Tool Use
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在多轮、多步骤AI代理工具使用场景中的应用难题,具体包括:现实任务目标往往缺乏可验证的奖励信号,而更关注开放性行为;多轮交互下的RL训练尚处于探索阶段;以及构建和维护可执行工具环境成本高昂,限制了规模与覆盖范围。其解决方案的关键在于提出CM2框架,通过将原本难以量化的开放性行为目标转化为细粒度的二元检查清单(checklist)奖励机制,每个回合的行为意图被拆解为具有明确证据支撑和结构化元数据的判别标准,从而将主观判断转变为更稳定的分类决策;同时采用稀疏奖励分配但密集评估标准的策略,在保证训练稳定性的同时提升反馈信息密度,并借助大语言模型(Large Language Model, LLM)模拟的工具环境实现可扩展的训练,避免对真实工具环境的高工程投入。
链接: https://arxiv.org/abs/2602.12268
作者: Zhen Zhang,Kaiqiang Song,Xun Wang,Yebowen Hu,Weixiang Yan,Chenyang Zhao,Henry Peng Zou,Haoyun Deng,Sathish Reddy Indurthi,Shujian Liu,Simin Ma,Xiaoyang Wang,Xin Eric Wang,Song Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn’s intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau^-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: this https URL.
[AI-2] hink like a Scientist: Physics-guided LLM Agent for Equation Discovery
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的符号方程发现方法在科学建模中存在局限的问题,即多数现有方法直接从数据中猜测方程,而忽视了科学家在实际推理过程中所采用的多步逻辑:先推断物理性质(如对称性),再利用这些先验信息约束候选方程空间。其解决方案的关键在于提出 KeplerAgent,一个代理式框架,显式模拟这一科学推理流程:通过协调基于物理的工具提取中间结构,并利用这些结果配置符号回归引擎(如 PySINDy 和 PySR),包括调整其函数库和结构约束,从而显著提升符号方程发现的准确性与对噪声数据的鲁棒性。
链接: https://arxiv.org/abs/2602.12259
作者: Jianke Yang,Ohm Venkatachalam,Mohammad Kianezhad,Sharvaree Vadgama,Rose Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Explaining observed phenomena through symbolic, interpretable formulas is a fundamental goal of science. Recently, large language models (LLMs) have emerged as promising tools for symbolic equation discovery, owing to their broad domain knowledge and strong reasoning capabilities. However, most existing LLM-based systems try to guess equations directly from data, without modeling the multi-step reasoning process that scientists often follow: first inferring physical properties such as symmetries, then using these as priors to restrict the space of candidate equations. We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process. The agent coordinates physics-based tools to extract intermediate structure and uses these results to configure symbolic regression engines such as PySINDy and PySR, including their function libraries and structural constraints. Across a suite of physical equation benchmarks, KeplerAgent achieves substantially higher symbolic accuracy and greater robustness to noisy data than both LLM and traditional baselines.
[AI-3] ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction
【速读】:该论文旨在解决PDF文档到结构化JSON数据提取过程中存在的两大关键问题:一是缺乏能够评估企业级复杂模式(schema)下端到端提取准确性的基准测试工具;二是现有方法无法有效捕捉嵌套提取任务中的语义差异,例如标识符需精确匹配、数值允许容差、名称需语义等价、数组需对齐,且必须区分字段遗漏与幻觉生成。解决方案的核心在于提出ExtractBench——一个开源的基准测试与评估框架,其创新性在于将JSON Schema视为可执行规范,每个字段明确声明评分指标,从而实现细粒度、语义感知的自动化评估。该框架基于35份真实经济领域PDF文档及其人工标注的金标准标签,覆盖从数十到数百字段的复杂schema,共包含12,867个可评估字段,实验表明当前前沿大模型在复杂schema下表现严重退化,验证了该基准的重要性与必要性。
链接: https://arxiv.org/abs/2602.12247
作者: Nick Ferguson,Josh Pennington,Narek Beghian,Aravind Mohan,Douwe Kiela,Sheshansh Agrawal,Thien Hang Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Unstructured documents like PDFs contain valuable structured information, but downstream systems require this data in reliable, standardized formats. LLMs are increasingly deployed to automate this extraction, making accuracy and reliability paramount. However, progress is bottlenecked by two gaps. First, no end-to-end benchmark evaluates PDF-to-JSON extraction under enterprise-scale schema breadth. Second, no principled methodology captures the semantics of nested extraction, where fields demand different notions of correctness (exact match for identifiers, tolerance for quantities, semantic equivalence for names), arrays require alignment, and omission must be distinguished from hallucination. We address both gaps with ExtractBench, an open-source benchmark and evaluation framework for PDF-to-JSON structured extraction. The benchmark pairs 35 PDF documents with JSON Schemas and human-annotated gold labels across economically valuable domains, yielding 12,867 evaluatable fields spanning schema complexities from tens to hundreds of fields. The evaluation framework treats the schema as an executable specification: each field declares its scoring metric. Baseline evaluations reveal that frontier models (GPT-5/5.2, Gemini-3 Flash/Pro, Claude 4.5 Opus/Sonnet) remain unreliable on realistic schemas. Performance degrades sharply with schema breadth, culminating in 0% valid output on a 369-field financial reporting schema across all tested models. We release ExtractBench at this https URL.
[AI-4] Intrinsic-Energy Joint Embedding Predictive Architectures Induce Quasimetric Spaces
【速读】:该论文旨在解决生成式表征学习(如Joint-Embedding Predictive Architectures, JEPAs)与目标条件控制(如Quasimetric Reinforcement Learning, QRL)之间理论联系缺失的问题,特别是如何统一两者在状态空间中对“距离”或“能量”建模的视角。解决方案的关键在于识别并利用一类特定的JEPA能量函数——内在能量(intrinsic energies),这类能量定义为从一个状态到另一个状态的所有可行轨迹中局部努力累积值的下确界(infima)。在满足适度闭包性和可加性假设下,内在能量自然构成一个准度量(quasimetric),从而使得基于此类能量的JEPA模型所学习的表征恰好落入QRL中目标导向控制所需的代价函数空间,实现了两者的理论统一,并揭示了对称有限能量在单向可达性问题中的结构性不匹配,强调了方向性重要时应采用非对称(即准度量)能量建模。
链接: https://arxiv.org/abs/2602.12245
作者: Anthony Kobanda,Waris Radji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Joint-Embedding Predictive Architectures (JEPAs) aim to learn representations by predicting target embeddings from context embeddings, inducing a scalar compatibility energy in a latent space. In contrast, Quasimetric Reinforcement Learning (QRL) studies goal-conditioned control through directed distance values (cost-to-go) that support reaching goals under asymmetric dynamics. In this short article, we connect these viewpoints by restricting attention to a principled class of JEPA energy functions : intrinsic (least-action) energies, defined as infima of accumulated local effort over admissible trajectories between two states. Under mild closure and additivity assumptions, any intrinsic energy is a quasimetric. In goal-reaching control, optimal cost-to-go functions admit exactly this intrinsic form ; inversely, JEPAs trained to model intrinsic energies lie in the quasimetric value class targeted by QRL. Moreover, we observe why symmetric finite energies are structurally mismatched with one-way reachability, motivating asymmetric (quasimetric) energies when directionality matters.
[AI-5] Bandit Learning in Matching Markets with Interviews
【速读】:该论文旨在解决匹配市场中因参与者无法完全评估偏好而导致的决策困境,尤其是在有限访谈(interviews)条件下如何实现高效学习与稳定匹配的问题。其核心挑战在于:一方面,参与方(如企业与求职者)仅能通过低成本但噪声较大的早期访谈获取部分偏好信息;另一方面,企业可能因不确定性而做出次优雇佣决策,且传统方法难以有效纠正此类错误。解决方案的关键在于引入策略性延迟(strategic deferral)机制——允许企业在每轮选择不 hiring,从而为后续修正错误提供机会,并支持无需协调的去中心化学习。作者设计了两类算法:一类用于集中式场景(由全能面试分配器优化),另一类用于两种不同类型的去中心化反馈场景,均实现了时间无关的遗憾(time-independent regret),显著优于无访谈情况下已知的 O(logT) regret 上界。在结构化市场假设下,去中心化方案性能接近集中式最优解,仅相差多项式因子。
链接: https://arxiv.org/abs/2602.12224
作者: Amirmahdi Mirfakhar,Xuchuang Wang,Mengfan Xu,Hedyeh Beyhaghi,Mohammad Hajiesmaili
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
备注:
Abstract:Two-sided matching markets rely on preferences from both sides, yet it is often impractical to evaluate preferences. Participants, therefore, conduct a limited number of interviews, which provide early, noisy impressions and shape final decisions. We study bandit learning in matching markets with interviews, modeling interviews as \textitlow-cost hints that reveal partial preference information to both sides. Our framework departs from existing work by allowing firm-side uncertainty: firms, like agents, may be unsure of their own preferences and can make early hiring mistakes by hiring less preferred agents. To handle this, we extend the firm’s action space to allow \emphstrategic deferral (choosing not to hire in a round), enabling recovery from suboptimal hires and supporting decentralized learning without coordination. We design novel algorithms for (i) a centralized setting with an omniscient interview allocator and (ii) decentralized settings with two types of firm-side feedback. Across all settings, our algorithms achieve time-independent regret, a substantial improvement over the O(\log T) regret bounds known for learning stable matchings without interviews. Also, under mild structured markets, decentralized performance matches the centralized counterpart up to polynomial factors in the number of agents and firms.
[AI-6] he Observer Effect in World Models: Invasive Adaptation Corrupts Latent Physics
【速读】:该论文旨在解决神经模型是否真正内化物理定律作为世界模型(world models)而非仅依赖统计捷径的问题,特别是在分布外(out-of-distribution, OOD)场景下。现有评估方法通常通过下游适应(如微调或高容量探测器)来测试潜在能力,但此类干预可能改变自监督学习(self-supervised learning, SSL)过程中学到的表征,从而混淆对模型真实物理理解能力的评估。论文提出一种非侵入式评估协议 PhyIP,其核心在于:在冻结模型表征的前提下,检验物理量是否可通过低容量线性探测器解码,这一思路基于线性表征假设(linear representation hypothesis)。实验表明,当 SSL 达到低误差时,其潜在结构可被线性解码;PhyIP 在 OOD 测试中能准确恢复内能和牛顿平方反比关系(相关系数 ρ ≈ 0.90),而适应性评估则导致结构崩溃(ρ ≈ 0.05),证明低容量探测器更可靠地反映物理世界模型的内在结构。
链接: https://arxiv.org/abs/2602.12218
作者: Christian Internò,Jumpei Yamaguchi,Loren Amdahl-Culleton,Markus Olhofer,David Klindt,Barbara Hammer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Determining whether neural models internalize physical laws as world models, rather than exploiting statistical shortcuts, remains challenging, especially under out-of-distribution (OOD) shifts. Standard evaluations often test latent capability via downstream adaptation (e.g., fine-tuning or high-capacity probes), but such interventions can change the representations being measured and thus confound what was learned during self-supervised learning (SSL). We propose a non-invasive evaluation protocol, PhyIP. We test whether physical quantities are linearly decodable from frozen representations, motivated by the linear representation hypothesis. Across fluid dynamics and orbital mechanics, we find that when SSL achieves low error, latent structure becomes linearly accessible. PhyIP recovers internal energy and Newtonian inverse-square scaling on OOD tests (e.g., \rho 0.90 ). In contrast, adaptation-based evaluations can collapse this structure ( \rho \approx 0.05 ). These findings suggest that adaptation-based evaluation can obscure latent structures and that low-capacity probes offer a more accurate evaluation of physical world models.
[AI-7] SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation
【速读】:该论文旨在解决视觉-语言分割模型(如SAM3)中文本编码器(text encoder)存在过度设计和资源浪费的问题。由于分割任务的提示(prompt)通常短小、结构化且语义受限,而现有模型沿用为开放式语言理解设计的大型通用文本编码器,导致计算与内存开销显著高于实际需求。关键解决方案是提出SAM3-LiteText,通过知识蒸馏(knowledge distillation)将原始文本编码器替换为一个轻量级的MobileCLIP学生模型,从而在保持分割性能的同时,实现文本编码器参数减少高达88%,显著降低静态内存占用。
链接: https://arxiv.org/abs/2602.12173
作者: Chengxi Zeng,Yuxuan Jiang,Ge Gao,Shuai Wang,Duolikun Danier,Bin Zhu,Stevan Rudinac,David Bull,Fan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language segmentation models such as SAM3 enable flexible, prompt-driven visual grounding, but inherit large, general-purpose text encoders originally designed for open-ended language understanding. In practice, segmentation prompts are short, structured, and semantically constrained, leading to substantial over-provisioning in text encoder capacity and persistent computational and memory overhead. In this paper, we perform a large-scale anatomical analysis of text prompting in vision-language segmentation, covering 404,796 real prompts across multiple benchmarks. Our analysis reveals severe redundancy: most context windows are underutilized, vocabulary usage is highly sparse, and text embeddings lie on low-dimensional manifold despite high-dimensional representations. Motivated by these findings, we propose SAM3-LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student that is optimized by knowledge distillation. Extensive experiments on image and video segmentation benchmarks show that SAM3-LiteText reduces text encoder parameters by up to 88%, substantially reducing static memory footprint, while maintaining segmentation performance comparable to the original model. Code: this https URL.
[AI-8] Statistical Parsing for Logical Information Retrieval
【速读】:该论文旨在解决早期量化布尔贝叶斯网络(Quantified Boolean Bayesian Network, QBBN)在逻辑推理中的两大局限:缺乏否定/逆向推理能力以及无法处理自然语言输入。针对这些问题,作者提出了三方面解决方案:在推理层面引入NEG因子以强制满足概率互补性(P(x) + P(¬x) = 1),从而支持通过反向λ消息传递实现假言定理(modus tollens)等逆向推理,完整覆盖Prawitz的简化消除规则;在语义层面设计一种带类型标注的逻辑语言,包含角色标记谓词、模态量词及三层次表达力(一阶量化、命题作为参数、谓词通过λ抽象量化);在句法层面提出一种确定性槽语法(typed slot grammar),可将句子无歧义地编译为逻辑形式(33/33正确),并结合大语言模型(LLM)进行预处理与重排序,最终由QBBN完成结构化推理。该架构实现了形式语义与Sutton“苦涩教训”(bitter lesson)的统一:LLM替代人工标注瓶颈,而QBBN则承担验证功能,形成高效且可解释的推理系统。
链接: https://arxiv.org/abs/2602.12170
作者: Greg Coppola
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 6 tables
Abstract:In previous work (Coppola, 2024) we introduced the Quantified Boolean Bayesian Network (QBBN), a logical graphical model that implements the forward fragment of natural deduction (Prawitz, 1965) as a probabilistic factor graph. That work left two gaps: no negation/backward reasoning, and no parser for natural language. This paper addresses both gaps across inference, semantics, and syntax. For inference, we extend the QBBN with NEG factors enforcing P(x) + P(neg x) = 1, enabling contrapositive reasoning (modus tollens) via backward lambda messages, completing Prawitz’s simple elimination rules. The engine handles 44/44 test cases spanning 22 reasoning patterns. For semantics, we present a typed logical language with role-labeled predicates, modal quantifiers, and three tiers of expressiveness following Prawitz: first-order quantification, propositions as arguments, and predicate quantification via lambda abstraction. For syntax, we present a typed slot grammar that deterministically compiles sentences to logical form (33/33 correct, zero ambiguity). LLMs handle disambiguation (95% PP attachment accuracy) but cannot produce structured parses directly (12.4% UAS), confirming grammars are necessary. The architecture: LLM preprocesses, grammar parses, LLM reranks, QBBN infers. We argue this reconciles formal semantics with Sutton’s “bitter lesson” (2019): LLMs eliminate the annotation bottleneck that killed formal NLP, serving as annotator while the QBBN serves as verifier. Code: this https URL Comments: 23 pages, 6 tables Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.12170 [cs.AI] (or arXiv:2602.12170v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.12170 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-9] Sci-CoE: Co-evolving Scientific Reasoning LLM s via Geometric Consensus with Sparse Supervision
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在科学推理任务中因不可靠的解题评估和验证策略多样性不足而导致的脆弱性问题。其解决方案的关键在于提出一种两阶段的科学协同进化框架(Sci-CoE),首先利用少量标注数据建立验证器(Verifier)的基础正确性判断锚点,随后引入几何奖励机制,综合考虑一致性、可靠性和多样性,在无监督学习下驱动大规模自迭代,从而显著提升模型的复杂推理能力与可扩展性。
链接: https://arxiv.org/abs/2602.12164
作者: Xiaohan He,Shiyang Feng,Songtao Huang,Lei Bai,Bin Wang,Bo Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated exceptional reasoning capabilities, and co-evolving paradigms have shown promising results in domains such as code and math. However, in scientific reasoning tasks, these models remain fragile due to unreliable solution evaluation and limited diversity in verification strategies. In this work, we propose Sci-CoE, a two-stage scientific co-evolving framework that enables models to self-evolve as both solver and verifier through a transition from sparse supervision to unsupervised learning. In the first stage, the model uses a small set of annotated data to establish fundamental correctness judgment anchors for the Verifier. In the second stage, we introduce a geometric reward mechanism that jointly considers consensus, reliability, and diversity, driving large-scale self-iteration on unlabeled data. Experiments on several general scientific benchmarks demonstrate that Sci-CoE enhances complex reasoning capabilities and exhibits strong scalability, facilitating the construction of more robust and diverse evaluation systems. Codes are available at this https URL.
[AI-10] 3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting
【速读】:该论文旨在解决零样本物体导航(Zero-Shot Object Navigation, ZSON)中因依赖场景抽象导致高阶决策受限于低阶感知精度的问题。现有方法通常将环境转化为语义地图或文本表示,难以在未知环境中实现鲁棒的空间推理。其解决方案的关键在于提出3DGSNav框架,通过将3D高斯泼溅(3D Gaussian Splatting, 3DGS)作为视觉语言模型(Vision-Language Models, VLMs)的持久记忆,实现增量式环境建模与轨迹引导的自由视角渲染,从而增强VLM的空间推理能力;同时结合结构化视觉提示与思维链(Chain-of-Thought, CoT) prompting提升推理准确性,并引入实时目标检测与VLM驱动的主动视角切换机制以保障导航效率与识别可靠性。
链接: https://arxiv.org/abs/2602.12159
作者: Wancai Zheng,Hao Chen,Xianlong Lu,Linlin Ou,Xinyi Yu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Object navigation is a core capability of embodied intelligence, enabling an agent to locate target objects in unknown environments. Recent advances in vision-language models (VLMs) have facilitated zero-shot object navigation (ZSON). However, existing methods often rely on scene abstractions that convert environments into semantic maps or textual representations, causing high-level decision making to be constrained by the accuracy of low-level perception. In this work, we present 3DGSNav, a novel ZSON framework that embeds 3D Gaussian Splatting (3DGS) as persistent memory for VLMs to enhance spatial reasoning. Through active perception, 3DGSNav incrementally constructs a 3DGS representation of the environment, enabling trajectory-guided free-viewpoint rendering of frontier-aware first-person views. Moreover, we design structured visual prompts and integrate them with Chain-of-Thought (CoT) prompting to further improve VLM reasoning. During navigation, a real-time object detector filters potential targets, while VLM-driven active viewpoint switching performs target re-verification, ensuring efficient and reliable recognition. Extensive evaluations across multiple benchmarks and real-world experiments on a quadruped robot demonstrate that our method achieves robust and competitive performance against state-of-the-art this http URL Project Page:this https URL
[AI-11] On the Adoption of AI Coding Agents in Open-source Android and iOS Development
【速读】:该论文旨在解决生成式 AI (Generative AI) 在开源移动应用(OSS mobile)项目中实际影响缺乏实证研究的问题。其解决方案的关键在于基于 AIDev 数据集对 2,901 个由 AI 编写的拉取请求(Pull Request, PR)进行分类级实证分析,系统考察了不同平台(Android 与 iOS)、AI 代理(agent)及任务类别(如功能实现、修复、UI 修改、重构等)对 PR 接受率和处理时长的影响,从而首次提供了 AI 代理在移动端开源开发中的行为特征与效果基准。
链接: https://arxiv.org/abs/2602.12144
作者: Muhammad Ahmad Khan,Hasnain Ali,Muneeb Rana,Muhammad Saqib Ilyas,Abdul Ali Bangash
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at MSR 2026 Mining Challenge track
Abstract:AI coding agents are increasingly contributing to software development, yet their impact on mobile development has received little empirical attention. In this paper, we present the first category-level empirical study of agent-generated code in open-source mobile app projects. We analyzed PR acceptance behaviors across mobile platforms, agents, and task categories using 2,901 AI-authored pull requests (PRs) in 193 verified Android and iOS open-source GitHub repositories in the AIDev dataset. We find that Android projects have received 2x more AI-authored PRs and have achieved higher PR acceptance rate (71%) than iOS (63%), with significant agent-level variation on Android. Across task categories, PRs with routine tasks (feature, fix, and ui) achieve the highest acceptance, while structural changes like refactor and build achieve lower success and longer resolution times. Furthermore, our evolution analysis shows improvement in PR resolution time on Android through mid-2025 before it declined again. Our findings offer the first evidence-based characterization of AI agents effects on OSS mobile projects and establish empirical baselines for evaluating agent-generated contributions to design platform aware agentic systems.
[AI-12] STAR : Bridging Statistical and Agent ic Reasoning for Large Model Performance Prediction
【速读】:该论文旨在解决大规模模型评估成本过高时,如何从有限观测数据中准确预测模型性能的问题。现有统计方法在模式漂移(pattern shifts)、数据稀疏性(data sparsity)及缺乏可解释性方面表现不佳,而纯大语言模型(LLM)方法则可靠性不足。解决方案的关键在于提出STAR框架,其核心是融合数据驱动的统计期望与知识驱动的代理推理(Agentic Reasoning):首先利用专用检索器获取外部知识,并将语义特征嵌入受限概率矩阵分解(Constrained Probabilistic Matrix Factorization, CPMF)以生成带不确定性的统计期望;随后通过基于期望违反理论(Expectation Violation Theory, EVT)的推理模块,结合同族内分析、跨模型比较和可信度感知聚合,对预测结果进行可追溯的修正,从而实现高精度且具备解释性的性能预测。
链接: https://arxiv.org/abs/2602.12143
作者: Xiaoxiao Wang,Chunxiao Li,Junying Wang,Yijin Guo,Zijian Chen,Chunyi Li,Xiaohong Liu,Zicheng Zhang,Guangtao Zhai
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 8 figures, 17 tables. Code available at this https URL
Abstract:As comprehensive large model evaluation becomes prohibitively expensive, predicting model performance from limited observations has become essential. However, existing statistical methods struggle with pattern shifts, data sparsity, and lack of explanation, while pure LLM methods remain unreliable. We propose STAR, a framework that bridges data-driven STatistical expectations with knowledge-driven Agentic Reasoning. STAR leverages specialized retrievers to gather external knowledge and embeds semantic features into Constrained Probabilistic Matrix Factorization (CPMF) to generate statistical expectations with uncertainty. A reasoning module guided by Expectation Violation Theory (EVT) then refines predictions through intra-family analysis, cross-model comparison, and credibility-aware aggregation, producing adjustments with traceable explanations. Extensive experiments show that STAR consistently outperforms all baselines on both score-based and rank-based metrics, delivering a 14.46% gain in total score over the strongest statistical method under extreme sparsity, with only 1–2 observed scores per test model.
[AI-13] HLA: Hadamard Linear Attention
【速读】:该论文旨在解决标准Transformer中注意力机制(attention mechanism)因计算复杂度为二次方(quadratic)而导致的高计算成本问题。现有线性注意力(linear attention)方法虽能降低复杂度,但其通过核函数对查询和键分别施加非线性变换,导致近似softmax时只能达到低阶有理函数精度。本文提出Hadamard Linear Attention(HLA),其关键创新在于将非线性操作置于计算成对相似度之后,而非像传统线性注意力那样在输入阶段独立处理查询和键;这一设计使非线性映射能够更有效地逼近softmax函数,从而形成更高阶的有理函数近似。同时,HLA保持了与标准线性注意力类似的高效计算结构,无需复杂的张量重塑(tensor reshaping),显著提升了实际应用效率。实验验证表明,该方法在大规模视频生成扩散Transformer模型中具有优异性能,适用于超大量token场景。
链接: https://arxiv.org/abs/2602.12128
作者: Hanno Ackermann,Hong Cai,Mohsen Ghafoorian,Amirhossein Habibian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The attention mechanism is an important reason for the success of transformers. It relies on computing pairwise relations between tokens. To reduce the high computational cost of standard quadratic attention, linear attention has been proposed as an efficient approximation. It employs kernel functions that are applied independently to the inputs before the pairwise similarities are calculated. That allows for an efficient computational procedure which, however, amounts to a low-degree rational function approximating softmax. We propose Hadamard Linear Attention (HLA). Unlike previous works on linear attention, the nonlinearity in HLA is not applied separately to queries and keys, but, analogously to standard softmax attention, after the pairwise similarities have been computed. It will be shown that the proposed nonlinearity amounts to a higher-degree rational function to approximate softmax. An efficient computational scheme for the proposed method is derived that is similar to that of standard linear attention. In contrast to other approaches, no time-consuming tensor reshaping is necessary to apply the proposed algorithm. The effectiveness of the approach is demonstrated by applying it to a large diffusion transformer model for video generation, an application that involves very large amounts of tokens. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.12128 [cs.AI] (or arXiv:2602.12128v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.12128 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-14] Commencing-Student Enrolment Forecasting Under Data Sparsity with Time Series Foundation Models
【速读】:该论文旨在解决高等教育机构在数据稀疏背景下进行新生入学人数预测的难题,传统经典方法因样本短、参数估计不稳定及结构突变导致外推性能下降。其解决方案的关键在于引入时间序列基础模型(Time Series Foundation Models, TSFMs)并构建一个紧凑且防泄漏的协变量集,包括可迁移的制度条件指数(Institutional Operating Conditions Index, IOCI)和经稳定化特征工程处理的谷歌趋势需求代理变量,从而在零样本设定下实现对不同院校的准确预测,且无需机构特定训练即可达到与经典基准相当的性能。
链接: https://arxiv.org/abs/2602.12120
作者: Jittarin Jetwiriyanon,Teo Susnjak,Surangika Ranathunga
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages, 5 figures, 3 tables
Abstract:Many universities face increasing financial pressure and rely on accurate forecasts of commencing enrolments. However, enrolment forecasting in higher education is often data-sparse; annual series are short and affected by reporting changes and regime shifts. Popular classical approaches can be unreliable, as parameter estimation and model selection are unstable with short samples, and structural breaks degrade extrapolation. Recently, TSFMs have provided zero-shot priors, delivering strong gains in annual, data-sparse institutional forecasting under leakage-disciplined covariate construction. We benchmark multiple TSFM families in a zero-shot setting and test a compact, leakage-safe covariate set and introduce the Institutional Operating Conditions Index (IOCI), a transferable 0-100 regime covariate derived from time-stamped documentary evidence available at each forecast origin, alongside Google Trends demand proxies with stabilising feature engineering. Using an expanding-window backtest with strict vintage alignment, covariate-conditioned TSFMs perform on par with classical benchmarks without institution-specific training, with performance differences varying by cohort and model.
[AI-15] KAN-FIF: Spline-Parameterized Lightweight Physics-based Tropical Cyclone Estimation on Meteorological Satellite
【速读】:该论文旨在解决热带气旋(Tropical Cyclone, TC)监测中现有物理引导模型在资源受限边缘设备上计算效率低、参数量大且难以捕捉高阶特征交互关系的问题。其关键解决方案是提出基于Kolmogorov-Arnold Network(KAN)的特征交互框架(KAN-FIF),通过引入样条参数化的KAN层与MLP和CNN层融合,有效建模TC属性间的高阶多项式关系,从而显著降低模型参数量(相比基线Phy-CoCo减少94.8%)并提升推理速度(每样本加速68.7%),同时保持更高预测精度(MAE降低32.5%),实现了面向边缘部署的轻量化、高效化热带气旋最大持续风速(Maximum Sustained Wind, MSW)预测。
链接: https://arxiv.org/abs/2602.12117
作者: Jiakang Shen,Qinghui Chen,Runtong Wang,Chenrui Xu,Jinglin Zhang,Cong Bai,Feng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Tropical cyclones (TC) are among the most destructive natural disasters, causing catastrophic damage to coastal regions through extreme winds, heavy rainfall, and storm surges. Timely monitoring of tropical cyclones is crucial for reducing loss of life and property, yet it is hindered by the computational inefficiency and high parameter counts of existing methods on resource-constrained edge devices. Current physics-guided models suffer from linear feature interactions that fail to capture high-order polynomial relationships between TC attributes, leading to inflated model sizes and hardware incompatibility. To overcome these challenges, this study introduces the Kolmogorov-Arnold Network-based Feature Interaction Framework (KAN-FIF), a lightweight multimodal architecture that integrates MLP and CNN layers with spline-parameterized KAN layers. For Maximum Sustained Wind (MSW) prediction, experiments demonstrate that the KAN-FIF framework achieves a 94.8% reduction in parameters (0.99MB vs 19MB) and 68.7% faster inference per sample (2.3ms vs 7.35ms) compared to baseline model Phy-CoCo, while maintaining superior accuracy with 32.5% lower MAE. The offline deployment experiment of the FY-4 series meteorological satellite processor on the Qingyun-1000 development board achieved a 14.41ms per-sample inference latency with the KAN-FIF framework, demonstrating promising feasibility for operational TC monitoring and extending deployability to edge-device AI applications. The code is released at this https URL.
[AI-16] he Pensieve Paradigm: Stateful Language Models Mastering Their Own Context
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长序列任务时因固定上下文窗口导致的“记忆局限性”问题,即模型无法自主管理其内部状态以动态调整上下文内容,从而限制了其在复杂推理和长期依赖场景下的表现。解决方案的关键在于提出StateLM这一新型基础模型架构,赋予模型一个内嵌的推理循环(reasoning loop),使其能够主动调用记忆工具(如上下文修剪、文档索引和笔记记录)来动态管理自身状态,从而突破固定窗口的限制,实现从被动预测者向状态感知代理(state-aware agent)的转变。
链接: https://arxiv.org/abs/2602.12108
作者: Xiaoyuan Liu,Tian Liang,Dongyang Ma,Deyu Zhou,Haitao Mi,Pinjia He,Yan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In the world of Harry Potter, when Dumbledore’s mind is overburdened, he extracts memories into a Pensieve to be revisited later. In the world of AI, while we possess the Pensieve-mature databases and retrieval systems, our models inexplicably lack the “wand” to operate it. They remain like a Dumbledore without agency, passively accepting a manually engineered context as their entire memory. This work finally places the wand in the model’s hand. We introduce StateLM, a new class of foundation models endowed with an internal reasoning loop to manage their own state. We equip our model with a suite of memory tools, such as context pruning, document indexing, and note-taking, and train it to actively manage these tools. By learning to dynamically engineering its own context, our model breaks free from the architectural prison of a fixed window. Experiments across various model sizes demonstrate StateLM’s effectiveness across diverse scenarios. On long-document QA tasks, StateLMs consistently outperform standard LLMs across all model scales; on the chat memory task, they achieve absolute accuracy improvements of 10% to 20% over standard LLMs. On the deep research task BrowseComp-Plus, the performance gap becomes even more pronounced: StateLM achieves up to 52% accuracy, whereas standard LLM counterparts struggle around 5%. Ultimately, our approach shifts LLMs from passive predictors to state-aware agents where reasoning becomes a stateful and manageable process.
[AI-17] On the Complexity of Offline Reinforcement Learning with Qstar-Approximation and Partial Coverag e
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)在 $ Q^\star −可实现性( Q^\star $-realizability)和贝尔曼完备性(Bellman completeness)假设下,样本效率不足的问题,特别是针对部分覆盖(partial coverage)场景下的理论瓶颈。研究首先通过信息论下界证明:仅凭 $ Q^\star $-可实现性和贝尔曼完备性不足以保证样本高效的学习,从而否定了一个长期存在的开放问题。其核心解决方案是提出一个通用的决策-估计分解框架(decision-estimation decomposition),该框架基于模型无关的决策估计系数(Decision-Estimation Coefficient, DEC),用于刻画给定 $ Q^\star $ 函数类的内在复杂度,并能统一和改进现有方法(如Chen & Jiang, 2022 和 Uehara et al., 2023)的理论保证。此外,作者还引入了新的二阶性能差分引理(second-order performance difference lemma),首次实现了软 $ Q $-学习在部分覆盖下的 $ \epsilon^{-2} $ 样本复杂度,优于先前 $ \epsilon^{-4} $ 的结果;并消除了 Chen and Jiang 方法对额外在线交互的依赖,同时首次对一般低贝尔曼秩(low-Bellman-rank)马尔可夫决策过程(MDP)在无贝尔曼完备性前提下的离线可学习性给出了刻画,以及首次分析了保守 $ Q $-学习(Conservative Q-Learning, CQL)在非表格情形下的理论性质。
链接: https://arxiv.org/abs/2602.12107
作者: Haolin Liu,Braham Snyder,Chen-Yu Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:We study offline reinforcement learning under Q^\star -approximation and partial coverage, a setting that motivates practical algorithms such as Conservative Q -Learning (CQL; Kumar et al., 2020) but has received limited theoretical attention. Our work is inspired by the following open question: “Are Q^\star -realizability and Bellman completeness sufficient for sample-efficient offline RL under partial coverage?” We answer in the negative by establishing an information-theoretic lower bound. Going substantially beyond this, we introduce a general framework that characterizes the intrinsic complexity of a given Q^\star function class, inspired by model-free decision-estimation coefficients (DEC) for online RL (Foster et al., 2023b; Liu et al., 2025b). This complexity recovers and improves the quantities underlying the guarantees of Chen and Jiang (2022) and Uehara et al. (2023), and extends to broader settings. Our decision-estimation decomposition can be combined with a wide range of Q^\star estimation procedures, modularizing and generalizing existing approaches. Beyond the general framework, we make further contributions: By developing a novel second-order performance difference lemma, we obtain the first \epsilon^-2 sample complexity under partial coverage for soft Q -learning, improving the \epsilon^-4 bound of Uehara et al. (2023). We remove Chen and Jiang’s (2022) need for additional online interaction when the value gap of Q^\star is unknown. We also give the first characterization of offline learnability for general low-Bellman-rank MDPs without Bellman completeness (Jiang et al., 2017; Du et al., 2021; Jin et al., 2021), a canonical setting in online RL that remains unexplored in offline RL except for special cases. Finally, we provide the first analysis for CQL under Q^\star -realizability and Bellman completeness beyond the tabular case. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2602.12107 [cs.LG] (or arXiv:2602.12107v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.12107 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-18] Multi Graph Search for High-Dimensional Robot Motion Planning
【速读】:该论文旨在解决高维机器人系统(如机械臂和移动机械臂)在实时操作与可靠部署中高效运动规划的问题。现有规划算法虽提升了对高维状态空间的可扩展性,但常伴随不可预测、不一致的运动轨迹,或需大量计算资源与内存开销。解决方案的关键在于提出多图搜索(Multi-Graph Search, MGS)算法,该算法将经典单向与双向搜索推广至多图框架,通过维护并增量扩展多个隐式图,在状态空间中聚焦于高潜力区域进行探索,并允许初始不连通的子图通过可行转移逐步合并,从而实现完备性和有界次优性,同时显著提升规划效率与稳定性。
链接: https://arxiv.org/abs/2602.12096
作者: Itamar Mishani,Maxim Likhachev
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Submitted for Publication
Abstract:Efficient motion planning for high-dimensional robotic systems, such as manipulators and mobile manipulators, is critical for real-time operation and reliable deployment. Although advances in planning algorithms have enhanced scalability to high-dimensional state spaces, these improvements often come at the cost of generating unpredictable, inconsistent motions or requiring excessive computational resources and memory. In this work, we introduce Multi-Graph Search (MGS), a search-based motion planning algorithm that generalizes classical unidirectional and bidirectional search to a multi-graph setting. MGS maintains and incrementally expands multiple implicit graphs over the state space, focusing exploration on high-potential regions while allowing initially disconnected subgraphs to be merged through feasible transitions as the search progresses. We prove that MGS is complete and bounded-suboptimal, and empirically demonstrate its effectiveness on a range of manipulation and mobile manipulation tasks. Demonstrations, benchmarks and code are available at this https URL.
[AI-19] Differentiable Modal Logic for Multi-Agent Diagnosis Orchestration and Communication
【速读】:该论文旨在解决多智能体人工智能系统(multi-agent AI systems)在演化为自主群体时,因语义失效(semantic failures)而难以调试的问题,尤其在涉及知识、信念、因果关系和义务等复杂逻辑推理场景下,传统方法缺乏自动建模能力。其核心解决方案是提出可微分模态逻辑(Differentiable Modal Logic, DML),通过模态逻辑神经网络(Modal Logical Neural Networks, MLNNs)实现从行为数据中自动学习信任网络、因果链和规则边界,从而构建一个统一的神经符号调试框架。该框架涵盖四种模态:认知模态(epistemic,确定信任关系)、时序模态(temporal,识别事件因果)、道义模态(deontic,判断行为合规性)和信念模态(doxastic,解释代理置信度),并以可解释参数形式显式建模关键逻辑结构,使逻辑矛盾转化为可优化的目标函数,显著提升多智能体系统的可解释性与可控性。
链接: https://arxiv.org/abs/2602.12083
作者: Antonin Sulc
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 29 pages, 8 figures, 8 tables, Tutorial at 3rd International Conference on Neuro-Symbolic Systems (NeuS)
Abstract:As multi-agent AI systems evolve from simple chatbots to autonomous swarms, debugging semantic failures requires reasoning about knowledge, belief, causality, and obligation, precisely what modal logic was designed to formalize. However, traditional modal logic requires manual specification of relationship structures that are unknown or dynamic in real systems. This tutorial demonstrates differentiable modal logic (DML), implemented via Modal Logical Neural Networks (MLNNs), enabling systems to learn trust networks, causal chains, and regulatory boundaries from behavioral data alone. We present a unified neurosymbolic debugging framework through four modalities: epistemic (who to trust), temporal (when events cause failures), deontic (what actions are permitted), and doxastic (how to interpret agent confidence). Each modality is demonstrated on concrete multi-agent scenarios, from discovering deceptive alliances in diplomacy games to detecting LLM hallucinations, with complete implementations showing how logical contradictions become learnable optimization objectives. Key contributions for the neurosymbolic community: (1) interpretable learned structures where trust and causality are explicit parameters, not opaque embeddings; (2) knowledge injection via differentiable axioms that guide learning with sparse data (3) compositional multi-modal reasoning that combines epistemic, temporal, and deontic constraints; and (4) practical deployment patterns for monitoring, active control and communication of multi-agent systems. All code provided as executable Jupyter notebooks. Comments: 29 pages, 8 figures, 8 tables, Tutorial at 3rd International Conference on Neuro-Symbolic Systems (NeuS) Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2602.12083 [cs.AI] (or arXiv:2602.12083v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.12083 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-20] ModelWisdom: An Integrated Toolkit for TLA Model Visualization Digest and Repair
【速读】:该论文旨在解决 TLA+ 模型检测(Model Checking)在实际应用中面临的三大挑战:难以解释反例、难以理解大规模状态转移图,以及难以修复有缺陷的模型。这些问题源于原始模型检查器输出的可解释性不足,以及手动追踪违反规范源头所需的巨大人力成本。解决方案的关键在于提出 ModelWisdom,一个结合可视化与大语言模型(Large Language Models, LLMs)的交互式环境:通过颜色标记违规路径、点击跳转至 TLA+ 代码、映射违反状态与失效属性实现模型可视化;利用树状结构和节点/边折叠优化图复杂度;借助 LLM 对子图进行摘要与解释,并支持错误信息提取与迭代调试,从而将原始模型检查结果转化为可交互、可解释的工作流,显著提升非平凡 TLA+ 规范的理解效率与调试效率。
链接: https://arxiv.org/abs/2602.12058
作者: Zhiyong Chen,Jialun Cao,Chang Xu,Shing-Chi Cheung
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注: Accepted by FM 2026 Research Track (Tool)
Abstract:Model checking in TLA+ provides strong correctness guarantees, yet practitioners continue to face significant challenges in interpreting counterexamples, understanding large state-transition graphs, and repairing faulty models. These difficulties stem from the limited explainability of raw model-checker output and the substantial manual effort required to trace violations back to source specifications. Although the TLA+ Toolbox includes a state diagram viewer, it offers only a static, fully expanded graph without folding, color highlighting, or semantic explanations, which limits its scalability and interpretability. We present ModelWisdom, an interactive environment that uses visualization and large language models to make TLA+ model checking more interpretable and actionable. ModelWisdom offers: (i) Model Visualization, with colorized violation highlighting, click-through links from transitions to TLA+ code, and mapping between violating states and broken properties; (ii) Graph Optimization, including tree-based structuring and node/edge folding to manage large models; (iii) Model Digest, which summarizes and explains subgraphs via large language models (LLMs) and performs preprocessing and partial explanations; and (iv) Model Repair, which extracts error information and supports iterative debugging. Together, these capabilities turn raw model-checker output into an interactive, explainable workflow, improving understanding and reducing debugging effort for nontrivial TLA+ specifications. The website to ModelWisdom is available: this https URL. A demonstrative video can be found at this https URL.
[AI-21] LawThinker: A Deep Research Legal Agent in Dynamic Environments
【速读】:该论文旨在解决现有法律推理方法在处理司法场景时缺乏对中间推理步骤进行验证机制的问题,导致诸如不适用的法条引用等错误可能在推理链中传播而未被发现。其解决方案的关键在于提出 LawThinker,一个采用“探索-验证-记忆”策略的自主法律研究代理,其中核心创新是将验证作为原子操作嵌入到每次知识探索之后,通过 DeepVerifier 模块从知识准确性、事实与法律的相关性以及程序合规性三个维度对检索结果进行严格审查,并结合记忆模块实现长周期任务中的跨轮次知识复用,从而确保推理过程的严谨性和可追溯性。
链接: https://arxiv.org/abs/2602.12056
作者: Xinyu Yang,Chenlong Deng,Tongyu Wen,Binyu Xie,Zhicheng Dou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Legal reasoning requires not only correct outcomes but also procedurally compliant reasoning processes. However, existing methods lack mechanisms to verify intermediate reasoning steps, allowing errors such as inapplicable statute citations to propagate undetected through the reasoning chain. To address this, we propose LawThinker, an autonomous legal research agent that adopts an Explore-Verify-Memorize strategy for dynamic judicial environments. The core idea is to enforce verification as an atomic operation after every knowledge exploration step. A DeepVerifier module examines each retrieval result along three dimensions of knowledge accuracy, fact-law relevance, and procedural compliance, with a memory module for cross-round knowledge reuse in long-horizon tasks. Experiments on the dynamic benchmark J1-EVAL show that LawThinker achieves a 24% improvement over direct reasoning and an 11% gain over workflow-based methods, with particularly strong improvements on process-oriented metrics. Evaluations on three static benchmarks further confirm its generalization capability. The code is available at this https URL .
[AI-22] Fourier Transformers for Latent Crystallographic Diffusion and Generative Modeling
【速读】:该论文旨在解决生成式AI在晶体材料设计中面临的三大挑战:如何有效处理周期边界条件、晶格对称性约束以及物理合理性,同时实现对大型且结构多样的晶胞的高效建模。传统基于原子坐标的生成方法难以自然地嵌入这些先验知识,且在原子数量较多时计算复杂度显著上升。其解决方案的关键在于提出一种倒空间(reciprocal-space)生成流程,通过截断傅里叶变换表示晶胞中原子种类分辨的密度函数,而非直接建模原子坐标;这种表示方式天然满足周期性,可简单地用代数运算实现空间群对称操作,并支持生成过程中原子多重性(atomic multiplicities)的变化,从而突破粒子基方法的局限性。
链接: https://arxiv.org/abs/2602.12045
作者: Jed A. Duersch,Elohan Veillon,Astrid Klipfel,Adlane Sayede,Zied Bouraoui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The discovery of new crystalline materials calls for generative models that handle periodic boundary conditions, crystallographic symmetries, and physical constraints, while scaling to large and structurally diverse unit cells. We propose a reciprocal-space generative pipeline that represents crystals through a truncated Fourier transform of the species-resolved unit-cell density, rather than modeling atomic coordinates directly. This representation is periodicity-native, admits simple algebraic actions of space-group symmetries, and naturally supports variable atomic multiplicities during generation, addressing a common limitation of particle-based approaches. Using only nine Fourier basis functions per spatial dimension, our approach reconstructs unit cells containing up to 108 atoms per chemical species. We instantiate this pipeline with a transformer variational autoencoder over complex-valued Fourier coefficients, and a latent diffusion model that generates in the compressed latent space. We evaluate reconstruction and latent diffusion on the LeMaterial benchmark and compare unconditional generation against coordinate-based baselines in the small-cell regime ( \leq 16 atoms per unit cell).
[AI-23] An Empirical Study of the Imbalance Issue in Software Vulnerability Detection ESORICS
【速读】:该论文旨在解决深度学习(Deep Learning, DL)在软件漏洞检测中性能不稳定的问题,其核心原因被识别为样本不平衡(即漏洞代码数量远少于正常代码),这导致模型在不同数据集上的表现差异显著。研究通过九个开源数据集和两种先进DL模型的实证分析验证了这一假设,并进一步评估了现有不平衡解决方案的效果:Focal loss更有利于提升精确率(precision),mean false error和class-balanced loss有助于提高召回率(recall),而随机过采样(random over-sampling)则能优化F1分数。然而,这些方法均未在所有指标上全面占优,表明当前方案仍需结合外部因素进行针对性改进,从而为未来设计更鲁棒的漏洞检测模型提供关键洞见。
链接: https://arxiv.org/abs/2602.12038
作者: Yuejun Guo,Qiang Hu,Qiang Tang,Yves Le Traon
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: This paper was accepted by the 28th European Symposium on Research in Computer Security (ESORICS), 2023
Abstract:Vulnerability detection is crucial to protect software security. Nowadays, deep learning (DL) is the most promising technique to automate this detection task, leveraging its superior ability to extract patterns and representations within extensive code volumes. Despite its promise, DL-based vulnerability detection remains in its early stages, with model performance exhibiting variability across datasets. Drawing insights from other well-explored application areas like computer vision, we conjecture that the imbalance issue (the number of vulnerable code is extremely small) is at the core of the phenomenon. To validate this, we conduct a comprehensive empirical study involving nine open-source datasets and two state-of-the-art DL models. The results confirm our conjecture. We also obtain insightful findings on how existing imbalance solutions perform in vulnerability detection. It turns out that these solutions perform differently as well across datasets and evaluation metrics. Specifically: 1) Focal loss is more suitable to improve the precision, 2) mean false error and class-balanced loss encourages the recall, and 3) random over-sampling facilitates the F1-measure. However, none of them excels across all metrics. To delve deeper, we explore external influences on these solutions and offer insights for developing new solutions.
[AI-24] InjectRBP: Steering Large Language Model Reasoning Behavior via Pattern Injection
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理任务中性能提升受限的问题,尤其是现有基于行为相关提示调整的方法多依赖直觉设计、缺乏对底层行为模式的系统性分析。其解决方案的关键在于从行为模式(behavioral patterns)视角出发,识别模型在应对特定类型问题时表现出的自适应行为分布,并通过结构化注入这些模式来显著优化推理过程与结果。具体提出两种无需参数更新的优化方法:InjectCorrect 通过模仿模型自身历史正确答案的行为模式进行引导;InjectRLOpt 则基于历史行为模式数据学习价值函数,并利用所提出的可靠性感知 Softmax 策略(Reliability-Aware Softmax Policy)在推理阶段生成行为注入信号以引导推理路径。实验表明,二者分别在多个推理任务上实现了最高达 5.34% 和 8.67% 的性能提升。
链接: https://arxiv.org/abs/2602.12013
作者: Xiuping Wu,Zhao Yu,Yuxin Cheng,Ngai Wong,Liangjun Ke,Tapas Mishra,Konstantinos V.Katsikopoulos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning can significantly enhance the performance of Large Language Models. While recent studies have exploited behavior-related prompts adjustment to enhance reasoning, these designs remain largely intuitive and lack a systematic analysis of the underlying behavioral patterns. Motivated by this, we investigate how models’ reasoning behaviors shape reasoning from the perspective of behavioral patterns. We observe that models exhibit adaptive distributions of reasoning behaviors when responding to specific types of questions, and that structurally injecting these patterns can substantially influence the quality of the models’ reasoning processes and outcomes. Building on these findings, we propose two optimization methods that require no parameter updates: InjectCorrect and InjectRLOpt. InjectCorrect guides the model by imitating behavioral patterns derived from its own past correct answers. InjectRLOpt learns a value function from historical behavior-pattern data and, via our proposed Reliability-Aware Softmax Policy, generates behavioral injectant during inference to steer the reasoning process. Our experiments demonstrate that both methods can improve model performance across various reasoning tasks without requiring any modifications to model parameters, achieving gains of up to 5.34% and 8.67%, respectively.
[AI-25] On the Sensitivity of Firing Rate-Based Federated Spiking Neural Networks to Differential Privacy ICASSP
【速读】:该论文旨在解决联邦脉冲学习(Federated Neuromorphic Learning, FNL)在实际部署中因引入差分隐私(Differential Privacy, DP)机制而导致的训练信号扰动问题,尤其是这些扰动如何影响基于发放率(firing-rate)的协调机制。解决方案的关键在于系统性分析梯度裁剪(gradient clipping)和噪声注入(noise injection)两类DP机制对脉冲神经网络(Spiking Neural Networks, SNNs)中发放率统计量的影响,并揭示这些扰动如何传播至率基联邦协调过程。研究发现,隐私预算和裁剪边界的变化会导致发放率系统性偏移、聚合信号衰减及客户端选择中的排名不稳定性,进而提出通过稀疏性和记忆指标来量化此类扰动,从而为隐私保护下的FNL提供可操作的平衡策略——即在隐私强度与率依赖协调之间取得最优权衡。
链接: https://arxiv.org/abs/2602.12009
作者: Luiz Pereira,Mirko Perkusich,Dalton Valadares,Kyller Gorgônio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: To be published in 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Abstract:Federated Neuromorphic Learning (FNL) enables energy-efficient and privacy-preserving learning on devices without centralizing data. However, real-world deployments require additional privacy mechanisms that can significantly alter training signals. This paper analyzes how Differential Privacy (DP) mechanisms, specifically gradient clipping and noise injection, perturb firing-rate statistics in Spiking Neural Networks (SNNs) and how these perturbations are propagated to rate-based FNL coordination. On a speech recognition task under non-IID settings, ablations across privacy budgets and clipping bounds reveal systematic rate shifts, attenuated aggregation, and ranking instability during client selection. Moreover, we relate these shifts to sparsity and memory indicators. Our findings provide actionable guidance for privacy-preserving FNL, specifically regarding the balance between privacy strength and rate-dependent coordination.
[AI-26] CSEval: A Framework for Evaluating Clinical Semantics in Text-to-Image Generation
【速读】:该论文旨在解决生成式 AI (Generative AI) 在医疗领域应用中图像质量评估的局限性问题,即现有评价方法主要关注图像真实感或多样性,而忽视了生成图像是否准确反映临床语义信息(如解剖位置和病理特征)。其解决方案的关键在于提出 Clinical Semantics Evaluator (CSEval),一个利用语言模型评估生成图像与条件提示之间临床语义对齐程度的框架,能够识别其他指标遗漏的语义不一致,并与专家判断具有强相关性,从而为医疗场景下生成模型的安全应用提供可扩展且具临床意义的评估手段。
链接: https://arxiv.org/abs/2602.12004
作者: Robert Cronshaw,Konstantinos Vilouras,Junyu Yan,Yuning Du,Feng Chen,Steven McDonagh,Sotirios A. Tsaftaris
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Text-to-image generation has been increasingly applied in medical domains for various purposes such as data augmentation and education. Evaluating the quality and clinical reliability of these generated images is essential. However, existing methods mainly assess image realism or diversity, while failing to capture whether the generated images reflect the intended clinical semantics, such as anatomical location and pathology. In this study, we propose the Clinical Semantics Evaluator (CSEval), a framework that leverages language models to assess clinical semantic alignment between the generated images and their conditioning prompts. Our experiments show that CSEval identifies semantic inconsistencies overlooked by other metrics and correlates with expert judgment. CSEval provides a scalable and clinically meaningful complement to existing evaluation methods, supporting the safe adoption of generative models in healthcare.
[AI-27] Evaluating AGENTS .md: Are Repository-Level Context Files Helpful for Coding Agents ?
【速读】:该论文旨在解决当前软件开发中广泛采用的代码代理(coding agent)基于上下文文件(context file)进行定制化配置的有效性问题,即这些文件是否真的能提升任务完成率。研究发现,无论是由大语言模型(LLM)生成还是开发者手动编写的内容文件,通常都会降低任务成功率,同时增加超过20%的推理成本;其根本原因在于上下文文件引入了不必要的约束和冗余信息,反而干扰了代理的任务执行逻辑。解决方案的关键在于:人类编写的上下文文件应仅包含最小必要要求,避免过度规范,从而减少对编码代理的干扰并提升效率。
链接: https://arxiv.org/abs/2602.11988
作者: Thibaud Gloaguen,Niels Mündler,Mark Müller,Veselin Raychev,Martin Vechev
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:A widespread practice in software development is to tailor coding agents to repositories using context files, such as this http URL, by either manually or automatically generating them. Although this practice is strongly encouraged by agent developers, there is currently no rigorous investigation into whether such context files are actually effective for real-world tasks. In this work, we study this question and evaluate coding agents’ task completion performance in two complementary settings: established SWE-bench tasks from popular repositories, with LLM-generated context files following agent-developer recommendations, and a novel collection of issues from repositories containing developer-committed context files. Across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%. Behaviorally, both LLM-generated and developer-provided context files encourage broader exploration (e.g., more thorough testing and file traversal), and coding agents tend to respect their instructions. Ultimately, we conclude that unnecessary requirements from context files make tasks harder, and human-written context files should describe only minimal requirements. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.11988 [cs.SE] (or arXiv:2602.11988v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2602.11988 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-28] Accelerating Robotic Reinforcement Learning with Agent Guidance
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在真实场景中应用时面临的严重样本效率低下的问题,尤其是现有“人类在回路”(Human-in-the-Loop, HIL)方法因依赖人工监督而存在可扩展性瓶颈的问题。解决方案的关键在于提出Agent-guided Policy Search (AGPS) 框架,其核心创新是用一个多模态智能体替代人类监督者,该智能体作为语义世界模型(semantic world model),通过提供精确的纠正性路径点和空间约束来引导探索,从而注入内在价值先验并结构化物理探索过程,显著提升样本效率并实现无需人工干预的规模化机器人学习。
链接: https://arxiv.org/abs/2602.11978
作者: Haojun Chen,Zili Zou,Chengdong Ma,Yaoxiang Pu,Haotong Zhang,Yuanpei Chen,Yaodong Yang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning (RL) offers a powerful paradigm for autonomous robots to master generalist manipulation skills through trial-and-error. However, its real-world application is stifled by severe sample inefficiency. Recent Human-in-the-Loop (HIL) methods accelerate training by using human corrections, yet this approach faces a scalability barrier. Reliance on human supervisors imposes a 1:1 supervision ratio that limits fleet expansion, suffers from operator fatigue over extended sessions, and introduces high variance due to inconsistent human proficiency. We present Agent-guided Policy Search (AGPS), a framework that automates the training pipeline by replacing human supervisors with a multimodal agent. Our key insight is that the agent can be viewed as a semantic world model, injecting intrinsic value priors to structure physical exploration. By using executable tools, the agent provides precise guidance via corrective waypoints and spatial constraints for exploration pruning. We validate our approach on two tasks, ranging from precision insertion to deformable object manipulation. Results demonstrate that AGPS outperforms HIL methods in sample efficiency. This automates the supervision pipeline, unlocking the path to labor-free and scalable robot learning. Project website: this https URL.
[AI-29] Manifold-Aware Temporal Domain Generalization for Large Language Models
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在实际部署中面临的时间分布漂移(Temporal Distribution Shifts)问题,即数据随时间动态演化导致模型性能下降。现有时间域泛化(Temporal Domain Generalization, TDG)方法通常在全参数空间中建模模型适应过程,这对现代LLMs而言计算开销过大且不可行。解决方案的关键在于提出一种基于参数高效微调的几何重参数化框架,证明低维时间结构可在参数高效重参数化下保持不变,从而无需在高维参数空间中直接建模时间演化。在此基础上,作者设计了流形感知的时间LoRA(Manifold-aware Temporal LoRA, MaT-LoRA),将时间更新约束于一个共享的低秩适配子空间中的低维流形,并通过结构化的时间核心(temporal core)建模其演化,显著降低时间建模复杂度的同时保留强表达能力,实现对LLMs的高效且可扩展的时间域泛化。
链接: https://arxiv.org/abs/2602.11965
作者: Yiheng Yao,Zekun Cai,Xinyuan Song,Hiroki Hill Kobayashi,Xuan Song,Ryosuke Shibasaki,Liang Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures
Abstract:Temporal distribution shifts are pervasive in real-world deployments of Large Language Models (LLMs), where data evolves continuously over time. While Temporal Domain Generalization (TDG) seeks to model such structured evolution, existing approaches characterize model adaptation in the full parameter space. This formulation becomes computationally infeasible for modern LLMs. This paper introduces a geometric reformulation of TDG under parameter-efficient fine-tuning. We establish that the low-dimensional temporal structure underlying model evolution can be preserved under parameter-efficient reparameterization, enabling temporal modeling without operating in the ambient parameter space. Building on this principle, we propose Manifold-aware Temporal LoRA (MaT-LoRA), which constrains temporal updates to a shared low-dimensional manifold within a low-rank adaptation subspace, and models its evolution through a structured temporal core. This reparameterization dramatically reduces temporal modeling complexity while retaining expressive power. Extensive experiments on synthetic and real-world datasets, including scientific documents, news publishers, and review ratings, demonstrate that MaT-LoRA achieves superior temporal generalization performance with practical scalability for LLMs.
[AI-30] Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments ICLR2026
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)代理在现实世界中部署时面临的评估瓶颈问题,即现有基准测试多基于静态或同步环境,无法真实反映代理在异步、动态、噪声干扰等复杂场景下的综合能力。为应对这一挑战,作者提出了Gaia2基准测试平台,其关键创新在于构建了与代理动作解耦的异步环境,使代理需在时间约束下适应动态事件、处理模糊性并与其他代理协作;同时引入写操作验证器(write-action verifier),实现细粒度的动作级评估,从而支持基于可验证奖励的强化学习训练。此设计显著提升了评估的真实性与实用性,推动了从模拟到现实(sim2real)的落地进展。
链接: https://arxiv.org/abs/2602.11964
作者: Romain Froger,Pierre Andrews,Matteo Bettini,Amar Budhiraja,Ricardo Silveira Cabral,Virginie Do,Emilien Garreau,Jean-Baptiste Gaya,Hugo Laurençon,Maxime Lecanu,Kunal Malkan,Dheeraj Mekala,Pierre Ménard,Gerard Moreno-Torres Bertran,Ulyana Piterbarg,Mikhail Plekhanov,Mathieu Rita,Andrey Rusakov,Vladislav Vorotilov,Mengjue Wang,Ian Yu,Amine Benhalloum,Grégoire Mialon,Thomas Scialom
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted as Oral at ICLR 2026
Abstract:We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where environments evolve independently of agent actions, requiring agents to operate under temporal constraints, adapt to noisy and dynamic events, resolve ambiguity, and collaborate with other agents. Each scenario is paired with a write-action verifier, enabling fine-grained, action-level evaluation and making Gaia2 directly usable for reinforcement learning from verifiable rewards. Our evaluation of state-of-the-art proprietary and open-source models shows that no model dominates across capabilities: GPT-5 (high) reaches the strongest overall score of 42% pass@1 but fails on time-sensitive tasks, Claude-4 Sonnet trades accuracy and speed for cost, Kimi-K2 leads among open-source models with 21% pass@1. These results highlight fundamental trade-offs between reasoning, efficiency, robustness, and expose challenges in closing the “sim2real” gap. Gaia2 is built on a consumer environment with the open-source Agents Research Environments platform and designed to be easy to extend. By releasing Gaia2 alongside the foundational ARE framework, we aim to provide the community with a flexible infrastructure for developing, benchmarking, and training the next generation of practical agent systems.
[AI-31] owards Performance-Enhanced Model-Contrastive Federated Learning using Historical Information in Heterogeneous Scenarios
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在异构场景下性能下降的问题,具体表现为节点间数据分布差异大以及参与频率不一致所导致的模型更新不稳定与全局目标偏差。其解决方案的关键在于提出了一种基于历史训练信息的性能增强型模型对比联邦学习框架(Performance-enhanced Model-contrastive Federated Learning, PMFL):一方面,在节点侧引入一种新的模型对比项,通过融合历史本地模型来捕捉稳定的对比点,提升异构数据分布下的模型更新一致性;另一方面,在服务器端利用节点累计参与次数自适应调整聚合权重,以纠正因参与频率差异引起的全局目标偏移,并通过融合历史全局模型降低相邻轮次间性能波动。
链接: https://arxiv.org/abs/2602.11945
作者: Hongliang Zhang,Jiguo Yu,Guijuan Wang,Wenshuo Ma,Tianqing He,Baobao Chai,Chunqiang Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated Learning (FL) enables multiple nodes to collaboratively train a model without sharing raw data. However, FL systems are usually deployed in heterogeneous scenarios, where nodes differ in both data distributions and participation frequencies, which undermines the FL performance. To tackle the above issue, this paper proposes PMFL, a performance-enhanced model-contrastive federated learning framework using historical training information. Specifically, on the node side, we design a novel model-contrastive term into the node optimization objective by incorporating historical local models to capture stable contrastive points, thereby improving the consistency of model updates in heterogeneous data distributions. On the server side, we utilize the cumulative participation count of each node to adaptively adjust its aggregation weight, thereby correcting the bias in the global objective caused by different node participation frequencies. Furthermore, the updated global model incorporates historical global models to reduce its fluctuations in performance between adjacent rounds. Extensive experiments demonstrate that PMFL achieves superior performance compared with existing FL methods in heterogeneous scenarios. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.11945 [cs.LG] (or arXiv:2602.11945v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.11945 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-32] MEME: Modeling the Evolutionary Modes of Financial Markets
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的量化金融方法在市场理解上的局限性问题,即现有方法多采用资产中心(Asset-Centric)或市场中心(Market-Centric)范式,忽视了驱动市场波动的深层逻辑机制。其解决方案的关键在于提出一种逻辑导向(Logic-Oriented)的新视角,将金融市场建模为由竞争性投资叙事(Modes of Thought)构成的动态演化生态系统,并设计MEME(Modeling the Evolutionary Modes of Financial Markets)框架:通过多智能体提取模块将噪声数据转化为高保真投资论点(Investment Arguments),利用高斯混合模型(Gaussian Mixture Modeling)识别语义空间中的潜在共识,结合时间评估与对齐机制追踪不同市场条件下模式的生命周期和历史收益表现,从而确保投资组合构建基于持久的市场智慧而非短期异常信号。
链接: https://arxiv.org/abs/2602.11918
作者: Taian Guo,Haiyang Shen,Junyu Luo,Zhongshi Xing,Hanchun Lian,Jinsheng Huang,Binqi Chen,Luchen Liu,Yun Ma,Ming Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLMs have demonstrated significant potential in quantitative finance by processing vast unstructured data to emulate human-like analytical workflows. However, current LLM-based methods primarily follow either an Asset-Centric paradigm focused on individual stock prediction or a Market-Centric approach for portfolio allocation, often remaining agnostic to the underlying reasoning that drives market movements. In this paper, we propose a Logic-Oriented perspective, modeling the financial market as a dynamic, evolutionary ecosystem of competing investment narratives, termed Modes of Thought. To operationalize this view, we introduce MEME (Modeling the Evolutionary Modes of Financial Markets), designed to reconstruct market dynamics through the lens of evolving logics. MEME employs a multi-agent extraction module to transform noisy data into high-fidelity Investment Arguments and utilizes Gaussian Mixture Modeling to uncover latent consensus within a semantic space. To model semantic drift among different market conditions, we also implement a temporal evaluation and alignment mechanism to track the lifecycle and historical profitability of these modes. By prioritizing enduring market wisdom over transient anomalies, MEME ensures that portfolio construction is guided by robust reasoning. Extensive experiments on three heterogeneous Chinese stock pools from 2023 to 2025 demonstrate that MEME consistently outperforms seven SOTA baselines. Further ablation studies, sensitivity analysis, lifecycle case study and cost analysis validate MEME’s capacity to identify and adapt to the evolving consensus of financial markets. Our implementation can be found at this https URL.
[AI-33] AlphaPROBE: Alpha Mining via Principled Retrieval and On-graph biased evolution
【速读】:该论文旨在解决量化金融中通过alpha因子挖掘提取信号时存在的效率与多样性不足问题。现有自动化方法主要依赖解耦式因子生成(Decoupled Factor Generation)或迭代式因子演化(Iterative Factor Evolution),但二者均缺乏对因子池全局结构的建模,导致搜索冗余且创新性受限。解决方案的关键在于提出AlphaPROBE框架,将因子池重构为有向无环图(Directed Acyclic Graph, DAG),其中因子作为节点、演化关系作为边,从而构建一个动态互联的生态系统;其核心组件包括基于贝叶斯后验概率模型的因子检索器(Bayesian Factor Retriever)以平衡探索与利用,以及利用完整祖先路径进行上下文感知优化的DAG感知因子生成器(DAG-aware Factor Generator),实现非冗余、高潜力的因子进化。
链接: https://arxiv.org/abs/2602.11917
作者: Taian Guo,Haiyang Shen,Junyu Luo,Binqi Chen,Hongjun Ding,Jinsheng Huang,Luchen Liu,Yun Ma,Ming Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Extracting signals through alpha factor mining is a fundamental challenge in quantitative finance. Existing automated methods primarily follow two paradigms: Decoupled Factor Generation, which treats factor discovery as isolated events, and Iterative Factor Evolution, which focuses on local parent-child refinements. However, both paradigms lack a global structural view, often treating factor pools as unstructured collections or fragmented chains, which leads to redundant search and limited diversity. To address these limitations, we introduce AlphaPROBE (Alpha Mining via Principled Retrieval and On-graph Biased Evolution), a framework that reframes alpha mining as the strategic navigation of a Directed Acyclic Graph (DAG). By modeling factors as nodes and evolutionary links as edges, AlphaPROBE treats the factor pool as a dynamic, interconnected ecosystem. The framework consists of two core components: a Bayesian Factor Retriever that identifies high-potential seeds by balancing exploitation and exploration through a posterior probability model, and a DAG-aware Factor Generator that leverages the full ancestral trace of factors to produce context-aware, nonredundant optimizations. Extensive experiments on three major Chinese stock market datasets against 8 competitive baselines demonstrate that AlphaPROBE significantly gains enhanced performance in predictive accuracy, return stability and training efficiency. Our results confirm that leveraging global evolutionary topology is essential for efficient and robust automated alpha discovery. We have open-sourced our implementation at this https URL.
[AI-34] Leverag ing LLM s to support co-evolution between definitions and instances of textual DSLs: A Systematic Evaluation
【速读】:该论文旨在解决文本领域特定语言(textual DSLs)在语法演进过程中,其语法与实例之间的同步更新问题,尤其关注如何在保持语义正确性的同时保留人类相关的元信息(如布局和注释)。传统模型驱动工程方法虽能处理元模型变化,但不适用于文本DSL,且易丢失此类人类可读信息。论文提出利用大语言模型(LLMs)实现语法与实例的协同演化,其关键在于通过系统评估不同LLM(Claude Sonnet 4.5 和 GPT-5.2)在多种DSL案例中的表现,量化其在语法正确性和人类相关性保留方面的有效性,并识别影响性能的关键因素,从而明确LLM在该任务中的适用边界与局限。
链接: https://arxiv.org/abs/2602.11904
作者: Weixing Zhang,Bowen Jiang,Yuhong Fu,Anne Koziolek,Regina Hebig,Daniel Strüber
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Software languages evolve over time for reasons such as feature additions. When grammars evolve, textual instances that originally conformed to them may become outdated. While model-driven engineering provides many techniques for co-evolving models with metamodel changes, these approaches are not designed for textual DSLs and may lose human-relevant information such as layout and comments. This study systematically evaluates the potential of large language models (LLMs) for co-evolving grammars and instances of textual DSLs. Using Claude Sonnet 4.5 and GPT-5.2 across ten case languages with ten runs each, we assess both correctness and preservation of human-oriented information. Results show strong performance on small-scale cases ( \geq 94% precision and recall for instances requiring fewer than 20 modified lines), but performance degraded with scale: Claude maintains 85% recall at 40 lines, while GPT fails on the largest instances. Response time increases substantially with instance size, and grammar evolution complexity and deletion granularity affect performance more than change type. These findings clarify when LLM-based co-evolution is effective and where current limitations remain.
[AI-35] Mitigating Mismatch within Reference-based Preference Optimization ICLR2026
【速读】:该论文旨在解决直接偏好优化(Direct Preference Optimization, DPO)在离线偏好对齐中因依赖参考策略(reference policy)而引发的“过早满足”(premature satisfaction)问题。具体而言,当偏好数据对为悲观对(pessimistic pairs)时,即参考模型更偏好被拒绝的响应,DPO会过早地减弱梯度更新,即便策略仍处于错误状态(Δθ < 0),从而导致训练-推理不一致。解决方案的关键在于提出Hybrid-DPO(HyPO),通过条件性地处理参考信号:在参考策略乐观或中立时保持DPO原形式,在悲观情况下将参考项替换为max{0, Δ_ref},从而在不改变DPO目标形式和计算成本的前提下,增强悲观对的单样本学习信号,有效缓解过早满足问题,并提升推理阶段的性能表现。
链接: https://arxiv.org/abs/2602.11902
作者: Suqin Yuan,Xingrui Yu,Jiyang Zheng,Lei Feng,Dadong Wang,Ivor Tsang,Tongliang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026
Abstract:Direct Preference Optimization (DPO) has become the de facto standard for offline preference alignment of large language models, but its reliance on a reference policy introduces a critical tension. DPO weighs each update relative to a reference, which stabilizes the training by regularizing the updates within a trusted region. This reliance becomes problematic for pessimistic pairs, where the reference model prefers the rejected response. For these pairs, DPO prematurely attenuates the gradient as soon as the policy margin ( \Delta_\theta ) merely beats the reference margin ( \Delta_\mathrmref ) even if the policy is still wrong ( \Delta_\theta0 ). We name this failure premature satisfaction, which is a concrete form of the training-inference mismatch. Reference-free objectives remove this mismatch by optimizing the absolute margin, but at the cost of discarding the stabilizing signal of the reference. We mitigate this tension with Hybrid-DPO (HyPO), a drop-in modification to DPO that applies reference conditionally: HyPO behaves exactly like DPO when the reference is optimistic or neutral, and it treats the reference as neutral when it is pessimistic by replacing \Delta_\theta-\Delta_\mathrmref with \Delta_\theta-\max\0,\Delta_\mathrmref\ . This one-line change strictly strengthens per-example learning signals on pessimistic pairs while preserving DPO’s objective form and computational cost. By conditionally debiasing the pessimistic reference signal, HyPO mitigates premature satisfaction; empirically, across preference alignment, HyPO improves inference-aligned metrics and achieves higher pairwise win rates. Our results provide evidence that direct preference alignment could be enhanced by conditionally debiasing the reference signal, rather than discarding it.
[AI-36] Agent ic AI for Cybersecurity: A Meta-Cognitive Architecture for Governable Autonomy
【速读】:该论文试图解决当前以模型为中心的网络安全系统在面对对抗性不确定性时难以实现可问责决策的问题,即现有系统虽在限定任务中表现优异(如准确率和响应延迟优化),但缺乏对行动合理性、合规性及组织约束的支撑。其解决方案的关键在于将网络安全编排重构为一种具有多智能体认知结构的代理系统,通过引入显式的元认知判断函数(meta-cognitive judgement function)来协调检测、假设生成、上下文解释、推理说明与治理等异构AI代理,并动态校准系统自主性——尤其在证据不完整、冲突或操作风险较高时,确保决策具备可解释性和可控性。这一架构使现代安全运营中心从隐性的分布式认知系统转变为显式可治理的认知体系,推动AI在网络安全中的角色从孤立预测优化转向不确定环境下的自主性治理。
链接: https://arxiv.org/abs/2602.11897
作者: Andrei Kojukhov,Arkady Bovshover
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Contemporary AI-driven cybersecurity systems are predominantly architected as model-centric detection and automation pipelines optimized for task-level performance metrics such as accuracy and response latency. While effective for bounded classification tasks, these architectures struggle to support accountable decision-making under adversarial uncertainty, where actions must be justified, governed, and aligned with organizational and regulatory constraints. This paper argues that cybersecurity orchestration should be reconceptualized as an agentic, multi-agent cognitive system, rather than a linear sequence of detection and response components. We introduce a conceptual architectural framework in which heterogeneous AI agents responsible for detection, hypothesis formation, contextual interpretation, explanation, and governance are coordinated through an explicit meta-cognitive judgement function. This function governs decision readiness and dynamically calibrates system autonomy when evidence is incomplete, conflicting, or operationally risky. By synthesizing distributed cognition theory, multi-agent systems research, and responsible AI governance frameworks, we demonstrate that modern security operations already function as distributed cognitive systems, albeit without an explicit organizing principle. Our contribution is to make this cognitive structure architecturally explicit and governable by embedding meta-cognitive judgement as a first-class system function. We discuss implications for security operations centers, accountable autonomy, and the design of next-generation AI-enabled cyber defence architectures. The proposed framework shifts the focus of AI in cybersecurity from optimizing isolated predictions to governing autonomy under uncertainty.
[AI-37] From Atoms to Trees: Building a Structured Feature Forest with Hierarchical Sparse Autoencoders
【速读】:该论文旨在解决稀疏自编码器(Sparse Autoencoder, SAE)在提取大语言模型(Large Language Models, LLMs)中单义特征时,仅能孤立识别特征而无法捕捉自然语言内在层次结构的问题。其核心挑战在于如何建模特征间的父子关系以揭示LLM表征中的多尺度概念层级。解决方案的关键在于提出分层稀疏自编码器(Hierarchical Sparse Autoencoder, HSAE),通过联合学习多个SAE及其特征间的父-子关系,并引入两种创新机制:结构约束损失(structural constraint loss)和随机特征扰动机制(random feature perturbation mechanism),从而强化父特征与子特征之间的语义对齐,实现对语义有意义的层次结构的稳定恢复,同时保持标准SAE的重建保真度与可解释性。
链接: https://arxiv.org/abs/2602.11881
作者: Yifan Luo,Yang Zhan,Jiedong Jiang,Tianyang Liu,Mingrui Wu,Zhennan Zhou,Bin Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse autoencoders (SAEs) have proven effective for extracting monosemantic features from large language models (LLMs), yet these features are typically identified in isolation. However, broad evidence suggests that LLMs capture the intrinsic structure of natural language, where the phenomenon of “feature splitting” in particular indicates that such structure is hierarchical. To capture this, we propose the Hierarchical Sparse Autoencoder (HSAE), which jointly learns a series of SAEs and the parent-child relationships between their features. HSAE strengthens the alignment between parent and child features through two novel mechanisms: a structural constraint loss and a random feature perturbation mechanism. Extensive experiments across various LLMs and layers demonstrate that HSAE consistently recovers semantically meaningful hierarchies, supported by both qualitative case studies and rigorous quantitative metrics. At the same time, HSAE preserves the reconstruction fidelity and interpretability of standard SAEs across different dictionary sizes. Our work provides a powerful, scalable tool for discovering and analyzing the multi-scale conceptual structures embedded in LLM representations.
[AI-38] Intelligent AI Delegation
【速读】:该论文旨在解决当前AI代理在处理复杂任务时,因依赖简单启发式方法而无法动态适应环境变化及稳健应对意外故障的问题。其解决方案的关键在于提出一种自适应的智能AI委托框架,该框架通过一系列涉及任务分配的决策过程,整合了权威、责任与问责机制、角色与边界明确性、意图清晰度以及信任建立机制,从而实现对人类和AI委托方与受托方之间高效且安全的协作支持。
链接: https://arxiv.org/abs/2602.11865
作者: Nenad Tomašev,Matija Franklin,Simon Osindero
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents are able to tackle increasingly complex tasks. To achieve more ambitious goals, AI agents need to be able to meaningfully decompose problems into manageable sub-components, and safely delegate their completion across to other AI agents and humans alike. Yet, existing task decomposition and delegation methods rely on simple heuristics, and are not able to dynamically adapt to environmental changes and robustly handle unexpected failures. Here we propose an adaptive framework for intelligent AI delegation - a sequence of decisions involving task allocation, that also incorporates transfer of authority, responsibility, accountability, clear specifications regarding roles and boundaries, clarity of intent, and mechanisms for establishing trust between the two (or more) parties. The proposed framework is applicable to both human and AI delegators and delegatees in complex delegation networks, aiming to inform the development of protocols in the emerging agentic web.
[AI-39] alk2DM: Enabling Natural Language Querying and Commonsense Reasoning for Vehicle-Road-Cloud Integrated Dynamic Maps with Large Language Models
【速读】:该论文旨在解决当前车路云(VRC)协同自动驾驶系统中动态地图(DM)缺乏自然语言支持(NLS)人机交互接口的问题,从而限制了人类与DM系统的高效互动。其解决方案的关键在于提出一个名为Talk2DM的即插即用模块,该模块基于一种新颖的提示链(chain-of-prompt, CoP)机制,将人工定义规则与大语言模型(LLM)的常识知识逐步融合,实现对混合交通场景的空间查询与常识推理能力。实验表明,Talk2DM可在不同LLM间无缝切换并保持超过93%的自然语言查询准确率,平均响应时间仅为2–5秒,展现出良好的实用性与泛化性能。
链接: https://arxiv.org/abs/2602.11860
作者: Lu Tao,Jinxuan Luo,Yousuke Watanabe,Zhengshu Zhou,Yuhuan Lu,Shen Ying,Pan Zhang,Fei Zhao,Hiroaki Takada
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to IEEE TITS. Under review
Abstract:Dynamic maps (DM) serve as the fundamental information infrastructure for vehicle-road-cloud (VRC) cooperative autonomous driving in China and Japan. By providing comprehensive traffic scene representations, DM overcome the limitations of standalone autonomous driving systems (ADS), such as physical occlusions. Although DM-enhanced ADS have been successfully deployed in real-world applications in Japan, existing DM systems still lack a natural-language-supported (NLS) human interface, which could substantially enhance human-DM interaction. To address this gap, this paper introduces VRCsim, a VRC cooperative perception (CP) simulation framework designed to generate streaming VRC-CP data. Based on VRCsim, we construct a question-answering data set, VRC-QA, focused on spatial querying and reasoning in mixed-traffic scenes. Building upon VRCsim and VRC-QA, we further propose Talk2DM, a plug-and-play module that extends VRC-DM systems with NLS querying and commonsense reasoning capabilities. Talk2DM is built upon a novel chain-of-prompt (CoP) mechanism that progressively integrates human-defined rules with the commonsense knowledge of large language models (LLMs). Experiments on VRC-QA show that Talk2DM can seamlessly switch across different LLMs while maintaining high NLS query accuracy, demonstrating strong generalization capability. Although larger models tend to achieve higher accuracy, they incur significant efficiency degradation. Our results reveal that Talk2DM, powered by Qwen3:8B, Gemma3:27B, and GPT-oss models, achieves over 93% NLS query accuracy with an average response time of only 2-5 seconds, indicating strong practical potential.
[AI-40] Resource-Aware Deployment Optimization for Collaborative Intrusion Detection in Layered Networks
【速读】:该论文旨在解决分布式环境中入侵检测系统(Intrusion Detection Systems, IDS)难以适应动态变化场景的问题,特别是在异构、资源受限的边缘设备上实现高效、灵活的入侵检测。其关键解决方案在于提出一种新型协作式入侵检测系统(Collaborative Intrusion Detection Systems, CIDS)框架,该框架能够根据节点可用资源和数据类型动态优化检测器分配策略,在无需大量计算开销的前提下,自动重构检测配置以适应新的操作场景,从而在边缘设备上实现自适应、高效的入侵检测能力。
链接: https://arxiv.org/abs/2602.11851
作者: André García Gómez,Ines Rieger,Wolfgang Hotwagner,Max Landauer,Markus Wurzenberger,Florian Skopik,Edgar Weippl
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Collaborative Intrusion Detection Systems (CIDS) are increasingly adopted to counter cyberattacks, as their collaborative nature enables them to adapt to diverse scenarios across heterogeneous environments. As distributed critical infrastructure operates in rapidly evolving environments, such as drones in both civil and military domains, there is a growing need for CIDS architectures that can flexibly accommodate these dynamic changes. In this study, we propose a novel CIDS framework designed for easy deployment across diverse distributed environments. The framework dynamically optimizes detector allocation per node based on available resources and data types, enabling rapid adaptation to new operational scenarios with minimal computational overhead. We first conducted a comprehensive literature review to identify key characteristics of existing CIDS architectures. Based on these insights and real-world use cases, we developed our CIDS framework, which we evaluated using several distributed datasets that feature different attack chains and network topologies. Notably, we introduce a public dataset based on a realistic cyberattack targeting a ground drone aimed at sabotaging critical infrastructure. Experimental results demonstrate that the proposed CIDS framework can achieve adaptive, efficient intrusion detection in distributed settings, automatically reconfiguring detectors to maintain an optimal configuration, without requiring heavy computation, since all experiments were conducted on edge devices.
[AI-41] Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的对象幻觉(object hallucination)问题,其根源在于深层网络层中视觉特征与预训练文本表示相互纠缠,导致视觉信息被抑制。解决方案的关键在于提出一种无需训练的框架REVIS,该框架基于潜在空间几何结构,通过正交投影提取纯净的视觉信息向量,并采用校准策略在视觉信息被抑制的确切深度进行稀疏干预,从而以极低的计算成本精准恢复视觉信息,实验证明该方法在标准基准上将对象幻觉率降低约19%,同时保持模型的通用推理能力。
链接: https://arxiv.org/abs/2602.11824
作者: Jialin Wu,Wei Shi,Han Shen,Peigui Qi,Kunsheng Tang,Zhicong Huang,Binghao Wang,Zhou Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Despite the advanced capabilities of Large Vision-Language Models (LVLMs), they frequently suffer from object hallucination. One reason is that visual features and pretrained textual representations often become intertwined in the deeper network layers. To address this, we propose REVIS, a training-free framework designed to explicitly re-activate this suppressed visual information. Rooted in latent space geometry, REVIS extracts the pure visual information vector via orthogonal projection and employs a calibrated strategy to perform sparse intervention only at the precise depth where suppression occurs. This surgical approach effectively restores visual information with minimal computational cost. Empirical evaluations on standard benchmarks demonstrate that REVIS reduces object hallucination rates by approximately 19% compared to state-of-the-art baselines, while preserving general reasoning capabilities.
[AI-42] Predicting LLM Output Length via Entropy-Guided Representations
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)服务和强化学习(Reinforcement Learning, RL)采样中因序列长度呈现长尾分布而导致的计算浪费问题,特别是批处理推理时过度填充(padding)造成的资源低效利用。其核心解决方案是提出一种轻量级框架,通过复用主模型内部隐藏状态实现高效长度预测:关键创新包括两个组件——熵引导的令牌池化(Entropy-Guided Token Pooling, EGTP),利用实时激活信息与令牌熵进行高精度静态长度预测且开销极低;以及渐进式长度预测(Progressive Length Prediction, PLP),在解码每一步动态估计剩余长度,从而有效应对随机“一对一多”生成场景。
链接: https://arxiv.org/abs/2602.11812
作者: Huanyi Xie,Yubin Chen,Liangyu Wang,Lijie Hu,Di Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The long-tailed distribution of sequence lengths in LLM serving and reinforcement learning (RL) sampling causes significant computational waste due to excessive padding in batched inference. Existing methods rely on auxiliary models for static length prediction, but they incur high overhead, generalize poorly, and fail in stochastic “one-to-many” sampling scenarios. We introduce a lightweight framework that reuses the main model’s internal hidden states for efficient length prediction. Our framework features two core components: 1) Entropy-Guided Token Pooling (EGTP), which uses on-the-fly activations and token entropy for highly accurate static prediction with negligible cost, and 2) Progressive Length Prediction (PLP), which dynamically estimates the remaining length at each decoding step to handle stochastic generation. To validate our approach, we build and release ForeLen, a comprehensive benchmark with long-sequence, Chain-of-Thought, and RL data. On ForeLen, EGTP achieves state-of-the-art accuracy, reducing MAE by 29.16% over the best baseline. Integrating our methods with a length-aware scheduler yields significant end-to-end throughput gains. Our work provides a new technical and evaluation baseline for efficient LLM inference.
[AI-43] PuYun-LDM: A Latent Diffusion Model for High-Resolution Ensemble Weather Forecasts
【速读】:该论文旨在解决潜在扩散模型(Latent Diffusion Models, LDMs)在高分辨率(=0.25°)集合天气预报中面临的扩散能力受限问题,即如何提升潜变量分布的可扩散性(diffusability)。核心挑战在于气象场缺乏任务无关的基础模型和显式语义结构,导致基于变分自由能(Variational Free Energy, VFM)的正则化方法失效;同时,现有频域正则化方法假设通道间谱特性同质,无法适应多变量气象数据中的谱异质性,造成正则化强度不均。解决方案的关键在于提出两个创新机制:一是3D掩码自编码器(3D Masked AutoEncoder, 3D-MAE),将天气状态演化特征作为额外条件引入扩散模型以增强潜空间表示;二是变量感知的掩码频域建模策略(Variable-Aware Masked Frequency Modeling, VA-MFM),根据各变量的谱能量分布自适应选择阈值,实现差异化频域正则化。二者共同构成PuYun-LDM框架,在短时效内性能优于传统集合预报(ENS),长时效保持相当水平,并具备高效推理能力(单张NVIDIA H200 GPU五分钟完成15天全球6小时分辨率预报)。
链接: https://arxiv.org/abs/2602.11807
作者: Lianjun Wu,Shengchen Zhu,Yuxuan Liu,Liuyu Kai,Xiaoduan Feng,Duomin Wang,Wenshuo Liu,Jingxuan Zhang,Kelvin Li,Bin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Latent diffusion models (LDMs) suffer from limited diffusability in high-resolution (=0.25°) ensemble weather forecasting, where diffusability characterizes how easily a latent data distribution can be modeled by a diffusion process. Unlike natural image fields, meteorological fields lack task-agnostic foundation models and explicit semantic structures, making VFM-based regularization inapplicable. Moreover, existing frequency-based approaches impose identical spectral regularization across channels under a homogeneity assumption, which leads to uneven regularization strength under the inter-variable spectral heterogeneity in multivariate meteorological data. To address these challenges, we propose a 3D Masked AutoEncoder (3D-MAE) that encodes weather-state evolution features as an additional conditioning for the diffusion model, together with a Variable-Aware Masked Frequency Modeling (VA-MFM) strategy that adaptively selects thresholds based on the spectral energy distribution of each variable. Together, we propose PuYun-LDM, which enhances latent diffusability and achieves superior performance to ENS at short lead times while remaining comparable to ENS at longer horizons. PuYun-LDM generates a 15-day global forecast with a 6-hour temporal resolution in five minutes on a single NVIDIA H200 GPU, while ensemble forecasts can be efficiently produced in parallel.
[AI-44] Hi-SAM: A Hierarchical Structure-Aware Multi-modal Framework for Large-Scale Recommendation
【速读】:该论文旨在解决多模态推荐中两个关键问题:(1) 语义标记化(Semantic ID-based tokenization)的次优性,即现有方法(如RQ-VAE)未能有效解耦跨模态共享语义与模态特有细节,导致冗余或语义坍缩;(2) 架构与数据不匹配问题,即标准Transformer将语义ID视为扁平序列,忽视了用户交互、物品和标记之间的层次结构,导致序列长度膨胀与注意力偏向局部细节而非整体语义。解决方案的核心在于提出Hi-SAM框架,其关键设计包括:(1) 解耦语义标记器(Disentangled Semantic Tokenizer, DST),通过几何感知对齐与粗到精量化策略统一多模态信息,共享码本提取共识语义,模态特定码本从残差中恢复细节,并通过互信息最小化约束解耦;(2) 层次化记忆锚定Transformer(Hierarchical Memory-Anchor Transformer, HMAT),利用分层旋转位置编码(Hierarchical RoPE)分离物品间与物品内位置嵌入,引入锚点标记(Anchor Tokens)压缩物品为紧凑记忆单元,在保留当前物品细节的同时仅通过压缩摘要访问历史信息,从而恢复数据层次结构并提升长序列建模效率。
链接: https://arxiv.org/abs/2602.11799
作者: Pingjun Pan,Tingting Zhou,Peiyao Lu,Tingting Fei,Hongxiang Chen,Chuanjiang Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-modal recommendation has gained traction as items possess rich attributes like text and images. Semantic ID-based approaches effectively discretize this information into compact tokens. However, two challenges persist: (1) Suboptimal Tokenization: existing methods (e.g., RQ-VAE) lack disentanglement between shared cross-modal semantics and modality-specific details, causing redundancy or collapse; (2) Architecture-Data Mismatch: vanilla Transformers treat semantic IDs as flat streams, ignoring the hierarchy of user interactions, items, and tokens. Expanding items into multiple tokens amplifies length and noise, biasing attention toward local details over holistic semantics. We propose Hi-SAM, a Hierarchical Structure-Aware Multi-modal framework with two designs: (1) Disentangled Semantic Tokenizer (DST): unifies modalities via geometry-aware alignment and quantizes them via a coarse-to-fine strategy. Shared codebooks distill consensus while modality-specific ones recover nuances from residuals, enforced by mutual information minimization; (2) Hierarchical Memory-Anchor Transformer (HMAT): splits positional encoding into inter- and intra-item subspaces via Hierarchical RoPE to restore hierarchy. It inserts Anchor Tokens to condense items into compact memory, retaining details for the current item while accessing history only through compressed summaries. Experiments on real-world datasets show consistent improvements over SOTA baselines, especially in cold-start scenarios. Deployed on a large-scale social platform serving millions of users, Hi-SAM achieved a 6.55% gain in the core online metric.
[AI-45] Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing
【速读】:该论文试图解决传统大语言模型(Large Language Models, LLMs)评估体系在真实部署场景中对操作性风险(operational failures)识别不足的问题。现有基准主要通过多任务广度评估来衡量安全性,但实际应用中,模型在重复推理相同或近似提示时可能出现幻觉、拒绝不一致和不安全输出等隐性故障,这些风险无法通过单次采样评估发现。解决方案的关键在于提出加速提示压力测试(Accelerated Prompt Stress Testing, APST),这是一种受可靠性工程启发的深度导向评估框架:通过在受控条件下(如固定解码温度)反复采样相同提示,将失败建模为独立推理事件的随机结果,并利用伯努利(Bernoulli)和二项分布(binomial)模型量化每轮推理的失败概率,从而实现跨模型与解码配置的可比性可靠性分析。此方法揭示了即使在基准分数相近的情况下,不同模型在持续使用下的实际失效率存在显著差异,尤其在高温度设置下更为明显,弥补了传统评估在部署场景中的盲区。
链接: https://arxiv.org/abs/2602.11786
作者: Keita Broadwater
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 9 figures. Submitted to TMLR
Abstract:Traditional benchmarks for large language models (LLMs) primarily assess safety risk through breadth-oriented evaluation across diverse tasks. However, real-world deployment exposes a different class of risk: operational failures arising from repeated inference on identical or near-identical prompts rather than broad task generalization. In high-stakes settings, response consistency and safety under sustained use are critical. We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by reliability engineering. APST repeatedly samples identical prompts under controlled operational conditions (e.g., decoding temperature) to surface latent failure modes including hallucinations, refusal inconsistency, and unsafe completions. Rather than treating failures as isolated events, APST models them as stochastic outcomes of independent inference events. We formalize safety failures using Bernoulli and binomial models to estimate per-inference failure probabilities, enabling quantitative comparison of reliability across models and decoding configurations. Applying APST to multiple instruction-tuned LLMs evaluated on AIR-BENCH-derived safety prompts, we find that models with similar benchmark-aligned scores can exhibit substantially different empirical failure rates under repeated sampling, particularly as temperature increases. These results demonstrate that shallow, single-sample evaluation can obscure meaningful reliability differences under sustained use. APST complements existing benchmarks by providing a practical framework for evaluating LLM safety and reliability under repeated inference, bridging benchmark alignment and deployment-oriented risk assessment.
[AI-46] Safe Fairness Guarantees Without Demographics in Classification: Spectral Uncertainty Set Perspective
【速读】:该论文旨在解决自动化分类系统在缺乏群体信息(如性别、种族等)情况下难以保障公平性的问题。现有方法多依赖于对所有样本的群体标签,这一假设在实际中往往不可行;而基于鲁棒优化的方法虽不需群体信息,但其性能高度依赖于不确定性集的选择,常因过度关注极端情况而导致整体性能与公平性下降。论文提出SPECTRE方法,其核心创新在于通过调整简单的傅里叶特征映射的频谱,并约束最坏情况分布相对于经验分布的最大偏离程度,从而实现最小-最大公平性优化。该方案在20个州的美国社区调查数据集上验证有效,不仅在公平性保障方面优于当前最优方法(包括那些拥有群体信息的方法),且具有更小的区间差异,同时提供了可计算的最坏情况误差边界理论分析。
链接: https://arxiv.org/abs/2602.11785
作者: Ainhize Barrainkua,Santiago Mazuelas,Novi Quadrianto,Jose A. Lozano
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As automated classification systems become increasingly prevalent, concerns have emerged over their potential to reinforce and amplify existing societal biases. In the light of this issue, many methods have been proposed to enhance the fairness guarantees of classifiers. Most of the existing interventions assume access to group information for all instances, a requirement rarely met in practice. Fairness without access to demographic information has often been approached through robust optimization techniques,which target worst-case outcomes over a set of plausible distributions known as the uncertainty set. However, their effectiveness is strongly influenced by the chosen uncertainty set. In fact, existing approaches often overemphasize outliers or overly pessimistic scenarios, compromising both overall performance and fairness. To overcome these limitations, we introduce SPECTRE, a minimax-fair method that adjusts the spectrum of a simple Fourier feature mapping and constrains the extent to which the worst-case distribution can deviate from the empirical distribution. We perform extensive experiments on the American Community Survey datasets involving 20 states. The safeness of SPECTRE comes as it provides the highest average values on fairness guarantees together with the smallest interquartile range in comparison to state-of-the-art approaches, even compared to those with access to demographic group information. In addition, we provide a theoretical analysis that derives computable bounds on the worst-case error for both individual groups and the overall population, as well as characterizes the worst-case distributions responsible for these extremal performances
[AI-47] FlowMind: Execute-Summarize for Structured Workflow Generation from LLM Reasoning
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在执行复杂任务时,如何将自由形式的推理过程准确转化为结构化工作流(structured workflows)的问题。现有方法通常在任务执行过程中同步构建工作流,导致执行与流程构造相互干扰,从而影响准确性。其解决方案的关键在于提出Execute-Summarize(ES)框架,通过解耦任务执行与工作流构建两个阶段:首先让模型利用可用工具完成任务,随后独立地从执行痕迹中重构出结构化的工具调用序列,从而显著提升工作流的准确性与鲁棒性。
链接: https://arxiv.org/abs/2602.11782
作者: Yihao Liu,Ziyun Zhang,Zile He,Huaqian Cai
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:LLMs can solve complex tasks through reasoning and tool use, but accurately translating these solutions into structured workflows remains challenging. We model workflows as sequences of tool use and reformulate the problem as designing a mechanism that can both solve tasks and reliably construct workflows. Prior approaches that build workflows during execution often suffer from inaccuracies due to interference between the two processes. We propose an Execute-Summarize(ES) framework that decouples task execution from workflow construction: the model first completes the task using available tools, then independently reconstructs a structured workflow from execution traces. This separation improves workflow accuracy and robustness. We introduce FlowBench and show through extensive experiments that our approach outperforms existing methods, providing a reliable paradigm for grounding free-form LLM reasoning into structured workflows.
[AI-48] RELATE: A Reinforcement Learning-Enhanced LLM Framework for Advertising Text Generation
【速读】:该论文旨在解决在线广告中广告文本生成与下游性能指标(如点击率CTR)之间优化目标不一致的问题,这种分离式两阶段范式常导致漏斗效率低下和全局最优性受限。其解决方案的关键在于提出RELATE框架——一个基于强化学习的端到端模型,将文本生成与多维奖励目标(包括转化导向指标和合规约束)统一建模,通过策略学习直接在生成过程中融合性能与合规性目标,从而实现高质量广告文案的自动优化,显著提升点击转化率(CTCVR),并在真实工业场景中验证了其有效性与鲁棒性。
链接: https://arxiv.org/abs/2602.11780
作者: Jinfang Wang,Jiajie Liu,Jianwei Wu,Ziqin Luo,Zhen Chen,Chunlei Li,Biao Han,Tao Deng,Yi Li,Shuanglong Li,Lin Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures
Abstract:In online advertising, advertising text plays a critical role in attracting user engagement and driving advertiser value. Existing industrial systems typically follow a two-stage paradigm, where candidate texts are first generated and subsequently aligned with online performance metrics such as click-through rate(CTR). This separation often leads to misaligned optimization objectives and low funnel efficiency, limiting global optimality. To address these limitations, we propose RELATE, a reinforcement learning-based end-to-end framework that unifies generation and objective alignment within a single model. Instead of decoupling text generation from downstream metric alignment, RELATE integrates performance and compliance objectives directly into the generation process via policy learning. To better capture ultimate advertiser value beyond click-level signals, We incorporate conversion-oriented metrics into the objective and jointly model them with compliance constraints as multi-dimensional rewards, enabling the model to generate high-quality ad texts that improve conversion performance under policy constraints. Extensive experiments on large-scale industrial datasets demonstrate that RELATE consistently outperforms baselines. Furthermore, online deployment on a production advertising platform yields statistically significant improvements in click-through conversion rate(CTCVR) under strict policy constraints, validating the robustness and real-world effectiveness of the proposed framework. Comments: 10 pages, 3 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.11780 [cs.AI] (or arXiv:2602.11780v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.11780 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-49] How to Optimize Multispecies Set Predictions in Presence-Absence Modeling ?
【速读】:该论文旨在解决物种分布模型(Species Distribution Models, SDMs)在将概率性出现预测转化为二元存在-缺失地图时,因采用启发式阈值方法而导致的物种普遍率和群落组成估计失真问题。其核心解决方案是提出一种决策驱动的二值化框架 MaxExp,该方法通过直接最大化选定的评估指标来选择最可能的物种组合,无需校准数据且适用于多种评分标准;此外还引入了计算效率更高的 Set Size Expectation (SSE) 方法,基于预期物种丰富度预测群落组成。两者均能有效应对类别不平衡和高稀有性场景,提供稳健、可重复的多物种SDM二值化工具。
链接: https://arxiv.org/abs/2602.11771
作者: Sébastien Gigot–Léandri,Gaétan Morand,Alexis Joly,François Munoz,David Mouillot,Christophe Botella,Maximilien Servajean
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Species distribution models (SDMs) commonly produce probabilistic occurrence predictions that must be converted into binary presence-absence maps for ecological inference and conservation planning. However, this binarization step is typically heuristic and can substantially distort estimates of species prevalence and community composition. We present MaxExp, a decision-driven binarization framework that selects the most probable species assemblage by directly maximizing a chosen evaluation metric. MaxExp requires no calibration data and is flexible across several scores. We also introduce the Set Size Expectation (SSE) method, a computationally efficient alternative that predicts assemblages based on expected species richness. Using three case studies spanning diverse taxa, species counts, and performance metrics, we show that MaxExp consistently matches or surpasses widely used thresholding and calibration methods, especially under strong class imbalance and high rarity. SSE offers a simpler yet competitive option. Together, these methods provide robust, reproducible tools for multispecies SDM binarization.
[AI-50] AIR: Improving Agent Safety through Incident Response
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理系统在实际部署中缺乏有效事后响应机制的问题,即现有安全措施主要聚焦于事前预防,难以应对不可避免的运行事故。其解决方案的关键在于提出AIR框架——首个面向LLM代理系统的事件响应框架,通过引入领域特定语言(Domain-Specific Language, DSL)来自主管理事件响应生命周期,并将其集成到代理执行循环中:(1) 基于环境状态和近期上下文进行语义检测以识别事件;(2) 利用工具引导代理执行遏制与恢复操作;(3) 在清除阶段生成防护规则以防止未来重复发生类似事件。实验表明,该方案在检测、修复和消除三个阶段的成功率均超过90%,验证了事件响应作为提升代理安全性的一类核心机制的可行性与必要性。
链接: https://arxiv.org/abs/2602.11749
作者: Zibo Xiao,Jun Sun,Junjie Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) agents are increasingly deployed in practice across a wide range of autonomous applications. Yet current safety mechanisms for LLM agents focus almost exclusively on preventing failures in advance, providing limited capabilities for responding to, containing, or recovering from incidents after they inevitably arise. In this work, we introduce AIR, the first incident response framework for LLM agent systems. AIR defines a domain-specific language for managing the incident response lifecycle autonomously in LLM agent systems, and integrates it into the agent’s execution loop to (1) detect incidents via semantic checks grounded in the current environment state and recent context, (2) guide the agent to execute containment and recovery actions via its tools, and (3) synthesize guardrail rules during eradication to block similar incidents in future executions. We evaluate AIR on three representative agent types. Results show that AIR achieves detection, remediation, and eradication success rates all exceeding 90%. Extensive experiments further confirm the necessity of AIR’s key design components, show the timeliness and moderate overhead of AIR, and demonstrate that LLM-generated rules can approach the effectiveness of developer-authored rules across domains. These results show that incident response is both feasible and essential as a first-class mechanism for improving agent safety.
[AI-51] xt2GQL-Bench: A Text to Graph Query Language Benchmark [Experiment Analysis Benchmark]
【速读】:该论文旨在解决当前Text-to-Graph-Query-Language (Text-to-GQL)系统在评估过程中面临的基准数据集质量低、覆盖领域有限以及评价维度单一的问题,这些问题阻碍了模型在不同图查询语言(Graph Query Language, GQL)和领域间的系统性比较。解决方案的关键在于提出Text2GQL-Bench——一个统一的Text-to-GQL基准,其核心由包含178,184个(问题, 查询)对的多GQL数据集和一个可扩展的数据构建框架组成,能够生成跨域、多抽象层级及异构资源支持的多样化数据;同时引入一套多维评估方法,联合报告语法有效性(grammatical validity)、语义相似度、语义对齐度与执行准确率(execution accuracy, EX),从而全面揭示模型性能瓶颈,例如发现即使强基线LLMs在零样本条件下对ISO-GQL的执行准确率最高仅达4%,而通过少量示例微调可显著提升至45.1% EX和90.8%语法有效性的结果,验证了高质量示例对突破“方言差距”(dialect gap)的关键作用。
链接: https://arxiv.org/abs/2602.11745
作者: Songlin Lyu,Lujie Ban,Zihang Wu,Tianqi Luo,Jirong Liu,Chenhao Ma,Yuyu Luo,Nan Tang,Shipeng Qi,Heng Lin,Yongchao Liu,Chuntao Hong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Graph models are fundamental to data analysis in domains rich with complex relationships. Text-to-Graph-Query-Language (Text-to-GQL) systems act as a translator, converting natural language into executable graph queries. This capability allows Large Language Models (LLMs) to directly analyze and manipulate graph data, posi-tioning them as powerful agent infrastructures for Graph Database Management System (GDBMS). Despite recent progress, existing datasets are often limited in domain coverage, supported graph query languages, or evaluation scope. The advancement of Text-to-GQL systems is hindered by the lack of high-quality benchmark datasets and evaluation methods to systematically compare model capabilities across different graph query languages and domains. In this work, we present Text2GQL-Bench, a unified Text-to-GQL benchmark designed to address these limitations. Text2GQL-Bench couples a multi-GQL dataset that has 178,184 (Question, Query) pairs spanning 13 domains, with a scalable construction framework that generates datasets in different domains, question abstraction levels, and GQLs with heterogeneous resources. To support compre-hensive assessment, we introduce an evaluation method that goes beyond a single end-to-end metric by jointly reporting grammatical validity, similarity, semantic alignment, and execution accuracy. Our evaluation uncovers a stark dialect gap in ISO-GQL generation: even strong LLMs achieve only at most 4% execution accuracy (EX) in zero-shot settings, though a fixed 3-shot prompt raises accuracy to around 50%, the grammatical validity remains lower than 70%. Moreover, a fine-tuned 8B open-weight model reaches 45.1% EX, and 90.8% grammatical validity, demonstrating that most of the performance jump is unlocked by exposure to sufficient ISO-GQL examples.
[AI-52] Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLM s
【速读】:该论文旨在解决跨架构模型差异检测(cross-architecture model diffing)的难题,即如何在不同架构的大型语言模型(Large Language Models, LLMs)之间识别其内部表示中的关键行为差异,以发现潜在的安全敏感行为。传统模型差分方法主要局限于同一架构下基础模型与微调后模型的比较,难以适用于新发布的异构架构模型。论文提出的关键解决方案是引入专用特征交叉编码器(Dedicated Feature Crosscoders, DFCs),这是一种针对特定模型架构设计的改进机制,能够更有效地隔离出仅属于某一模型的独特特征。通过该方法,作者首次实现了跨架构模型的无监督差异检测,并成功识别出多个模型中具有实际意义的行为特征,如Qwen3-8B中的中国共产党对齐、Llama3.1-8B-Instruct中的美国例外论倾向以及GPT-OSS-20B中的版权拒绝机制,从而验证了跨架构交叉编码器方法的有效性和实用性。
链接: https://arxiv.org/abs/2602.11729
作者: Thomas Jiralerspong,Trenton Bricken
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:Model diffing, the process of comparing models’ internal representations to identify their differences, is a promising approach for uncovering safety-critical behaviors in new models. However, its application has so far been primarily focused on comparing a base model with its finetune. Since new LLM releases are often novel architectures, cross-architecture methods are essential to make model diffing widely applicable. Crosscoders are one solution capable of cross-architecture model diffing but have only ever been applied to base vs finetune comparisons. We provide the first application of crosscoders to cross-architecture model diffing and introduce Dedicated Feature Crosscoders (DFCs), an architectural modification designed to better isolate features unique to one model. Using this technique, we find in an unsupervised fashion features including Chinese Communist Party alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and a copyright refusal mechanism in GPT-OSS-20B. Together, our results work towards establishing cross-architecture crosscoder model diffing as an effective method for identifying meaningful behavioral differences between AI models.
[AI-53] Beyond Parameter Arithmetic: Sparse Complementary Fusion for Distribution-Aware Model Merging
【速读】:该论文旨在解决现有模型融合方法在参数空间中依赖启发式策略所导致的功能干扰问题,进而引发泛化性能下降和生成行为不稳定(如重复输出与逻辑不连贯)的挑战。其解决方案的关键在于提出一种基于稀疏互补融合与反向KL散度(Sparse Complementary Fusion with reverse KL, SCF-RKL)的新框架,通过在分布感知的基础上衡量模型间的功能差异,而非假设参数空间中的线性叠加关系,从而选择性地引入互补参数,实现对稳定表示的有效保留与新能力的精准整合。该设计具有模式聚焦特性并诱导稀疏更新,显著提升了融合后模型的性能一致性与生成稳定性。
链接: https://arxiv.org/abs/2602.11717
作者: Weihong Lin,Lin Sun,Qilong Shi,Aomufei Yuan,Yuxuan Tian,Zhengyang Wang,Guangxiang Zhao,Xiangzheng Zhang,Tong Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Model merging has emerged as a promising paradigm for composing the capabilities of large language models by directly operating in weight space, enabling the integration of specialized models without costly retraining. However, existing merging methods largely rely on parameter-space heuristics, which often introduce severe interference, leading to degraded generalization and unstable generation behaviors such as repetition and incoherent outputs. In this work, we propose Sparse Complementary Fusion with reverse KL (SCF-RKL), a novel model merging framework that explicitly controls functional interference through sparse, distribution-aware updates. Instead of assuming linear additivity in parameter space, SCF-RKL measures the functional divergence between models using reverse Kullback-Leibler divergence and selectively incorporates complementary parameters. This mode-seeking, sparsity-inducing design effectively preserves stable representations while integrating new capabilities. We evaluate SCF-RKL across a wide range of model scales and architectures, covering both reasoning-focused and instruction-tuned models. Extensive experiments on 24 benchmarks spanning advanced reasoning, general reasoning and knowledge, instruction following, and safety demonstrate, vision classification that SCF-RKL consistently outperforms existing model merging methods while maintaining strong generalization and generation stability.
[AI-54] abSieve: Explicit In-Table Evidence Selection for Tabular Prediction
【速读】:该论文旨在解决表格预测任务中现有模型难以有效利用表内行作为少样本证据的问题,尤其是传统模型多采用逐实例推理方式,而基于大语言模型(LLM)的提示方法往往不稳定、无法一致地利用相关行信息,且噪声上下文会降低性能。解决方案的关键在于提出一种“先筛选后预测”(select-then-predict)框架TabSieve,通过显式选择少量高信息量的参考行作为证据,并在此基础上进行目标预测,从而提升模型对有用证据的利用效率与鲁棒性。为实现该框架,作者构建了高质量的监督微调数据集TabSieve-SFT-40K,并引入TAB-GRPO强化学习策略,联合优化证据选择与预测准确性,同时通过动态任务优势平衡机制稳定混合回归与分类训练过程。
链接: https://arxiv.org/abs/2602.11700
作者: Yongyao Wang,Ziqi Miao,Lu Yang,Haonan Jia,Wenting Yan,Chen Qian,Lijun Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages
Abstract:Tabular prediction can benefit from in-table rows as few-shot evidence, yet existing tabular models typically perform instance-wise inference and LLM-based prompting is often brittle. Models do not consistently leverage relevant rows, and noisy context can degrade performance. To address this challenge, we propose TabSieve, a select-then-predict framework that makes evidence usage explicit and auditable. Given a table and a query row, TabSieve first selects a small set of informative rows as evidence and then predicts the missing target conditioned on the selected evidence. To enable this capability, we construct TabSieve-SFT-40K by synthesizing high-quality reasoning trajectories from 331 real tables using a strong teacher model with strict filtering. Furthermore, we introduce TAB-GRPO, a reinforcement learning recipe that jointly optimizes evidence selection and prediction correctness with separate rewards, and stabilizes mixed regression and classification training via dynamic task-advantage balancing. Experiments on a held-out benchmark of 75 classification and 52 regression tables show that TabSieve consistently improves performance across shot budgets, with average gains of 2.92% on classification and 4.45% on regression over the second-best baseline. Further analysis indicates that TabSieve concentrates more attention on the selected evidence, which improves robustness to noisy context.
[AI-55] ANML: Attribution-Native Machine Learning with Guaranteed Robustness
【速读】:该论文旨在解决当前前沿人工智能(AI)系统在训练过程中对所有样本一视同仁的问题,即未区分数据来源的质量差异,导致模型性能受限且缺乏可解释性。例如,诺贝尔奖得主的贡献与未经验证的提交被赋予相同权重,这影响了模型的准确性与可信度。其解决方案的核心是提出ANML(Attribution-Native Machine Learning)框架,通过四个质量因子——梯度一致性(q)、验证状态(v)、贡献者声誉(r)和时间相关性(T)——动态加权训练样本,并融合模型观测到的梯度信号与数据溯源信息(external signals),从而实现高质量数据驱动的训练优化与贡献者级归因。实验表明,该方法在多个数据集上相比仅依赖梯度的基线模型误差降低33%-72%,且使用20%高质量数据即可超越100%均匀加权的数据表现,同时具备抗策略性攻击的鲁棒性。
链接: https://arxiv.org/abs/2602.11690
作者: Oliver Zahn,Matt Beton,Simran Chana
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 6 figures
Abstract:Frontier AI systems increasingly train on specialized expert data, from clinical records to proprietary research to curated datasets, yet current training pipelines treat all samples identically. A Nobel laureate’s contribution receives the same weight as an unverified submission. We introduce ANML (Attribution-Native Machine Learning), a framework that weights training samples by four quality factors: gradient-based consistency (q), verification status (v), contributor reputation ®, and temporal relevance (T). By combining what the model observes (gradient signals) with what the system knows about data provenance (external signals), ANML produces per-contributor quality weights that simultaneously improve model performance and enable downstream attribution. Across 5 datasets (178-32,561 samples), ANML achieves 33-72% error reduction over gradient-only baselines. Quality-weighted training is data-efficient: 20% high-quality data outperforms 100% uniformly weighted data by 47%. A Two-Stage Adaptive gating mechanism guarantees that ANML never underperforms the best available baseline, including under strategic joint attacks combining credential faking with gradient alignment. When per-sample detection fails against subtle corruption, contributor-level attribution provides 1.3-5.3x greater improvement than sample-level methods, with the advantage growing as corruption becomes harder to detect.
[AI-56] DRACO: a Cross-Domain Benchmark for Deep Research Accuracy Completeness and Objectivity
【速读】:该论文旨在解决当前评估生成式 AI 在复杂深度研究任务中表现缺乏统一、客观且全面基准的问题。现有评估方法难以覆盖真实世界中多源异构信息整合、跨领域知识推理与批判性分析等核心挑战。其解决方案的关键在于构建 DRACO(Deep Research Accuracy, Completeness, and Objectivity),一个涵盖10个领域、基于来自40个国家的信息源的开放-ended深度研究任务基准,这些任务源自匿名化的真实用户请求,并通过标准化评分维度——事实准确性(accuracy)、分析广度与深度(completeness)、呈现质量(objectivity)及引用质量——实现对模型输出的多维量化评估,从而为深度研究型 AI 提供可复现、可比较的评测标准。
链接: https://arxiv.org/abs/2602.11685
作者: Joey Zhong,Hao Zhang,Clare Southern,Jeremy Yang,Thomas Wang,Kate Jung,Shu Zhang,Denis Yarats,Johnny Ho,Jerry Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We present DRACO (Deep Research Accuracy, Completeness, and Objectivity), a benchmark of complex deep research tasks. These tasks, which span 10 domains and draw on information sources from 40 countries, originate from anonymized real-world usage patterns within a large-scale deep research system. Tasks are sampled from a de-identified dataset of Perplexity Deep Research requests, then filtered and augmented to ensure that the tasks are anonymized, open-ended and complex, objectively evaluable, and representative of the broad scope of real-world deep research use cases. Outputs are graded against task-specific rubrics along four dimensions: factual accuracy (accuracy), breadth and depth of analysis (including completeness), presentation quality (including objectivity), and citation quality. DRACO is publicly available at this https URL.
[AI-57] Right for the Wrong Reason s: Epistemic Regret Minimization for Causal Rung Collapse in LLM s
【速读】:该论文旨在解决机器学习系统在训练过程中因依赖错误因果推理而产生的“对的原因错的”(right for the wrong reasons)问题,这种问题表现为模型在特定分布下表现优异,但在分布偏移时性能急剧下降。其核心病理源于自回归训练无法区分关联概率 P(Y∣X) 与干预概率 P(Y∣do(X)),导致模型陷入错误的因果信念中,即所谓的“随机性固化”(Aleatoric Entrenchment)。解决方案的关键在于提出认知后悔最小化(Epistemic Regret Minimization, ERM),这是一种独立于任务成功率的信念修正目标,通过三层次架构实现:(1)物理接地定理证明动作满足执行器独立性即可实施有效的 do-操作,连接动作语言与 do-计算;(2)ERM作为满足 AGM 公理的因果信念修正算子,防止模型即使任务成功也固化错误推理;(3)构建通用失败模式分类体系并注入领域无关的防护机制,支持跨域迁移。理论证明在有限样本下可渐近恢复真实干预分布,实验验证了 Rung Collapse 在前沿大语言模型中普遍存在,且 ERM 反馈能有效修复 53–59% 的固化错误,显著优于仅基于结果反馈的方法。
链接: https://arxiv.org/abs/2602.11675
作者: Edward Y. Chang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 6 tables, 3 figures
Abstract:Machine learning systems that are “right for the wrong reasons” achieve high performance through shortcuts that collapse under distributional shift. We show this pathology has a precise causal origin: autoregressive training provides no gradient signal to distinguish association P(Y|X) from intervention P(Y|do(X)), a failure we formalize as Rung Collapse. When outcome-based learning reinforces correct answers obtained through incorrect causal models, the agent becomes entrenched in flawed reasoning, a phenomenon we term Aleatoric Entrenchment. We propose Epistemic Regret Minimization (ERM), a belief revision objective that penalizes errors in causal reasoning independently of task success, and embed it within a three-layer architecture with three contributions grounded in knowledge representation: (1) a Physical Grounding Theorem proving that actions satisfying actuator independence implement valid do-operations, bridging action languages and do-calculus; (2) ERM as a causal belief revision operator satisfying AGM postulates, preventing entrenchment even when the agent succeeds for the wrong reasons; and (3) a failure mode taxonomy that classifies recurring reasoning errors and injects domain-independent guards, enabling cross-domain transfer. We prove asymptotic recovery of the true interventional distribution with finite-sample bounds. Experiments on 1,360 causal trap scenarios across six frontier LLMs reveal that Rung Collapse persists even in reasoning-enhanced models (3.7% for GPT-5.2), that steerability exhibits inverse scaling where advanced models resist generic correction, and that targeted ERM feedback recovers 53-59% of entrenched errors where outcome-level feedback fails.
[AI-58] Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLM s ALT
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估中基准测试(benchmark)可信度下降的问题,具体表现为分数膨胀(score inflation)和选择性报告(selective reporting),导致学术界与工业界难以判断哪些评估结果仍具可靠性。解决方案的关键在于提出Benchmark Health Index(BHI),这是一个基于数据驱动的框架,从三个正交且互补的维度对评估集进行审计:(1)能力区分度(Capability Discrimination),衡量基准能否清晰区分模型性能差异而非噪声;(2)抗饱和性(Anti-Saturation),估计在天花板效应侵蚀分辨力前的剩余提升空间,从而预测基准的长期可用性;(3)影响力(Impact),量化其在学术与工业生态中的采纳广度和实践塑造能力。通过系统分析2025年91个代表性模型的技术报告中提取的106个验证过的基准,BHI首次实现了宏观层面的基准健康度量化,为基准筛选与下一代评估协议的动态生命周期管理提供了理论依据。
链接: https://arxiv.org/abs/2602.11674
作者: Longyuan Zhu,Hairan Hua,Linlin Miao,Bing Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 42 pages, 8 figures, 7 tables. Code and website available at this https URL
Abstract:Large Language Models (LLMs) are advancing rapidly, yet the benchmarks used to measure this progress are becoming increasingly unreliable. Score inflation and selective reporting have eroded the authority of standard benchmarks, leaving the community uncertain about which evaluation results remain trustworthy. We introduce the Benchmark Health Index (BHI), a pure data-driven framework for auditing evaluation sets along three orthogonal and complementary axes: (1) Capability Discrimination, measuring how sharply a benchmark separates model performance beyond noise; (2) Anti-Saturation, estimating remaining headroom before ceiling effects erode resolution and thus the benchmark’s expected longevity; and (3) Impact, quantifying influence across academic and industrial ecosystems via adoption breadth and practice-shaping power. By distilling 106 validated benchmarks from the technical reports of 91 representative models in 2025, we systematically characterize the evaluation landscape. BHI is the first framework to quantify benchmark health at a macro level, providing a principled basis for benchmark selection and enabling dynamic lifecycle management for next-generation evaluation protocols.
[AI-59] Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm
【速读】:该论文旨在解决大语言模型在医疗问答场景中对齐(alignment)的挑战,具体表现为:传统基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)依赖昂贵且未必反映医学事实准确性的偏好标注;而基于可验证奖励的强化学习(Reinforcement Learning from Verifiable Rewards)缺乏有效的自动验证器,难以处理复杂的临床情境。此外,医疗对齐需同时优化正确性、安全性与合规性等多目标异构奖励信号,易出现尺度不匹配和优化不稳定问题。解决方案的关键在于提出一种鲁棒的医疗对齐范式:首先构建一个四维医疗对齐矩阵(fundamental capabilities, expert knowledge, online feedback, format specifications),形成可观测指标→可归因诊断→可优化奖励的闭环监督机制,提供细粒度高分辨率的监督信号;其次引入统一优化机制,通过Reference-Frozen Normalization对齐不同奖励尺度,并采用Tri-Factor Adaptive Dynamic Weighting策略实现以弱点为导向、风险优先、冗余减少的协同优化,从而有效缓解梯度主导与优化不稳定性问题。
链接: https://arxiv.org/abs/2602.11661
作者: Tianxiang Xu,Jiayi Liu,Yixuan Tong,Jialu Xu,Yunqing Wei,Kaiwen Feng,PanPan Hou,Kangping Yin,Jiyuan Hu,Hao Zhou,Zhenxin Ma,Jian Xu,Guanjun Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While reinforcement learning for large language model alignment has progressed rapidly in recent years, transferring these paradigms to high-stakes medical question answering reveals a fundamental paradigm mismatch. Reinforcement Learning from Human Feedback relies on preference annotations that are prohibitively expensive and often fail to reflect the absolute correctness of medical facts. Reinforcement Learning from Verifiable Rewards lacks effective automatic verifiers and struggles to handle complex clinical contexts. Meanwhile, medical alignment requires the simultaneous optimization of correctness, safety, and compliance, yet multi-objective heterogeneous reward signals are prone to scale mismatch and optimization this http URL address these challenges, we propose a robust medical alignment paradigm. We first construct a holistic multi-dimensional medical alignment matrix that decomposes alignment objectives into four categories: fundamental capabilities, expert knowledge, online feedback, and format specifications. Within each category, we establish a closed loop of where observable metrics inform attributable diagnosis, which in turn drives optimizable rewards, thereby providing fine-grained, high-resolution supervision signals for subsequent iterative optimization. To resolve gradient domination and optimization instability problem caused by heterogeneous signals, we further propose a unified optimization mechanism. This mechanism employs Reference-Frozen Normalization to align reward scales and implements a Tri-Factor Adaptive Dynamic Weighting strategy to achieve collaborative optimization that is weakness-oriented, risk-prioritized, and redundancy-reducing. Experimental results demonstrate the effectiveness of our proposed paradigm in real-world medical scenario evaluations, establishing a new paradigm for complex alignment in vertical domains.
[AI-60] LoRA-based Parameter-Efficient LLM s for Continuous Learning in Edge-based Malware Detection
【速读】:该论文旨在解决边缘设备上恶意软件检测面临的两大挑战:一是静态或集中训练的模型难以适应不断演化的威胁和异构流量,导致性能下降;二是本地训练的模型形成数据孤岛,缺乏跨设备的知识迁移能力。解决方案的关键在于提出一种基于参数高效微调(Parameter-Efficient Fine-Tuning)的连续学习架构,结合每个边缘节点上的本地增量微调与通过LoRA(Low-Rank Adaptation)适配器进行的全局知识共享。具体而言,轻量级Transformer模型(如DistilBERT、DistilGPT-2、TinyT5)在边缘设备上本地训练,仅将生成的LoRA模块上传至轻量级协调器聚合并分发,从而实现跨设备泛化能力,同时避免原始数据交换,显著降低通信开销和存储负担。实验表明,该方法在多轮学习场景下可提升20–25%的准确率,且模型大小增加不足1%,适用于资源受限的边缘硬件环境。
链接: https://arxiv.org/abs/2602.11655
作者: Christian Rondanini,Barbara Carminati,Elena Ferrari,Niccolò Lardo,Ashish Kundu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:The proliferation of edge devices has created an urgent need for security solutions capable of detecting malware in real time while operating under strict computational and memory constraints. Recently, Large Language Models (LLMs) have demonstrated remarkable capabilities in recognizing complex patterns, yet their deployment on edge devices remains impractical due to their resource demands. However, in edge malware detection, static or centrally retrained models degrade under evolving threats and heterogeneous traffic; locally trained models become siloed and fail to transfer across domains. To overcome these limitations, in this paper, we present a continuous learning architecture for edge-based malware detection that combines local adaptation on each device with global knowledge sharing through parameter-efficient LoRA adapters. Lightweight transformer models (DistilBERT, DistilGPT-2, TinyT5) run on edge nodes and are incrementally fine-tuned on device-specific traffic; only the resulting LoRA modules are aggregated by a lightweight coordinator and redistributed, enabling cross-device generalization without exchanging raw data. We evaluate on two public IoT security datasets, Edge-IIoTset and TON-IoT, under multi-round learning to simulate evolving threats. Compared to isolated fine-tuning, the LoRA-based exchange yields up to 20-25% accuracy gains when models encounter previously unseen attacks from another domain, while maintaining stable loss and F1 across rounds. LoRA adds less than 1% to model size (~0.6-1.8 MB), making updates practical for constrained edge hardware.
[AI-61] DMind-3: A Sovereign Edge–Local–Cloud AI System with Controlled Deliberation and Correction-Based Tuning for Safe Low-Latency Transaction Execution
【速读】:该论文旨在解决Web3环境中金融交易执行的安全性与实时性矛盾问题,即如何在面对恶意攻击和严格延迟约束下,保障用户意图的隐私性和完整性。现有云中心化方案存在隐私泄露风险且在网络拥塞时性能下降,而纯本地方案则缺乏全局生态上下文支持。解决方案的关键在于提出一个主权型边缘-本地-云端智能架构(DMind-3),通过三层协同机制实现:边缘层部署确定性签名时意图防火墙以确保执行安全,本地端运行私有高保真推理引擎以保护敏感数据,云端则提供策略驱动的全局上下文合成器;同时引入基于隐私敏感度与不确定性选择性卸载策略,并结合分层预测合成(Hierarchical Predictive Synthesis, HPS)和对比链式纠错监督微调(Contrastive Chain-of-Correction Supervised Fine-Tuning, C³-SFT)两项训练目标,显著提升多轮任务成功率(93.7%)和领域推理能力,从而在保证安全性的同时实现用户对意图的主权控制。
链接: https://arxiv.org/abs/2602.11651
作者: Enhao Huang,Frank Li,Tony Lin,Lowes Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces DMind-3, a sovereign Edge-Local-Cloud intelligence stack designed to secure irreversible financial execution in Web3 environments against adversarial risks and strict latency constraints. While existing cloud-centric assistants compromise privacy and fail under network congestion, and purely local solutions lack global ecosystem context, DMind-3 resolves these tensions by decomposing capability into three cooperating layers: a deterministic signing-time intent firewall at the edge, a private high-fidelity reasoning engine on user hardware, and a policy-governed global context synthesizer in the cloud. We propose policy-driven selective offloading to route computation based on privacy sensitivity and uncertainty, supported by two novel training objectives: Hierarchical Predictive Synthesis (HPS) for fusing time-varying macro signals, and Contrastive Chain-of-Correction Supervised Fine-Tuning (C ^3 -SFT) to enhance local verification reliability. Extensive evaluations demonstrate that DMind-3 achieves a 93.7% multi-turn success rate in protocol-constrained tasks and superior domain reasoning compared to general-purpose baselines, providing a scalable framework where safety is bound to the edge execution primitive while maintaining sovereignty over sensitive user intent.
[AI-62] Variation-aware Flexible 3D Gaussian Editing
【速读】:该论文旨在解决3D Gaussian Splatting(3DGS)中间接编辑方法存在的跨视图不一致性问题,以及由此导致的编辑灵活性和效率受限的问题。间接编辑方法通常在2D渲染空间进行修改后再投影回3D,这一过程难以保证多视角一致性且难以高效实现复杂编辑操作。解决方案的关键在于提出VF-Editor,其通过一个前馈式的变异预测器(variation predictor)直接预测每个3D高斯素的属性变化,该预测器从2D编辑知识中蒸馏而来,编码输入后生成变异场,并利用两个可学习的并行解码函数迭代推断属性变化。这种统一设计使得VF-Editor能够灵活地将多种2D编辑策略的知识迁移至3D域,从而实现高效、一致且灵活的3D高斯素原生编辑。
链接: https://arxiv.org/abs/2602.11638
作者: Hao Qin,Yukai Sun,Meng Wang,Ming Kong,Mengxu Lu,Qiang Zhu
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:
Abstract:Indirect editing methods for 3D Gaussian Splatting (3DGS) have recently witnessed significant advancements. These approaches operate by first applying edits in the rendered 2D space and subsequently projecting the modifications back into 3D. However, this paradigm inevitably introduces cross-view inconsistencies and constrains both the flexibility and efficiency of the editing process. To address these challenges, we present VF-Editor, which enables native editing of Gaussian primitives by predicting attribute variations in a feedforward manner. To accurately and efficiently estimate these variations, we design a novel variation predictor distilled from 2D editing knowledge. The predictor encodes the input to generate a variation field and employs two learnable, parallel decoding functions to iteratively infer attribute changes for each 3D Gaussian. Thanks to its unified design, VF-Editor can seamlessly distill editing knowledge from diverse 2D editors and strategies into a single predictor, allowing for flexible and effective knowledge transfer into the 3D domain. Extensive experiments on both public and private datasets reveal the inherent limitations of indirect editing pipelines and validate the effectiveness and flexibility of our approach.
[AI-63] Do MLLM s Really Understand Space? A Mathematical Reasoning Evaluation
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在数学空间推理(mathematical spatial reasoning)能力上的显著不足问题,即模型难以准确解析和操作二维与三维空间关系,尽管人类在类似教科书式空间推理任务中可达到95%以上的准确率,而主流MLLMs的性能却低于60%。解决方案的关键在于提出MathSpatial框架,其核心创新是通过三个互补组件实现对空间推理能力的精准评估与提升:(i) MathSpatial-Bench基准测试集,包含2000个问题以隔离推理难度与感知噪声;(ii) MathSpatial-Corpus训练数据集,提供8000个带验证解的问题用于微调;(iii) MathSpatial-SRT结构化推理追踪机制,将推理过程建模为“关联(Correlate)、约束(Constrain)、推断(Infer)”三类原子操作组成的结构化轨迹。实验表明,在Qwen2.5-VL-7B上微调后,模型在保持竞争力准确率的同时减少25%的token消耗,首次实现了感知与推理的解耦测量,推动了对MLLM空间推理能力的系统性理解与改进。
链接: https://arxiv.org/abs/2602.11635
作者: Shuo Lu,Jianjie Cheng,Yinuo Xu,Yongcan Yu,Lijun Sheng,Peijie Wang,Siru Jiang,Yongguan Hu,Run Ling,Yihua Shao,Ao Ma,Wei Feng,Lingxiao He,Meng Wang,Qianlong Xie,Xingxing Wang,Ran He,Jian Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) have achieved strong performance on perception-oriented tasks, yet their ability to perform mathematical spatial reasoning, defined as the capacity to parse and manipulate two- and three-dimensional relations, remains unclear. Humans easily solve textbook-style spatial reasoning problems with over 95% accuracy, but we find that most leading MLLMs fail to reach even 60% on the same tasks. This striking gap highlights spatial reasoning as a fundamental weakness of current models. To investigate this gap, we present MathSpatial, a unified framework for evaluating and improving spatial reasoning in MLLMs. MathSpatial includes three complementary components: (i) MathSpatial-Bench, a benchmark of 2K problems across three categories and eleven subtypes, designed to isolate reasoning difficulty from perceptual noise; (ii) MathSpatial-Corpus, a training dataset of 8K additional problems with verified solutions; and (iii) MathSpatial-SRT, which models reasoning as structured traces composed of three atomic operations–Correlate, Constrain, and Infer. Experiments show that fine-tuning Qwen2.5-VL-7B on MathSpatial achieves competitive accuracy while reducing tokens by 25%. MathSpatial provides the first large-scale resource that disentangles perception from reasoning, enabling precise measurement and comprehensive understanding of mathematical spatial reasoning in MLLMs.
[AI-64] Neuro-Symbolic Multitasking: A Unified Framework for Discovering Generalizable Solutions to PDE Families
【速读】:该论文旨在解决偏微分方程(Partial Differential Equations, PDEs)家族求解中的计算效率与可解释性问题。传统数值方法(如有限元法)需对每个PDE实例单独求解,导致计算成本高昂;而现有机器学习PDE求解器虽具备高计算速度和精度,却因“黑箱”特性缺乏解析表达式,难以提供科学洞察。为此,作者提出神经辅助多任务符号PDE求解框架(Neuro-assisted Multitasking Symbolic PDE Solver, NMIPS),其关键在于:采用多因子优化策略同时发现PDE家族中各成员的解析解,并设计仿射转移方法(affine transfer method)在同类PDE间迁移已学习到的数学结构,从而避免重复求解,显著提升计算效率并获得可解释的解析解。
链接: https://arxiv.org/abs/2602.11630
作者: Yipeng Huang,Dejun Xu,Zexin Lin,Zhenzhong Wang,Min Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Solving Partial Differential Equations (PDEs) is fundamental to numerous scientific and engineering disciplines. A common challenge arises from solving the PDE families, which are characterized by sharing an identical mathematical structure but varying in specific parameters. Traditional numerical methods, such as the finite element method, need to independently solve each instance within a PDE family, which incurs massive computational cost. On the other hand, while recent advancements in machine learning PDE solvers offer impressive computational speed and accuracy, their inherent ``black-box" nature presents a considerable limitation. These methods primarily yield numerical approximations, thereby lacking the crucial interpretability provided by analytical expressions, which are essential for deeper scientific insight. To address these limitations, we propose a neuro-assisted multitasking symbolic PDE solver framework for PDE family solving, dubbed NMIPS. In particular, we employ multifactorial optimization to simultaneously discover the analytical solutions of PDEs. To enhance computational efficiency, we devise an affine transfer method by transferring learned mathematical structures among PDEs in a family, avoiding solving each PDE from scratch. Experimental results across multiple cases demonstrate promising improvements over existing baselines, achieving up to a \sim 35.7% increase in accuracy while providing interpretable analytical solutions.
[AI-65] ArGEnT: Arbitrary Geometry-encoded Transformer for Operator Learning
【速读】:该论文旨在解决科学机器学习中复杂、变化几何形状与参数化物理场景下算子学习(operator learning)的挑战,特别是在多查询场景(如设计优化、控制和反问题)中,需要在不同几何结构上进行灵活的空间位置评估。其核心解决方案是提出了一种几何感知的注意力机制架构——任意几何编码Transformer(ArGEnT),通过自注意力、交叉注意力和混合注意力三种方式直接从点云表示中编码几何信息,并将其嵌入DeepONet作为主干网络,从而无需显式将几何参数化为分支网络输入即可实现对几何与非几何输入共同依赖的算子映射学习。该方法显著提升了预测精度与泛化能力,在流体动力学、固体力学和电化学系统等基准问题上优于标准DeepONet及其他现有几何感知代理模型,尤其交叉注意力变体减少了对符号距离函数的依赖,实现了更准确的几何条件预测。
链接: https://arxiv.org/abs/2602.11626
作者: Wenqian Chen,Yucheng Fu,Michael Penwarden,Pratanu Roy,Panos Stinis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
备注: 69 pages, 21 figures, 10 tables
Abstract:Learning solution operators for systems with complex, varying geometries and parametric physical settings is a central challenge in scientific machine learning. In many-query regimes such as design optimization, control and inverse problems, surrogate modeling must generalize across geometries while allowing flexible evaluation at arbitrary spatial locations. In this work, we propose Arbitrary Geometry-encoded Transformer (ArGEnT), a geometry-aware attention-based architecture for operator learning on arbitrary domains. ArGEnT employs Transformer attention mechanisms to encode geometric information directly from point-cloud representations with three variants-self-attention, cross-attention, and hybrid-attention-that incorporates different strategies for incorporating geometric features. By integrating ArGEnT into DeepONet as the trunk network, we develop a surrogate modeling framework capable of learning operator mappings that depend on both geometric and non-geometric inputs without the need to explicitly parametrize geometry as a branch network input. Evaluation on benchmark problems spanning fluid dynamics, solid mechanics and electrochemical systems, we demonstrate significantly improved prediction accuracy and generalization performance compared with the standard DeepONet and other existing geometry-aware saurrogates. In particular, the cross-attention transformer variant enables accurate geometry-conditioned predictions with reduced reliance on signed distance functions. By combining flexible geometry encoding with operator-learning capabilities, ArGEnT provides a scalable surrogate modeling framework for optimization, uncertainty quantification, and data-driven modeling of complex physical systems.
[AI-66] When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM -Based Agents
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在重复执行相同任务时行为不一致的问题,这种不一致性可能影响任务成功率和系统可靠性。研究发现,在HotpotQA基准上对三种主流模型(Llama 3.1 70B、GPT-4o 和 Claude Sonnet 4.5)进行3000次运行时,ReAct风格的代理平均每10次运行产生2.0–4.2种不同的动作序列,且行为一致性与任务准确率显著相关:行为一致的任务(≤2条路径)准确率达80–92%,而高度不一致的任务(≥6条路径)准确率仅为25–60%。解决方案的关键在于识别并监控早期决策点的行为变异——研究发现69%的分歧发生在第一步搜索查询(即第2步),表明通过实时监测代理执行过程中的行为一致性,可在早期阶段检测潜在错误,从而提升LLM代理的可靠性和鲁棒性。
链接: https://arxiv.org/abs/2602.11619
作者: Aman Mehta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures
Abstract:Run the same LLM agent on the same task twice: do you get the same behavior? We find the answer is often no. In a study of 3,000 agent runs across three models (Llama 3.1 70B, GPT-4o, and Claude Sonnet 4.5) on HotpotQA, we observe that ReAct-style agents produce 2.0–4.2 distinct action sequences per 10 runs on average, even with identical inputs. More importantly, this variance predicts failure: tasks with consistent behavior ( \leq 2 unique paths) achieve 80–92% accuracy, while highly inconsistent tasks ( \geq 6 unique paths) achieve only 25–60%, a 32–55 percentage point gap depending on model. We trace variance to early decisions: 69% of divergence occurs at step 2, the first search query. Our results suggest that monitoring behavioral consistency during execution could enable early error detection and improve agent reliability.
[AI-67] scPilot: Large Language Model Reasoning Toward Automated Single-Cell Analysis and Discovery NEURIPS2025
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)在单细胞基因组学分析中缺乏对原始组学数据的直接理解与推理能力的问题,即传统LLM通常依赖预处理结果或离线知识,难以实现可解释、可审计且与实验数据紧密耦合的分析流程。其解决方案的关键在于提出scPilot框架,首次实现了“组学原生推理”(omics-native reasoning):通过让LLM以自然语言交互方式直接访问单细胞RNA测序(single-cell RNA-seq)数据及按需调用生物信息学工具,将核心单细胞分析任务——包括细胞类型注释、发育轨迹重构和转录因子靶向识别——转化为逐步推理问题,要求模型自主决策、提供证据支持并根据新信息迭代修正。该方法显著提升了准确性(如细胞类型注释平均提升11%)并增强了分析透明度,为生成式AI赋能精准生物学研究提供了可验证、可诊断的新范式。
链接: https://arxiv.org/abs/2602.11609
作者: Yiming Gao,Zhen Wang,Jefferson Chen,Mark Antkowiak,Mengzhou Hu,JungHo Kong,Dexter Pratt,Jieyuan Liu,Enze Ma,Zhiting Hu,Eric P. Xing
机构: 未知
类目: Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注: Accepted at NeurIPS 2025 Main Conference
Abstract:We present scPilot, the first systematic framework to practice omics-native reasoning: a large language model (LLM) converses in natural language while directly inspecting single-cell RNA-seq data and on-demand bioinformatics tools. scPilot converts core single-cell analyses, i.e., cell-type annotation, developmental-trajectory reconstruction, and transcription-factor targeting, into step-by-step reasoning problems that the model must solve, justify, and, when needed, revise with new evidence. To measure progress, we release scBench, a suite of 9 expertly curated datasets and graders that faithfully evaluate the omics-native reasoning capability of scPilot w.r.t various LLMs. Experiments with o1 show that iterative omics-native reasoning lifts average accuracy by 11% for cell-type annotation and Gemini-2.5-Pro cuts trajectory graph-edit distance by 30% versus one-shot prompting, while generating transparent reasoning traces explain marker gene ambiguity and regulatory logic. By grounding LLMs in raw omics data, scPilot enables auditable, interpretable, and diagnostically informative single-cell analyses. Code, data, and package are available at this https URL Comments: Accepted at NeurIPS 2025 Main Conference Subjects: Artificial Intelligence (cs.AI); Genomics (q-bio.GN) Cite as: arXiv:2602.11609 [cs.AI] (or arXiv:2602.11609v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.11609 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-68] MAPLE: Modality-Aware Post-training and Learning Ecosystem
【速读】:该论文旨在解决多模态语言模型在强化学习(Reinforcement Learning, RL)后训练过程中存在的模态盲视问题,即现有方法将所有输入信号视为同等重要,忽视了不同任务对特定模态的实际依赖性,导致策略梯度方差增大、收敛速度慢,并且在真实场景中因信号缺失或扰动而鲁棒性差。其解决方案的关键在于提出一个完整的模态感知后训练生态系统——MAPLE,包含三个核心组件:(1) MAPLE-bench,首个明确标注每项任务所需最小信号组合的基准;(2) MAPO(Modality-Aware Policy Optimization),通过按模态需求分层批量处理以降低异质组优势带来的梯度方差;(3) 自适应加权与课程调度机制,动态平衡并优先学习更难的信号组合。该方案显著提升了多模态RL训练效率与稳定性,使单/多模态性能差距缩小30.24%,收敛速度提升3.18倍,并在有限信号条件下保持鲁棒性。
链接: https://arxiv.org/abs/2602.11596
作者: Nikhil Verma,Minjung Kim,JooYoung Yoo,Kyung-Min Jin,Manasa Bharadwaj,Kevin Ferreira,Ko Keun Kim,Youngjoon Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages
Abstract:Multimodal language models now integrate text, audio, and video for unified reasoning. Yet existing RL post-training pipelines treat all input signals as equally relevant, ignoring which modalities each task actually requires. This modality-blind training inflates policy-gradient variance, slows convergence, and degrades robustness to real-world distribution shifts where signals may be missing, added, or reweighted. We introduce MAPLE, a complete modality-aware post-training and learning ecosystem comprising: (1) MAPLE-bench, the first benchmark explicitly annotating minimal signal combinations required per task; (2) MAPO, a modality-aware policy optimization framework that stratifies batches by modality requirement to reduce gradient variance from heterogeneous group advantages; (3) Adaptive weighting and curriculum scheduling that balances and prioritizes harder signal combinations. Systematic analysis across loss aggregation, clipping, sampling, and curriculum design establishes MAPO’s optimal training strategy. Adaptive weighting and curriculum focused learning further boost performance across signal combinations. MAPLE narrows uni/multi-modal accuracy gaps by 30.24%, converges 3.18x faster, and maintains stability across all modality combinations under realistic reduced signal access. MAPLE constitutes a complete recipe for deployment-ready multimodal RL post-training.
[AI-69] Gradient Compression May Hurt Generalization: A Remedy by Synthetic Data Guided Sharpness Aware Minimization
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中梯度压缩导致的损失函数景观变尖锐(sharper loss landscapes)问题,进而削弱模型泛化能力,尤其是在非独立同分布(non-IID)数据场景下。现有方法如Sharpness Aware Minimization (SAM) 虽能通过引入梯度上升步骤寻找平坦极小值点(flat minima),但在FL中因数据异质性导致全局扰动估计不准确,尤其在模型更新压缩时效果受限。论文提出FedSynSAM,其关键创新在于利用全局模型轨迹构建合成数据(synthetic data),从而更精确地估计全局扰动,提升SAM在FL中的适用性和收敛性保障。
链接: https://arxiv.org/abs/2602.11584
作者: Yujie Gu,Richeng Jin,Zhaoyang Zhang,Huaiyu Dai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:It is commonly believed that gradient compression in federated learning (FL) enjoys significant improvement in communication efficiency with negligible performance degradation. In this paper, we find that gradient compression induces sharper loss landscapes in federated learning, particularly under non-IID data distributions, which suggests hindered generalization capability. The recently emerging Sharpness Aware Minimization (SAM) effectively searches for a flat minima by incorporating a gradient ascent step (i.e., perturbing the model with gradients) before the celebrated stochastic gradient descent. Nonetheless, the direct application of SAM in FL suffers from inaccurate estimation of the global perturbation due to data heterogeneity. Existing approaches propose to utilize the model update from the previous communication round as a rough estimate. However, its effectiveness is hindered when model update compression is incorporated. In this paper, we propose FedSynSAM, which leverages the global model trajectory to construct synthetic data and facilitates an accurate estimation of the global perturbation. The convergence of the proposed algorithm is established, and extensive experiments are conducted to validate its effectiveness.
[AI-70] he Five Ws of Multi-Agent Communication: Who Talks to Whom When What and Why – A Survey from MARL to Emergent Language and LLM s
【速读】:该论文旨在解决多智能体系统中通信机制的设计与优化问题,特别是在动态、部分可观测环境下的协作效率与可解释性难题。其核心挑战在于如何在不同任务场景下设计有效的通信策略,以降低不确定性并提升协同性能。解决方案的关键在于通过“五W”框架(谁与谁通信、沟通什么、何时通信、为何有益)系统梳理多智能体通信(MA-Comm)的发展脉络,并归纳出三大范式:基于强化学习的通信(MARL)、涌现语言(EL)以及大语言模型(LLM)驱动的通信。其中,关键突破点在于从早期手工设计或隐式协议,逐步演进到端到端学习通信、符号化语言生成,再到利用LLM引入自然语言先验进行开放域推理与协作,从而在可扩展性、通用性和可解释性之间寻求平衡。
链接: https://arxiv.org/abs/2602.11583
作者: Jingdi Chen,Hanqing Yang,Zongjun Liu,Carlee Joe-Wong
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at Transactions on Machine Learning Research (TMLR), 2026
Abstract:Multi-agent sequential decision-making powers many real-world systems, from autonomous vehicles and robotics to collaborative AI assistants. In dynamic, partially observable environments, communication is often what reduces uncertainty and makes collaboration possible. This survey reviews multi-agent communication (MA-Comm) through the Five Ws: who communicates with whom, what is communicated, when communication occurs, and why communication is beneficial. This framing offers a clean way to connect ideas across otherwise separate research threads. We trace how communication approaches have evolved across three major paradigms. In Multi-Agent Reinforcement Learning (MARL), early methods used hand-designed or implicit protocols, followed by end-to-end learned communication optimized for reward and control. While successful, these protocols are frequently task-specific and hard to interpret, motivating work on Emergent Language (EL), where agents can develop more structured or symbolic communication through interaction. EL methods, however, still struggle with grounding, generalization, and scalability, which has fueled recent interest in large language models (LLMs) that bring natural language priors for reasoning, planning, and collaboration in more open-ended settings. Across MARL, EL, and LLM-based systems, we highlight how different choices shape communication design, where the main trade-offs lie, and what remains unsolved. We distill practical design patterns and open challenges to support future hybrid systems that combine learning, language, and control for scalable and interpretable multi-agent collaboration.
[AI-71] Learning to Configure Agent ic AI Systems
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的智能体系统在配置过程中存在的“一刀切”问题,即传统方法通常采用固定模板或人工调优的启发式策略来设定工作流、工具、token预算和提示词等参数,导致对简单和复杂查询均使用相同配置,造成行为脆弱性和计算资源浪费。其解决方案的关键在于将代理配置建模为基于查询的决策问题,并提出ARC(Agentic Resource Configuration learner),通过强化学习训练一个轻量级分层策略,实现对每个输入查询动态调整资源配置。实验表明,该方法在多个推理与工具增强问答基准上显著优于手工设计及其他基线方案,在提升任务准确率(最高达25%)的同时降低token消耗与运行时间。
链接: https://arxiv.org/abs/2602.11574
作者: Aditya Taparia,Som Sagar,Ransalu Senanayake
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 13 figures
Abstract:Configuring LLM-based agent systems involves choosing workflows, tools, token budgets, and prompts from a large combinatorial design space, and is typically handled today by fixed large templates or hand-tuned heuristics. This leads to brittle behavior and unnecessary compute, since the same cumbersome configuration is often applied to both easy and hard input queries. We formulate agent configuration as a query-wise decision problem and introduce ARC (Agentic Resource Configuration learner), which learns a light-weight hierarchical policy using reinforcement learning to dynamically tailor these configurations. Across multiple benchmarks spanning reasoning and tool-augmented question answering, the learned policy consistently outperforms strong hand-designed and other baselines, achieving up to 25% higher task accuracy while also reducing token and runtime costs. These results demonstrate that learning per-query agent configurations is a powerful alternative to “one size fits all” designs.
[AI-72] SemaPop: Semantic-Persona Conditioned Population Synthesis
【速读】:该论文旨在解决传统人口合成方法在同时建模统计结构与潜在行为语义方面的局限性,尤其是难以从调查数据中隐式捕捉抽象行为模式的问题。其关键解决方案是提出SemaPop框架,通过将大语言模型(Large Language Models, LLMs)与生成式人口建模相结合,从个体调查记录中提取高层人格表征(persona representations),并将其作为语义条件信号用于人口生成;同时引入边际正则化以确保生成样本与目标人口边际分布的一致性。该方法以Wasserstein GAN with gradient penalty(WGAN-GP)为骨干网络实现SemaPop-GAN,在保持样本可行性与多样性的前提下显著提升了对目标边际和联合分布的拟合能力,并通过消融实验证明了语义人格条件与架构设计对平衡边际一致性与结构真实性的贡献。
链接: https://arxiv.org/abs/2602.11569
作者: Zhenlin Qin,Yancheng Ling,Leizhen Wang,Francisco Câmara Pereira,Zhenliang Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Population synthesis is a critical component of individual-level socio-economic simulation, yet remains challenging due to the need to jointly represent statistical structure and latent behavioral semantics. Existing population synthesis approaches predominantly rely on structured attributes and statistical constraints, leaving a gap in semantic-conditioned population generation that can capture abstract behavioral patterns implicitly in survey data. This study proposes SemaPop, a semantic-statistical population synthesis model that integrates large language models (LLMs) with generative population modeling. SemaPop derives high-level persona representations from individual survey records and incorporates them as semantic conditioning signals for population generation, while marginal regularization is introduced to enforce alignment with target population marginals. In this study, the framework is instantiated using a Wasserstein GAN with gradient penalty (WGAN-GP) backbone, referred to as SemaPop-GAN. Extensive experiments demonstrate that SemaPop-GAN achieves improved generative performance, yielding closer alignment with target marginal and joint distributions while maintaining sample-level feasibility and diversity under semantic conditioning. Ablation studies further confirm the contribution of semantic persona conditioning and architectural design choices to balancing marginal consistency and structural realism. These results demonstrate that SemaPop-GAN enables controllable and interpretable population synthesis through effective semantic-statistical information fusion. SemaPop-GAN also provides a promising modular foundation for developing generative population projection systems that integrate individual-level behavioral semantics with population-level statistical constraints.
[AI-73] S-Memory: Plug-and-Play Memory for Time Series Foundation Models
【速读】:该论文旨在解决时间序列基础模型(Time Series Foundation Models, TSFMs)在下游领域因分布偏移(distribution shift)导致适应困难的问题。现有方法存在两难:参数化适配易引发灾难性遗忘且需高昂的多域维护成本,而非参数化检索虽能提升预测性能但引入高推理延迟。其解决方案的关键在于提出参数化记忆蒸馏(Parametric Memory Distillation),并实现为轻量级记忆适配器TS-Memory。该方法通过两阶段训练:首先构建离线、无泄露的安全kNN教师模型以合成置信度感知的分位数目标;其次利用置信度门控监督将检索诱导的分布校正蒸馏至轻量记忆模块。推理时,TS-Memory以常数时间开销融合记忆与主干预测,实现无需检索的高效部署,在点预测和概率预测上均优于主流适配方法,同时保持与冻结主干相当的效率。
链接: https://arxiv.org/abs/2602.11550
作者: Sisuo Lyu,Siru Zhong,Tiegang Chen,Weilin Ruan,Qingxiang Liu,Taiqiang Lv,Qingsong Wen,Raymond Chi-Wing Wong,Yuxuan Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time Series Foundation Models (TSFMs) achieve strong zero-shot forecasting through large-scale pre-training, but adapting them to downstream domains under distribution shift remains challenging. Existing solutions face a trade-off: Parametric Adaptation can cause catastrophic forgetting and requires costly multi-domain maintenance, while Non-Parametric Retrieval improves forecasts but incurs high inference latency due to datastore search. We propose Parametric Memory Distillation and implement it as TS-Memory, a lightweight memory adapter that augments frozen TSFMs. TS-Memory is trained in two stages. First, we construct an offline, leakage-safe kNN teacher that synthesizes confidence-aware quantile targets from retrieved futures. Second, we distill this retrieval-induced distributional correction into a lightweight memory adapter via confidence-gated supervision. During inference, TS-Memory fuses memory and backbone predictions with constant-time overhead, enabling retrieval-free deployment. Experiments across diverse TSFMs and benchmarks demonstrate consistent improvements in both point and probabilistic forecasting over representative adaptation methods, with efficiency comparable to the frozen backbone.
[AI-74] Native Reasoning Models: Training Language Models to Reason on Unverifiable Data ICLR2026
【速读】:该论文旨在解决当前大型推理模型训练范式(即结合监督微调与可验证奖励强化学习)所面临的三大核心问题:对高质量人工标注推理数据的高度依赖、外部验证器的使用限制了强化学习的应用范围(仅限于数学和编程等可验证任务),以及由此带来的数据收集成本高和人类认知偏见风险。解决方案的关键在于提出一种名为NRT(Native Reasoning Training)的新框架,其核心创新是将推理过程视为潜在变量,并通过统一的训练目标将其建模为优化问题——模型在无需外部验证器或专家演示的情况下,仅利用标准问答对即可自动生成推理轨迹,并通过最大化生成正确答案的概率来内在激励有效推理路径,从而形成自我强化的反馈循环,显著提升复杂推理能力并增强对策略坍缩(policy collapse)的鲁棒性。
链接: https://arxiv.org/abs/2602.11549
作者: Yuanfu Wang,Zhixuan Liu,Xiangtian Li,Chaochao Lu,Chao Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026
Abstract:The prevailing paradigm for training large reasoning models–combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR)–is fundamentally constrained by its reliance on high-quality, human-annotated reasoning data and external verifiers. This dependency incurs significant data-collection costs, risks embedding human cognitive biases, and confines the reinforcement learning stage to objectively assessable domains like mathematics and coding, leaving a wide range of unverifiable tasks beyond its scope. To overcome these limitations, we introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning by having the model generate its own reasoning traces using only standard question-answer pairs, thereby obviating the need for expert-written demonstrations. NRT reframes the training problem by treating the reasoning process as a latent variable. It employs a unified training objective that models reasoning as an optimization problem, intrinsically rewarding paths that increase the model’s likelihood of producing the ground-truth answer. This unified perspective allows us to analyze intrinsic failure modes of prior methods, such as policy collapse, and systematically design more robust reward aggregation functions, creating a self-reinforcing feedback loop where the model learns to think in ways that resolve its own uncertainty. Empirical evaluation on Llama and Mistral model families demonstrates that NRT achieves state-of-the-art performance among verifier-free methods, significantly outperforming standard SFT baselines and prior verifier-free RL methods. Our approach yields particularly strong performance gains in complex reasoning domains and exhibits high robustness to policy collapse, offering a general, scalable path toward building more powerful and broadly applicable reasoning systems.
[AI-75] Budget-Constrained Agent ic Large Language Models : Intention-Based Planning for Costly Tool Use
【速读】:该论文旨在解决预算约束下的工具增强型智能体(tool-augmented agents)在执行多步任务时的决策优化问题,即如何在严格货币预算限制下,通过调用外部工具完成复杂任务。由于状态-动作空间巨大、工具执行结果具有随机性且探索成本高昂,直接规划难以实现。解决方案的关键在于提出INTENT框架,该框架基于意图感知的分层世界模型,在推理阶段在线预测未来工具使用模式与风险校准的成本,从而指导决策,确保硬性预算可行性的同时显著提升任务成功率,并对市场动态变化(如工具价格波动或预算调整)保持鲁棒性。
链接: https://arxiv.org/abs/2602.11541
作者: Hanbing Liu,Chunhao Tian,Nan An,Ziyuan Wang,Pinyan Lu,Changyuan Yu,Qi Qi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We study budget-constrained tool-augmented agents, where a large language model must solve multi-step tasks by invoking external tools under a strict monetary budget. We formalize this setting as sequential decision making in context space with priced and stochastic tool executions, making direct planning intractable due to massive state-action spaces, high variance of outcomes and prohibitive exploration cost. To address these challenges, we propose INTENT, an inference-time planning framework that leverages an intention-aware hierarchical world model to anticipate future tool usage, risk-calibrated cost, and guide decisions online. Across cost-augmented StableToolBench, INTENT strictly enforces hard budget feasibility while substantially improving task success over baselines, and remains robust under dynamic market shifts such as tool price changes and varying budgets.
[AI-76] Krause Synchronization Transformers KR
【速读】:该论文试图解决Transformer中自注意力机制因全局归一化softmax权重导致的注意力集中问题(attention sink),该问题会引发表示坍缩(representation collapse)和层间强同步动力学,限制模型表达能力和计算效率。解决方案的关键在于提出Krause Attention机制,其灵感源自bounded-confidence共识动力学,通过将基于相似度的全局聚合替换为基于距离的局部稀疏交互,实现结构化的局部同步而非全局混合,从而缓解注意力集中现象;同时,这种局部邻域限制使计算复杂度从序列长度的平方级降低至线性级,显著提升可扩展性。
链接: https://arxiv.org/abs/2602.11534
作者: Jingkun Liu,Yisong Yue,Max Welling,Yue Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Experiments across vision (ViT on CIFAR/ImageNet), autoregressive generation (MNIST/CIFAR-10), and large language models (Llama/Qwen) demonstrate consistent gains with substantially reduced computation, highlighting bounded-confidence dynamics as a scalable and effective inductive bias for attention.
[AI-77] AltTS: A Dual-Path Framework with Alternating Optimization for Multivariate Time Series Forecasting
【速读】:该论文旨在解决多变量时间序列预测中因单一模型同时建模稳定的时间自回归(AR)动态与间歇性的跨维度交互关系而导致的优化冲突问题。这种冲突表现为:跨维度建模所需的高方差更新会污染支持自回归的梯度,从而导致训练不稳定和长时 horizon 预测性能下降。解决方案的关键在于提出 ALTTS(Alternating Dual-Path Framework),通过显式解耦 AR 与跨关系(Cross-Relation, CR)建模:AR 路径采用线性预测器,CR 路径使用带有 Cross-Relation Self-Attention(CRSA)机制的 Transformer,并通过交替优化协调双路径,以隔离梯度噪声并减少跨模块干扰,从而显著提升长期预测准确性。
链接: https://arxiv.org/abs/2602.11533
作者: Zhihang Yuan,Zhiyuan Liu,Mahesh K. Marina
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Multivariate time series forecasting involves two qualitatively distinct factors: (i) stable within-series autoregressive (AR) dynamics, and (ii) intermittent cross-dimension interactions that can become spurious over long horizons. We argue that fitting a single model to capture both effects creates an optimization conflict: the high-variance updates needed for cross-dimension modeling can corrupt the gradients that support autoregression, resulting in brittle training and degraded long-horizon accuracy. To address this, we propose ALTTS, a dual-path framework that explicitly decouples autoregression and cross-relation (CR) modeling. In ALTTS, the AR path is instantiated with a linear predictor, while the CR path uses a Transformer equipped with Cross-Relation Self-Attention (CRSA); the two branches are coordinated via alternating optimization to isolate gradient noise and reduce cross-block interference. Extensive experiments on multiple benchmarks show that ALTTS consistently outperforms prior methods, with the most pronounced improvements on long-horizon forecasting. Overall, our results suggest that carefully designed optimization strategies, rather than ever more complex architectures, can be a key driver of progress in multivariate time series forecasting.
[AI-78] CausalAgent : A Conversational Multi-Agent System for End-to-End Causal Inference
【速读】:该论文旨在解决传统因果推断分析流程中存在的技术门槛高、操作复杂等问题,这些问题要求研究者同时具备统计学与计算机科学的双重背景,并需手动完成算法选择、数据质量处理及结果解释等繁琐步骤。解决方案的关键在于提出CausalAgent——一个基于对话式的多智能体系统(Multi-Agent Systems, MAS),通过整合检索增强生成(Retrieval-Augmented Generation, RAG)与模型上下文协议(Model Context Protocol, MCP),实现从数据清洗、因果结构学习到偏差校正和报告生成的全流程自动化,并支持自然语言交互。该系统显著降低了因果分析的使用门槛,同时保障了分析过程的严谨性与可解释性。
链接: https://arxiv.org/abs/2602.11527
作者: Jiawei Zhu,Wei Chen,Ruichu Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by IUI 2026
Abstract:Causal inference holds immense value in fields such as healthcare, economics, and social sciences. However, traditional causal analysis workflows impose significant technical barriers, requiring researchers to possess dual backgrounds in statistics and computer science, while manually selecting algorithms, handling data quality issues, and interpreting complex results. To address these challenges, we propose CausalAgent, a conversational multi-agent system for end-to-end causal inference. The system innovatively integrates Multi-Agent Systems (MAS), Retrieval-Augmented Generation (RAG), and the Model Context Protocol (MCP) to achieve automation from data cleaning and causal structure learning to bias correction and report generation through natural language interaction. Users need only upload a dataset and pose questions in natural language to receive a rigorous, interactive analysis report. As a novel user-centered human-AI collaboration paradigm, CausalAgent explicitly models the analysis workflow. By leveraging interactive visualizations, it significantly lowers the barrier to entry for causal analysis while ensuring the rigor and interpretability of the process.
[AI-79] Human-Inspired Continuous Learning of Internal Reasoning Processes: Learning How to Think for Adaptive AI Systems
【速读】:该论文旨在解决当前人工智能系统在动态现实环境中缺乏持续适应能力的问题,尤其针对现有方法过度关注任务特定输出或静态知识表示,而忽视内部推理结构、动作调度策略及学习机制本身持续优化的局限性。解决方案的关键在于提出一种受人类启发的连续学习框架,将推理(reasoning)、行动(action)、反思(reflection)与验证(verification)统一于一个由并行学习增强的序列推理模型中,并将内部思维过程作为主要学习对象;通过系统记录内部推理轨迹和环境交互数据作为结构化学习材料,使系统不仅能优化任务内容,还能同步改进推理活动的组织、调度与演化,从而实现边执行边学习的认知结构自我提升。
链接: https://arxiv.org/abs/2602.11516
作者: Hong Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Learning internal reasoning processes is crucial for developing AI systems capable of sustained adaptation in dynamic real-world environments. However, most existing approaches primarily emphasize learning task-specific outputs or static knowledge representations, while overlooking the continuous refinement of internal reasoning structures, action scheduling policies, and learning mechanisms themselves. In this paper, we propose a human-inspired continuous learning framework that unifies reasoning, action, reflection, and verification within a sequential reasoning model enhanced by parallel learning. The framework explicitly treats internal thinking processes as primary learning objects. It systematically records internal reasoning trajectories and environmental interactions as structured learning material, enabling the system to optimize not only task-level content but also the organization, scheduling, and evolution of reasoning activities. This design realizes learning alongside processing, allowing cognitive structures to improve during execution. Furthermore, the framework supports controlled replacement of predefined logic with learned procedures and introduces a hierarchical learning-to-learn mechanism that jointly adapts task-level parameters and learning strategies. As a result, the system progressively evolves its internal cognitive architecture while preserving operational stability. Experimental results on a temperature sensor abnormality detection task show that incorporating internal-process learning reduces average runtime by 23.9%.
[AI-80] Differentially Private and Communication Efficient Large Language Model Split Inference via Stochastic Quantization and Soft Prompt
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在资源受限设备上本地部署时面临的计算开销过大,以及当前主流推理范式依赖云端处理导致的隐私泄露问题。其关键解决方案在于提出了一种差分隐私且通信高效的LLM分割推理框架——DEL:首先通过嵌入投影模块和差分隐私随机量化机制,在保障隐私的前提下显著降低传输数据量;其次,摒弃传统需在客户端部署本地模型的做法,转而在服务端采用软提示(soft prompt)技术以补偿因隐私保护措施带来的性能损失,从而实现隐私与模型效用之间的更好平衡。这是首个将软提示用于提升LLM推理中隐私-效用权衡的研究工作。
链接: https://arxiv.org/abs/2602.11513
作者: Yujie Gu,Richeng Jin,Xiaoyu Ji,Yier Jin,Wenyuan Xu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have achieved remarkable performance and received significant research interest. The enormous computational demands, however, hinder the local deployment on devices with limited resources. The current prevalent LLM inference paradigms require users to send queries to the service providers for processing, which raises critical privacy concerns. Existing approaches propose to allow the users to obfuscate the token embeddings before transmission and utilize local models for denoising. Nonetheless, transmitting the token embeddings and deploying local models may result in excessive communication and computation overhead, preventing practical implementation. In this work, we propose \textbfDEL, a framework for \textbfDifferentially private and communication \textbfEfficient \textbfLLM split inference. More specifically, an embedding projection module and a differentially private stochastic quantization mechanism are proposed to reduce the communication overhead in a privacy-preserving manner. To eliminate the need for local models, we adapt soft prompt at the server side to compensate for the utility degradation caused by privacy. To the best of our knowledge, this is the first work that utilizes soft prompt to improve the trade-off between privacy and utility in LLM inference, and extensive experiments on text generation and natural language understanding benchmarks demonstrate the effectiveness of the proposed method.
[AI-81] Agent Leak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems
【速读】:该论文旨在解决多智能体大语言模型(Multi-agent Large Language Model, LLM)系统中存在的隐私泄露问题,现有基准测试仅关注输出通道(output-only audits),无法评估内部通信路径(如智能体间消息、共享内存和工具参数)中的隐私风险。其解决方案的关键在于提出首个端到端的隐私泄露基准测试框架AgentLeak,涵盖1,000个跨医疗、金融、法律和企业领域的场景,构建32类攻击分类法及三层检测流水线,首次系统性量化了内部通道(inter-agent messages, C2)对整体隐私暴露的贡献——实测显示C2泄露率高达68.8%,远超输出通道(C1)的27.2%,且输出审计会遗漏41.7%的违规行为,凸显出在多智能体协作中强化内部通信隐私保护的必要性。
链接: https://arxiv.org/abs/2602.11510
作者: Faouzi El Yagoubi,Ranwa Al Mallah,Godwin Badu-Marfo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 10 figures, 13 tables. Code and dataset available at this https URL
Abstract:Multi-agent Large Language Model (LLM) systems create privacy risks that current benchmarks cannot measure. When agents coordinate on tasks, sensitive data passes through inter-agent messages, shared memory, and tool arguments; pathways that output-only audits never inspect. We introduce AgentLeak, to the best of our knowledge the first full-stack benchmark for privacy leakage covering internal channels, spanning 1,000 scenarios across healthcare, finance, legal, and corporate domains, paired with a 32-class attack taxonomy and three-tier detection pipeline. Testing GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet, Mistral Large, and Llama 3.3 70B across 4,979 traces reveals that multi-agent configurations reduce per-channel output leakage (C1: 27.2% vs 43.2% in single-agent) but introduce unmonitored internal channels that raise total system exposure to 68.9% (OR-aggregated across C1, C2, C5). Internal channels account for most of this gap: inter-agent messages (C2) leak at 68.8%, compared to 27.2% on C1 (output channel). This means that output-only audits miss 41.7% of violations. Claude 3.5 Sonnet, which emphasizes safety alignment in its design, achieves the lowest leakage rates on both external (3.3%) and internal (28.1%) channels, suggesting that model-level safety training may transfer to internal channel protection. Across all five models and four domains, the pattern C2 C1 holds consistently, confirming that inter-agent communication is the primary vulnerability. These findings underscore the need for coordination frameworks that incorporate internal-channel privacy protections and enforce privacy controls on inter-agent communication.
[AI-82] RooflineBench: A Benchmarking Framework for On-Device LLM s via Roofline Analysis
【速读】:该论文旨在解决在资源受限的边缘硬件上对小型语言模型(Small Language Models, SLMs)进行客观性能评估的难题,尤其是如何在异构计算平台上统一衡量不同神经网络架构的理论性能上限。其解决方案的关键在于提出一个基于Roofline模型的系统性框架,通过操作强度(Operational Intensity, OI)这一核心指标统一刻画架构原语与硬件约束之间的关系,并定义了“推理潜力区域”(inference-potential region),进而引入相对推理潜力(Relative Inference Potential)作为跨模型效率比较的新指标。该方法揭示了序列长度和模型深度对OI的影响机制,并识别出由硬件异构性引发的效率陷阱,同时验证了结构优化如多头潜在注意力(Multi-head Latent Attention, MLA)可有效释放设备端的潜在推理性能,为软硬件协同设计提供了明确方向。
链接: https://arxiv.org/abs/2602.11506
作者: Zhen Bi,Xueshu Chen,Luoyang Sun,Yuhang Yao,Qing Shen,Jungang Lou,Cheng Deng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Performance (cs.PF)
备注:
Abstract:The transition toward localized intelligence through Small Language Models (SLMs) has intensified the need for rigorous performance characterization on resource-constrained edge hardware. However, objectively measuring the theoretical performance ceilings of diverse architectures across heterogeneous platforms remains a formidable challenge. In this work, we propose a systematic framework based on the Roofline model that unifies architectural primitives and hardware constraints through the lens of operational intensity (OI). By defining an inference-potential region, we introduce the Relative Inference Potential as a novel metric to compare efficiency differences between Large Language Models (LLMs) on the same hardware substrate. Extensive empirical analysis across diverse compute tiers reveals that variations in performance and OI are significantly influenced by sequence length. We further identify a critical regression in OI as model depth increases. Additionally, our findings highlight an efficiency trap induced by hardware heterogeneity and demonstrate how structural refinements, such as Multi-head Latent Attention (M LA), can effectively unlock latent inference potential across various hardware substrates. These insights provide actionable directions for hardware-software co-design to align neural structures with physical constraints in on-device intelligence. The released code is available in the Appendix C.
[AI-83] Compiler-Guided Inference-Time Adaptation: Improving GPT -5 Programming Performance in Idris
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源或非主流编程语言中表现不佳的问题,特别是评估GPT-5在未见过的函数式编程语言Idris中的学习能力。研究发现,仅采用零样本提示(zero-shot prompting)时,GPT-5在Idris上的解题准确率仅为22/56,显著低于其在Python(45/50)和Erlang(35/47)等高资源语言中的表现。解决方案的关键在于引入结构化的错误引导迭代优化机制——通过整合本地编译错误和失败测试用例作为反馈信号,构建一个基于平台反馈的迭代提示循环。该方法使GPT-5在Idris上的性能大幅提升至54/56,表明编译器级别的细粒度反馈是解锁LLMs在低资源语言中潜力的核心要素。
链接: https://arxiv.org/abs/2602.11481
作者: Minda Li,Bhaskar Krishnamachari
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:
Abstract:GPT-5, a state of the art large language model from OpenAI, demonstrates strong performance in widely used programming languages such as Python, C++, and Java; however, its ability to operate in low resource or less commonly used languages remains underexplored. This work investigates whether GPT-5 can effectively acquire proficiency in an unfamiliar functional programming language, Idris, through iterative, feedback driven prompting. We first establish a baseline showing that with zero shot prompting the model solves only 22 out of 56 Idris exercises using the platform Exercism, substantially underperforming relative to higher resource languages (45 out of 50 in Python and 35 out of 47 in Erlang). We then evaluate several refinement strategies, including iterative prompting based on platform feedback, augmenting prompts with documentation and error classification guides, and iterative prompting using local compilation errors and failed test cases. Among these approaches, incorporating local compilation errors yields the most substantial improvements. Using this structured, error guided refinement loop, GPT-5 performance increased to an impressive 54 solved problems out of 56. These results suggest that while large language models may initially struggle in low resource settings, structured compiler level feedback can play a critical role in unlocking their capabilities.
[AI-84] EM-Aware Physical Synthesis: Neural Inductor Modeling and Intelligent Placement Routing for RF Circuits ISCAS2026
【速读】:该论文旨在解决射频(RF)电路物理综合自动化中生成可制造版图(GDSII格式)的难题,现有机器学习(ML)方法虽在拓扑选择和参数优化方面取得进展,但因组件模型过于简化且缺乏布线能力,难以产出满足制造规则的布局。其解决方案的关键在于三项创新:(1) 基于18,210个电感几何结构与1–100 GHz频率扫频数据训练的神经网络模型,生成750万样本,实现Q因子预测误差低于2%,并支持基于梯度的快速布局优化,成功率达93.77%;(2) 智能P-Cell优化器,在保持设计规则检查(DRC)合规的前提下缩减版图面积;(3) 具备频率依赖电磁(EM)间距规则和DRC感知能力的完整放置与布线引擎,最终实现高保真、DRC合规的RF电路GDSII版图自动生成,推动了RF物理设计自动化的实质性进展。
链接: https://arxiv.org/abs/2602.11461
作者: Yilun Huang,Asal Mehradfar,Salman Avestimehr,Hamidreza Aghasi
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Accepted at the 2026 IEEE International Symposium on Circuits and Systems (ISCAS 2026)
Abstract:This paper presents an ML-driven framework for automated RF physical synthesis that transforms circuit netlists into manufacturable GDSII layouts. While recent ML approaches demonstrate success in topology selection and parameter optimization, they fail to produce manufacturable layouts due to oversimplified component models and lack of routing capabilities. Our framework addresses these limitations through three key innovations: (1) a neural network framework trained on 18,210 inductor geometries with frequency sweeps from 1-100 GHz, generating 7.5 million training samples, that predicts inductor Q-factor with less than 2% error and enables fast gradient-based layout optimization with a 93.77% success rate in producing high-Q layouts; (2) an intelligent P-Cell optimizer that reduces layout area while maintaining design-rule-check (DRC) compliance; and (3) a complete placement and routing engine with frequency-dependent EM spacing rules and DRC-aware synthesis. The neural inductor model demonstrates superior accuracy across 1-100 GHz, enabling EM-accurate component synthesis with real-time inference. The framework successfully generates DRC-aware GDSII layouts for RF circuits, representing a significant step toward automated RF physical design.
[AI-85] Credit Where It is Due: Cross-Modality Connectivity Drives Precise Reinforcement Learning for MLLM Reasoning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在强化学习过程中如何有效整合视觉证据以提升推理能力的问题,尤其是对跨模态注意力连接机制的理解不足。其解决方案的关键在于识别并强化具有高视觉-文本耦合度的锚定标记(anchor tokens),这些标记约占总标记数的15%,但承担了推理过程中的视觉 grounding 作用;作者提出轻量级的锚定标记强化学习(Anchor-Token Reinforcement Learning, AT-RL)框架,通过基于注意力拓扑的图聚类方法选择性地增强此类高连通性标记,从而实现更精准的信用分配(credit assignment),显著提升模型在数学、STEM、视频和通用任务上的表现,同时仅引入1.2%的计算开销。
链接: https://arxiv.org/abs/2602.11455
作者: Zhengbo Jiao,Shaobo Wang,Zifan Zhang,Wei Wang,Bing Zhao,Hu Wei,Linfeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20pages
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet how visual evidence is integrated during reasoning remains poorly understood. We explore multimodal RLVR through the lens of cross-modal attention connectivity and find that only a small fraction of tokens (approximately 15%) exhibit strong visual-textual coupling. These high-connectivity tokens act as anchors that ground reasoning in the image, while the majority follow linguistic patterns. During RLVR training, credit assignment naturally concentrates on these anchors, sharpening their visual grounding over time. Building on this insight, we propose Anchor-Token Reinforcement Learning (AT-RL), a lightweight framework that selectively reinforces high-connectivity tokens via graph-based clustering of attention topology. Evaluated across the series (3B-32B), AT-RL introduces only 1.2% overhead yet enables the 32B model to surpass the 72B-Instruct baseline on MathVista (80.2), with consistent gains observed across STEM, video and general tasks. Conversely, training solely on low-connectivity tokens causes severe degradation, confirming that effective multimodal RL hinges on precise credit assignment to visual anchors. Our work reveals that reasoning quality is governed not by token quantity but by the fidelity of cross-modal anchoring.
[AI-86] RACER: Trajectory Risk Aggregation for Critical Episodes in Agent ic Reasoning
【速读】:该论文旨在解决真实世界多轮交互中AI代理(Agent)在使用工具时的不确定性估计难题,尤其是在人类-代理协同场景下,由于稀疏的关键失败事件(如循环、工具使用不连贯或人机协调失效)导致局部生成看似自信但整体轨迹失败的问题。现有不确定性代理方法主要针对单次文本生成,忽略了轨迹层面的异常信号。解决方案的关键在于提出TRACER——一种面向双控(dual-control)工具-代理-用户交互的轨迹级不确定性度量指标,其融合内容感知的惊讶度(surprisal)、情境感知信号、语义与词汇重复性以及基于工具的连贯性缺口,并通过尾部聚焦的风险函数和MAX复合风险聚合步骤来突出决定性的异常模式,从而实现对复杂对话式工具使用场景中不确定性的更早且更准确检测。
链接: https://arxiv.org/abs/2602.11409
作者: Sina Tayebati,Divake Kumar,Nastaran Darabi,Davide Ettori,Ranganath Krishnan,Amit Ranjan Trivedi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Estimating uncertainty for AI agents in real-world multi-turn tool-using interaction with humans is difficult because failures are often triggered by sparse critical episodes (e.g., looping, incoherent tool use, or user-agent miscoordination) even when local generation appears confident. Existing uncertainty proxies focus on single-shot text generation and therefore miss these trajectory-level breakdown signals. We introduce TRACER, a trajectory-level uncertainty metric for dual-control Tool-Agent-User interaction. TRACER combines content-aware surprisal with situational-awareness signals, semantic and lexical repetition, and tool-grounded coherence gaps, and aggregates them using a tail-focused risk functional with a MAX-composite step risk to surface decisive anomalies. We evaluate TRACER on \tau^2 -bench by predicting task failure and selective task execution. To this end, TRACER improves AUROC by up to 37.1% and AUARC by up to 55% over baselines, enabling earlier and more accurate detection of uncertainty in complex conversational tool-use settings. Our code and benchmark are available at this https URL.
[AI-87] GHOST: Unmasking Phantom States in Mamba2 via Grouped Hidden-state Output-aware Selection Truncation
【速读】:该论文旨在解决Mamba2模型因状态维度扩展导致的推理开销过大问题,尤其在自回归生成过程中会饱和带宽瓶颈。传统剪枝方法无法有效缓解此问题:非结构化稀疏性无法减少激活张量的密度,基于幅度的选择忽略运行时动态特性,而基于梯度的方法则带来高昂计算成本。解决方案的关键在于提出GHOST(Grouped Hidden-state Output-aware Selection and Truncation)——一种基于前向传播统计信息的结构化剪枝框架,其通过联合衡量可控性和可观测性来近似控制理论中的平衡截断(balanced truncation),从而在无需反向传播的情况下实现与基于梯度方法相当的建模保真度。实验表明,在130M至2.7B参数的模型上,GHOST可实现50%的状态维度压缩,仅带来约1个困惑度点的增长。
链接: https://arxiv.org/abs/2602.11408
作者: Michael Menezes,Anastasios Kyrillidis
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 16 pages, 7 figures
Abstract:While Mamba2’s expanded state dimension enhances temporal modeling, it incurs substantial inference overhead that saturates bandwidth during autoregressive generation. Standard pruning methods fail to address this bottleneck: unstructured sparsity leaves activations dense, magnitude-based selection ignores runtime dynamics, and gradient-based methods impose prohibitive costs. We introduce GHOST (Grouped Hidden-state Output-aware Selection and Truncation), a structured pruning framework that approximates control-theoretic balanced truncation using only forward-pass statistics. By jointly measuring controllability and observability, GHOST rivals the fidelity of gradient-based methods without requiring backpropagation. As a highlight, on models ranging from 130M to 2.7B parameters, our approach achieves a 50% state-dimension reduction with approximately 1 perplexity point increase on WikiText-2. Code is available at this https URL.
[AI-88] Can We Really Learn One Representation to Optimize All Rewards?
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中如何有效利用大规模预训练模型作为先验知识以提升下游任务性能的问题,特别是针对“前向-后向”(Forward-Backward, FB)表示学习方法在训练目标和收敛机制上的不透明性。其解决方案的关键在于通过理论分析揭示FB表示存在的条件、优化目标及实际收敛行为,并由此提出一种简化的无监督预训练方法——一步前向-后向表示学习(one-step FB),该方法不追求最优控制能力,而是实现一步策略改进(policy improvement)。实验表明,该方法在10个基于状态和图像的连续控制环境中平均将误差降低至原方法的10−5倍,且零样本性能提升24%。
链接: https://arxiv.org/abs/2602.11399
作者: Chongyi Zheng,Royina Karegoudra Jayanth,Benjamin Eysenbach
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:As machine learning has moved towards leveraging large models as priors for downstream tasks, the community has debated the right form of prior for solving reinforcement learning (RL) problems. If one were to try to prefetch as much computation as possible, they would attempt to learn a prior over the policies for some yet-to-be-determined reward function. Recent work (forward-backward (FB) representation learning) has tried this, arguing that an unsupervised representation learning procedure can enable optimal control over arbitrary rewards without further fine-tuning. However, FB’s training objective and learning behavior remain mysterious. In this paper, we demystify FB by clarifying when such representations can exist, what its objective optimizes, and how it converges in practice. We draw connections with rank matching, fitted Q-evaluation, and contraction mapping. Our analysis suggests a simplified unsupervised pre-training method for RL that, instead of enabling optimal control, performs one step of policy improvement. We call our proposed method \textbfone-step forward-backward representation learning (one-step FB) . Experiments in didactic settings, as well as in 10 state-based and image-based continuous control domains, demonstrate that one-step FB converges to errors 10^5 smaller and improves zero-shot performance by +24% on average. Our project website is available at this https URL.
[AI-89] General and Efficient Steering of Unconditional Diffusion
【速读】:该论文旨在解决无条件扩散模型在推理阶段进行可控生成时面临的计算效率问题,即传统方法如基于分类器的梯度引导(classifier-based guidance)或重新训练条件输入会引入显著的计算开销。其解决方案的关键在于提出一种无需逐步梯度计算的高效引导机制,核心包括两个发现:一是“噪声对齐”(Noise Alignment),即在早期高噪声阶段即可通过轻量级离线计算的引导信号实现粗粒度语义控制;二是“可迁移的概念向量”(Transferable concept vectors),即在激活空间中学习到的概念方向具有跨时间步和样本的迁移性,仅需一次学习即可用于所有生成轨迹的精细化控制。该方法借助无需反向传播的递归特征机器(Recursive Feature Machine, RFM)高效识别此类概念向量,从而在保持生成质量的同时大幅提升推理速度。
链接: https://arxiv.org/abs/2602.11395
作者: Qingsong Wang,Mikhail Belkin,Yusu Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Guiding unconditional diffusion models typically requires either retraining with conditional inputs or per-step gradient computations (e.g., classifier-based guidance), both of which incur substantial computational overhead. We present a general recipe for efficiently steering unconditional diffusion without gradient guidance during inference, enabling fast controllable generation. Our approach is built on two observations about diffusion model structure: Noise Alignment: even in early, highly corrupted stages, coarse semantic steering is possible using a lightweight, offline-computed guidance signal, avoiding any per-step or per-sample gradients. Transferable concept vectors: a concept direction in activation space once learned transfers across both timesteps and samples; the same fixed steering vector learned near low noise level remains effective when injected at intermediate noise levels for every generation trajectory, providing refined conditional control with efficiency. Such concept directions can be efficiently and reliably identified via Recursive Feature Machine (RFM), a light-weight backpropagation-free feature learning method. Experiments on CIFAR-10, ImageNet, and CelebA demonstrate improved accuracy/quality over gradient-based guidance, while achieving significant inference speedups.
[AI-90] Causal-JEPA: Learning World Models through Object-Level Latent Interventions
【速读】:该论文旨在解决世界模型(world models)在处理交互依赖动态时的局限性问题,即传统基于对象中心(object-centric)的表示虽能提供有效抽象,但难以捕捉物体间相互作用所引发的动力学变化。解决方案的关键在于提出C-JEPA,一种通过对象级掩码(object-level masking)机制扩展掩码联合嵌入预测(masked joint embedding prediction)的新方法:该机制迫使模型从其他对象中推断某一对象的状态,从而诱导潜在干预(latent interventions),产生类反事实(counterfactual-like)效应,并防止捷径解法(shortcut solutions),使交互推理成为必要条件。实验证明,此设计显著提升了视觉问答任务中的反事实推理能力(绝对提升约20%),并在智能体控制任务中以仅1%的潜在输入特征实现与基于图像块的世界模型相当的性能,同时提供了形式化分析表明对象级掩码通过潜在干预引入因果归纳偏置(causal inductive bias)。
链接: https://arxiv.org/abs/2602.11389
作者: Heejeong Nam,Quentin Le Lidec,Lucas Maes,Yann LeCun,Randall Balestriero
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By applying object-level masking that requires an object’s state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects and prevents shortcut solutions, making interaction reasoning essential. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20% in counterfactual reasoning compared to the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces a causal inductive bias via latent interventions. Our code is available at this https URL.
[AI-91] Retrieval-Aware Distillation for Transformer-SSM Hybrids
【速读】:该论文旨在解决状态空间模型(State-space Models, SSMs)在需要上下文检索(in-context retrieval)的基准任务上性能落后于Transformer的问题。研究表明,这一差距主要源于SSMs难以模拟Transformer中一类被称为“收集与聚合”(Gather-and-Aggregate, GA)的特定注意力头。解决方案的关键在于提出检索感知蒸馏(retrieval-aware distillation):首先通过合成检索任务的消融实验识别出对检索至关重要的GA头(仅占总注意力头的2%,即10/512),随后将这些关键头保留在学生模型中,其余头部则蒸馏为循环结构的SSM头部。结果表明,仅保留2%的注意力头即可恢复超过95%的教师模型在检索密集型任务上的性能,且通过减少注意力缓存和SSM状态维度(最多8倍),该混合模型相比同类方案内存效率提升5–6倍,显著缩小了Transformer与SSM之间的性能差距,同时大幅降低内存开销。
链接: https://arxiv.org/abs/2602.11374
作者: Aviv Bick,Eric P. Xing,Albert Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:State-space models (SSMs) offer efficient sequence modeling but lag behind Transformers on benchmarks that require in-context retrieval. Prior work links this gap to a small set of attention heads, termed Gather-and-Aggregate (GA), which SSMs struggle to reproduce. We propose retrieval-aware distillation, which converts a pretrained Transformer into a hybrid student by preserving only these retrieval-critical heads and distilling the rest into recurrent heads. We identify the essential heads via ablation on a synthetic retrieval task, producing a hybrid with sparse, non-uniform attention placement. We show that preserving just 2% of attention heads recovers over 95% of teacher performance on retrieval-heavy tasks (10 heads in a 1B model), requiring far fewer heads than hybrids that retain at least 25%. We further find that large recurrent states often compensate for missing retrieval: once retrieval is handled by these heads, the SSM backbone can be simplified with limited loss, even with an 8\times reduction in state dimension. By reducing both the attention cache and the SSM state, the resulting hybrid is 5 – 6\times more memory-efficient than comparable hybrids, closing the Transformer–SSM gap at a fraction of the memory cost.
[AI-92] he Manifold of the Absolute: Religious Perennialism as Generative Inference
【速读】:该论文试图解决宗教认识论(religious epistemology)的理论统一性问题,即如何在不牺牲各宗教传统独特性的前提下,解释不同宗教之间在冥想实践(contemplative convergence)上的跨文化一致性。其解决方案的关键在于引入变分自编码器(Variational Autoencoder, VAE)的数学框架,将各宗教传统建模为从共享低维潜在空间到高维可观测文化形式的生成映射,并比较三种竞争性生成配置:排他主义(exclusivism)、普世主义(universalism)、永恒主义(perennialism)及混合模式(syncretism)。研究表明,唯有永恒主义配置能避免其他模型的结构性缺陷——如排他主义无法解释跨传统收敛、合成主义导致输出不一致、普世主义引发后验坍缩(posterior collapse),从而提供最优解释力。在此框架下,严格正统性并非文化约束,而是结构必要性:只有匹配特定传统输入的冥想实践才能有效恢复潜在源,揭示宗教统一性虽真实存在,但需深度而非广度才能触及。
链接: https://arxiv.org/abs/2602.11368
作者: Arthur Juliani
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper formalizes religious epistemology through the mathematics of Variational Autoencoders. We model religious traditions as distinct generative mappings from a shared, low-dimensional latent space to the high-dimensional space of observable cultural forms, and define three competing generative configurations corresponding to exclusivism, universalism, and perennialism, alongside syncretism as direct mixing in observable space. Through abductive comparison, we argue that exclusivism cannot parsimoniously account for cross-traditional contemplative convergence, that syncretism fails because combining the outputs of distinct generative processes produces incoherent artifacts, and that universalism suffers from posterior collapse: stripping traditions to a common core discards the structural information necessary for inference. The perennialist configuration provides the best explanatory fit. Within this framework, strict orthodoxy emerges not as a cultural constraint but as a structural necessity: the contemplative practices that recover the latent source must be matched to the specific tradition whose forms they take as input. The unity of religions, if it exists, is real but inaccessible by shortcut: one must go deep rather than wide.
[AI-93] Bootstrapping-based Regularisation for Reducing Individual Prediction Instability in Clinical Risk Prediction Models
【速读】:该论文旨在解决深度学习临床预测模型在不同样本训练下预测结果不稳定的问题(即模型输出随训练数据微小变化而显著波动),这种不稳定性削弱了模型的可靠性并限制其在临床实践中的应用。解决方案的关键在于提出一种基于自助法(bootstrapping)的正则化框架,将自助采样过程直接嵌入深度神经网络的训练阶段,通过约束模型在重采样数据集上的预测差异,使单一模型具备内在的稳定性特性,同时保持判别性能和特征重要性的一致性(如SHAP值相关性高达0.965),从而在不牺牲可解释性的前提下提升模型的鲁棒性和可重复性。
链接: https://arxiv.org/abs/2602.11360
作者: Sara Matijevic,Christopher Yau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Clinical prediction models are increasingly used to support patient care, yet many deep learning-based approaches remain unstable, as their predictions can vary substantially when trained on different samples from the same population. Such instability undermines reliability and limits clinical adoption. In this study, we propose a novel bootstrapping-based regularisation framework that embeds the bootstrapping process directly into the training of deep neural networks. This approach constrains prediction variability across resampled datasets, producing a single model with inherent stability properties. We evaluated models constructed using the proposed regularisation approach against conventional and ensemble models using simulated data and three clinical datasets: GUSTO-I, Framingham, and SUPPORT. Across all datasets, our model exhibited improved prediction stability, with lower mean absolute differences (e.g., 0.019 vs. 0.059 in GUSTO-I; 0.057 vs. 0.088 in Framingham) and markedly fewer significantly deviating predictions. Importantly, discriminative performance and feature importance consistency were maintained, with high SHAP correlations between models (e.g., 0.894 for GUSTO-I; 0.965 for Framingham). While ensemble models achieved greater stability, we show that this came at the expense of interpretability, as each constituent model used predictors in different ways. By regularising predictions to align with bootstrapped distributions, our approach allows prediction models to be developed that achieve greater robustness and reproducibility without sacrificing interpretability. This method provides a practical route toward more reliable and clinically trustworthy deep learning models, particularly valuable in data-limited healthcare settings.
[AI-94] Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agent ic Optimization
【速读】:该论文旨在解决主动式大语言模型(Large Language Model, LLM)代理在多轮交互中如何平衡任务性能与用户参与度的问题。现有基于代理强化学习(Agentic Reinforcement Learning, RL)的方案往往难以兼顾高效任务完成与用户满意度,因为被动型代理无法有效适应用户意图,而过度依赖人类反馈则会降低用户体验。解决方案的关键在于提出BAO框架,其核心是将行为增强(Behavior Enhancement)与行为正则化(Behavior Regularization)相结合:前者通过丰富代理的主动推理和信息获取能力提升任务表现,后者则抑制低效或冗余交互,使代理行为更符合用户预期。实验证明,BAO在UserRL基准测试中显著优于现有主动代理RL基线,并达到甚至超越商业LLM代理的性能,验证了其在复杂多轮场景下训练高效且用户对齐的LLM代理的有效性。
链接: https://arxiv.org/abs/2602.11351
作者: Yihang Yao,Zhepeng Cen,Haohong Lin,Shiqi Liu,Zuxin Liu,Jiacheng Zhu,Zhang-Wei Hong,Laixi Shi,Ding Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Proactive large language model (LLM) agents aim to actively plan, query, and interact over multiple turns, enabling efficient task completion beyond passive instruction following and making them essential for real-world, user-centric applications. Agentic reinforcement learning (RL) has recently emerged as a promising solution for training such agents in multi-turn settings, allowing interaction strategies to be learned from feedback. However, existing pipelines face a critical challenge in balancing task performance with user engagement, as passive agents can not efficiently adapt to users’ intentions while overuse of human feedback reduces their satisfaction. To address this trade-off, we propose BAO, an agentic RL framework that combines behavior enhancement to enrich proactive reasoning and information-gathering capabilities with behavior regularization to suppress inefficient or redundant interactions and align agent behavior with user expectations. We evaluate BAO on multiple tasks from the UserRL benchmark suite, and demonstrate that it substantially outperforms proactive agentic RL baselines while achieving comparable or even superior performance to commercial LLM agents, highlighting its effectiveness for training proactive, user-aligned LLM agents in complex multi-turn scenarios. Our website: this https URL.
[AI-95] Agent NoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体在真实环境中性能显著低于基准测试表现的问题,其核心原因是现有训练与评估范式依赖理想化假设,忽略了现实交互中固有的随机性和噪声。解决方案的关键在于提出AgentNoiseBench框架,通过系统性分析真实场景中的偏差与不确定性,将环境噪声分为用户噪声(user-noise)和工具噪声(tool-noise)两类,并构建一个可自动注入可控噪声的流水线,在保持任务可解性的前提下对现有代理基准进行扰动,从而实现对代理模型鲁棒性的全面评估。
链接: https://arxiv.org/abs/2602.11348
作者: Ruipeng Wang,Yuxin Chen,Yukai Wang,Chang Wu,Junfeng Fang,Xiaodong Cai,Qi Gu,Hui Su,An Zhang,Xiang Wang,Xunliang Cai,Tat-Seng Chua
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models have enabled LLM-based agents to achieve strong performance on a variety of benchmarks. However, their performance in real-world deployments often that observed on benchmark settings, especially in complex and imperfect environments. This discrepancy largely arises because prevailing training and evaluation paradigms are typically built on idealized assumptions, overlooking the inherent stochasticity and noise present in real-world interactions. To bridge this gap, we introduce AgentNoiseBench, a framework for systematically evaluating the robustness of agentic models under noisy environments. We first conduct an in-depth analysis of biases and uncertainties in real-world scenarios and categorize environmental noise into two primary types: user-noise and tool-noise. Building on this analysis, we develop an automated pipeline that injects controllable noise into existing agent-centric benchmarks while preserving task solvability. Leveraging this pipeline, we perform extensive evaluations across a wide range of models with diverse architectures and parameter scales. Our results reveal consistent performance variations under different noise conditions, highlighting the sensitivity of current agentic models to realistic environmental perturbations.
[AI-96] Divide and Learn: Multi-Objective Combinatorial Optimization at Scale
【速读】:该论文旨在解决多目标组合优化(Multi-objective combinatorial optimization)问题,即在指数级大的离散决策空间中寻找帕累托最优解(Pareto-optimal solutions),而现有方法普遍面临通用性、可扩展性或理论保证的局限。其解决方案的关键在于将问题重新建模为一个在分解决策空间上的在线学习问题,通过自适应专家引导的序列构造策略求解逐位置的 bandit 子问题,从而获得依赖于子问题维度 d 而非组合空间规模的 regret 上界 O(dTlogT),显著提升了样本效率和计算效率,并在真实世界的人工智能加速器软硬件协同设计任务中展现出优于贝叶斯优化等方法的性能优势。
链接: https://arxiv.org/abs/2602.11346
作者: Esha Singh,Dongxia Wu,Chien-Yi Yang,Tajana Rosing,Rose Yu,Yi-An Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Tech report. Code URL coming soon
Abstract:Multi-objective combinatorial optimization seeks Pareto-optimal solutions over exponentially large discrete spaces, yet existing methods sacrifice generality, scalability, or theoretical guarantees. We reformulate it as an online learning problem over a decomposed decision space, solving position-wise bandit subproblems via adaptive expert-guided sequential construction. This formulation admits regret bounds of O(d\sqrtT \log T) depending on subproblem dimensionality (d) rather than combinatorial space size. On standard benchmarks, our method achieves 80–98% of specialized solvers performance while achieving two to three orders of magnitude improvement in sample and computational efficiency over Bayesian optimization methods. On real-world hardware-software co-design for AI accelerators with expensive simulations, we outperform competing methods under fixed evaluation budgets. The advantage grows with problem scale and objective count, establishing bandit optimization over decomposed decision spaces as a principled alternative to surrogate modeling or offline training for multi-objective optimization.
[AI-97] Bi-Level Prompt Optimization for Multimodal LLM -as-a-Judge
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)作为评判者(LLM-as-a-judge)时,其评估结果与人类判断对齐困难的问题,尤其针对AI生成图像的评价场景。现有自动提示优化(Auto Prompt Optimization, APO)方法主要面向纯文本任务,在多模态设置中因上下文窗口限制难以有效进行试错式提示迭代。解决方案的关键在于提出一种双层提示优化框架(Bi-Level Prompt Optimization, BLPO),通过将图像转换为保留关键视觉特征的文本表示(Image-to-Text, I2T),在有限上下文预算下联合优化判别提示(judge prompt)和图像转文本提示(I2T prompt),从而提升多模态 LLM 判决的准确性与一致性。
链接: https://arxiv.org/abs/2602.11340
作者: Bo Pan,Xuan Kan,Kaitai Zhang,Yan Yan,Shunwen Tan,Zihao He,Zixin Ding,Junjie Wu,Liang Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have become widely adopted as automated judges for evaluating AI-generated content. Despite their success, aligning LLM-based evaluations with human judgments remains challenging. While supervised fine-tuning on human-labeled data can improve alignment, it is costly and inflexible, requiring new training for each task or dataset. Recent progress in auto prompt optimization (APO) offers a more efficient alternative by automatically improving the instructions that guide LLM judges. However, existing APO methods primarily target text-only evaluations and remain underexplored in multimodal settings. In this work, we study auto prompt optimization for multimodal LLM-as-a-judge, particularly for evaluating AI-generated images. We identify a key bottleneck: multimodal models can only process a limited number of visual examples due to context window constraints, which hinders effective trial-and-error prompt refinement. To overcome this, we propose BLPO, a bi-level prompt optimization framework that converts images into textual representations while preserving evaluation-relevant visual cues. Our bi-level optimization approach jointly refines the judge prompt and the I2T prompt to maintain fidelity under limited context budgets. Experiments on four datasets and three LLM judges demonstrate the effectiveness of our method.
[AI-98] Security Threat Modeling for Emerging AI-Agent Protocols: A Comparative Analysis of MCP A2A Agora and ANP
【速读】:该论文旨在解决当前AI代理通信协议(如Model Context Protocol, Agent2Agent, Agora和Agent Network Protocol)在安全机制方面缺乏系统性分析与标准化风险评估框架的问题。其关键解决方案在于构建一个结构化的威胁建模分析方法,识别协议架构、信任假设、交互模式及生命周期行为中的特定风险面;同时提出一种定性风险评估框架,量化十二类协议级风险在创建、运行和更新阶段的可能影响,并通过MCP协议的实证案例验证了缺失强制验证/认证机制导致的执行错误问题,从而为安全部署和未来标准制定提供可操作的指导。
链接: https://arxiv.org/abs/2602.11327
作者: Zeynab Anbiaee,Mahdi Rabbani,Mansur Mirani,Gunjan Piya,Igor Opushnyev,Ali Ghorbani,Sajjad Dadkhah
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid development of the AI agent communication protocols, including the Model Context Protocol (MCP), Agent2Agent (A2A), Agora, and Agent Network Protocol (ANP), is reshaping how AI agents communicate with tools, services, and each other. While these protocols support scalable multi-agent interaction and cross-organizational interoperability, their security principles remain understudied, and standardized threat modeling is limited; no protocol-centric risk assessment framework has been established yet. This paper presents a systematic security analysis of four emerging AI agent communication protocols. First, we develop a structured threat modeling analysis that examines protocol architectures, trust assumptions, interaction patterns, and lifecycle behaviors to identify protocol-specific and cross-protocol risk surfaces. Second, we introduce a qualitative risk assessment framework that identifies twelve protocol-level risks and evaluates security posture across the creation, operation, and update phases through systematic assessment of likelihood, impact, and overall protocol risk, with implications for secure deployment and future standardization. Third, we provide a measurement-driven case study on MCP that formalizes the risk of missing mandatory validation/attestation for executable components as a falsifiable security claim by quantifying wrong-provider tool execution under multi-server composition across representative resolver policies. Collectively, our results highlight key design-induced risk surfaces and provide actionable guidance for secure deployment and future standardization of agent communication ecosystems.
[AI-99] Predictive Associative Memory: Retrieval Beyond Similarity Through Temporal Co-occurrence
【速读】:该论文旨在解决当前神经网络记忆系统中依赖相似性检索(similarity-based retrieval)的局限性,即假设“有用的记忆是相似的记忆”,这无法捕捉生物记忆的核心特性——通过时间共现(temporal co-occurrence)形成的关联。为应对这一问题,作者提出预测关联记忆(Predictive Associative Memory, PAM),其核心创新在于引入一种基于JEPA(Joint-Embedding Predictive Architecture)结构的双向预测机制:标准的Outward JEPA用于预测未来状态(基于输入感官数据),而新提出的Inward JEPA则在存储的经验中反向预测可关联的过去状态。这种架构使模型能够从连续体验流中学习嵌入空间中的关联结构,并以时间共现为基础进行关联回忆,而非依赖静态特征相似度。实验表明,PAM在合成基准上实现了高精度的关联召回(Association Precision@1 = 0.970),且在嵌入相似性无信息时仍能有效区分共同经历与从未共现的状态(AUC = 0.849),验证了其对真实时间结构的敏感性和鲁棒性。
链接: https://arxiv.org/abs/2602.11322
作者: Jason Dury
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 20 pages, 6 figures, for associated Git: this https URL
Abstract:Current approaches to memory in neural systems rely on similarity-based retrieval: given a query, find the most representationally similar stored state. This assumption – that useful memories are similar memories – fails to capture a fundamental property of biological memory: association through temporal co-occurrence. We propose Predictive Associative Memory (PAM), an architecture in which a JEPA-style predictor, trained on temporal co-occurrence within a continuous experience stream, learns to navigate the associative structure of an embedding space. We introduce an Inward JEPA that operates over stored experience (predicting associatively reachable past states) as the complement to the standard Outward JEPA that operates over incoming sensory data (predicting future states). We evaluate PAM as an associative recall system – testing faithfulness of recall for experienced associations – rather than as a retrieval system evaluated on generalisation to unseen associations. On a synthetic benchmark, the predictor’s top retrieval is a true temporal associate 97% of the time (Association Precision@1 = 0.970); it achieves cross-boundary Recall@20 = 0.421 where cosine similarity scores zero; and it separates experienced-together from never-experienced-together states with a discrimination AUC of 0.916 (cosine: 0.789). Even restricted to cross-room pairs where embedding similarity is uninformative, the predictor achieves AUC = 0.849 (cosine: 0.503, chance). A temporal shuffle control confirms the signal is genuine temporal co-occurrence structure, not embedding geometry: shuffling collapses cross-boundary recall by 90%, replicated across training seeds. All results are stable across seeds (SD 0.006) and query selections (SD \leq 0.012).
[AI-100] CryptoAnalystBench: Failures in Multi-Tool Long-Form LLM Analysis
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高数据密度、多工具协同的分析型任务中出现的系统性失败问题,尤其是在整合大量动态、结构化与非结构化工具输出时的表现不佳。其核心挑战在于现有评估方法难以捕捉复杂错误模式,导致高风险决策可能被误导。解决方案的关键在于构建一个面向分析师行为的基准测试平台——CryptoAnalystBench,该平台包含198个真实生产环境中的加密货币与去中心化金融(DeFi)查询,并配套一个集成多工具的代理测试框架和基于四项用户定义维度(相关性、时效性、深度、数据一致性)的评估流水线,其中引入LLM作为评判者并结合人工标注建立七类高级错误类型分类体系,从而实现对关键失败模式的可扩展识别与反馈,推动长文本、多工具增强型系统的可靠评估与发展。
链接: https://arxiv.org/abs/2602.11304
作者: Anushri Eswaran,Oleg Golev,Darshan Tank,Sidhant Rahi,Himanshu Tyagi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern analyst agents must reason over complex, high token inputs, including dozens of retrieved documents, tool outputs, and time sensitive data. While prior work has produced tool calling benchmarks and examined factuality in knowledge augmented systems, relatively little work studies their intersection: settings where LLMs must integrate large volumes of dynamic, structured and unstructured multi tool outputs. We investigate LLM failure modes in this regime using crypto as a representative high data density domain. We introduce (1) CryptoAnalystBench, an analyst aligned benchmark of 198 production crypto and DeFi queries spanning 11 categories; (2) an agentic harness equipped with relevant crypto and DeFi tools to generate responses across multiple frontier LLMs; and (3) an evaluation pipeline with citation verification and an LLM as a judge rubric spanning four user defined success dimensions: relevance, temporal relevance, depth, and data consistency. Using human annotation, we develop a taxonomy of seven higher order error types that are not reliably captured by factuality checks or LLM based quality scoring. We find that these failures persist even in state of the art systems and can compromise high stakes decisions. Based on this taxonomy, we refine the judge rubric to better capture these errors. While the judge does not align with human annotators on precise scoring across rubric iterations, it reliably identifies critical failure modes, enabling scalable feedback for developers and researchers studying analyst style agents. We release CryptoAnalystBench with annotated queries, the evaluation pipeline, judge rubrics, and the error taxonomy, and outline mitigation strategies and open challenges in evaluating long form, multi tool augmented systems.
[AI-101] he PBSAI Governance Ecosystem: A Multi-Agent AI Reference Architecture for Securing Enterprise AI Estates
【速读】:该论文旨在解决企业级和超大规模AI设施(AI estates)在部署生成式AI(Generative AI)、检索增强生成(Retrieval-Augmented Generation, RAG)管道及工具型智能体(Tool-Using Agents)时面临的治理与安全挑战,尤其针对多智能体协同环境下的网络安全防御缺乏可落地架构的问题。其解决方案的关键在于提出“实践者安全AI蓝图”(Practitioners Blueprint for Secure AI, PBSAI)——一个基于十二域分类的多智能体参考架构,通过定义边界明确的智能体家族(Bounded Agent Families),以共享上下文包(Context Envelopes)和结构化输出契约(Structured Output Contracts)实现工具与策略之间的中介控制,并集成分析监控、协同防御与自适应响应等系统安全技术,从而在跨域环境中保障可追溯性、溯源性和人机协同(Human-in-the-Loop)的安全保障机制。
链接: https://arxiv.org/abs/2602.11301
作者: John M. Willis
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 43 pages, plus 12 pages of appendices. One Figure
Abstract:Enterprises are rapidly deploying large language models, retrieval augmented generation pipelines, and tool using agents into production, often on shared high performance computing clusters and cloud accelerator platforms that also support defensive analytics. These systems increasingly function not as isolated models but as AI estates: socio technical systems spanning models, agents, data pipelines, security tooling, human workflows, and hyperscale infrastructure. Existing governance and security frameworks, including the NIST AI Risk Management Framework and systems security engineering guidance, articulate principles and risk functions but do not provide implementable architectures for multi agent, AI enabled cyber defense. This paper introduces the Practitioners Blueprint for Secure AI (PBSAI) Governance Ecosystem, a multi agent reference architecture for securing enterprise and hyperscale AI estates. PBSAI organizes responsibilities into a twelve domain taxonomy and defines bounded agent families that mediate between tools and policy through shared context envelopes and structured output contracts. The architecture assumes baseline enterprise security capabilities and encodes key systems security techniques, including analytic monitoring, coordinated defense, and adaptive response. A lightweight formal model of agents, context envelopes, and ecosystem level invariants clarifies the traceability, provenance, and human in the loop guarantees enforced across domains. We demonstrate alignment with NIST AI RMF functions and illustrate application in enterprise SOC and hyperscale defensive environments. PBSAI is proposed as a structured, evidence centric foundation for open ecosystem development and future empirical validation. Comments: 43 pages, plus 12 pages of appendices. One Figure Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2602.11301 [cs.AI] (or arXiv:2602.11301v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.11301 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-102] Voxtral Realtime
【速读】:该论文旨在解决实时自动语音识别(Automatic Speech Recognition, ASR)模型在低延迟下难以达到离线模型性能的问题。传统方法通过分块(chunking)或滑动窗口对离线模型进行改造,但存在对齐不准确和延迟控制不佳的局限性。本文提出Voxtral Realtime,一种原生流式训练的ASR模型,其关键创新在于:1)采用因果音频编码器(causal audio encoder)实现端到端流式训练,确保音频与文本流之间的显式对齐;2)引入Ada RMS-Norm机制以优化延迟条件建模,提升模型在短延迟(如480ms)下的稳定性与准确性。实验表明,在亚秒级延迟下,该模型性能可媲美广泛部署的离线模型Whisper。
链接: https://arxiv.org/abs/2602.11298
作者: Alexander H. Liu,Andy Ehrenberg,Andy Lo,Chen-Yo Sun,Guillaume Lample,Jean-Malo Delignon,Khyathi Raghavi Chandu,Patrick von Platen,Pavankumar Reddy Muddireddy,Rohin Arora,Sanchit Gandhi,Sandeep Subramanian,Soham Ghosh,Srijan Mishra,Abhinav Rastogi,Alan Jeffares,Albert Jiang,Alexandre Sablayrolles,Amélie Héliou,Andrew Bai,Angele Lenglemetz,Anmol Agarwal,Anton Eliseev,Antonia Calvi,Arjun Majumdar,Baptiste Bout,Baptiste Rozière,Baudouin De Monicault,Benjamin Tibi,Clémence Lanfranchi,Connor Chen,Corentin Barreau,Corentin Sautier,Cyprien Courtot,Darius Dabert,Diego de las Casas,Elliot Chane-Sane,Enguerrand Paquin,Faruk Ahmed,Federico Baldassarre,Gabrielle Berrada,Gaëtan Ecrepont,Gauthier Guinet,Genevieve Hayes,Georgii Novikov,Giada Pistilli,Guillaume Martin,Gunjan Dhanuka,Gunshi Gupta,Han Zhou,Indraneel Mukherjee,Irene Zhang,Jaeyoung Kim,Jan Ludziejewski,Jason Rute,Joachim Studnia,John Harvill,Jonas Amar,Josselin Somerville Roberts,Julien Tauran,Karmesh Yadav,Kartik Khandelwal,Kush Jain,Laurence Aitchison,Léonard Blier,Lingxiao Zhao,Louis Martin,Lucile Saulnier,Luyu Gao,Maarten Buyl,Manan Sharma,Margaret Jennings,Marie Pellat,Mark Prins,Mathieu Poirée,Mathilde Guillaumin,Matthieu Dinot,Matthieu Futeral,Maxime Darrin,Maximilian Augustin,Mert Unsal,Mia Chiquier,Nathan Grinsztajn,Neha Gupta,Olivier Bousquet,Olivier Duchenne,Patricia Wang,Paul Jacob,Paul Wambergue,Paula Kurylowicz,Philomène Chagniot,Pierre Stock,Piotr Miłoś,Prateek Gupta,Pravesh Agrawal,Quentin Torroba,Ram Ramrakhya,Rishi Shah,Romain Sauvestre,Roman Soletskyi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.
[AI-103] On Decision-Valued Maps and Representational Dependence
【速读】:该论文旨在解决不同数据表示形式在计算引擎中可能导致离散结果不一致的问题,即同一数据的不同表示可能产生相同或截然不同的输出,从而影响系统的可重复性和可审计性。其解决方案的关键在于提出“决策值映射”(decision-valued map)的形式化定义,并构建DecisionDB基础设施,通过内容驱动的标识符对存储于写入一次(write-once)形式中的数据和计算产物进行持久化记录,实现确定性重放(deterministic replay),确保每个决策标识符能精确恢复自存储的产物,且所有三个识别字段均与持久化值完全匹配。该方法将表示空间划分为具有不同持久性特征的区域和边界,并将决策复用视为一种可机械验证的条件,从而提升系统行为的透明度与可追溯性。
链接: https://arxiv.org/abs/2602.11295
作者: Gil Raitses
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 10 pages, 3 figures, 5 tables
Abstract:A computational engine applied to different representations of the same data can produce different discrete outcomes, with some representations preserving the result and others changing it entirely. A decision-valued map records which representations preserve the outcome and which change it, associating each member of a declared representation family with the discrete result it produces. This paper formalizes decision-valued maps and describes DecisionDB, an infrastructure that logs, replays and audits these relationships using identifiers computed from content and artifacts stored in write-once form. Deterministic replay recovers each recorded decision identifier exactly from stored artifacts, with all three identifying fields matching their persisted values. The contribution partitions representation space into persistence regions and boundaries, and treats decision reuse as a mechanically checkable condition.
[AI-104] HiFloat4 Format for Language Model Inference
【速读】:该论文旨在解决深度学习模型中低精度量化(如4-bit表示)导致的精度损失问题,同时兼顾硬件实现的效率与能效。其解决方案的关键在于提出一种名为HiFloat4(HiF4)的块浮点(Block Floating-Point, BFP)数据格式:每个HiF4单元打包64个4-bit元素,并共享32位缩放元数据,平均每个值占用4.5比特;该元数据采用三级缩放层次结构,有效捕捉组间和组内动态范围,提升表示空间利用率;此外,64元素的大分组尺寸支持矩阵乘法以高度定点化方式执行,显著降低硬件面积与功耗。实验表明,HiF4在多个语言模型(LLaMA、Qwen、Mistral等)上相较当前最优的NVFP4格式实现了更高的平均推理精度。
链接: https://arxiv.org/abs/2602.11287
作者: Yuanyong Luo,Jing Huang,Yu Cheng,Ziwei Yu,Kaihua Zhang,Kehong Hong,Xinda Ma,Xin Wang,Anping Tong,Guipeng Hu,Yun Xu,Mehran Taghian,Peng Wu,Guanglin Li,Yunke Peng,Tianchi Hu,Minqi Chen,Michael Bi Mi,Hu Liu,Xiping Zhou,Junsong Wang,Qiang Lin,Heng Liao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 8 pages, 4 figures
Abstract:This paper introduces HiFloat4 (HiF4), a block floating-point data format tailored for deep learning. Each HiF4 unit packs 64 4-bit elements with 32 bits of shared scaling metadata, averaging 4.5 bits per value. The metadata specifies a three-level scaling hierarchy, capturing inter- and intra-group dynamic range while improving the utilization of the representational space. In addition, the large 64-element group size enables matrix multiplications to be executed in a highly fixed-point manner, significantly reducing hardware area and power consumption. To evaluate the proposed format, we conducted inference experiments on several language models, including LLaMA, Qwen, Mistral, DeepSeek-V3.1 and LongCat. Results show that HiF4 achieves higher average accuracy than the state-of-the-art NVFP4 format across multiple models and diverse downstream tasks.
[AI-105] AI-Driven Clinical Decision Support System for Enhanced Diabetes Diagnosis and Management
【速读】:该论文旨在解决初级保健医生在识别2型糖尿病(Type 2 Diabetes Mellitus, T2DM)时面临的挑战,即诊断准确性不足的问题。其解决方案的关键在于开发并验证一种结合专家知识与机器学习技术的生成式AI临床决策支持系统(AI-CDSS),该系统通过整合体重指数、空腹血糖和糖化血红蛋白等关键特征进行预测,在测试集上表现出高达99.8%的糖尿病预测准确率,并在105名患者的临床试点研究中展现出与内分泌专科医生98.5%的一致性,显著优于非内分泌专科医生的85%一致性,证明其在缺乏专科资源场景下具有高精度辅助诊断潜力。
链接: https://arxiv.org/abs/2602.11237
作者: Mujeeb Ur Rehman,Imran Rehan,Sohail Khalid
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Identifying type 2 diabetes mellitus can be challenging, particularly for primary care physicians. Clinical decision support systems incorporating artificial intelligence (AI-CDSS) can assist medical professionals in diagnosing type 2 diabetes with high accuracy. This study aimed to assess an AI-CDSS specifically developed for the diagnosis of type 2 diabetes by employing a hybrid approach that integrates expert-driven insights with machine learning techniques. The AI-CDSS was developed (training dataset: n = 650) and tested (test dataset: n = 648) using a dataset of 1298 patients with and without type 2 diabetes. To generate predictions, the algorithm utilized key features such as body mass index, plasma fasting glucose, and hemoglobin A1C. Furthermore, a clinical pilot study involving 105 patients was conducted to assess the diagnostic accuracy of the system in comparison to non-endocrinology specialists. The AI-CDSS showed a high degree of accuracy, with 99.8% accuracy in predicting diabetes, 99.3% in predicting prediabetes, 99.2% in identifying at-risk individuals, and 98.8% in predicting no diabetes. The test dataset revealed a 98.8% agreement between endocrinology specialists and the AI-CDSS. Type 2 diabetes was identified in 45% of 105 individuals in the pilot study. Compared with diabetes specialists, the AI-CDSS scored a 98.5% concordance rate, greatly exceeding that of nonendocrinology specialists, who had an 85% agreement rate. These findings indicate that the AI-CDSS has the potential to be a useful tool for accurately identifying type 2 diabetes, especially in situations in which diabetes specialists are not readily available.
[AI-106] Latent Generative Solvers for Generalizable Long-Term Physics Simulation
【速读】:该论文旨在解决跨异构偏微分方程(Partial Differential Equation, PDE)系统中长期模拟的稳定性与泛化能力问题,尤其是传统神经算子方法在长时间预测中因轨迹漂移(rollout drift)导致的误差累积和可靠性下降。解决方案的关键在于提出了一种两阶段框架——潜在生成求解器(Latent Generative Solvers, LGS):首先通过预训练变分自编码器(Variational Autoencoder, VAE)将多样化的PDE状态映射到共享的潜在物理空间;其次利用基于流匹配(flow matching)训练的Transformer学习潜在空间中的概率动力学,并引入一个不确定性旋钮(uncertainty knob)在训练和推理时扰动潜在输入,从而引导模型纠正离流形(off-manifold)的滚动预测偏差并增强自回归预测的稳定性。此外,通过流强制(flow forcing)机制从模型生成轨迹中更新系统描述符(context),实现训练与测试条件的一致性对齐,显著提升长期稳定性。此方法在保持短时精度的同时大幅降低长期漂移,且得益于潜在空间学习与高效架构设计,计算复杂度较非生成基线降低达70倍,支持大规模预训练及小样本适应。
链接: https://arxiv.org/abs/2602.11229
作者: Zituo Chen,Haixu Wu,Sili Deng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We study long-horizon surrogate simulation across heterogeneous PDE systems. We introduce Latent Generative Solvers (LGS), a two-stage framework that (i) maps diverse PDE states into a shared latent physics space with a pretrained VAE, and (ii) learns probabilistic latent dynamics with a Transformer trained by flow matching. Our key mechanism is an uncertainty knob that perturbs latent inputs during training and inference, teaching the solver to correct off-manifold rollout drift and stabilizing autoregressive prediction. We further use flow forcing to update a system descriptor (context) from model-generated trajectories, aligning train/test conditioning and improving long-term stability. We pretrain on a curated corpus of \sim 2.5M trajectories at 128^2 resolution spanning 12 PDE families. LGS matches strong deterministic neural-operator baselines on short horizons while substantially reducing rollout drift on long horizons. Learning in latent space plus efficient architectural choices yields up to \textbf70 \times lower FLOPs than non-generative baselines, enabling scalable pretraining. We also show efficient adaptation to an out-of-distribution 256^2 Kolmogorov flow dataset under limited finetuning budgets. Overall, LGS provides a practical route toward generalizable, uncertainty-aware neural PDE solvers that are more reliable for long-term forecasting and downstream scientific workflows.
[AI-107] Credal Concept Bottleneck Models: Structural Separation of Epistemic and Aleatoric Uncertainty
【速读】:该论文旨在解决预测不确定性(predictive uncertainty)在模型中的混杂问题,即传统方法通常从同一预测分布中同时估计认知不确定性(epistemic uncertainty,源于模型知识不足)和随机不确定性(aleatoric uncertainty,源于数据本身的模糊性),导致两者高度相关,从而模糊了其语义区分。解决方案的关键在于提出一种基于可信集(credal set)的形式化框架,将不确定性表示为一组预测分布的集合,其中认知不确定性对应于该集合的大小(size),而随机不确定性则体现于集合内各分布的噪声水平(noise within elements)。作者进一步在变分可信概念瓶颈模型(Variational Credal Concept Bottleneck Model)中实现该思想,通过两个独立的不确定性头(uncertainty heads)分别训练、互不重叠的梯度路径(non-overlapping gradient paths),实现了不确定性成分的构造性分离(separation by construction),而非事后分解(post hoc decomposition)。实验表明,该方法显著降低了两种不确定性之间的相关性(降低一个数量级以上),并提升了认知不确定性与预测误差的一致性以及随机不确定性与真实数据模糊性的对齐程度。
链接: https://arxiv.org/abs/2602.11219
作者: Tanmoy Mukherjee,Marius Kloft,Pierre Marquis,Zied Bouraoui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Decomposing predictive uncertainty into epistemic (model ignorance) and aleatoric (data ambiguity) components is central to reliable decision making, yet most methods estimate both from the same predictive distribution. Recent empirical and theoretical results show these estimates are typically strongly correlated, so changes in predictive spread simultaneously affect both components and blur their semantics. We propose a credal-set formulation in which uncertainty is represented as a set of predictive distributions, so that epistemic and aleatoric uncertainty correspond to distinct geometric properties: the size of the set versus the noise within its elements. We instantiate this idea in a Variational Credal Concept Bottleneck Model with two disjoint uncertainty heads trained by disjoint objectives and non-overlapping gradient paths, yielding separation by construction rather than post hoc decomposition. Across multi-annotator benchmarks, our approach reduces the correlation between epistemic and aleatoric uncertainty by over an order of magnitude compared to standard methods, while improving the alignment of epistemic uncertainty with prediction error and aleatoric uncertainty with ground-truth ambiguity.
[AI-108] SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents ICML
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在软件工程(Software Engineering, SWE)代理训练中因依赖每个任务独立容器带来的存储开销大、环境初始化慢以及需容器管理权限等问题。其解决方案的关键在于提出SWE-MiniSandbox,一种无需容器的轻量级隔离机制,通过内核级隔离技术实现任务间的安全执行,并结合轻量级环境预缓存策略,显著降低系统资源消耗——磁盘使用量降至容器基线的约5%,环境准备时间减少至约25%,同时保持与标准容器管道相当的评估性能,从而为资源受限场景下的RL驱动SWE代理规模化部署提供可行方案。
链接: https://arxiv.org/abs/2602.11210
作者: Danlong Yuan,Wei Wu,Zhengren Wang,Xueliang Zhao,Huishuai Zhang,Dongyan Zhao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML under review
Abstract:Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, but existing pipelines typically rely on per-task containers for isolation. At scale, pre-built container images incur substantial storage overhead, slow environment setup, and require container-management privileges. We propose SWE-MiniSandbox, a lightweight, container-free method that enables scalable RL training of SWE agents without sacrificing isolation. Instead of relying on per-instance containers, SWE-MiniSandbox executes each task in an isolated workspace backed by kernel-level mechanisms, substantially reducing system overhead. It leverages lightweight environment pre-caching techniques to eliminate the need for bulky container images. As a result, our approach lowers disk usage to approximately 5% of that required by container-based pipelines and reduces environment preparation time to about 25% of the container baseline. Empirical results demonstrate that SWE-MiniSandbox achieves evaluation performance comparable to standard container-based pipelines. By removing the dependency on heavy container infrastructure, SWE-MiniSandbox offers a practical and accessible foundation for scaling RL-based SWE agents, particularly in resource-constrained research environments.
[AI-109] Zero-Sacrifice Persistent-Robustness Adversarial Defense for Pre-Trained Encoders
【速读】:该论文旨在解决预训练编码器在下游任务中对无任务特异性对抗样本(Downstream-Agnostic Adversarial Examples, DAEs)的脆弱性问题,即这些对抗样本无需了解具体下游任务即可误导模型,而现有防御方法依赖于任务特定的对抗微调,导致泛化能力受限、灾难性遗忘及良性性能下降。解决方案的关键在于提出一种零牺牲持久鲁棒性对抗防御方法(Zero-Sacrifice Persistent-Robustness Adversarial Defense, ZePAD),其核心创新为双分支结构:多模式对抗增强分支(MPAE-Branch)通过两个对抗微调后的编码器提升抗扰动能力,良性记忆保留分支(BMP-Branch)利用本地数据训练以确保良性性能不被损害;更关键的是,ZePAD可通过分支置信度直接检测DAEs,无需额外的对抗样本识别任务,且单次对抗微调即可实现跨下游任务的持续鲁棒性,从而实现“零牺牲”特性。
链接: https://arxiv.org/abs/2602.11204
作者: Zhuxin Lei,Ziyuan Yang,Yi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The widespread use of publicly available pre-trained encoders from self-supervised learning (SSL) has exposed a critical vulnerability: their susceptibility to downstream-agnostic adversarial examples (DAEs), which are crafted without knowledge of the downstream tasks but capable of misleading downstream models. While several defense methods have been explored recently, they rely primarily on task-specific adversarial fine-tuning, which inevitably limits generalizability and causes catastrophic forgetting and deteriorates benign performance. Different with previous works, we propose a more rigorous defense goal that requires only a single tuning for diverse downstream tasks to defend against DAEs and preserve benign performance. To achieve this defense goal, we introduce Zero-Sacrifice Persistent-Robustness Adversarial Defense (ZePAD), which is inspired by the inherent sensitivity of neural networks to data characteristics. Specifically, ZePAD is a dual-branch structure, which consists of a Multi-Pattern Adversarial Enhancement Branch (MPAE-Branch) that uses two adversarially fine-tuned encoders to strengthen adversarial resistance. The Benign Memory Preservation Branch (BMP-Branch) is trained on local data to ensure adversarial robustness does not compromise benign performance. Surprisingly, we find that ZePAD can directly detect DAEs by evaluating branch confidence, without introducing any adversarial exsample identification task during training. Notably, by enriching feature diversity, our method enables a single adversarial fine-tuning to defend against DAEs across downstream tasks, thereby achieving persistent robustness. Extensive experiments on 11 SSL methods and 6 datasets validate its effectiveness. In certain cases, it achieves a 29.20% improvement in benign performance and a 73.86% gain in adversarial robustness, highlighting its zero-sacrifice property.
[AI-110] nterwhen: A Generalizable Framework for Verifiable Reasoning with Test-time Monitors
【速读】:该论文旨在解决当前推理模型在高风险场景(如物理世界部署或法律、金融领域)中缺乏有效验证机制的问题,现有方法要么采用生成-测试范式(generate-test paradigm),仅在最终答案生成后验证,效率低下;要么依赖步骤提取范式(step-extraction paradigm),将任务执行人为拆解为结构化步骤,限制了模型的自然推理策略。解决方案的关键在于提出一种称为 interwhen 的测试时验证框架,其核心思想是元提示(meta-prompting):识别任何部分解应满足的可验证属性,并引导模型以特定格式输出推理轨迹,从而实现对中间结果的直接解析与检查。该方法无需改变模型结构即可在不损失准确性的前提下支持自验证(self-verification)和外部验证(external verification),显著提升推理效率与可靠性。
链接: https://arxiv.org/abs/2602.11202
作者: Vishak K Bhat,Prateek Chanda,Ashmit Khandelwal,Maitreyi Swaroop,Vineeth N. Balasubramanian,Subbarao Kambhampati,Nagarajan Natarajan,Amit Sharma
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 23 pages, 5 figures
Abstract:We present a test-time verification framework, interwhen, that ensures that the output of a reasoning model is valid wrt. a given set of verifiers. Verified reasoning is an important goal in high-stakes scenarios such as deploying agents in the physical world or in domains such as law and finance. However, current techniques either rely on the generate-test paradigm that verifies only after the final answer is produced, or verify partial output through a step-extraction paradigm where the task execution is externally broken down into structured steps. The former is inefficient while the latter artificially restricts a model’s problem solving strategies. Instead, we propose to verify a model’s reasoning trace as-is, taking full advantage of a model’s reasoning capabilities while verifying and steering the model’s output only when needed. The key idea is meta-prompting, identifying the verifiable properties that any partial solution should satisfy and then prompting the model to follow a custom format in its trace such that partial outputs can be easily parsed and checked. We consider both self-verification and external verification and find that interwhen provides a useful abstraction to provide feedback and steer reasoning models in each case. Using self-verification, interwhen obtains state-of-the-art results on early stopping reasoning models, without any loss in accuracy. Using external verifiers, interwhen obtains 10 p.p. improvement in accuracy over test-time scaling methods, while ensuring 100% soundness and being 4x more efficient. The code for interwhen is available at this https URL
[AI-111] MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models
【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型在资源受限场景下因全部参数需加载至GPU内存而导致的内存瓶颈问题。现有方法通过将部分专家卸载至CPU内存并在激活时迁移回GPU,但存在显著的I/O延迟。其解决方案的关键在于提出MELINOE,一种通过微调使模型更倾向于每条输入序列激活较少专家的方法,从而将这些偏好专家缓存于GPU内存中,降低专家切换频率和CPU-GPU数据传输开销;实验表明,该方法在保持甚至提升下游任务性能的同时,相较高效基线提升1.2–3×吞吐量,相较高传输开销基线提升最高达14.7×。
链接: https://arxiv.org/abs/2602.11192
作者: Arian Raje,Anupam Nayak,Gauri Joshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts (MoE) model architectures can significantly reduce the number of activated parameters per token, enabling computationally efficient training and inference. However, their large overall parameter counts and model sizes have precluded their widespread usage in resource-constrained settings as all of the parameters must still be loaded into GPU memory. Prior works aim to address this memory bottleneck by offloading certain experts into CPU memory and porting them to GPU memory only when they are activated. In practice, these methods suffer from the significant I/O latency incurred by expert transfer. We present MELINOE, a method that fine-tunes an MoE model to more strongly prefer activating a smaller number of experts per sequence. Caching these preferred experts in GPU memory reduces expert churn and CPU-GPU transfer overhead. MELINOE increases throughput by 1.2-3\times over efficient baselines and up to 14.7\times over transfer-heavy baselines while retaining or even improving the performance of the model on a downstream task, making it a reliable method for improving MoE inference efficiency.
[AI-112] me-TK: A Multi-Offset Temporal Interaction Framework Combining Transformer and Kolmogorov-Arnold Networks for Time Series Forecasting
【速读】:该论文旨在解决时间序列预测中因传统独立嵌入(token embedding)策略导致的长期序列信息瓶颈问题,其根本原因在于该策略破坏了序列内部存在的多偏移时间相关性(multi-offset temporal correlation),即跨越不同时间步的细粒度依赖结构,尤其在规律性强的Web数据中尤为显著。解决方案的关键在于提出一种新的时间嵌入范式——多偏移时间嵌入(Multi-Offset Time Embedding, MOTE),该方法通过理论推导给出了标准token嵌入近似重构性能的上界,并据此设计出简洁而高效的嵌入机制以缓解性能退化;进一步地,作者构建了Time-TK架构,利用多偏移交互KAN(Multi-Offset Interactive KAN)提取多个偏移子序列中的特定时序模式,并结合多偏移时间交互机制高效捕捉子序列间的复杂依赖关系,实现全局信息融合,从而显著提升预测精度。
链接: https://arxiv.org/abs/2602.11190
作者: Fan Zhang,Shiming Fan,Hua Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series forecasting is crucial for the World Wide Web and represents a core technical challenge in ensuring the stable and efficient operation of modern web services, such as intelligent transportation and website throughput. However, we have found that existing methods typically employ a strategy of embedding each time step as an independent token. This paradigm introduces a fundamental information bottleneck when processing long sequences, the root cause of which is that independent token embedding destroys a crucial structure within the sequence - what we term as multi-offset temporal correlation. This refers to the fine-grained dependencies embedded within the sequence that span across different time steps, which is especially prevalent in regular Web data. To fundamentally address this issue, we propose a new perspective on time series embedding. We provide an upper bound on the approximate reconstruction performance of token embedding, which guides our design of a concise yet effective Multi-Offset Time Embedding method to mitigate the performance degradation caused by standard token embedding. Furthermore, our MOTE can be integrated into various existing models and serve as a universal building block. Based on this paradigm, we further design a novel forecasting architecture named Time-TK. This architecture first utilizes a Multi-Offset Interactive KAN to learn and represent specific temporal patterns among multiple offset sub-sequences. Subsequently, it employs an efficient Multi-Offset Temporal Interaction mechanism to effectively capture the complex dependencies between these sub-sequences, achieving global information integration. Extensive experiments on 14 real-world benchmark datasets, covering domains such as traffic flow and BTC/USDT throughput, demonstrate that Time-TK significantly outperforms all baseline models, achieving state-of-the-art forecasting accuracy.
[AI-113] DPNavigator-Placer: Thermal- and Wirelength-Aware Chiplet Placement in 2.5D Systems Through Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决2.5D集成电路中芯片模块(chiplet)放置优化问题,尤其针对线长(wirelength)与热管理(thermal management)这两个本质冲突的设计目标难以平衡的挑战。现有方法通常通过加权求和将多目标优化简化为单目标优化,无法有效应对实际系统中复杂的竞争需求。解决方案的关键在于提出TDPNavigator-Placer,一个基于多智能体强化学习(multi-agent reinforcement learning)的框架,该框架将不同目标分配给具有独立奖励机制和环境约束的专用智能体,在统一的放置范式下实现动态优化,从而显著提升帕累托前沿(Pareto front),获得更优的线长与热性能权衡。
链接: https://arxiv.org/abs/2602.11187
作者: Yubo Hou,Furen Zhuang,Partha Pratim Kundu,Sezin Ata Kircali,Jie Wang,Mihai Dragos Rotaru,Dutta Rahul,Ashish James
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid growth of electronics has accelerated the adoption of 2.5D integrated circuits, where effective automated chiplet placement is essential as systems scale to larger and more heterogeneous chiplet assemblies. Existing placement methods typically focus on minimizing wirelength or transforming multi-objective optimization into a single objective through weighted sum, which limits their ability to handle competing design requirements. Wirelength reduction and thermal management are inherently conflicting objectives, making prior approaches inadequate for practical deployment. To address this challenge, we propose TDPNavigator-Placer, a novel multi-agent reinforcement learning framework that dynamically optimizes placement based on chiplet’s thermal design power (TDP). This approach explicitly assigns these inherently conflicting objectives to specialized agents, each operating under distinct reward mechanisms and environmental constraints within a unified placement paradigm. Experimental results demonstrate that TDPNavigator-Placer delivers a significantly improved Pareto front over state-of-the-art methods, enabling more balanced trade-offs between wirelength and thermal performance.
[AI-114] Spectra: Rethinking Optimizers for LLM s Under Spectral Anisotropy
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)训练中梯度信号的各向异性问题,即梯度能量集中在少数主导谱方向(spike),而语境特异性信息则分布在长尾谱方向(tail),导致优化器因对高能量方向过度响应而抑制尾部学习。解决方案的关键在于提出一种名为 Spectra 的感知谱结构的优化器:它通过缓存并预热的幂迭代追踪低秩 spike 子空间,并施加低秩谱整形策略,在几乎不增加计算开销的前提下,有效抑制主导 spike 方向的影响,同时避免放大噪声敏感的谱尾,从而提升训练效率与下游任务性能。
链接: https://arxiv.org/abs/2602.11185
作者: Zhendong Huang,Hengjie Cao,Fang Dong,Ruijun Huang,Mengyi Chen,Yifeng Yang,Xin Zhang,Anrui Chen,Mingzhi Dong,Yujiang Wang,Jinlong Hou,Qin Lv,Robert P. Dick,Yuan Cheng,Fan Yang,Tun Lu,Li Shang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Gradient signals in LLM training are highly anisotropic: recurrent linguistic structure concentrates energy into a small set of dominant spectral directions, while context specific information resides in a long tail. We show that this spike tail separation persists throughout training, with the spike occupying only about 1.5% of directions yet dominating optimizer statistics. This dominance suppresses tail learning by contracting tail updates through second moment normalization and tightening the globally stable learning rate bound. Motivated by this analysis, we propose Spectra, a spike aware optimizer that suppresses the dominant low rank spike subspace without amplifying the noise sensitive spectral tail. Spectra tracks the spike subspace via cached, warm started power iteration and applies low rank spectral shaping with negligible overhead and substantially reduced optimizer state memory. On LLaMA3 8B trained on 50B tokens, Spectra reaches the same target loss 30% faster than AdamW, reduces per step end to end overhead by 0.7%, cuts optimizer state memory by 49.25%, and improves average downstream accuracy by 1.62%. Compared to Muon, Spectra is 5.1x faster in optimizer processing time, achieves a lower final loss, and improves average accuracy by 0.66%.
[AI-115] KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models ICLR2026
【速读】:该论文旨在解决混合专家(Mixture of Experts, MoE)模型在极低比特量化(ultra-low-bit quantization)过程中因专家间冗余表示和累积输出偏置导致的性能显著下降问题。其核心挑战在于:(1)多个专家共享相似权重表示,使向量量化(Vector Quantization, VQ)重复对相近特征进行编码,浪费有限码本容量;(2)MoE层中专家聚合放大了量化引入的输出偏置,引发分布偏移。解决方案的关键是提出KBVQ-MoE框架,融合两项创新技术:一是基于KLT引导的奇异值分解(SVD)实现输入驱动的冗余消除,提取并共享主导权重成分以减少冗余;二是仅对专家特有(非冗余)表示进行量化,并通过通道级仿射补偿校正量化输出,从而稳定分布并提升精度。实验表明,该方法在3-bit量化下仍能保持接近浮点基准的性能,显著优于现有方案。
链接: https://arxiv.org/abs/2602.11184
作者: Zukang Xu,Zhixiong Zhao,Xing Hu,Zhixuan Chen,Dawei Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026
Abstract:Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation. However, their enormous parameter sizes and memory demands pose major challenges for deployment in resource-constrained environments. Vector Quantization (VQ) offers a promising approach for ultra-low-bit compression in Large Language Models (LLMs) by leveraging a codebook, where weight vectors are mapped to the most similar discrete codewords. Yet, directly applying VQ to MoEs often leads to substantial performance degradation due to two critical obstacles: (1) redundant representations among experts cause VQ to repeatedly quantize similar representations for each expert, resulting in inefficient use of limited codebook capacity; and (2) cumulative output bias is amplified by expert aggregation in MoE layers, leading to distributional shifts in the quantized outputs. To address these issues, we propose KBVQ-MoE, a novel VQ framework to enhance extremely low-bit quantization for MoE-based LLMs. KBVQ-MoE integrates two techniques: (1) input-driven redundancy elimination, where a Karhunen-Loeve Transform (KLT) guided singular value decomposition (SVD) extracts dominant weight components and shares them across experts; and (2) bias-corrected output stabilization, where vector quantization is applied only to expert-specific (non-redundant) representations and the quantized outputs are corrected via channel-wise affine compensation. Experiments on various MoE LLMs demonstrate that KBVQ-MoE preserves accuracy substantially better than existing quantization methods. For example, 3-bit quantization of Qwen1.5-MoE-A2.7B achieves an average accuracy of 67.99, nearly identical to the FP16 baseline of 68.07, underscoring KBVQ-MoE’s potential for efficient deployment on edge devices and other resource-constrained platforms.
[AI-116] ChaosBench-Logic: A Benchmark for Logical and Symbolic Reasoning on Chaotic Dynamical Systems AAAI-26
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在需要精确逻辑与符号推理任务中表现脆弱的问题,尤其聚焦于混沌动力系统这一高度挑战性的领域——尽管混沌具有确定性,但常被误认为随机性或复杂性。解决方案的关键在于提出 ChaosBench-Logic,这是一个基于一阶逻辑(First-Order Logic, FOL)本体的基准测试平台,涵盖30种不同的动力系统,每个系统标注了11个语义谓词的真值,并生成621个问题,覆盖多跳推理、跨系统类比、反事实推理等七类逻辑推理任务。该基准通过逻辑准确性、蕴含一致性、对话连贯性和矛盾检测等指标量化评估模型性能,并揭示当前前沿模型在组合性任务上仍为0%准确率,凸显其全局推理一致性不足的问题,从而为发展神经符号方法(Neuro-Symbolic Approaches)以提升LLMs的科学推理能力提供可量化的诊断工具和研究基础。
链接: https://arxiv.org/abs/2601.01982
作者: Noel Thomas
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 7 pages, 0 figures , Accepted to AAAI-26 Bridge Program: Logical and Symbolic Reasoning in Language Models (camera-ready)
Abstract:Large language models (LLMs) excel at natural language tasks but remain brittle in domains requiring precise logical and symbolic reasoning. Chaotic dynamical systems provide an especially demanding test because chaos is deterministic yet often misinterpreted as randomness or complexity. We introduce ChaosBench-Logic, a benchmark that evaluates LLM reasoning across 30 diverse dynamical systems using a unified first-order logic (FOL) ontology. Each system is annotated with truth assignments for 11 semantic predicates, and 621 questions are generated across seven reasoning categories, including multi-hop implications, cross-system analogies, counterfactual reasoning, bias probes, and multi-turn dialogues. We define metrics for logical accuracy, implication consistency, dialogue coherence, and contradiction, and we release an open-source evaluation pipeline. Initial experiments show that frontier LLMs such as GPT-4, Claude 3.5 Sonnet, Gemini 2.5 Flash, and the open-source LLaMA-3 70B achieve 91-94% per-item accuracy, yet still score 0% on compositional items and exhibit fragile global coherence. Dialogue-level accuracy ranges from 53.1% (GPT-4 CoT) to 75.5% (LLaMA-3 zero-shot). ChaosBench-Logic provides a rigorous testbed for diagnosing such failures and a foundation for developing neuro-symbolic approaches that improve scientific reasoning in LLMs.
[AI-117] Creative Ownership in the Age of AI
【速读】:该论文旨在解决生成式 AI(Generative AI)在版权侵权判定中的法律适用难题,即现有“实质性相似”标准难以应对AI对原作风格的模仿而未直接复制内容的情形。其解决方案的关键在于提出一种新的侵权判定标准:若生成式AI的输出无法在不使用某特定作品作为训练语料的前提下生成,则该输出构成对该作品的侵权。为实现这一标准,作者将生成式系统建模为从现有作品集合到新作品集合的闭包算子(closure operator),并据此界定合法生成行为;理论分析揭示出一个渐近二分现象:当原创过程呈轻尾分布时,个体作品的影响趋于消失,监管对AI生成无实质限制;而当原创过程呈重尾分布时,监管可能持续施加约束。
链接: https://arxiv.org/abs/2602.12270
作者: Annie Liang,Jay Lu
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
Abstract:Copyright law focuses on whether a new work is “substantially similar” to an existing one, but generative AI can closely imitate style without copying content, a capability now central to ongoing litigation. We argue that existing definitions of infringement are ill-suited to this setting and propose a new criterion: a generative AI output infringes on an existing work if it could not have been generated without that work in its training corpus. To operationalize this definition, we model generative systems as closure operators mapping a corpus of existing works to an output of new works. AI generated outputs are \emphpermissible if they do not infringe on any existing work according to our criterion. Our results characterize structural properties of permissible generation and reveal a sharp asymptotic dichotomy: when the process of organic creations is light-tailed, dependence on individual works eventually vanishes, so that regulation imposes no limits on AI generation; with heavy-tailed creations, regulation can be persistently constraining.
[AI-118] On the implicit regularization of Langevin dynamics with projected noise
【速读】:该论文旨在解决对过参数化模型中随机梯度下降(Stochastic Gradient Descent, SGD)的对称性影响机制理解不足的问题,特别是如何通过引入具有几何结构的噪声来揭示对称性诱导的隐式正则化效应。解决方案的关键在于构建一个基于群作用轨道(group orbit)几何性质的耦合过程:作者证明了在初始和目标分布均保持群对称性的情况下,带有正交投影噪声的朗之万动力学(Langevin dynamics)在概率分布上等价于标准各向同性扩散加上一个与负对数轨道体积梯度成比例的额外漂移项,该漂移项被识别为轨道的平均曲率(mean curvature),从而揭示了一种新的隐式正则化形式。
链接: https://arxiv.org/abs/2602.12257
作者: Govind Menon,Austin J. Stromme,Adrien Vacher
机构: 未知
类目: Probability (math.PR); Artificial Intelligence (cs.AI)
备注: 30 pages, 1 figure
Abstract:We study Langevin dynamics with noise projected onto the directions orthogonal to an isometric group action. This mathematical model is introduced to shed new light on the effects of symmetry on stochastic gradient descent for over-parametrized models. Our main result identifies a novel form of implicit regularization: when the initial and target density are both invariant under the group action, Langevin dynamics with projected noise is equivalent in law to Langevin dynamics with isotropic diffusion but with an additional drift term proportional to the negative log volume of the group orbit. We prove this result by constructing a coupling of the two processes via a third process on the group itself, and identify the additional drift as the mean curvature of the orbits.
[AI-119] AVAE: A VAE with Adaptable Priors Explains Contextual Modulation in the Visual Cortex ICLR2026
【速读】:该论文试图解决的问题是:视觉系统是否能够灵活地学习特定任务所需的上下文先验(contextual priors),并在执行任务时将这些先验部署到初级视皮层(V1)中,从而影响感知决策。解决方案的关键在于提出了一种任务适配变分自编码器(Task-Amortized Variational Autoencoder, TAVAE),通过重用已学习的表示来高效获取任务特定的先验,并将神经元活动建模为生成模型中的潜在后验(latent posteriors)。该框架揭示了当刺激违反训练任务统计规律时,TAVAE后验会表现出不确定性特征,与小鼠V1记录中观察到的双峰响应特性一致,表明任务优化的生成模型可以解释V1群体活动的关键特征,包括每日内的响应更新。
链接: https://arxiv.org/abs/2602.11956
作者: Balázs Meszéna,Keith T. Murray,Julien Corbo,O. Batuhan Erkat,Márton A. Hajnal,Pierre-Olivier Polack,Gergő Orbán
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICLR 2026
Abstract:The brain interprets visual information through learned regularities, a computation formalized as probabilistic inference under a prior. The visual cortex establishes priors for this inference, some delivered through established top-down connections that inform low-level cortices about statistics represented at higher levels in the cortical hierarchy. While evidence shows that adaptation leads to priors reflecting the structure of natural images, it remains unclear whether similar priors can be flexibly acquired when learning a specific task. To investigate this, we built a generative model of V1 optimized for a simple discrimination task and analyzed it together with large-scale recordings from mice performing an analogous task. In line with recent approaches, we assumed that neuronal activity in V1 corresponds to latent posteriors in the generative model, enabling investigation of task-related priors in neuronal responses. To obtain a flexible test bed, we extended the VAE formalism so that a task can be acquired efficiently by reusing previously learned representations. Task-specific priors learned by this Task-Amortized VAE were used to investigate biases in mice and model when presenting stimuli that violated trained task statistics. Mismatch between learned task statistics and incoming sensory evidence produced signatures of uncertainty in stimulus category in the TAVAE posterior, reflecting properties of bimodal response profiles in V1 recordings. The task-optimized generative model accounted for key characteristics of V1 population activity, including within-day updates to population responses. Our results confirm that flexible task-specific contextual priors can be learned on demand by the visual system and deployed as early as the entry level of visual cortex.
[AI-120] Provable Offline Reinforcement Learning for Structured Cyclic MDPs
【速读】:该论文旨在解决多步决策问题中因阶段特异性动态、转移函数和折扣因子异质性所导致的离线学习难题,尤其在循环马尔可夫决策过程(cyclic Markov decision process, MDP)框架下,优化某一阶段的策略会扰动后续阶段的状态分布,从而引发跨阶段的分布偏移问题。解决方案的关键在于提出一种模块化结构框架,将循环过程分解为阶段独立的子问题,并基于此设计了CycleFQI算法——一种扩展的拟合Q-迭代方法,采用一组阶段特定的Q函数来建模阶段内序列及阶段间转移,实现部分控制(即某些阶段可优化而其他阶段遵循预设策略),同时通过Besov正则性假设建立了有限样本次优误差界与全局收敛速率,有效缓解了维度灾难问题,并进一步提出基于筛法(sieve-based method)的渐近推断方法用于最优策略值估计。
链接: https://arxiv.org/abs/2602.11679
作者: Kyungbok Lee,Angelica Cristello Sarteau,Michael R. Kosorok
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC); Methodology (stat.ME)
备注: 65 pages, 4 figures. Submitted to JMLR
Abstract:We introduce a novel cyclic Markov decision process (MDP) framework for multi-step decision problems with heterogeneous stage-specific dynamics, transitions, and discount factors across the cycle. In this setting, offline learning is challenging: optimizing a policy at any stage shifts the state distributions of subsequent stages, propagating mismatch across the cycle. To address this, we propose a modular structural framework that decomposes the cyclic process into stage-wise sub-problems. While generally applicable, we instantiate this principle as CycleFQI, an extension of fitted Q-iteration enabling theoretical analysis and interpretation. It uses a vector of stage-specific Q-functions, tailored to each stage, to capture within-stage sequences and transitions between stages. This modular design enables partial control, allowing some stages to be optimized while others follow predefined policies. We establish finite-sample suboptimality error bounds and derive global convergence rates under Besov regularity, demonstrating that CycleFQI mitigates the curse of dimensionality compared to monolithic baselines. Additionally, we propose a sieve-based method for asymptotic inference of optimal policy values under a margin condition. Experiments on simulated and real-world Type 1 Diabetes data sets demonstrate CycleFQI’s effectiveness.
[AI-121] Locally Interpretable Individualized Treatment Rules for Black-Box Decision Models
【速读】:该论文旨在解决个性化治疗规则(Individualized Treatment Rules, ITRs)在临床实践中面临的两大挑战:一是现有方法通常依赖于可解释但灵活性不足的模型,或采用高灵活度但缺乏可解释性的黑箱模型;二是多数方法假设全局统一的决策规则适用于所有患者,忽视了个体差异。解决方案的关键在于提出局部可解释的个性化治疗规则(Locally Interpretable Individualized Treatment Rule, LI-ITR)方法,其核心创新是结合灵活的机器学习模型以准确捕捉复杂的治疗效果,并通过变分自编码器生成局部合成样本,进而利用可解释专家混合模型构建个体化的决策规则,从而在保持高预测准确性的同时提供透明、临床可理解的解释。
链接: https://arxiv.org/abs/2602.11520
作者: Yasin Khadem Charvadeh,Katherine S. Panageas,Yuan Chen
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Individualized treatment rules (ITRs) aim to optimize healthcare by tailoring treatment decisions to patient-specific characteristics. Existing methods typically rely on either interpretable but inflexible models or highly flexible black-box approaches that sacrifice interpretability; moreover, most impose a single global decision rule across patients. We introduce the Locally Interpretable Individualized Treatment Rule (LI-ITR) method, which combines flexible machine learning models to accurately learn complex treatment outcomes with locally interpretable approximations to construct subject-specific treatment rules. LI-ITR employs variational autoencoders to generate realistic local synthetic samples and learns individualized decision rules through a mixture of interpretable experts. Simulation studies show that LI-ITR accurately recovers true subject-specific local coefficients and optimal treatment strategies. An application to precision side-effect management in breast cancer illustrates the necessity of flexible predictive modeling and highlights the practical utility of LI-ITR in estimating optimal treatment rules while providing transparent, clinically interpretable explanations.
[AI-122] DeepRed: an architecture for redshift estimation
【速读】:该论文旨在解决天体红移(redshift)估计在大规模天文巡天中成本高、耗时长,且现有基于图像的方法难以跨不同天体形态(如星系、引力透镜和引力透镜超新星)及观测条件泛化的问题。其关键解决方案是提出 DeepRed,一个集成多种现代计算机视觉架构(包括 ResNet、EfficientNet、Swin Transformer 和 MLP-Mixer)的深度学习流水线,通过在模拟数据(DeepGraviLens)和真实观测数据(KiDS、SDSS)上的系统验证,证明该方法在多个基准上均达到当前最优性能,并具备良好的可解释性(SHAP 分析显示模型对目标天体的定位准确率超过 95%),从而为大规模巡天中的红移估计提供了可扩展、鲁棒且可靠的解决方案。
链接: https://arxiv.org/abs/2602.11281
作者: Alessandro Meroni,Nicolò Oreste Pinciroli Vago,Piero Fraternali
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
备注: Accepted for publication in Neural Computing and Applications
Abstract:Estimating redshift is a central task in astrophysics, but its measurement is costly and time-consuming. In addition, current image-based methods are often validated on homogeneous datasets. The development and comparison of networks able generalize across different morphologies, ranging from galaxies to gravitationally-lensed transients, and observational conditions, remain an open challenge. This work proposes DeepRed, a deep learning pipeline that demonstrates how modern computer vision architectures, including ResNet, EfficientNet, Swin Transformer, and MLP-Mixer, can estimate redshifts from images of galaxies, gravitational lenses, and gravitationally-lensed supernovae. We compare these architectures and their ensemble to both neural networks (A1, A3, NetZ, and PhotoZ) and a feature-based method (HOG+SVR) on simulated (DeepGraviLens) and real (KiDS, SDSS) datasets. Our approach achieves state-of-the-art results on all datasets. On DeepGraviLens, DeepRed achieves a significant improvement in the Normalized Mean Absolute Deviation compared to the best baseline (PhotoZ): 55% on DES-deep (using EfficientNet), 51% on DES-wide (Ensemble), 52% on DESI-DOT (Ensemble), and 46% on LSST-wide (Ensemble). On real observations from the KiDS survey, the pipeline outperforms the best baseline (NetZ), improving NMAD by 16% on a general test set without high-probability lenses (Ensemble) and 27% on high-probability lenses (Ensemble). For non-lensed galaxies in the SDSS dataset, the MLP-Mixer architecture achieves a 5% improvement over the best baselines (A3 and NetZ). SHAP shows that the models correctly focus on the objects of interest with over 95% localization accuracy on high-quality images, validating the reliability of the predictions. These findings suggest that deep learning is a scalable, robust, and interpretable solution for redshift estimation in large-scale surveys.
[AI-123] Position-Aware Self-supervised Representation Learning for Cross-mode Radar Signal Recognition
【速读】:该论文旨在解决开放电磁环境中雷达信号识别的挑战,特别是由于工作模式多样性和未见过的雷达类型导致的识别困难问题。现有方法通常忽略脉冲序列中的位置关系,难以捕捉时间上的语义依赖性。解决方案的关键在于提出一种位置感知的自监督框架 RadarPos,该框架通过利用脉冲级的时间动态特性来建模位置关系,无需复杂的增强或掩码操作,从而在对比学习或掩码重建基础上显著提升位置关系的建模能力。
链接: https://arxiv.org/abs/2602.11196
作者: Hongyang Zhang,Haitao Zhang,Yinhao Liu,Kunjie Lin,Yue Huang,Xinghao Ding
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Radar signal recognition in open electromagnetic environments is challenging due to diverse operating modes and unseen radar types. Existing methods often overlook position relations in pulse sequences, limiting their ability to capture semantic dependencies over time. We propose RadarPos, a position-aware self-supervised framework that leverages pulse-level temporal dynamics without complex augmentations or masking, providing improved position relation modeling over contrastive learning or masked reconstruction. Using this framework, we evaluate cross-mode radar signal recognition under the long-tailed setting to assess adaptability and generalization. Experimental results demonstrate enhanced discriminability and robustness, highlighting practical applicability in real-world electromagnetic environments.
[AI-124] MuCO: Generative Peptide Cyclization Empowered by Multi-stage Conformation Optimization
【速读】:该论文旨在解决循环肽(cyclic peptide)构象建模难题,即如何在虚拟筛选中高效生成具有理想物理和药理性质的循环肽构象。由于循环肽常呈现多种环状构象,传统基于线性肽折叠的确定性预测模型难以准确捕捉其构象分布。解决方案的关键在于提出MuCO(Multi-stage Conformation Optimization)方法,该方法将肽环化任务分解为三个阶段:拓扑感知主链设计、生成式侧链填充和物理感知全原子优化,从而以粗到精的方式生成并优化循环肽构象。这一多阶段框架支持高效的并行采样策略,可快速探索多样且低能量的构象空间,在物理稳定性、结构多样性、二级结构恢复度及计算效率方面显著优于现有最优方法。
链接: https://arxiv.org/abs/2602.11189
作者: Yitian Wang,Fanmeng Wang,Angxiao Yue,Wentao Guo,Yaning Cui,Hongteng Xu
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:
Abstract:Modeling peptide cyclization is critical for the virtual screening of candidate peptides with desirable physical and pharmaceutical properties. This task is challenging because a cyclic peptide often exhibits diverse, ring-shaped conformations, which cannot be well captured by deterministic prediction models derived from linear peptide folding. In this study, we propose MuCO (Multi-stage Conformation Optimization), a generative peptide cyclization method that models the distribution of cyclic peptide conformations conditioned on the corresponding linear peptide. In principle, MuCO decouples the peptide cyclization task into three stages: topology-aware backbone design, generative side-chain packing, and physics-aware all-atom optimization, thereby generating and optimizing conformations of cyclic peptides in a coarse-to-fine manner. This multi-stage framework enables an efficient parallel sampling strategy for conformation generation and allows for rapid exploration of diverse, low-energy conformations. Experiments on the large-scale CPSea dataset demonstrate that MuCO significantly outperforms state-of-the-art methods in consistently in physical stability, structural diversity, secondary structure recovery, and computational efficiency, making it a promising computational tool for exploring and designing cyclic peptides.
机器学习
[LG-0] Function-Space Decoupled Diffusion for Forward and Inverse Modeling in Carbon Capture and Storag e
链接: https://arxiv.org/abs/2602.12274
作者: Xin Ju,Jiachen Yao,Anima Anandkumar,Sally M. Benson,Gege Wen
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:
Abstract:Accurate characterization of subsurface flow is critical for Carbon Capture and Storage (CCS) but remains challenged by the ill-posed nature of inverse problems with sparse observations. We present Fun-DDPS, a generative framework that combines function-space diffusion models with differentiable neural operator surrogates for both forward and inverse modeling. Our approach learns a prior distribution over geological parameters (geomodel) using a single-channel diffusion model, then leverages a Local Neural Operator (LNO) surrogate to provide physics-consistent guidance for cross-field conditioning on the dynamics field. This decoupling allows the diffusion prior to robustly recover missing information in parameter space, while the surrogate provides efficient gradient-based guidance for data assimilation. We demonstrate Fun-DDPS on synthetic CCS modeling datasets, achieving two key results: (1) For forward modeling with only 25% observations, Fun-DDPS achieves 7.7% relative error compared to 86.9% for standard surrogates (an 11x improvement), proving its capability to handle extreme data sparsity where deterministic methods fail. (2) We provide the first rigorous validation of diffusion-based inverse solvers against asymptotically exact Rejection Sampling (RS) posteriors. Both Fun-DDPS and the joint-state baseline (Fun-DPS) achieve Jensen-Shannon divergence less than 0.06 against the ground truth. Crucially, Fun-DDPS produces physically consistent realizations free from the high-frequency artifacts observed in joint-state baselines, achieving this with 4x improved sample efficiency compared to rejection sampling.
[LG-1] Self-Supervised Learning via Flow-Guided Neural Operator on Time-Series Data
链接: https://arxiv.org/abs/2602.12267
作者: Duy Nguyen,Jiachen Yao,Jiayun Wang,Julius Berner,Animashree Anandkumar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Self-supervised learning (SSL) is a powerful paradigm for learning from unlabeled time-series data. However, popular methods such as masked autoencoders (MAEs) rely on reconstructing inputs from a fixed, predetermined masking ratio. Instead of this static design, we propose treating the corruption level as a new degree of freedom for representation learning, enhancing flexibility and performance. To achieve this, we introduce the Flow-Guided Neural Operator (FGNO), a novel framework combining operator learning with flow matching for SSL training. FGNO learns mappings in functional spaces by using Short-Time Fourier Transform to unify different time resolutions. We extract a rich hierarchy of features by tapping into different network layers and flow times that apply varying strengths of noise to the input data. This enables the extraction of versatile representations, from low-level patterns to high-level global features, using a single model adaptable to specific tasks. Unlike prior generative SSL methods that use noisy inputs during inference, we propose using clean inputs for representation extraction while learning representations with noise; this eliminates randomness and boosts accuracy. We evaluate FGNO across three biomedical domains, where it consistently outperforms established baselines. Our method yields up to 35% AUROC gains in neural signal decoding (BrainTreeBank), 16% RMSE reductions in skin temperature prediction (DREAMT), and over 20% improvement in accuracy and macro-F1 on SleepEDF under low-data regimes. These results highlight FGNO’s robustness to data scarcity and its superior capacity to learn expressive representations for diverse time series.
[LG-2] Is Online Linear Optimization Sufficient for Strategic Robustness?
链接: https://arxiv.org/abs/2602.12253
作者: Yang Cai,Haipeng Luo,Chen-Yu Wei,Weiqiang Zheng
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 26 pages
Abstract:We consider bidding in repeated Bayesian first-price auctions. Bidding algorithms that achieve optimal regret have been extensively studied, but their strategic robustness to the seller’s manipulation remains relatively underexplored. Bidding algorithms based on no-swap-regret algorithms achieve both desirable properties, but are suboptimal in terms of statistical and computational efficiency. In contrast, online gradient ascent is the only algorithm that achieves O(\sqrtTK) regret and strategic robustness [KSS24], where T denotes the number of auctions and K the number of bids. In this paper, we explore whether simple online linear optimization (OLO) algorithms suffice for bidding algorithms with both desirable properties. Our main result shows that sublinear linearized regret is sufficient for strategic robustness. Specifically, we construct simple black-box reductions that convert any OLO algorithm into a strategically robust no-regret bidding algorithm, in both known and unknown value distribution settings. For the known value distribution case, our reduction yields a bidding algorithm that achieves O(\sqrtT \log K) regret and strategic robustness (with exponential improvement on the K -dependence compared to [KSS24]). For the unknown value distribution case, our reduction gives a bidding algorithm with high-probability O(\sqrtT (\log K+\log(T/\delta)) regret and strategic robustness, while removing the bounded density assumption made in [KSS24]. Comments: 26 pages Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2602.12253 [cs.GT] (or arXiv:2602.12253v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2602.12253 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-3] Community Concealment from Unsupervised Graph Learning-Based Clustering
链接: https://arxiv.org/abs/2602.12250
作者: Dalyapraz Manatova,Pablo Moriano,L. Jean Camp
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)
*备注:
Abstract:Graph neural networks (GNNs) are designed to use attributed graphs to learn representations. Such representations are beneficial in the unsupervised learning of clusters and community detection. Nonetheless, such inference may reveal sensitive groups, clustered systems, or collective behaviors, raising concerns regarding group-level privacy. Community attribution in social and critical infrastructure networks, for example, can expose coordinated asset groups, operational hierarchies, and system dependencies that could be used for profiling or intelligence gathering. We study a defensive setting in which a data publisher (defender) seeks to conceal a community of interest while making limited, utility-aware changes in the network. Our analysis indicates that community concealment is strongly influenced by two quantifiable factors: connectivity at the community boundary and feature similarity between the protected community and adjacent communities. Informed by these findings, we present a perturbation strategy that rewires a set of selected edges and modifies node features to reduce the distinctiveness leveraged by GNN message passing. The proposed method outperforms DICE in our experiments on synthetic benchmarks and real network graphs under identical perturbation budgets. Overall, it achieves median relative concealment improvements of approximately 20-45% across the evaluated settings. These findings demonstrate a mitigation strategy against GNN-based community learning and highlight group-level privacy risks intrinsic to graph learning.
[LG-4] Categorical Flow Maps
链接: https://arxiv.org/abs/2602.12233
作者: Daan Roos,Oscar Davis,Floor Eijkelboom,Michael Bronstein,Max Welling,İsmail İlkan Ceylan,Luca Ambrogioni,Jan-Willem van de Meent
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce Categorical Flow Maps, a flow-matching method for accelerated few-step generation of categorical data via self-distillation. Building on recent variational formulations of flow matching and the broader trend towards accelerated inference in diffusion and flow-based models, we define a flow map towards the simplex that transports probability mass toward a predicted endpoint, yielding a parametrisation that naturally constrains model predictions. Since our trajectories are continuous rather than discrete, Categorical Flow Maps can be trained with existing distillation techniques, as well as a new objective based on endpoint consistency. This continuous formulation also automatically unlocks test-time inference: we can directly reuse existing guidance and reweighting techniques in the categorical setting to steer sampling toward downstream objectives. Empirically, we achieve state-of-the-art few-step results on images, molecular graphs, and text, with strong performance even in single-step generation.
[LG-5] Diffusion Alignment Beyond KL: Variance Minimisation as Effective Policy Optimiser
链接: https://arxiv.org/abs/2602.12229
作者: Zijing Ou,Jacob Si,Junyi Zhu,Ondrej Bohdal,Mete Ozay,Taha Ceritli,Yingzhen Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion alignment adapts pretrained diffusion models to sample from reward-tilted distributions along the denoising trajectory. This process naturally admits a Sequential Monte Carlo (SMC) interpretation, where the denoising model acts as a proposal and reward guidance induces importance weights. Motivated by this view, we introduce Variance Minimisation Policy Optimisation (VMPO), which formulates diffusion alignment as minimising the variance of log importance weights rather than directly optimising a Kullback-Leibler (KL) based objective. We prove that the variance objective is minimised by the reward-tilted target distribution and that, under on-policy sampling, its gradient coincides with that of standard KL-based alignment. This perspective offers a common lens for understanding diffusion alignment. Under different choices of potential functions and variance minimisation strategies, VMPO recovers various existing methods, while also suggesting new design directions beyond KL.
[LG-6] Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction
链接: https://arxiv.org/abs/2602.12204
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hybrid architectures combining state-space models with attention have achieved strong efficiency-quality tradeoffs, yet existing approaches either apply attention uniformly or learn static sparse patterns. This misses a key opportunity: \emphattention demand should decrease over time as recurring patterns become familiar. We present a surprising finding from analyzing GPT-2 models: \textbf88% of attention operations retrieve information already predictable from the model’s hidden state, and this redundancy does \emphnot decrease during training. Motivated by this observation, we introduce \textbf\ours (\textbfConsolidation-based \textbfRouting for \textbfAdaptive \textbfMemory), a biologically inspired memory consolidation mechanism that gradually distills episodic retrievals into parametric semantic memory. Unlike prior sparse attention methods, \ours exhibits \emphdecreasing attention utilization over training, achieving a \textbf37.8 \times reduction through a sharp phase transition at approximately 3K steps. We prove that this capability is \emphimpossible without consolidation: any static routing scheme requires \Omega(f \cdot n) attention for tasks with recurring patterns of frequency f . On our proposed SRCD benchmark, \ours achieves \textbf100% retrieval accuracy at 1.6% attention compute (vs.\ 68% for baselines), and consolidated patterns transfer to unseen tasks with \textbf48–52% attention reduction without retraining. Remarkably, the learned consolidation dynamics quantitatively match human episodic-to-semantic memory transition curves from cognitive psychology ( \gamma = 0.43 vs.\ \gamma_\texthuman \approx 0.4 – 0.5 ). Code and benchmarks are available at [anonymized].
[LG-7] WaveFormer: Wavelet Embedding Transformer for Biomedical Signals
链接: https://arxiv.org/abs/2602.12189
作者: Habib Irani,Bikram De,Vangelis Metsis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Biomedical signal classification presents unique challenges due to long sequences, complex temporal dynamics, and multi-scale frequency patterns that are poorly captured by standard transformer architectures. We propose WaveFormer, a transformer architecture that integrates wavelet decomposition at two critical stages: embedding construction, where multi-channel Discrete Wavelet Transform (DWT) extracts frequency features to create tokens containing both time-domain and frequency-domain information, and positional encoding, where Dynamic Wavelet Positional Encoding (DyWPE) adapts position embeddings to signal-specific temporal structure through mono-channel DWT analysis. We evaluate WaveFormer on eight diverse datasets spanning human activity recognition and brain signal analysis, with sequence lengths ranging from 50 to 3000 timesteps and channel counts from 1 to 144. Experimental results demonstrate that WaveFormer achieves competitive performance through comprehensive frequency-aware processing. Our approach provides a principled framework for incorporating frequency-domain knowledge into transformer-based time series classification.
[LG-8] How Sampling Shapes LLM Alignment: From One-Shot Optima to Iterative Dynamics
链接: https://arxiv.org/abs/2602.12180
作者: Yurong Chen,Yu He,Michael I. Jordan,Fan Yao
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:Standard methods for aligning large language models with human preferences learn from pairwise comparisons among sampled candidate responses and regularize toward a reference policy. Despite their effectiveness, the effects of sampling and reference choices are poorly understood theoretically. We investigate these effects through Identity Preference Optimization, a widely used preference alignment framework, and show that proper instance-dependent sampling can yield stronger ranking guarantees, while skewed on-policy sampling can induce excessive concentration under structured preferences. We then analyze iterative alignment dynamics in which the learned policy feeds back into future sampling and reference policies, reflecting a common practice of model-generated preference data. We prove that these dynamics can exhibit persistent oscillations or entropy collapse for certain parameter choices, and characterize regimes that guarantee stability. Our theoretical insights extend to Direct Preference Optimization, indicating the phenomena we captured are common to a broader class of preference-alignment methods. Experiments on real-world preference data validate our findings.
[LG-9] Amortized Molecular Optimization via Group Relative Policy Optimization
链接: https://arxiv.org/abs/2602.12162
作者: Muhammad bin Javaid,Hasham Hussain,Ashima Khanna,Berke Kisin,Jonathan Pirnay,Alexander Mitsos,Dominik G. Grimm,Martin Grohe
类目: Machine Learning (cs.LG)
*备注: 23 pages, 5 figures
Abstract:Molecular design encompasses tasks ranging from de-novo design to structural alteration of given molecules or fragments. For the latter, state-of-the-art methods predominantly function as "Instance Optimizers’', expending significant compute restarting the search for every input structure. While model-based approaches theoretically offer amortized efficiency by learning a policy transferable to unseen structures, existing methods struggle to generalize. We identify a key failure mode: the high variance arising from the heterogeneous difficulty of distinct starting structures. To address this, we introduce GRXForm, adapting a pre-trained Graph Transformer model that optimizes molecules via sequential atom-and-bond additions. We employ Group Relative Policy Optimization (GRPO) for goal-directed fine-tuning to mitigate variance by normalizing rewards relative to the starting structure. Empirically, GRXForm generalizes to out-of-distribution molecular scaffolds without inference-time oracle calls or refinement, achieving scores in multi-objective optimization competitive with leading instance optimizers.
[LG-10] SafeNeuron: Neuron-Level Safety Alignment for Large Language Models
链接: https://arxiv.org/abs/2602.12158
作者: Zhaoxin Wang,Jiaming Liang,Fengbin Zhu,Weixiang Zhao,Junfeng Fang,Jiayi Ji,Handing Wang,Tat-Seng Chua
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) and multimodal LLMs are typically safety-aligned before release to prevent harmful content generation. However, recent studies show that safety behaviors are concentrated in a small subset of parameters, making alignment brittle and easily bypassed through neuron-level attacks. Moreover, most existing alignment methods operate at the behavioral level, offering limited control over the model’s internal safety mechanisms. In this work, we propose SafeNeuron, a neuron-level safety alignment framework that improves robustness by redistributing safety representations across the network. SafeNeuron first identifies safety-related neurons, then freezes these neurons during preference optimization to prevent reliance on sparse safety pathways and force the model to construct redundant safety representations. Extensive experiments across models and modalities demonstrate that SafeNeuron significantly improves robustness against neuron pruning attacks, reduces the risk of open-source models being repurposed as red-team generators, and preserves general capabilities. Furthermore, our layer-wise analysis reveals that safety behaviors are governed by stable and shared internal representations. Overall, SafeNeuron provides an interpretable and robust perspective for model alignment.
[LG-11] Its TIME: Towards the Next Generation of Time Series Forecasting Benchmarks
链接: https://arxiv.org/abs/2602.12147
作者: Zhongzheng Qiao,Sheng Pan,Anni Wang,Viktoriya Zhukova,Yong Liu,Xudong Jiang,Qingsong Wen,Mingsheng Long,Ming Jin,Chenghao Liu
类目: Machine Learning (cs.LG)
*备注: The source code will be released on GitHub shortly
Abstract:Time series foundation models (TSFMs) are revolutionizing the forecasting landscape from specific dataset modeling to generalizable task evaluation. However, we contend that existing benchmarks exhibit common limitations in four dimensions: constrained data composition dominated by reused legacy sources, compromised data integrity lacking rigorous quality assurance, misaligned task formulations detached from real-world contexts, and rigid analysis perspectives that obscure generalizable insights. To bridge these gaps, we introduce TIME, a next-generation task-centric benchmark comprising 50 fresh datasets and 98 forecasting tasks, tailored for strict zero-shot TSFM evaluation free from data leakage. Integrating large language models and human expertise, we establish a rigorous human-in-the-loop benchmark construction pipeline to ensure high data integrity and redefine task formulation by aligning forecasting configurations with real-world operational requirements and variate predictability. Furthermore, we propose a novel pattern-level evaluation perspective that moves beyond traditional dataset-level evaluations based on static meta labels. By leveraging structural time series features to characterize intrinsic temporal properties, this approach offers generalizable insights into model capabilities across diverse patterns. We evaluate 12 representative TSFMs and establish a multi-granular leaderboard to facilitate in-depth analysis and visualized inspection. The leaderboard is available at this https URL.
[LG-12] Oscillators Are All You Need: Irregular Time Series Modelling via Damped Harmonic Oscillators with Closed-Form Solutions
链接: https://arxiv.org/abs/2602.12139
作者: Yashas Shende(1),Aritra Das(1),Reva Laxmi Chauhan(1),Arghya Pathak(1),Debayan Gupta(1) ((1) Ashoka University)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transformers excel at time series modelling through attention mechanisms that capture long-term temporal patterns. However, they assume uniform time intervals and therefore struggle with irregular time series. Neural Ordinary Differential Equations (NODEs) effectively handle irregular time series by modelling hidden states as continuously evolving trajectories. ContiFormers arXiv:2402.10635 combine NODEs with Transformers, but inherit the computational bottleneck of the former by using heavy numerical solvers. This bottleneck can be removed by using a closed-form solution for the given dynamical system - but this is known to be intractable in general! We obviate this by replacing NODEs with a novel linear damped harmonic oscillator analogy - which has a known closed-form solution. We model keys and values as damped, driven oscillators and expand the query in a sinusoidal basis up to a suitable number of modes. This analogy naturally captures the query-key coupling that is fundamental to any transformer architecture by modelling attention as a resonance phenomenon. Our closed-form solution eliminates the computational overhead of numerical ODE solvers while preserving expressivity. We prove that this oscillator-based parameterisation maintains the universal approximation property of continuous-time attention; specifically, any discrete attention matrix realisable by ContiFormer’s continuous keys can be approximated arbitrarily well by our fixed oscillator modes. Our approach delivers both theoretical guarantees and scalability, achieving state-of-the-art performance on irregular time series benchmarks while being orders of magnitude faster.
[LG-13] Few-Shot Design Optimization by Exploiting Auxiliary Information
链接: https://arxiv.org/abs/2602.12112
作者: Arjun Mani,Carl Vondrick,Richard Zemel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Many real-world design problems involve optimizing an expensive black-box function f(x) , such as hardware design or drug discovery. Bayesian Optimization has emerged as a sample-efficient framework for this problem. However, the basic setting considered by these methods is simplified compared to real-world experimental setups, where experiments often generate a wealth of useful information. We introduce a new setting where an experiment generates high-dimensional auxiliary information h(x) along with the performance measure f(x) ; moreover, a history of previously solved tasks from the same task family is available for accelerating optimization. A key challenge of our setting is learning how to represent and utilize h(x) for efficiently solving new optimization tasks beyond the task history. We develop a novel approach for this setting based on a neural model which predicts f(x) for unseen designs given a few-shot context containing observations of h(x) . We evaluate our method on two challenging domains, robotic hardware design and neural network hyperparameter tuning, and introduce a novel design problem and large-scale benchmark for the former. On both domains, our method utilizes auxiliary feedback effectively to achieve more accurate few-shot prediction and faster optimization of design tasks, significantly outperforming several methods for multi-task optimization.
[LG-14] Geometry of Uncertainty: Learning Metric Spaces for Multimodal State Estimation in RL
链接: https://arxiv.org/abs/2602.12087
作者: Alfredo Reichlin,Adriano Pacciarelli,Danica Kragic,Miguel Vasco
类目: Machine Learning (cs.LG)
*备注:
Abstract:Estimating the state of an environment from high-dimensional, multimodal, and noisy observations is a fundamental challenge in reinforcement learning (RL). Traditional approaches rely on probabilistic models to account for the uncertainty, but often require explicit noise assumptions, in turn limiting generalization. In this work, we contribute a novel method to learn a structured latent representation, in which distances between states directly correlate with the minimum number of actions required to transition between them. The proposed metric space formulation provides a geometric interpretation of uncertainty without the need for explicit probabilistic modeling. To achieve this, we introduce a multimodal latent transition model and a sensor fusion mechanism based on inverse distance weighting, allowing for the adaptive integration of multiple sensor modalities without prior knowledge of noise distributions. We empirically validate the approach on a range of multimodal RL tasks, demonstrating improved robustness to sensor noise and superior state estimation compared to baseline methods. Our experiments show enhanced performance of an RL agent via the learned representation, eliminating the need of explicit noise augmentation. The presented results suggest that leveraging transition-aware metric spaces provides a principled and scalable solution for robust state estimation in sequential decision-making.
[LG-15] Empirical Gaussian Processes
链接: https://arxiv.org/abs/2602.12082
作者: Jihao Andreas Lin,Sebastian Ament,Louis C. Tiao,David Eriksson,Maximilian Balandat,Eytan Bakshy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Gaussian processes (GPs) are powerful and widely used probabilistic regression models, but their effectiveness in practice is often limited by the choice of kernel function. This kernel function is typically handcrafted from a small set of standard functions, a process that requires expert knowledge, results in limited adaptivity to data, and imposes strong assumptions on the hypothesis space. We study Empirical GPs, a principled framework for constructing flexible, data-driven GP priors that overcome these limitations. Rather than relying on standard parametric kernels, we estimate the mean and covariance functions empirically from a corpus of historical observations, enabling the prior to reflect rich, non-trivial covariance structures present in the data. Theoretically, we show that the resulting model converges to the GP that is closest (in KL-divergence sense) to the real data generating process. Practically, we formulate the problem of learning the GP prior from independent datasets as likelihood estimation and derive an Expectation-Maximization algorithm with closed-form updates, allowing the model handle heterogeneous observation locations across datasets. We demonstrate that Empirical GPs achieve competitive performance on learning curve extrapolation and time series forecasting benchmarks.
[LG-16] PathCRF: Ball-Free Soccer Event Detection via Possession Path Inference from Player Trajectories
链接: https://arxiv.org/abs/2602.12080
作者: Hyunsung Kim,Kunhee Lee,Sangwoo Seo,Sang-Ki Ko,Jinsung Yoon,Chanyoung Park
类目: Machine Learning (cs.LG)
*备注:
Abstract:Despite recent advances in AI, event data collection in soccer still relies heavily on labor-intensive manual annotation. Although prior work has explored automatic event detection using player and ball trajectories, ball tracking also remains difficult to scale due to high infrastructural and operational costs. As a result, comprehensive data collection in soccer is largely confined to top-tier competitions, limiting the broader adoption of data-driven analysis in this domain. To address this challenge, this paper proposes PathCRF, a framework for detecting on-ball soccer events using only player tracking data. We model player trajectories as a fully connected dynamic graph and formulate event detection as the problem of selecting exactly one edge corresponding to the current possession state at each time step. To ensure logical consistency of the resulting edge sequence, we employ a Conditional Random Field (CRF) that forbids impossible transitions between consecutive edges. Both emission and transition scores dynamically computed from edge embeddings produced by a Set Attention-based backbone architecture. During inference, the most probable edge sequence is obtained via Viterbi decoding, and events such as ball controls or passes are detected whenever the selected edge changes between adjacent time steps. Experiments show that PathCRF produces accurate, logically consistent possession paths, enabling reliable downstream analyses while substantially reducing the need for manual event annotation. The source code is available at this https URL.
[LG-17] Improving HPC Code Generation Capability of LLM s via Online Reinforcement Learning with Real-Machine Benchmark Rewards
链接: https://arxiv.org/abs/2602.12049
作者: Ryo Mikasa,Shun-ichiro Hayashi,Daichi Mukunoki,Tetsuya Hoshino,Takahiro Katagiri
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) have demonstrated strong code generation capabilities, yet the runtime performance of generated code is not guaranteed, and there have been few attempts to train LLMs using runtime performance as a reward in the HPC domain. We propose an online reinforcement learning approach that executes LLM-generated code on a supercomputer and directly feeds back the measured runtime performance (GFLOPS) as a reward. We further introduce a Staged Quality-Diversity (SQD) algorithm that progressively varies the permitted optimization techniques on a per-problem basis, enabling the model to learn code optimization from diverse perspectives. We build a distributed system connecting a GPU training cluster with a CPU benchmarking cluster, and train Qwen2.5 Coder 14B on a double-precision matrix multiplication task using Group Relative Policy Optimization (GRPO). Through two experiments, we show that reinforcement learning combining runtime performance feedback with staged optimization can improve the HPC code generation capability of LLMs.
[LG-18] Safety Beyond the Training Data: Robust Out-of-Distribution MPC via Conformalized System Level Synthesis
链接: https://arxiv.org/abs/2602.12047
作者: Anutam Srinivasan,Antoine Leeman,Glen Chou
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:
Abstract:We present a novel framework for robust out-of-distribution planning and control using conformal prediction (CP) and system level synthesis (SLS), addressing the challenge of ensuring safety and robustness when using learned dynamics models beyond the training data distribution. We first derive high-confidence model error bounds using weighted CP with a learned, state-control-dependent covariance model. These bounds are integrated into an SLS-based robust nonlinear model predictive control (MPC) formulation, which performs constraint tightening over the prediction horizon via volume-optimized forward reachable sets. We provide theoretical guarantees on coverage and robustness under distributional drift, and analyze the impact of data density and trajectory tube size on prediction coverage. Empirically, we demonstrate our method on nonlinear systems of increasing complexity, including a 4D car and a 12D quadcopter, improving safety and robustness compared to fixed-bound and non-robust baselines, especially outside of the data distribution.
[LG-19] PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving
链接: https://arxiv.org/abs/2602.12029
作者: Sunghyeon Woo,Hoseung Kim,Sunghwan Shim,Minjung Jo,Hyunjoon Jeong,Jeongtae Lee,Joonghoon Kim,Sungjae Lee,Baeseong Park,Se Jung Kwon,Dongsoo Lee
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Preprint. 13 pages, 6 figures
Abstract:Multi-agent systems increasingly orchestrate multiple specialized language models to solve complex real-world problems, often invoking them over a shared context. This execution pattern repeatedly processes the same prompt prefix across models. Consequently, each model redundantly executes the prefill stage and maintains its own key-value (KV) cache, increasing aggregate prefill load and worsening tail latency by intensifying prefill-decode interference in existing LLM serving stacks. Disaggregated serving reduces such interference by placing prefill and decode on separate GPUs, but disaggregation does not fundamentally eliminate inter-model redundancy in computation and KV storage for the same prompt. To address this issue, we propose PrefillShare, a novel algorithm that enables sharing the prefill stage across multiple models in a disaggregated setting. PrefillShare factorizes the model into prefill and decode modules, freezes the prefill module, and fine-tunes only the decode module. This design allows multiple task-specific models to share a prefill module and the KV cache generated for the same prompt. We further introduce a routing mechanism that enables effective prefill sharing across heterogeneous models in a vLLM-based disaggregated system. PrefillShare not only matches full fine-tuning accuracy on a broad range of tasks and models, but also delivers 4.5x lower p95 latency and 3.9x higher throughput in multi-model agent workloads.
[LG-20] Protein Circuit Tracing via Cross-layer Transcoders
链接: https://arxiv.org/abs/2602.12026
作者: Darin Tsui,Kunal Talreja,Daniel Saeedi,Amirali Aghazadeh
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 29 pages, 15 figures
Abstract:Protein language models (pLMs) have emerged as powerful predictors of protein structure and function. However, the computational circuits underlying their predictions remain poorly understood. Recent mechanistic interpretability methods decompose pLM representations into interpretable features, but they treat each layer independently and thus fail to capture cross-layer computation, limiting their ability to approximate the full model. We introduce ProtoMech, a framework for discovering computational circuits in pLMs using cross-layer transcoders that learn sparse latent representations jointly across layers to capture the model’s full computational circuitry. Applied to the pLM ESM2, ProtoMech recovers 82-89% of the original performance on protein family classification and function prediction tasks. ProtoMech then identifies compressed circuits that use 1% of the latent space while retaining up to 79% of model accuracy, revealing correspondence with structural and functional motifs, including binding, signaling, and stability. Steering along these circuits enables high-fitness protein design, surpassing baseline methods in more than 70% of cases. These results establish ProtoMech as a principled framework for protein circuit tracing.
[LG-21] Improved state mixing in higher-order and block diagonal linear recurrent networks
链接: https://arxiv.org/abs/2602.12021
作者: Igor Dubinin,Antonio Orvieto,Felix Effenberger
类目: Machine Learning (cs.LG)
*备注:
Abstract:Linear recurrent networks (LRNNs) and linear state space models (SSMs) promise computational and memory efficiency on long-sequence modeling tasks, yet their diagonal state transitions limit expressivity. Dense and nonlinear architectures (e.g., LSTMs) on the other hand are provably more expressive, but computationally costly. Here, we explore how expressivity in LRNNs can be increased via richer state mixing across time and channels while maintaining competitive efficiency. Specifically, we introduce two structured LRNN architectures: (i) Higher-order Linear Recurrent Units (H-LRU), which generalize first-order recurrence to higher order, mixing multiple past states, and (ii) Block-Diagonal LRUs (BD-LRU), which enable dense intra-block channel mixing. Per-channel (H-LRU) or per-row (BD-LRU) L1-normalization of selective gates stabilizes training and allows for scaling window/block sizes. A parallel-scan implementation of the proposed architectures keeps the throughput competitive with diagonal LRNNs for moderate orders (H-LRU) and block sizes (BD-LRU). In synthetic sequence modeling tasks, the performance of BD-LRU matches or exceeds those of linear SSMs (Mamba), low-rank LRNNs (DeltaNet) and LSTM baselines, while H-LRU is found to be the most parameter-efficient in compression task. In both synthetic sequence modeling and language modeling, our results indicate that the structure of state mixing rather than width alone shapes expressivity of LRNNs, offering a practical route to closing the efficiency-expressivity gap in linear sequence models.
[LG-22] FedGRPO: Privately Optimizing Foundation Models with Group-Relative Rewards from Domain Client AAAI2026
链接: https://arxiv.org/abs/2602.12014
作者: Gongxi Zhu,Hanlin Gu,Lixin Fan,Qiang Yang,Yuxing Han
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2026 as Oral
Abstract:One important direction of Federated Foundation Models (FedFMs) is leveraging data from small client models to enhance the performance of a large server-side foundation model. Existing methods based on model level or representation level knowledge transfer either require expensive local training or incur high communication costs and introduce unavoidable privacy risks. We reformulate this problem as a reinforcement learning style evaluation process and propose FedGRPO, a privacy preserving framework comprising two modules. The first module performs competence-based expert selection by building a lightweight confidence graph from auxiliary data to identify the most suitable clients for each question. The second module leverages the “Group Relative” concept from the Group Relative Policy Optimization (GRPO) framework by packaging each question together with its solution rationale into candidate policies, dispatching these policies to a selected subset of expert clients, and aggregating solely the resulting scalar reward signals via a federated group-relative loss function. By exchanging reward values instead of data or model updates, FedGRPO reduces privacy risk and communication overhead while enabling parallel evaluation across heterogeneous devices. Empirical results on diverse domain tasks demonstrate that FedGRPO achieves superior downstream accuracy and communication efficiency compared to conventional FedFMs baselines.
[LG-23] Momentum LMS Theory beyond Stationarity: Stability Tracking and Regret
链接: https://arxiv.org/abs/2602.11995
作者: Yifei Jin,Xin Zheng,Lei Guo
类目: Machine Learning (cs.LG)
*备注: 9 pages, 3 figures
Abstract:In large-scale data processing scenarios, data often arrive in sequential streams generated by complex systems that exhibit drifting distributions and time-varying system parameters. This nonstationarity challenges theoretical analysis, as it violates classical assumptions of i.i.d. (independent and identically distributed) samples, necessitating algorithms capable of real-time updates without expensive retraining. An effective approach should process each sample in a single pass, while maintaining computational and memory complexities independent of the data stream length. Motivated by these challenges, this paper investigates the Momentum Least Mean Squares (MLMS) algorithm as an adaptive identification tool, leveraging its computational simplicity and online processing capabilities. Theoretically, we derive tracking performance and regret bounds for the MLMS in time-varying stochastic linear systems under various practical conditions. Unlike classical LMS, whose stability can be characterized by first-order random vector difference equations, MLMS introduces an additional dynamical state due to momentum, leading to second-order time-varying random vector difference equations whose stability analysis hinges on more complicated products of random matrices, which poses a substantially challenging problem to resolve. Experiments on synthetic and real-world data streams demonstrate that MLMS achieves rapid adaptation and robust tracking, in agreement with our theoretical results especially in nonstationary settings, highlighting its promise for modern streaming and online learning applications.
[LG-24] Are Two LLM s Better Than One? A Student-Teacher Dual-Head LLM s Architecture for Pharmaceutical Content Optimization
链接: https://arxiv.org/abs/2602.11957
作者: Suyash Mishra,Qiang Li,Anubhav Girdhar
类目: Machine Learning (cs.LG)
*备注: Submitted to the Demo Track of Top Tier Conference; currently under peer review
Abstract:Large language models (LLMs) are increasingly used to create content in regulated domains such as pharmaceuticals, where outputs must be scientifically accurate and legally compliant. Manual quality control (QC) is slow, error prone, and can become a publication bottleneck. We introduce LRBTC, a modular LLM and vision language model (VLM) driven QC architecture covering Language, Regulatory, Brand, Technical, and Content Structure checks. LRBTC combines a Student-Teacher dual model architecture, human in the loop (HITL) workflow with waterfall rule filtering to enable scalable, verifiable content validation and optimization. On AIReg-Bench, our approach achieves 83.0% F1 and 97.5% recall, reducing missed violations by 5x compared with Gemini 2.5 Pro. On CSpelling, it improves mean accuracy by 26.7%. Error analysis further reveals that while current models are strong at detecting misspellings (92.5 recall), they fail to identify complex medical grammatical (25.0 recall) and punctuation (41.7 recall) errors, highlighting a key area for future work. This work provides a practical, plug and play solution for reliable, transparent quality control of content in high stakes, compliance critical industries. We also provide access to our Demo under MIT Licenses.
[LG-25] Using predictive multiplicity to measure individual performance within the AI Act
链接: https://arxiv.org/abs/2602.11944
作者: Karolin Frohnapfel,Mara Seyfert,Sebastian Bordt,Ulrike von Luxburg,Kristof Meding
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:When building AI systems for decision support, one often encounters the phenomenon of predictive multiplicity: a single best model does not exist; instead, one can construct many models with similar overall accuracy that differ in their predictions for individual cases. Especially when decisions have a direct impact on humans, this can be highly unsatisfactory. For a person subject to high disagreement between models, one could as well have chosen a different model of similar overall accuracy that would have decided the person’s case differently. We argue that this arbitrariness conflicts with the EU AI Act, which requires providers of high-risk AI systems to report performance not only at the dataset level but also for specific persons. The goal of this paper is to put predictive multiplicity in context with the EU AI Act’s provisions on accuracy and to subsequently derive concrete suggestions on how to evaluate and report predictive multiplicity in practice. Specifically: (1) We argue that incorporating information about predictive multiplicity can serve compliance with the EU AI Act’s accuracy provisions for providers. (2) Based on this legal analysis, we suggest individual conflict ratios and \delta -ambiguity as tools to quantify the disagreement between models on individual cases and to help detect individuals subject to conflicting predictions. (3) Based on computational insights, we derive easy-to-implement rules on how model providers could evaluate predictive multiplicity in practice. (4) Ultimately, we suggest that information about predictive multiplicity should be made available to deployers under the AI Act, enabling them to judge whether system outputs for specific individuals are reliable enough for their use case.
[LG-26] mporally Unified Adversarial Perturbations for Time Series Forecasting
链接: https://arxiv.org/abs/2602.11940
作者: Ruixian Su,Yukun Bao,Xinze Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:While deep learning models have achieved remarkable success in time series forecasting, their vulnerability to adversarial examples remains a critical security concern. However, existing attack methods in the forecasting field typically ignore the temporal consistency inherent in time series data, leading to divergent and contradictory perturbation values for the same timestamp across overlapping samples. This temporally inconsistent perturbations problem renders adversarial attacks impractical for real-world data manipulation. To address this, we introduce Temporally Unified Adversarial Perturbations (TUAPs), which enforce a temporal unification constraint to ensure identical perturbations for each timestamp across all overlapping samples. Moreover, we propose a novel Timestamp-wise Gradient Accumulation Method (TGAM) that provides a modular and efficient approach to effectively generate TUAPs by aggregating local gradient information from overlapping samples. By integrating TGAM with momentum-based attack algorithms, we ensure strict temporal consistency while fully utilizing series-level gradient information to explore the adversarial perturbation space. Comprehensive experiments on three benchmark datasets and four representative state-of-the-art models demonstrate that our proposed method significantly outperforms baselines in both white-box and black-box transfer attack scenarios under TUAP constraints. Moreover, our method also exhibits superior transfer attack performance even without TUAP constraints, demonstrating its effectiveness and superiority in generating adversarial perturbations for time series forecasting models.
[LG-27] Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT -OSS Acceleration
链接: https://arxiv.org/abs/2602.11937
作者: Akhiad Bercovich,Nir Ailon,Vladimir Anisimov,Tomer Asida,Nave Assaf,Mohammad Dabbah,Ido Galil,Amnon Geifman,Yonatan Geifman,Izhak Golan,Roi Koren,Itay Levy,Zach Moshe,Pavlo Molchanov,Najeeb Nabwani,Mostofa Patwari,Omri Puny,Tomer Ronen,Itamar Schen,Elad Segal,Ido Shahaf,Oren Tropp,Ran Zilberstein,Ran El-Yaniv
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reasoning-focused LLMs improve answer quality by generating longer reasoning traces, but the additional tokens dramatically increase serving cost, motivating inference optimization. We extend and apply Puzzle, a post-training neural architecture search (NAS) framework, to gpt-oss-120B to produce gpt-oss-puzzle-88B, a deployment-optimized derivative. Our approach combines heterogeneous MoE expert pruning, selective replacement of full-context attention with window attention, FP8 KV-cache quantization with calibrated scales, and post-training reinforcement learning to recover accuracy, while maintaining low generation length. In terms of per-token speeds, on an 8XH100 node we achieve 1.63X and 1.22X throughput speedups in long-context and short-context settings, respectively. gpt-oss-puzzle-88B also delivers throughput speedups of 2.82X on a single NVIDIA H100 GPU. However, because token counts can change with reasoning effort and model variants, per-token throughput (tok/s) and latency (ms/token) do not necessarily lead to end-to-end speedups: a 2X throughput gain is erased if traces grow 2X. Conversely, throughput gains can be spent on more reasoning tokens to improve accuracy; we therefore advocate request-level efficiency metrics that normalize throughput by tokens generated and trace an accuracy–speed frontier across reasoning efforts. We show that gpt-oss-puzzle-88B improves over gpt-oss-120B along the entire frontier, delivering up to 1.29X higher request-level efficiency. Across various benchmarks, gpt-oss-puzzle-88B matches or slightly exceeds the parent on suite-average accuracy across reasoning efforts, with retention ranging from 100.8% (high) to 108.2% (low), showing that post-training architecture search can substantially reduce inference costs without sacrificing quality.
[LG-28] Learning Conditional Averag es
链接: https://arxiv.org/abs/2602.11920
作者: Marco Bressan,Nataly Brukhim,Nicolo Cesa-Bianchi,Emmanuel Esposito,Yishay Mansour,Shay Moran,Maximilian Thiessen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We introduce the problem of learning conditional averages in the PAC framework. The learner receives a sample labeled by an unknown target concept from a known concept class, as in standard PAC learning. However, instead of learning the target concept itself, the goal is to predict, for each instance, the average label over its neighborhood – an arbitrary subset of points that contains the instance. In the degenerate case where all neighborhoods are singletons, the problem reduces exactly to classic PAC learning. More generally, it extends PAC learning to a setting that captures learning tasks arising in several domains, including explainability, fairness, and recommendation systems. Our main contribution is a complete characterization of when conditional averages are learnable, together with sample complexity bounds that are tight up to logarithmic factors. The characterization hinges on the joint finiteness of two novel combinatorial parameters, which depend on both the concept class and the neighborhood system, and are closely related to the independence number of the associated neighborhood graph.
[LG-29] ADA! Tuning Audio Diffusion Models through Activation Steering
链接: https://arxiv.org/abs/2602.11910
作者: Łukasz Staniszewski,Katarzyna Zaleska,Mateusz Modrzejewski,Kamil Deja
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Preprint. Preliminary work
Abstract:Audio diffusion models can synthesize high-fidelity music from text, yet their internal mechanisms for representing high-level concepts remain poorly understood. In this work, we use activation patching to demonstrate that distinct semantic musical concepts, such as the presence of specific instruments, vocals, or genre characteristics, are controlled by a small, shared subset of attention layers in state-of-the-art audio diffusion architectures. Next, we demonstrate that applying Contrastive Activation Addition and Sparse Autoencoders in these layers enables more precise control over the generated audio, indicating a direct benefit of the specialization phenomenon. By steering activations of the identified layers, we can alter specific musical elements with high precision, such as modulating tempo or changing a track’s mood.
[LG-30] Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning ICLR2026
链接: https://arxiv.org/abs/2602.11909
作者: Daiqing Wu,Xuan Zhang,Dongbao Yang,Jiashu Yao,Longfei Chen,Qingsong Liu,Sicheng Zhao,Can Ma,Yangyang Kang,Yu Zhou
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted by ICLR 2026
Abstract:The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert-level and general-purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio-interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. Project page: this https URL.
[LG-31] Universal Diffusion-Based Probabilistic Downscaling
链接: https://arxiv.org/abs/2602.11893
作者: Roberto Molinaro,Niall Siegenheim,Henry Martin,Mark Frey,Niels Poulsen,Philipp Seitz,Marvin Vincent Gabler
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce a universal diffusion-based downscaling framework that lifts deterministic low-resolution weather forecasts into probabilistic high-resolution predictions without any model-specific fine-tuning. A single conditional diffusion model is trained on paired coarse-resolution inputs (~25 km resolution) and high-resolution regional reanalysis targets (~5 km resolution), and is applied in a fully zero-shot manner to deterministic forecasts from heterogeneous upstream weather models. Focusing on near-surface variables, we evaluate probabilistic forecasts against independent in situ station observations over lead times up to 90 h. Across a diverse set of AI-based and numerical weather prediction (NWP) systems, the ensemble mean of the downscaled forecasts consistently improves upon each model’s own raw deterministic forecast, and substantially larger gains are observed in probabilistic skill as measured by CRPS. These results demonstrate that diffusion-based downscaling provides a scalable, model-agnostic probabilistic interface for enhancing spatial resolution and uncertainty representation in operational weather forecasting pipelines.
[LG-32] In-Context Function Learning in Large Language Models
链接: https://arxiv.org/abs/2602.11863
作者: Elif Akata,Konstantinos Voudouris,Vincent Fortuin,Eric Schulz
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) can learn from a few demonstrations provided at inference time. We study this in-context learning phenomenon through the lens of Gaussian Processes (GPs). We build controlled experiments where models observe sequences of multivariate scalar-valued function samples drawn from known GP priors. We evaluate prediction error in relation to the number of demonstrations and compare against two principled references: (i) an empirical GP-regression learner that gives a lower bound on achievable error, and (ii) the expected error of a 1-nearest-neighbor (1-NN) rule, which gives a data-driven upper bound. Across model sizes, we find that LLM learning curves are strongly influenced by the function-generating kernels and approach the GP lower bound as the number of demonstrations increases. We then study the inductive biases of these models using a likelihood-based analysis. We find that LLM predictions are most likely under less smooth GP kernels. Finally, we explore whether post-training can shift these inductive biases and improve sample-efficiency on functions sampled from GPs with smoother kernels. We find that both reinforcement learning and supervised fine-tuning can effectively shift inductive biases in the direction of the training data. Together, our framework quantifies the extent to which LLMs behave like GP learners and provides tools for steering their inductive biases for continuous function learning tasks.
[LG-33] Scale-Invariant Fast Convergence in Games
链接: https://arxiv.org/abs/2602.11857
作者: Taira Tsuchiya,Haipeng Luo,Shinji Ito
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 44 pages
Abstract:Scale-invariance in games has recently emerged as a widely valued desirable property. Yet, almost all fast convergence guarantees in learning in games require prior knowledge of the utility scale. To address this, we develop learning dynamics that achieve fast convergence while being both scale-free, requiring no prior information about utilities, and scale-invariant, remaining unchanged under positive rescaling of utilities. For two-player zero-sum games, we obtain scale-free and scale-invariant dynamics with external regret bounded by \tildeO(A_\mathrmdiff) , where A_\mathrmdiff is the payoff range, which implies an \tildeO(A_\mathrmdiff / T) convergence rate to Nash equilibrium after T rounds. For multiplayer general-sum games with n players and m actions, we obtain scale-free and scale-invariant dynamics with swap regret bounded by O(U_\mathrmmax \log T) , where U_\mathrmmax is the range of the utilities, ignoring the dependence on the number of players and actions. This yields an O(U_\mathrmmax \log T / T) convergence rate to correlated equilibrium. Our learning dynamics are based on optimistic follow-the-regularized-leader with an adaptive learning rate that incorporates the squared path length of the opponents’ gradient vectors, together with a new stopping-time analysis that exploits negative terms in regret bounds without scale-dependent tuning. For general-sum games, scale-free learning is enabled also by a technique called doubling clipping, which clips observed gradients based on past observations.
[LG-34] Robust Optimization Approach and Learning Based Hide-and-Seek Game for Resilient Network Design
链接: https://arxiv.org/abs/2602.11854
作者: Mohammad Khosravi,Setareh Maghsudi
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the design of resilient and reliable communication networks in which a signal can be transferred only up to a limited distance before its quality falls below an acceptable threshold. When excessive signal degradation occurs, regeneration is required through regenerators installed at selected network nodes. In this work, both network links and nodes are subject to uncertainty. The installation costs of regenerators are modeled using a budgeted uncertainty set. In addition, link lengths follow a dynamic budgeted uncertainty set introduced in this paper, where deviations may vary over time. Robust optimization seeks solutions whose performance is guaranteed under all scenarios represented by the underlying uncertainty set. Accordingly, the objective is to identify a minimum-cost subset of nodes for regenerator deployment that ensures full network connectivity, even under the worst possible realizations of uncertainty. To solve the problem, we first formulate it within a robust optimization framework, and then develop scalable solution methods based on column-and-constraint generation, Benders decomposition, and iterative robust optimization. In addition, we formulate a learning-based hide-and-seek game to further analyze the problem structure. The proposed approaches are evaluated against classical static budgeted robust models and deterministic worst-case formulations. Both theoretical analysis and computational results demonstrate the effectiveness and advantages of our methodology.
[LG-35] owards Sustainable Investment Policies Informed by Opponent Shaping ICLR2026
链接: https://arxiv.org/abs/2602.11829
作者: Juan Agustin Duque,Razvan Ciuca,Ayoub Echchahed,Hugo Larochelle,Aaron Courville
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: Accepted at ICLR 2026
Abstract:Addressing climate change requires global coordination, yet rational economic actors often prioritize immediate gains over collective welfare, resulting in social dilemmas. InvestESG is a recently proposed multi-agent simulation that captures the dynamic interplay between investors and companies under climate risk. We provide a formal characterization of the conditions under which InvestESG exhibits an intertemporal social dilemma, deriving theoretical thresholds at which individual incentives diverge from collective welfare. Building on this, we apply Advantage Alignment, a scalable opponent shaping algorithm shown to be effective in general-sum games, to influence agent learning in InvestESG. We offer theoretical insights into why Advantage Alignment systematically favors socially beneficial equilibria by biasing learning dynamics toward cooperative outcomes. Our results demonstrate that strategically shaping the learning processes of economic agents can result in better outcomes that could inform policy mechanisms to better align market incentives with long-term sustainability goals.
[LG-36] CAAL: Confidence-Aware Active Learning for Heteroscedastic Atmospheric Regression
链接: https://arxiv.org/abs/2602.11825
作者: Fei Jiang,Jiyang Xia,Junjie Yu,Mingfei Sun,Hugh Coe,David Topping,Dantong Liu,Zhenhui Jessie Li,Zhonghua Zheng
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 17 pages in total
Abstract:Quantifying the impacts of air pollution on health and climate relies on key atmospheric particle properties such as toxicity and hygroscopicity. However, these properties typically require complex observational techniques or expensive particle-resolved numerical simulations, limiting the availability of labeled data. We therefore estimate these hard-to-measure particle properties from routinely available observations (e.g., air pollutant concentrations and meteorological conditions). Because routine observations only indirectly reflect particle composition and structure, the mapping from routine observations to particle properties is noisy and input-dependent, yielding a heteroscedastic regression setting. With a limited and costly labeling budget, the central challenge is to select which samples to measure or simulate. While active learning is a natural approach, most acquisition strategies rely on predictive uncertainty. Under heteroscedastic noise, this signal conflates reducible epistemic uncertainty with irreducible aleatoric uncertainty, causing limited budgets to be wasted in noise-dominated regions. To address this challenge, we propose a confidence-aware active learning framework (CAAL) for efficient and robust sample selection in heteroscedastic settings. CAAL consists of two components: a decoupled uncertainty-aware training objective that separately optimises the predictive mean and noise level to stabilise uncertainty estimation, and a confidence-aware acquisition function that dynamically weights epistemic uncertainty using predicted aleatoric uncertainty as a reliability signal. Experiments on particle-resolved numerical simulations and real atmospheric observations show that CAAL consistently outperforms standard AL baselines. The proposed framework provides a practical and general solution for the efficient expansion of high-cost atmospheric particle property databases.
[LG-37] Deep Kernel Fusion for Transformers
链接: https://arxiv.org/abs/2602.11808
作者: Zixi Zhang,Zhiwen Mo,Yiren Zhao,Robert Mullins
类目: Machine Learning (cs.LG)
*备注:
Abstract:Agentic LLM inference with long contexts is increasingly limited by memory bandwidth rather than compute. In this setting, SwiGLU MLP blocks, whose large weights exceed cache capacity, become a major yet under-optimized bottleneck. We propose DeepFusionKernel, a deeply fused kernel that cuts HBM traffic and boosts cache reuse, delivering up to 13.2% speedup on H100 and 9.7% on A100 over SGLang. Integrated with SGLang and paired with a kernel scheduler, DeepFusionKernel ensures consistent accelerations over generation lengths, while remaining adaptable to diverse models, inference configurations, and hardware platforms.
[LG-38] From Path Signatures to Sequential Modeling: Incremental Signature Contributions for Offline RL
链接: https://arxiv.org/abs/2602.11805
作者: Ziyi Zhao,Qingchuan Li,Yuxuan Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Path signatures embed trajectories into tensor algebra and constitute a universal, non-parametric representation of paths; however, in the standard form, they collapse temporal structure into a single global object, which limits their suitability for decision-making problems that require step-wise reactivity. We propose the Incremental Signature Contribution (ISC) method, which decomposes truncated path signatures into a temporally ordered sequence of elements in the tensor-algebra space, corresponding to incremental contributions induced by last path increments. This reconstruction preserves the algebraic structure and expressivity of signatures, while making their internal temporal evolution explicit, enabling processing signature-based representations via sequential modeling approaches. In contrast to full signatures, ISC is inherently sensitive to instantaneous trajectory updates, which is critical for sensitive and stability-requiring control dynamics. Building on this representation, we introduce ISC-Transformer (ISCT), an offline reinforcement learning model that integrates ISC into a standard Transformer architecture without further architectural modification. We evaluate ISCT on HalfCheetah, Walker2d, Hopper, and Maze2d, including settings with delayed rewards and downgraded datasets. The results demonstrate that ISC method provides a theoretically grounded and practically effective alternative to path processing for temporally sensitive control tasks.
[LG-39] opoFair: Linking Topological Bias to Fairness in Link Prediction Benchmarks
链接: https://arxiv.org/abs/2602.11802
作者: Lilian Marey,Mathilde Perez,Tiphaine Viard,Charlotte Laclau
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph link prediction (LP) plays a critical role in socially impactful applications, such as job recommendation and friendship formation. Ensuring fairness in this task is thus essential. While many fairness-aware methods manipulate graph structures to mitigate prediction disparities, the topological biases inherent to social graph structures remain poorly understood and are often reduced to homophily alone. This undermines the generalization potential of fairness interventions and limits their applicability across diverse network topologies. In this work, we propose a novel benchmarking framework for fair LP, centered on the structural biases of the underlying graphs. We begin by reviewing and formalizing a broad taxonomy of topological bias measures relevant to fairness in graphs. In parallel, we introduce a flexible graph generation method that simultaneously ensures fidelity to real-world graph patterns and enables controlled variation across a wide spectrum of structural biases. We apply this framework to evaluate both classical and fairness-aware LP models across multiple use cases. Our results provide a fine-grained empirical analysis of the interactions between predictive fairness and structural biases. This new perspective reveals the sensitivity of fairness interventions to beyond-homophily biases and underscores the need for structurally grounded fairness evaluations in graph learning.
[LG-40] SpaTeoGL: Spatiotemporal Graph Learning for Interpretable Seizure Onset Zone Analysis from Intracranial EEG
链接: https://arxiv.org/abs/2602.11801
作者: Elham Rostami,Aref Einizade,Taous-Meriem Laleg-Kirati
类目: Machine Learning (cs.LG)
*备注: 5 pages, 4 figures
Abstract:Accurate localization of the seizure onset zone (SOZ) from intracranial EEG (iEEG) is essential for epilepsy surgery but is challenged by complex spatiotemporal seizure dynamics. We propose SpaTeoGL, a spatiotemporal graph learning framework for interpretable seizure network analysis. SpaTeoGL jointly learns window-level spatial graphs capturing interactions among iEEG electrodes and a temporal graph linking time windows based on similarity of their spatial structure. The method is formulated within a smooth graph signal processing framework and solved via an alternating block coordinate descent algorithm with convergence guarantees. Experiments on a multicenter iEEG dataset with successful surgical outcomes show that SpaTeoGL is competitive with a baseline based on horizontal visibility graphs and logistic regression, while improving non-SOZ identification and providing interpretable insights into seizure onset and propagation dynamics.
[LG-41] mporal Difference Learning with Constrained Initial Representations
链接: https://arxiv.org/abs/2602.11800
作者: Jiafei Lyu,Jingwen Yang,Zhongjian Qiao,Runze Liu,Zeyuan Liu,Deheng Ye,Zongqing Lu,Xiu Li
类目: Machine Learning (cs.LG)
*备注: 35 pages
Abstract:Recently, there have been numerous attempts to enhance the sample efficiency of off-policy reinforcement learning (RL) agents when interacting with the environment, including architecture improvements and new algorithms. Despite these advances, they overlook the potential of directly constraining the initial representations of the input data, which can intuitively alleviate the distribution shift issue and stabilize training. In this paper, we introduce the Tanh function into the initial layer to fulfill such a constraint. We theoretically unpack the convergence property of the temporal difference learning with the Tanh function under linear function approximation. Motivated by theoretical insights, we present our Constrained Initial Representations framework, tagged CIR, which is made up of three components: (i) the Tanh activation along with normalization methods to stabilize representations; (ii) the skip connection module to provide a linear pathway from the shallow layer to the deep layer; (iii) the convex Q-learning that allows a more flexible value estimate and mitigates potential conservatism. Empirical results show that CIR exhibits strong performance on numerous continuous control tasks, even being competitive or surpassing existing strong baseline methods.
[LG-42] Latent-Variable Learning of SPDEs via Wiener Chaos
链接: https://arxiv.org/abs/2602.11794
作者: Sebastian Zeng,Andreas Petersson,Wolfgang Bock
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the problem of learning the law of linear stochastic partial differential equations (SPDEs) with additive Gaussian forcing from spatiotemporal observations. Most existing deep learning approaches either assume access to the driving noise or initial condition, or rely on deterministic surrogate models that fail to capture intrinsic stochasticity. We propose a structured latent-variable formulation that requires only observations of solution realizations and learns the underlying randomly forced dynamics. Our approach combines a spectral Galerkin projection with a truncated Wiener chaos expansion, yielding a principled separation between deterministic evolution and stochastic forcing. This reduces the infinite-dimensional SPDE to a finite system of parametrized ordinary differential equations governing latent temporal dynamics. The latent dynamics and stochastic forcing are jointly inferred through variational learning, allowing recovery of stochastic structure without explicit observation or simulation of noise during training. Empirical evaluation on synthetic data demonstrates state-of-the-art performance under comparable modeling assumptions across bounded and unbounded one-dimensional spatial domains.
[LG-43] mperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning ICLR2026
链接: https://arxiv.org/abs/2602.11779
作者: Haoran Dang,Cuiling Lan,Hai Wan,Xibin Zhao,Yan Lu
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR 2026. 10 pages (main text) + supplementary material, 6 figures
Abstract:Temperature is a crucial hyperparameter in large language models (LLMs), controlling the trade-off between exploration and exploitation during text generation. High temperatures encourage diverse but noisy outputs, while low temperatures produce focused outputs but may cause premature convergence. Yet static or heuristic temperature schedules fail to adapt to the dynamic demands of reinforcement learning (RL) throughout training, often limiting policy improvement. We propose Temperature Adaptive Meta Policy Optimization (TAMPO), a new framework that recasts temperature control as a learnable meta-policy. TAMPO operates through a hierarchical two-loop process. In the inner loop, the LLM policy is updated (e.g., using GRPO) with trajectories sampled at the temperature selected by the meta-policy. In the outer loop, meta-policy updates the distribution over candidate temperatures by rewarding those that maximize the likelihood of high-advantage trajectories. This trajectory-guided, reward-driven mechanism enables online adaptation without additional rollouts, directly aligning exploration with policy improvement. On five mathematical reasoning benchmarks, TAMPO outperforms baselines using fixed or heuristic temperatures, establishing temperature as an effective learnable meta-policy for adaptive exploration in LLM reinforcement learning. Accepted at ICLR 2026.
[LG-44] MUSE: Multi-Tenant Model Serving With Seamless Model Updates KDD2026
链接: https://arxiv.org/abs/2602.11776
作者: Cláudio Correia,Alberto E. A. Ferreira,Lucas Martins,Miguel P. Bento,Sofia Guerreiro,Ricardo Ribeiro Pereira,Ana Sofia Gomes,Jacopo Bono,Hugo Ferreira,Pedro Bizarro
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Currently under review for KDD 2026 (Applied Data Science)
Abstract:In binary classification systems, decision thresholds translate model scores into actions. Choosing suitable thresholds relies on the specific distribution of the underlying model scores but also on the specific business decisions of each client using that model. However, retraining models inevitably shifts score distributions, invalidating existing thresholds. In multi-tenant Score-as-a-Service environments, where decision boundaries reside in client-managed infrastructure, this creates a severe bottleneck: recalibration requires coordinating threshold updates across hundreds of clients, consuming excessive human hours and leading to model stagnation. We introduce MUSE, a model serving framework that enables seamless model updates by decoupling model scores from client decision boundaries. Designed for multi-tenancy, MUSE optimizes infrastructure re-use by sharing models via dynamic intent-based routing, combined with a two-level score transformation that maps model outputs to a stable, reference distribution. Deployed at scale by Feedzai, MUSE processes over a thousand events per second, and over 55 billion events in the last 12 months, across several dozens of tenants, while maintaining high-availability and low-latency guarantees. By reducing model lead time from weeks to minutes, MUSE promotes model resilience against shifting attacks, saving millions of dollars in fraud losses and operational costs.
[LG-45] UBO: A Tailored ML Framework for Reliable Network Traffic Forecasting
链接: https://arxiv.org/abs/2602.11759
作者: Zhihang Yuan,Leyang Xue,Waleed Ahsan,Mahesh K. Marina
类目: Machine Learning (cs.LG)
*备注: Short version of this paper is presented at ICDCS 2025
Abstract:Traffic forecasting based network operation optimization and management offers enormous promise but also presents significant challenges from traffic forecasting perspective. While deep learning models have proven to be relatively more effective than traditional statistical methods for time series forecasting, their reliability is not satisfactory due to their inability to effectively handle unique characteristics of network traffic. In particular, the burst and complex traffic patterns makes the existing models less reliable, as each type of deep learning model has limited capability in capturing traffic patterns. To address this issue, we introduce TUBO, a novel machine learning framework custom designed for reliable network traffic forecasting. TUBO features two key components: burst processing for handling significant traffic fluctuations and model selection for adapting to varying traffic patterns using a pool of models. A standout feature of TUBO is its ability to provide deterministic predictions along with quantified uncertainty, which serves as a cue for identifying the most reliable forecasts. Evaluations on three real-world network demand matrix (DM) datasets (Abilene, GEANT, and CERNET) show that TUBO significantly outperforms existing methods on forecasting accuracy (by 4 times), and also achieves up to 94% accuracy in burst occurrence forecasting. Furthermore, we also consider traffic demand forecasting based proactive traffic engineering (TE) as a downstream use case. Our results show that compared to reactive approaches and proactive TE using the best existing DM forecasting methods, proactive TE powered by TUBO improves aggregated throughput by 9 times and 3 times, respectively.
[LG-46] U-Former ODE: Fast Probabilistic Forecasting of Irregular Time Series
链接: https://arxiv.org/abs/2602.11738
作者: Ilya Kuleshov,Alexander Marusov,Alexey Zaytsev
类目: Machine Learning (cs.LG)
*备注:
Abstract:Probabilistic forecasting of irregularly sampled time series is crucial in domains such as healthcare and finance, yet it remains a formidable challenge. Existing Neural Controlled Differential Equation (Neural CDE) approaches, while effective at modelling continuous dynamics, suffer from slow, inherently sequential computation, which restricts scalability and limits access to global context. We introduce UFO (U-Former ODE), a novel architecture that seamlessly integrates the parallelizable, multiscale feature extraction of U-Nets, the powerful global modelling of Transformers, and the continuous-time dynamics of Neural CDEs. By constructing a fully causal, parallelizable model, UFO achieves a global receptive field while retaining strong sensitivity to local temporal dynamics. Extensive experiments on five standard benchmarks – covering both regularly and irregularly sampled time series – demonstrate that UFO consistently outperforms ten state-of-the-art neural baselines in predictive accuracy. Moreover, UFO delivers up to 15 \times faster inference compared to conventional Neural CDEs, with consistently strong performance on long and highly multivariate sequences.
[LG-47] Dopamine: Brain Modes Not Brains
链接: https://arxiv.org/abs/2602.11726
作者: Shervin Ghasemlou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Parameter-efficient fine-tuning (PEFT) methods such as \lora adapt large pretrained models by adding small weight-space updates. While effective, weight deltas are hard to interpret mechanistically, and they do not directly expose \emphwhich internal computations are reused versus bypassed for a new task. We explore an alternative view inspired by neuromodulation: adaptation as a change in \emphmode – selecting and rescaling existing computations – rather than rewriting the underlying weights. We propose \methodname, a simple activation-space PEFT technique that freezes base weights and learns per-neuron \emphthresholds and \emphgains. During training, a smooth gate decides whether a neuron’s activation participates; at inference the gate can be hardened to yield explicit conditional computation and neuron-level attributions. As a proof of concept, we study mode specialization'' on MNIST (0 ^\circ ) versus rotated MNIST (45 ^\circ ). We pretrain a small MLP on a 50/50 mixture (foundation), freeze its weights, and then specialize to the rotated mode using \methodname. Across seeds, \methodname improves rotated accuracy over the frozen baseline while using only a few hundred trainable parameters per layer, and exhibits partial activation sparsity (a minority of units strongly active). Compared to \lora, \methodname trades some accuracy for substantially fewer trainable parameters and a more interpretable which-neurons-fire’’ mechanism. We discuss limitations, including reduced expressivity when the frozen base lacks features needed for the target mode. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.11726 [cs.LG] (or arXiv:2602.11726v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.11726 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-48] Potential-energy gating for robust state estimation in bistable stochastic systems
链接: https://arxiv.org/abs/2602.11712
作者: Luigi Simeone
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Chaotic Dynamics (nlin.CD); Data Analysis, Statistics and Probability (physics.data-an); Methodology (stat.ME)
*备注: 20 pages, 8 figures
Abstract:We introduce potential-energy gating, a method for robust state estimation in systems governed by double-well stochastic dynamics. The observation noise covariance of a Bayesian filter is modulated by the local value of a known or assumed potential energy function: observations are trusted when the state is near a potential minimum and progressively discounted as it approaches the barrier separating metastable wells. This physics-based mechanism differs from purely statistical robust filters, which treat all regions of state space identically, and from constrained filters, which impose hard bounds on states rather than modulating observation trust. We implement the gating within Extended, Unscented, Ensemble, and Adaptive Kalman filters and particle filters, requiring only two additional hyperparameters. Synthetic benchmarks on a Ginzburg-Landau double-well process with 10% outlier contamination and Monte Carlo validation over 100 replications show 57-80% RMSE improvement over the standard Extended Kalman Filter, all statistically significant (p 10^-15, Wilcoxon signed-rank test). A naive topological baseline using only distance to the nearest well achieves 57%, confirming that the continuous energy landscape adds an additional ~21 percentage points. The method is robust to misspecification: even when assumed potential parameters deviate by 50% from their true values, improvement never falls below 47%. Comparing externally forced and spontaneous Kramers-type transitions, gating retains 68% improvement under noise-induced transitions whereas the naive baseline degrades to 30%. As an empirical illustration, we apply the framework to Dansgaard-Oeschger events in the NGRIP delta-18O ice-core record, estimating asymmetry parameter gamma = -0.109 (bootstrap 95% CI: [-0.220, -0.011], excluding zero) and demonstrating that outlier fraction explains 91% of the variance in filter improvement.
[LG-49] SpiralFormer: Looped Transformers Can Learn Hierarchical Dependencies via Multi-Resolution Recursion
链接: https://arxiv.org/abs/2602.11698
作者: Chengting Yu,Xiaobo Shu,Yadao Wang,Yizhen Zhang,Haoyi Wu,You Wu,Rujiao Long,Ziheng Chen,Yuchi Xu,Wenbo Su,Bo Zheng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recursive (looped) Transformers decouple computational depth from parameter depth by repeatedly applying shared layers, providing an explicit architectural primitive for iterative refinement and latent reasoning. However, early looped Transformers often underperform non-recursive baselines of equal compute. While recent literature has introduced more effective recursion mechanisms to mitigate this gap, existing architectures still operate at a fixed, full-token resolution, neglecting the potential efficiency of computing over compressed latent representations. In this paper, we propose SpiralFormer, a looped Transformer that executes recurrence under a multi-resolution recursion schedule. We provide probing evidence that multi-resolution recursion enables the model to learn hierarchical dependencies by inducing iteration-wise functional specialization across different scales. Empirically, SpiralFormer achieves better parameter and compute efficiency than both looped and non-looped baselines across model scales from 160M to 1.4B, establishing sequence resolution as a potential axis for scaling recursive architectures.
[LG-50] LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training ASPLOS2026
链接: https://arxiv.org/abs/2602.11686
作者: Xinyi Liu,Yujie Wang,Fangcheng Fu,Xuefeng Xiao,Huixia Li,Jiashi Li,Bin Cui
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 19 pages, 12 figures, the paper will be presented at ASPLOS 2026
Abstract:Expert parallelism is vital for effectively training Mixture-of-Experts (MoE) models, enabling different devices to host distinct experts, with each device processing different input data. However, during expert parallel training, dynamic routing results in significant load imbalance among experts: a handful of overloaded experts hinder overall iteration, emerging as a training bottleneck. In this paper, we introduce LAER-MoE, an efficient MoE training framework. The core of LAER-MoE is a novel parallel paradigm, Fully Sharded Expert Parallel (FSEP), which fully partitions each expert parameter by the number of devices and restores partial experts at expert granularity through All-to-All communication during training. This allows for flexible re-layout of expert parameters during training to enhance load balancing. In particular, we perform fine-grained scheduling of communication operations to minimize communication overhead. Additionally, we develop a load balancing planner to formulate re-layout strategies of experts and routing schemes for tokens during training. We perform experiments on an A100 cluster, and the results indicate that our system achieves up to 1.69x acceleration compared to the current state-of-the-art training systems. Source code available at this https URL. Comments: 19 pages, 12 figures, the paper will be presented at ASPLOS 2026 Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2602.11686 [cs.DC] (or arXiv:2602.11686v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2602.11686 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3779212.3790180 Focus to learn more DOI(s) linking to related resources
[LG-51] Explainable Machine-Learning based Detection of Knee Injuries in Runners
链接: https://arxiv.org/abs/2602.11668
作者: David Fuentes-Jiménez,Sara García-de-Villa,David Casillas-Pérez,Pablo Floría,Francisco-Manuel Melgarejo-Meseguer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Running is a widely practiced activity but shows a high incidence of knee injuries, especially Patellofemoral Pain Syndrome (PFPS) and Iliotibial Band Syndrome (ITBS). Identifying gait patterns linked to these injuries can improve clinical decision-making, which requires precise systems capable of capturing and analyzing temporal kinematic data. This study uses optical motion capture systems to enhance detection of injury-related running patterns. We analyze a public dataset of 839 treadmill recordings from healthy and injured runners to evaluate how effectively these systems capture dynamic parameters relevant to injury classification. The focus is on the stance phase, using joint and segment angle time series and discrete point values. Three classification tasks are addressed: healthy vs. injured, healthy vs. PFPS, and healthy vs. ITBS. We examine different feature spaces, from traditional point-based metrics to full stance-phase time series and hybrid representations. Multiple models are tested, including classical algorithms (K-Nearest Neighbors, Gaussian Processes, Decision Trees) and deep learning architectures (CNNs, LSTMs). Performance is evaluated with accuracy, precision, recall, and F1-score. Explainability tools such as Shapley values, saliency maps, and Grad-CAM are used to interpret model behavior. Results show that combining time series with point values substantially improves detection. Deep learning models outperform classical ones, with CNNs achieving the highest accuracy: 77.9% for PFPS, 73.8% for ITBS, and 71.43% for the combined injury class. These findings highlight the potential of motion capture systems coupled with advanced machine learning to identify knee injury-related running patterns. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.11668 [cs.LG] (or arXiv:2602.11668v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.11668 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Francisco-Manuel Melgarejo-Meseguer [view email] [v1] Thu, 12 Feb 2026 07:41:07 UTC (11,998 KB) Full-text links: Access Paper: View a PDF of the paper titled Explainable Machine-Learning based Detection of Knee Injuries in Runners, by David Fuentes-Jim’enez and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-02 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-52] Fully First-Order Algorithms for Online Bilevel Optimization
链接: https://arxiv.org/abs/2602.11665
作者: Tingkai Jia,Cheng Chen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:In this work, we study non-convex-strongly-convex online bilevel optimization (OBO). Existing OBO algorithms are mainly based on hypergradient descent, which requires access to a Hessian-vector product (HVP) oracle and potentially incurs high computational costs. By reformulating the original OBO problem as a single-level online problem with inequality constraints and constructing a sequence of Lagrangian function, we eliminate the need for HVPs arising from implicit differentiation. Specifically, we propose a fully first-order algorithm for OBO, and provide theoretical guarantees showing that it achieves regret of O(1 + V_T + H_2,T) . Furthermore, we develop an improved variant with an adaptive inner-iteration scheme, which removes the dependence on the drift variation of the inner-level optimal solution and achieves regret of O(\sqrtT + V_T) . This regret have the advatange when V_T\ge O(\sqrtT) .
[LG-53] UMAP Is Spectral Clustering on the Fuzzy Nearest-Neighbor Graph
链接: https://arxiv.org/abs/2602.11662
作者: Yang Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:UMAP (Uniform Manifold Approximation and Projection) is among the most widely used algorithms for non linear dimensionality reduction and data visualisation. Despite its popularity, and despite being presented through the lens of algebraic topology, the exact relationship between UMAP and classical spectral methods has remained informal. In this work, we prove that UMAP performs spectral clustering on the fuzzy k nearest neighbour graph. Our proof proceeds in three steps: (1) we show that UMAP’s stochastic optimisation with negative sampling is a contrastive learning objective on the similarity graph; (2) we invoke the result of HaoChen et al. [8], establishing that contrastive learning on a similarity graph is equivalent to spectral clustering; and (3) we verify that UMAP’s spectral initialisation computes the exact linear solution to this spectral problem. The equivalence is exact for Gaussian kernels, and holds as a first order approximation for UMAP’s default Cauchy type kernel. Our result unifies UMAP, contrastive learning, and spectral clustering under a single framework, and provides theoretical grounding for several empirical observations about UMAP’s behaviour.
[LG-54] Both Topology and Text Matter: Revisiting LLM -guided Out-of-Distribution Detection on Text-attributed Graphs
链接: https://arxiv.org/abs/2602.11641
作者: Yinlin Zhu,Di Wu,Xu Wang,Guocong Quan,Miao Hu
类目: Machine Learning (cs.LG)
*备注: Under Review
Abstract:Text-attributed graphs (TAGs) associate nodes with textual attributes and graph structure, enabling GNNs to jointly model semantic and structural information. While effective on in-distribution (ID) data, GNNs often encounter out-of-distribution (OOD) nodes with unseen textual or structural patterns in real-world settings, leading to overconfident and erroneous predictions in the absence of reliable OOD detection. Early approaches address this issue from a topology-driven perspective, leveraging neighboring structures to mitigate node-level detection bias. However, these methods typically encode node texts as shallow vector features, failing to fully exploit rich semantic information. In contrast, recent LLM-based approaches generate pseudo OOD priors by leveraging textual knowledge, but they suffer from several limitations: (1) a reliability-informativeness imbalance in the synthesized OOD priors, as the generated OOD exposures either deviate from the true OOD semantics, or introduce non-negligible ID noise, all of which offers limited improvement to detection performance; (2) reliance on specialized architectures, which prevents incorporation of the extensive effective topology-level insights that have been empirically validated in prior work. To this end, we propose LG-Plug, an LLM-Guided Plug-and-play strategy for TAG OOD detection tasks. LG-Plug aligns topology and text representations to produce fine-grained node embeddings, then generates consensus-driven OOD exposure via clustered iterative LLM prompting. Moreover, it leverages lightweight in-cluster codebook and heuristic sampling reduce time cost of LLM querying. The resulting OOD exposure serves as a regularization term to separate ID and OOD nodes, enabling seamless integration with existing detectors.
[LG-55] IP: Resisting Gradient Inversion via Targeted Interpretable Perturbation in Federated Learning
链接: https://arxiv.org/abs/2602.11633
作者: Jianhua Wang,Yinlin Su
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated Learning (FL) facilitates collaborative model training while preserving data locality; however, the exchange of gradients renders the system vulnerable to Gradient Inversion Attacks (GIAs), allowing adversaries to reconstruct private training data with high fidelity. Existing defenses, such as Differential Privacy (DP), typically employ indiscriminate noise injection across all parameters, which severely degrades model utility and convergence stability. To address those limitation, we proposes Targeted Interpretable Perturbation (TIP), a novel defense framework that integrates model interpretability with frequency domain analysis. Unlike conventional methods that treat parameters uniformly, TIP introduces a dual-targeting strategy. First, leveraging Gradient-weighted Class Activation Mapping (Grad-CAM) to quantify channel sensitivity, we dynamically identify critical convolution channels that encode primary semantic features. Second, we transform these selected kernels into the frequency domain via the Discrete Fourier Transform and selectively inject calibrated perturbations into the high-frequency spectrum. By selectively perturbing high-frequency components, TIP effectively destroys the fine-grained details necessary for image reconstruction while preserving the low-frequency information crucial for model accuracy. Extensive experiments on benchmark datasets demonstrate that TIP renders reconstructed images visually unrecognizable against state-of-the-art GIAs, while maintaining global model accuracy comparable to non-private baselines, significantly outperforming existing DP-based defenses in the privacy-utility trade-off and interpretability. Code is available in this https URL
[LG-56] GP2F: Cross-Domain Graph Prompting with Adaptive Fusion of Pre-trained Graph Neural Networks
链接: https://arxiv.org/abs/2602.11629
作者: Dongxiao He,Wenxuan Sun,Yongqi Huang,Jitao Zhao,Di Jin
类目: Machine Learning (cs.LG)
*备注: 16 pages, 8 figures
Abstract:Graph Prompt Learning (GPL) has recently emerged as a promising paradigm for downstream adaptation of pre-trained graph models, mitigating the misalignment between pre-training objectives and downstream tasks. Recently, the focus of GPL has shifted from in-domain to cross-domain scenarios, which is closer to the real world applications, where the pre-training source and downstream target often differ substantially in data distribution. However, why GPLs remain effective under such domain shifts is still unexplored. Empirically, we observe that representative GPL methods are competitive with two simple baselines in cross-domain settings: full fine-tuning (FT) and linear probing (LP), motivating us to explore a deeper understanding of the prompting mechanism. We provide a theoretical analysis demonstrating that jointly leveraging these two complementary branches yields a smaller estimation error than using either branch alone, formally proving that cross-domain GPL benefits from the integration between pre-trained knowledge and task-specific adaptation. Based on this insight, we propose GP2F, a dual-branch GPL method that explicitly instantiates the two extremes: (1) a frozen branch that retains pre-trained knowledge, and (2) an adapted branch with lightweight adapters for task-specific adaptation. We then perform adaptive fusion under topology constraints via a contrastive loss and a topology-consistent loss. Extensive experiments on cross-domain few-shot node and graph classification demonstrate that our method outperforms existing methods.
[LG-57] reeGrad-Ranker: Feature Ranking via O(L)-Time Gradients for Decision Trees
链接: https://arxiv.org/abs/2602.11623
作者: Weida Li,Yaoliang Yu,Bryan Kian Hsiang Low
类目: Machine Learning (cs.LG)
*备注:
Abstract:We revisit the use of probabilistic values, which include the well-known Shapley and Banzhaf values, to rank features for explaining the local predicted values of decision trees. The quality of feature rankings is typically assessed with the insertion and deletion metrics. Empirically, we observe that co-optimizing these two metrics is closely related to a joint optimization that selects a subset of features to maximize the local predicted value while minimizing it for the complement. However, we theoretically show that probabilistic values are generally unreliable for solving this joint optimization. Therefore, we explore deriving feature rankings by directly optimizing the joint objective. As the backbone, we propose TreeGrad, which computes the gradients of the multilinear extension of the joint objective in O(L) time for decision trees with L leaves; these gradients include weighted Banzhaf values. Building upon TreeGrad, we introduce TreeGrad-Ranker, which aggregates the gradients while optimizing the joint objective to produce feature rankings, and TreeGrad-Shap, a numerically stable algorithm for computing Beta Shapley values with integral parameters. In particular, the feature scores computed by TreeGrad-Ranker satisfy all the axioms uniquely characterizing probabilistic values, except for linearity, which itself leads to the established unreliability. Empirically, we demonstrate that the numerical error of Linear TreeShap can be up to 10^15 times larger than that of TreeGrad-Shap when computing the Shapley value. As a by-product, we also develop TreeProb, which generalizes Linear TreeShap to support all probabilistic values. In our experiments, TreeGrad-Ranker performs significantly better on both insertion and deletion metrics. Our code is available at this https URL.
[LG-58] How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?
链接: https://arxiv.org/abs/2602.11618
作者: Tatsuya Sagawa,Ryosuke Kojima
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Chemical Language Models (CLMs) pre-trained on large scale molecular data are widely used for molecular property prediction. However, the common belief that increasing training resources such as model size, dataset size, and training compute improves both pretraining loss and downstream task performance has not been systematically validated in the chemical domain. In this work, we evaluate this assumption by pretraining CLMs while scaling training resources and measuring transfer performance across diverse molecular property prediction (MPP) tasks. We find that while pretraining loss consistently decreases with increased training resources, downstream task performance shows limited improvement. Moreover, alternative metrics based on the Hessian or loss landscape also fail to estimate downstream performance in CLMs. We further identify conditions under which downstream performance saturates or degrades despite continued improvements in pretraining metrics, and analyze the underlying task dependent failure modes through parameter space visualizations. These results expose a gap between pretraining based evaluation and downstream performance, and emphasize the need for model selection and evaluation strategies that explicitly account for downstream task characteristics.
[LG-59] SkillRater: Untangling Capabilities in Multimodal Data
链接: https://arxiv.org/abs/2602.11615
作者: Naveen Sahi,Jeremy Dohmann,Armen Aghajanyan,Akshat Shrivastava
类目: Machine Learning (cs.LG)
*备注:
Abstract:Data curation methods typically assign samples a single quality score. We argue this scalar framing is fundamentally limited: when training requires multiple distinct capabilities, a monolithic scorer cannot maximize useful signals for all of them simultaneously. Quality is better understood as multidimensional, with each dimension corresponding to a capability the model must acquire. We introduce SkillRater, a framework that decomposes data filtering into specialized raters - one per capability, each trained via meta-learning on a disjoint validation objective - and composes their scores through a progressive selection rule: at each training stage, a sample is retained if any rater ranks it above a threshold that tightens over time, preserving diversity early while concentrating on high-value samples late. We validate this approach on vision language models, decomposing quality into three capability dimensions: visual understanding, OCR, and STEM reasoning. At 2B parameters, SkillRater improves over unfiltered baselines by 5.63% on visual understanding, 2.00% on OCR, and 3.53% on STEM on held out benchmarks. The learned rater signals are near orthogonal, confirming that the decomposition captures genuinely independent quality dimensions and explaining why it outperforms both unfiltered training and monolithic learned filtering.
[LG-60] Learn from Your Mistakes: Self-Correcting Masked Diffusion Models
链接: https://arxiv.org/abs/2602.11590
作者: Yair Schiff,Omer Belhasin,Roy Uziel,Guanghan Wang,Marianne Arriola,Gilad Turok,Michael Elad,Volodymyr Kuleshov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models, enabling parallel token generation while achieving competitive performance. Despite these advantages, MDMs face a fundamental limitation: once tokens are unmasked, they remain fixed, leading to error accumulation and ultimately degrading sample quality. We address this by proposing a framework that trains a model to perform both unmasking and correction. By reusing outputs from the MDM denoising network as inputs for corrector training, we train a model to recover from potential mistakes. During generation we apply additional corrective refinement steps between unmasking ones in order to change decoded tokens and improve outputs. We name our training and sampling method Progressive Self-Correction (ProSeCo) for its unique ability to iteratively refine an entire sequence, including already generated tokens. We conduct extensive experimental validation across multiple conditional and unconditional tasks, demonstrating that ProSeCo yields better quality-efficiency trade-offs (up to ~2-3x faster sampling) and enables inference-time compute scaling to further increase sample quality beyond standard MDMs (up to ~1.3x improvement on benchmarks).
[LG-61] Brain4FMs: A Benchmark of Foundation Models for Electrical Brain Signal
链接: https://arxiv.org/abs/2602.11558
作者: Fanqi Shen,Enhong Yang,Jiahe Li,Junru Hong,Xiaoran Pan,Zhizhang Yuan,Meng Li,Yang Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Brain Foundation Models (BFMs) are transforming neuroscience by enabling scalable and transferable learning from neural signals, advancing both clinical diagnostics and cutting-edge neuroscience exploration. Their emergence is powered by large-scale clinical recordings, particularly electroencephalography (EEG) and intracranial EEG, which provide rich temporal and spatial representations of brain dynamics. However, despite their rapid proliferation, the field lacks a unified understanding of existing methodologies and a standardized evaluation framework. To fill this gap, we map the benchmark design space along two axes: (i) from the model perspective, we organize BFMs under a self-supervised learning (SSL) taxonomy; and (ii) from the dataset perspective, we summarize common downstream tasks and curate representative public datasets across clinical and human-centric neurotechnology applications. Building on this consolidation, we introduce Brain4FMs, an open evaluation platform with plug-and-play interfaces that integrates 15 representative BFMs and 18 public datasets. It enables standardized comparisons and analysis of how pretraining data, SSL strategies, and architectures affect generalization and downstream performance, guiding more accurate and transferable BFMs. The code is available at this https URL.
[LG-62] he Implicit Bias of Steepest Descent with Mini-batch Stochastic Gradient
链接: https://arxiv.org/abs/2602.11557
作者: Jichu Li,Xuan Tang,Difan Zou
类目: Machine Learning (cs.LG)
*备注:
Abstract:A variety of widely used optimization methods like SignSGD and Muon can be interpreted as instances of steepest descent under different norm-induced geometries. In this work, we study the implicit bias of mini-batch stochastic steepest descent in multi-class classification, characterizing how batch size, momentum, and variance reduction shape the limiting max-margin behavior and convergence rates under general entry-wise and Schatten- p norms. We show that without momentum, convergence only occurs with large batches, yielding a batch-dependent margin gap but the full-batch convergence rate. In contrast, momentum enables small-batch convergence through a batch-momentum trade-off, though it slows convergence. This approach provides fully explicit, dimension-free rates that improve upon prior results. Moreover, we prove that variance reduction can recover the exact full-batch implicit bias for any batch size, albeit at a slower convergence rate. Finally, we further investigate the batch-size-one steepest descent without momentum, and reveal its convergence to a fundamentally different bias via a concrete data example, which reveals a key limitation of purely stochastic updates. Overall, our unified analysis clarifies when stochastic optimization aligns with full-batch behavior, and paves the way for perform deeper explorations of the training behavior of stochastic gradient steepest descent algorithms.
[LG-63] Real-Time Proactive Anomaly Detection via Forward and Backward Forecast Modeling
链接: https://arxiv.org/abs/2602.11539
作者: Luis Olmos,Rashida Hasan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reactive anomaly detection methods, which are commonly deployed to identify anomalies after they occur based on observed deviations, often fall short in applications that demand timely intervention, such as industrial monitoring, finance, and cybersecurity. Proactive anomaly detection, by contrast, aims to detect early warning signals before failures fully manifest, but existing methods struggle with handling heterogeneous multivariate data and maintaining precision under noisy or unpredictable conditions. In this work, we introduce two proactive anomaly detection frameworks: the Forward Forecasting Model (FFM) and the Backward Reconstruction Model (BRM). Both models leverage a hybrid architecture combining Temporal Convolutional Networks (TCNs), Gated Recurrent Units (GRUs), and Transformer encoders to model directional temporal dynamics. FFM forecasts future sequences to anticipate disruptions, while BRM reconstructs recent history from future context to uncover early precursors. Anomalies are flagged based on forecasting error magnitudes and directional embedding discrepancies. Our models support both continuous and discrete multivariate features, enabling robust performance in real-world settings. Extensive experiments on four benchmark datasets, MSL, SMAP, SMD, and PSM, demonstrate that FFM and BRM outperform state-of-the-art baselines across detection metrics and significantly improve the timeliness of anomaly anticipation. These properties make our approach well-suited for deployment in time-sensitive domains requiring proactive monitoring.
[LG-64] PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning -based Large Language Models HPCA-32
链接: https://arxiv.org/abs/2602.11530
作者: Eunyeong Cho,Jehyeon Bang,Ranggi Hwang,Minsoo Rhu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Accepted for publication at the 32nd IEEE International Symposium on High-Performance Computer Architecture (HPCA-32), 2026
Abstract:The emergence of reasoning-based LLMs leveraging Chain-of-Thought (CoT) inference introduces new serving challenges, as their extended reasoning phases delay user-visible output and inflate Time-To-First-Token (TTFT). Existing LLM serving frameworks fail to distinguish between reasoning and answering phases, leading to performance degradation under GPU memory constraints. We present PASCAL, a phase-aware scheduling algorithm that prioritizes reasoning to reduce TTFT while using controlled preemption and token pacing during answering to preserve Quality-of-Experience (QoE). Our hierarchical scheduler combines instance-level placement with intra-instance execution and enables dynamic migration at phase boundaries to balance load and reduce interference. Across benchmarks using DeepSeek-R1-Distill-Qwen-32B, PASCAL reduces tail TTFT by up to 72% while maintaining answering phase SLO attainment, demonstrating the importance of phase-aware scheduling for reasoning-based LLM deployment.
[LG-65] Unifying Stable Optimization and Reference Regularization in RLHF ICLR2026
链接: https://arxiv.org/abs/2602.11523
作者: Li He,Qiang Qu,He Zhao,Stephen Wan,Dadong Wang,Lina Yao,Tongliang Liu
类目: Machine Learning (cs.LG)
*备注: ICLR 2026
Abstract:Reinforcement Learning from Human Feedback (RLHF) has advanced alignment capabilities significantly but remains hindered by two core challenges: \textbfreward hacking and \textbfstable optimization. Current solutions independently address these issues through separate regularization strategies, specifically a KL-divergence penalty against a supervised fine-tuned model ( \pi_0 ) to mitigate reward hacking, and policy ratio clipping towards the current policy ( \pi_t ) to promote stable alignment. However, the implicit trade-off arising from simultaneously regularizing towards both \pi_0 and \pi_t remains under-explored. In this paper, we introduce a unified regularization approach that explicitly balances the objectives of preventing reward hacking and maintaining stable policy updates. Our simple yet principled alignment objective yields a weighted supervised fine-tuning loss with a superior trade-off, which demonstrably improves both alignment results and implementation complexity. Extensive experiments across diverse benchmarks validate that our method consistently outperforms RLHF and online preference learning methods, achieving enhanced alignment performance and stability.
[LG-66] Calibration and Evaluation of Car-Following Models for Autonomous Shuttles Using a Novel Multi-Criteria Framework
链接: https://arxiv.org/abs/2602.11517
作者: Renan Favero,Lily Elefteriadou
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:
Abstract:Autonomous shuttles (AS) are fully autonomous transit vehicles with operating characteristics distinct from conventional autonomous vehicles (AV). Developing dedicated car-following models for AS is critical to understanding their traffic impacts; however, few studies have calibrated such models with field data. More advanced machine learning (ML) techniques have not yet been applied to AS trajectories, leaving the potential of ML for capturing AS dynamics unexplored and constraining the development of dedicated AS models. Furthermore, there is a lack of a unified framework for systematically evaluating and comparing the performance of car-following models to replicate real trajectories. Existing car-following studies often rely on disparate metrics, which limit reproducibility and performance comparability. This study addresses these gaps through two main contributions: (1) the calibration of a diverse set of car-following models using real-world AS trajectory data, including eight machine learning algorithms and two physics-based models; and (2) the introduction of a multi-criteria evaluation framework that integrates measures of prediction accuracy, trajectory stability, and statistical similarity, which provides a generalizable methodology for a systematic assessment of car-following models. Results indicated that the proposed calibrated XGBoost model achieved the best overall performance. Sequential model type, such as LSTM and CNN, captured long-term positional stability but were less responsive to short-term dynamics. LSTM and CNN captured long-term positional stability but were less responsive to short-term dynamics. Traditional models (IDM, ACC) and kernel methods showed lower accuracy and stability than most ML models tested. Subjects: Emerging Technologies (cs.ET); Machine Learning (cs.LG) Cite as: arXiv:2602.11517 [cs.ET] (or arXiv:2602.11517v1 [cs.ET] for this version) https://doi.org/10.48550/arXiv.2602.11517 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-67] Calibrating an Imperfect Auxiliary Predictor for Unobserved No-Purchase Choice
链接: https://arxiv.org/abs/2602.11505
作者: Jiangkai Xiong,Kalyan Talluri,Hanzhao Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Firms typically cannot observe key consumer actions: whether customers buy from a competitor, choose not to buy, or even fully consider the firm’s offer. This missing outside-option information makes market-size and preference estimation difficult even in simple multinomial logit (MNL) models, and it is a central obstacle in practice when only transaction data are recorded. Existing approaches often rely on auxiliary market-share, aggregated, or cross-market data. We study a complementary setting in which a black-box auxiliary predictor provides outside-option probabilities, but is potentially biased or miscalibrated because it was trained in a different channel, period, or population, or produced by an external machine-learning system. We develop calibration methods that turn such imperfect predictions into statistically valid no-purchase estimates using purchase-only data from the focal environment. First, under affine miscalibration in logit space, we show that a simple regression identifies outside-option utility parameters and yields consistent recovery of no-purchase probabilities without collecting new labels for no-purchase events. Second, under a weaker nearly monotone condition, we propose a rank-based calibration method and derive finite-sample error bounds that cleanly separate auxiliary-predictor quality from first-stage utility-learning error over observed in-set choices. Our analysis also translates estimation error into downstream decision quality for assortment optimization, quantifying how calibration accuracy affects revenue performance. The bounds provide explicit dependence on predictor alignment and utility-learning error, clarifying when each source dominates. Numerical experiments demonstrate improvements in no-purchase estimation and downstream assortment decisions, and we discuss robust aggregation extensions for combining multiple auxiliary predictors.
[LG-68] A Generic Framework for Fair Consensus Clustering in Streams AAMAS2026
链接: https://arxiv.org/abs/2602.11500
作者: Diptarka Chakraborty,Kushagra Chatterjee,Debarati Das,Tien-Long Nguyen
类目: Machine Learning (cs.LG)
*备注: Accepted in AAMAS 2026
Abstract:Consensus clustering seeks to combine multiple clusterings of the same dataset, potentially derived by considering various non-sensitive attributes by different agents in a multi-agent environment, into a single partitioning that best reflects the overall structure of the underlying dataset. Recent work by Chakraborty et al, introduced a fair variant under proportionate fairness and obtained a constant-factor approximation by naively selecting the best closest fair input clustering; however, their offline approach requires storing all input clusterings, which is prohibitively expensive for most large-scale applications. In this paper, we initiate the study of fair consensus clustering in the streaming model, where input clusterings arrive sequentially and memory is limited. We design the first constant-factor algorithm that processes the stream while storing only a logarithmic number of inputs. En route, we introduce a new generic algorithmic framework that integrates closest fair clustering with cluster fitting, yielding improved approximation guarantees not only in the streaming setting but also when revisited offline. Furthermore, the framework is fairness-agnostic: it applies to any fairness definition for which an approximately close fair clustering can be computed efficiently. Finally, we extend our methods to the more general k-median consensus clustering problem. Comments: Accepted in AAMAS 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.11500 [cs.LG] (or arXiv:2602.11500v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.11500 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-69] Partial GFlowNet: Accelerating Convergence in Large State Spaces via Strategic Partitioning
链接: https://arxiv.org/abs/2602.11498
作者: Xuan Yu,Xu Wang,Rui Zhu,Yudong Zhang,Yang Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generative Flow Networks (GFlowNets) have shown promising potential to generate high-scoring candidates with probability proportional to their rewards. As existing GFlowNets freely explore in state space, they encounter significant convergence challenges when scaling to large state spaces. Addressing this issue, this paper proposes to restrict the exploration of actor. A planner is introduced to partition the entire state space into overlapping partial state spaces. Given their limited size, these partial state spaces allow the actor to efficiently identify subregions with higher rewards. A heuristic strategy is introduced to switch partial regions thus preventing the actor from wasting time exploring fully explored or low-reward partial regions. By iteratively exploring these partial state spaces, the actor learns to converge towards the high-reward subregions within the entire state space. Experiments on several widely used datasets demonstrate that \modelname converges faster than existing works on large state spaces. Furthermore, \modelname not only generates candidates with higher rewards but also significantly improves their diversity.
[LG-70] Exploring Multiple High-Scoring Subspaces in Generative Flow Networks
链接: https://arxiv.org/abs/2602.11491
作者: Xuan Yu,Xu Wang,Rui Zhu,Yudong Zhang,Yang Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:As a probabilistic sampling framework, Generative Flow Networks (GFlowNets) show strong potential for constructing complex combinatorial objects through the sequential composition of elementary components. However, existing GFlowNets often suffer from excessive exploration over vast state spaces, leading to over-sampling of low-reward regions and convergence to suboptimal distributions. Effectively biasing GFlowNets toward high-reward solutions remains a non-trivial challenge. In this paper, we propose CMAB-GFN, which integrates a combinatorial multi-armed bandit (CMAB) framework with GFlowNet policies. The CMAB component prunes low-quality actions, yielding compact high-scoring subspaces for exploration. Restricting GFNs to these compact high-scoring subspaces accelerates the discovery of high-value candidates, while the exploration of different subspaces ensures that diversity is not sacrificed. Experimental results on multiple tasks demonstrate that CMAB-GFN generates higher-reward candidates than existing approaches.
[LG-71] External Division of Two Bregman Proximity Operators for Poisson Inverse Problems
链接: https://arxiv.org/abs/2602.11482
作者: Kazuki Haishima,Kyohei Suzuki,Konstantinos Slavakis
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper presents a novel method for recovering sparse vectors from linear models corrupted by Poisson noise. The contribution is twofold. First, an operator defined via the external division of two Bregman proximity operators is introduced to promote sparse solutions while mitigating the estimation bias induced by classical \ell_1 -norm regularization. This operator is then embedded into the already established NoLips algorithm, replacing the standard Bregman proximity operator in a plug-and-play manner. Second, the geometric structure of the proposed external-division operator is elucidated through two complementary reformulations, which provide clear interpretations in terms of the primal and dual spaces of the Poisson inverse problem. Numerical tests show that the proposed method exhibits more stable convergence behavior than conventional Kullback-Leibler (KL)-based approaches and achieves significantly superior performance on synthetic data and an image restoration problem.
[LG-72] PRISM: A 3D Probabilistic Neural Representation for Interpretable Shape Modeling
链接: https://arxiv.org/abs/2602.11467
作者: Yining Jiao,Sreekalyani Bhamidi,Carlton Jude Zdanski,Julia S Kimbell,Andrew Prince,Cameron P Worden,Samuel Kirse,Christopher Rutter,Benjamin H Shields,Jisan Mahmud,Marc Niethammer
类目: Machine Learning (cs.LG)
*备注: 22 pages
Abstract:Understanding how anatomical shapes evolve in response to developmental covariates and quantifying their spatially varying uncertainties is critical in healthcare research. Existing approaches typically rely on global time-warping formulations that ignore spatially heterogeneous dynamics. We introduce PRISM, a novel framework that bridges implicit neural representations with uncertainty-aware statistical shape analysis. PRISM models the conditional distribution of shapes given covariates, providing spatially continuous estimates of both the population mean and covariate-dependent uncertainty at arbitrary locations. A key theoretical contribution is a closed-form Fisher Information metric that enables efficient, analytically tractable local temporal uncertainty quantification via automatic differentiation. Experiments on three synthetic datasets and one clinical dataset demonstrate PRISM’s strong performance across diverse tasks within a unified framework, while providing interpretable and clinically meaningful uncertainty estimates.
[LG-73] Assessing Low Back Movement with Motion Tape Sensor Data Through Deep Learning
链接: https://arxiv.org/abs/2602.11465
作者: Jared Levy,Aarti Lalwani,Elijah Wyckoff,Kenneth J. Loh,Sara P. Gombatto,Rose Yu,Emilia Farcas
类目: Machine Learning (cs.LG)
*备注:
Abstract:Back pain is a pervasive issue affecting a significant portion of the population, often worsened by certain movements of the lower back. Assessing these movements is important for helping clinicians prescribe appropriate physical therapy. However, it can be difficult to monitor patients’ movements remotely outside the clinic. High-fidelity data from motion capture sensors can be used to classify different movements, but these sensors are costly and impractical for use in free-living environments. Motion Tape (MT), a new fabric-based wearable sensor, addresses these issues by being low cost and portable. Despite these advantages, novelty and variability in sensor stability make the MT dataset small scale and inherent to noise. In this work, we propose the Motion-Tape Augmentation Inference Model (MT-AIM), a deep learning classification pipeline trained on MT data. In order to address the challenges of limited sample size and noise present within the MT dataset, MT-AIM leverages conditional generative models to generate synthetic MT data of a desired movement, as well as predicting joint kinematics as additional features. This combination of synthetic data generation and feature augmentation enables MT-AIM to achieve state-of-the-art accuracy in classifying lower back movements, bridging the gap between physiological sensing and movement analysis.
[LG-74] Adaptive Power Iteration Method for Differentially Private PCA
链接: https://arxiv.org/abs/2602.11454
作者: Ta Duy Nguyem,Alina Ene,Huy Le Nguyen
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:We study (\epsilon,\delta) -differentially private algorithms for the problem of approximately computing the top singular vector of a matrix A\in\mathbbR^n\times d where each row of A is a datapoint in \mathbbR^d . In our privacy model, neighboring inputs differ by one single row/datapoint. We study the private variant of the power iteration method, which is widely adopted in practice. Our algorithm is based on a filtering technique which adapts to the coherence parameter of the input matrix. This technique provides a utility that goes beyond the worst-case guarantees for matrices with low coherence parameter. Our work departs from and complements the work by Hardt-Roth (STOC 2013) which designed a private power iteration method for the privacy model where neighboring inputs differ in one single entry by at most 1.
[LG-75] Multi-Level Strategic Classification: Incentivizing Improvement through Promotion and Relegation Dynamics
链接: https://arxiv.org/abs/2602.11439
作者: Ziyuan Huang,Lina Alkarmi,Mingyan Liu
类目: Machine Learning (cs.LG)
*备注: Preprint. 8 pages (8 figures) plus appendix
Abstract:Strategic classification studies the problem where self-interested individuals or agents manipulate their response to obtain favorable decision outcomes made by classifiers, typically turning to dishonest actions when they are less costly than genuine efforts. While existing studies on sequential strategic classification primarily focus on optimizing dynamic classifier weights, we depart from these weight-centric approaches by analyzing the design of classifier thresholds and difficulty progression within a multi-level promotion-relegation framework. Our model captures the critical inter-temporal incentives driven by an agent’s farsightedness, skill retention, and a leg-up effect where qualification and attainment can be self-reinforcing. We characterize the agent’s optimal long-term strategy and demonstrate that a principal can design a sequence of thresholds to effectively incentivize honest effort. Crucially, we prove that under mild conditions, this mechanism enables agents to reach arbitrarily high levels solely through genuine improvement efforts.
[LG-76] Surface impedance inference via neural fields and sparse acoustic data obtained by a compact array
链接: https://arxiv.org/abs/2602.11425
作者: Yuanxin Xia,Xinyan Li,Matteo Calafà,Allan P. Engsig-Karup,Cheol-Ho Jeong
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:Standardized laboratory characterizations for absorbing materials rely on idealized sound field assumptions, which deviate largely from real-life conditions. Consequently, \emphin-situ acoustic characterization has become essential for accurate diagnosis and virtual prototyping. We propose a physics-informed neural field that reconstructs local, near-surface broadband sound fields from sparse pressure samples to directly infer complex surface impedance. A parallel, multi-frequency architecture enables a broadband impedance retrieval within runtimes on the order of seconds to minutes. To validate the method, we developed a compact microphone array with low hardware complexity. Numerical verifications and laboratory experiments demonstrate accurate impedance retrieval with a small number of sensors under realistic conditions. We further showcase the approach in a vehicle cabin to provide practical guidance on measurement locations that avoid strong interference. Here, we show that this approach offers a robust means of characterizing \emphin-situ boundary conditions for architectural and automotive acoustics.
[LG-77] Optimizing Agent Planning for Security and Autonomy
链接: https://arxiv.org/abs/2602.11416
作者: Aashish Kolluri,Rishi Sharma,Manuel Costa,Boris Köpf,Tobias Nießen,Mark Russinovich,Shruti Tople,Santiago Zanella-Béguelin
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 33 pages, 6 figures
Abstract:Indirect prompt injection attacks threaten AI agents that execute consequential actions, motivating deterministic system-level defenses. Such defenses can provably block unsafe actions by enforcing confidentiality and integrity policies, but currently appear costly: they reduce task completion rates and increase token usage compared to probabilistic defenses. We argue that existing evaluations miss a key benefit of system-level defenses: reduced reliance on human oversight. We introduce autonomy metrics to quantify this benefit: the fraction of consequential actions an agent can execute without human-in-the-loop (HITL) approval while preserving security. To increase autonomy, we design a security-aware agent that (i) introduces richer HITL interactions, and (ii) explicitly plans for both task progress and policy compliance. We implement this agent design atop an existing information-flow control defense against prompt injection and evaluate it on the AgentDojo and WASP benchmarks. Experiments show that this approach yields higher autonomy without sacrificing utility.
[LG-78] meSynth: A Framework for Uncovering Systematic Biases in Time Series Forecasting
链接: https://arxiv.org/abs/2602.11413
作者: Md Rakibul Haque,Vishwa Goudar,Shireen Elhabian,Warren Woodrich Pettine
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series forecasting is a fundamental tool with wide ranging applications, yet recent debates question whether complex nonlinear architectures truly outperform simple linear models. Prior claims of dominance of the linear model often stem from benchmarks that lack diverse temporal dynamics and employ biased evaluation protocols. We revisit this debate through TimeSynth, a structured framework that emulates key properties of real world time series,including non-stationarity, periodicity, trends, and phase modulation by creating synthesized signals whose parameters are derived from real-world time series. Evaluating four model families Linear, Multi Layer Perceptrons (MLP), Convolutional Neural Networks (CNNs), and Transformers, we find a systematic bias in linear models: they collapse to simple oscillation regardless of signal complexity. Nonlinear models avoid this collapse and gain clear advantages as signal complexity increases. Notably, Transformers and CNN based models exhibit slightly greater adaptability to complex modulated signals compared to MLPs. Beyond clean forecasting, the framework highlights robustness differences under distribution and noise shifts and removes biases of prior benchmarks by using independent instances for train, test, and validation for each signal family. Collectively, TimeSynth provides a principled foundation for understanding when different forecasting approaches succeed or fail, moving beyond oversimplified claims of model equivalence.
[LG-79] CADET: Context-Conditioned Ads CTR Prediction With a Decoder-Only Transformer
链接: https://arxiv.org/abs/2602.11410
作者: David Pardoe,Neil Daftary,Miro Furtado,Aditya Aiyer,Yu Wang,Liuqing Li,Tao Song,Lars Hertel,Young Jin Yun,Senthil Radhakrishnan,Zhiwei Wang,Tommy Li,Khai Tran,Ananth Nagarajan,Ali Naqvi,Yue Zhang,Renpeng Fang,Avi Romascanu,Arjun Kulothungun,Deepak Kumar,Praneeth Boda,Fedor Borisyuk,Ruoyan Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Click-through rate (CTR) prediction is fundamental to online advertising systems. While Deep Learning Recommendation Models (DLRMs) with explicit feature interactions have long dominated this domain, recent advances in generative recommenders have shown promising results in content recommendation. However, adapting these transformer-based architectures to ads CTR prediction still presents unique challenges, including handling post-scoring contextual signals, maintaining offline-online consistency, and scaling to industrial workloads. We present CADET (Context-Conditioned Ads Decoder-Only Transformer), an end-to-end decoder-only transformer for ads CTR prediction deployed at LinkedIn. Our approach introduces several key innovations: (1) a context-conditioned decoding architecture with multi-tower prediction heads that explicitly model post-scoring signals such as ad position, resolving the chicken-and-egg problem between predicted CTR and ranking; (2) a self-gated attention mechanism that stabilizes training by adaptively regulating information flow at both representation and interaction levels; (3) a timestamp-based variant of Rotary Position Embedding (RoPE) that captures temporal relationships across timescales from seconds to months; (4) session masking strategies that prevent the model from learning dependencies on unavailable in-session events, addressing train-serve skew; and (5) production engineering techniques including tensor packing, sequence chunking, and custom Flash Attention kernels that enable efficient training and serving at scale. In online A/B testing, CADET achieves a 11.04% CTR lift compared to the production LiRank baseline model, a hybrid ensemble of DCNv2 and sequential encoders. The system has been successfully deployed on LinkedIn’s advertising platform, serving the main traffic for homefeed sponsored updates.
[LG-80] Provably Efficient Algorithms for S- and Non-Rectangular Robust MDPs with General Parameterization
链接: https://arxiv.org/abs/2602.11387
作者: Anirudh Satheesh,Ziyi Chen,Furong Huang,Heng Huang
类目: Machine Learning (cs.LG)
*备注: 30 pages
Abstract:We study robust Markov decision processes (RMDPs) with general policy parameterization under s-rectangular and non-rectangular uncertainty sets. Prior work is largely limited to tabular policies, and hence either lacks sample complexity guarantees or incurs high computational cost. Our method reduces the average reward RMDPs to entropy-regularized discounted robust MDPs, restoring strong duality and enabling tractable equilibrium computation. We prove novel Lipschitz and Lipschitz-smoothness properties for general policy parameterizations that extends to infinite state spaces. To address infinite-horizon gradient estimation, we introduce a multilevel Monte Carlo gradient estimator with \tilde\mathcalO(\epsilon^-2) sample complexity, a factor of \mathcalO(\epsilon^-2) improvement over prior work. Building on this, we design a projected gradient descent algorithm for s-rectangular uncertainty ( \mathcalO(\epsilon^-5) ) and a Frank–Wolfe algorithm for non-rectangular uncertainty ( \mathcalO(\epsilon^-4) discounted, \mathcalO(\epsilon^-10.5) average reward), significantly improving prior results in both the discounted setting and average reward setting. Our work is the first one to provide sample complexity guarantees for RMDPs with general policy parameterization beyond (s, a) -rectangularity. It also provides the first such guarantees in the average reward setting and improves existing bounds for discounted robust MDPs.
[LG-81] WSBD: Freezing-Based Optimizer for Quantum Neural Networks AISTATS2026
链接: https://arxiv.org/abs/2602.11383
作者: Christopher Kverne,Mayur Akewar,Yuqian Huo,Tirthak Patel,Janki Bhimani
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: Accepted to AISTATS 2026. 9 pages main, 24 pages total
Abstract:The training of Quantum Neural Networks (QNNs) is hindered by the high computational cost of gradient estimation and the barren plateau problem, where optimization landscapes become intractably flat. To address these challenges, we introduce Weighted Stochastic Block Descent (WSBD), a novel optimizer with a dynamic, parameter-wise freezing strategy. WSBD intelligently focuses computational resources by identifying and temporarily freezing less influential parameters based on a gradient-derived importance score. This approach significantly reduces the number of forward passes required per training step and helps navigate the optimization landscape more effectively. Unlike pruning or layer-wise freezing, WSBD maintains full expressive capacity while adapting throughout training. Our extensive evaluation shows that WSBD converges on average 63.9% faster than Adam for the popular ground-state-energy problem, an advantage that grows with QNN size. We provide a formal convergence proof for WSBD and show that parameter-wise freezing outperforms traditional layer-wise approaches in QNNs. Project page: this https URL.
[LG-82] oward Adaptive Non-Intrusive Reduced-Order Models: Design and Challenges
链接: https://arxiv.org/abs/2602.11378
作者: Amirpasha Hedayat,Alberto Padovan,Karthik Duraisamy
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:
Abstract:Projection-based Reduced Order Models (ROMs) are often deployed as static surrogates, which limits their practical utility once a system leaves the training manifold. We formalize and study adaptive non-intrusive ROMs that update both the latent subspace and the reduced dynamics online. Building on ideas from static non-intrusive ROMs, specifically, Operator Inference (OpInf) and the recently-introduced Non-intrusive Trajectory-based optimization of Reduced-Order Models (NiTROM), we propose three formulations: Adaptive OpInf (sequential basis/operator refits), Adaptive NiTROM (joint Riemannian optimization of encoder/decoder and polynomial dynamics), and a hybrid that initializes NiTROM with an OpInf update. We describe the online data window, adaptation window, and computational budget, and analyze cost scaling. On a transiently perturbed lid-driven cavity flow, static Galerkin/OpInf/NiTROM drift or destabilize when forecasting beyond training. In contrast, Adaptive OpInf robustly suppresses amplitude drift with modest cost; Adaptive NiTROM is shown to attain near-exact energy tracking under frequent updates but is sensitive to its initialization and optimization depth; the hybrid is most reliable under regime changes and minimal offline data, yielding physically coherent fields and bounded energy. We argue that predictive claims for ROMs must be cost-aware and transparent, with clear separation of training/adaptation/deployment regimes and explicit reporting of online budgets and full-order model queries. This work provides a practical template for building self-correcting, non-intrusive ROMs that remain effective as the dynamics evolve well beyond the initial manifold.
[LG-83] Structured Hybrid Mechanistic Models for Robust Estimation of Time-Dependent Intervention Outcomes
链接: https://arxiv.org/abs/2602.11350
作者: Tomer Meir,Ori Linial,Danny Eytan,Uri Shalit
类目: Machine Learning (cs.LG)
*备注:
Abstract:Estimating intervention effects in dynamical systems is crucial for outcome optimization. In medicine, such interventions arise in physiological regulation (e.g., cardiovascular system under fluid administration) and pharmacokinetics, among others. Propofol administration is an anesthetic intervention, where the challenge is to estimate the optimal dose required to achieve a target brain concentration for anesthesia, given patient characteristics, while avoiding under- or over-dosing. The pharmacokinetic state is characterized by drug concentrations across tissues, and its dynamics are governed by prior states, patient covariates, drug clearance, and drug administration. While data-driven models can capture complex dynamics, they often fail in out-of-distribution (OOD) regimes. Mechanistic models on the other hand are typically robust, but might be oversimplified. We propose a hybrid mechanistic-data-driven approach to estimate time-dependent intervention outcomes. Our approach decomposes the dynamical system’s transition operator into parametric and nonparametric components, further distinguishing between intervention-related and unrelated dynamics. This structure leverages mechanistic anchors while learning residual patterns from data. For scenarios where mechanistic parameters are unknown, we introduce a two-stage procedure: first, pre-training an encoder on simulated data, and subsequently learning corrections from observed data. Two regimes with incomplete mechanistic knowledge are considered: periodic pendulum and Propofol bolus injections. Results demonstrate that our hybrid approach outperforms purely data-driven and mechanistic approaches, particularly OOD. This work highlights the potential of hybrid mechanistic-data-driven models for robust intervention optimization in complex, real-world dynamical systems.
[LG-84] Sample-Free Safety Assessment of Neural Network Controllers via Taylor Methods
链接: https://arxiv.org/abs/2602.11332
作者: Adam Evans,Roberto Armellin
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:In recent years, artificial neural networks have been increasingly studied as feedback controllers for guidance problems. While effective in complex scenarios, they lack the verification guarantees found in classical guidance policies. Their black-box nature creates significant concerns regarding trustworthiness, limiting their adoption in safety-critical spaceflight applications. This work addresses this gap by developing a method to assess the safety of a trained neural network feedback controller via automatic domain splitting and polynomial bounding. The methodology involves embedding the trained neural network into the system’s dynamical equations, rendering the closed-loop system autonomous. The system flow is then approximated by high-order Taylor polynomials, which are subsequently manipulated to construct polynomial maps that project state uncertainties onto an event manifold. Automatic domain splitting ensures the polynomials are accurate over their relevant subdomains, whilst also allowing an extensive state-space to be analysed efficiently. Utilising polynomial bounding techniques, the resulting event values may be rigorously constrained and analysed within individual subdomains, thereby establishing bounds on the range of possible closed-loop outcomes from using such neural network controllers and supporting safety assessment and informed operational decision-making in real-world missions.
[LG-85] Efficient Analysis of the Distilled Neural Tangent Kernel
链接: https://arxiv.org/abs/2602.11320
作者: Jamie Mahowald,Brian Bell,Alex Ho,Michael Geyer
类目: Machine Learning (cs.LG)
*备注: 27 pages, 9 figures
Abstract:Neural tangent kernel (NTK) methods are computationally limited by the need to evaluate large Jacobians across many data points. Existing approaches reduce this cost primarily through projecting and sketching the Jacobian. We show that NTK computation can also be reduced by compressing the data dimension itself using NTK-tuned dataset distillation. We demonstrate that the neural tangent space spanned by the input data can be induced by dataset distillation, yielding a 20-100 \times reduction in required Jacobian calculations. We further show that per-class NTK matrices have low effective rank that is preserved by this reduction. Building on these insights, we propose the distilled neural tangent kernel (DNTK), which combines NTK-tuned dataset distillation with state-of-the-art projection methods to reduce up NTK computational complexity by up to five orders of magnitude while preserving kernel structure and predictive performance.
[LG-86] Learning Glioblastoma Tumor Heterogeneity Using Brain Inspired Topological Neural Networks
链接: https://arxiv.org/abs/2602.11234
作者: Ankita Paul,Wenyi Wang
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:Accurate prognosis for Glioblastoma (GBM) using deep learning (DL) is hindered by extreme spatial and structural heterogeneity. Moreover, inconsistent MRI acquisition protocols across institutions hinder generalizability of models. Conventional transformer and DL pipelines often fail to capture the multi-scale morphological diversity such as fragmented necrotic cores, infiltrating margins, and disjoint enhancing components leading to scanner-specific artifacts and poor cross-site prognosis. We propose TopoGBM, a learning framework designed to capture heterogeneity-preserved, scanner-robust representations from multi-parametric 3D MRI. Central to our approach is a 3D convolutional autoencoder regularized by a topological regularization that preserves the complex, non-Euclidean invariants of the tumor’s manifold within a compressed latent space. By enforcing these topological priors, TopoGBM explicitly models the high-variance structural signatures characteristic of aggressive GBM. Evaluated across heterogeneous cohorts (UPENN, UCSF, RHUH) and external validation on TCGA, TopoGBM achieves better performance (C-index 0.67 test, 0.58 validation), outperforming baselines that degrade under domain shift. Mechanistic interpretability analysis reveals that reconstruction residuals are highly localized to pathologically heterogeneous zones, with tumor-restricted and healthy tissue error significantly low (Test: 0.03, Validation: 0.09). Furthermore, occlusion-based attribution localizes approximately 50% of the prognostic signal to the tumor and the diverse peritumoral microenvironment advocating clinical reliability of the unsupervised learning method. Our findings demonstrate that incorporating topological priors enables the learning of morphology-faithful embeddings that capture tumor heterogeneity while maintaining cross-institutional robustness.
[LG-87] Generative AI-Driven Phase Control for RIS-Aided Cell-Free Massive MIMO Systems
链接: https://arxiv.org/abs/2602.11226
作者: Kalpesh K. Patel,Malay Chakraborty,Ekant Sharma,Sandeep Kumar Singh
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:This work investigates a generative artificial intelligence (GenAI) model to optimize the reconfigurable intelligent surface (RIS) phase shifts in RIS-aided cell-free massive multiple-input multiple-output (mMIMO) systems under practical constraints, including imperfect channel state information (CSI) and spatial correlation. We propose two GenAI based approaches, generative conditional diffusion model (GCDM) and generative conditional diffusion implicit model (GCDIM), leveraging the diffusion model conditioned on dynamic CSI to maximize the sum spectral efficiency (SE) of the system. To benchmark performance, we compare the proposed GenAI based approaches against an expert algorithm, traditionally known for achieving near-optimal solutions at the cost of computational efficiency. The simulation results demonstrate that GCDM matches the sum SE achieved by the expert algorithm while significantly reducing the computational overhead. Furthermore, GCDIM achieves a comparable sum SE with an additional 98% reduction in computation time, underscoring its potential for efficient phase optimization in RIS-aided cell-free mMIMO systems.
[LG-88] he Magic Correlations: Understanding Knowledge Transfer from Pretraining to Supervised Fine-Tuning
链接: https://arxiv.org/abs/2602.11217
作者: Simin Fan,Dimitris Paparas,Natasha Noy,Binbin Xiong,Noveen Sachdeva,Berivan Isik
类目: Machine Learning (cs.LG)
*备注:
Abstract:Understanding how language model capabilities transfer from pretraining to supervised fine-tuning (SFT) is fundamental to efficient model development and data curation. In this work, we investigate four core questions: RQ1. To what extent do accuracy and confidence rankings established during pretraining persist after SFT? RQ2. Which benchmarks serve as robust cross-stage predictors and which are unreliable? RQ3. How do transfer dynamics shift with model scale? RQ4. How well does model confidence align with accuracy, as a measure of calibration quality? Does this alignment pattern transfer across training stages? We address these questions through a suite of correlation protocols applied to accuracy and confidence metrics across diverse data mixtures and model scales. Our experiments reveal that transfer reliability varies dramatically across capability categories, benchmarks, and scales – with accuracy and confidence exhibiting distinct, sometimes opposing, scaling dynamics. These findings shed light on the complex interplay between pretraining decisions and downstream outcomes, providing actionable guidance for benchmark selection, data curation, and efficient model development.
[LG-89] Protein Language Model Embeddings Improve Generalization of Implicit Transfer Operators
链接: https://arxiv.org/abs/2602.11216
作者: Panagiotis Antoniadis,Beatrice Pavesi,Simon Olsson,Ole Winther
类目: Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注: 24 pages, 12 figures and 7 tables
Abstract:Molecular dynamics (MD) is a central computational tool in physics, chemistry, and biology, enabling quantitative prediction of experimental observables as expectations over high-dimensional molecular distributions such as Boltzmann distributions and transition densities. However, conventional MD is fundamentally limited by the high computational cost required to generate independent samples. Generative molecular dynamics (GenMD) has recently emerged as an alternative, learning surrogates of molecular distributions either from data or through interaction with energy models. While these methods enable efficient sampling, their transferability across molecular systems is often limited. In this work, we show that incorporating auxiliary sources of information can improve the data efficiency and generalization of transferable implicit transfer operators (TITO) for molecular dynamics. We find that coarse-grained TITO models are substantially more data-efficient than Boltzmann Emulators, and that incorporating protein language model (pLM) embeddings further improves out-of-distribution generalization. Our approach, PLaTITO, achieves state-of-the-art performance on equilibrium sampling benchmarks for out-of-distribution protein systems, including fast-folding proteins. We further study the impact of additional conditioning signals – such as structural embeddings, temperature, and large-language-model-derived embeddings – on model performance.
[LG-90] Charting Empirical Laws for LLM Fine-Tuning in Scientific Multi-Discipline Learning
链接: https://arxiv.org/abs/2602.11215
作者: Lintao Wang,Zhuqiang Lu,Yilin Zhu,Kun Hu,Zhenfei Yin,Shixiang Tang,Zhiyong Wang,Wanli Ouyang,Xinzhu Ma
类目: Machine Learning (cs.LG)
*备注:
Abstract:While large language models (LLMs) have achieved strong performance through fine-tuning within individual scientific domains, their learning dynamics in multi-disciplinary contexts remains poorly understood, despite the promise of improved generalization and broader applicability through cross-domain knowledge synergy. In this work, we present the first systematic study of multi-disciplinary LLM fine-tuning, constructing a five-discipline corpus and analyzing learning patterns of full fine-tuning, LoRA, LoRA-MoE, and LoRA compositions. Particularly, our study shows that multi-disciplinary learning is substantially more variable than single-discipline training and distills four consistent empirical laws: (1) Balance-then-Diversity: low-resource disciplines degrade performance unless mitigated via diversity-aware upsampling; (2) Merge-then-Align: restoring instruction-following ability is critical for cross-discipline synergy; (3) Optimize-then-Scale: parameter scaling offers limited gains without prior design optimization; and (4) Share-then-Specialize: asymmetric LoRA-MoE yields robust gains with minimal trainable parameters via shared low-rank projection. Together, these laws form a practical recipe for principled multi-discipline fine-tuning and provide actionable guidance for developing generalizable scientific LLMs.
[LG-91] owards Compressive and Scalable Recurrent Memory
链接: https://arxiv.org/abs/2602.11212
作者: Yunchong Song,Jushi Kai,Liming Lu,Kaixi Qiu,Zhouhan Lin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transformers face a quadratic bottleneck in attention when scaling to long contexts. Recent approaches introduce recurrent memory to extend context beyond the current window, yet these often face a fundamental trade-off between theoretical principles and practical scalability. To address this, we introduce Elastic Memory, a novel memory architecture grounded in the HiPPO framework for online function approximation. Elastic Memory treats historical sequence as samples from continuous signals, applying optimal online compression to encode them into a fixed-size memory state. For retrieval, we propose a flexible \textitpolynomial sampling mechanism that reconstructs a history summary from this compressed state. Elastic Memory consistently outperformed baselines on long-context (32k+) datasets across three domains. With equal parameters, it beat Memorizing Transformer by 16x memory and outperformed Melodi at all memory sizes, even when Melodi had 30% more parameters. When scaling model size, Elastic Memory stayed ahead of all baselines and was significantly faster than Melodi at 4x size. Furthermore, its decoupled design allows for injecting inductive biases at test-time to boost performance.
[LG-92] Adaptive Physics Transformer with Fused Global-Local Attention for Subsurface Energy Systems
链接: https://arxiv.org/abs/2602.11208
作者: Xin Ju,Nok Hei(Hadrian)Fung,Yuyan Zhang,Carl Jacquemyn,Matthew Jackson,Randolph Settgast,Sally M. Benson,Gege Wen
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Earth’s subsurface is a cornerstone of modern society, providing essential energy resources like hydrocarbons, geothermal, and minerals while serving as the primary reservoir for CO_2 sequestration. However, full physics numerical simulations of these systems are notoriously computationally expensive due to geological heterogeneity, high resolution requirements, and the tight coupling of physical processes with distinct propagation time scales. Here we propose the \textbfAdaptive Physics Transformer (APT), a geometry-, mesh-, and physics-agnostic neural operator that explicitly addresses these challenges. APT fuses a graph-based encoder to extract high-resolution local heterogeneous features with a global attention mechanism to resolve long-range physical impacts. Our results demonstrate that APT outperforms state-of-the-art architectures in subsurface tasks across both regular and irregular grids with robust super-resolution capabilities. Notably, APT is the first architecture that directly learns from adaptive mesh refinement simulations. We also demonstrate APT’s capability for cross-dataset learning, positioning it as a robust and scalable backbone for large-scale subsurface foundation model development.
[LG-93] AM-FM: A Foundation Model for Ambient Intelligence Through WiFi
链接: https://arxiv.org/abs/2602.11200
作者: Guozhen Zhu,Yuqian Hu,Sakila Jayaweera,Weihang Gao,Wei-Hsiang Wang,Jiaxuan Zhang,Beibei Wang,Chenshu Wu,K. J. Ray Liu
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Ambient intelligence, continuously understanding human presence, activity, and physiology in physical spaces, is fundamental to smart environments, health monitoring, and human-computer interaction. WiFi infrastructure provides a ubiquitous, always-on, privacy-preserving substrate for this capability across billions of IoT devices. Yet this potential remains largely untapped, as wireless sensing has typically relied on task-specific models that require substantial labeled data and limit practical deployment. We present AM-FM, the first foundation model for ambient intelligence and sensing through WiFi. AM-FM is pre-trained on 9.2 million unlabeled Channel State Information (CSI) samples collected over 439 days from 20 commercial device types deployed worldwide, learning general-purpose representations via contrastive learning, masked reconstruction, and physics-informed objectives tailored to wireless signals. Evaluated on public benchmarks spanning nine downstream tasks, AM-FM shows strong cross-task performance with improved data efficiency, demonstrating that foundation models can enable scalable ambient intelligence using existing wireless infrastructure.
[LG-94] Predicting the post-wildfire mudflow onset using machine learning models on multi-parameter experimental data
链接: https://arxiv.org/abs/2602.11194
作者: Mahta Movasat,Ingrid Tomac
类目: Machine Learning (cs.LG); Soft Condensed Matter (cond-mat.soft)
*备注:
Abstract:Post-wildfire mudflows are increasingly hazardous due to the prevalence of wildfires, including those on the wildland-urban interface. Upon burning, soil on the surface or immediately beneath becomes hydrophobic, a phenomenon that occurs predominantly on sand-based hillslopes. Rainwater and eroded soil blanket the downslope, leading to catastrophic debris flows. Soil hydrophobicity enhances erosion, resulting in post-wildfire debris flows that differ from natural mudflows in intensity, duration, and destructiveness. Thus, it is crucial to understand the timing and conditions of debris-flow onset, driven by the coupled effects of critical parameters: varying rain intensities (RI), slope gradients, water-entry values, and grain sizes (D50). Machine Learning (ML) techniques have become increasingly valuable in geotechnical engineering due to their ability to model complex systems without predefined assumptions. This study applies multiple ML algorithms: multiple linear regression (MLR), logistic regression (LR), support vector classifier (SVC), K-means clustering, and principal component analysis (PCA) to predict and classify outcomes from laboratory experiments that model field conditions using a rain device on various soils in sloped flumes. While MLR effectively predicted total discharge, erosion predictions were less accurate, especially for coarse sand. LR and SVC achieved good accuracy in classifying failure outcomes, supported by clustering and dimensionality reduction. Sensitivity analysis revealed that fine sand is highly susceptible to erosion, particularly under low-intensity, long-duration rainfall. Results also show that the first 10 minutes of high-intensity rain are most critical for discharge and failure. These findings highlight the potential of ML for post-wildfire hazard assessment and emergency response planning.
[LG-95] Learning to Control: The iUzawa-Net for Nonsmooth Optimal Control of Linear PDEs
链接: https://arxiv.org/abs/2602.12273
作者: Yongcun Song,Xiaoming Yuan,Hangrui Yue,Tianyou Zeng
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We propose an optimization-informed deep neural network approach, named iUzawa-Net, aiming for the first solver that enables real-time solutions for a class of nonsmooth optimal control problems of linear partial differential equations (PDEs). The iUzawa-Net unrolls an inexact Uzawa method for saddle point problems, replacing classical preconditioners and PDE solvers with specifically designed learnable neural networks. We prove universal approximation properties and establish the asymptotic \varepsilon -optimality for the iUzawa-Net, and validate its promising numerical efficiency through nonsmooth elliptic and parabolic optimal control problems. Our techniques offer a versatile framework for designing and analyzing various optimization-informed deep learning approaches to optimal control and other PDE-constrained optimization problems. The proposed learning-to-control approach synergizes model-based optimization algorithms and data-driven deep learning techniques, inheriting the merits of both methodologies.
[LG-96] he Implicit Bias of Logit Regularization
链接: https://arxiv.org/abs/2602.12039
作者: Alon Beck,Yohai Bar Sinai,Noam Levi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Logit regularization, the addition a convex penalty directly in logit space, is widely used in modern classifiers, with label smoothing as a prominent example. While such methods often improve calibration and generalization, their mechanism remains under-explored. In this work, we analyze a general class of such logit regularizers in the context of linear classification, and demonstrate that they induce an implicit bias of logit clustering around finite per-sample targets. For Gaussian data, or whenever logits are sufficiently clustered, we prove that logit clustering drives the weight vector to align exactly with Fisher’s Linear Discriminant. To demonstrate the consequences, we study a simple signal-plus-noise model in which this transition has dramatic effects: Logit regularization halves the critical sample complexity and induces grokking in the small-noise limit, while making generalization robust to noise. Our results extend the theoretical understanding of label smoothing and highlight the efficacy of a broader class of logit-regularization methods.
[LG-97] Insights on Muon from Simple Quadratics
链接: https://arxiv.org/abs/2602.11948
作者: Antoine Gonon,Andreea-Alexandra Muşat,Nicolas Boumal
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Muon updates weight matrices along (approximate) polar factors of the gradients and has shown strong empirical performance in large-scale training. Existing attempts at explaining its performance largely focus on single-step comparisons (on quadratic proxies) and worst-case guarantees that treat the inexactness of the polar-factor as a nuisance ``to be argued away’'. We show that already on simple strongly convex functions such as L(W)=\frac12|W|_\textF^2 , these perspectives are insufficient, suggesting that understanding Muon requires going beyond local proxies and pessimistic worst-case bounds. Instead, our analysis exposes two observations that already affect behavior on simple quadratics and are not well captured by prevailing abstractions: (i) approximation error in the polar step can qualitatively alter discrete-time dynamics and improve reachability and finite-time performance – an effect practitioners exploit to tune Muon, but that existing theory largely treats as a pure accuracy compromise; and (ii) structural properties of the objective affect finite-budget constants beyond the prevailing conditioning-based explanations. Thus, any general theory covering these cases must either incorporate these ingredients explicitly or explain why they are irrelevant in the regimes of interest.
[LG-98] EqDeepRx: Learning a Scalable MIMO Receiver
链接: https://arxiv.org/abs/2602.11834
作者: Mikko Honkala,Dani Korpi,Elias Raninen,Janne M. J. Huttunen
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This work has been submitted to IEEE for consideration for publication
Abstract:While machine learning (ML)-based receiver algorithms have received a great deal of attention in the recent literature, they often suffer from poor scaling with increasing spatial multiplexing order and lack of explainability and generalization. This paper presents EqDeepRx, a practical deep-learning-aided multiple-input multiple-output (MIMO) receiver, which is built by augmenting linear receiver processing with carefully engineered ML blocks. At the core of the receiver model is a shared-weight DetectorNN that operates independently on each spatial stream or layer, enabling near-linear complexity scaling with respect to multiplexing order. To ensure better explainability and generalization, EqDeepRx retains conventional channel estimation and augments it with a lightweight DenoiseNN that learns frequency-domain smoothing. To reduce the dimensionality of the DetectorNN inputs, the receiver utilizes two linear equalizers in parallel: a linear minimum mean-square error (LMMSE) equalizer with interference-plus-noise covariance estimation and a regularized zero-forcing (RZF) equalizer. The parallel equalized streams are jointly consumed by the DetectorNN, after which a compact DemapperNN produces bit log-likelihood ratios for channel decoding. 5G/6G-compliant end-to-end simulations across multiple channel scenarios, pilot patterns, and inter-cell interference conditions show improved error rate and spectral efficiency over a conventional baseline, while maintaining low-complexity inference and support for different MIMO configurations without retraining.
[LG-99] Decentralized Non-convex Stochastic Optimization with Heterogeneous Variance
链接: https://arxiv.org/abs/2602.11789
作者: Hongxu Chen,Ke Wei,Luo Luo
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Decentralized optimization is critical for solving large-scale machine learning problems over distributed networks, where multiple nodes collaborate through local communication. In practice, the variances of stochastic gradient estimators often differ across nodes, yet their impact on algorithm design and complexity remains unclear. To address this issue, we propose D-NSS, a decentralized algorithm with node-specific sampling, and establish its sample complexity depending on the arithmetic mean of local standard deviations, achieving tighter bounds than existing methods that rely on the worst-case or quadratic mean. We further derive a matching sample complexity lower bound under heterogeneous variance, thereby proving the optimality of this dependence. Moreover, we extend the framework with a variance reduction technique and develop D-NSS-VR, which under the mean-squared smoothness assumption attains an improved sample complexity bound while preserving the arithmetic-mean dependence. Finally, numerical experiments validate the theoretical results and demonstrate the effectiveness of the proposed algorithms.
[LG-100] Aggregate Models Not Explanations: Improving Feature Importance Estimation
链接: https://arxiv.org/abs/2602.11760
作者: Joseph Paillard,Angel Reyero Lobo,Denis A. Engemann,Bertrand Thirion
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Feature-importance methods show promise in transforming machine learning models from predictive engines into tools for scientific discovery. However, due to data sampling and algorithmic stochasticity, expressive models can be unstable, leading to inaccurate variable importance estimates and undermining their utility in critical biomedical applications. Although ensembling offers a solution, deciding whether to explain a single ensemble model or aggregate individual model explanations is difficult due to the nonlinearity of importance measures and remains largely understudied. Our theoretical analysis, developed under assumptions accommodating complex state-of-the-art ML models, reveals that this choice is primarily driven by the model’s excess risk. In contrast to prior literature, we show that ensembling at the model level provides more accurate variable-importance estimates, particularly for expressive models, by reducing this leading error term. We validate these findings on classical benchmarks and a large-scale proteomic study from the UK Biobank.
[LG-101] PAC-Bayesian Generalization Guarantees for Fairness on Stochastic and Deterministic Classifiers
链接: https://arxiv.org/abs/2602.11722
作者: Julien Bastian(LabHC),Benjamin Leblanc,Pascal Germain,Amaury Habrard(LabHC, IUF, MALICE),Christine Largeron(LabHC),Guillaume Metzler(ERIC),Emilie Morvant(LabHC),Paul Viallard(MALT)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Classical PAC generalization bounds on the prediction risk of a classifier are insufficient to provide theoretical guarantees on fairness when the goal is to learn models balancing predictive risk and fairness constraints. We propose a PAC-Bayesian framework for deriving generalization bounds for fairness, covering both stochastic and deterministic classifiers. For stochastic classifiers, we derive a fairness bound using standard PAC-Bayes techniques. Whereas for deterministic classifiers, as usual PAC-Bayes arguments do not apply directly, we leverage a recent advance in PAC-Bayes to extend the fairness bound beyond the stochastic setting. Our framework has two advantages: (i) It applies to a broad class of fairness measures that can be expressed as a risk discrepancy, and (ii) it leads to a self-bounding algorithm in which the learning procedure directly optimizes a trade-off between generalization bounds on the prediction risk and on the fairness. We empirically evaluate our framework with three classical fairness measures, demonstrating not only its usefulness but also the tightness of our bounds.
[LG-102] Estimation of instrument and noise parameters for inverse problem based on prior diffusion model
链接: https://arxiv.org/abs/2602.11711
作者: Jean-François Giovannelli
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Applications (stat.AP)
*备注:
Abstract:This article addresses the issue of estimating observation parameters (response and error parameters) in inverse problems. The focus is on cases where regularization is introduced in a Bayesian framework and the prior is modeled by a diffusion process. In this context, the issue of posterior sampling is well known to be thorny, and a recent paper proposes a notably simple and effective solution. Consequently, it offers an remarkable additional flexibility when it comes to estimating observation parameters. The proposed strategy enables us to define an optimal estimator for both the observation parameters and the image of interest. Furthermore, the strategy provides a means of quantifying uncertainty. In addition, MCMC algorithms allow for the efficient computation of estimates and properties of posteriors, while offering some guarantees. The paper presents several numerical experiments that clearly confirm the computational efficiency and the quality of both estimates and uncertainties quantification.
[LG-103] Enforcing Reciprocity in Operator Learning for Seismic Wave Propagation
链接: https://arxiv.org/abs/2602.11631
作者: Caifeng Zou,Yaozhong Shi,Zachary E. Ross,Robert W. Clayton,Kamyar Azizzadenesheli
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:
Abstract:Accurate and efficient wavefield modeling underpins seismic structure and source studies. Traditional methods comply with physical laws but are computationally intensive. Data-driven methods, while opening new avenues for advancement, have yet to incorporate strict physical consistency. The principle of reciprocity is one of the most fundamental physical laws in wave propagation. We introduce the Reciprocity-Enforced Neural Operator (RENO), a transformer-based architecture for modeling seismic wave propagation that hard-codes the reciprocity principle. The model leverages the cross-attention mechanism and commutative operations to guarantee invariance under swapping source and receiver positions. Beyond improved physical consistency, the proposed architecture supports simultaneous realizations for multiple sources without crosstalk issues. This yields an order-of-magnitude inference speedup at a similar memory footprint over an reciprocity-unenforced neural operator on a realistic configuration. We demonstrate the functionality using the reciprocity relation for particle velocity fields under single forces. This architecture is also applicable to pressure fields under dilatational sources and travel-time fields governed by the eikonal equation, paving the way for encoding more complex reciprocity relations.
[LG-104] he Cost of Learning under Multiple Change Points
链接: https://arxiv.org/abs/2602.11406
作者: Tomer Gafni,Garud Iyengar,Assaf Zeevi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We consider an online learning problem in environments with multiple change points. In contrast to the single change point problem that is widely studied using classical “high confidence” detection schemes, the multiple change point environment presents new learning-theoretic and algorithmic challenges. Specifically, we show that classical methods may exhibit catastrophic failure (high regret) due to a phenomenon we refer to as endogenous confounding. To overcome this, we propose a new class of learning algorithms dubbed Anytime Tracking CUSUM (ATC). These are horizon-free online algorithms that implement a selective detection principle, balancing the need to ignore “small” (hard-to-detect) shifts, while reacting “quickly” to significant ones. We prove that the performance of a properly tuned ATC algorithm is nearly minimax-optimal; its regret is guaranteed to closely match a novel information-theoretic lower bound on the achievable performance of any learning algorithm in the multiple change point problem. Experiments on synthetic as well as real-world data validate the aforementioned theoretical findings.
[LG-105] raffic Flow Reconstruction from Limited Collected Data
链接: https://arxiv.org/abs/2602.11336
作者: Nail Baloul,Amaury Hayat,Thibault Liard,Pierre Lissy
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG)
*备注: 64th IEEE Conference on Decision and Control (CDC 2025), IEEE, Dec 2025, Rio de Janeiro, Brazil
Abstract:We propose an efficient method for reconstructing traffic density with low penetration rate of probe vehicles. Specifically, we rely on measuring only the initial and final positions of a small number of cars which are generated using microscopic dynamical systems. We then implement a machine learning algorithm from scratch to reconstruct the approximate traffic density. This approach leverages learning techniques to improve the accuracy of density reconstruction despite constraints in available data. For the sake of consistency, we will prove that, if only using data from dynamical systems, the approximate density predicted by our learned-based model converges to a well-known macroscopic traffic flow model when the number of vehicles approaches infinity.
[LG-106] Amortised and provably-robust simulation-based inference
链接: https://arxiv.org/abs/2602.11325
作者: Ayush Bharti,Charita Dellaporta,Yuga Hikida,François-Xavier Briol
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注:
Abstract:Complex simulator-based models are now routinely used to perform inference across the sciences and engineering, but existing inference methods are often unable to account for outliers and other extreme values in data which occur due to faulty measurement instruments or human error. In this paper, we introduce a novel approach to simulation-based inference grounded in generalised Bayesian inference and a neural approximation of a weighted score-matching loss. This leads to a method that is both amortised and provably robust to outliers, a combination not achieved by existing approaches. Furthermore, through a carefully chosen conditional density model, we demonstrate that inference can be further simplified and performed without the need for Markov chain Monte Carlo sampling, thereby offering significant computational advantages, with complexity that is only a small fraction of that of current state-of-the-art approaches.
[LG-107] Hierarchical Testing of a Hybrid Machine Learning-Physics Global Atmosphere Model
链接: https://arxiv.org/abs/2602.11313
作者: Ziming Chen,L. Ruby Leung,Wenyu Zhou,Jian Lu,Sandro W. Lubis,Ye Liu,Chuan-Chieh Chang,Bryce E. Harrop,Ya Wang,Mingshi Yang,Gan Zhang,Yun Qian
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注: 48 pages, 9 figures
Abstract:Machine learning (ML)-based models have demonstrated high skill and computational efficiency, often outperforming conventional physics-based models in weather and subseasonal predictions. While prior studies have assessed their fidelity in capturing synoptic-scale atmospheric dynamics, their performance across timescales and under out-of-distribution forcing, such as +3K or +4K uniform-warming forcings, and the sources of biases remain elusive, to establish the model reliability for Earth science. Here, we design three sets of experiments targeting synoptic-scale phenomena, interannual variability, and out-of-distribution uniform-warming forcings. We evaluate the Neural General Circulation Model (NeuralGCM), a hybrid model integrating a dynamical core with ML-based component, against observations and physics-based Earth system models (ESMs). At the synoptic scale, NeuralGCM captures the evolution and propagation of extratropical cyclones with performance comparable to ESMs. At the interannual scale, when forced by El Niño-Southern Oscillation sea surface temperature (SST) anomalies, NeuralGCM successfully reproduces associated teleconnection patterns but exhibits deficiencies in capturing nonlinear response. Under out-of-distribution uniform-warming forcings, NeuralGCM simulates similar responses in global-average temperature and precipitation and reproduces large-scale tropospheric circulation features similar to those in ESMs. Notable weaknesses include overestimating the tracks and spatial extent of extratropical cyclones, biases in the teleconnected wave train triggered by tropical SST anomalies, and differences in upper-level warming and stratospheric circulation responses to SST warming compared to physics-based ESMs. The causes of these weaknesses were explored.
[LG-108] Unlearnable phases of matter
链接: https://arxiv.org/abs/2602.11262
作者: Tarun Advaith Kumar,Yijian Zou,Amir-Reza Negari,Roger G. Melko,Timothy H. Hsieh
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 28 pages, 9 figures
Abstract:We identify fundamental limitations in machine learning by demonstrating that non-trivial mixed-state phases of matter are computationally hard to learn. Focusing on unsupervised learning of distributions, we show that autoregressive neural networks fail to learn global properties of distributions characterized by locally indistinguishable (LI) states. We demonstrate that conditional mutual information (CMI) is a useful diagnostic for LI: we show that for classical distributions, long-range CMI of a state implies a spatially LI partner. By introducing a restricted statistical query model, we prove that nontrivial phases with long-range CMI, such as strong-to-weak spontaneous symmetry breaking phases, are hard to learn. We validate our claims by using recurrent, convolutional, and Transformer neural networks to learn the syndrome and physical distributions of toric/surface code under bit flip noise. Our findings suggest hardness of learning as a diagnostic tool for detecting mixed-state phases and transitions and error-correction thresholds, and they suggest CMI and more generally ``non-local Gibbsness’’ as metrics for how hard a distribution is to learn.


