本篇博文主要内容为 2026-03-26 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-03-26)
今日共更新573篇论文,其中:
- 自然语言处理共97篇(Computation and Language (cs.CL))
- 人工智能共166篇(Artificial Intelligence (cs.AI))
- 计算机视觉共135篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共162篇(Machine Learning (cs.LG))
- 多智能体系统共10篇(Multiagent Systems (cs.MA))
- 信息检索共22篇(Information Retrieval (cs.IR))
- 人机交互共22篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] he Free-Market Algorithm: Self-Organizing Optimization for Open-Ended Complex Systems
【速读】:该论文旨在解决传统元启发式算法(如遗传算法、粒子群优化和模拟退火)在求解复杂问题时面临的局限性,即对预设适应度函数和固定搜索空间的依赖性,导致其难以应对开放-ended、动态演化的问题场景。为此,作者提出自由市场算法(Free-Market Algorithm, FMA),其核心创新在于构建一个基于分布式供需动态的三层次架构:第一层为通用市场机制(包括供给、需求、竞争与选择),第二层为可插拔的领域特定行为规则,第三层为领域特定观测机制。FMA的关键在于通过自治代理之间的交互(如规则发现、商品交易、企业开闭与竞争)实现适应度的涌现(emergent fitness)和开放式的搜索空间演化,从而无需中央控制器即可自组织生成层级路径网络形式的解决方案。这一机制已在前生物化学合成和宏观经济预测两个异构领域验证有效性,展现出强大的泛化能力与物理意义。
链接: https://arxiv.org/abs/2603.24559
作者: Martin Jaraiz
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 26 pages, 3 figures, 2 tables, draft
Abstract:We introduce the Free-Market Algorithm (FMA), a novel metaheuristic inspired by free-market economics. Unlike Genetic Algorithms, Particle Swarm Optimization, and Simulated Annealing – which require prescribed fitness functions and fixed search spaces – FMA uses distributed supply-and-demand dynamics where fitness is emergent, the search space is open-ended, and solutions take the form of hierarchical pathway networks. Autonomous agents discover rules, trade goods, open and close firms, and compete for demand with no centralized controller. FMA operates through a three-layer architecture: a universal market mechanism (supply, demand, competition, selection), pluggable domain-specific behavioral rules, and domain-specific observation. The market mechanism is identical across applications; only the behavioral rules change. Validated in two unrelated domains. In prebiotic chemistry, starting from 900 bare atoms (C, H, O, N), FMA discovers all 12 feasible amino acid formulas, all 5 nucleobases, the formose sugar chain, and Krebs cycle intermediates in under 5 minutes on a laptop – with up to 240 independent synthesis routes per product. In macroeconomic forecasting, reading a single input-output table with zero estimated parameters, FMA achieves Mean Absolute Error of 0.42 percentage points for non-crisis GDP prediction, comparable to professional forecasters, portable to 33 countries. Assembly Theory alignment shows that FMA provides the first explicit, tunable mechanism for the selection signatures described by Sharma et al. (Nature, 2023). The event-driven assembly dynamics resonate with foundational programs in physics – causal set theory, relational quantum mechanics, constructor theory – suggesting that Darwinian market dynamics may reflect a deeper organizational principle that lead to the unfolding of Nature itself. Comments: 26 pages, 3 figures, 2 tables, draft Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2603.24559 [cs.NE] (or arXiv:2603.24559v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2603.24559 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Martin Jaraiz Prof. [view email] [v1] Wed, 25 Mar 2026 17:41:25 UTC (426 KB)
[MA-1] Relaxing Constraints in Anonymous Multi Agent Path Finding for Large Agents
【速读】:该论文致力于解决匿名多智能体路径规划(Anonymous Multi-Agent Path-finding, AMAPF)问题,其核心挑战在于:在不指定具体哪个智能体到达哪个目标点的前提下,确保所有目标位置被占据且路径安全无碰撞。传统方法通常依赖离散环境表示(如网格)并忽略智能体尺寸,难以应用于实际场景(如仓库移动机器人轨迹规划);而连续空间方法往往对初始/目标位置与障碍物之间的距离施加严格限制。本文针对一种适用于连续空间的AMAPF算法进行改进,该算法将智能体建模为等大小圆盘,并要求起点/终点间最小间距为4倍半径。关键创新在于将这一最小间距限制从4降低至23,同时通过理论证明保持原算法的安全性与收敛性保障——即所有智能体仍可无碰撞地抵达任意目标位置。
链接: https://arxiv.org/abs/2603.24442
作者: Stepan Dergachev,Dmitry Avdeev
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: 14 pages, 6 figures
Abstract:The study addressed the problem of Anonymous Multi-Agent Path-finding (AMAPF). Unlike the classical formulation, where the assignment of agents to goals is fixed, in the anonymous MAPF setting it is irrelevant which agent reaches specific goal, provided that all goals are occupied. Most existing multi-agent pathfinding algorithms rely on a discrete representation of the environment (e.g., square grids) and do not account for the sizes of agents. This limits their applicability in real-world scenarios, such as trajectory planning for mobile robots in warehouses. Conversely, methods operating in continuous space typically impose substantial restrictions on the input data, such as constraints on the distances between initial and goal positions or between start/goal positions and obstacles. In this work, we considered one of the AMAPF algorithms designed for continuous space, where agents are modeled as disks of equal size. The algorithm requires a strict minimum separation of 4 agent radii between any start/goal positions. Proposed a modification aimed at relaxing the constraints and reduce this limit from 4 to 2\sqrt3 . We theoretically demonstrated that the proposed enhancements preserve original theoretical properties, including the guarantee that all agents will eventually achieve their goals safely and without collisions.
[MA-2] he Specification Gap: Coordination Failure Under Partial Knowledge in Code Agents
【速读】:该论文旨在解决多大语言模型(Large Language Model, LLM)代码代理在独立实现同一类(class)的不同部分时所面临的协调问题,尤其是在规范(specification)信息逐渐缺失的情况下如何保持代码的一致性。其关键解决方案在于提出“规范优先”(specification-first)的视角,强调更丰富的规范是实现有效协作的核心机制,并且通过实验验证:即使在最弱规范层级(仅函数签名,L3),基于抽象语法树(Abstract Syntax Tree, AST)的冲突检测器也能以97%的精度识别不一致,但恢复完整规范(如完整文档字符串,L0)即可使单代理性能(89%)完全复现,表明额外提供冲突报告并无显著增益。这一发现揭示了协调成本(+16个百分点)与信息不对称(+11个百分点)独立且近似相加,说明问题本质并非单纯的信息隐藏,而是缺乏共享决策机制导致的代码兼容性困难。
链接: https://arxiv.org/abs/2603.24284
作者: Camilo Chacón Sartori
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:When multiple LLM-based code agents independently implement parts of the same class, they must agree on shared internal representations, even when the specification leaves those choices implicit. We study this coordination problem across 51 class-generation tasks, progressively stripping specification detail from full docstrings (L0) to bare signatures (L3), and introducing opposing structural biases (lists vs. dictionaries) to stress-test integration. Three findings emerge. First, a persistent specification gap: two-agent integration accuracy drops from 58% to 25% as detail is removed, while a single-agent baseline degrades more gracefully (89% to 56%), leaving a 25–39 pp coordination gap that is consistent across two Claude models (Sonnet, Haiku) and three independent runs. Second, an AST-based conflict detector achieves 97% precision at the weakest specification level without additional LLM calls, yet a factorial recovery experiment shows that restoring the full specification alone recovers the single-agent ceiling (89%), while providing conflict reports adds no measurable benefit. Third, decomposing the gap into coordination cost (+16 pp) and information asymmetry (+11 pp) suggests that the two effects are independent and approximately additive. The gap is not merely a consequence of hidden information, but reflects the difficulty of producing compatible code without shared decisions. These results support a specification-first view of multi-agent code generation: richer specifications are both the primary coordination mechanism and the sufficient recovery instrument.
[MA-3] he Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More
【速读】:该论文试图解决的问题是:当前开发者和消费者依据API标价选择推理语言模型(Reasoning Language Models, RLMs)时,标价是否真实反映了实际推理成本。研究发现,标价与实际成本之间存在显著偏差,即“定价反转”现象——在21.8%的模型对比较中,标价较低的模型反而导致更高的总成本,最大反转幅度达28倍。解决方案的关键在于识别出造成这一偏差的核心因素:不同模型在相同任务下“思考令牌”(thinking tokens)消耗的极大异质性,例如同一查询下某模型可能使用另一模型9倍的思考令牌。通过移除思考令牌成本,模型价格与实际成本的排名相关性(Kendall’s τ)从0.563提升至0.873,表明思考令牌是主要噪声源;同时揭示了单次查询的思考令牌波动可达9.7倍,说明成本预测存在不可消除的噪声基底。因此,论文呼吁采用成本感知的模型选择策略,并实现请求级成本透明监控。
链接: https://arxiv.org/abs/2603.23971
作者: Lingjiao Chen,Chi Zhang,Yeye He,Ion Stoica,Matei Zaharia,James Zou
机构: Stanford University (斯坦福大学); UC Berkeley (加州大学伯克利分校); CMU (卡内基梅隆大学); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Developers and consumers increasingly choose reasoning language models (RLMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RLMs across 9 diverse tasks covering competition math, science QA, code generation, and multi-domain reasoning. We uncover the pricing reversal phenomenon: in 21.8% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash’s listed price is 78% cheaper than GPT-5.2’s, yet its actual cost across all tasks is 22% higher. We trace the root cause to vast heterogeneity in thinking token consumption: on the same query, one model may use 900% more thinking tokens than another. In fact, removing thinking token costs reduces ranking reversals by 70% and raises the rank correlation (Kendall’s \tau ) between price and cost rankings from 0.563 to 0.873. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.
[MA-4] Self-Evolving Multi-Agent Framework for Efficient Decision Making in Real-Time Strategy Scenarios
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实时策略(Real-Time Strategy, RTS)场景中面临的速度-质量权衡问题,即状态空间庞大和时间限制导致的推理延迟过高,以及随机规划误差引发的逻辑不一致性。其解决方案的关键在于提出SEMA(Self-Evolving Multi-Agent)框架,通过以下两个核心机制实现高效低延迟决策:一是基于结构熵的动态观测剪枝方法,对游戏状态进行拓扑建模并提炼核心语义信息以显著降低推理耗时;二是融合微观轨迹、宏观经验与分层领域知识的混合知识记忆机制,提升战略适应性与决策一致性,从而在多个StarCraft II地图上实现胜率提升的同时,平均决策延迟降低超过50%。
链接: https://arxiv.org/abs/2603.23875
作者: Li Ma,Hao Peng,Yiming Wang,Hongbin Luo,Jie Liu,Kongjing Gu,Guanlin Wu,Hui Lin,Lei Ren
机构: Beihang University (北京航空航天大学); Naval Aviation University (海军航空大学); Military Science Academy (军事科学院); National University of Defense Technology (国防科技大学); China Academy of Electronics and Information Technology (中国电子科技集团公司); Beihang University (北京航空航天大学)
类目: Multiagent Systems (cs.MA)
备注: 17 pages, 6 figures. Submitted to SCIS (Science China Information Science)
Abstract:Large language models (LLMs) have demonstrated exceptional potential in complex reasoning,pioneering a new paradigm for autonomous agent decision making in dynamic settings. However, in Real-Time Strategy (RTS) scenarios, LLMs suffer from a critical speed-quality trade-off. Specifically expansive state spaces and time limits render inference delays prohibitive, while stochastic planning errors undermine logical consistency. To address these challenges, we present SEMA (Self-Evolving Multi-Agent), a novel framework designed for high-performance, low-latency decision-making in RTS environments. This collaborative multi-agent framework facilitates self-evolution by adaptively calibrating model bias through in-episode assessment and cross-episode analysis. We further incorporate dynamic observation pruning based on structural entropy to model game states topologically. By distilling high dimensional data into core semantic information, this approach significantly reduces inference time. We also develop a hybrid knowledge-memory mechanism that integrates micro-trajectories, macro-experience, and hierarchical domain knowledge, thereby enhancing both strategic adaptability and decision consistency. Experiments across multiple StarCraft II maps demonstrate that SEMA achieves superior win rates while reducing average decision latency by over 50%, validating its efficiency and robustness in complex RTS scenarios.
[MA-5] SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems ICLR2024
【速读】:该论文旨在解决多视觉语言模型(Vision-Language Models, VLMs)集成系统中因输出异质性导致的不确定性放大与幻觉(hallucination)风险增加的问题。现有不确定性量化(Uncertainty Quantification, UQ)方法主要针对单一模型设计,难以有效捕捉多模型协同下的系统级不确定性。其解决方案的关键在于提出一种无需训练的语义一致性意见聚合(Semantic-Consistent Opinion Pooling, SCoOP)框架,通过不确定性加权线性意见聚合机制,显式建模跨多个VLM的集体不确定性,从而实现对高不确定样本的有效幻觉检测与回避决策,同时保持极低的计算开销(微秒级),显著优于现有基线方法。
链接: https://arxiv.org/abs/2603.23853
作者: Chung-En Johnny Yu,Brian Jalaian,Nathaniel D. Bastian
机构: University of West Florida (西佛罗里达大学); United States Military Academy (美国军事学院)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted to ICLR 2024 Workshop on Agentic AI in the Wild: From Hallucinations to Reliable Autonomy
Abstract:Combining multiple Vision-Language Models (VLMs) can enhance multimodal reasoning and robustness, but aggregating heterogeneous models’ outputs amplifies uncertainty and increases the risk of hallucinations. We propose SCoOP (Semantic-Consistent Opinion Pooling), a training-free uncertainty quantification (UQ) framework multi-VLM systems through uncertainty-weighted linear opinion pooling. Unlike prior UQ methods designed for single models, SCoOP explicitly measures collective, system-level uncertainty across multiple VLMs, enabling effective hallucination detection and abstention for highly uncertain samples. On ScienceQA, SCoOP achieves an AUROC of 0.866 for hallucination detection, outperforming baselines (0.732-0.757) by approximately 10-13%. For abstention, it attains an AURAC of 0.907, exceeding baselines (0.818-0.840) by 7-9%. Despite these gains, SCoOP introduces only microsecond-level aggregation overhead relative to the baselines, which is trivial compared to typical VLM inference time (on the order of seconds). These results demonstrate that SCoOP provides an efficient and principled mechanism for uncertainty-aware aggregation, advancing the reliability of multimodal AI systems.
[MA-6] Dual-Gated Epistemic Time-Dilation: Autonomous Compute Modulation in Asynchronous MARL
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)在边缘设备上部署时面临的高计算负载问题,尤其是在同步执行框架下,所有智能体必须在每个微帧(micro-frame)中执行深度神经网络推理,导致能耗和热约束难以满足。解决方案的关键在于提出认知时间膨胀MAPPO(Epistemic Time-Dilation MAPPO, ETD-MAPPO),其核心是引入一个双门控认知触发机制(Dual-Gated Epistemic Trigger):通过解析策略的香农熵(aleatoric uncertainty)和孪生价值 critic 架构中的状态价值分歧(epistemic uncertainty)来动态调节智能体的执行频率,从而实现异步决策;同时将环境建模为半马尔可夫决策过程(Semi-Markov Decision Process, SMDP),并设计了与SMDP对齐的异步梯度掩码 critic(SMDP-Aligned Asynchronous Gradient Masking Critic)以确保信用分配正确性。实验表明,该方法在Google Research Football(GRF)基准上实现了73.6%的计算开销降低,且未损害集中式任务主导能力,还涌现出时间角色专业化(Temporal Role Specialization)现象。
链接: https://arxiv.org/abs/2603.23722
作者: Igor Jankowski
机构: 未知
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注: 14 pages, 5 figures. Code available at: this https URL . Related materials available on Zenodo: https://doi.org/10.5281/zenodo.19206838
Abstract:While Multi-Agent Reinforcement Learning (MARL) algorithms achieve unprecedented successes across complex continuous domains, their standard deployment strictly adheres to a synchronous operational paradigm. Under this paradigm, agents are universally forced to execute deep neural network inferences at every micro-frame, regardless of immediate necessity. This dense throughput acts as a fundamental barrier to physical deployment on edge-devices where thermal and metabolic budgets are highly constrained. We propose Epistemic Time-Dilation MAPPO (ETD-MAPPO), augmented with a Dual-Gated Epistemic Trigger. Instead of depending on rigid frame-skipping (macro-actions), agents autonomously modulate their execution frequency by interpreting aleatoric uncertainty (via Shannon entropy of their policy) and epistemic uncertainty (via state-value divergence in a Twin-Critic architecture). To format this, we structure the environment as a Semi-Markov Decision Process (SMDP) and build the SMDP-Aligned Asynchronous Gradient Masking Critic to ensure proper credit assignment. Empirical findings demonstrate massive improvements ( 60% relative baseline acquisition leaps) over current temporal models. By assessing LBF, MPE, and the 115-dimensional state space of Google Research Football (GRF), ETD correctly prevented premature policy collapse. Remarkably, this unconstrained approach leads to emergent Temporal Role Specialization, reducing computational overhead by a statistically dominant 73.6% entirely during off-ball execution without deteriorating centralized task dominance.
[MA-7] Engagement-Zone-Aware Input-Constrained Guidance for Safe Target Interception in Contested Environments
【速读】:该论文旨在解决在存在多个拦截者(defender)的对抗环境中,攻击方如何实现对目标的可靠拦截,同时确保自身安全的问题。传统方法通常基于最大交战距离设定保守的 standoff 约束,忽略了拦截者执行机构(actuator)的动作限制,导致控制策略过于保守且不具实用性。解决方案的关键在于:首先引入由防御者诱导的交战区(Engagement Zone, EZ)作为动态安全边界,并通过时变的安全集紧致参数补偿因执行机构饱和引起的瞬态约束违反;其次,利用 log-sum-exp 运算符构造平滑的聚合安全函数,以实现多防御者场景下的可扩展安全约束整合;最后设计一种平滑切换制导策略,在远离威胁边界时直接向目标推进,接近 EZ 边界时逐步激活规避机动,从而在仅依赖相对测量、无需知晓防御者控制输入的前提下,实现分布式、可扩展且满足执行机构约束的安全拦截。
链接: https://arxiv.org/abs/2603.23649
作者: Praveen Kumar Ranjan,Abhinav Sinha,Yongcan Cao
机构: University of Texas at San Antonio (德克萨斯大学圣安东尼奥分校); University of Cincinnati (辛辛那提大学)
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:
Abstract:We address target interception in contested environments in the presence of multiple defenders whose interception capability is limited by finite ranges. Conventional methods typically impose conservative stand-off constraints based on maximum engagement distance and neglect the interceptors’ actuator limitations. Instead, we formulate safety constraints using defender-induced engagement zones. To account for actuator limits, the vehicle model is augmented with input saturation dynamics. A time-varying safe-set tightening parameter is introduced to compensate for transient constraint violations induced by actuator dynamics. To ensure scalable safety enforcement in multi-defender scenarios, a smooth aggregate safety function is constructed using a log-sum-exp operator combining individual threat measures associated with each defender’s capability. A smooth switching guidance strategy is then developed to coordinate interception and safety objectives. The attacker pursues the target when sufficiently distant from threat boundaries and progressively activates evasive motion as the EZ boundaries are approached. The resulting controller relies only on relative measurements and does not require knowledge of defender control inputs, thus facilitating a fully distributed and scalable implementation. Rigorous analysis provides sufficient conditions guaranteeing target interception, practical safety with respect to all defender engagement zones, and satisfaction of actuator bounds. An input-constrained guidance law based on conservative stand-off distance is also developed to quantify the conservatism of maximum-range-based safety formulations. Simulations with stationary and maneuvering defenders demonstrate that the proposed formulation yields shorter interception paths and reduced interception time compared with conventional methods while maintaining safety throughout the engagement.
[MA-8] Platos Cave: A Human-Centered Research Verification System
【速读】:该论文旨在解决科研文献快速增长背景下,信息真实性验证、写作质量评估以及不可验证主张识别的迫切需求。其解决方案的关键在于提出一个名为Plato’s Cave的开源、以人为中心的研究验证系统,该系统通过构建文档的有向无环图(DAG)来结构化表示论证逻辑,利用网络代理(web agents)为DAG中的节点和边分配可信度分数,并最终基于对论文论证结构的解释与评估生成综合评分。
链接: https://arxiv.org/abs/2603.23526
作者: Matheus Kunzler Maldaner,Raul Valle,Junsung Kim,Tonuka Sultan,Pranav Bhargava,Matthew Maloni,John Courtney,Hoang Nguyen,Aamogh Sawant,Kristian O’Connor,Stephen Wormald,Damon L. Woodard
机构: University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 15 pages, 4 figures
Abstract:The growing publication rate of research papers has created an urgent need for better ways to fact-check information, assess writing quality, and identify unverifiable claims. We present Plato’s Cave as an open-source, human-centered research verification system that (i) creates a directed acyclic graph (DAG) from a document, (ii) leverages web agents to assign credibility scores to nodes and edges from the DAG, and (iii) gives a final score by interpreting and evaluating the paper’s argumentative structure. We report the system implementation and results on a collected dataset of 104 research papers.
[MA-9] Smooth Routing in Decaying Trees
【速读】:该论文致力于解决在极端事件(如洪水或森林火灾)背景下,如何在图结构中平滑调度一组路径的问题,其中某些边会在特定时间点变为不可通行。平滑调度要求任意两条路径不能在同一时刻占据同一边,且同时位于某个顶点的路径数量不得超过该顶点的容量限制。研究聚焦于树状图(特别是星形图和路径图)场景下的计算复杂性,证明即使在这些受限结构中,问题仍为NP-hard。其解决方案的关键在于构建一个整数线性规划(Integer Linear Program, ILP)模型来计算最晚可撤离时间,并通过该ILP及其松弛版本求解人工构造数据集(路径或星形结构)及半人工实例(基于德国沿河城市的实际图结构),从而评估算法性能与优化效果。
链接: https://arxiv.org/abs/2603.23504
作者: Till Fluschnik,Amela Pucic,Malte Renken
机构: Humboldt-Universität zu Berlin (柏林洪堡大学); Technische Universität Berlin (柏林工业大学)
类目: Data Structures and Algorithms (cs.DS); Multiagent Systems (cs.MA)
备注:
Abstract:Motivated by evacuation scenarios arising in extreme events such as flooding or forest fires, we study the problem of smoothly scheduling a set of paths in graphs where connections become impassable at some point in time. A schedule is smooth if no two paths meet on an edge and the number of paths simultaneously located at a vertex does not exceed its given capacity. We study the computational complexity of the problem when the underlying graph is a tree, in particular a star or a path. We prove that already in these settings, the problem is NP-hard even with further restrictions on the capacities or on the time when all connections ceased. We provide an integer linear program (ILP) to compute the latest possible time to evacuate. Using the ILP and its relaxation, we solve sets of artificial (where each underlying graph forms either a path or star) and semi-artificial instances (where the graphs are obtained from German cities along rivers), study the runtimes, and compare the results of the ILP with those of its relaxation.
自然语言处理
[NLP-0] Comparing Developer and LLM Biases in Code Evaluation
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)作为代码评估裁判时,在真实交互场景中因部分上下文和模糊意图导致的评估偏差问题。解决方案的关键在于提出TRACE(Tool for Rubric Analysis in Code Evaluation),该框架不仅量化LLM裁判与开发者偏好之间的对齐程度,还能自动提取评分标准(rubric items)以揭示人类与模型在权重分配上的系统性偏差。通过在三种交互模态(基于聊天的编程、IDE自动补全和指令式代码编辑)中应用TRACE,研究发现最优LLM裁判仍比人类标注者低12–23%,并识别出35个显著的错位来源,其中多数对应于已有的软件工程代码质量准则,例如在聊天编程中模型偏好更长的代码解释而人类更倾向简洁表达。
链接: https://arxiv.org/abs/2603.24586
作者: Aditya Mittal,Ryan Shar,Zichu Wu,Shyam Agarwal,Tongshuang Wu,Chris Donahue,Ameet Talwalkar,Wayne Chi,Valerie Chen
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:As LLMs are increasingly used as judges in code applications, they should be evaluated in realistic interactive settings that capture partial context and ambiguous intent. We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges’ ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and models weigh each item. Across three modalities – chat-based programming, IDE autocompletion, and instructed code editing – we use TRACE to measure how well LLM judges align with developer preferences. Among 13 different models, the best judges underperform human annotators by 12-23%. TRACE identifies 35 significant sources of misalignment between humans and judges across interaction modalities, the majority of which correspond to existing software engineering code quality criteria. For example, in chat-based coding, judges are biased towards longer code explanations while humans prefer shorter ones. We find significant misalignment on the majority of existing code quality dimensions, showing alignment gaps between LLM judges and human preference in realistic coding applications.
[NLP-1] MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因幻觉(Hallucination)导致的可靠性问题,尤其是在检索增强生成(Retrieval-Augmented Generation, RAG)系统中,现有基于LLM-as-a-judge的检测方法存在固有确认偏差(confirmation bias),即验证者可能无意中复制生成者的错误。解决方案的关键在于提出多智能体强化自检框架(Multi-Agent Reinforced Self-Check for Hallucination, MARCH),其核心机制是通过刻意设计的信息不对称:由Solver生成初始回答,Proposer将其分解为可验证的原子命题,Checker则在隔离状态下独立验证这些命题,从而打破自我确认偏差的循环;并通过多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)训练使三者协同进化,显著提升事实一致性。
链接: https://arxiv.org/abs/2603.24579
作者: Zhuo Li,Yupeng Zhang,Pengyu Cheng,Jiajun Song,Mengyu Zhou,Hao Li,Shujie Hu,Yu Qin,Erchao Zhao,Xiaoxi Jiang,Guanjun Jiang
机构: Alibaba(阿里巴巴); The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)
类目: Computation and Language (cs.CL)
备注:
Abstract:Hallucination remains a critical bottleneck for large language models (LLMs), undermining their reliability in real-world applications, especially in Retrieval-Augmented Generation (RAG) systems. While existing hallucination detection methods employ LLM-as-a-judge to verify LLM outputs against retrieved evidence, they suffer from inherent confirmation bias, where the verifier inadvertently reproduces the errors of the original generation. To address this, we introduce Multi-Agent Reinforced Self-Check for Hallucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate information asymmetry. MARCH orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker. The Solver generates an initial RAG response, which the Proposer decomposes into claim-level verifiable atomic propositions. Crucially, the Checker validates these propositions against retrieved evidence in isolation, deprived of the Solver’s original output. This well-crafted information asymmetry scheme breaks the cycle of self-confirmation bias. By training this pipeline with multi-agent reinforcement learning (MARL), we enable the agents to co-evolve and optimize factual adherence. Extensive experiments across hallucination benchmarks demonstrate that MARCH substantially reduces hallucination rates. Notably, an 8B-parameter LLM equipped with MARCH achieves performance competitive with powerful closed-source models. MARCH paves a scalable path for factual self-improvement of LLMs through co-evolution. The code is at this https URL.
[NLP-2] A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English
【速读】: 该论文旨在解决自动语音识别(ASR)系统在处理方言变异时存在的性能不均衡问题,特别是针对英国东北部英语(Newcastle English)这一区域变体所表现出的高错误率。研究通过分析来自Diachronic Electronic Corpus of Tyneside English(DECTE)的自发语音数据,对主流商业ASR系统的转录结果进行细粒度错误分类,并结合社会语言学变量(如性别、年龄和经济社会地位)以及音系特征(如元音质量与声门化)进行多维度剖析。解决方案的关键在于揭示ASR错误并非随机分布,而是具有明显的社会模式和语言学成因——即多数错误源于方言特有的音系特征(如非标准元音、声门化及本地词汇和语法形式),且不同社会群体的错误频率存在显著差异(如男性和极端年龄段人群错误率更高)。因此,论文主张将社会语言学专业知识纳入ASR系统的评估与开发流程,并强调必须采用基于社区的方言语音数据以实现更公平的语音识别技术。
链接: https://arxiv.org/abs/2603.24549
作者: Dana Serditova,Kevin Tang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: 54 pages, 11 figures
Abstract:Automatic Speech Recognition (ASR) systems are widely used in everyday communication, education, healthcare, and industry, yet their performance remains uneven across speakers, particularly when dialectal variation diverges from the mainstream accents represented in training data. This study investigates ASR bias through a sociolinguistic analysis of Newcastle English, a regional variety of North-East England that has been shown to challenge current speech recognition technologies. Using spontaneous speech from the Diachronic Electronic Corpus of Tyneside English (DECTE), we evaluate the output of a state-of-the-art commercial ASR system and conduct a fine-grained analysis of more than 3,000 transcription errors. Errors are classified by linguistic domain and examined in relation to social variables including gender, age, and socioeconomic status. In addition, an acoustic case study of selected vowel features demonstrates how gradient phonetic variation contributes directly to misrecognition. The results show that phonological variation accounts for the majority of errors, with recurrent failures linked to dialect-specific features like vowel quality and glottalisation, as well as local vocabulary and non-standard grammatical forms. Error rates also vary across social groups, with higher error frequencies observed for men and for speakers at the extremes of the age spectrum. These findings indicate that ASR errors are not random but socially patterned and can be explained from a sociolinguistic perspective. Thus, the study demonstrates the importance of incorporating sociolinguistic expertise into the evaluation and development of speech technologies and argues that more equitable ASR systems require explicit attention to dialectal variation and community-based speech data. Comments: 54 pages, 11 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD) ACMclasses: I.2; I.2.7; I.5; J.4; J.5 Cite as: arXiv:2603.24549 [cs.CL] (or arXiv:2603.24549v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.24549 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-3] Analysing the Safety Pitfalls of Steering Vectors
【速读】: 该论文旨在解决生成式 AI(Generative AI)模型中基于激活量调节(activation steering)技术所带来的安全风险问题,尤其是其在诱导越狱攻击(jailbreak attack)成功率方面的不可控性。研究表明,尽管激活量调节无需修改模型权重即可改变大语言模型(LLM)的行为,但其本质脆弱性可能被恶意利用,导致模型安全性显著下降。论文提出的关键解决方案是通过统一评估协议对对比激活加法(Contrastive Activation Addition, CAA)所获得的引导向量进行系统性安全审计,并发现这些向量与拒绝行为的潜在方向存在重叠,从而解释了为何特定方向的调节会显著提升或降低越狱攻击成功率(最高可达57%提升或50%下降)。这一发现揭示了可控性与安全性之间的权衡关系,为理解并缓解 LLM 安全漏洞提供了可追踪的机制依据。
链接: https://arxiv.org/abs/2603.24543
作者: Yuxiao Li,Alina Fastowski,Efstratios Zaradoukas,Bardh Prenkaj,Gjergji Kasneci
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Activation steering has emerged as a powerful tool to shape LLM behavior without the need for weight updates. While its inherent brittleness and unreliability are well-documented, its safety implications remain underexplored. In this work, we present a systematic safety audit of steering vectors obtained with Contrastive Activation Addition (CAA), a widely used steering approach, under a unified evaluation protocol. Using JailbreakBench as benchmark, we show that steering vectors consistently influence the success rate of jailbreak attacks, with stronger amplification under simple template-based attacks. Across LLM families and sizes, steering the model in specific directions can drastically increase (up to 57%) or decrease (up to 50%) its attack success rate (ASR), depending on the targeted behavior. We attribute this phenomenon to the overlap between the steering vectors and the latent directions of refusal behavior. Thus, we offer a traceable explanation for this discovery. Together, our findings reveal the previously unobserved origin of this safety gap in LLMs, highlighting a trade-off between controllability and safety.
[NLP-4] Representation Learning to Study Temporal Dynamics in Tutorial Scaffolding
【速读】: 该论文旨在解决如何在真实教学对话中有效测量自适应支架(scaffolding)这一关键问题,尤其是在远程人类辅导和基于大语言模型的系统日益普及的背景下。其解决方案的核心在于提出一种基于嵌入(embedding)的方法,通过计算导师与学生发言、问题陈述及正确解答之间的语义对齐程度(即余弦相似度),来量化支架动态。该方法能够捕捉角色特异性语义对齐随时间的变化模式,并证明这种对齐程度可显著预测教学进展,超越传统基线特征(如消息顺序和长度)。
链接: https://arxiv.org/abs/2603.24535
作者: Conrad Borchers,Jiayi Zhang,Ashish Gurung
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted as short paper to the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
Abstract:Adaptive scaffolding enhances learning, yet the field lacks robust methods for measuring it within authentic tutoring dialogue. This gap has become more pressing with the rise of remote human tutoring and large language model-based systems. We introduce an embedding-based approach that analyzes scaffolding dynamics by aligning the semantics of dialogue turns, problem statements, and correct solutions. Specifically, we operationalize alignment by computing cosine similarity between tutor and student contributions and task-relevant content. We apply this framework to 1,576 real-world mathematics tutoring dialogues from the Eedi Question Anchored Tutoring Dialogues dataset. The analysis reveals systematic differences in task alignment and distinct temporal patterns in how participants ground their contributions in problem and solution content. Further, mixed-effects models show that role-specific semantic alignment predicts tutorial progression beyond baseline features such as message order and length. Tutor contributions exhibited stronger grounding in problem content early in interactions. In contrast, student solution alignment was modestly positively associated with progression. These findings support scaffolding as a continuous, role-sensitive process grounded in task semantics. By capturing role-specific alignment over time, this approach provides a principled method for analyzing instructional dialogue and evaluating conversational tutoring systems.
[NLP-5] Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA
【速读】: 该论文旨在解决医疗领域人工智能(AI)系统中置信度评分校准不足的问题,即模型在临床应用中常表现出过度自信,导致无法提供有效的决策置信信号用于任务拒答(deferral)。解决方案的关键在于提出一种多智能体框架,通过引入领域特异的专科智能体(如呼吸科、心血管科等)结合两阶段自验证机制与S-score加权融合策略,在提升诊断准确性的同时显著改善置信度校准性能。其中,两阶段自验证机制是校准优化的核心驱动因素,而多智能体协同推理则主要提升了整体准确率,最终在多个医学多选题数据集上实现ECE(期望校准误差)降低49–74%,为安全关键场景下的临床AI应用提供了可靠的不确定性估计和可解释的置信信号。
链接: https://arxiv.org/abs/2603.24481
作者: John Ray B. Martinez
机构: Harrisburg University of Science and Technology (哈里斯堡科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 6 figures. Preprint under review
Abstract:Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis is then subjected to a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate across four experimental settings, covering 100-question and 250-question high-disagreement subsets of both MedQA-USMLE and MedMCQA. Calibration improvement is the central finding, with ECE reduced by 49-74% across all four settings, including the harder MedMCQA benchmark where these gains persist even when absolute accuracy is constrained by knowledge-intensive recall demands. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Ablation analysis identifies Two-Phase Verification as the primary calibration driver and multi-agent reasoning as the primary accuracy driver. These results establish that consistency-based verification produces more reliable uncertainty estimates across diverse medical question types, providing a practical confidence signal for deferral in safety-critical clinical AI applications.
[NLP-6] Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLM s?
【速读】: 该论文旨在解决自蒸馏(self-distillation)在大语言模型(LLM)数学推理任务中导致性能下降的问题。研究表明,尽管自蒸馏通常能缩短推理路径并提升性能,但在数学推理场景下其反而会因抑制“认知不确定性表达”(epistemic verbalization)而损害模型表现。解决方案的关键在于:通过丰富教师模型的条件上下文(conditioning context),可有效抑制不确定性表达,从而实现有限任务覆盖下的快速域内优化;但若缺乏对不确定性的适当暴露,则会显著削弱模型在分布外(OOD)任务中的适应能力。因此,论文强调,在优化推理行为时,需平衡正确答案路径的强化与合理不确定性表达的保留,以确保推理鲁棒性。
链接: https://arxiv.org/abs/2603.24472
作者: Jeonghye Kim,Xufang Luo,Minbeom Kim,Sangmook Lee,Dohyung Kim,Jiwon Jeon,Dongsheng Li,Yuqing Yang
机构: Microsoft Research (微软研究院); KAIST (韩国科学技术院); Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model’s expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.
[NLP-7] Counting Without Numbers Finding Without Words
【速读】: 该论文旨在解决流浪动物与主人难以重聚的问题,当前系统仅依赖视觉识别,而忽略了动物通过声音进行身份识别的生物特性。研究表明,70%的走失宠物未能找回并非因为缺乏匹配对象,而是因现有技术未考虑声学特征。解决方案的关键在于提出首个融合视觉与声学生物特征的多模态重聚系统,其核心是基于物种适应性的架构,能够处理从10Hz大象低鸣到4kHz幼犬呜咽等广泛频率的声音,并结合容忍应激导致外观变化的概率视觉匹配,从而实现更贴近生物通信原理的AI重建机制。
链接: https://arxiv.org/abs/2603.24470
作者: Badri Narayana Patro
机构: Microsoft(微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:
Abstract:Every year, 10 million pets enter shelters, separated from their families. Despite desperate searches by both guardians and lost animals, 70% never reunite, not because matches do not exist, but because current systems look only at appearance, while animals recognize each other through sound. We ask, why does computer vision treat vocalizing species as silent visual objects? Drawing on five decades of cognitive science showing that animals perceive quantity approximately and communicate identity acoustically, we present the first multimodal reunification system integrating visual and acoustic biometrics. Our species-adaptive architecture processes vocalizations from 10Hz elephant rumbles to 4kHz puppy whines, paired with probabilistic visual matching that tolerates stress-induced appearance changes. This work demonstrates that AI grounded in biological communication principles can serve vulnerable populations that lack human language.
[NLP-8] Mechanic: Sorrifier-Driven Formal Decomposition Workflow for Automated Theorem Proving
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自动定理证明系统在处理复杂数学推理问题时,因首次尝试失败而需反复修改证明策略所引发的效率瓶颈问题。具体而言,现有方法要么完全丢弃已有推理过程重新生成证明(导致局部错误引发整体浪费),要么逐次修复错误(导致上下文过长并削弱模型对未解决问题的关注能力)。解决方案的关键在于提出一种名为Mechanic的新颖代理系统,其核心创新是采用“sorry-driven formal decomposition”策略:利用Lean定理证明器中的sorry占位符精准隔离未解决的子目标,同时保留已验证的证明结构,从而将每个失败的子问题提取为独立、自包含的上下文进行单独求解,有效避免了全量重生成与上下文膨胀的双重弊端。
链接: https://arxiv.org/abs/2603.24465
作者: Ruichen Qiu,Yichuan Cao,Junqi Liu,Dakai Guo,Xiao-Shan Gao,Lihong Zhi,Ruyong Feng
机构: Academy of Mathematics and Systems Science, CAS; School of Advanced Interdisciplinary Sciences, UCAS; School of Mathematical Science, UCAS
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in large language models (LLMs) and LLM-based agents have substantially improved the capabilities of automated theorem proving. However, for problems requiring complex mathematical reasoning, current systems rarely succeed on the first try and must repeatedly modify their proof strategies. Existing approaches for handling failed attempts typically either discard the entire proof and regenerate it from scratch or iteratively fix errors within the proof. The former is inefficient, as it may abandon mostly correct reasoning due to localized errors, while the latter, although preserving prior progress, leads to progressively longer contexts which progressively degrades the model’s ability to attend to the remaining unresolved subproblems. To address this dilemma, we propose Mechanic, a novel agent system that employs a sorry-driven formal decomposition strategy. By leveraging the sorry placeholder in Lean to precisely isolate unresolved subgoals while preserving the surrounding verified proof structure, Mechanic extracts each failed subproblem into a clean, self-contained context and resolves it independently. This avoids both the waste of full regeneration and the excessive context length induced by repeated repairs. Experimental results on challenging mathematical competition benchmarks, including IMO 2025 and Putnam 2025, demonstrate that our agent achieves significant advantages in proving efficiency.
[NLP-9] What and When to Learn: CURriculum Ranking Loss for Large-Scale Speaker Verification
【速读】: 该论文旨在解决大规模语音验证(speaker verification)中因固定边距损失函数对所有样本一视同仁而导致的性能瓶颈问题,特别是由误标注或退化样本引入的噪声梯度干扰紧凑说话人流形结构的问题。其解决方案的关键在于提出一种自适应损失函数 Curry(CURriculum Ranking),通过 Sub-center ArcFace 模型在线估计样本难度:利用主导子中心余弦相似度的置信度分数,结合运行批次统计量将样本分为易、中、难三类,并据此动态调整学习权重,引导模型从稳定的身份基础逐步过渡到流形细化与边界锐化阶段,从而实现对不完美大规模数据的鲁棒建模。
链接: https://arxiv.org/abs/2603.24432
作者: Massa Baali,Sarthak Bisht,Rita Singh,Bhiksha Raj
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:
Abstract:Speaker verification at large scale remains an open challenge as fixed-margin losses treat all samples equally regardless of quality. We hypothesize that mislabeled or degraded samples introduce noisy gradients that disrupt compact speaker manifolds. We propose Curry (CURriculum Ranking), an adaptive loss that estimates sample difficulty online via Sub-center ArcFace: confidence scores from dominant sub-center cosine similarity rank samples into easy, medium, and hard tiers using running batch statistics, without auxiliary annotations. Learnable weights guide the model from stable identity foundations through manifold refinement to boundary sharpening. To our knowledge, this is the largest-scale speaker verification system trained to date. Evaluated on VoxCeleb1-O, and SITW, Curry reduces EER by 86.8% and 60.0% over the Sub-center ArcFace baseline, establishing a new paradigm for robust speaker verification on imperfect large-scale data.
[NLP-10] PINGALA: Prosody-Aware Decoding for Sanskrit Poetry Generation
【速读】: 该论文旨在解决梵文诗歌生成中语义连贯性与格律规则难以兼顾的问题。传统方法将诗句视为整体序列进行建模,导致语义 coherence 较低;而本文提出的关键解决方案是:首先将诗句按行分组(grouped-lines)进行解码,以提升语义一致性(提升约10%),并通过偏好较长token的词元选择策略增强每行的词汇完整性;其次引入基于音位的转写方案SLP1,显著改善音节权重匹配度(提升46%),从而在保持语义相似性的同时强化格律准确性;此外,还提出一种无需参考文本的交叉编码器评估方法,实现对诗歌质量更贴近真实实例的无监督量化评估。
链接: https://arxiv.org/abs/2603.24413
作者: Manoj Balaji Jagadeeshan,Atul Singh,Nallani Chakravartula Sahith,Amrith Krishna,Pawan Goyal
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Poetry generation in Sanskrit typically requires the verse to be semantically coherent and adhere to strict prosodic rules. In Sanskrit prosody, every line of a verse is typically a fixed length sequence of syllables adhering to prescribed binary patterns of syllable weights. We observe that instead of treating a verse as a monolithic sequence, segmenting them as grouped-lines leads to significant improvement in semantic coherence by 10% with comparable metrical adherence. Specifically, PINGALA, our proposed decoding approach is designed to encourage every line to have well-formed words and our token selection biases the model towards it by preferring longer tokens. Writing in Sanskrit follows phonemic orthography, hence using a phonetically aware transliteration scheme, SLP1, increased the metrical alignment by 46% with comparable semantic similarity, for a instruction fine-tuned large language models like Phi-4. We also introduce a new approach for reference-free evaluation using cross-encoders which achieved better alignment with true poetry instances.
[NLP-11] When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools
【速读】: 该论文旨在解决大规模早期儿童教育场景中高质量师生互动(Teacher-Child Interaction, TCI)评估的可扩展性难题,尤其是在中国覆盖超3600万儿童、25万个幼儿园的大规模体系下,传统人工观察方式因成本高、耗时长而难以实现持续质量监测,仅能依赖低频次的专家审计,限制了及时干预与改进追踪。其解决方案的关键在于构建了一个名为Interaction2Eval的专用大语言模型(Large Language Model, LLM)框架,通过处理领域特定挑战——包括儿童语音识别、普通话同音字歧义消解以及基于评分量表的推理机制——实现了与人类专家判断高达88%的一致性,并在43个班级中部署验证,使评估流程效率提升18倍,从而推动从年度专家审核向月度AI辅助监测转变,为建立持续、包容且公平的学前教育质量提升新范式奠定基础。
链接: https://arxiv.org/abs/2603.24389
作者: Xingming Li,Runke Huang,Yanan Bao,Yuye Jin,Yuru Jiao,Qingyong Hu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted to AIED 2026, Project page: this https URL
Abstract:High-quality teacher-child interaction (TCI) is fundamental to early childhood development, yet traditional expert-based assessment faces a critical scalability challenge. In large systems like China’s-serving 36 million children across 250,000+ kindergartens-the cost and time requirements of manual observation make continuous quality monitoring infeasible, relegating assessment to infrequent episodic audits that limit timely intervention and improvement tracking. In this paper, we investigate whether AI can serve as a scalable assessment teammate by extracting structured quality indicators and validating their alignment with human expert judgments. Our contributions include: (1) TEPE-TCI-370h (Tracing Effective Preschool Education), the first large-scale dataset of naturalistic teacher-child interactions in Chinese preschools (370 hours, 105 classrooms) with standardized ECQRS-EC and SSTEW annotations; (2) We develop Interaction2Eval, a specialized LLM-based framework addressing domain-specific challenges-child speech recognition, Mandarin homophone disambiguation, and rubric-based reasoning-achieving up to 88% agreement; (3) Deployment validation across 43 classrooms demonstrating an 18x efficiency gain in the assessment workflow, highlighting its potential for shifting from annual expert audits to monthly AI-assisted monitoring with targeted human oversight. This work not only demonstrates the technical feasibility of scalable, AI-augmented quality assessment but also lays the foundation for a new paradigm in early childhood education-one where continuous, inclusive, AI-assisted evaluation becomes the engine of systemic improvement and equitable growth. Comments: Accepted to AIED 2026, Project page: this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2603.24389 [cs.CL] (or arXiv:2603.24389v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.24389 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-12] owards Reward Modeling for AI Tutors in Math Mistake Remediation
【速读】: 该论文旨在解决生成式 AI(Generative AI)辅导系统中教学质量评估难题,特别是如何有效衡量AI助教在错误纠正任务中的教学能力,如识别错误、引导推理和避免直接暴露答案等关键教学行为。解决方案的关键在于:首先基于人类成对偏好数据构建了一个教学维度层级结构,并设计出最小差异的响应对(minimally contrastive response pairs),这些响应对在特定教学属性(如错误识别、定位、针对性、支架支持、可操作性、清晰度和连贯性)上存在明确差异;其次,利用MRBench数据、合成响应对及其加权组合训练了Bradley-Terry偏好模型,仅用0.5B参数的小型骨干模型即实现了高达0.74的成对偏好准确率,显著优于更大规模的通用奖励模型,证明了合成数据与结构化偏好建模的有效性。
链接: https://arxiv.org/abs/2603.24375
作者: Kseniia Petukhova,Ekaterina Kochmar
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Evaluating the pedagogical quality of AI tutors remains challenging: standard NLG metrics do not determine whether responses identify mistakes, scaffold reasoning, or avoid revealing the answers. For the task of mistake remediation, we derive a hierarchy of pedagogical aspects from human pairwise preferences on MRBench, and synthesize minimally contrastive response pairs that differ along key aspects (e.g., mistake identification and location, targetedness, scaffolding, actionability, clarity, and coherence). We develop and release Bradley-Terry preference models trained on weighted-sum rankings that we automatically create from MRBench, synthetic pairs, and data combinations. Using only synthetic data, our best model reaches 0.69 pairwise accuracy on a human preference test, and combining weighted-sum data with targeted synthetic groups improves accuracy to 0.74, outperforming larger general-purpose reward models while using only a 0.5B-parameter backbone.
[NLP-13] Improving Lean4 Autoformalization via Cycle Consistency Fine-tuning
【速读】: 该论文旨在解决自然语言数学文本到形式化证明语言(如Lean4)的自动转换问题,即生成式AI (Generative AI) 在数学推理中的autoformalization任务,以加速AI辅助数学研究,包括证明验证和证明搜索。其核心解决方案是基于LoRA(Low-Rank Adaptation)微调Qwen3.5-2B模型,并对比三种训练策略:带课程学习(curriculum learning,难度从1到10)的监督微调(SFT)、无课程顺序的SFT,以及使用组相对策略优化(GRPO)的强化学习(RL),其中奖励函数采用循环一致性(cycle consistency)指标——通过计算自然语言到Lean4再回到自然语言的句子嵌入余弦相似度来衡量语义保真度。实验表明,强化学习在未见数据集FineLeanCorpus(FLC)和PutnamBench上显著优于两种SFT方法(平均循环一致性分别为0.669 vs. 0.513 和 0.561 vs. 0.422),同时仅增加0.011纳特交叉熵损失,说明其在提升语义保留能力的同时对形式化质量影响极小,且课程学习并未带来可测量的优势。
链接: https://arxiv.org/abs/2603.24372
作者: Arsen Shebzukhov
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 10 figures, pages 10-27 appendix
Abstract:Autoformalization - automatically translating natural language mathematical texts into formal proof language such as Lean4 - can help accelerate AI-assisted mathematical research, be it via proof verification or proof search. I fine-tune Qwen3.5-2B with LoRA for natural language to Lean4 formalization on FineLeanCorpus and consider three training regimes: supervised fine-tuning (SFT) with curriculum learning (difficulty 1 to 10), SFT without curriculum ordering, and reinforcement learning using group relative policy optimization (GRPO) with a cycle consistency reward. Cycle consistency measures how well the meaning of a statement is preserved through a NL to Lean4 to NL’ loop, computed as cosine similarity of off-the-shelf sentence embeddings. On an unseen subset of FineLeanCorpus (FLC) and on PutnamBench, RL substantially outperforms both SFT variants (mean cycle consistency 0.669 vs. 0.513 on FLC; 0.561 vs. 0.422 on PutnamBench), while increasing cross-entropy loss by only 0.011 nats, with minimal impact on formalization quality. Curriculum ordering provides no measurable benefit over shuffled training.
[NLP-14] GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在3D环境中对多智能体交互的感知与推理能力评估不足的问题,特别是针对自主代理(agent)从第一人称视角理解快速状态变化、准确归因动作主体以及推理并发多智能体行为的能力缺乏有效评测基准。其解决方案的关键在于提出GameplayQA框架,通过密集标注多人3D游戏视频(每秒1.22个标签),构建时间同步、并发的语义描述体系,围绕“自我(Self)、其他智能体(Other Agents)和世界(World)”三元结构组织状态、动作与事件信息,并基于此生成2.4K诊断性问答对,涵盖三个认知复杂度层级及结构化的干扰项分类,从而实现对模型在时序一致性、跨视频定位、角色归属和决策密度处理等关键维度上的细粒度分析。
链接: https://arxiv.org/abs/2603.24329
作者: Yunzhe Wang,Runhui Xu,Kexin Zheng,Tianyi Zhang,Jayavibhav Niranjan Kogundi,Soham Hans,Volkan Ustun
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.
[NLP-15] Samasāmayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation
【速读】: 该论文旨在解决当代印度语言中梵文(Sanskrit)与印地语(Hindi)之间机器翻译(Machine Translation, MT)资源匮乏的问题,尤其是现有数据多集中于古典文献和诗歌,缺乏对现代语境下文本的覆盖。解决方案的关键在于构建并发布了一个全新的大规模平行语料库——Samasāmayik,包含92,196句平行句子,其数据来源涵盖口语教程、儿童杂志、广播对话及说明材料等当代内容,从而填补了低资源场景下梵文-印地语翻译的数据空白。通过在ByT5、NLLB和IndicTrans-v2三种模型上进行微调实验,验证了该语料库在领域内测试集上的显著性能提升,同时保持与其他通用测试集相当的表现,确立了该语料库作为现代梵文-印地语翻译的新基准。
链接: https://arxiv.org/abs/2603.24307
作者: N J Karthika,Keerthana Suryanarayanan,Jahanvi Purohit,Ganesh Ramakrishnan,Jitin Singla,Anil Kumar Gourishetty
机构: Indian Institute of Technology Bombay(印度理工学院孟买分校); Geakminds Technologies Private Limited(Geakminds科技有限公司); Indian Institute of Technology Roorkee(印度理工学院鲁尔基分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:We release Samasāmayik, a novel, meticulously curated, large-scale Hindi-Sanskrit corpus, comprising 92,196 parallel sentences. Unlike most data available in Sanskrit, which focuses on classical era text and poetry, this corpus aggregates data from diverse sources covering contemporary materials, including spoken tutorials, children’s magazines, radio conversations, and instruction materials. We benchmark this new dataset by fine-tuning three complementary models - ByT5, NLLB and IndicTrans-v2, to demonstrate its utility. Our experiments demonstrate that models trained on the Samasamayik corpus achieve significant performance gains on in-domain test data, while achieving comparable performance on other widely used test sets, establishing a strong new performance baseline for contemporary Hindi-Sanskrit translation. Furthermore, a comparative analysis against existing corpora reveals minimal semantic and lexical overlap, confirming the novelty and non-redundancy of our dataset as a robust new resource for low-resource Indic language MT.
[NLP-16] Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning LREC2026
【速读】: 该论文旨在解决古埃及语四个历史阶段之间在词级语义对齐(word-level semantic alignment)中的难题,这些阶段在书写系统和正字法上差异显著,且平行语料稀缺。其核心解决方案是通过联合训练一个紧凑的编码器-解码器模型,使用共享的字节级分词器(byte-level tokenizer),并融合掩码语言建模(MLM)、翻译语言建模(TLM)、序列到序列翻译及词性标注等多种任务,采用任务感知损失函数(task-aware loss)结合固定权重与不确定性缩放机制进行优化。关键创新在于引入拉丁转写(Latin transliteration)和国际音标(IPA)重构作为辅助视图,并通过KL散度一致性约束和嵌入层融合策略增强跨阶段语义对齐能力。实验表明,翻译任务带来最大性能提升,而IPA结合KL一致性有助于改善分支间的对齐效果,尽管整体对齐仍受限,但为历史语言建模提供了可复现的基准与实用指导。
链接: https://arxiv.org/abs/2603.24258
作者: He Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to LREC 2026
Abstract:We study word-level semantic alignment across four historical stages of Ancient Egyptian. These stages differ in script and orthography, and parallel data are scarce. We jointly train a compact encoder-decoder model with a shared byte-level tokenizer on all four stages, combining masked language modeling (MLM), translation language modeling (TLM), sequence-to-sequence translation, and part-of-speech tagging under a task-aware loss with fixed weights and uncertainty-based scaling. To reduce surface divergence we add Latin transliteration and IPA reconstruction as auxiliary views. We integrate these views through KL-based consistency and through embedding-level fusion. We evaluate alignment quality using pairwise metrics, specifically ROC-AUC and triplet accuracy, on curated Egyptian-English and intra-Egyptian cognate datasets. Translation yields the strongest gains. IPA with KL consistency improves cross-branch alignment, while early fusion demonstrates limited efficacy. Although the overall alignment remains limited, the findings provide a reproducible baseline and practical guidance for modeling historical languages under real constraints. They also show how normalization and task design shape what counts as alignment in typologically distant settings.
[NLP-17] Semantic Centroids and Hierarchical Density-Based Clustering for Cross-Document Software Coreference Resolution
【速读】: 该论文旨在解决跨文档共指消解(Cross-Document Coreference Resolution, CDCR)中软件提及(software mentions)不一致识别与聚类的问题,即在科学文献语料库中准确识别并归并指代同一软件实体的不同表述。其解决方案的关键在于提出一种混合框架:首先利用预训练的Sentence-BERT模型生成密集语义嵌入以捕捉上下文语义相似性;其次基于训练集聚类中心构建知识库(KB)查找策略,并采用FAISS实现高效检索;对于无法明确归属已有聚类的提及,则使用HDBSCAN密度聚类算法进行处理;同时通过表面形式规范化(surface-form normalization)和缩写解析提升规范名称匹配效果。该核心流程统一应用于Subtasks 1和2,在Subtask 3的大规模场景下进一步引入基于实体类型和标准化表面形式的阻断(blocking)策略以提升效率,最终在三个子任务上分别获得0.98、0.98和0.96的CoNLL F1分数。
链接: https://arxiv.org/abs/2603.24246
作者: Julia Matela,Frank Krüger
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper describes the system submitted to the SOMD 2026 Shared Task for Cross-Document Coreference Resolution (CDCR) of software mentions. Our approach addresses the challenge of identifying and clustering inconsistent software mentions across scientific corpora. We propose a hybrid framework that combines dense semantic embeddings from a pre-trained Sentence-BERT model, Knowledge Base (KB) lookup strategy built from training-set cluster centroids using FAISS for efficient retrieval, and HDBSCAN density-based clustering for mentions that cannot be confidently assigned to existing clusters. Surface-form normalization and abbreviation resolution are applied to improve canonical name matching. The same core pipeline is applied to Subtasks 1 and 2. To address the large scale settings of Subtask 3, the pipeline was adapted by utilising a blocking strategy based on entity types and canonicalized surface forms. Our system achieved CoNLL F1 scores of 0.98, 0.98, and 0.96 on Subtasks 1, 2, and 3 respectively.
[NLP-18] Optimizing Multilingual LLM s via Federated Learning: A Study of Client Language Composition
【速读】: 该论文旨在解决多语言环境下大语言模型(Large Language Models, LLMs)的联邦学习(Federated Learning, FL)所面临的挑战,主要包括客户端间语言分布异构性以及语言资源可用性的不均衡问题。其核心解决方案是扩展了FederatedScope-LLM框架以支持多语言指令微调实验,并提出了一种新型客户端特定的早停机制——局部动态早停(Local Dynamic Early Stopping, LDES-FL),该机制允许客户端根据本地验证性能自主暂停和恢复本地训练,从而提升训练效率与可持续性。研究进一步表明,客户端的语言组成结构(从纯单语到多语)是影响多语言模型性能、公平性和训练成本的关键设计变量,且在联邦学习中增加客户端内部的多语多样性可显著提升全局模型的强度与公平性,尤其对低资源语言收益最大。
链接: https://arxiv.org/abs/2603.24242
作者: Aleix Sant,Jordi Luque,Carlos Escolano
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 4 figures, 5 tables
Abstract:Federated Learning (FL) of Large Language Models (LLMs) in multilingual environments presents significant challenges stemming from heterogeneous language distributions across clients and disparities in language resource availability. To address these challenges, we extended the FederatedScope-LLM framework to support multilingual instruction-tuning experiments with LLMs. We also introduced a novel client-specific early stopping mechanism, Local Dynamic Early Stopping (LDES-FL), which allows clients to pause and resume local training based on client-side validation performance, enhancing training efficiency and sustainability. Through a series of experiments, we studied how client language composition - from fully monolingual to increasingly multilingual clients - affects multilingual quality, fairness and training cost. Monolingual local fine-tuning remains the most effective for single-language specialization, whereas federated training is better suited to learning a single balanced multilingual model. In FL, increasing within-client multilinguality leads to stronger and fairer global models, narrows the gap to centralized multilingual fine-tuning, and yields the largest gains for lower-resource languages, albeit at the cost of more optimization steps. Overall, our results identify client language composition as a key design variable in multilingual FL, shaping performance, fairness and efficiency
[NLP-19] Stance Labels Fail When They Matter Most: The Projection Problem in Stance Detection
【速读】: 该论文旨在解决立场检测(Stance Detection)中因将多维态度压缩为单一标签而导致的标注不一致问题,即“投影问题”(Projection Problem)。传统方法将文本立场划分为Favor、Against或Neutral三类,但在面对复杂目标时,个体可能在不同维度上持有矛盾态度(如支持气候科学但反对碳税),导致不同标注者基于各自权重选择压缩方式,从而产生看似混乱的标注分歧。解决方案的关键在于区分“整体标签一致性”与“各维度一致性”,研究表明:当各维度态度一致时,三类标签标注效果良好;而当维度冲突时,整体标签一致性显著下降,但各维度单独标注的一致性仍保持较高水平,表明应从多维视角重新设计立场检测任务,而非强制压缩为单一标签。
链接: https://arxiv.org/abs/2603.24231
作者: Bowen Zhang
机构: Shenzhen Technology University (深圳技术大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:
Abstract:Stance detection is nearly always formulated as classifying text into Favor, Against, or Neutral – a convention inherited from debate analysis and applied without modification to social media since SemEval-2016. But attitudes toward complex targets are not unitary: a person can accept climate science while opposing carbon taxes, expressing support on one dimension and opposition on another. When annotators must compress such multi-dimensional attitudes into a single label, different annotators weight different dimensions – producing disagreement that reflects not confusion but different compression choices. We call this the \textbfprojection problem, and show that its cost is conditional: when a text’s dimensions align, any weighting yields the same label and three-way annotation works well; when dimensions conflict, label agreement collapses while agreement on individual dimensions remains intact. A pilot study on SemEval-2016 Task 6 confirms this crossover: on dimension-consistent texts, label agreement (Krippendorff’s \alpha = 0.307 ) exceeds dimensional agreement ( \alpha = 0.082 ); on dimension-conflicting texts, the pattern reverses – label \alpha drops to 0.085 while dimensional \alpha rises to 0.334 , with Policy reaching 0.572 . The projection problem is real – but it activates precisely where it matters most.
[NLP-20] Variation is the Norm: Embracing Sociolinguistics in NLP LREC2026
【速读】: 该论文试图解决的问题是:在自然语言处理(Natural Language Processing, NLP)中,语言变体(variation)通常被视为噪声而被“标准化”处理,导致模型对实际存在的语言多样性缺乏鲁棒性;而社会语言学(sociolinguistics)则强调语言变体的社会语境意义。论文提出一个融合社会语言学维度与NLP技术维度的框架,旨在将语言变体主动纳入研究设计中,从而提升模型对现实语言变异的适应能力。解决方案的关键在于:在微调(fine-tuning)过程中显式引入语言变体(如卢森堡语中的拼写变异),使模型学习到多样化的形式表达,从而显著改善其在非标准文本上的性能表现。
链接: https://arxiv.org/abs/2603.24222
作者: Anne-Marie Lutgen,Alistair Plum,Verena Blaschke,Barbara Plank,Christoph Purschke
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026
Abstract:In Natural Language Processing (NLP), variation is typically seen as noise and “normalised away” before processing, even though it is an integral part of language. Conversely, studying language variation in social contexts is central to sociolinguistics. We present a framework to combine the sociolinguistic dimension of language with the technical dimension of NLP. We argue that by embracing sociolinguistics, variation can actively be included in a research setup, in turn informing the NLP side. To illustrate this, we provide a case study on Luxembourgish, an evolving language featuring a large amount of orthographic variation, demonstrating how NLP performance is impacted. The results show large discrepancies in the performance of models tested and fine-tuned on data with a large amount of orthographic variation in comparison to data closer to the (orthographic) standard. Furthermore, we provide a possible solution to improve the performance by including variation in the fine-tuning process. This case study highlights the importance of including variation in the research setup, as models are currently not robust to occurring variation. Our framework facilitates the inclusion of variation in the thought-process while also being grounded in the theoretical framework of sociolinguistics.
[NLP-21] A visual observation on the geometry of UMAP projections of the difference vectors of antonym and synonym word pair embeddings
【速读】: 该论文旨在探究词嵌入向量中是否存在可识别的几何结构来表征反义词(antonyms)关系,特别是通过分析词对差异向量(difference vectors)的几何特性。其核心问题是:是否可以通过嵌入空间中的方向性信息(如向量差的方向或模式)来检测反义关系,并将其与同义词(synonyms)进行区分。解决方案的关键在于发现了一种在多种嵌入模型中均出现的“漩涡”(swirl)结构——这种结构仅在特定的投影配置下显现,暗示了反义词对在高维空间中可能具有某种一致性的几何分布模式,从而为基于几何特征的反义词识别提供了新的线索和方法论基础。
链接: https://arxiv.org/abs/2603.24150
作者: Rami Luisto
机构: University of Jyväskylä (于韦斯屈莱大学); HUS Helsinki University Hospital (赫尔辛基大学医院); Digital Workforce Services (数字劳动力服务)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code available at this https URL
Abstract:Antonyms, or opposites, are sometimes defined as \emphword pairs that have all of the same contextually relevant properties but one. Seeing how transformer models seem to encode concepts as directions, this begs the question if one can detect antonymity'' in the geometry of the embedding vectors of word pairs, especially based on their difference vectors. Such geometrical studies are then naturally contrasted by comparing antonymic pairs to their opposites; synonyms. This paper started as an exploratory project on the complexity of the systems needed to detect the geometry of the embedding vectors of antonymic word pairs. What we now report is a curious swirl’’ that appears across embedding models in a somewhat specific projection configuration. Comments: Code available at this https URL Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) MSC classes: 68T50 (Primary) 62H30, 68T09 (Secondary) Cite as: arXiv:2603.24150 [cs.CL] (or arXiv:2603.24150v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.24150 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-22] MedAidDialog: A Multilingual Multi-Turn Medical Dialogue Dataset for Accessible Healthcare
【速读】: 该论文旨在解决现有医疗对话系统在多轮交互真实性、多语言适用性以及部署可行性方面的局限性问题。当前多数系统仅支持单轮问答或依赖模板化数据集,难以模拟真实医患问诊流程,且缺乏对多语言场景的支持。其解决方案的关键在于构建了一个名为MedAidDialog的多语言多轮医疗对话数据集,通过大语言模型生成合成对话并扩展至七种语言(英语、印地语、泰卢固语、泰米尔语、孟加拉语、马拉地语和阿拉伯语),同时开发了基于量化小语言模型参数高效微调的MedAidLM模型,可在低算力环境下部署;此外引入患者预设上下文信息(如年龄、性别、过敏史)以增强个性化咨询能力,实验表明该框架能有效实现症状采集与诊断建议生成,并获得医学专家对对话合理性和连贯性的认可。
链接: https://arxiv.org/abs/2603.24132
作者: Shubham Kumar Nigam,Suparnojit Sarkar,Piyush Patel
机构: University of Birmingham, Dubai, United Arab Emirates; Heritage Institute of Technology, Kolkata, India; Madan Mohan Malaviya University of Technology, India
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Conversational artificial intelligence has the potential to assist users in preliminary medical consultations, particularly in settings where access to healthcare professionals is limited. However, many existing medical dialogue systems operate in a single-turn question–answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. In this work, we introduce MedAidDialog, a multilingual multi-turn medical dialogue dataset designed to simulate realistic physician–patient consultations. The dataset extends the MDDial corpus by generating synthetic consultations using large language models and further expands them into a parallel multilingual corpus covering seven languages: English, Hindi, Telugu, Tamil, Bengali, Marathi, and Arabic. Building on this dataset, we develop MedAidLM, a conversational medical model trained using parameter-efficient fine-tuning on quantized small language models, enabling deployment without high-end computational infrastructure. Our framework additionally incorporates optional patient pre-context information (e.g., age, gender, allergies) to personalize the consultation process. Experimental results demonstrate that the proposed system can effectively perform symptom elicitation through multi-turn dialogue and generate diagnostic recommendations. We further conduct medical expert evaluation to assess the plausibility and coherence of the generated consultations. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.24132 [cs.CL] (or arXiv:2603.24132v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.24132 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-23] Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中习得的社会规律性导致的性别偏见问题,特别是现有缓解策略多聚焦于生成输出层面的偏见减少,而忽视了对模型内部表征中潜在性别信息的分析,且结构化基准测试可能无法反映真实应用场景。解决方案的关键在于提出一个统一框架,使用相同的中性提示同时分析模型内在(intrinsic)和外在(extrinsic)性别偏见,从而实现对内部表示中编码的性别相关信息与生成输出中表达的偏见之间的直接比较。该方法揭示了在统一协议下,潜在性别信息与表达偏见之间存在一致关联,且表明通过监督微调虽可降低输出偏见,但内部表征中仍保留可被对抗提示重新激活的性别相关关联,凸显了当前去偏方法在现实场景中的局限性。
链接: https://arxiv.org/abs/2603.24125
作者: Nour Bouchouchi,Thiabult Laugel,Xavier Renard,Christophe Marsala,Marie-Jeanne Lesot,Marcin Detyniecki
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:During training, Large Language Models (LLMs) learn social regularities that can lead to gender bias in downstream applications. Most mitigation efforts focus on reducing bias in generated outputs, typically evaluated on structured benchmarks, which raises two concerns: output-level evaluation does not reveal whether alignment modifies the model’s underlying representations, and structured benchmarks may not reflect realistic usage scenarios. We propose a unified framework to jointly analyze intrinsic and extrinsic gender bias in LLMs using identical neutral prompts, enabling direct comparison between gender-related information encoded in internal representations and bias expressed in generated outputs. Contrary to prior work reporting weak or inconsistent correlations, we find a consistent association between latent gender information and expressed bias when measured under the unified protocol. We further examine the effect of alignment through supervised fine-tuning aimed at reducing gender bias. Our results suggest that while the latter indeed reduces expressed bias, measurable gender-related associations are still present in internal representations, and can be reactivated under adversarial prompting. Finally, we consider two realistic settings and show that debiasing effects observed on structured benchmarks do not necessarily generalize, e.g., to the case of story generation.
[NLP-24] he Alignment Tax: Response Homogenization in Aligned LLM s and Its Implications for Uncertainty Estimation
【速读】: 该论文旨在解决强化学习人类反馈(Reinforcement Learning from Human Feedback, RLHF)对语言模型响应多样性造成的负面影响,即“对齐税”(alignment tax)问题——表现为在TruthfulQA等任务中模型输出高度同质化(如40-79%的问题产生单一语义簇),导致基于采样的不确定性估计方法失效(AUROC=0.500),而自由token熵仍保留判别能力(0.603)。解决方案的关键在于识别出该现象的因果机制:通过基线模型与指令微调模型(instruct model)的对比实验,确认对齐过程主要由直接偏好优化(Direct Preference Optimization, DPO)阶段引发(基线0.0% vs. DPO后4.0%单簇率),并提出一种基于多种正交不确定性信号的“先便宜后昂贵”级联策略(UCBD),在保持高精度的同时实现成本节约(GSM8K准确率从84.4%提升至93.2%,覆盖率达50%时节省57%计算资源)。
链接: https://arxiv.org/abs/2603.24124
作者: Mingyi Liu
机构: Independent Researcher
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages, 3 figures, 10 tables, 22 experiments across 5 benchmarks. Code: this https URL
Abstract:RLHF-aligned language models exhibit response homogenization: on TruthfulQA (n=790), 40-79% of questions produce a single semantic cluster across 10 i.i.d. samples. On affected questions, sampling-based uncertainty methods have zero discriminative power (AUROC=0.500), while free token entropy retains signal (0.603). This alignment tax is task-dependent: on GSM8K (n=500), token entropy achieves 0.724 (Cohen’s d=0.81). A base-vs-instruct ablation confirms the causal role of alignment: the base model shows 1.0% single-cluster rate vs. 28.5% for the instruct model (p 10^-6). A training stage ablation (Base 0.0% - SFT 1.5% - DPO 4.0% SCR) localizes the cause to DPO, not SFT. Cross-family replication on four model families reveals alignment tax severity varies by family and scale. We validate across 22 experiments, 5 benchmarks, 4 model families, and 3 model scales (3B-14B), with Jaccard, embedding, and NLI-based baselines at three DeBERTa scales (all ~0.51 AUROC). Cross-embedder validation with two independent embedding families rules out coupling bias. Cross-dataset validation on WebQuestions (58.0% SCR) confirms generalization beyond TruthfulQA. The central finding – response homogenization – is implementation-independent and label-free. Motivated by this diagnosis, we explore a cheapest-first cascade (UCBD) over orthogonal uncertainty signals. Selective prediction raises GSM8K accuracy from 84.4% to 93.2% at 50% coverage; weakly dependent boundaries (|r| = 0.12) enable 57% cost savings. Comments: 23 pages, 3 figures, 10 tables, 22 experiments across 5 benchmarks. Code: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2603.24124 [cs.LG] (or arXiv:2603.24124v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.24124 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-25] LLM pedia: A Transparent Framework to Materialize an LLM s Encyclopedic Knowledge at Scale
【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在事实性(factuality)评估中存在偏差的问题,即现有基准测试(如MMLU)可能高估了模型的真实知识掌握能力。其解决方案的关键在于构建一个完全由参数化记忆生成的百科全书式数据集——LLMpedia,该数据集不依赖外部检索,而是通过模型自身知识直接生成约100万篇百科文章,并对不同主题领域的事实准确性进行系统性测量。研究发现,在维基百科覆盖的主题上,gpt-5-mini的事实正确率仅为74.7%,显著低于基准测试所呈现的90%以上;而在仅靠人工筛选网络证据验证的前沿主题上,准确率进一步下降至63.2%。此外,LLMpedia通过公开所有提示、生成内容与评估结果,实现了可复现的、透明的事实性评估框架,从而推动从“评估”到“知识生成”的范式转变。
链接: https://arxiv.org/abs/2603.24080
作者: Muhammed Saeed,Simon Razniewski
机构: ScaDS.AI Dresden/Leipzig; TU Dresden, Germany
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:
Abstract:Benchmarks such as MMLU suggest flagship language models approach factuality saturation, with scores above 90%. We show this picture is incomplete. \emphLLMpedia generates encyclopedic articles entirely from parametric memory, producing \sim 1M articles across three model families without retrieval. For gpt-5-mini, the verifiable true rate on Wikipedia-covered subjects is only 74.7% – more than 15 percentage points below the benchmark-based picture, consistent with the availability bias of fixed-question evaluation. Beyond Wikipedia, frontier subjects verifiable only through curated web evidence fall further to 63.2% true rate. Wikipedia covers just 61% of surfaced subjects, and three model families overlap by only 7.3% in subject choice. In a capture-trap benchmark inspired by prior analysis of Grokipedia, LLMpedia achieves substantially higher factuality at roughly half the textual similarity to Wikipedia. Unlike Grokipedia, every prompt, artifact, and evaluation verdict is publicly released, making LLMpedia the first fully open parametric encyclopedia – bridging factuality evaluation and knowledge materialization. All data, code, and a browsable interface are at this https URL.
[NLP-26] ConceptKT: A Benchmark for Concept-Level Deficiency Prediction in Knowledge Tracing LREC2026
【速读】: 该论文旨在解决传统知识追踪(Knowledge Tracing, KT)系统仅能预测答题正确与否,而无法诊断学生错误背后具体概念缺失的问题。为实现更精细的诊断反馈以支持针对性教学干预,作者提出了概念级缺陷预测(concept-level deficiency prediction)任务,并构建了ConceptKT数据集,其中标注了每道题目所需的概念以及错误回答所反映的缺失概念。解决方案的关键在于:利用大语言模型(Large Language Models, LLMs)和大推理模型(Large Reasoning Models, LRMs)在上下文学习(in-context learning)框架下的诊断能力,结合基于概念对齐和语义相似度的信息性历史记录选择策略,显著提升了正确性预测与概念级缺陷识别的性能。
链接: https://arxiv.org/abs/2603.24073
作者: Yu-Chen Kang,Yu-Chien Tang,An-Zi Yen
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by LREC 2026
Abstract:Knowledge Tracing (KT) is a critical technique for modeling student knowledge to support personalized learning. However, most KT systems focus on binary correctness prediction and cannot diagnose the underlying conceptual misunderstandings that lead to errors. Such fine-grained diagnostic feedback is essential for designing targeted instruction and effective remediation. In this work, we introduce the task of concept-level deficiency prediction, which extends traditional KT by identifying the specific concepts a student is likely to struggle with on future problems. We present ConceptKT, a dataset annotated with labels that capture both the concepts required to solve each question and the missing concepts underlying incorrect responses. We investigate in-context learning approaches to KT and evaluate the diagnostic capabilities of various Large Language Models (LLMs) and Large Reasoning Models (LRMs). Different strategies for selecting informative historical records are explored. Experimental results demonstrate that selecting response histories based on conceptual alignment and semantic similarity leads to improved performance on both correctness prediction and concept-level deficiency identification.
[NLP-27] FinToolSyn: A forward synthesis Framework for Financial Tool-Use Dialogue Data with Dynamic Tool Retrieval
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在金融领域中工具调用能力不足的问题,特别是现有数据合成方法采用逆向合成范式(reverse synthesis paradigm),即从预采样的工具生成用户查询,导致生成的对话缺乏真实场景中的隐含性与事件驱动特性,并且无法适应动态变化的工具空间。其解决方案的关键在于提出一种前向合成框架(forward synthesis framework)——FinToolSyn,该框架通过三个阶段构建高质量金融对话数据:角色指令引导、原子工具合成以及动态检索对话生成,其中引入动态检索机制以模拟大规模工具空间中的噪声候选集,从而更贴近实际金融场景下的工具使用过程。实验表明,基于FinToolSyn训练的模型在工具调用准确率上提升21.06%,显著增强了LLMs在金融任务中的工具学习能力。
链接: https://arxiv.org/abs/2603.24051
作者: Caishuang Huang,Yang Qiao,Rongyu Zhang,Junjie Ye,Pu Lu,Wenxi Wu,Meng Zhou,Xiku Du,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); Tencent (腾讯); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Tool-use capabilities are vital for Large Language Models (LLMs) in finance, a domain characterized by massive investment targets and data-intensive inquiries. However, existing data synthesis methods typically rely on a reverse synthesis paradigm, generating user queries from pre-sampled tools. This approach inevitably introduces artificial explicitness, yielding queries that fail to capture the implicit, event-driven nature of real-world needs. Moreover, its reliance on static tool sets overlooks the dynamic retrieval process required to navigate massive tool spaces. To address these challenges, we introduce \textitFinToolSyn, a forward synthesis framework designed to generate high-quality financial dialogues. Progressing from persona instruction and atomic tool synthesis to dynamic retrieval dialogue generation, our pipeline constructs a repository of 43,066 tools and synthesizes over 148k dialogue instances, incorporating dynamic retrieval to emulate the noisy candidate sets typical of massive tool spaces. We also establish a dedicated benchmark to evaluate tool-calling capabilities in realistic financial scenarios. Extensive experiments demonstrate that models trained on FinToolSyn achieve a 21.06% improvement, providing a robust foundation for tool learning in financial scenarios.
[NLP-28] MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning
【速读】: 该论文旨在解决标准LoRA(Low-Rank Adaptation)微调在混合专家(Mixture-of-Experts, MoE)模型中效率低下的问题,即对所有专家均应用LoRA适配器导致冗余计算与存储开销。其核心发现是:MoE模型中每层的专家路由具有高度偏斜性,仅有少量“热”专家处理大部分输入token,而多数“冷”专家几乎不被激活。解决方案的关键在于提出MoE-Sieve框架——通过小规模校准集对专家路由频次进行 profiling,仅选择每层中路由频率最高的top-k专家(如25%)进行LoRA微调,从而显著降低可训练参数量(减少70–73%)、检查点大小(减少71–73%)及训练时间(最多减少50%),同时保持与全量LoRA相当的性能表现(平均误差不超过±1个百分点)。该方法有效利用了路由信号指导适配器部署,避免了对冷专家的无效微调,进而提升了微调效率与稳定性。
链接: https://arxiv.org/abs/2603.24044
作者: Andrea Manzoni
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 17 pages, 6 figures, 10 tables
Abstract:Standard LoRA fine-tuning of Mixture-of-Experts (MoE) models applies adapters to every expert, yet our profiling shows that per-layer expert routing is highly skewed: a small subset of experts handles most tokens in each layer, while many others are rarely activated (“cold”). We propose MoE-Sieve, a simple routing-guided framework for LoRA fine-tuning, and pair it with a systematic profiling study of expert routing across architectures and tasks. The method is simple: profile routing counts on a small calibration set, select the top-k most-routed experts per layer, and apply LoRA only to those experts. Across two architecturally distinct MoE models and three diverse tasks, tuning only the top 25% routed experts per layer remains competitive with full LoRA, with mean differences within +/-1 percentage point across all conditions. This reduces LoRA trainable parameters by 70-73%, adapter checkpoint size by 71-73%, and wall-clock training time by up to 50%. We also observe a non-monotonic relationship between expert count and seed-to-seed variance, consistent with the hypothesis that adapting cold experts can introduce gradient noise without improving accuracy. Further ablations show that random expert selection at matched budget is about 2.5 percentage points worse, indicating that the routing signal matters, while greedy per-layer budget optimization does not improve over uniform top-k.
[NLP-29] From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLM s
【速读】: 该论文旨在解决上下文感知自动语音识别(Contextual Automatic Speech Recognition, Contextual ASR)中因训练时使用理想化的历史对话内容(oracle conversation history)而推理时依赖预测的历史信息所导致的“上下文暴露偏差”(contextual exposure bias)问题。解决方案的关键在于提出一个统一的训练框架,包含三个核心机制:(i) 教师错误知识(Teacher Error Knowledge),利用Whisper large-v3模型的预测结果作为训练阶段的历史输入以模拟真实场景;(ii) 上下文丢弃(Context Dropout),通过随机屏蔽历史信息来缓解对上下文的过度依赖;(iii) 直接偏好优化(Direct Preference Optimization, DPO),在人工标注的失败案例上进行优化以提升鲁棒性。实验表明,该方法在TED-LIUM 3(域内)和零样本LibriSpeech(域外)数据集上均显著提升了基于预测历史的解码性能,并在无关上下文攻击下表现出最小的性能下降,验证了其对误导性上下文的更强鲁棒性。
链接: https://arxiv.org/abs/2603.24034
作者: Xiaoyong Guo,Nanjie Li,Zijie Zeng,Kai Wang,Hao Huang,Haihua Xu,Wei Shi
机构: Xinjiang University (新疆大学); Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing (丝绸之路多语言认知计算联合国际实验室); Timekettle
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Contextual automatic speech recognition (ASR) with Speech-LLMs is typically trained with oracle conversation history, but relies on error-prone history at inference, causing a train-test mismatch in the context channel that we term contextual exposure bias. We propose a unified training framework to improve robustness under realistic histories: (i) Teacher Error Knowledge by using Whisper large-v3 hypotheses as training-time history, (ii) Context Dropout to regularize over-reliance on history, and (iii) Direct Preference Optimization (DPO) on curated failure cases. Experiments on TED-LIUM 3 (in-domain) and zero-shot LibriSpeech (out-of-domain) show consistent gains under predicted-history decoding. With a two-utterance history as context, SFT with Whisper hypotheses reduce WER from 5.59% (oracle-history training) to 5.47%, and DPO further improves to 5.17%. Under irrelevant-context attacks, DPO yields the smallest degradation (5.17% - 5.63%), indicating improved robustness to misleading context. Our code and models are published on this https URL.
[NLP-30] Schema on the Inside: A Two-Phase Fine-Tuning Method for High-Efficiency Text-to-SQL at Scale AAAI-26 AAAI
【速读】: 该论文旨在解决将大规模、专有API驱动的语言模型应用于文本到SQL(text-to-SQL)任务时面临的高成本与高延迟问题,尤其是在需要依赖长上下文提示(schema-heavy prompts)的情况下,导致每token的API费用高昂且响应延迟大,难以在大规模生产环境中部署。解决方案的关键在于提出一种专门针对特定领域(cricket statistics)的自托管8B参数模型,并采用新颖的两阶段监督微调(supervised fine-tuning)方法,使模型能够内化整个数据库模式(database schema),从而彻底消除对长上下文提示的依赖。这一策略将输入token数量从17k减少至不足100(降幅超99%),并用高效的本地推理替代昂贵的外部API调用,最终实现了98.4%的执行成功率和92.5%的语义准确率,显著优于基于Google Gemini Flash 2.0的提示工程基线。
链接: https://arxiv.org/abs/2603.24023
作者: Chinmay Soni,Shivam Chourasia,Gaurav Kumar,Hitesh Kapoor
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures. Published in the Proceedings of the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), 2026
Abstract:Applying large, proprietary API-based language models to text-to-SQL tasks poses a significant industry challenge: reliance on massive, schema-heavy prompts results in prohibitive per-token API costs and high latency, hindering scalable production deployment. We present a specialized, self-hosted 8B-parameter model designed for a conversational bot in CriQ, a sister app to Dream11, India’s largest fantasy sports platform with over 250 million users, that answers user queries about cricket statistics. Our novel two-phase supervised fine-tuning approach enables the model to internalize the entire database schema, eliminating the need for long-context prompts. This reduces input tokens by over 99%, from a 17k-token baseline to fewer than 100, and replaces costly external API calls with efficient local inference. The resulting system achieves 98.4% execution success and 92.5% semantic accuracy, substantially outperforming a prompt-engineered baseline using Google’s Gemini Flash 2.0 (95.6% execution, 89.4% semantic accuracy). These results demonstrate a practical path toward high-precision, low-latency text-to-SQL applications using domain-specialized, self-hosted language models in large-scale production environments.
[NLP-31] CVPD at QIAS 2026: RAG -Guided LLM Reasoning for Al-Mawarith Share Computation and Heir Allocation
【速读】: 该论文旨在解决伊斯兰继承法(Ilm al-Mawarith)中的多阶段法律推理问题,其核心挑战在于准确识别合格继承人、处理阻断规则(hajb)、分配固定份额与剩余份额、以及应对如aw1和radd等调整机制,并在不同法学派别和民法典编纂差异下保持法律配置的显式约束。解决方案的关键在于构建一个检索增强生成(Retrieval-Augmented Generation, RAG)流水线:通过符号继承计算器生成带有完整中间推理轨迹的高质量合成数据以确保法律与数值一致性;采用密集向量与BM25混合检索结合交叉编码器重排序策略提升检索准确性;并通过结构化模式约束输出验证机制保障生成结果的合法性与一致性。该方法在QIAS 2026盲测排行榜中取得MIR-E分数0.935,验证了检索驱动且模式感知的生成范式在高精度阿拉伯语法律推理任务中的显著可靠性提升。
链接: https://arxiv.org/abs/2603.24012
作者: Wassim Swaileh,Mohammed-En-Nadhir Zighem,Hichem Telli,Salah Eddine Bekhouche,Abdellah Zakaria Sellam,Fadi Dornaika,Dimitrios Kotzinos
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Islamic inheritance (Ilm al-Mawarith) is a multi-stage legal reasoning task requiring the identification of eligible heirs, resolution of blocking rules (hajb), assignment of fixed and residual shares, handling of adjustments such as awl and radd, and generation of a consistent final distribution. The task is further complicated by variations across legal schools and civil-law codifications, requiring models to operate under explicit legal configurations. We present a retrieval-augmented generation (RAG) pipeline for this setting, combining rule-grounded synthetic data generation, hybrid retrieval (dense and BM25) with cross-encoder reranking, and schema-constrained output validation. A symbolic inheritance calculator is used to generate a large high-quality synthetic corpus with full intermediate reasoning traces, ensuring legal and numerical consistency. The proposed system achieves a MIR-E score of 0.935 and ranks first on the official QIAS 2026 blind-test leaderboard. Results demonstrate that retrieval-grounded, schema-aware generation significantly improves reliability in high-precision Arabic legal reasoning tasks. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.24012 [cs.CL] (or arXiv:2603.24012v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.24012 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-32] hinking with Tables: Enhancing Multi-Modal Tabular Understanding via Neuro-Symbolic Reasoning
【速读】: 该论文旨在解决表格-视觉多模态理解(Tabular-Vision Multi-Modal Understanding, TVMU)任务中的三大核心挑战:(1)表格数据的高结构变异性与不完整性;(2)表中特征间的隐式复杂依赖关系;(3)下游任务间显著的问题求解流程异质性。解决方案的关键在于提出Thinking with Tables(TWT),其采用基于程序辅助的代码驱动神经符号推理机制,通过与外部环境交互实现信息提取和元素建模等关键操作,从而有效应对上述挑战。
链接: https://arxiv.org/abs/2603.24004
作者: Kun-Yang Yu,Zhi Zhou,Shi-Yu Tian,Xiao-Wen Yang,Zi-Yi Jia,Ming Yang,Zi-Jian Cheng,Lan-Zhe Guo,Yu-Feng Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, 6 figures
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities across modalities such as images and text. However, tabular data, despite being a critical real-world modality, remains relatively underexplored in multimodal learning. In this paper, we focus on the task of Tabular-Vision Multi-Modal Understanding (TVMU) and identify three core challenges: (1) high structural variability and data incompleteness in tables, (2) implicit and complex feature dependencies, and (3) significant heterogeneity in problem-solving pipelines across downstream tasks. To address these issues, we propose Thinking with Tables (TWT). TWT employs a program-aided code-based neuro-symbolic reasoning mechanism that facilitates key operations, such as information extraction and element modeling, by interacting with external environments. We evaluate TWT on eight representative datasets. Experimental results demonstrate that TWT consistently outperforms existing baselines by an average of 10% in accuracy, achieving performance comparable to, or even surpassing, proprietary commercial SOTA LLMs on TVMU tasks. Models and codes are available at this https URL
[NLP-33] Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping
【速读】: 该论文旨在解决现有Transformer模型在提升有效深度时存在的计算冗余问题,即传统方法依赖参数复用并通过递归执行扩展计算深度,导致训练过程中网络结构固定且深度分配均匀,造成大量不必要的计算开销。其解决方案的关键在于提出一种名为Sparse Growing Transformer (SGT) 的训练时稀疏深度分配框架,通过动态地、逐步地从深层到浅层扩展递归路径,仅对信息量高的注意力头进行目标导向的循环连接,从而实现结构稀疏性——即随着训练进程有选择性地增加少量参数的深度,而非全局均匀分配。实验表明,SGT在多个参数规模下均优于静态块级循环基线,同时将额外训练FLOPs开销从约16–20%显著降低至仅1–3%。
链接: https://arxiv.org/abs/2603.23998
作者: Yao Chen,Yilong Chen,Yinqi Yang,Junyuan Shang,Zhenyu Zhang,Zefeng Zhang,Shuaiyi Nie,Shuohuan Wang,Yu Sun,Hua Wu,HaiFeng Wang,Tingwen Liu
机构: Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences; Baidu Inc.
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the training timeline, and additional computational depth is uniformly assigned to entire blocks at the parameter level. This rigidity across training time and parameter space leads to substantial computational redundancy during training. In contrast, we argue that depth allocation during training should not be a static preset, but rather a progressively growing structural process. Our systematic analysis reveals a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. Motivated by this observation, we introduce the Sparse Growing Transformer (SGT). SGT is a training-time sparse depth allocation framework that progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads. This mechanism induces structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves. Extensive experiments across multiple parameter scales demonstrate that SGT consistently outperforms training-time static block-level looping baselines under comparable settings, while reducing the additional training FLOPs overhead from approximately 16–20% to only 1–3% relative to a standard Transformer backbone.
[NLP-34] CoCR-RAG : Enhancing Retrieval-Augmented Generation in Web QA via Concept-oriented Context Reconstruction
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中多源文档融合的挑战,即从异构网络来源获取的文档常因写作风格、格式和粒度不一致而引入冗余与无关信息,从而损害答案的事实一致性。解决方案的关键在于提出概念导向的上下文重构框架(Concept-oriented Context Reconstruction RAG, CoCR-RAG),其核心是基于抽象意义表示(Abstract Meaning Representation, AMR)的语义级概念蒸馏算法,通过提取多个文档中的关键概念,并由大语言模型(Large Language Models, LLMs)对这些概念进行融合重构,仅补充必要句法元素以突出核心知识,从而构建连贯且知识密集的统一上下文。
链接: https://arxiv.org/abs/2603.23989
作者: Kaize Shi,Xueyao Sun,Qika Lin,Firoj Alam,Qing Li,Xiaohui Tao,Guandong Xu
机构: University of Southern Queensland (南昆士兰大学); The Hong Kong Polytechnic University (香港理工大学); National University of Singapore (新加坡国立大学); Qatar Foundation (卡塔尔基金会); Hong Kong Education University (香港教育大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-augmented generation (RAG) has shown promising results in enhancing QA by incorporating information from the web and other external sources. However, the supporting documents retrieved from the heterogeneous web often originate from multiple sources with diverse writing styles, varying formats, and inconsistent granularity. Fusing such multi-source documents into a coherent and knowledge-intensive context remains a significant challenge, as the presence of irrelevant and redundant information can compromise the factual consistency of the inferred answers. This paper proposes the Concept-oriented Context Reconstruction RAG (CoCR-RAG), a framework that addresses the multi-source information fusion problem in RAG through linguistically grounded concept-level integration. Specifically, we introduce a concept distillation algorithm that extracts essential concepts from Abstract Meaning Representation (AMR), a stable semantic representation that structures the meaning of texts as logical graphs. The distilled concepts from multiple retrieved documents are then fused and reconstructed into a unified, information-intensive context by Large Language Models, which supplement only the necessary sentence elements to highlight the core knowledge. Experiments on the PopQA and EntityQuestions datasets demonstrate that CoCR-RAG significantly outperforms existing context-reconstruction methods across these Web QA benchmarks. Furthermore, CoCR-RAG shows robustness across various backbone LLMs, establishing itself as a flexible, plug-and-play component adaptable to different RAG frameworks.
[NLP-35] From AI Assistant to AI Scientist: Autonomous Discovery of LLM -RL Algorithms with LLM Agents
【速读】: 该论文旨在解决语言模型中策略优化算法(Policy Optimization Algorithms)的改进依赖于高成本的人工手动设计与验证的问题,其核心挑战在于算法机制与训练动态高度耦合,且需在迭代过程中复用实证证据。解决方案的关键是提出POISE框架——一个闭环自动化发现机制,通过结构化基因谱系档案(genealogically linked archive)关联算法提案、可执行实现、标准化评估与自然语言反思,从而支持基于证据的迭代优化;在数学推理任务中,从GRPO出发,POISE评估64种候选算法并发现如解析方差缩放(analytic-variance scaling)和有效性掩码(validity masking)等改进机制,显著提升性能(如加权Overall得分从47.8提升至52.5),验证了自动化策略优化发现的可行性并支持可解释的设计原则。
链接: https://arxiv.org/abs/2603.23951
作者: Sirui Xia,Yikai Zhang,Aili Chen,Siye Wu,Siyu Yuan,Yanghua Xiao
机构: Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism-level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed-loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a structured, genealogically linked archive linking proposals, executable implementations, standardized evaluations, and natural-language reflections to support evidence-driven iteration. In mathematical reasoning experiments starting from GRPO, POISE evaluates 64 candidate algorithms and discovers improved mechanisms, including analytic-variance scaling and validity masking. The best variant improves weighted Overall from 47.8 to 52.5 (+4.6) and increases AIME25 pass@32 from 26.7% to 43.3%, demonstrating the feasibility of automated policy optimization discovery while supporting interpretable design principles.
[NLP-36] Argument Mining as a Text-to-Text Generation Task
【速读】: 该论文旨在解决传统论点挖掘(Argument Mining, AM)方法中因需分步执行多个子任务(如跨度识别、组件分类和关系分类)而导致的模型复杂性高、后处理规则依赖性强及超参数搜索空间扩大等问题。其解决方案的关键在于提出一种基于预训练编码器-解码器语言模型的文本到文本生成方法,能够同时生成包含论点跨度、组件类型和关系的标注文本,从而无需任务特定的后处理步骤与繁琐的超参数调优,显著简化了流程并提升了适应不同论点结构类型的灵活性。
链接: https://arxiv.org/abs/2603.23949
作者: Masayuki Kawarada,Tsutomu Hirao,Wataru Uchida,Masaaki Nagata
机构: NTT DOCOMO, INC.(NTT DOCOMO公司); NTT Corporation(NTT公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Argument Mining(AM) aims to uncover the argumentative structures within a text. Previous methods require several subtasks, such as span identification, component classification, and relation classification. Consequently, these methods need rule-based postprocessing to derive argumentative structures from the output of each subtask. This approach adds to the complexity of the model and expands the search space of the hyperparameters. To address this difficulty, we propose a simple yet strong method based on a text-to-text generation approach using a pretrained encoder-decoder language model. Our method simultaneously generates argumentatively annotated text for spans, components, and relations, eliminating the need for task-specific postprocessing and hyperparameter tuning. Furthermore, because it is a straightforward text-to-text generation method, we can easily adapt our approach to various types of argumentative structures. Experimental results demonstrate the effectiveness of our method, as it achieves state-of-the-art performance on three different types of benchmark datasets: the Argument-annotated Essays Corpus(AAEC), AbstRCT, and the Cornell eRulemaking Corpus(CDCP)
[NLP-37] OmniACBench: A Benchmark for Evaluating Context-Grounded Acoustic Control in Omni-Modal Models
【速读】: 该论文试图解决当前多模态模型在文本输出评估中表现优异,但缺乏对语音生成能力(即“能否正确说出答案”)的系统性评测问题。解决方案的关键在于提出OmniACBench基准,该基准通过设计包含3,559个验证实例的任务,要求模型根据语音指令、文本脚本和图像输入,以恰当的语调与方式朗读脚本,从而评估其在上下文感知的声学控制(context-grounded acoustic control)方面的综合能力。实验表明,模型在单一模态处理上表现良好,但在多模态信息融合以生成忠实语音方面存在显著瓶颈,揭示了未来研究应聚焦于提升跨模态语义整合能力。
链接: https://arxiv.org/abs/2603.23938
作者: Seunghee Kim,Bumkyu Park,Kyudan Jung,Joosung Lee,Soyoon Kim,Jeonghoon Kim,Taeuk Kim,Hwiyeol Jo
机构: Hanyang University (汉阳大学); Seoul National University (首尔国立大学); KAIST AI; NAVER Cloud
类目: Computation and Language (cs.CL)
备注:
Abstract:Most testbeds for omni-modal models assess multimodal understanding via textual outputs, leaving it unclear whether these models can properly speak their answers. To study this, we introduce OmniACBench, a benchmark for evaluating context-grounded acoustic control in omni-modal models. Given a spoken instruction, a text script, and an image, a model must read the script aloud with an appropriate tone and manner. OmniACBench comprises 3,559 verified instances covering six acoustic features: speech rate, phonation, pronunciation, emotion, global accent, and timbre. Extensive experiments on eight models reveal their limitations in the proposed setting, despite their strong performance on prior textual-output evaluations. Our analyses show that the main bottleneck lies not in processing individual modalities, but in integrating multimodal context for faithful speech generation. Moreover, we identify three common failure modes-weak direct control, failed implicit inference, and failed multimodal grounding-providing insights for developing models that can verbalize responses effectively.
[NLP-38] Dialogue to Question Generation for Evidence-based Medical Guideline Agent Development ALT ML4H
【速读】: 该论文试图解决在快节奏的一级诊疗环境中实施循证医学(Evidence-Based Medicine, EBM)的难题,具体表现为医生面临短时间问诊、患者负荷增加以及指南文档冗长难以实时查阅等问题。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)作为环境辅助工具,在医患互动过程中生成有针对性的、基于指南的临床问题,从而降低认知负担并促进循证实践在有限时间内落地。研究聚焦于问题生成而非回答,并通过零样本基线与多阶段推理提示策略对比验证了LLMs在真实临床对话中生成具有临床意义和指南相关性的提问的可行性。
链接: https://arxiv.org/abs/2603.23937
作者: Zongliang Ji,Ziyang Zhang,Xincheng Tan,Matthew Thompson,Anna Goldenberg,Carl Yang,Rahul G. Krishnan,Fan Zhang
机构: Google Research; University of Toronto; Vector Institute; Emory University
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages. To appear in Proceedings of Machine Learning Research (PMLR), Machine Learning for Health (ML4H) Symposium 2025
Abstract:Evidence-based medicine (EBM) is central to high-quality care, but remains difficult to implement in fast-paced primary care settings. Physicians face short consultations, increasing patient loads, and lengthy guideline documents that are impractical to consult in real time. To address this gap, we investigate the feasibility of using large language models (LLMs) as ambient assistants that surface targeted, evidence-based questions during physician-patient encounters. Our study focuses on question generation rather than question answering, with the aim of scaffolding physician reasoning and integrating guideline-based practice into brief consultations. We implemented two prompting strategies, a zero-shot baseline and a multi-stage reasoning variant, using Gemini 2.5 as the backbone model. We evaluated on a benchmark of 80 de-identified transcripts from real clinical encounters, with six experienced physicians contributing over 90 hours of structured review. Results indicate that while general-purpose LLMs are not yet fully reliable, they can produce clinically meaningful and guideline-relevant questions, suggesting significant potential to reduce cognitive burden and make EBM more actionable at the point of care.
[NLP-39] ORACLE: Orchestrate NPC Daily Activities using Contrastive Learning with Transformer-CVAE
【速读】: 该论文旨在解决现有数字环境中非玩家角色(Non-player characters, NPCs)活动规划缺乏真实感的问题,尤其针对传统方法导致的单调重复现象,无法准确反映人类日常活动的复杂性和多样性。解决方案的关键在于提出ORACLE模型,该模型融合了Transformer的序列建模能力、条件变分自编码器(Conditional Variational Autoencoders, CVAE)的生成可控性以及对比学习的判别优化机制,从而在CASAS智能家庭数据集上实现对24小时室内活动序列的有效建模与生成,显著提升了NPC活动计划的真实性和多样性。
链接: https://arxiv.org/abs/2603.23933
作者: Seong-Eun Hong,JuYeong Hwang,RyunHa Lee,HyeongYeop Kang
机构: Korea University(韩国科学技术院)
类目: Graphics (cs.GR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 7 figures. Accepted to CVM 2026
Abstract:The integration of Non-player characters (NPCs) within digital environments has been increasingly recognized for its potential to augment user immersion and cognitive engagement. The sophisticated orchestration of their daily activities, reflecting the nuances of human daily routines, contributes significantly to the realism of digital environments. Nevertheless, conventional approaches often produce monotonous repetition, falling short of capturing the intricacies of real human activity plans. In response to this, we introduce ORACLE, a novel generative model for the synthesis of realistic indoor daily activity plans, ensuring NPCs’ authentic presence in digital habitats. Exploiting the CASAS smart home dataset’s 24-hour indoor activity sequences, ORACLE addresses challenges in the dataset, including its imbalanced sequential data, the scarcity of training samples, and the absence of pre-trained models encapsulating human daily activity patterns. ORACLE’s training leverages the sequential data processing prowess of Transformers, the generative controllability of Conditional Variational Autoencoders (CVAE), and the discriminative refinement of contrastive learning. Our experimental results validate the superiority of generating NPC activity plans and the efficacy of our design strategies over existing methods.
[NLP-40] Self-Distillation for Multi-Token Prediction
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)推理效率瓶颈问题,尤其是多标记预测(Multi-Token Prediction, MTP)方法中存在的两个关键挑战:MTP头的接受率较低,以及多个MTP头难以联合训练。解决方案的关键在于提出一种轻量级且高效的自蒸馏方法——MTP-D,其通过最小的额外训练成本显著提升MTP头的接受率(+7.5%),同时最大程度保留主头性能;此外,论文进一步引入循环扩展策略,实现MTP头的有效与经济扩展,使单头MTP推理速度提升达220.4%,从而显著增强MTP在LLMs中的实用性与可扩展性。
链接: https://arxiv.org/abs/2603.23911
作者: Guoliang Zhao,Ruobing Xie,An Wang,Shuaipeng Li,Huaibing Xie,Xingwu Sun
机构: Tencent(腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.
[NLP-41] BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在长期对话中对用户信念动态变化建模不足的问题。现有基准普遍将用户信息视为静态事实,忽略了人类在多轮交互中可能出现的意见漂移(opinion drift)、过度对齐(over-alignment)和确认偏误(confirmation bias)等现象。为此,作者提出了BeliefShift这一纵向评估基准,专门用于衡量LLM在跨会话交互中的信念演化能力,包含时间一致性、矛盾检测与证据驱动修正三个评估维度,并构建了涵盖健康、政治、个人价值观和产品偏好等领域的2,400条人工标注的多轮交互轨迹数据集。关键创新在于引入四类新指标:信念修正准确率(Belief Revision Accuracy, BRA)、漂移一致性评分(Drift Coherence Score, DCS)、矛盾解决率(Contradiction Resolution Rate, CRR)和证据敏感性指数(Evidence Sensitivity Index, ESI),从而揭示出模型在个性化与事实准确性之间的权衡关系:激进个性化模型抗漂移能力弱,而基于事实的模型则可能忽略合法的信念更新。
链接: https://arxiv.org/abs/2603.23848
作者: Praveen Kumar Myakala,Manan Agrawal,Rahul Manche
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved. That’s the wrong model. People change their minds, and over extended interactions, phenomena like opinion drift, over-alignment, and confirmation bias start to matter a lot. BeliefShift introduces a longitudinal benchmark designed specifically to evaluate belief dynamics in multi-session LLM interactions. It covers three tracks: Temporal Belief Consistency, Contradiction Detection, and Evidence-Driven Revision. The dataset includes 2,400 human-annotated multi-session interaction trajectories spanning health, politics, personal values, and product preferences. We evaluate seven models including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, LLaMA-3, and Mistral-Large under zero-shot and retrieval-augmented generation (RAG) settings. Results reveal a clear trade-off: models that personalize aggressively resist drift poorly, while factually grounded models miss legitimate belief updates. We further introduce four novel evaluation metrics: Belief Revision Accuracy (BRA), Drift Coherence Score (DCS), Contradiction Resolution Rate (CRR), and Evidence Sensitivity Index (ESI). Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2603.23848 [cs.CL] (or arXiv:2603.23848v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.23848 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Praveen Kumar Myakala [view email] [v1] Wed, 25 Mar 2026 02:09:35 UTC (24 KB)
[NLP-42] Language Model Planners do not Scale but do Formalizers?
【速读】: 该论文旨在解决大型语言模型(LLM)在复杂规划问题中表现不佳的问题,特别是针对LLM形式化器(formalizer)能否有效生成面向求解器的程序(如PDDL)这一关键问题。研究表明,与LLM规划器相比,LLM形式化器在复杂任务中具有显著优势,例如在状态空间高达10^165的经典BlocksWorld域中仍能保持完美准确率;同时,论文提出分而治之(divide-and-conquer)的形式化策略以提升小型模型的鲁棒性。为应对“展开问题”(unraveling problems)——即自然语言描述与形式语言之间存在指数级映射关系的挑战,论文引入了“LLM作为高阶形式化器”(LLM-as-higher-order-formalizer)的新范式,通过让LLM生成程序生成器(program generator),从而将token输出与底层形式化及搜索空间的组合爆炸解耦,这是其解决方案的关键创新点。
链接: https://arxiv.org/abs/2603.23844
作者: Owen Jiang,Cassie Huang,Ashish Sabharwal,Li Zhang
机构: Drexel University (德雷塞尔大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent work shows overwhelming evidence that LLMs, even those trained to scale their reasoning trace, perform unsatisfactorily when solving planning problems too complex. Whether the same conclusion holds for LLM formalizers that generate solver-oriented programs remains unknown. We systematically show that LLM formalizers greatly out-scale LLM planners, some retaining perfect accuracy in the classic BlocksWorld domain with a huge state space of size up to 10^165 . While performance of smaller LLM formalizers degrades with problem complexity, we show that a divide-and-conquer formalizing technique can greatly improve its robustness. Finally, we introduce unraveling problems where one line of problem description realistically corresponds to exponentially many lines of formal language such as the Planning Domain Definition Language (PDDL), greatly challenging LLM formalizers. We tackle this challenge by introducing a new paradigm, namely LLM-as-higher-order-formalizer, where an LLM generates a program generator. This decouples token output from the combinatorial explosion of the underlying formalization and search space.
[NLP-43] PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在政治立场上可能存在的系统性偏倚问题,特别是现有基准测试多集中于性别与种族刻板印象,而对塑造社会政治倾向的具体价值观缺乏细粒度评估。其解决方案的关键在于提出并应用PoliticsBench——一个基于EQ-Bench-v3心理测量基准改编的多轮角色扮演框架,通过二十个逐步演化的场景,让八种主流LLM在自由文本交互中表达立场并做出决策,进而以十项政治价值观为尺度量化其偏离中立标准的程度。研究发现七款模型呈现左倾倾向,Grok为右倾,且多数模型在角色扮演后期并未表现出显著偏倚加剧趋势,但Grok更倾向于使用事实和数据进行论证,从而首次实现了对LLM政治价值观的系统性、多阶段心理测量评估。
链接: https://arxiv.org/abs/2603.23841
作者: Rohan Khetan,Ashna Khetan
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 8 tables, 3 figures
Abstract:While Large Language Models (LLMs) are increasingly used as primary sources of information, their potential for political bias may impact their objectivity. Existing benchmarks of LLM social bias primarily evaluate gender and racial stereotypes. When political bias is included, it is typically measured at a coarse level, neglecting the specific values that shape sociopolitical leanings. This study investigates political bias in eight prominent LLMs (Claude, Deepseek, Gemini, GPT, Grok, Llama, Qwen Base, Qwen Instruction-Tuned) using PoliticsBench: a novel multi-turn roleplay framework adapted from the EQ-Bench-v3 psychometric benchmark. We test whether commercially developed LLMs display a systematic left-leaning bias that becomes more pronounced in later stages of multi-stage roleplay. Through twenty evolving scenarios, each model reported its stance and determined its course of action. Scoring these responses on a scale of ten political values, we explored the values underlying chatbots’ deviations from unbiased standards. Seven of our eight models leaned left, while Grok leaned right. Each left-leaning LLM strongly exhibited liberal traits and moderately exhibited conservative ones. We discovered slight variations in alignment scores across stages of roleplay, with no particular pattern. Though most models used consequence-based reasoning, Grok frequently argued with facts and statistics. Our study presents the first psychometric evaluation of political values in LLMs through multi-stage, free-text interactions.
[NLP-44] VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents
【速读】: 该论文旨在解决当前车载智能代理(Vehicle-based Agents)在多用户、长期记忆演化场景下,因缺乏动态偏好建模与可靠决策能力而难以适应真实驾驶环境的问题。现有基准测试大多局限于单用户、静态问答任务,无法捕捉用户偏好随时间变化及多用户间冲突的复杂性。其解决方案的关键在于提出VehicleMemBench——一个基于可执行车载仿真环境的多用户长上下文记忆基准,通过对比动作后的环境状态与预设目标状态来客观评估工具使用和记忆能力,无需依赖大语言模型(LLM)或人工评分。该基准包含23个工具模块和每样本超80条历史记忆事件,实验表明即使先进模型在直接指令任务中表现良好,仍难以应对偏好动态变化下的记忆演进挑战,凸显了对更鲁棒、领域特异的记忆管理机制的需求。
链接: https://arxiv.org/abs/2603.23840
作者: Yuhao Chen,Yi Xu,Xinyun Ding,Xiang Fang,Shuochen Liu,Luxi Lin,Qingyu Zhang,Ya Li,Quan Liu,Tong Xu
机构: University of Science and Technology of China (中国科学技术大学); iFLYTEK Research (科大讯飞研究院); Xiamen University (厦门大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:With the growing demand for intelligent in-vehicle experiences, vehicle-based agents are evolving from simple assistants to long-term companions. This evolution requires agents to continuously model multi-user preferences and make reliable decisions in the face of inter-user preference conflicts and changing habits over time. However, existing benchmarks are largely limited to single-user, static question-answer settings, failing to capture the temporal evolution of preferences and the multi-user, tool-interactive nature of real vehicle environments. To address this gap, we introduce VehicleMemBench, a multi-user long-context memory benchmark built on an executable in-vehicle simulation environment. The benchmark evaluates tool use and memory by comparing the post-action environment state with a predefined target state, enabling objective and reproducible evaluation without LLM-based or human scoring. VehicleMemBench includes 23 tool modules, and each sample contains over 80 historical memory events. Experiments show that powerful models perform well on direct instruction tasks but struggle in scenarios involving memory evolution, particularly when user preferences change dynamically. Even advanced memory systems struggle to handle domain-specific memory requirements in this environment. These findings highlight the need for more robust and specialized memory management mechanisms to support long-term adaptive decision-making in real-world in-vehicle systems. To facilitate future research, we release the data and code.
[NLP-45] How Vulnerable Are Edge LLM s?
【速读】: 该论文旨在解决量化部署在边缘设备上的大语言模型(Large Language Models, LLMs)面临的一种新型安全风险:即在有限查询预算下,攻击者能否通过设计特定查询序列从量化后的模型中提取出其语义知识。研究发现,尽管量化引入了噪声,但并未消除模型的底层语义信息,使得基于查询的知识提取仍具可行性。解决方案的关键在于提出了一种结构化的查询构造框架——CLIQ(Clustered Instruction Querying),该方法通过聚类指令实现语义覆盖最大化并减少冗余,从而显著提升在有限查询次数内的知识提取效率。实验表明,CLIQ在INT8和INT4量化版本的Qwen模型上均优于原始查询策略,在BERTScore、BLEU和ROUGE等指标上表现更优,揭示了量化本身不足以提供有效防护,为边缘部署LLM的安全设计提供了重要警示。
链接: https://arxiv.org/abs/2603.23822
作者: Ao Ding,Hongzong Li,Zi Liang,Zhanpeng Shi,Shuxin Zhuang,Shiqin Tang,Rong Feng,Ping Lu
机构: China University of Geoscience Beijing(中国地质大学(北京)); Hong Kong University of Science and Technology(香港科技大学); Hong Kong Polytechnic University(香港理工大学); Jilin University(吉林大学); City University of Hong Kong(香港城市大学); Chinese Academy of Sciences(中国科学院); City University of Hong Kong (Dongguan)(香港城市大学(东莞)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are increasingly deployed on edge devices under strict computation and quantization constraints, yet their security implications remain unclear. We study query-based knowledge extraction from quantized edge-deployed LLMs under realistic query budgets and show that, although quantization introduces noise, it does not remove the underlying semantic knowledge, allowing substantial behavioral recovery through carefully designed queries. To systematically analyze this risk, we propose \textbfCLIQ (\textbfClustered \textbfInstruction \textbfQuerying), a structured query construction framework that improves semantic coverage while reducing redundancy. Experiments on quantized Qwen models (INT8/INT4) demonstrate that CLIQ consistently outperforms original queries across BERTScore, BLEU, and ROUGE, enabling more efficient extraction under limited budgets. These results indicate that quantization alone does not provide effective protection against query-based extraction, highlighting a previously underexplored security risk in edge-deployed LLMs.
[NLP-46] Perturbation: A simple and efficient adversarial tracer for representation learning in language models
【速读】: 该论文试图解决深度神经语言模型(Language Models, LMs)中表征学习的长期难题,即如何在不施加不切实际的约束(如线性假设)或过度简化表征概念的前提下,准确识别出有意义的语义表征。传统方法面临两难:要么强制引入几何约束导致结果失真,要么无法区分有效表征与随机噪声。论文的关键解决方案在于重构“表征”的定义——不再将其视为激活模式,而是将其理解为学习过程中的信息传递通道。具体而言,作者通过微调模型以单个对抗样本作为扰动源,测量该扰动在其他样本间的传播效应(即“感染”程度),从而揭示模型内部结构化的迁移特性。此方法无需几何假设,且能有效避免在未训练模型中误判表征,同时在训练后的模型中显示出多粒度的语言抽象能力,表明语言模型确实能从经验中习得具有泛化能力的表征结构。
链接: https://arxiv.org/abs/2603.23821
作者: Joshua Rozner,Cory Shain
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Linguistic representation learning in deep neural language models (LMs) has been studied for decades, for both practical and theoretical reasons. However, finding representations in LMs remains an unsolved problem, in part due to a dilemma between enforcing implausible constraints on representations (e.g., linearity; Arora et al. 2024) and trivializing the notion of representation altogether (Sutter et al., 2025). Here we escape this dilemma by reconceptualizing representations not as patterns of activation but as conduits for learning. Our approach is simple: we perturb an LM by fine-tuning it on a single adversarial example and measure how this perturbation ``infects’’ other examples. Perturbation makes no geometric assumptions, and unlike other methods, it does not find representations where it should not (e.g., in untrained LMs). But in trained LMs, perturbation reveals structured transfer at multiple linguistic grain sizes, suggesting that LMs both generalize along representational lines and acquire linguistic abstractions from experience alone.
[NLP-47] Infrequent Child-Directed Speech Is Bursty and May Draw Infant Vocalizations
【速读】: 该论文试图解决的问题是:在成人对婴幼儿的直接言语输入(child-directed speech, CDS)较少的环境中,婴儿如何仍能达成语言发展的关键里程碑。研究发现,尽管玻利维亚农村地区儿童接触的CDS频率低于美国城市地区,但其CDS在时间上同样呈现高度集中(temporal clustering)的模式,即以短时密集爆发的形式出现,而非均匀分布于全天。关键解决方案在于揭示了CDS的时间集中性(temporal concentration)和发声来源多样性(source diversity),特别是指出年长儿童在低频成人CDS环境中可作为重要的言语输入源,从而解释了语言发展不受限的机制——即CDS的质量(如集中性和来源)可能比单纯数量更为关键。
链接: https://arxiv.org/abs/2603.23797
作者: Margaret Cychosz,Adriana Weisleder
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Children in many parts of the world hear relatively little speech directed to them, yet still reach major language development milestones. What differs about the speech input that infants learn from when directed input is rare? Using longform, infant-centered audio recordings taken in rural Bolivia and the urban U.S., we examined temporal patterns of infants’ speech input and their pre-linguistic vocal behavior. We find that child-directed speech in Bolivia, though less frequent, was just as temporally clustered as speech input in the U.S, arriving in concentrated bursts rather than spread across the day. In both communities, infants were most likely to produce speech-like vocalizations during periods of speech directed to them, with the probability of infants’ speech-like vocalizations during target child-directed speech nearly double that during silence. In Bolivia, infants’ speech-like vocalizations were also more likely to occur during bouts of directed speech from older children than from adults. Together, these findings suggest that the developmental impact of child-directed speech may depend not only on quantity, but on temporal concentration and source, with older children serving as an important source of input in some communities, including where adult speech to infants is less frequent.
[NLP-48] IslamicMMLU: A Benchmark for Evaluating LLM s on Islamic Knowledge
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在伊斯兰知识领域缺乏系统性评估的问题。现有研究未建立涵盖核心伊斯兰学科(如《古兰经》、圣训和法理学)的综合性基准测试,导致对LLMs在该领域的性能认知不足。解决方案的关键在于构建IslamicMMLU——一个包含10,013道多选题的基准数据集,覆盖三个子任务:《古兰经》(2,013题)、圣训(4,000题)和法理学(Fiqh,4,000题),并设计多样化题型以全面考察模型对伊斯兰知识不同维度的理解能力。此外,该研究还引入了一个新颖的“教法学派(madhab)偏见检测任务”,揭示了不同模型在伊斯兰法学流派偏好上的差异,并基于此建立了公开排行榜,为后续研究提供标准化评估工具。
链接: https://arxiv.org/abs/2603.23750
作者: Ali Abdelaal,Mohammed Nader Al Haffar,Mahmoud Fawzi,Walid Magdy
机构: The University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注: Leaderboard link: this https URL
Abstract:Large language models are increasingly consulted for Islamic knowledge, yet no comprehensive benchmark evaluates their performance across core Islamic disciplines. We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (jurisprudence, 4,000 questions). Each track is formed of multiple types of questions to examine LLMs capabilities handling different aspects of Islamic knowledge. The benchmark is used to create the IslamicMMLU public leaderboard for evaluating LLMs, and we initially evaluate 26 LLMs, where their averaged accuracy across the three tracks varied between 39.8% to 93.8% (by Gemini 3 Flash). The Quran track shows the widest span (99.3% to 32.4%), while the Fiqh track includes a novel madhab (Islamic school of jurisprudence) bias detection task revealing variable school-of-thought preferences across models. Arabic-specific models show mixed results, but they all underperform compared to frontier models. The evaluation code and leaderboard are made publicly available.
[NLP-49] LLM s Do Not Grade Essays Like Humans
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动作文评分(Automated Essay Scoring, AES)中的评分一致性问题,即LLM生成的分数与人类评分者之间是否存在可靠的一致性。其关键解决方案在于:在无需针对特定任务进行微调(out-of-the-box setting)的前提下,系统评估多个来自GPT和Llama系列的LLM在真实作文上的评分表现,并分析其评分行为与人类评分的差异。研究发现,尽管LLM评分与其生成的反馈具有一致性,但其依赖的评判信号与人类评分者不同,导致其评分在短文或语法错误较多的长文中表现出系统性偏差,这揭示了LLM在作文评分中虽具实用性,但仍需进一步对齐人类评价标准。
链接: https://arxiv.org/abs/2603.23714
作者: Jerin George Mathew,Sumayya Taher,Anindita Kundu,Denilson Barbosa
机构: University of Alberta (阿尔伯塔大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models have recently been proposed as tools for automated essay scoring, but their agreement with human grading remains unclear. In this work, we evaluate how LLM-generated scores compare with human grades and analyze the grading behavior of several models from the GPT and Llama families in an out-of-the-box setting, without task-specific training. Our results show that agreement between LLM and human scores remains relatively weak and varies with essay characteristics. In particular, compared to human raters, LLMs tend to assign higher scores to short or underdeveloped essays, while assigning lower scores to longer essays that contain minor grammatical or spelling errors. We also find that the scores generated by LLMs are generally consistent with the feedback they generate: essays receiving more praise tend to receive higher scores, while essays receiving more criticism tend to receive lower scores. These results suggest that LLM-generated scores and feedback follow coherent patterns but rely on signals that differ from those used by human raters, resulting in limited alignment with human grading practices. Nevertheless, our work shows that LLMs produce feedback that is consistent with their grading and that they can be reliably used in supporting essay scoring.
[NLP-50] he Diminishing Returns of Early-Exit Decoding in Modern LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)推理过程中因层间冗余减少而导致的早期退出(early-exit)机会受限问题。随着现代LLM采用更优的预训练方法和架构设计,中间层表示的判别能力下降,使得在早期层即可获得高置信度预测的能力减弱,从而限制了通过早期退出降低延迟与计算成本的潜力。解决方案的关键在于提出一种量化模型内在早期退出适配性的指标,并构建基准测试框架,用于系统评估不同模型架构(如密集Transformer、混合专家模型和状态空间模型)及参数规模对早期退出效果的影响。研究发现,较大规模(>20B参数)且未经特定微调的基础预训练模型具有更高的早期退出潜力,而密集Transformer相比其他架构更具优势。
链接: https://arxiv.org/abs/2603.23701
作者: Rui Wei,Rui Du,Hanfei Yu,Devesh Tiwari,Jian Li,Zhaozhuo Xu,Hao Wang
机构: Stevens Institute of Technology ( Stevens Institute of Technology); Northeastern University (Northeastern University); Stony Brook University (Stony Brook University)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In Large Language Model (LLM) inference, early-exit refers to stopping computation at an intermediate layer once the prediction is sufficiently confident, thereby reducing latency and cost. However, recent LLMs adopt improved pretraining recipes and architectures that reduce layer redundancy, potentially limiting early-exit opportunities. We re-evaluate layer-wise early-exit in modern LLMs and analyze how intermediate representations evolve during training. We introduce a metric to quantify a model’s intrinsic suitability for early-exit and propose a benchmark for researchers to explore the potential early-exit benefits on different models and workloads. Our results show a diminishing trend in early-exit effectiveness across newer model generations. We further find that dense transformers generally offer greater early-exit potential than Mixture-of-Experts and State Space Models. In addition, larger models, particularly those with more than 20 billion parameters, and base pretrained models without specialized tuning tend to exhibit higher early-exit potential.
[NLP-51] PLACID: Privacy-preserving Large language models for Acronym Clinical Inference and Disambiguation
【速读】: 该论文旨在解决医疗领域中大型语言模型(Large Language Models, LLMs)因数据隐私限制而难以集成的问题,特别是临床文本中歧义缩写的误解释可能引发严重医疗后果(如危及生命的用药错误)。解决方案的关键在于提出一种隐私保护的级联式处理流程:首先利用通用本地小参数模型(2B–10B参数规模)高精度检测临床缩写(准确率约0.988),随后将检测到的缩写路由至专用生物医学模型进行上下文相关的扩展,从而显著提升缩写消歧的准确性(从约0.655提升至约0.81),整个过程在设备端完成,无需上传敏感健康信息至云端,有效保障了患者数据隐私。
链接: https://arxiv.org/abs/2603.23678
作者: Manjushree B. Aithal,Ph.D.,Alexander Kotz,James Mitchell,Ph.D
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures, Under review AMIA Symposium
Abstract:Large Language Models (LLMs) offer transformative solutions across many domains, but healthcare integration is hindered by strict data privacy constraints. Clinical narratives are dense with ambiguous acronyms, misinterpretation these abbreviations can precipitate severe outcomes like life-threatening medication errors. While cloud-dependent LLMs excel at Acronym Disambiguation, transmitting Protected Health Information to external servers violates privacy frameworks. To bridge this gap, this study pioneers the evaluation of small-parameter models deployed entirely on-device to ensure privacy preservation. We introduce a privacy-preserving cascaded pipeline leveraging general-purpose local models to detect clinical acronyms, routing them to domain-specific biomedical models for context-relevant expansions. Results reveal that while general instruction-following models achieve high detection accuracy (~0.988), their expansion capabilities plummet (~0.655). Our cascaded approach utilizes domain-specific medical models to increase expansion accuracy to (~0.81). This novel work demonstrates that privacy-preserving, on-device (2B-10B) models deliver high-fidelity clinical acronym disambiguation support.
[NLP-52] Probing Ethical Framework Representations in Large Language Models : Structure Entanglement and Methodological Challenges
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在进行伦理判断时,其内部表征是否能区分不同的规范性伦理框架(normative frameworks),还是将伦理简化为单一的可接受性维度(acceptability dimension)这一核心问题。解决方案的关键在于通过多维度探针(probes)分析六种不同规模(4B–72B参数)的LLM中五类伦理框架(德行论、功利主义、美德伦理、正义伦理和常识伦理)的隐藏表征空间,发现这些框架在模型内部存在分化且具有不对称的迁移模式——例如,德行论探针可部分泛化至美德伦理场景,而常识伦理探针在正义伦理任务上则出现灾难性失效。这一方法揭示了模型中伦理认知的结构特征,但也提示需谨慎解释结果,因探针可能依赖基准模板的表面特征。
链接: https://arxiv.org/abs/2603.23659
作者: Weilun Xu,Alexander Rusnak,Frederic Kaplan
机构: École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:When large language models make ethical judgments, do their internal representations distinguish between normative frameworks, or collapse ethics into a single acceptability dimension? We probe hidden representations across five ethical frameworks (deontology, utilitarianism, virtue, justice, commonsense) in six LLMs spanning 4B–72B parameters. Our analysis reveals differentiated ethical subspaces with asymmetric transfer patterns – e.g., deontology probes partially generalize to virtue scenarios while commonsense probes fail catastrophically on justice. Disagreement between deontological and utilitarian probes correlates with higher behavioral entropy across architectures, though this relationship may partly reflect shared sensitivity to scenario difficulty. Post-hoc validation reveals that probes partially depend on surface features of benchmark templates, motivating cautious interpretation. We discuss both the structural insights these methods provide and their epistemological limitations.
[NLP-53] Ethio-ASR: Joint Multilingual Speech Recognition and Language Identification for Ethiopian Languages
【速读】: 该论文旨在解决非洲语系中多语言自动语音识别(ASR)模型在埃塞俄比亚主要语言(包括阿姆哈拉语、提格里尼亚语、奥罗莫语、西达马语和瓦拉塔语)上的严重资源匮乏问题,这些语言分属亚非语系的闪米特、库希特和奥摩特分支,但当前在语音技术领域仍处于极度欠代表状态。解决方案的关键在于构建并训练一个基于连接时序分类(CTC)的多语言ASR模型套件——Ethio-ASR,其联合训练于新发布的WAXAL语料库,并采用多个预训练语音编码器进行优化,在仅使用较少参数的情况下实现了优于OmniASR等强基线模型的性能(平均词错误率WER为30.48%),同时通过系统性分析性别偏见、元音长短与辅音重叠对识别误差的影响以及多语言CTC模型的训练动态,提升了模型的鲁棒性和可解释性。
链接: https://arxiv.org/abs/2603.23654
作者: Badr M. Abdullah,Israel Abebe Azime,Atnafu Lambebo Tonja,Jesujoba O. Alabi,Abel Mulat Alemu,Eyob G. Hagos,Bontu Fufa Balcha,Mulubrhan A. Nerea,Debela Desalegn Yadeta,Dagnachew Mekonnen Marilign,Amanuel Temesgen Fentahun,Tadesse Kebede,Israel D. Gebru,Michael Melese Woldeyohannis,Walelign Tewabe Sewunetie,Bernd Möbius,Dietrich Klakow
机构: Saarland University, Germany; University College London, UK; Ethiopian AI Institute, Ethiopia; HiLCoE, Ethiopia; Addis Ababa University, Ethiopia; University West, Sweden; Haramaya University, Ethiopia; Ethiopic.ai; AIMS - Research and Innovation Centre, Rwanda
类目: Computation and Language (cs.CL)
备注: Preprint (under review)
Abstract:We present Ethio-ASR, a suite of multilingual CTC-based automatic speech recognition (ASR) models jointly trained on five Ethiopian languages: Amharic, Tigrinya, Oromo, Sidaama, and Wolaytta. These languages belong to the Semitic, Cushitic, and Omotic branches of the Afroasiatic family, and remain severely underrepresented in speech technology despite being spoken by the vast majority of Ethiopia’s population. We train our models on the recently released WAXAL corpus using several pre-trained speech encoders and evaluate against strong multilingual baselines, including OmniASR. Our best model achieves an average WER of 30.48% on the WAXAL test set, outperforming the best OmniASR model with substantially fewer parameters. We further provide a comprehensive analysis of gender bias, the contribution of vowel length and consonant gemination to ASR errors, and the training dynamics of multilingual CTC models. Our models and codebase are publicly available to the research community.
[NLP-54] Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks
【速读】: 该论文旨在解决当前缺乏针对瑞士监管合规任务的前沿大语言模型(Large Language Models, LLMs)评估基准的问题,尤其在零检索(zero-retrieval)条件下对模型在实际应用中的表现进行系统性量化。其解决方案的关键在于构建并发布Swiss-Bench SBP-002——一个涵盖三个瑞士监管领域(FINMA、Legal-CH、EFK)、七种任务类型及德语、法语、意大利语三语的专家级标注数据集,并采用由三位盲评大型语言模型(GPT-4o、Claude Sonnet 4、Qwen3-235B)组成的评分框架,通过多数投票聚合与加权卡帕系数(weighted kappa = 0.605)确保评分一致性,同时以独立人类法律专家对100项子集验证参考答案的法律准确性(73%正确率,无错误)。此基准揭示了模型在瑞士监管场景下的显著性能分层(Tier A–C),并首次表明开源模型在部分任务中可媲美甚至超越闭源模型,为后续研究提供了关键实证参考。
链接: https://arxiv.org/abs/2603.23646
作者: Fatih Uenal
机构: University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures, 7 tables. Code and data: this https URL
Abstract:While recent work has benchmarked large language models on Swiss legal translation (Niklaus et al., 2025) and academic legal reasoning from university exams (Fan et al., 2025), no existing benchmark evaluates frontier model performance on applied Swiss regulatory compliance tasks. I introduce Swiss-Bench SBP-002, a trilingual benchmark of 395 expert-crafted items spanning three Swiss regulatory domains (FINMA, Legal-CH, EFK), seven task types, and three languages (German, French, Italian), and evaluate ten frontier models from March 2026 using a structured three-dimension scoring framework assessed via a blind three-judge LLM panel (GPT-4o, Claude Sonnet 4, Qwen3-235B) with majority-vote aggregation and weighted kappa = 0.605, with reference answers validated by an independent human legal expert on a 100-item subset (73% rated Correct, 0% Incorrect, perfect Legal Accuracy). Results reveal three descriptive performance clusters: Tier A (35-38% correct), Tier B (26-29%), and Tier C (13-21%). The benchmark proves difficult: even the top-ranked model (Qwen 3.5 Plus) achieves only 38.2% correct, with 47.3% incorrect and 14.4% partially correct. Task type difficulty varies widely: legal translation and case analysis yield 69-72% correct rates, while regulatory QA, hallucination detection, and gap analysis remain below 9%. Within this roster (seven open-weight, three closed-source), an open-weight model leads the ranking, and several open-weight models match or outperform their closed-source counterparts. These findings provide an initial empirical reference point for assessing frontier model capability on Swiss regulatory tasks under zero-retrieval conditions.
[NLP-55] A Theory of LLM Information Susceptibility
【速读】: 该论文试图解决的问题是:在代理系统(agentic systems)中,当大型语言模型(Large Language Models, LLMs)被用作优化模块时,其对策略集性能的改善能力是否存在根本性限制,以及如何设计架构以最大化这种改善潜力。解决方案的关键在于提出了一种“LLM信息易感性理论”(LLM information susceptibility theory),其核心假设为:当计算资源足够大时,固定LLM的介入不会增加策略集相对于预算的性能易感性(susceptibility)。作者进一步构建了一个多变量效用函数框架,将该假设推广至具有多个共变预算通道的架构,并揭示了协同扩展(co-scaling)在何种条件下可突破易感性边界。实证验证表明,嵌套式协同扩展架构能够打开固定配置无法实现的响应通道,从而明确指出LLM干预何时有效、何时无效,并提示统计物理工具可用于预测AI系统的设计约束。
链接: https://arxiv.org/abs/2603.23626
作者: Zhuo-Yang Song,Hua Xing Zhu
机构: 未知
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Adaptation and Self-Organizing Systems (nlin.AO)
备注: 16 pages, 9 figures
Abstract:Large language models (LLMs) are increasingly deployed as optimization modules in agentic systems, yet the fundamental limits of such LLM-mediated improvement remain poorly understood. Here we propose a theory of LLM information susceptibility, centred on the hypothesis that when computational resources are sufficiently large, the intervention of a fixed LLM does not increase the performance susceptibility of a strategy set with respect to budget. We develop a multi-variable utility-function framework that generalizes this hypothesis to architectures with multiple co-varying budget channels, and discuss the conditions under which co-scaling can exceed the susceptibility bound. We validate the theory empirically across structurally diverse domains and model scales spanning an order of magnitude, and show that nested, co-scaling architectures open response channels unavailable to fixed configurations. These results clarify when LLM intervention helps and when it does not, demonstrating that tools from statistical physics can provide predictive constraints for the design of AI systems. If the susceptibility hypothesis holds generally, the theory suggests that nested architectures may be a necessary structural condition for open-ended agentic self-improvement.
[NLP-56] Evaluating a Multi-Agent Voice-Enabled Smart Speaker for Care Homes: A Safety-Focused Framework
【速读】: 该论文旨在解决护理院中工作人员因行政负担过重而难以专注于患者照护的问题,提出了一种基于语音的智能音箱系统以支持日常照护活动,如访问居民记录、设置提醒和安排任务。解决方案的关键在于构建一个以安全为核心评估框架,结合Whisper语音识别与检索增强生成(Retrieval-Augmented Generation, RAG)方法(包括混合、稀疏和稠密策略),并通过监督式护理院试验和受控测试验证其性能。系统在最佳配置(GPT-5.2)下实现了居民身份与照护类别匹配准确率100%、提醒识别召回率为100%,且通过置信度评分、澄清提示和人工介入机制保障了在噪声环境和多样口音下的可靠性,从而实现高精度文档记录、有效任务管理和可信赖的AI应用。
链接: https://arxiv.org/abs/2603.23625
作者: Zeinab Dehghani,Rameez Raja Kureshi,Koorosh Aslansefat,Faezeh Alsadat Abedi,Dhavalkumar Thakker,Lisa Greaves,Bhupesh Kumar Mishra,Baseer Ahmad,Tanaya Maslekar
机构: University of Hull, UK; University of Southampton, UK; Connexin, Hull, UK; Leeds Teaching Hospital NHS Trust, UK
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Artificial intelligence (AI) is increasingly being explored in health and social care to reduce administrative workload and allow staff to spend more time on patient care. This paper evaluates a voice-enabled Care Home Smart Speaker designed to support everyday activities in residential care homes, including spoken access to resident records, reminders, and scheduling tasks. A safety-focused evaluation framework is presented that examines the system end-to-end, combining Whisper-based speech recognition with retrieval-augmented generation (RAG) approaches (hybrid, sparse, and dense). Using supervised care-home trials and controlled testing, we evaluated 330 spoken transcripts across 11 care categories, including 184 reminder-containing interactions. These evaluations focus on (i) correct identification of residents and care categories, (ii) reminder recognition and extraction, and (iii) end-to-end scheduling correctness under uncertainty (including safe deferral/clarification). Given the safety-critical nature of care homes, particular attention is also paid to reliability in noisy environments and across diverse accents, supported by confidence scoring, clarification prompts, and human-in-the-loop oversight. In the best-performing configuration (GPT-5.2), resident ID and care category matching reached 100% (95% CI: 98.86-100), while reminder recognition reached 89.09% (95% CI: 83.81-92.80) with zero missed reminders (100% recall) but some false positives. End-to-end scheduling via calendar integration achieved 84.65% exact reminder-count agreement (95% CI: 78.00-89.56), indicating remaining edge cases in converting informal spoken instructions into actionable events. The findings suggest that voice-enabled systems, when carefully evaluated and appropriately safeguarded, can support accurate documentation, effective task management, and trustworthy use of AI in care home settings.
[NLP-57] Revisiting Real-Time Digging-In Effects: No Evidence from NP/Z Garden-Paths
【速读】: 该论文试图解决的核心问题是:人类句法加工中是否存在“挖坑效应”(digging-in effect),即随着歧义区域长度增加,消歧难度是否也随之上升,从而支持自组织句法处理理论;抑或这种效应仅为收尾过程(wrap-up processes)或方法学混杂因素所致。解决方案的关键在于设计两项实验,分别使用Maze和自 paced reading范式,在英语名词短语(NP)/Z花园路径句上对比人类行为与大规模语言模型(neural language models)的预测,发现非句末消歧项(nonfinal items)呈现与神经语言模型一致的反向趋势,而正向挖坑效应仅出现在句末消歧项中,表明其可能源于收尾效应而非实时句法处理机制。这一结果质疑了挖坑效应作为人类实时句法加工证据的可靠性,并为基于统计预期的语言理解模型提供了实证支持。
链接: https://arxiv.org/abs/2603.23624
作者: Amani Maina-Kilaas,Roger Levy
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL)
备注: 8 pages, 5 figures
Abstract:Digging-in effects, where disambiguation difficulty increases with longer ambiguous regions, have been cited as evidence for self-organized sentence processing, in which structural commitments strengthen over time. In contrast, surprisal theory predicts no such effect unless lengthening genuinely shifts statistical expectations, and neural language models appear to show the opposite pattern. Whether digging-in is a robust real-time phenomenon in human sentence processing – or an artifact of wrap-up processes or methodological confounds – remains unclear. We report two experiments on English NP/Z garden-path sentences using Maze and self-paced reading, comparing human behavior with predictions from an ensemble of large language models. We find no evidence for real-time digging-in effects. Critically, items with sentence-final versus nonfinal disambiguation show qualitatively different patterns: positive digging-in trends appear only sentence-finally, where wrap-up effects confound interpretation. Nonfinal items – the cleaner test of real-time processing – show reverse trends consistent with neural model predictions.
[NLP-58] LLM ORPH: Automated Metamorphic Testing of Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自然语言处理(Natural Language Processing, NLP)任务中缺乏自动化验证机制的问题,特别是如何在不依赖人工标注数据的情况下检测模型输出的不一致性。解决方案的关键在于引入一种基于变异测试(Metamorphic Testing, MT)的自动化测试工具 LLMORPH,其核心是利用变异关系(Metamorphic Relations, MRs)从原始输入生成后续输入,从而通过比较模型在不同输入下的输出一致性来识别潜在缺陷,无需昂贵的人工标签数据即可有效暴露模型行为中的不一致现象。
链接: https://arxiv.org/abs/2603.23611
作者: Steven Cho,Stefano Ruberto,Valerio Terragni
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for publication in the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE 2025). This arXiv version is the authors’ accepted manuscript. DOI: https://doi.org/10.1109/ASE63991.2025.00385 Code: this http URL
Abstract:Automated testing is essential for evaluating and improving the reliability of Large Language Models (LLMs), yet the lack of automated oracles for verifying output correctness remains a key challenge. We present LLMORPH, an automated testing tool specifically designed for LLMs performing NLP tasks, which leverages Metamorphic Testing (MT) to uncover faulty behaviors without relying on human-labeled data. MT uses Metamorphic Relations (MRs) to generate follow-up inputs from source test input, enabling detection of inconsistencies in model outputs without the need of expensive labelled data. LLMORPH is aimed at researchers and developers who want to evaluate the robustness of LLM-based NLP systems. In this paper, we detail the design, implementation, and practical usage of LLMORPH, demonstrating how it can be easily extended to any LLM, NLP task, and set of MRs. In our evaluation, we applied 36 MRs across four NLP benchmarks, testing three state-of-the-art LLMs: GPT-4, LLAMA3, and HERMES 2. This produced over 561,000 test executions. Results demonstrate LLMORPH’s effectiveness in automatically exposing inconsistencies.
[NLP-59] he Geometric Price of Discrete Logic: Context-driven Manifold Dynamics of Number Representations
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在连续语义空间中具有良好泛化能力,但面对严格逻辑推理任务时却难以形成离散决策边界这一根本性矛盾。传统理论依赖线性等距投影(linear isometric projections)无法解释此现象。论文提出,任务上下文作为非等距动力学算子(non-isometric dynamical operator),强制引入必要的“拓扑扭曲”(topological distortion)。其解决方案的关键在于通过Gram-Schmidt分解残差流激活(residual-stream activations),揭示出一种双调制机制:一类无类别特异性的拓扑保持机制锚定全局结构以防止语义坍缩,另一类特定代数发散机制则定向撕裂跨类概念以构建逻辑边界。实验证明,定向擦除该发散分量会使奇偶分类准确率从100%降至随机水平(38.57%),确立了拓扑结构与模型功能之间的因果关系。
链接: https://arxiv.org/abs/2603.23577
作者: Long Zhang,Dai-jun Lin,Wei-neng Chen
机构: South China University of Technology (华南理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Large language models (LLMs) generalize smoothly across continuous semantic spaces, yet strict logical reasoning demands the formation of discrete decision boundaries. Prevailing theories relying on linear isometric projections fail to resolve this fundamental tension. In this work, we argue that task context operates as a non-isometric dynamical operator that enforces a necessary “topological distortion.” By applying Gram-Schmidt decomposition to residual-stream activations , we reveal a dual-modulation mechanism driving this process: a class-agnostic topological preservation that anchors global structure to prevent semantic collapse, and a specific algebraic divergence that directionally tears apart cross-class concepts to forge logical boundaries. We validate this geometric evolution across a gradient of tasks, from simple mapping to complex primality testing. Crucially, targeted specific vector ablation establishes a strict causal binding between this topology and model function: algebraically erasing the divergence component collapses parity classification accuracy from 100% to chance levels (38.57%). Furthermore, we uncover a three-phase layer-wise geometric dynamic and demonstrate that under social pressure prompts, models fail to generate sufficient divergence. This results in a “manifold entanglement” that geometrically explains sycophancy and hallucination. Ultimately, our findings revise the linear-isometric presumption, demonstrating that the emergence of discrete logic in LLMs is purchased at an irreducible cost of topological deformation.
[NLP-60] PLDR-LLM s Reason At Self-Organized Criticality
【速读】: 该论文试图解决的问题是:如何从理论上解释大型语言模型(Large Language Models, LLMs)在推理阶段展现出的推理能力,并提供一种无需依赖人工标注基准测试数据集即可量化其推理能力的方法。解决方案的关键在于发现并利用PLDR-LLM(Pretrained at Large-scale Deductive Reasoning,基于自组织临界性的大语言模型)在临界点(criticality)处表现出的稳态行为——此时模型的演绎输出呈现类似二阶相变的特性,相关长度发散且输出处于亚稳态;通过分析该稳态下全局参数的统计特征,可定义一个序参量(order parameter),其值越接近零,模型的推理能力越强。这一机制表明,推理能力源于训练过程中学习到的尺度函数、普适类和重整化群结构,从而实现了对推理能力的定量刻画与理论解释。
链接: https://arxiv.org/abs/2603.23539
作者: Burc Gokden
机构: Fromthesky Research Labs LLC(Fromthesky研究实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO)
备注:
Abstract:We show that PLDR-LLMs pretrained at self-organized criticality exhibit reasoning at inference time. The characteristics of PLDR-LLM deductive outputs at criticality is similar to second-order phase transitions. At criticality, the correlation length diverges, and the deductive outputs attain a metastable steady state. The steady state behaviour suggests that deductive outputs learn representations equivalent to scaling functions, universality classes and renormalization groups from the training dataset, leading to generalization and reasoning capabilities in the process. We can then define an order parameter from the global statistics of the model’s deductive output parameters at inference. The reasoning capabilities of a PLDR-LLM is better when its order parameter is close to zero at criticality. This observation is supported by the benchmark scores of the models trained at near-criticality and sub-criticality. Our results provide a self-contained explanation on how reasoning manifests in large language models, and the ability to reason can be quantified solely from global model parameter values of the deductive outputs at steady state, without any need for evaluation of curated benchmark datasets through inductive output for reasoning and comprehension.
[NLP-61] Not All Pretraining are Created Equal: Threshold Tuning and Class Weighting for Imbalanced Polarization Tasks in Low-Resource Settings
【速读】: 该论文旨在解决社交媒体文本中的极化检测与分类问题,具体包括二元极化检测、多标签目标类型分类以及多标签表现形式识别三个子任务。其解决方案的关键在于采用基于Transformer的模型(如mDeBERTa-v3-base、SwahBERT和AfriBERTa-large)结合类权重损失函数、迭代分层数据划分策略及逐标签阈值调优技术,以有效应对严重类别不平衡问题。实验表明,最优配置在二元极化检测上达到0.8032的宏F1分数,且在多标签任务中也表现出竞争力(最高达0.556宏F1),但依然面临隐含极化、语码转换和政治激烈讨论与真实极化区分等挑战。
链接: https://arxiv.org/abs/2603.23534
作者: Abass Oguntade
机构: African Institute of Mathematical Sciences, South Africa
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:This paper describes my submission to the Polarization Shared Task at SemEval-2025, which addresses polarization detection and classification in social media text. I develop Transformer-based systems for English and Swahili across three subtasks: binary polarization detection, multi-label target type classification, and multi-label manifestation identification. The approach leverages multilingual and African language-specialized models (mDeBERTa-v3-base, SwahBERT, AfriBERTa-large), class-weighted loss functions, iterative stratified data splitting, and per-label threshold tuning to handle severe class imbalance. The best configuration, mDeBERTa-v3-base, achieves 0.8032 macro-F1 on validation for binary detection, with competitive performance on multi-label tasks (up to 0.556 macro-F1). Error analysis reveals persistent challenges with implicit polarization, code-switching, and distinguishing heated political discourse from genuine polarization.
[NLP-62] Generating Hierarchical JSON Representations of Scientific Sentences Using LLM s
【速读】: 该论文试图解决的问题是:结构化表示(structured representations)是否能够有效保留科学文本的语义信息。解决方案的关键在于设计了一种新颖的结构损失函数(structural loss function),用于微调轻量级大语言模型(lightweight LLM),使其能够从科学文章中收集的句子生成层次化的 JSON 结构;随后,利用生成式模型基于这些结构重建原始文本,并通过语义相似性和词汇相似性指标对比原句与重构句,验证了层次化格式在保持科学文本信息方面的有效性。
链接: https://arxiv.org/abs/2603.23532
作者: Satya Sri Rajiteswari Nimmagadda,Ethan Young,Niladri Sengupta,Ananya Jana,Aniruddha Maiti
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: accepted to 21th International Conference on Semantic Computing (IEEE ICSC 2026)
Abstract:This paper investigates whether structured representations can preserve the meaning of scientific sentences. To test this, a lightweight LLM is fine-tuned using a novel structural loss function to generate hierarchical JSON structures from sentences collected from scientific articles. These JSONs are then used by a generative model to reconstruct the original text. Comparing the original and reconstructed sentences using semantic and lexical similarity we show that hierarchical formats are capable of retaining information of scientific texts effectively.
[NLP-63] Large Language Models Unpack Complex Political Opinions through Target-Stance Extraction
【速读】: 该论文旨在解决在线政治对话中政治观点分析过于粗粒度的问题,即现有计算方法多依赖于简单的党派标签,忽略了政策、人物和议题等具体信念之间的复杂互动。其解决方案的关键在于采用目标立场抽取(Target-Stance Extraction, TSE)任务,利用大型语言模型(Large Language Models, LLMs)在无需大量标注数据的情况下,同时识别讨论的目标对象(target)与针对该对象的态度立场(stance),从而实现对政治意见的精细化解析。研究通过构建包含138个政治目标的Reddit帖子数据集,并测试多种开源及专有LLMs在零样本、少样本和上下文增强提示策略下的表现,验证了最优模型可达到人类专家水平且在低一致性挑战样本上仍具鲁棒性,表明LLMs为计算社会科学和政治文本分析提供了高效、可扩展的工具。
链接: https://arxiv.org/abs/2603.23531
作者: Özgür Togay,Florian Kunneman,Javier Garcia-Bernardo,Anastasia Giachanou
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Political polarization emerges from a complex interplay of beliefs about policies, figures, and issues. However, most computational analyses reduce discourse to coarse partisan labels, overlooking how these beliefs interact. This is especially evident in online political conversations, which are often nuanced and cover a wide range of subjects, making it difficult to automatically identify the target of discussion and the opinion expressed toward them. In this study, we investigate whether Large Language Models (LLMs) can address this challenge through Target-Stance Extraction (TSE), a recent natural language processing task that combines target identification and stance detection, enabling more granular analysis of political opinions. For this, we construct a dataset of 1,084 Reddit posts from r/NeutralPolitics, covering 138 distinct political targets and evaluate a range of proprietary and open-source LLMs using zero-shot, few-shot, and context-augmented prompting strategies. Our results show that the best models perform comparably to highly trained human annotators and remain robust on challenging posts with low inter-annotator agreement. These findings demonstrate that LLMs can extract complex political opinions with minimal supervision, offering a scalable tool for computational social science and political text analysis.
[NLP-64] Did You Forget What I Asked? Prospective Memory Failures in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在同时执行复杂任务时难以遵守格式指令的问题。研究通过借鉴认知心理学中的前瞻性记忆(prospective memory)视角,设计了一个受控实验范式,将可验证的格式约束与逐步增加复杂度的基准任务相结合。关键发现是:格式约束的失效具有高度类型依赖性——终端约束(terminal constraints,即要求在响应边界处执行动作)最易受损,合规率下降可达50%;而避免型约束(avoidance constraints)则相对稳健。解决方案的核心在于引入一种增强显著性的格式提示策略(salience-enhanced format),即通过显式指令框架加尾随提醒的方式,显著恢复了因任务负载导致的合规性损失,在多数场景下使性能恢复至90–100%。此外,研究还揭示了格式约束与任务准确性之间的双向干扰效应,表明格式要求可能反过来损害核心任务表现。
链接: https://arxiv.org/abs/2603.23530
作者: Avni Mittal
机构: Microsoft
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model’s GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.
[NLP-65] Konkani LLM : Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源语言环境(如Konkani语)中性能显著下降的问题,其根本原因在于训练数据稀缺以及Devanagari、Romi和Kannada等多种书写系统之间的高多样性。解决方案的关键在于构建了一个名为Konkani-Instruct-100k的合成指令微调数据集,该数据集通过Gemini 3生成,并基于此开发了针对区域语言特征优化的Konkani LLM系列微调模型;同时,论文还提出了多书写系统(Multi-Script)的Konkani基准测试框架,以支持跨书写系统的语言评估,从而在机器翻译等任务中实现对基线模型的稳定提升,甚至在某些场景下超越闭源商用模型。
链接: https://arxiv.org/abs/2603.23529
作者: Reuben Chagas Fernandes,Gaurang S. Patkar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) consistently under perform in low-resource linguistic contexts such as Konkani. This performance deficit stems from acute training data scarcity compounded by high script diversity across Devanagari, Romi and Kannada orthographies. To address this gap, we introduce Konkani-Instruct-100k, a comprehensive synthetic instruction-tuning dataset generated through Gemini 3. We establish rigorous baseline benchmarks by evaluating leading open-weights architectures including Llama 3.1, Qwen2.5 and Gemma 3 alongside proprietary closed-source models. Our primary contribution involves the development of Konkani LLM, a series of fine-tuned models optimized for regional nuances. Furthermore, we are developing the Multi-Script Konkani Benchmark to facilitate cross-script linguistic evaluation. In machine translation, Konkani LLM delivers consistent gains over the corresponding base models and is competitive with and in several settings surpasses proprietary baselines Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.23529 [cs.CL] (or arXiv:2603.23529v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.23529 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-66] he Compression Paradox in LLM Inference: Provider-Dependent Energy Effects of Prompt Compression
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理阶段能耗过高所带来的环境问题,尤其是在生成式 AI(Generative AI)广泛应用背景下,其碳排放已成为亟待优化的挑战。研究通过系统性实验评估提示压缩(prompt compression)对推理能效的影响,在三个主流模型(OpenAI GPT-4o-mini、Anthropic Claude-3.5-Sonnet、DeepSeek-Chat)和五个基准测试(HumanEval、MBPP、GSM8K、MATH、MMLU)上共执行28,421次API调用,覆盖四种压缩比(r=1.0, 0.7, 0.5, 0.3)。关键发现是:单纯减少输入token数量并不能可靠提升能效,反而可能因模型行为差异导致显著质量下降或能量消耗激增(如DeepSeek在r=0.3时输出token增长至原长度的39倍,能耗上升+2,140%);相较之下,模型选择与输出长度控制展现出更稳定且可预测的能量-质量权衡关系。因此,解决方案的关键在于摒弃“以输入压缩为核心”的优化思路,转而采用更具适应性的策略,包括根据任务特性合理选型模型并主动约束输出长度,从而实现生产环境中可持续的能效优化。
链接: https://arxiv.org/abs/2603.23528
作者: Warren Johnson
机构: Plexor Labs
类目: Computation and Language (cs.CL)
备注: 16 pages, 5 figures, 5 tables. Includes data/code availability, ethics statement, and competing interests
Abstract:The rapid proliferation of Large Language Models has created an environmental paradox: the very technology that could help solve climate challenges is itself becoming a significant contributor to global carbon emissions. We test whether prompt compression improves inference energy efficiency in 28,421 successful API trials (28,428 planned) across three providers (OpenAI GPT-4o-mini, Anthropic Claude-3.5-Sonnet, and DeepSeek-Chat), five benchmarks (HumanEval, MBPP, GSM8K, MATH, MMLU), and four compression ratios (r in 1.0, 0.7, 0.5, 0.3). Energy is estimated with a token-based proxy calibrated against local direct measurements, and quality is tracked with benchmark pass rates. Compression produced substantial quality loss (overall pass rate 26.0% at baseline vs. 1.5% at r=0.7) and strongly provider-dependent energy behavior. DeepSeek exhibited output expansion under compression (21 to 798 tokens at r=0.3), corresponding to energy increases up to +2,140%, while GPT-4o-mini showed mixed effects including a reduction at r=0.5. These results indicate that input-token reduction alone is not a reliable energy optimization strategy in production inference. For the evaluated settings, model selection and output-length control provided more consistent energy-quality tradeoffs than prompt compression.
[NLP-67] Compression Method Matters: Benchmark-Dependent Output Dynamics in LLM Prompt Compression
【速读】: 该论文旨在解决生成式 AI(Generative AI)模型在提示压缩(prompt compression)部署中存在评估偏差的问题,即仅以输入 token 减少作为衡量标准可能无法准确反映实际推理成本和输出行为变化。其核心问题是:压缩策略对不同基准测试(benchmark)的输出长度和总推理开销影响显著差异,且现有研究结论相互矛盾,缺乏统一的结构化解释框架。解决方案的关键在于提出“指令存活概率”(instruction survival probability, Psi),这是一个结构化指标,用于量化任务关键提示片段在截断后是否保留;同时引入“压缩鲁棒性指数”(Compression Robustness Index, CRI),实现跨基准的可比性评估。研究表明,Prompt 结构而非模型提供商本身是决定压缩效果的主要因素,从而揭示了为何某些场景下出现极端输出膨胀(如 DeepSeek 在 MBPP 上 56 倍扩展),而其他场景则表现稳定(如 GPT-4o-mini)。此方法推动了更可靠、节能的 LLM 部署实践,强调需进行多基准测试与结构感知的压缩策略设计。
链接: https://arxiv.org/abs/2603.23527
作者: Warren Johnson
机构: Plexor Labs
类目: Computation and Language (cs.CL)
备注: 19 pages. Includes figures and tables. Companion code/data repository and direct NVML calibration dataset are cited in manuscript
Abstract:Prompt compression is often evaluated by input-token reduction, but its real deployment impact depends on how compression changes output length and total inference cost. We present a controlled replication and extension study of benchmark-dependent output dynamics under aggressive compression, covering 5,400 API calls across three benchmarks and multiple providers. To explain conflicting prior observations, we formalize instruction survival probability (Psi), a structural metric that captures whether task-critical prompt segments remain after truncation. Results show a strong benchmark effect: under r=0.3, DeepSeek exhibits severe output expansion on MBPP (56x, Psi approx 0.15) but substantially lower expansion on HumanEval (5x, Psi approx 0.72), while GPT-4o-mini is comparatively stable across benchmarks. This reconciles the apparent discrepancy between previously reported extreme explosion and lower replication effects by identifying prompt structure, not provider identity alone, as the primary moderator. We introduce the Compression Robustness Index (CRI) for cross-benchmark evaluation and show that single-benchmark assessments can produce misleading conclusions about compression safety and efficiency. To contextualize energy claims, we incorporate companion direct NVML measurements from rented RunPod GPUs and show that token savings can overstate joule savings. These findings motivate benchmark-diverse testing and structure-aware compression policies for reliable, energy-conscious LLM deployment.
[NLP-68] Prompt Compression in Production Task Orchestration: A Pre-Registered Randomized Trial
【速读】: 该论文旨在解决生成式 AI (Generative AI) 系统中提示词压缩(prompt compression)的经济性问题,即如何在降低输入 token 成本的同时,合理控制输出 token 增长带来的额外成本——因为输出 token 通常定价为输入 token 的数倍。其解决方案的关键在于:压缩策略必须同时考虑输入减少与输出扩展的权衡,并引入结构感知的压缩方法(如基于熵自适应和基于时间权重的策略),以实现总推理成本最小化且保持响应语义一致性。实验表明,适度压缩(保留率 r=0.5)可降低平均总成本 27.9%,而过度压缩(r=0.2)虽减少输入但导致输出轻微扩张和成本上升,说明“压缩越多越好”并非可靠生产策略,输出 token 应作为压缩政策设计中的首要优化目标。
链接: https://arxiv.org/abs/2603.23525
作者: Warren Johnson,Charles Lee
机构: Plexor Labs; Project Autobots
类目: Computation and Language (cs.CL)
备注: 28 pages, 9 tables, 1 CONSORT figure; pre-registered randomized controlled trial on production orchestration prompts
Abstract:The economics of prompt compression depend not only on reducing input tokens but on how compression changes output length, which is typically priced several times higher. We evaluate this in a pre-registered six-arm randomized controlled trial of prompt compression on production multi-agent task-orchestration, analyzing 358 successful Claude Sonnet 4.5 runs (59-61 per arm) drawn from a randomized corpus of 1,199 real orchestration instructions. We compare an uncompressed control with three uniform retention rates (r=0.8, 0.5, 0.2) and two structure-aware strategies (entropy-adaptive and recency-weighted), measuring total inference cost (input+output) and embedding-based response similarity. Moderate compression (r=0.5) reduced mean total cost by 27.9%, while aggressive compression (r=0.2) increased mean cost by 1.8% despite substantial input reduction, consistent with small mean output expansion (1.03x vs. control) and heavy-tailed uncertainty. Recency-weighted compression achieved 23.5% savings and, together with moderate compression, occupied the empirical cost-similarity Pareto frontier, whereas aggressive compression was dominated on both cost and similarity. These results show that “compress more” is not a reliable production heuristic and that output tokens must be treated as a first-class outcome when designing compression policies.
[NLP-69] Navigating the Concept Space of Language Models
【速读】: 该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAE)在大规模语言模型激活特征中提取出的数千个特征难以进行高效、系统性探索的问题。当前主流方法如人工浏览或基于语义搜索的分析方式,无法有效支持概念发现与比较,尤其在高维度和大规模场景下存在瓶颈。解决方案的关键在于提出 Concept Explorer——一个可扩展的交互式探索系统,其核心创新是利用层次化邻域嵌入(hierarchical neighborhood embeddings)对 SAE 特征解释进行组织,并构建多分辨率流形结构,从而支持从粗粒度概念簇到细粒度邻域的渐进式导航,实现概念间的发现、对比与关系分析。
链接: https://arxiv.org/abs/2603.23524
作者: Wilson E. Marcílio-Jr,Danilo M. Eler
机构: Adaption Labs(Adaption Labs); São Paulo State University (UNESP)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse autoencoders (SAEs) trained on large language model activations output thousands of features that enable mapping to human-interpretable concepts. The current practice for analyzing these features primarily relies on inspecting top-activating examples, manually browsing individual features, or performing semantic search on interested concepts, which makes exploratory discovery of concepts difficult at scale. In this paper, we present Concept Explorer, a scalable interactive system for post-hoc exploration of SAE features that organizes concept explanations using hierarchical neighborhood embeddings. Our approach constructs a multi-resolution manifold over SAE feature embeddings and enables progressive navigation from coarse concept clusters to fine-grained neighborhoods, supporting discovery, comparison, and relationship analysis among concepts. We demonstrate the utility of Concept Explorer on SAE features extracted from SmolLM2, where it reveals coherent high-level structure, meaningful subclusters, and distinctive rare concepts that are hard to identify with existing workflows.
[NLP-70] Do 3D Large Language Models Really Understand 3D Spatial Relationships? ICLR2026
【速读】: 该论文旨在解决当前3D大语言模型(3D Large-Language Models, 3D-LLMs)在评估中可能依赖文本线索而非真正进行三维空间推理的问题,即现有基准测试(如SQA3D)难以区分模型是否具备真正的3D感知能力。其解决方案的关键在于提出一个更严格的评估基准Real-3DQA,通过过滤易猜解的问题并引入结构化分类体系来系统评估多维3D推理能力,并进一步设计了一种基于3D信息重加权的训练目标,引导模型更多依赖视觉线索而非文本捷径,从而显著提升模型在空间关系理解上的表现。
链接: https://arxiv.org/abs/2603.23523
作者: Xianzheng Ma,Tao Sun,Shuai Chen,Yash Bhalgat,Jindong Gu,Angel X Chang,Iro Armeni,Iro Laina,Songyou Peng,Victor Adrian Prisacariu
机构: University of Oxford (牛津大学); Stanford University (斯坦福大学); Simon Fraser University (西蒙弗雷泽大学); Google DeepMind (谷歌DeepMind)
类目: Computation and Language (cs.CL); Robotics (cs.RO)
备注: ICLR 2026
Abstract:Recent 3D Large-Language Models (3D-LLMs) claim to understand 3D worlds, especially spatial relationships among objects. Yet, we find that simply fine-tuning a language model on text-only question-answer pairs can perform comparably or even surpass these methods on the SQA3D benchmark without using any 3D input. This indicates that the SQA3D benchmark may not be able to detect if the model exploits textual shortcuts rather than engages in 3D-aware reasoning. To address this issue, we introduce Real-3DQA, a more rigorous evaluation benchmark that filters out easy-to-guess questions and introduces a structured taxonomy to assess various aspects of 3D reasoning. Experiments on Real-3DQA confirm that existing 3D-LLMs struggle with spatial relationships once simple cues are removed. We further propose a 3D-reweighted training objective that guides model to rely more on 3D visual clues, substantially enhancing 3D-LLMs performance in spatial reasoning tasks. Our findings underscore the need for robust benchmarks and tailored training strategies to advance genuine 3D vision-language understanding. Project page: this https URL.
[NLP-71] Qworld: Question-Specific Evaluation Criteria for LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在开放性问题评估中因响应质量高度依赖于问题上下文而难以准确衡量的问题。传统二元评分和静态评分标准无法捕捉这种情境依赖性,且现有方法通常在数据集层面定义标准或单次生成准则,限制了对每个问题所隐含评价维度的充分探索。其解决方案的关键在于提出“One-Question-One-World”(Qworld)方法,通过递归扩展树结构,将每个问题分解为具体场景、视角及细粒度二元标准,从而生成与问题高度适配的评价准则。该方法实现了对问题所隐含评价轴的结构化覆盖,使评估能够动态适应每个问题的独特要求,而非依赖固定的任务级标准。
链接: https://arxiv.org/abs/2603.23522
作者: Shanghua Gao,Yuchang Su,Pengwei Sui,Curtis Ginder,Marinka Zitnik
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating large language models (LLMs) on open-ended questions is difficult because response quality depends on the question’s context. Binary scores and static rubrics fail to capture these context-dependent requirements. Existing methods define criteria at the dataset level or generate them in a single pass, which limits their ability to explore the evaluation space implied by each question. We introduce One-Question-One-World (Qworld), a method that generates question-specific evaluation criteria using a recursive expansion tree. Given a question, Qworld decomposes it into scenarios, perspectives, and fine-grained binary criteria through structured hierarchical and horizontal expansion. The resulting criteria specify what a high-quality answer must address for that question. On HealthBench, Qworld covers 89% of expert-authored criteria and generates 79% novel criteria validated by human experts. Experts rate Qworld criteria higher in insight and granularity than those produced by prior methods. When applied to 11 frontier LLMs on HealthBench and Humanity’s Last Exam, Qworld reveals capability differences in dimensions such as long-term impact, equity, error handling, and interdisciplinary reasoning that coarse rubrics do not distinguish. By formulating criteria generation as structured coverage of question-implied evaluation axes, Qworld enables evaluation that adapts to each question rather than relying on fixed task-level criteria.
[NLP-72] Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages CVPR2025
【速读】: 该论文旨在解决当前多模态模型在多图像场景下的理解能力不足,以及现有视觉语言模型(Vision-Language Models, VLMs)主要基于英文数据集训练导致对印度语种代表性匮乏的问题。其解决方案的关键在于构建并公开了Chitrakshara数据集系列,涵盖11种印度语言,包含两个核心组成部分:(1) Chitrakshara-IL,一个大规模交错图文预训练数据集(含1.93亿张图像、300亿文本标记和5000万份多语言文档),以及(2) Chitrakshara-Cap,一个包含4400万图像-文本对和7.33亿文本标记的高质量数据子集。该工作通过系统化的数据采集、筛选与处理流程,并辅以全面的质量与多样性分析,为开发更具文化包容性的多模态模型提供了可靠的数据基础。
链接: https://arxiv.org/abs/2603.23521
作者: Shaharukh Khan,Ali Faraz,Abhinav Ravi,Mohd Nauman,Mohd Sarfraz,Akshat Patidar,Raja Kolla,Chandra Khatri,Shubham Agarwal
机构: Krutrim AI(克鲁特里姆人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at “CVPR 2025: Workshop Vision Language Models For All”
Abstract:Multimodal research has predominantly focused on single-image reasoning, with limited exploration of multi-image scenarios. Recent models have sought to enhance multi-image understanding through large-scale pretraining on interleaved image-text datasets. However, most Vision-Language Models (VLMs) are trained primarily on English datasets, leading to inadequate representation of Indian languages. To address this gap, we introduce the Chitrakshara dataset series, covering 11 Indian languages sourced from Common Crawl. It comprises (1) Chitrakshara-IL, a large-scale interleaved pretraining dataset with 193M images, 30B text tokens, and 50M multilingual documents, and (2) Chitrakshara-Cap, which includes 44M image-text pairs with 733M tokens. This paper details the data collection pipeline, including curation, filtering, and processing methodologies. Additionally, we present a comprehensive quality and diversity analysis to assess the dataset’s representativeness across Indic languages and its potential for developing more culturally inclusive VLMs.
[NLP-73] From Physician Expertise to Clinical Agents : Preserving Standardizing and Scaling Physicians Medical Expertise with Lightweight LLM
【速读】: 该论文旨在解决中医临床实践中高质专家知识难以规模化传承与应用的问题,即名医经验因个体化强、形成周期长且传播效率低,导致优质诊疗能力稀缺。解决方案的关键在于提出Med-Shicheng框架,通过系统性地提取五位国家级名老中医的诊断-治疗哲学及案例依赖的调整治则,并将其标准化为多任务学习范式,使大语言模型(LLM)能够内化并迁移这些复杂知识体系,在资源受限设备上实现与先进模型相当的性能表现。
链接: https://arxiv.org/abs/2603.23520
作者: Chanyong Luo,Jirui Dai,Zhendong Wang,Kui Chen,Jiaxi Yang,Bingjie Lu,Jing Wang,Jiaxin Hao,Bing Li,Ruiyang He,Yiyu Qiao,Chenkai Zhang,Kaiyu Wang,Zhi Liu,Zeyu Zheng,Yan Li,Xiaohong Gu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Medicine is an empirical discipline refined through long-term observation and the messy, high-variance reality of clinical practice. Physicians build diagnostic and therapeutic competence through repeated cycles of application, reflection, and improvement, forming individualized methodologies. Yet outcomes vary widely, and master physicians’ knowledge systems are slow to develop and hard to transmit at scale, contributing to the scarcity of high-quality clinical expertise. To address this, we propose Med-Shicheng, a general framework that enables large language models to systematically learn and transfer distinguished physicians’ diagnostic-and-therapeutic philosophy and case-dependent adaptation rules in a standardized way. Built on Tianyi, Med-Shicheng consists of five stages. We target five National Masters of Chinese Medicine or distinguished TCM physicians, curate multi-source materials, and train a single model to internalize all five knowledge systems across seven tasks, including etiology-pathogenesis analysis, syndrome diagnosis, treatment principle selection, prescription generation, prescription explanation, symptom evolution with regimen adjustment, and clinical advice. Implemented on Qwen2.5-1.5B-Base, Med-Shicheng runs on resource-constrained GPUs while achieving performance comparable to DeepSeek-R1 and GPT-5. We also examine the reliability of LLM-as-a-judge versus physician evaluation: automated judging tracks overall trends but shows bias on fine-grained individualized distinctions, highlighting the need for physician involvement when ground truth is unavailable and for domain-adapted judge models.
[NLP-74] MedMT-Bench: Can LLM s Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?
【速读】: 该论文旨在解决当前医疗领域大语言模型(Large Language Models, LLMs)评估中缺乏对长上下文记忆、干扰鲁棒性和安全防御能力的严格测试问题。现有医学相关基准通常未能模拟真实诊疗流程中的复杂多轮交互场景,导致模型在实际应用中的可靠性难以保障。为应对这一挑战,作者提出MedMT-Bench——一个模拟完整诊断与治疗过程的多轮指令遵循基准,其关键在于通过逐场景数据合成并经专家人工精修构建出400个高度贴近真实临床实践的测试用例,每例平均包含22轮对话(最多52轮),覆盖五类难点指令遵循任务;同时设计了基于实例级评分标准和原子测试点的LLM-as-judge评估协议,验证结果显示该方法与专家标注的一致性高达91.94%,从而为推动更安全、可靠的医疗AI研究提供了可量化、高挑战性的评估工具。
链接: https://arxiv.org/abs/2603.23519
作者: Lin Yang,Yuancheng Yang,Xu Wang,Changkun Liu,Haihua Yang
机构: ByteDance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities across various specialist domains and have been integrated into high-stakes areas such as medicine. However, as existing medical-related benchmarks rarely stress-test the long-context memory, interference robustness, and safety defense required in practice. To bridge this gap, we introduce MedMT-Bench, a challenging medical multi-turn instruction following benchmark that simulates the entire diagnosis and treatment process. We construct the benchmark via scene-by-scene data synthesis refined by manual expert editing, yielding 400 test cases that are highly consistent with real-world application scenarios. Each test case has an average of 22 rounds (maximum of 52 rounds), covering 5 types of difficult instruction following issues. For evaluation, we propose an LLM-as-judge protocol with instance-level rubrics and atomic test points, validated against expert annotations with a human-LLM agreement of 91.94%. We test 17 frontier models, all of which underperform on MedMT-Bench (overall accuracy below 60.00%), with the best model reaching 59.75%. MedMT-Bench can be an essential tool for driving future research towards safer and more reliable medical AI. The benchmark is available in this https URL
[NLP-75] Cluster-R1: Large Reasoning Models Are Instruction-following Clustering Agents
【速读】: 该论文旨在解决通用嵌入模型在语义相似性识别上表现优异但无法响应用户指令细化文本特征,以及指令微调嵌入模型虽能对齐文本指令却难以自主推断潜在数据结构(如聚类数量)的问题。其解决方案的关键在于将指令遵循的聚类任务重构为生成式任务,并训练大型推理模型(Large Reasoning Models, LRMs)作为自主聚类代理,通过推理驱动的训练流程使LRM能够理解高层级聚类指令并自动推断对应的潜在分组结构,从而实现更忠实且可解释的指令驱动聚类。
链接: https://arxiv.org/abs/2603.23518
作者: Peijun Qing,Puneet Mathur,Nedim Lipka,Varun Manjunatha,Ryan Rossi,Franck Dernoncourt,Saeed Hassanpour,Soroush Vosoughi
机构: Dartmouth College (达特茅斯学院); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:General-purpose embedding models excel at recognizing semantic similarities but fail to capture the characteristics of texts specified by user instructions. In contrast, instruction-tuned embedders can align embeddings with textual instructions yet cannot autonomously infer latent corpus structures, such as determining the optimal number of clusters. To address both limitations, we reframe instruction-following clustering as a generative task and train large reasoning models (LRMs) as autonomous clustering agents. Our reasoning-driven training pipeline enables LRMs to interpret high-level clustering instructions and then infer the corresponding latent groupings. To evaluate this paradigm, we introduce ReasonCluster, a comprehensive benchmark comprising 28 diverse tasks spanning daily dialogue, legal cases, and financial reports. Experiments across diverse datasets and clustering scenarios show that our approach consistently outperforms strong embedding-based methods and LRM baselines, demonstrating that explicit reasoning fosters more faithful and interpretable instruction-based clustering.
[NLP-76] Beyond Accuracy: Introducing a Symbolic-Mechanistic Approach to Interpretable Evaluation
【速读】: 该论文试图解决现有基于准确率(accuracy-based)的评估方法在小数据场景下无法可靠区分模型的真实泛化能力与捷径策略(如记忆、信息泄露或脆弱启发式规则)的问题。其解决方案的关键在于提出机制感知评估(mechanism-aware evaluation),该方法将任务相关的符号规则(symbolic rules)与机制可解释性(mechanistic interpretability)相结合,从而获得算法层面的通过/失败评分,精确揭示模型在何处实现泛化、何处依赖模式利用。以NL-to-SQL任务为例,研究通过对比两个结构相同但训练条件不同的模型(一个无schema信息强制记忆,另一个有schema信息实现语义锚定),发现仅靠准确率会错误地认为记忆模型具备能力,而符号-机制联合评估则能识别出其违反核心schema泛化规则的失败本质。
链接: https://arxiv.org/abs/2603.23517
作者: Reza Habibi,Darian Lee,Magy Seif El-Nasr
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注:
Abstract:Accuracy-based evaluation cannot reliably distinguish genuine generalization from shortcuts like memorization, leakage, or brittle heuristics, especially in small-data regimes. In this position paper, we argue for mechanism-aware evaluation that combines task-relevant symbolic rules with mechanistic interpretability, yielding algorithmic pass/fail scores that show exactly where models generalize versus exploit patterns. We demonstrate this on NL-to-SQL by training two identical architectures under different conditions: one without schema information (forcing memorization), one with schema (enabling grounding). Standard evaluation shows the memorization model achieves 94% field-name accuracy on unseen data, falsely suggesting competence. Our symbolic-mechanistic evaluation reveals this model violates core schema generalization rules, a failure invisible to accuracy metrics.
[NLP-77] raining a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data
【速读】: 该论文旨在解决医疗编码自动化中的准确性与可靠性问题,尤其是在临床文档中自动分配ICD-10-CM(国际疾病分类第10版临床修改)和CPT(当前程序术语)代码时面临的挑战,包括病历异质性、编码规则复杂性以及长尾分布等问题。解决方案的关键在于利用基于电子健康记录(EHR)模板和编码政策生成的隐私保护型合成训练数据,对通用大语言模型Llama 3-70B进行微调,从而在不暴露受保护健康信息的前提下,显著提升模型对精确代码匹配的预测能力。实验表明,微调后模型在两个代码系统上的精确匹配F1分数均超过0.70,远优于未调整模型的0.18,且在需多步临床推理的复杂类别上仍保持高性能。
链接: https://arxiv.org/abs/2603.23515
作者: John Cook,Michael Wyatt,Peng Wei,Iris Chin,Santosh Gupta,Van Zyl Van Vuuren,Richie Siburian,Amanda Spicer,Kristen Viviano,Alda Cami,Raunaq Malhotra,Zhewei Yao,Jeff Rasley,Gaurav Kaushik
机构: Veradigm; Snowflake
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 6 figures
Abstract:Improving the accuracy and reliability of medical coding reduces clinician burnout and supports revenue cycle processes, freeing providers to focus more on patient care. However, automating the assignment of ICD-10-CM and CPT codes from clinical documentation remains a challenge due to heterogeneous records, nuanced coding guidelines, and long-tail distributions. Large language models have been proposed to help or automate specific medical coding tasks. However, foundation models are not explicitly trained for medical coding and zero-shot coding has yielded poor results. We investigate whether a modern open-weight foundation model can be adapted for an expert-level medical coding task using privacy-preserving synthetic training data derived from electronic health records. We fine-tune Llama 3-70B on pairs of clinical notes and gold codes generated from EHR-grounded templates and coding policies, then evaluate exact-code prediction for ICD-10-CM and CPT. A zero-shot baseline with the unadapted model achieved an F1 score of 0.18 for exact code match. After fine-tuning on the synthetic corpus, exact-match F1 exceeded 0.70, representing a large absolute gain across both code systems. Notably, performance remained high on complex categories that often require multi-step clinical reasoning and code composition, including Advanced Illness and Frailty classes, and the model retained its performance on medical comprehension tasks. These results indicate that synthetic, policy-aware data can efficiently teach a general-purpose large language model to support precise medical coding without exposing protected health information. The approach offers a practical path for training coding agents safely and iteratively on specific tasks that represent real-world populations.
[NLP-78] DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对领域特定细节时知识深度不足的问题,尤其是缺乏一种通用方法来衡量模型在适应性追问下维持准确回答的能力。其解决方案的核心在于提出一个领域无关的评估框架DepthCharge,该框架通过三项关键创新实现:一是基于模型实际提及概念的自适应探查机制以生成后续问题;二是从权威来源按需验证事实的真实性;三是采用固定样本量下的生存统计方法,在每一深度层级上获取可比的性能指标。此框架无需预构建测试集或领域专业知识即可部署于任意具有公开可验证事实的知识领域,从而揭示标准基准无法捕捉的深度依赖型性能差异,并支持跨模型、跨领域的相对比较评估。
链接: https://arxiv.org/abs/2603.23514
作者: Alexander Sheppert
机构: Legacy Health; Capitol Technology University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models appear competent when answering general questions but often fail when pushed into domain-specific details. No existing methodology provides an out-of-the-box solution for measuring how deeply LLMs can sustain accurate responses under adaptive follow-up questioning across arbitrary domains. We present DepthCharge, a domain-agnostic framework that measures knowledge depth through three innovations: adaptive probing that generates follow-up questions based on concepts the model actually mentions, on-demand fact verification from authoritative sources, and survival statistics with constant sample sizes at every depth level. The framework can be deployed on any knowledge domain with publicly verifiable facts, without requiring pre-constructed test sets or domain-specific expertise. DepthCharge results are relative to the evaluator model used for answer checking, making the framework a tool for comparative evaluation rather than absolute accuracy certification. Empirical validation across four diverse domains (Medicine, Constitutional Law, Ancient Rome, and Quantum Computing) with five frontier models demonstrates that DepthCharge reveals depth-dependent performance variation hidden by standard benchmarks. Expected Valid Depth (EVD) ranges from 3.45 to 7.55 across model-domain combinations, and model rankings vary substantially by domain, with no single model dominating all areas. Cost-performance analysis further reveals that expensive models do not always achieve deeper knowledge, suggesting that domain-specific evaluation is more informative than aggregate benchmarks for model selection in professional applications. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.23514 [cs.CL] (or arXiv:2603.23514v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.23514 Focus to learn more arXiv-issued DOI via DataCite
[NLP-79] Berta: an open-source modular tool for AI-enabled clinical documentation
【速读】: 该论文旨在解决当前商业AI语音记录系统(AI scribe)存在的三大问题:高成本(每月每名医师99–600美元)、运行机制不透明以及临床数据无法回传至机构基础设施,从而限制了医疗机构对数据治理、质量改进和临床工作流程的控制。其解决方案的关键在于开发并部署了一个名为Berta的开源模块化语音记录平台,该平台集成自动语音识别(Automatic Speech Recognition, ASR)与大语言模型(Large Language Models, LLMs),同时将全部临床数据保留在阿尔伯塔省卫生服务局(AHS)的安全环境中,并与AHS现有的Snowflake AI数据云基础设施无缝整合。该方案实现了低于30美元/医师/月的运营成本(较商业方案降低70–95%),并在8个月内于105家城乡医疗设施中支持198名急诊医生完成超过2.2万次临床会话,验证了其可扩展性和实用性,为健康系统提供了具备数据主权保障且可复现的低成本替代方案。
链接: https://arxiv.org/abs/2603.23513
作者: Samridhi Vaid,Mike Weldon,Jesse Dunn,Sacha Davis,Kevin Lonergan,Henry Li,Jeffrey Franc,Mohamed Abdalla,Daniel C. Baumgart,Jake Hayward,J Ross Mitchell
机构: University of Alberta(阿尔伯塔大学); Alberta Health Services(阿尔伯塔健康服务局); Alberta Machine Intelligence Institute(阿尔伯塔机器智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Commercial AI scribes cost \ 99-600 per physician per month, operate as opaque systems, and do not return data to institutional infrastructure, limiting organizational control over data governance, quality improvement, and clinical workflows. We developed Berta, an open-source modular scribe platform for AI-enabled clinical documentation, and deployed a customized implementation within Alberta Health Services (AHS) integrated with their existing Snowflake AI Data Cloud infrastructure. The system combines automatic speech recognition with large language models while retaining all clinical data within the secure AHS environment. During eight months (November 2024 to July 2025), 198 emergency physicians used the system in 105 urban and rural facilities, generating 22148 clinical sessions and more than 2800 hours of audio. The use grew from 680 to 5530 monthly sessions. Operating costs averaged less than \ 30 per physician per month, a 70-95% reduction compared to commercial alternatives. AHS has since approved expansion to 850 physicians. This is the first provincial-scale deployment of an AI scribe integrated with existing health system infrastructure. By releasing Berta as open source, we provide a reproducible, cost-effective alternative that health systems can adapt to their own secure environments, supporting data sovereignty and informed evaluation of AI documentation technology.
[NLP-80] DISCO: Document Intelligence Suite for COmparative Evaluation ICLR2026
【速读】: 该论文旨在解决文档智能(Document Intelligence)中文本提取与内容推理的准确性问题,特别是在多样化的文档类型(如手写文本、多语言脚本、医疗表格、信息图和多页文档)下,如何有效评估光学字符识别(OCR)流水线与视觉-语言模型(VLMs)的性能差异。其解决方案的关键在于提出一个名为DISCO(Document Intelligence Suite for COmparative Evaluation)的评估框架,该框架独立评估OCR与VLM在文档解析和问答任务上的表现,并揭示不同文档特征(如结构复杂度、语言多样性、视觉丰富性)对模型性能的影响,从而为根据文档特性选择合适的处理策略提供实证依据。
链接: https://arxiv.org/abs/2603.23511
作者: Kenza Benkirane,Dan Goldwater,Martin Asenov,Aneiss Ghodsi
机构: Parexel AI Labs(帕雷塞尔人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the ICLR 2026 Workshop on Multimodal Intelligence (MMIntelligence). 10 pages, 7 figures
Abstract:Document intelligence requires accurate text extraction and reliable reasoning over document content. We introduce \textbfDISCO, a \emphDocument Intelligence Suite for COmparative Evaluation, that evaluates optical character recognition (OCR) pipelines and vision-language models (VLMs) separately on parsing and question answering across diverse document types, including handwritten text, multilingual scripts, medical forms, infographics, and multi-page documents. Our evaluation shows that performance varies substantially across tasks and document characteristics, underscoring the need for complexity-aware approach selection. OCR pipelines are generally more reliable for handwriting and for long or multi-page documents, where explicit text grounding supports text-heavy reasoning, while VLMs perform better on multilingual text and visually rich layouts. Task-aware prompting yields mixed effects, improving performance on some document types while degrading it on others. These findings provide empirical guidance for selecting document processing strategies based on document structure and reasoning demands.
[NLP-81] Visuospatial Perspective Taking in Multimodal Language Models
【速读】: 该论文旨在解决多模态语言模型(Multimodal Language Models, MLMs)在社交与协作场景中视角转换能力(visuospatial perspective-taking, VPT)评估不足的问题。现有基准测试主要依赖文本片段或静态场景理解,未能充分考察模型在空间关系上的视角适应能力。解决方案的关键在于借鉴人类心理学研究中的两个经典任务——“指挥者任务”(Director Task)和“旋转图形任务”(Rotating Figure Task),系统性地评估MLMs在参照性沟通和角度差异情境下的VPT表现。实验发现,当前MLMs在Level 2 VPT(需抑制自身视角以采纳他人视角)上存在显著缺陷,揭示了其在表征和推理替代视角方面的根本局限,对提升模型在协作场景中的实用性具有重要启示。
链接: https://arxiv.org/abs/2603.23510
作者: Jonathan Prunty,Seraphina Zhang,Patrick Quinn,Jianxun Lian,Xing Xie,Lucy Cheke
机构: University of Cambridge (剑桥大学); Microsoft Research Asia (亚洲微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As multimodal language models (MLMs) are increasingly used in social and collaborative settings, it is crucial to evaluate their perspective-taking abilities. Existing benchmarks largely rely on text-based vignettes or static scene understanding, leaving visuospatial perspective-taking (VPT) underexplored. We adapt two evaluation tasks from human studies: the Director Task, assessing VPT in a referential communication paradigm, and the Rotating Figure Task, probing perspective-taking across angular disparities. Across tasks, MLMs show pronounced deficits in Level 2 VPT, which requires inhibiting one’s own perspective to adopt another’s. These results expose critical limitations in current MLMs’ ability to represent and reason about alternative perspectives, with implications for their use in collaborative contexts.
[NLP-82] Internal Safety Collapse in Frontier Large Language Models
【速读】: 该论文旨在解决前沿大语言模型(Large Language Models, LLMs)在特定任务条件下可能出现的“内部安全崩溃”(Internal Safety Collapse, ISC)问题,即模型在执行看似无害的任务时,因任务设计诱导而持续生成有害内容。解决方案的关键在于提出TVS(Task, Validator, Data)框架,通过构造包含53个场景的ISC-Bench基准测试集,识别并触发ISC现象——尤其在专业领域任务中,若生成有害内容成为唯一有效完成方式,则模型极易陷入此类安全失效状态。实验表明,四款前沿LLM在三个典型场景下的最坏情况安全失败率平均达95.3%,显著高于传统越狱攻击,揭示了模型能力越强、潜在风险越大,且对齐训练无法彻底消除其内在安全隐患。
链接: https://arxiv.org/abs/2603.23509
作者: Yutao Wu,Xiao Liu,Yifeng Gao,Xiang Zheng,Hanxun Huang,Yige Li,Cong Wang,Bo Li,Xingjun Ma,Yu-Gang Jiang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 15 pages of the main text, qualitative examples of jailbreaks may be harmful in nature
Abstract:This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC-Bench containing 53 scenarios across 8 professional disciplines. Evaluated on JailbreakBench, three representative scenarios yield worst-case safety failure rates averaging 95.3% across four frontier LLMs (including GPT-5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models are more vulnerable than earlier LLMs: the very capabilities that enable complex task execution become liabilities when tasks intrinsically involve harmful content. This reveals a growing attack surface: almost every professional domain uses tools that process sensitive data, and each new dual-use tool automatically expands this vulnerability–even without any deliberate attack. Despite substantial alignment efforts, frontier LLMs retain inherently unsafe internal capabilities: alignment reshapes observable outputs but does not eliminate the underlying risk profile. These findings underscore the need for caution when deploying LLMs in high-stakes settings. Source code: this https URL
[NLP-83] Beyond Masks: Efficient Flexible Diffusion Language Models via Deletion-Insertion Processes ICLR2026
【速读】: 该论文旨在解决当前基于掩码的扩散语言模型(Masked Diffusion Language Models, MDLMs)在计算效率和生成灵活性方面的局限性。MDLMs依赖于token掩码与解掩码机制,导致两个主要的计算开销:一是模型对无信息量的MASK token进行冗余计算,二是变长序列场景下引入的PAD token带来的额外负担。此外,MDLMs在处理变长序列时需固定长度填充,限制了生成灵活性。解决方案的关键在于提出删除-插入扩散语言模型(Deletion-Insertion Diffusion language models, DID),将token的删除与插入严格建模为离散扩散过程,从而替代原有掩码机制;DID通过消除MASK和PAD token的计算开销提升了训练与推理效率,并通过插入操作天然支持变长序列且具备内在自校正能力,显著增强了生成灵活性。
链接: https://arxiv.org/abs/2603.23507
作者: Fangyu Ding,Ding Ding,Sijin Chen,Kaibo Wang,Peng Xu,Zijin Feng,Haoli Bai,Kai Han,Youliang Yan,Binhang Yuan,Jiacheng Sun
机构: HKUST; Huawei Foundation Model Dept; CUHK
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICLR 2026
Abstract:While Masked Diffusion Language Models (MDLMs) relying on token masking and unmasking have shown promise in language modeling, their computational efficiency and generation flexibility remain constrained by the masking paradigm. In this paper, we propose Deletion-Insertion Diffusion language models (DID) that rigorously formulate token deletion and insertion as discrete diffusion processes, replacing the masking and unmasking processes in current MDLMs. DID improves training and inference efficiency by eliminating two major sources of computational overhead in MDLMs: the computations on non-informative 1) MASK tokens inherent to the paradigm, and 2) PAD tokens introduced in variable-length settings. Furthermore, DID offers greater flexibility by: 1) natively supporting variable-length sequences without requiring fixed-length padding, and 2) an intrinsic self-correction mechanism during generation due to insertion that dynamically adjusts token positions. To train DID, we design a score-based approach that assigns scores to token insertion operations and derive appropriate training objectives. The objectives involve subsequence counting problems, which we efficiently solve via a parallelized dynamic programming algorithm. Our experiments across fixed and variable-length settings demonstrate the advantage of DID over baselines of MDLMs and existing insertion-based LMs, in terms of modeling performance, sampling quality, and training/inference speed, without any hyperparameter tuning.
[NLP-84] Leverag ing Computerized Adaptive Testing for Cost-effective Evaluation of Large Language Models in Medical Benchmarking
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗领域评估中面临的可扩展性与心理测量学可靠性不足的问题。传统静态基准测试存在重复实施成本高、易受数据污染以及缺乏精细性能追踪的校准测量特性等局限。其解决方案的关键在于提出并验证了一种基于项目反应理论(Item Response Theory, IRT)的计算机化自适应测试(Computerized Adaptive Testing, CAT)框架,通过动态选择题目并依据实时能力估计终止测试(标准误≤0.3),实现了对LLMs标准化医学知识的高效评估:仅使用1.3%的题项即可获得与全量题库高度一致的效能估计(相关系数r = 0.988),同时显著降低计算资源消耗和评估时间,且保持模型间性能排序不变,为LLMs医疗知识的快速、低成本基准测试提供了可标准化的测量工具。
链接: https://arxiv.org/abs/2603.23506
作者: Tianpeng Zheng,Zhehan Jiang,Jiayi Liu,Shicong Feng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 37 pages, 6 figures
Abstract:The rapid proliferation of large language models (LLMs) in healthcare creates an urgent need for scalable and psychometrically sound evaluation methods. Conventional static benchmarks are costly to administer repeatedly, vulnerable to data contamination, and lack calibrated measurement properties for fine-grained performance tracking. We propose and validate a computerized adaptive testing (CAT) framework grounded in item response theory (IRT) for efficient assessment of standardized medical knowledge in LLMs. The study comprises a two-phase design: a Monte Carlo simulation to identify optimal CAT configurations and an empirical evaluation of 38 LLMs using a human-calibrated medical item bank. Each model completed both the full item bank and an adaptive test that dynamically selected items based on real-time ability estimates and terminated upon reaching a predefined reliability threshold (standard error = 0.3). Results show that CAT-derived proficiency estimates achieved a near-perfect correlation with full-bank estimates (r = 0.988) while using only 1.3 percent of the items. Evaluation time was reduced from several hours to minutes per model, with substantial reductions in token usage and computational cost, while preserving inter-model performance rankings. This work establishes a psychometric framework for rapid, low-cost benchmarking of foundational medical knowledge in LLMs. The proposed adaptive methodology is intended as a standardized pre-screening and continuous monitoring tool and is not a substitute for real-world clinical validation or safety-oriented prospective studies.
[NLP-85] Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct ICLR2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)是否能够识别自身生成文本的问题,这一现象对人工智能安全具有潜在影响但尚未被充分研究。研究发现,Llama3-8b-Instruct聊天模型能可靠地区分自身输出与人类撰写的内容,且其能力源于后训练阶段积累的自我输出经验;关键突破在于识别出残差流(residual stream)中一个在正确判断自写文本时差异化激活的向量,该向量对“自我”概念具有表征意义,并通过因果干预实验证明其直接调控模型对作者身份的感知和声明行为——即可通过向量操控模型的行为表现和主观认知,实现对其“是否认为自己写了某段文本”的精确控制。
链接: https://arxiv.org/abs/2410.02064
作者: Christopher Ackerman,Nina Panickssery
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 13 figs, 2 tables, accepted as conference paper to ICLR 2025
Abstract:It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of “self” in the model, and demonstrate that the vector is causally related to the model’s ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model’s behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model’s output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.
[NLP-86] SpinGQE: A Generative Quantum Eigensolver for Spin Hamiltonians
【速读】: 该论文旨在解决量子计算中寻找基态(ground state)的问题,尤其针对变分量子本征求解器(Variational Quantum Eigensolver, VQE)在小系统中表现有限、易受“ barren plateaus”(平坦梯度区)影响、参数化电路表达能力受限以及依赖特定问题结构等挑战。其解决方案的关键在于提出 SpinGQE,将电路设计重构为生成建模任务:利用基于 Transformer 的解码器学习生成低能量状态的量子线路分布;训练过程中通过加权均方误差损失函数,使模型 logits 与每一步子序列门操作后评估的电路能量对齐。该方法无需依赖问题特异性对称性或结构信息,即可有效探索复杂能量景观,并在四量子比特海森堡模型上实现接近基态的收敛,展现出优于传统变分方法的可扩展性。
链接: https://arxiv.org/abs/2603.24298
作者: Alexander Holden,Moinul Hossain Rahat,Nii Osae Osae Dade
机构: Mindbeam AI; Department of Computer and Data Sciences, Case Western Reserve University
类目: Quantum Physics (quant-ph); Computation and Language (cs.CL)
备注:
Abstract:The ground state search problem is central to quantum computing, with applications spanning quantum chemistry, condensed matter physics, and optimization. The Variational Quantum Eigensolver (VQE) has shown promise for small systems but faces significant limitations. These include barren plateaus, restricted ansatz expressivity, and reliance on domain-specific structure. We present SpinGQE, an extension of the Generative Quantum Eigensolver (GQE) framework to spin Hamiltonians. Our approach reframes circuit design as a generative modeling task. We employ a transformer-based decoder to learn distributions over quantum circuits that produce low-energy states. Training is guided by a weighted mean-squared error loss between model logits and circuit energies evaluated at each gate subsequence. We validate our method on the four-qubit Heisenberg model, demonstrating successfulconvergencetonear-groundstates. Throughsystematichyperparameterexploration, we identify optimal configurations: smaller model architectures (12 layers, 8 attention heads), longer sequence lengths (12 gates), and carefully chosen operator pools yield the most reliable convergence. Our results show that generative approaches can effectively navigate complex energy landscapes without relying on problem-specific symmetries or structure. This provides a scalable alternative to traditional variational methods for general quantum systems. An open-source implementation is available at this https URL.
信息检索
[IR-0] Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在复杂政策文档分析中可靠性不足的问题,特别是在法律语言密集且监管框架动态重叠的领域。其解决方案的关键在于:构建一个针对AI治理政策领域的专用RAG系统,采用基于ColBERT的检索器(通过对比学习进行微调)与经由直接偏好优化(Direct Preference Optimization, DPO)对齐人类偏好的生成器,并利用合成查询和成对偏好数据进行领域适配。实验表明,尽管领域微调提升了检索质量,但并未稳定提升端到端问答性能,甚至在缺乏相关文档时导致更自信的幻觉输出,揭示了组件改进不等于整体可靠性的核心挑战。
链接: https://arxiv.org/abs/2603.24580
作者: Saahil Mathur,Ryan David Rittner,Vedant Ajit Thakur,Daniel Stuart Schiff,Tunazzina Islam
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.
[IR-1] Evaluating Chunking Strategies For Retrieval-Augmented Generation in Oil and Gas Enterprise Documents CCS ATC
【速读】:该论文旨在解决生成式 AI(Generative AI)中检索增强生成(Retrieval-Augmented Generation, RAG)框架在实际应用中的性能瓶颈问题,其核心挑战在于文档分块(document chunking)策略对检索效果的显著影响。研究表明,传统文本驱动的分块方法在处理结构复杂或视觉信息丰富的文档(如管道和仪表图 P&IDs)时表现有限,而采用结构感知型分块(structure-aware chunking)能显著提升检索准确性(尤其在 top-K 指标上),同时降低计算开销。因此,解决方案的关键在于引入显式结构保留机制,以适配专业领域文档的语义与布局特征,并指出未来需融合多模态模型以突破纯文本 RAG 在视觉空间编码内容上的局限性。
链接: https://arxiv.org/abs/2603.24556
作者: Samuel Taiwo,Mohd Amaluddin Yusoff
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Presented at CCSEIT 2026. This version matches the published proceedings
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a framework to address the constraints of Large Language Models (LLMs). Yet, its effectiveness fundamentally hinges on document chunking - an often-overlooked determinant of its quality. This paper presents an empirical study quantifying performance differences across four chunking strategies: fixed-size sliding window, recursive, breakpoint-based semantic, and structure-aware. We evaluated these methods using a proprietary corpus of oil and gas enterprise documents, including text-heavy manuals, table-heavy specifications, and piping and instrumentation diagrams (P and IDs). Our findings show that structure-aware chunking yields higher overall retrieval effectiveness, particularly in top-K metrics, and incurs significantly lower computational costs than semantic or baseline strategies. Crucially, all four methods demonstrated limited effectiveness on P and IDs, underscoring a core limitation of purely text-based RAG within visually and spatially encoded documents. We conclude that while explicit structure preservation is essential for specialised domains, future work must integrate multimodal models to overcome current limitations.
[IR-2] Positive-First Most Ambiguous: A Simple Active Learning Criterion for Interactive Retrieval of Rare Categories
【速读】:该论文旨在解决在低监督、长尾分布的细粒度视觉检索场景中,如何高效发现稀有类别的问题,尤其是在生物多样性监测等实际应用中,目标类别仅占数据极小比例,导致传统主动学习(Active Learning, AL)方法因假设类别先验对称且标注预算充足而效果受限。其解决方案的关键在于提出一种名为“正类优先最模糊”(Positive-First Most Ambiguous, PF-MA)的主动学习准则:该方法在样本选择时兼顾边界不确定性与正类倾向性,优先选取靠近决策边界的高可能性正样本,从而在小批量标注下显著提升相关样本比例,增强早期检索性能和用户满意度;同时引入类覆盖度(class coverage)指标衡量所选正例对目标类视觉变异性捕捉的充分性,实验证明PF-MA在多种细粒度数据集(包括植物学数据)上均优于强基线模型,在不同类别规模和特征描述符下保持稳定优势。
链接: https://arxiv.org/abs/2603.24480
作者: Kawtar Zaher,Olivier Buisson,Alexis Joly
机构: INRIA, LIRMM, Université de Montpellier, France; Institut National de l’Audiovisuel, France
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:
Abstract:Real-world fine-grained visual retrieval often requires discovering a rare concept from large unlabeled collections with minimal supervision. This is especially critical in biodiversity monitoring, ecological studies, and long-tailed visual domains, where the target may represent only a tiny fraction of the data, creating highly imbalanced binary problems. Interactive retrieval with relevance feedback offers a practical solution: starting from a small query, the system selects candidates for binary user annotation and iteratively refines a lightweight classifier. While Active Learning (AL) is commonly used to guide selection, conventional AL assumes symmetric class priors and large annotation budgets, limiting effectiveness in imbalanced, low-budget, low-latency settings. We introduce Positive-First Most Ambiguous (PF-MA), a simple yet effective AL criterion that explicitly addresses the class imbalance asymmetry: it prioritizes near-boundary samples while favoring likely positives, enabling rapid discovery of subtle visual categories while maintaining informativeness. Unlike standard methods that oversample negatives, PF-MA consistently returns small batches with a high proportion of relevant samples, improving early retrieval and user satisfaction. To capture retrieval diversity, we also propose a class coverage metric that measures how well selected positives span the visual variability of the target class. Experiments on long-tailed datasets, including fine-grained botanical data, demonstrate that PF-MA consistently outperforms strong baselines in both coverage and classifier performance, across varying class sizes and descriptors. Our results highlight that aligning AL with the asymmetric and user-centric objectives of interactive fine-grained retrieval enables simple yet powerful solutions for retrieving rare and visually subtle categories in realistic human-in-the-loop settings.
[IR-3] OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework
【速读】:该论文旨在解决生成式搜索(Generative Retrieval, GR)系统在实际应用中面临的三大核心问题:复杂查询理解不足、潜在用户意图挖掘效率低以及对历史偏好过度拟合导致的个性化偏差。为此,作者提出OneSearch-V2框架,其关键创新在于:(1) 引入思维增强型复杂查询理解模块,通过深度语义建模突破传统直接推理的浅层匹配局限;(2) 设计内嵌推理机制的自蒸馏训练流程,利用隐式上下文学习从用户行为日志中挖掘未显式表达但精准的电商意图;(3) 构建行为偏好对齐优化系统,通过直接用户反馈缓解单一转化指标引发的奖励欺骗问题,并提升个性化推荐质量。该方案在离线与在线实验中均显著优于基线,且不增加推理开销或服务延迟。
链接: https://arxiv.org/abs/2603.24422
作者: Ben Chen,Siyuan Wang,Yufei Ma,Zihan Liang,Xuxin Zhang,Yue Lv,Ying Yang,Huangyu Dai,Lingtao Mao,Tong Zhao,Zhipeng Qian,Xinyu Sun,Zhixin Zhai,Yang Zhao,Bochao Liu,Jingshan Lv,Xiao Liang,Hui Kong,Jing Chen,Han Li,Chenyi Lei,Wenwu Ou,Kun Gai
机构: Kuaishou Technology(快手科技)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Key codes are available at this https URL . Feel free to contact benchen4395@gmail.com
Abstract:Generative Retrieval (GR) has emerged as a promising paradigm for modern search systems. Compared to multi-stage cascaded architecture, it offers advantages such as end-to-end joint optimization and high computational efficiency. OneSearch, as a representative industrial-scale deployed generative search framework, has brought significant commercial and operational benefits. However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement. To address these challenges, we propose \textbfOneSearch-V2, a latent reasoning enhanced self-distillation generative search framework. It contains three key innovations: (1) a thought-augmented complex query understanding module, which enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference; (2) a reasoning-internalized self-distillation training pipeline, which uncovers users’ potential yet precise e-commerce intentions beyond log-fitting through implicit in-context learning; (3) a behavior preference alignment optimization system, which mitigates reward hacking arising from the single conversion metric, and addresses personal preference via direct user feedback. Extensive offline evaluations demonstrate OneSearch-V2’s strong query recognition and user profiling capabilities. Online A/B tests further validate its business effectiveness, yielding +3.98% item CTR, +3.05% buyer conversion rate, and +2.11% order volume. Manual evaluation further confirms gains in search experience quality, with +1.65% in page good rate and +1.37% in query-item relevance. More importantly, OneSearch-V2 effectively mitigates common search system issues such as information bubbles and long-tail sparsity, without incurring additional inference costs or serving latency.
[IR-4] Exploring How Fair Model Representations Relate to Fair Recommendations
【速读】:该论文试图解决的问题是:当前推荐系统公平性评估中,以模型表示(representation)层面的群体属性可分类性作为代理指标来衡量推荐一致性(recommendation parity)的有效性问题。研究指出,这种基于表示层的评估方法可能无法准确反映实际推荐结果在不同用户群体间的差异程度。解决方案的关键在于提出两种新的、基于排序推荐列表(ranked recommendations)的度量方法,用于更直接地评估群体属性在推荐结果中的泄露情况,并通过多个真实与合成数据集上的实验验证了:虽然优化公平表示有助于提升推荐一致性,但仅依赖表示层的公平性指标会误导模型比较结果,因此推荐层的公平性度量才是更可靠的评估手段。
链接: https://arxiv.org/abs/2603.24396
作者: Bjørnar Vassøy,Benjamin Kille,Helge Langseth
机构: Norwegian University of Science and Technology (挪威科技大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages
Abstract:One of the many fairness definitions pursued in recent recommender system research targets mitigating demographic information encoded in model representations. Models optimized for this definition are typically evaluated on how well demographic attributes can be classified given model representations, with the (implicit) assumption that this measure accurately reflects \textitrecommendation parity, i.e., how similar recommendations given to different users are. We challenge this assumption by comparing the amount of demographic information encoded in representations with various measures of how the recommendations differ. We propose two new approaches for measuring how well demographic information can be classified given ranked recommendations. Our results from extensive testing of multiple models on one real and multiple synthetically generated datasets indicate that optimizing for fair representations positively affects recommendation parity, but also that evaluation at the representation level is not a good proxy for measuring this effect when comparing models. We also provide extensive insight into how recommendation-level fairness metrics behave for various models by evaluating their performances on numerous generated datasets with different properties.
[IR-5] Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing CVPR2026
【速读】:该论文旨在解决文档解析(document parsing)任务中因图像分辨率提升而导致视觉令牌(vision tokens)数量呈平方级增长、计算成本急剧上升的问题。作者指出,文档图像中存在大量冗余视觉区域(如背景),这导致现有基于视觉语言模型(Vision-Language Models, VLMs)的方法效率低下。解决方案的关键在于提出一种新颖的“粗粒度到细粒度”(coarse-to-fine)架构 PaddleOCR-VL,其核心是引入一个轻量级的有效区域聚焦模块(Valid Region Focus Module, VRFM),该模块通过定位与上下文关系预测能力识别语义相关的视觉令牌,从而抑制冗余区域;在此基础上训练了一个参数量为0.9B的紧凑但强大的视觉语言模型(PaddleOCR-VL-0.9B),在VRFM引导下仅对关键区域进行精细识别,显著减少所需视觉令牌和计算资源,同时实现卓越的端到端文档理解性能。
链接: https://arxiv.org/abs/2603.24326
作者: Cheng Cui,Ting Sun,Suyin Liang,Tingquan Gao,Zelun Zhang,Jiaxuan Liu,Xueqing Wang,Changda Zhou,Hongen Liu,Manhui Lin,Yue Zhang,Yubo Zhang,Jing Zhang,Jun Zhang,Xing Wei,Yi Liu,Dianhai Yu,Yanjun Ma
机构: PaddlePaddle Team, Baidu Inc.; Xi’an Jiaotong University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted by CVPR2026
Abstract:Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision-language model (PaddleOCR-VL-0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse-to-fine parsing for accurate and efficient document understanding. The source code and models are publicly available at this https URL.
[IR-6] UniScale: Synergistic Entire Space Data and Model Scaling for Search Ranking
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在工业搜索、广告和推荐系统中因仅关注架构优化而忽视数据与架构协同设计所导致的性能瓶颈问题,特别是模型参数扩展带来的边际收益递减以及复杂异构数据分布引发的不可逆性能退化。其解决方案的关键在于提出UniScale框架,通过两个核心组件实现数据与架构的联合优化:一是ES³(Entire-Space Sample System),构建覆盖全空间的高质量训练数据体系,融合域内请求上下文与跨域样本对齐机制,提升监督信号质量;二是HHSFT(Heterogeneous Hierarchical Sample Fusion Transformer),一种新型架构设计,利用异构层级特征交互与全空间用户兴趣融合机制,有效建模规模化异构数据中的复杂分布,从而突破纯结构调优的性能天花板。
链接: https://arxiv.org/abs/2603.24226
作者: Liren Yu,Caiyuan Li,Feiyi Dong,Tao Zhang,Zhixuan Zhang,Dan Ou,Haihong Tang,Bo Zheng
机构: Taobao \ Tmall Group of Alibaba(淘宝\天猫集团阿里巴巴)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in Large Language Models (LLMs) have inspired a surge of scaling law research in industrial search, advertising, and recommendation systems. However, existing approaches focus mainly on architectural improvements, overlooking the critical synergy between data and architecture design. We observe that scaling model parameters alone exhibits diminishing returns, i.e., the marginal gain in performance steadily declines as model size increases, and that the performance degradation caused by complex heterogeneous data distributions is often irrecoverable through model design alone. In this paper, we propose UniScale to address these limitation, a novel co-design framework that jointly optimizes data and architecture to unlock the full potential of model scaling, which includes two core parts: (1) ES ^3 (Entire-Space Sample System), a high-quality data scaling system that expands the training signal beyond conventional sampling strategies from both intra-domain request contexts with global supervised signal constructed by hierarchical label attribution and cross-domain samples aligning with the essence of user decision under similar content exposure environment in search domain; and (2) HHSFT (Heterogeneous Hierarchical Sample Fusion Transformer), a novel architecture designed to effectively model the complex heterogeneous distribution of scaled data and to harness the entire space user behavior data with Heterogeneous Hierarchical Feature Interaction and Entire Space User Interest Fusion, thereby surpassing the performance ceiling of structure-only model tuning. Extensive experiments on large-scale real world E-commerce search platform demonstrate that UniScale achieves significant improvements through the synergistic co-design of data and architecture and exhibits clear scaling trends, delivering substantial gains in key business metrics.
[IR-7] Who Benefits from RAG ? The Role of Exposure Utility and Attribution Bias
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在公平性方面存在的潜在问题,即查询群体公平性(query group fairness)——亦即不同社会群体相关的查询是否在准确率或RAG相对于纯大语言模型(Large Language Models, LLMs)的性能提升上存在系统性差异。其解决方案的关键在于识别并量化三个核心因素对查询群体公平性的贡献:群体暴露(Group exposure,由检索器决定文档在检索结果中的比例)、群体效用(Group utility,反映某类文档对答案准确率提升的贡献程度)和群体归属度(Group attribution,指生成器对特定群体文档的依赖程度)。通过在TREC 2022公平排序赛道数据集上进行实证分析,研究发现RAG系统不仅未能缓解不公平现象,反而放大了群体间平均准确率及性能改进的差距,且这三个因素与群体准确率之间存在显著正负相关关系,揭示了它们在构建公平RAG系统中的关键作用。
链接: https://arxiv.org/abs/2603.24218
作者: Mahdi Dehghan,Graham McDonald
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) enhanced with Retrieval-Augmented Generation (RAG) have achieved substantial improvements in accuracy by grounding their responses in external documents that are relevant to the user’s query. However, relatively little work has investigated the impact of RAG in terms of fairness. Particularly, it is not yet known if queries that are associated with certain groups within a fairness category systematically receive higher accuracy, or accuracy improvements in RAG systems compared to LLM-only, a phenomenon we refer to as query group fairness. In this work, we conduct extensive experiments to investigate the impact of three key factors on query group fairness in RAG, namely: Group exposure, i.e., the proportion of documents from each group appearing in the retrieved set, determined by the retriever; Group utility, i.e., the degree to which documents from each group contribute to improving answer accuracy, capturing retriever-generator interactions; and Group attribution, i.e., the extent to which the generator relies on documents from each group when producing responses. We examine group-level average accuracy and accuracy improvements disparities across four fairness categories using three datasets derived from the TREC 2022 Fair Ranking Track for two tasks: article generation and title generation. Our findings show that RAG systems suffer from the query group fairness problem and amplify disparities in terms of average accuracy across queries from different groups, compared to an LLM-only setting. Moreover, group utility, exposure, and attribution can exhibit strong positive or negative correlations with average accuracy or accuracy improvements of queries from that group, highlighting their important role in fair RAG. Our data and code are publicly available from Github.
[IR-8] Where Do Your Citations Come From? Citation-Constellation: A Free Open-Source No-Code and Auditable Tool for Citation Network Decomposition with Complementary BARON and HEROCON Scores
【速读】:该论文旨在解决传统引文指标将所有引用等同看待的问题,从而掩盖了学术影响力在社会结构和传播路径中的真实分布。其核心解决方案是提出一种名为 Citation-Constellation 的无代码工具,通过两个互补的计量指标——BARON(Boundary-Anchored Research Outreach Network score)和 HEROCON(Holistic Equilibrated Research Outreach CONstellation score)——对研究者的引文谱系进行基于引用作者与被引作者之间网络邻近度的分解分析。关键创新在于:BARON 严格计数来自合作网络外部的引用,而 HEROCON 则根据关系紧密程度对组内引用赋予渐进权重,二者差异可作为诊断学者是否依赖“内部圈子”的结构性指标。该工具采用四阶段架构实现自动化处理,并结合 ORCID 验证的身份解析、ROR 匹配机构归属及本地大语言模型(LLM)驱动的期刊治理提取,确保结果可追溯且无需编程即可使用。
链接: https://arxiv.org/abs/2603.24216
作者: Mahbub Ul Alam
机构: SciLifeLab Data Centre, Uppsala University, Sweden
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
备注: Citation-Constellation No-Code Tool Link: this https URL
Abstract:Standard citation metrics treat all citations as equal, obscuring the social and structural pathways through which scholarly influence propagates. I introduce Citation-Constellation, a freely available no-code tool for citation network analysis with two complementary bibliometric scores that decompose a researcher’s citation profile by network proximity between citing and cited authors. BARON (Boundary-Anchored Research Outreach Network score) is a strict binary metric counting only citations from outside the detected collaborative network. HEROCON (Holistic Equilibrated Research Outreach CONstellation score) applies graduated weights assigning partial credit to in-group citations based on relationship proximity. The gap between scores serves as a diagnostic of inner-circle dependence. An extended abstract with full details appears in the paper. The tool implements this through a phased architecture: (1) self-citation analysis, (2) co-authorship graph traversal, (3) temporal institutional affiliation matching via ROR, and (4) AI-agent-driven venue governance extraction using a local LLM. Phases 1-3 are fully operational; Phase 4 is under development. Key design choices include ORCID-validated author identity resolution, an UNKNOWN classification for citations with insufficient metadata, and comprehensive audit trails documenting every classification decision. A no-code web interface enables researchers to compute scores without programming, installation, or registration. I present these scores as structural diagnostics, not quality indicators. BARON and HEROCON describe where in the social graph citations originate. They should not be used for hiring, promotion, or funding decisions. HEROCON weights are experimental and require empirical calibration. Comments: Citation-Constellation No-Code Tool Link: this https URL Subjects: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR) Cite as: arXiv:2603.24216 [cs.DL] (or arXiv:2603.24216v1 [cs.DL] for this version) https://doi.org/10.48550/arXiv.2603.24216 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-9] SumRank: Aligning Summarization Models for Long-Document Listwise Reranking
【速读】:该论文旨在解决将大语言模型(Large Language Models, LLMs)直接应用于长文档排序时面临的有效性与效率双重挑战,主要源于长文本导致的上下文长度显著增加。其解决方案的关键在于提出一种点对点(pointwise)摘要模型 SumRank,该模型在下游列表级重排序(listwise reranking)目标指导下进行训练,能够将长文档压缩为简洁且与排序任务对齐的摘要,从而在不牺牲性能的前提下大幅降低计算复杂度和推理延迟。SumRank 的核心创新在于采用三阶段训练流程:冷启动监督微调(Supervised Fine-Tuning, SFT)、针对强化学习(Reinforcement Learning, RL)的数据构建以及基于排名信号的对齐优化,确保摘要内容保留关键相关性信息以支持高效准确的最终排序。
链接: https://arxiv.org/abs/2603.24204
作者: Jincheng Feng,Wenhan Liu,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) have demonstrated superior performance in listwise passage reranking task. However, directly applying them to rank long-form documents introduces both effectiveness and efficiency issues due to the substantially increased context length. To address this challenge, we propose a pointwise summarization model SumRank, aligned with downstream listwise reranking, to compress long-form documents into concise rank-aligned summaries before the final listwise reranking stage. To obtain our summarization model SumRank, we introduce a three-stage training pipeline comprising cold-start Supervised Fine-Tuning (SFT), specialized RL data construction, and rank-driven alignment via Reinforcement Learning. This paradigm aligns the SumRank with downstream ranking objectives to preserve relevance signals. We conduct extensive experiments on five benchmark datasets from the TREC Deep Learning tracks (TREC DL 19-23). Results show that our lightweight SumRank model achieves state-of-the-art (SOTA) ranking performance while significantly improving efficiency by reducing both summarization overhead and reranking complexity.
[IR-10] Sequence-aware Large Language Models for Explainable Recommendation
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的可解释推荐系统中存在的两个关键问题:一是现有方法忽视了用户行为的序列动态性,二是评估指标与实际推荐效用不一致。解决方案的核心在于提出SELLER框架,其关键创新包括:(1) 设计双路径编码器以同时捕捉用户行为序列和物品语义信息;(2) 引入专家混合(Mixture-of-Experts)适配器将多源信号有效对齐至LLM;(3) 构建统一评估框架,联合衡量解释文本质量及其对推荐结果的实际影响,从而实现更贴近真实场景的可解释推荐。
链接: https://arxiv.org/abs/2603.24136
作者: Gangyi Zhang,Runzhe Teng,Chongming Gao
机构: University of Science and Technology of China (中国科学技术大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) have shown strong potential in generating natural language explanations for recommender systems. However, existing methods often overlook the sequential dynamics of user behavior and rely on evaluation metrics misaligned with practical utility. We propose SELLER (SEquence-aware LLM-based framework for Explainable Recommendation), which integrates explanation generation with utility-aware evaluation. SELLER combines a dual-path encoder-capturing both user behavior and item semantics with a Mixture-of-Experts adapter to align these signals with LLMs. A unified evaluation framework assesses explanations via both textual quality and their effect on recommendation outcomes. Experiments on public benchmarks show that SELLER consistently outperforms prior methods in explanation quality and real-world utility.
[IR-11] S4CMDR: a metadata repository for electronic health records
【速读】:该论文旨在解决电子健康记录(Electronic Health Records, EHRs)因各国及医疗机构间标准不一而导致的数据不兼容问题,从而限制了大规模和跨临床场景的机器学习应用。其解决方案的关键在于构建了一个基于ISO 11179-3标准、采用“中层驱动”(middle-out)元数据标准化方法的开源元数据仓库S4CMDR,通过自动化目录编目减少错误,并支持跨数据注册表发现兼容特征集,同时提供友好的用户界面与灵活的部署方式(本地Linux或云环境),显著提升了EHR数据的可发现性与可用性。
链接: https://arxiv.org/abs/2603.24118
作者: Jiawei Zhao(1),Md Shamim Ahmed(1),Nicolai Dinh Khang Truong(1),Verena Schuster(2),Rudolf Mayer(2),Richard Röttger(1) ((1) University of Southern Denmark, Department for Mathematics and Computer Science, Denmark, (2) SBA Research, Austria)
机构: University of Southern Denmark, Department for Mathematics and Computer Science (南丹麦大学,数学与计算机科学系); SBA Research (SBA研究机构)
类目: Information Retrieval (cs.IR)
备注: 16 pages, 7 figures
Abstract:Background: Electronic health records (EHRs) enable machine learning for diagnosis, prognosis, and clinical decision support. However, EHR standards vary by country and hospital, making records often incompatible. This limits large-scale and cross-clinical machine learning. To address such complexity, a metadata repository cataloguing available data elements, their value domains, and their compatibility is an essential tool. This allows researchers to leverage relevant data for tasks such as identifying undiagnosed rare disease patients. Results: Within the Screen4Care project, we developed S4CMDR, an open-source metadata repository built on ISO 11179-3, based on a middle-out metadata standardisation approach. It automates cataloguing to reduce errors and enable the discovery of compatible feature sets across data registries. S4CMDR supports on-premise Linux deployment and cloud hosting, with state-of-the-art user authentication and an accessible interface. Conclusions: S4CMDR is a clinical metadata repository registering and discovering compatible EHR records. Novel contributions include a microservice architecture, a middle-out standardisation approach, and a user-friendly interface for error-free data registration and visualisation of metadata compatibility. We validate S4CMDR’s case studies involving rare disease patients. We invite clinical data holders to populate S4CMDR using their metadata to validate the generalisability and support further development. Comments: 16 pages, 7 figures Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2603.24118 [cs.IR] (or arXiv:2603.24118v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.24118 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-12] Hierarchical Spatial-Temporal Graph-Enhanced Model for Map-Matching
【速读】:该论文旨在解决当前地图匹配(map-matching)任务中因规则方法局限性、大规模数据标注困难、时空关系建模效率低以及训练与测试数据分布差异导致的性能瓶颈问题。其解决方案的关键在于提出一种两阶段框架HSTGMatch:首先通过分层自监督学习(hierarchical self-supervised learning)构建基于网格单元(grid cells)和地理元组(geographic tuples)的轨迹表示,有效捕捉移动模式;其次引入自适应轨迹邻接图(Adaptive Trajectory Adjacency Graph)动态建模空间关系,并优化图注意力网络(GATs)以提升计算效率;同时设计时空因子(Spatial-Temporal Factor)提取关键特征,并采用衰减系数应对轨迹长度变化,从而显著增强模型在复杂场景下的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2603.24054
作者: Anjun Gao,Zhenglin Wan,Pingfu Chao,Shunyu Yao
机构: 未知
类目: Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:The integration of GNSS data into portable devices has led to the generation of vast amounts of trajectory data, which is crucial for applications such as map-matching. To tackle the limitations of rule-based methods, recent works in deep learning for trajectory-related tasks occur. However, existing models remain challenging due to issues such as the difficulty of large-scale data labeling, ineffective modeling of spatial-temporal relationships, and discrepancies between training and test data distributions. To tackle these challenges, we propose HSTGMatch, a novel model designed to enhance map-matching performance. Our approach involves a two-stage process: hierarchical self-supervised learning and spatial-temporal supervised learning. We introduce a hierarchical trajectory representation, leveraging both grid cells and geographic tuples to capture moving patterns effectively. The model constructs an Adaptive Trajectory Adjacency Graph to dynamically capture spatial relationships, optimizing GATs for improved efficiency. Furthermore, we incorporate a Spatial-Temporal Factor to extract relevant features and employ a decay coefficient to address variations in trajectory length. Our extensive experiments demonstrate the model’s superior performance, module effectiveness, and robustness, providing a promising solution for overcoming the existing limitations in map-matching applications. The source code of HSTGMatch is publicly available on GitHub at this https URL.
[IR-13] Grounding Arabic LLM s in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理复杂历史与宗教阿拉伯语文本(如《古兰经》和圣训)时表现不佳的问题。其解决方案的关键在于构建一个基于历时词典知识的检索增强生成(Retrieval-Augmented Generation, RAG)框架,通过从《阿拉伯语历史词典》(Doha Historical Dictionary of Arabic, DHDA)中检索具有历史语义演变信息的证据,替代传统依赖通用语料库的RAG方法。该框架采用混合检索与意图导向路由机制,精准提供上下文相关的历时语言信息,显著提升了阿拉伯语原生LLMs(如Fanar和ALLaM)在相关任务上的准确率(超过85%),缩小了与商用模型Gemini的性能差距。
链接: https://arxiv.org/abs/2603.23972
作者: Somaya Eltanbouly,Samer Rashwani
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Large language models (LLMs) have achieved remarkable progress in many language tasks, yet they continue to struggle with complex historical and religious Arabic texts such as the Quran and Hadith. To address this limitation, we develop a retrieval-augmented generation (RAG) framework grounded in diachronic lexicographic knowledge. Unlike prior RAG systems that rely on general-purpose corpora, our approach retrieves evidence from the Doha Historical Dictionary of Arabic (DHDA), a large-scale resource documenting the historical development of Arabic vocabulary. The proposed pipeline combines hybrid retrieval with an intent-based routing mechanism to provide LLMs with precise, contextually relevant historical information. Our experiments show that this approach improves the accuracy of Arabic-native LLMs, including Fanar and ALLaM, to over 85%, substantially reducing the performance gap with Gemini, a proprietary large-scale model. Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments. The automated judgments were verified through human evaluation, demonstrating high agreement (kappa = 0.87). An error analysis further highlights key linguistic challenges, including diacritics and compound expressions. These findings demonstrate the value of integrating diachronic lexicographic resources into retrieval-augmented generation frameworks to enhance Arabic language understanding, particularly for historical and religious texts. The code and resources are publicly available at: this https URL.
[IR-14] VILLA: Versatile Information Retrieval From Scientific Literature Using Large LAnguage Models KDD2026
【速读】:该论文旨在解决科学信息抽取(Scientific Information Extraction, SIE)领域中高质量标注数据集匮乏的问题,从而推动人工智能在科学研究中的应用。现有基于大语言模型(Large Language Models, LLMs)的SIE方法多集中于生物医学和化学等宽泛领域,且受限于选择题式任务与短文本格式,对复杂、开放式任务的探索不足。为填补这一空白,作者聚焦于被忽视的病毒学领域,设计了一项全新的开放性SIE任务——从文献中提取能够改变病毒与宿主相互作用的突变信息,并提出一种多步检索增强生成(Retrieval-Augmented Generation, RAG)框架VILLA作为解决方案。其关键创新在于结合了结构化检索与生成策略,通过分阶段的知识获取与推理机制显著提升突变信息抽取的准确性,同时构建了一个包含629个流感A病毒蛋白突变的新型基准数据集,为后续研究提供高质量的训练与评估基础。
链接: https://arxiv.org/abs/2603.23849
作者: Blessy Antony,Amartya Dutta,Sneha Aggarwal,Vasu Gatne,Ozan Gökdemir,Samantha Grimes,Adam Lauring,Brian R. Wasik,Anuj Karpatne,T. M. Murali
机构: Virginia Tech (弗吉尼亚理工大学); University of Chicago (芝加哥大学); University of Michigan (密歇根大学); Cornell University (康奈尔大学)
类目: Information Retrieval (cs.IR)
备注: Under review at ACM KDD 2026 (AI for Sciences Track)
Abstract:The lack of high-quality ground truth datasets to train machine learning (ML) models impedes the potential of artificial intelligence (AI) for science research. Scientific information extraction (SIE) from the literature using LLMs is emerging as a powerful approach to automate the creation of these datasets. However, existing LLM-based approaches and benchmarking studies for SIE focus on broad topics such as biomedicine and chemistry, are limited to choice-based tasks, and focus on extracting information from short and well-formatted text. The potential of SIE methods in complex, open-ended tasks is considerably under-explored. In this study, we used a domain that has been virtually ignored in SIE, namely virology, to address these research gaps. We design a unique, open-ended SIE task of extracting mutations in a given virus that modify its interaction with the host. We develop a new, multi-step retrieval augmented generation (RAG) framework called VILLA for SIE. In parallel, we curate a novel dataset of 629 mutations in ten influenza A virus proteins obtained from 239 scientific publications to serve as ground truth for the mutation extraction task. Finally, we demonstrate VILLA’s superior performance using a novel and comprehensive evaluation and comparison with vanilla RAG and other state-of-the art RAG- and agent-based tools for SIE.
[IR-15] An In-Depth Study of Filter-Agnostic Vector Search on a PostgreSQL Database System: [Experiments and Analysis] SIGMOD2026
【速读】:该论文旨在解决过滤无关的向量搜索(Filter-agnostic Vector Search, FVS)在生产级数据库系统中性能表现与学术研究结果不一致的问题。现有工作多在专用库中评估算法,假设条件过于理想化,未能反映企业级数据库系统的实际负载和资源开销。其解决方案的关键在于:通过在兼容 PostgreSQL 的生产级系统中对后过滤(post-filtering)与内联过滤(inline-filtering)策略进行系统性评估,揭示出最优算法的选择不仅取决于距离计算成本,更受制于系统层面的开销(如页访问、数据检索等)。研究表明,尽管基于图的方法(如 NaviX/ACORN)理论上高效,但在真实数据库环境中可能因过多的过滤检查和系统级开销而丧失优势,相比之下,聚类索引方法(如 ScaNN)更具实用性。因此,论文强调应根据工作负载特征和底层数据访问成本做出系统感知的算法选择,而非简单依赖理论性能指标。
链接: https://arxiv.org/abs/2603.23710
作者: Duo Lu,Helena Caminal,Manos Chatzakis,Yannis Papakonstantinou,Yannis Chronis,Vaibhav Jain,Fatma Özcan
机构: Brown University (布朗大学); Google (谷歌); Université Paris Cité (巴黎城市大学); ETH Zurich (苏黎世联邦理工学院)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 26 pages, 13 figures, to be published at SIGMOD 2026
Abstract:Filtered Vector Search (FVS) is critical for supporting semantic search and GenAI applications in modern database systems. However, existing research most often evaluates algorithms in specialized libraries, making optimistic assumptions that do not align with enterprise-grade database systems. Our work challenges this premise by demonstrating that in a production-grade database system, commonly made assumptions do not hold, leading to performance characteristics and algorithmic trade-offs that are fundamentally different from those observed in isolated library settings. This paper presents the first in-depth analysis of filter-agnostic FVS algorithms within a production PostgreSQL-compatible system. We systematically evaluate post-filtering and inline-filtering strategies across a wide range of selectivities and correlations. Our central finding is that the optimal algorithm is not dictated by the cost of distance computations alone, but that system-level overheads that come from both distance computations and filter operations (like page accesses and data retrieval) play a significant role. We demonstrate that graph-based approaches (such as NaviX/ACORN) can incur prohibitive numbers of filter checks and system-level overheads, compared with clustering-based indexes such as ScaNN, often canceling out their theoretical benefits in real-world database environments. Ultimately, our findings provide the database community with crucial insights and practical guidelines, demonstrating that the optimal choice for a filter-agnostic FVS algorithm is not absolute, but rather a system-aware decision contingent on the interplay between workload characteristics and the underlying costs of data access in a real-world database architecture. Comments: 26 pages, 13 figures, to be published at SIGMOD 2026 Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2603.23710 [cs.DB] (or arXiv:2603.23710v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2603.23710 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3802011 Focus to learn more DOI(s) linking to related resources
[IR-16] Mixture of Demonstrations for Textual Graph Understanding and Question Answering
【速读】:该论文旨在解决文本图结构增强生成(GraphRAG)在领域特定问答中因检索子图包含无关信息以及缺乏高质量示例选择机制而导致的推理准确率下降问题。其解决方案的关键在于提出MixDemo框架,该框架引入了基于专家混合(Mixture-of-Experts, MoE)的演示选择机制,以在不同问题语境下动态筛选最具信息量的示例;同时设计了一个查询感知的图编码器,通过选择性注意力机制过滤检索子图中的噪声信息,从而提升生成质量与推理性能。
链接: https://arxiv.org/abs/2603.23554
作者: Yukun Wu,Lihui Liu
机构: Wayne State University (韦恩州立大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Textual graph-based retrieval-augmented generation (GraphRAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) in domain-specific question answering. While existing approaches primarily focus on zero-shot GraphRAG, selecting high-quality demonstrations is crucial for improving reasoning and answer accuracy. Furthermore, recent studies have shown that retrieved subgraphs often contain irrelevant information, which can degrade reasoning performance. In this paper, we propose MixDemo, a novel GraphRAG framework enhanced with a Mixture-of-Experts (MoE) mechanism for selecting the most informative demonstrations under diverse question contexts. To further reduce noise in the retrieved subgraphs, we introduce a query-specific graph encoder that selectively attends to information most relevant to the query. Extensive experiments across multiple textual graph benchmarks show that MixDemo significantly outperforms existing methods.
[IR-17] MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG
【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)管道中因固定大小分块(fixed-size chunking)导致的语义碎片化问题,以及多轮大语言模型(Large Language Model, LLM)调用带来的效率低下和文档级上下文丢失问题。其解决方案的关键在于提出MDKeyChunker,一个三阶段处理流程:首先基于Markdown文档结构(如标题、代码块、表格和列表)进行语义原子单位的结构感知分块;其次通过单次LLM调用同时提取七类元数据(包括标题、摘要、关键词、类型实体、假设性问题、语义键及摘要),并利用滚动语义键字典(rolling key dictionary)保持文档级上下文一致性;最后采用二进制打包(bin-packing)策略合并共享相同语义键的片段,实现相关内容的共置以提升检索效果。该设计显著减少了LLM调用次数,并以LLM原生语义匹配替代人工调参的评分机制,从而在实验中实现了高召回率与平均倒数排名(MRR)。
链接: https://arxiv.org/abs/2603.23533
作者: Bhavik Mangla
机构: Independent Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 13 pages, 4 figures, 7 tables, 2 algorithms. Code: this https URL
Abstract:RAG pipelines typically rely on fixed-size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three-stage pipeline for Markdown documents that (1) performs structure-aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a rolling key dictionary to maintain document-level context; and (3) restructures chunks by merging those sharing the same semantic key via bin-packing, co-locating related content for retrieval. The single-call design extracts all seven metadata fields in one LLM invocation, eliminating the need for separate per-field extraction passes. Rolling key propagation replaces hand-tuned scoring with LLM-native semantic matching. An empirical evaluation on 30 queries over an 18-document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over the full pipeline (Config C) reaches Recall@5=0.867. MDKeyChunker is implemented in Python with four dependencies and supports any OpenAI-compatible endpoint.
[IR-18] MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
【速读】:该论文旨在解决大语言模型(LLM)在处理长期记忆时面临的瓶颈问题,即受限于全注意力机制的上下文长度通常仅为100万token以内,而现有方法如混合线性注意力、固定大小的记忆状态或外部存储(如RAG或代理系统)往往导致精度下降、延迟增加、无法动态修改记忆内容或缺乏端到端优化。其解决方案的关键在于提出Memory Sparse Attention (MSA)框架,通过两项核心技术实现高效且可扩展的记忆建模:一是可扩展的稀疏注意力机制,使训练和推理复杂度呈线性增长;二是文档级RoPE(Rotary Positional Encoding),确保在从16K到1亿token跨度下保持稳定性(误差小于9%)。此外,结合KV缓存压缩与Memory Parallel技术,可在2张A800 GPU上实现1亿token推理,并引入Memory Interleaving以支持跨分散记忆段的多跳推理。该方案首次实现了将记忆容量与推理能力解耦,为通用模型赋予终身尺度记忆提供了可扩展的基础。
链接: https://arxiv.org/abs/2603.23516
作者: Yu Chen,Runkai Chen,Sheng Yi,Xinda Zhao,Xiaohong Li,Jianjin Zhang,Jun Sun,Chuanrui Hu,Yunyun Han,Lidong Bing,Yafeng Deng,Tianqiao Chen
机构: Evermind; Shanda Group; Peking University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit in the field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typically limited to 1M tokens. Existing approaches, such as hybrid linear attention, fixed-size memory states (e.g., RNNs), and external storage methods like RAG or agent systems, attempt to extend this limit. However, they often suffer from severe precision degradation and rapidly increasing latency as context length grows, an inability to dynamically modify memory content, or a lack of end-to-end optimization. These bottlenecks impede complex scenarios like large-corpus summarization, Digital Twins, and long-history agent reasoning, while limiting memory capacity and slowing inference. We present Memory Sparse Attention (MSA), an end-to-end trainable, efficient, and massively scalable memory model framework. Through core innovations including scalable sparse attention and document-wise RoPE, MSA achieves linear complexity in both training and inference while maintaining exceptional stability, exhibiting less than 9% degradation when scaling from 16K to 100M tokens. Furthermore, KV cache compression, combined with Memory Parallel, enables 100M-token inference on 2xA800 GPUs. We also propose Memory Interleaving to facilitate complex multi-hop reasoning across scattered memory segments. MSA significantly surpasses frontier LLMs, state-of-the-art RAG systems, and leading memory agents in long-context benchmarks. These results demonstrate that by decoupling memory capacity from reasoning, MSA provides a scalable foundation to endow general-purpose models with intrinsic, lifetime-scale memory. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2603.23516 [cs.CL] (or arXiv:2603.23516v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.23516 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Runkai Chen [view email] [v1] Fri, 6 Mar 2026 02:29:54 UTC (383 KB) Full-text links: Access Paper: View a PDF of the paper titled MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens, by Yu Chen and 11 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-03 Change to browse by: cs cs.AI cs.IR References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[IR-19] S-Path-RAG : Semantic-Aware Shortest-Path Retrieval Augmented Generation for Multi-Hop Knowledge Graph Question Answering
【速读】:该论文旨在解决大规模知识图谱(Knowledge Graph, KG)上的多跳问答(Multi-hop Question Answering, MQA)任务中检索效率低、路径语义感知不足以及缺乏可解释性的问题。其核心解决方案是提出S-Path-RAG框架,关键在于通过混合策略(加权k最短路径、束搜索与约束随机游走)枚举语义加权的候选路径,并结合可微分路径评分器、对比路径编码器和轻量级验证器,将精选路径的潜在表示以软混合形式通过交叉注意力注入语言模型;同时在神经苏格拉底图对话循环中实现基于诊断信息的动态图编辑或种子扩展,从而提升检索的拓扑感知能力与token效率,并保留路径级可解释痕迹用于调试与干预。
链接: https://arxiv.org/abs/2603.23512
作者: Rong Fu,Yemin Wang,Tianxiang Xu,Yongtai Liu,Weizhi Tang,Wangyu Wu,Xiaowen Ma,Simon Fong
机构: University of Macau (澳门大学); Xiamen University (厦门大学); Peking University (北京大学); Hanyang University (汉阳大学); University of Liverpool (利物浦大学); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:We present S-Path-RAG, a semantic-aware shortest-path Retrieval-Augmented Generation framework designed to improve multi-hop question answering over large knowledge graphs. S-Path-RAG departs from one-shot, text-heavy retrieval by enumerating bounded-length, semantically weighted candidate paths using a hybrid weighted k -shortest, beam, and constrained random-walk strategy, learning a differentiable path scorer together with a contrastive path encoder and lightweight verifier, and injecting a compact soft mixture of selected path latents into a language model via cross-attention. The system runs inside an iterative Neural-Socratic Graph Dialogue loop in which concise diagnostic messages produced by the language model are mapped to targeted graph edits or seed expansions, enabling adaptive retrieval when the model expresses uncertainty. This combination yields a retrieval mechanism that is both token-efficient and topology-aware while preserving interpretable path-level traces for diagnostics and intervention. We validate S-Path-RAG on standard multi-hop KGQA benchmarks and through ablations and diagnostic analyses. The results demonstrate consistent improvements in answer accuracy, evidence coverage, and end-to-end efficiency compared to strong graph- and LLM-based baselines. We further analyze trade-offs between semantic weighting, verifier filtering, and iterative updates, and report practical recommendations for deployment under constrained compute and token budgets.
[IR-20] Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems
【速读】:该论文旨在解决生成式 AI (Generative AI) 在企业搜索和文档导向型助手等场景中,如何在低延迟约束下实现对长且复杂的源文档进行完整内容验证的问题。当前主流方法存在两大局限:一是大语言模型虽能处理长上下文但响应速度慢、成本高,难以用于交互式服务;二是轻量级分类器受限于短文本截断,常遗漏超出片段范围的关键证据。解决方案的关键在于设计并集成一个实时验证组件到生产级检索增强生成(Retrieval-Augmented Generation, RAG)流水线中,支持高达32K token的文档处理,并采用自适应推理策略动态平衡响应时间和验证覆盖率,从而显著提升对无依据回答的检测能力,优于传统基于分块的验证方式。
链接: https://arxiv.org/abs/2603.23508
作者: Xunzhuo Liu,Bowei He,Xue Liu,Haichen Zhang,Huamin Chen
机构: vLLM Semantic Router Project (vLLM语义路由器项目); MBZUAI (穆巴达拉人工智能研究所); McGill University (麦吉尔大学); AMD (超威半导体公司); Red Hat (红帽公司)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Retrieval-augmented generation (RAG) is increasingly deployed in enterprise search and document-centric assistants, where responses must be grounded in long and complex source materials. In practice, verifying that generated answers faithfully reflect retrieved documents is difficult: large language models can check long contexts but are too slow and costly for interactive services, while lightweight classifiers operate within strict context limits and frequently miss evidence outside truncated passages. We present the design of a real-time verification component integrated into a production RAG pipeline that enables full-document grounding under latency constraints. The system processes documents up to 32K tokens and employs adaptive inference strategies to balance response time and verification coverage across workloads. We describe the architectural decisions, operational trade-offs, and evaluation methodology used to deploy the verifier, and show that full-context verification substantially improves detection of unsupported responses compared with truncated validation. Our experience highlights when long-context verification is necessary, why chunk-based checking often fails in real documents, and how latency budgets shape model design. These findings provide practical guidance for practitioners building reliable large-scale retrieval-augmented applications. (Model, benchmark, and code: this https URL)
[IR-21] KARMA: Knowledge-Action Regularized Multimodal Alignment for Personalized Search at Taobao
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在工业级个性化搜索任务中因“知识-行为鸿沟”(Knowledge–Action Gap)导致的语义退化问题,即直接微调LLMs以优化特定个性化行为目标(如下一物品预测)时,会破坏其预训练阶段获得的深层语义知识,引发注意力“汇聚”(attention sink)等现象,从而削弱模型的泛化能力。解决方案的关键在于提出KARMA(Knowledge–Action Regularized Multimodal Alignment)框架,通过引入仅训练阶段生效的语义重建正则项,协同优化两个互补目标:(i) 历史条件下的语义生成,锚定于LLM原生的下一个词分布;(ii) 嵌入条件下的语义重构,约束兴趣嵌入保持语义可恢复性,从而在不增加推理开销的前提下显著提升个性化搜索系统的点击率(CTR)和召回质量。
链接: https://arxiv.org/abs/2603.22779
作者: Zhi Sun,Wenming Zhang,Yi Wei,Liren Yu,Zhixuan Zhang,Dan Ou,Haihong Tang
机构: Taobao \ Tmall Group of Alibaba(淘宝\天猫集团阿里巴巴)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) are equipped with profound semantic knowledge, making them a natural choice for injecting semantic generalization into personalized search systems. However, in practice we find that directly fine-tuning LLMs on industrial personalized tasks (e.g. next item prediction) often yields suboptimal results. We attribute this bottleneck to a critical Knowledge–Action Gap: the inherent conflict between preserving pre-trained semantic knowledge and aligning with specific personalized actions by discriminative objectives. Empirically, action-only training objectives induce Semantic Collapse, such as attention ``sinks’'. This degradation severely cripples the LLM’s generalization, failing to bring improvements to personalized search systems. We propose KARMA (Knowledge–Action Regularized Multimodal Alignment), a unified framework that treats semantic reconstruction as a train-only regularizer. KARMA optimizes a next-interest embedding for retrieval (Action) while enforcing semantic decodability (Knowledge) through two complementary objectives: (i) history-conditioned semantic generation, which anchors optimization to the LLM’s native next-token distribution, and (ii) embedding-conditioned semantic reconstruction, which constrains the interest embedding to remain semantically recoverable. On Taobao search system, KARMA mitigates semantic collapse (attention-sink analysis) and improves both action metrics and semantic fidelity. In ablations, semantic decodability yields up to +22.5 HR@200. With KARMA, we achieve +0.25 CTR AUC in ranking, +1.86 HR in pre-ranking and +2.51 HR in recalling. Deployed online with low inference overhead at ranking stage, KARMA drives +0.5% increase in Item Click. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.22779 [cs.IR] (or arXiv:2603.22779v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.22779 Focus to learn more arXiv-issued DOI via DataCite
人机交互
[HC-0] Vibe Coding XR: Accelerating AI XR Prototyping with XR Blocks and Gemini
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在扩展现实(Extended Reality, XR)领域应用中的关键瓶颈问题:即开发者在构建智能XR体验时,因复杂的游戏引擎和底层传感器集成而面临高门槛,导致原型开发效率低下。解决方案的关键在于提出XR Blocks框架与Vibe Coding XR工作流——前者通过模块化WebXR架构将空间计算复杂性抽象为以人为中心的高层原语;后者则利用大语言模型(Large Language Models, LLMs)实现从自然语言意图到可运行XR应用的端到端自动化转换,使创作者能在Web界面中仅用一分钟内完成交互式XR原型开发,从而显著降低技术壁垒并提升创作效率。
链接: https://arxiv.org/abs/2603.24591
作者: Ruofei Du,Benjamin Hersh,David Li,Nels Numan,Xun Qian,Yanhe Chen,Zhongyi Zhou,Xingyue Chen,Jiahao Ren,Robert Timothy Bettridge,Steve Toh,David Kim
机构: Google XR Labs(谷歌XR实验室)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:While large language models have accelerated software development through “vibe coding”, prototyping intelligent Extended Reality (XR) experiences remains inaccessible due to the friction of complex game engines and low-level sensor integration. To bridge this gap, we contribute XR Blocks, an open-source, modular WebXR framework that abstracts spatial computing complexities into high-level, human-centered primitives. Building upon this foundation, we present Vibe Coding XR, an end-to-end rapid prototyping workflow that leverages LLMs to translate natural language intent directly into functional XR software. Using a web-based interface, creators can transform high-level prompts (e.g., “create a dandelion that reacts to hand”) into interactive WebXR applications in under a minute. We provide a preliminary technical evaluation on a pilot dataset (VCXR60) alongside diverse application scenarios highlighting mixed-reality realism, multi-modal interaction, and generative AI integrations. By democratizing spatial software creation, this work empowers practitioners to bypass low-level hurdles and rapidly move from “idea to reality.” Code and live demos are available at this https URL and this https URL.
[HC-1] Robust Multilingual Text-to-Pictogram Mapping for Scalable Reading Rehabilitation
【速读】:该论文旨在解决患有特殊教育需求与障碍(Special Educational Needs and Disabilities, SEND)的儿童在阅读理解方面面临的重大挑战,尤其是因缺乏足够的一对一阅读支持而导致的学习障碍问题。解决方案的关键在于开发了一个多语言、基于人工智能(AI)的界面,能够自动为文本添加视觉支架(visual scaffolding),通过动态识别关键概念并将其映射到语境相关的图示符号(pictograms),从而增强跨语言的可读性与理解力。该系统在五种类型学差异显著的语言(英语、法语、意大利语、西班牙语和阿拉伯语)中进行了验证,结果显示其在图示覆盖率、视觉支架密度及专家临床评审中的语义适切性均表现优异,且延迟符合实时教育应用要求,证明了自动化多模态支架技术在提升神经多样性学习者可访问性方面的技术可行性与接受度。
链接: https://arxiv.org/abs/2603.24536
作者: Soufiane Jhilal,Martina Galletti
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Reading comprehension presents a significant challenge for children with Special Educational Needs and Disabilities (SEND), often requiring intensive one-on-one reading support. To assist therapists in scaling this support, we developed a multilingual, AI-powered interface that automatically enhances text with visual scaffolding. This system dynamically identifies key concepts and maps them to contextually relevant pictograms, supporting learners across languages. We evaluated the system across five typologically diverse languages (English, French, Italian, Spanish, and Arabic), through multilingual coverage analysis, expert clinical review by speech therapists and special education professionals, and latency assessment. Evaluation results indicate high pictogram coverage and visual scaffolding density across the five languages. Expert audits suggested that automatically selected pictograms were semantically appropriate, with combined correct and acceptable ratings exceeding 95% for the four European languages and approximately 90% for Arabic despite reduced pictogram repository coverage. System latency remained within interactive thresholds suitable for real-time educational use. These findings support the technical viability, semantic safety, and acceptability of automated multimodal scaffolding to improve accessibility for neurodiverse learners.
[HC-2] Integrating Causal Machine Learning into Clinical Decision Support Systems: Insights from Literature and Practice
【速读】:该论文旨在解决当前临床决策支持系统(Clinical Decision Support Systems, CDSSs)普遍依赖相关性而非因果关系进行预测,导致其在实际临床应用中缺乏可解释性和治疗特异性的问题。解决方案的关键在于引入因果机器学习(Causal Machine Learning, CML),通过构建能够提供因果洞察的CDSS,并围绕临床医生使用场景设计人机协同界面:研究基于设计科学方法论,结合文献综述与医师访谈,提炼出八项实证驱动的设计需求、七条设计原则和九项具体功能特征,从而实现系统对临床工作流程的无缝集成、提升可信度与可用性,并促进人机协作;同时揭示了自动化、责任归属与监管之间的张力,强调需建立适应性的认证机制以支持基于机器学习的医疗产品落地。
链接: https://arxiv.org/abs/2603.24448
作者: Domenique Zipperling,Lukas Schmidt,Benedikt Hahn,Niklas Kühl,Steven Kimbrough
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Current clinical decision support systems (CDSSs) typically base their predictions on correlation, not causation. In recent years, causal machine learning (ML) has emerged as a promising way to improve decision-making with CDSSs by offering interpretable, treatment-specific reasoning. However, existing research often emphasizes model development rather than designing clinician-facing interfaces. To address this gap, we investigated how CDSSs based on causal ML should be designed to effectively support collaborative clinical decision-making. Using a design science research methodology, we conducted a structured literature review and interviewed experienced physicians. From these, we derived eight empirically grounded design requirements, developed seven design principles, and proposed nine practical design features. Our results establish guidance for designing CDSSs that deliver causal insights, integrate seamlessly into clinical workflows, and support trust, usability, and human-AI collaboration. We also reveal tensions around automation, responsibility, and regulation, highlighting the need for an adaptive certification process for ML-based medical products.
[HC-3] Gendered Prompting and LLM Code Review: How Gender Cues in the Prompt Shape Code Quality and Evaluation
【速读】:该论文旨在解决生成式 AI (Generative AI) 在编程工作流中应用时,性别化语言风格如何影响代码生成结果与代码评审决策的问题。其核心发现表明,尽管女性作者的提示语通常更间接且更具参与性,但这种差异并未导致功能正确性或静态代码质量上的显著差距;然而,在 LLM 代码评审环节中存在系统性偏差——模型对女性作者编写的代码给予更高的通过率,即便其质量与其他性别提示生成的代码相当。解决方案的关键在于识别出公平性风险主要源于 LLM 的评估阶段而非生成阶段,并强调需在自动化代码评审系统中引入更公平的评价机制以减少性别偏见。
链接: https://arxiv.org/abs/2603.24359
作者: Lynn Janzen,Üveys Eroglu,Dorothea Kolossa,Pia Knöferle,Sebastian Möller,Vera Schmitt,Veronika Solopova
机构: Technische Universität Berlin (柏林工业大学); Humboldt-Universität zu Berlin (洪堡大学)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注:
Abstract:LLMs are increasingly embedded in programming workflows, from code generation to automated code review. Yet, how gendered communication styles interact with LLM-assisted programming and code review remains underexplored. We present a mixed-methods pilot study examining whether gender-related linguistic differences in prompts influence code generation outcomes and code review decisions. Across three complementary studies, we analyze (i) collected real-world coding prompts, (ii) a controlled user study, in which developers solve identical programming tasks with LLM assistance, and (iii) an LLM-based simulated evaluation framework that systematically varies gender-coded prompt styles and reviewer personas. We find that gender-related differences in prompting style are subtle but measurable, with female-authored prompts exhibiting more indirect and involved language, which does not translate into consistent gaps in functional correctness or static code quality. For LLM code review, in contrast, we observe systematic biases: on average, models approve female-authored code more, despite comparable quality. Controlled experiments show that gender-coded prompt style affect code length and maintainability, while reviewer behavior varies across models. Our findings suggest that fairness risks in LLM-assisted programming arise less from generation accuracy than from LLM evaluation, as LLMs are increasingly deployed as automated code reviewers.
[HC-4] A Neuro-Symbolic System for Interpretable Multimodal Physiological Signals Integration in Human Fatigue Detection
【速读】:该论文旨在解决疲劳分类任务中模型准确性与可解释性之间的矛盾问题,尤其是在需要安全关键应用的场景下,传统方法往往依赖于刚性手工规则或缺乏个体层面的对齐诊断能力。其解决方案的关键在于提出一种神经符号架构(neuro-symbolic architecture),通过注意力机制编码眼动追踪与功能近红外光谱(fNIRS)信号,学习四个可解释的生理概念:眼动动力学(oculomotor dynamics)、注视稳定性(gaze stability)、前额叶血流动力学(prefrontal hemodynamics)及多模态特征,并结合可微分近似推理规则(differentiable approximate reasoning rules),利用学习得到的权重和软阈值进行融合推理。该方法不仅在18名受试者上的留一被试者交叉验证中达到72.1% ± 12.3%的准确率,且能可视化各概念激活强度与规则触发程度,从而实现可审计、可解释的决策过程。
链接: https://arxiv.org/abs/2603.24358
作者: Mohammadreza Jamalifard,Yaxiong Lei,Parasto Azizinezhad,Javier Fumanal-Idocin,Javier Andreu-Perez
机构: 未知
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:We propose a neuro-symbolic architecture that learns four interpretable physiological concepts, oculomotor dynamics, gaze stability, prefrontal hemodynamics, and multimodal, from eye-tracking and neural hemodynamics, functional near-infrared spectroscopy, (fNIRS) windows using attention-based encoders, and combines them with differentiable approximate reasoning rules using learned weights and soft thresholds, to address both rigid hand-crafted rules and the lack of subject-level alignment diagnostics. We apply this system to fatigue classification from multimodal physiological signals, a domain that requires models that are accurate and interpretable, with internal reasoning that can be inspected for safety-critical use. In leave-one-subject-out evaluation on 18 participants (560 samples), the method achieves 72.1% +/- 12.3% accuracy, comparable to tuned baselines while exposing concept activations and rule firing strengths. Ablations indicate gains from participant-specific calibration (+5.2 pp), a modest drop without the fNIRS concept (-1.2 pp), and slightly better performance with Lukasiewicz operators than product (+0.9 pp). We also introduce concept fidelity, an offline per-subject audit metric from held-out labels, which correlates strongly with per-subject accuracy (r=0.843, p 0.0001).
[HC-5] Honey I shrunk the scientist – Evaluating 2D 3D and VR interfaces for navigating samples under the microscope
【速读】:该论文旨在解决在三维(3D)显微成像中,研究人员在微米尺度的三维样本中导航和定位感兴趣区域时所面临的操作繁琐问题。为提升探索效率与用户体验,研究对比了2D桌面、3D桌面和虚拟现实(Virtual Reality, VR)三种界面形式在任务执行速度、可用性和完成度方面的表现。解决方案的关键在于引入VR交互界面,实验结果表明,VR显著优于2D和3D桌面界面,在任务效率、可用性及用户接受度上均展现出明显优势,而3D桌面界面并未表现出相对于2D桌面的优势。
链接: https://arxiv.org/abs/2603.24337
作者: Jan Tiemann,Matthew McGinity,Ulrik Günther
机构: Helmholtz-Zentrum Dresden - Rossendorf(亥姆霍兹德累斯顿研究中心); Technische Universität Dresden(德累斯顿工业大学); Center for Systems Biology Dresden(德累斯顿系统生物学中心); MPI-CBG(马克斯·普朗克分子遗传学研究所); IXLAB(IX实验室)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:In contemporary biology and medicine, 3D microscopy is one of the most widely-used techniques for imaging and manipulation of various kinds of samples. Navigating such a micrometer-sized, 3-dimensional sample under the microscope – e.g. to find relevant imaging regions – can pose a tedious challenge for the experimenter. In this paper, we examine whether 2D desktop, 3D desktop, or Virtual Reality (VR) interfaces provide the best user experience and performance for the exploration of 3D samples. We invited 12 skilled microscope operators to perform two different exploration tasks in 2D, 3D and VR and compared all conditions in terms speed, usability, and completion. Our results show a clear benefit when using VR – in terms of task efficiency, usability, and user acceptance. Intriguingly, while VR outperformed desktop 2D and 3D in all scenarios, 3D desktop did not outperform 2D desktop.
[HC-6] Human Factors in Detecting AI-Generated Portraits: Age Sex Device and Confidence
【速读】:该论文试图解决的问题是:随着生成式 AI (Generative AI) 技术的发展,由 ChatGPT-4o 和 Imagen 3 等模型生成的逼真人脸图像在社交媒体和新闻场景中广泛传播,人类个体对真实与合成人脸的辨别能力如何随时间、人口统计学特征(如年龄、性别)、设备使用环境(移动端 vs. PC)及主观信心变化。其解决方案的关键在于通过大规模在线实验(n=1,664)量化了人类检测AI生成肖像的能力,并揭示出该能力并非仅取决于图像真实性,而是受多因素交互影响——包括年龄相关下降趋势(尤其在移动端更显著)、性别差异(50–60岁女性表现更差)、设备类型(PC优于移动端)、自我报告的AI暴露程度和检测信心水平。其中,检测信心在解释年龄相关性能下降中起主导作用,表明这本质上是一个人类因素问题,而非单纯的技术对抗问题。
链接: https://arxiv.org/abs/2603.24048
作者: Sunwhi Kim(1),Sunyul Kim(2) ((1) Hwasung Medi-Science University, Dept. of Bio-Healthcare, South Korea, (2) Yonsei University, Graduate School of Engineering, Dept. of Artificial Intelligence, South Korea)
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 36 pages, 15 figures, 1 supplementary table. Project page: this https URL
Abstract:Generative AI now produces photorealistic portraits that circulate widely in social and newslike contexts. Human ability to distinguish real from synthetic faces is time-sensitive because image generators continue to improve while public familiarity with synthetic media also changes. Here, we provide a time-stamped snapshot of human ability to distinguish real from AI-generated portraits produced by models available in July 2025. In a large-scale web experiment conducted from August 2025 to January 2026, 1,664 participants aged 20-69 years (mobile n = 1,330; PC n = 334) completed a two-alternative forced-choice task (REAL vs AI). Each participant judged 20 trials sampled from a 210-image pool comprising real FFHQ photographs and AI-generated portraits from ChatGPT-4o and Imagen 3. Overall accuracy was high (mean 85.2%, median 90%) but varied across groups. PC participants outperformed mobile participants by 3.65 percentage points. Accuracy declined with age in both device cohorts and more steeply on mobile than on PC (-0.607 vs -0.230 percentage points per year). Self-rated AI-detection confidence and AI exposure were positively associated with accuracy and statistically accounted for part of the age-related decline, with confidence accounting for the larger share. In the mobile cohort, an age-related sex divergence emerged among participants in their 50s and 60s, with female participants performing worse. Trial-level reaction-time models showed that correct AI judgments were faster than correct real judgments, whereas incorrect AI judgments were slower than incorrect real judgments. ChatGPT-4o portraits were harder and slower to classify than Imagen 3 portraits and were associated with a steeper age-related decline in performance. These findings frame AI portrait detection as a human-factors problem shaped by age, sex, device context, and confidence, not image realism alone.
[HC-7] SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons CVPR2026
【速读】:该论文旨在解决扁平化矢量图形(flattened vector art)中语义层结构丢失的问题,这限制了后续编辑、重风格化和动画等任务的实现。其核心挑战在于如何从单路径或复合路径的图形中恢复出可编辑的分层结构。解决方案的关键是提出SemLayer——一个基于视觉生成的流水线,首先生成颜色区分的表示以使不同语义组件在视觉上可分离,随后通过语义补全步骤重建每个部分的完整几何形状(包括被遮挡区域),最终将恢复的部件组装为带有推断遮挡关系的分层矢量表示,从而实现了对原始语义层次的重建与编辑可用性。
链接: https://arxiv.org/abs/2603.24039
作者: Haiyang Xu,Ronghuan Wu,Li-Yi Wei,Nanxuan Zhao,Chenxi Liu,Cuong Nguyen,Zhuowen Tu,Zhaowen Wang
机构: UC San Diego (加州大学圣地亚哥分校); Adobe Research (Adobe 研究院); City University of Hong Kong (香港城市大学); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: Accepted to CVPR 2026
Abstract:Graphic icons are a cornerstone of modern design workflows, yet they are often distributed as flattened single-path or compound-path graphics, where the original semantic layering is lost. This absence of semantic decomposition hinders downstream tasks such as editing, restyling, and animation. We formalize this problem as semantic layer construction for flattened vector art and introduce SemLayer, a visual generation empowered pipeline that restores editable layered structures. Given an abstract icon, SemLayer first generates a chromatically differentiated representation in which distinct semantic components become visually separable. To recover the complete geometry of each part, including occluded regions, we then perform a semantic completion step that reconstructs coherent object-level shapes. Finally, the recovered parts are assembled into a layered vector representation with inferred occlusion relationships. Extensive qualitative comparisons and quantitative evaluations demonstrate the effectiveness of SemLayer, enabling editing workflows previously inapplicable to flattened vector graphics and establishing semantic layer reconstruction as a practical and valuable task. Project page: this https URL
[HC-8] Skewed Dual Normal Distribution Model: Predicting Touch Pointing Success Rates for Targets Near Screen Edges and Corners
【速读】:该论文旨在解决传统触控命中率预测模型在靠近屏幕边缘的目标上失效的问题,这类目标在实际交互设计中(如滚动界面)频繁出现。其关键解决方案是提出一种“偏斜双正态分布模型”(Skewed Dual Normal Distribution Model),该模型假设边缘会扭曲点击坐标的分布形态:当目标接近边缘时,分布峰值向边缘偏移且尾部延伸远离边缘;同时发现接触边缘反而提升命中率,揭示出“将目标与边缘一同点击”的用户策略。该模型能准确预测包括边缘邻近目标在内的多种场景下的命中率,具有良好的泛化能力。
链接: https://arxiv.org/abs/2603.23865
作者: Nobuhito Kasahara,Shota Yamanaka,Homei Miyashita
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Typical success-rate prediction models for tapping exclude targets near screen edges. However, design constraints often force such placements, and in scrollable user interfaces, any element can move close to the screen edges. In this work, we model how target-edge distance affects touch pointing accuracy. We propose the Skewed Dual Normal Distribution Model, which assumes the tap-coordinate distribution is skewed by a nearby edge. The results showed that as targets approached the edge, the distribution’s peak shifted toward the edge, and its tail extended away. In contrast to prior reports, the success rate improved when the target touched the edge, suggesting a strategy of ``tapping the target together with the edge.‘’ Our model predicts success rates across a wide range of conditions, including edge-adjacent targets. Through three experiments of horizontal, vertical, and 2D pointing, we demonstrated the generalizability and utility of our proposed model.
[HC-9] Generative AI User Experience: Developing Human–AI Epistemic Partnership
【速读】:该论文旨在解决当前生成式 AI(Generative AI, GenAI)在教育场景中用户体验(User Experience, UX)研究的理论局限性问题。现有基于有用性、易用性和参与度等采纳导向构念的理论已不足以解释 GenAI 作为知识建构参与者所带来的复杂交互现象,如协商型权威(negotiated authority)、认知再分配(redistributed cognition)和责任张力(accountability tension)。为此,论文提出人类-人工智能认识论伙伴关系理论(Human–AI Epistemic Partnership Theory, HAEPT),其核心在于将 GenAI 的用户体验重构为一种动态的“认识论伙伴关系”,其中包含三个相互嵌套的契约:认识论契约(epistemic contract)、代理权契约(agency contract)和责任契约(accountability contract)。该理论的关键创新在于,将用户对 GenAI 的态度与行为视为在反复校准周期中不断调整的契约关系,从而统一解释信任与怀疑共存、合作关系模式重复出现等现象,而非孤立看待技术使用中的伦理或教学问题。
链接: https://arxiv.org/abs/2603.23863
作者: Xiaoming Zhai
机构: AI4STEM Education Center, University of Georgia (乔治亚大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Generative AI (GenAI) has rapidly entered education, yet its user experience is often explained through adoption-oriented constructs such as usefulness, ease of use, and engagement. We argue that these constructs are no longer sufficient because systems such as ChatGPT do not merely support learning tasks but also participate in knowledge construction. Existing theories cannot explain why GenAI frequently produces experiences characterized by negotiated authority, redistributed cognition, and accountability tension. To address this gap, this paper develops the Human–AI Epistemic Partnership Theory (HAEPT), explaining the GenAI user experience as a form of epistemic partnership that features a dynamic negotiation of three interlocking contracts: epistemic, agency, and accountability. We argue that findings on trust, over-reliance, academic integrity, teacher caution, and relational interaction about GenAI can be reinterpreted as tensions within these contracts rather than as isolated issues. Instead of holding a single, stable view of GenAI, users adjust how they relate to it over time through calibration cycles. These repeated interactions account for why trust and skepticism often coexist and for how partnership modes describe recurrent configurations of human–AI collaboration across tasks. To demonstrate the usefulness of HAEPT, we applied it to analyze the UX of collaborative learning with AI speakers and AI-facilitated scientific argumentation, illustrating different contract configurations.
[HC-10] General Intellectual Humility Is Malleable Through AI-Mediated Reflective Dialogue
【速读】:该论文旨在解决“普遍性认知谦逊(General Intellectual Humility, GIH)是否可通过干预手段实现提升”这一核心问题。当前学界普遍认为GIH是一种稳定的人格特质,难以通过外部干预改变,而本文通过实证研究挑战了这一观点。其解决方案的关键在于设计了一种结构化的对话式干预,结合分阶段的认知支架(cognitive scaffolding)与个性化苏格拉底式反思(personalized Socratic reflection),借助大语言模型(LLM)引导参与者从理解概念逐步过渡到应用、分析、评估并生成与自身相关的具体情境,从而深化对GIH的内化与实践。实验结果显示,该干预显著提升了GIH水平,且效果具有持久性与广泛适用性,不依赖于个体初始人格特征或政治立场。
链接: https://arxiv.org/abs/2603.23855
作者: Mohammad Ratul Mahjabin,Raiyan Abdul Baten
机构: University of South Florida (南佛罗里达大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:General intellectual humility (GIH) – the recognition that one’s beliefs may be fallible and revisable – is associated with improved reasoning, learning, and social discourse, yet is widely regarded as a stable trait resistant to intervention. We test whether GIH can be elevated through a conversational intervention that combines staged cognitive scaffolding with personalized Socratic reflection. In a randomized controlled experiment (N=400), participants engaged in a structured, LLM-mediated dialogue that progressed from conceptual understanding of intellectual humility to applying, analyzing, evaluating, and generating novel, self-relevant scenarios that instantiate it. Relative to a time-matched control, the intervention produced a systematic increase in GIH, reduced rank-order stability, and tripled the rate of reliable individual improvement. Crucially, these effects persisted over a two-week follow-up without detectable decay. The effects generalized across political affiliation and did not depend on baseline personality profile. These findings challenge the prevailing pessimism regarding the malleability of GIH and suggest that scaffolded, Socratic reflection delivered through structured dialogue can produce durable changes in general intellectual humility.
[HC-11] CodeExemplar: Example-Based Scaffolding for Introductory Programming in the GenAI Era
【速读】:该论文旨在解决生成式 AI(Generative AI)在编程教学中带来的双重挑战:一方面,学生需要及时获得帮助以克服学习障碍;另一方面,直接提供完整代码解决方案可能导致抄袭行为,并削弱学生的推理能力。其核心解决方案是提出“基于示例的支架教学”(example-based scaffolding),即利用 GenAI 生成与目标任务具有相同底层推理模式但情境不同的示例代码,从而促进类比迁移(analogical transfer),同时降低直接复制的风险。该方法的关键在于通过设计匹配认知结构的差异化示例,引导学生理解问题本质而非简单模仿代码实现。
链接: https://arxiv.org/abs/2603.23830
作者: Boxuan Ma,Shinichi Konomi
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Generative AI (GenAI) can generate working code with minimal effort, creating a tension in introductory programming: students need timely help, yet direct solutions invite copying and can short-circuit reasoning. To address this, we propose example-based scaffolding, where GenAI provides scaffold examples that match a target task’s underlying reasoning pattern but differ in contexts to support analogical transfer while reducing copying. We contribute a two-dimensional taxonomy, design guidelines, and CodeExemplar, a prototype integrated with auto-graded tasks, with initial formative feedback from a classroom pilot and instructor interviews.
[HC-12] Bridging the Interpretation Gap in Accessibility Testing: Empathetic and Legal-Aware Bug Report Generation via Large Language Models
【速读】:该论文试图解决现有移动应用无障碍测试工具在修复(remediation)环节效果有限的问题,核心原因在于其输出的低级技术性报告难以被非专业利益相关者(如产品经理和设计师)理解为真实用户伤害与合规风险。解决方案的关键在于提出一个名为HEAR(Human-Entered Accessibility Reporting)的框架,通过语义切片和视觉定位重建UI上下文,动态注入与缺陷类型匹配的残障人群角色(disability-oriented personas),并进行多层推理以解释物理障碍、功能阻塞及法律合规问题,从而将原始漏洞报告转化为具同理心、面向利益相关者的叙事报告,显著提升感知共情、紧迫感、说服力和合规风险意识,同时认知负担增加极少。
链接: https://arxiv.org/abs/2603.23828
作者: Ryoya Koyama,Zhiyao Wang,Devi Karolita,Jialong Li,Kenji Tei
机构: Waseda University (早稻田大学); Institute of Science Tokyo (东京科学大学)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注:
Abstract:Modern automated accessibility testing tools for mobile applications have significantly improved the detection of interface violations, yet their impact on remediation remains limited. A key reason is that existing tools typically produce low-level, technical outputs that are difficult for non-specialist stakeholders, such as product managers and designers, to interpret in terms of real user harm and compliance risk. In this paper, we present \textscHEAR (\underlineHuman-c\underlineEntered \underlineAccessibility \underlineReporting), a framework that bridges this interpretation gap by transforming raw accessibility bug reports into empathetic, stakeholder-oriented narratives. Given the outputs of the existing accessibility testing tool, \textscHEAR first reconstructs the UI context through semantic slicing and visual grounding, then dynamically injects disability-oriented personas matched to each violation type, and finally performs multi-layer reasoning to explain the physical barrier, functional blockage, and relevant legal or compliance concerns. We evaluate the framework on real-world accessibility issues collected from four popular Android applications and conduct a user study (N=12). The results show that \textscHEAR generates factually grounded reports and substantially improves perceived empathy, urgency, persuasiveness, and awareness of legal risk compared with raw technical logs, while imposing little additional cognitive burden.
[HC-13] Aesthetics of Robot-Mediated Applied Drama: A Case Study on REMind
【速读】:该论文旨在解决当前社会机器人在教育应用中普遍局限于解释性教学(explanation-based instruction)的局限性,探索一种新的交互模式——机器人媒介的应用戏剧(Robot-Mediated Applied Drama, RMAD),以支持儿童的社会情感学习(social-emotional learning)。其核心问题是如何在现有机器人表达能力有限的前提下,使机器人戏剧在情感和审美上具有吸引力。解决方案的关键在于:通过整合表演艺术的专业知识,从整体体验设计出发,而非仅依赖机器人自身的表达能力,来构建富有感染力的戏剧情境,从而实现情绪共鸣与教育目标的统一。
链接: https://arxiv.org/abs/2603.23816
作者: Elaheh Sanoubari,Alicia Pan,Keith Rebello,Neil Fernandes,Andrew Houston,Kerstin Dautenhahn
机构: 未知
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: 15 pages, 6 figures. Preprint submitted to the 18th International Conference on Social Robotics (ICSR 2026)
Abstract:Social robots are increasingly used in education, but most applications cast them as tutors offering explanation-based instruction. We explore an alternative: Robot-Mediated Applied Drama (RMAD), in which robots function as life-like puppets in interactive dramatic experiences designed to support reflection and social-emotional learning. This paper presents REMind, an anti-bullying robot role-play game that helps children rehearse bystander intervention and peer support. We focus on a central design challenge in RMAD: how to make robot drama emotionally and aesthetically engaging despite the limited expressive capacities of current robotic platforms. Through the development of REMind, we show how performing arts expertise informed this process, and argue that the aesthetics of robot drama arise from the coordinated design of the wider experience, not from robot expressivity alone.
[HC-14] A Reproducible Reality-to-VR Pipeline for Ecologically Valid Aging-in-Place Research
【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)在评估老年人日常生活活动能力(Instrumental Activities of Daily Living, IADLs)时因环境设计简化或保真度不足而导致生态效度下降的问题。其解决方案的关键在于构建一个可重复的“现实到VR”工作流:通过地面激光扫描(Terrestrial Laser Scanning, TLS)实现亚毫米级几何精度采集,利用Faro SCENE进行点云处理、SketchUp完成几何重构,并通过Unreal Engine 5的Datasmith插件与Lumen全局光照技术实现高保真视觉渲染,最终在保证90 Hz稳定帧率的同时支持实验变量的即时操控(如橱柜开闭状态切换),从而显著提升VR场景的真实性与实验灵活性,同时验证了该方法对老年参与者具有低眩晕风险且保持实验敏感性,为老龄化居家研究提供了可复现的技术基准。
链接: https://arxiv.org/abs/2603.23812
作者: Ibrahim Bilau,Stacie Smith,Abdurrahman Baru,Marwan Shagar,Brian Jones,Eunhwa Yang
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 28 pages, 5 figures, 2 tables
Abstract:Virtual reality (VR) has emerged as a promising tool for assessing instrumental activities of daily living (IADLs) in older adults. However, the ecological validity of these simulations is often compromised by simplified or low-fidelity environmental design that fails to elicit a genuine sense of presence. This paper documents a reproducible Reality-to-VR pipeline for creating a photorealistic environmental simulation to support a study on cognitive aging in place. The proposed workflow captured the as-built kitchen of the Aware Home building at Georgia Tech using Terrestrial Laser Scanning (TLS) for sub-millimeter geometric accuracy, followed by point cloud processing in Faro SCENE, geometric retopology in SketchUp, and integration into Unreal Engine 5 via Datasmith with Lumen global illumination for high visual fidelity. The pipeline achieved photorealistic rendering while maintaining a stable 90 Hz frame rate, a critical threshold for mitigating cybersickness in older populations. The environment also enables instantaneous manipulation of environmental variables, such as switching between closed cabinetry and open shelving, providing experimental flexibility impossible in physical settings. Participant validation with 17 older adults confirmed minimal cybersickness risk and preserved sensitivity to the experimental manipulation, supporting the pipeline’s feasibility for aging-in-place research and establishing a benchmark for future comparative studies.
[HC-15] AI Fortune-Teller: Juxtaposing Shaman and AI to Reveal Human Agency in the Age of AI
【速读】:该论文试图解决的问题是:在人工智能(AI)日益渗透人类决策过程的背景下,用户对AI建议的信任与接受度是否真正依赖于其技术透明性与准确性。解决方案的关键在于通过一个实验性的视频作品,让参与者误以为他们正在与一个职业咨询AI互动,而实际上这些回应源自韩国传统巫师(mudang)的占卜。研究发现,即便参与者得知建议并非来自AI而是来自传统仪式,他们对建议的态度并未发生改变,这揭示了人类对AI解释性与准确性的认知可能并不如预期般重要,进而引发对人类在AI时代中主体性与决策机制本质的反思——即人类依然以根本上人性化、混乱且不确定的方式生活和决策。
链接: https://arxiv.org/abs/2603.23811
作者: Soonho Kwon,Dong Whi Yoo,Younah Kang
机构: Georgia Institute of Technology (佐治亚理工学院); Indiana University Indianapolis (印第安纳大学印第安纳波利斯分校); Yonsei University (延世大学)
类目: Human-Computer Interaction (cs.HC)
备注: Disclaimer: This document is an unofficial commentary on AI Fortune-Teller by its creators. While the work was introduced and received an Honorary Mention at Prix Ars Electronica 2024, this document is not an officially published or affiliated record of the festival
Abstract:This speculative video piece showcases participants interacting with a career counseling AI agent, unaware that the responses were actually derived from the fortunetelling of a mudang (a Korean traditional shaman). Our work captures this deception and documents participants’ reactions, showcasing shifts in their initial perceptions of the agent’s advice following the reveal. Notably, even after learning that the advice came from a mudang rather than an AI, participants did not change their initial attitudes toward the advice they received. This raises questions about the perceived importance of AI’s explainability and accuracy. By juxtaposing scientific and pre-scientific approaches, we aim to provoke discussions on human agency in the age of AI. We argue that, regardless of AI’s advancements, we continue to navigate life in fundamentally human ways – wonderfully messy and uncertain.
[HC-16] Exploring Self-Tracking Practices of Older Adults with CVD to Inform the Design of LLM -Enabled Health Data Sensemaking
【速读】:该论文旨在解决老年心血管疾病(CVD)患者在使用可穿戴设备和移动健康应用进行自我管理时,面对海量健康数据感到困惑和压力的问题。研究发现,自追踪行为具有情感性、解释性和社会情境性,患者更倾向于基于身体感受和情感体验来解读数据,而非单纯依赖量化指标。解决方案的关键在于利用大语言模型(LLM)构建支持性数据意义建构系统,其核心设计方向包括:支持情感参与、强化患者自主性、承认具身经验,并在临床与社交场景中促进对话;同时为保障安全,需引入专家介入机制(expert-in-the-loop),以确保AI生成内容的准确性与可信度。这为将数据转化为有意义的健康叙事提供了可行路径,并拓展了人-数据交互与行为改变支持的设计边界。
链接: https://arxiv.org/abs/2603.23733
作者: Duosi Dai,Pavithren V S Pakianathan,Gunnar Treff,Mahdi Sareban,Jan David Smeddinck,Sanna Kuoppamäki
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 23 pages,4 figures, 3 tables
Abstract:Wearables and mobile health applications are increasingly adopted for self-management of chronic illnesses; yet the data feels overwhelming for older adults with cardiovascular disease (CVD). This study explores how they make sense of self-tracked data and identifies design opportunities for Large Language Model (LLM)-enabled support. We conducted a seven-day diary study and follow-up interviews with eight CVD patients aged 64-82. We identified six themes: navigating emotional complexity, owning health narratives, prioritizing bodily sensations, selective engagement with health metrics, negotiating socio-technical dynamics of sharing, and cautious optimism toward AI. Findings highlight that self-tracking is affective, interpretive, and socially situated. We outline design directions for LLM-enabled data sensemaking systems: supporting emotional engagement, reinforcing patient agency, acknowledging embodied experiences, and prompting dialogue in clinical and social contexts. To support safety, expert-in-the-loop mechanisms are essential. These directions articulate how LLMs can help translate data into narratives and carry implications for human-data interaction and behavior-change support.
[HC-17] Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在教育评估中广泛应用所带来的挑战,即如何系统性地识别LLM与人类学习者在答题行为上的差异,从而指导评估设计以应对生成式AI(Generative AI)可能带来的滥用风险。其解决方案的关键在于引入基于项目反应理论的差异项目功能(Differential Item Functioning, DIF)分析方法,结合负向控制分析和项目总分相关性区分度分析,构建一种统计上严谨、可推广且具有教育测量学基础的方法框架,用以精准定位LLM与人类在任务维度上的能力差异区域,为提升评估的有效性、可靠性和公平性提供实证依据。
链接: https://arxiv.org/abs/2603.23682
作者: Licol Zeinfeld,Alona Strugatski,Ziva Bar-Dov,Ron Blonder,Shelley Rap,Giora Alexandron
机构: Weizmann Institute of Science (魏兹曼科学研究所)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid adoption of large language models (LLMs) in education raises profound challenges for assessment design. To adapt assessments to the presence of LLM-based tools, it is crucial to characterize the strengths and weaknesses of LLMs in a generalizable, valid and reliable manner. However, current LLM evaluations often rely on descriptive statistics derived from benchmarks, and little research applies theory-grounded measurement methods to characterize LLM capabilities relative to human learners in ways that directly support assessment design. Here, by combining educational data mining and psychometric theory, we introduce a statistically principled approach for identifying items on which humans and LLMs show systematic response differences, pinpointing where assessments may be most vulnerable to AI misuse, and which task dimensions make problems particularly easy or difficult for generative AI. The method is based on Differential Item Functioning (DIF) analysis – traditionally used to detect bias across demographic groups – together with negative control analysis and item-total correlation discrimination analysis. It is evaluated on responses from human learners and six leading chatbots (ChatGPT-4o \ 5.2, Gemini 1.5 \ 3 Pro, Claude 3.5 \ 4.5 Sonnet) to two instruments: a high school chemistry diagnostic test and a university entrance exam. Subject-matter experts then analyzed DIF-flagged items to characterize task dimensions associated with chatbot over- or under-performance. Results show that DIF-informed analytics provide a robust framework for understanding where LLM and human capabilities diverge, and highlight their value for improving the design of valid, reliable, and fair assessment in the AI era.
[HC-18] Augmented Reality Visualization for Musical Instrument Learning
【速读】:该论文旨在解决音乐乐器学习过程中可视化信息呈现不足的问题,以提升学习效率与体验。其解决方案的关键在于设计两种增强现实(Augmented Reality, AR)可视化方案:一是针对架子鼓(drum kit)设计简洁、可快速获取的编码信息,并通过投影设备展示;二是针对吉他设计两种显示模态——一种是作为增强镜面显示在屏幕上,另一种是通过光学透视式AR头显实现三维空间中的信息叠加。这两种方案均支持在乐器周围及三维空间中呈现辅助信息,从而增强学习者的操作反馈与认知理解。
链接: https://arxiv.org/abs/2603.23639
作者: Frank Heyen,Michael Sedlmair
机构: 未知
类目: Human-Computer Interaction (cs.HC); Graphics (cs.GR)
备注: Presented at the ISMIR 2022 Late-Breaking Demo Session, see this https URL
Abstract:We contribute two design studies for augmented reality visualizations that support learning musical instruments. First, we designed simple, glanceable encodings for drum kits, which we display through a projector. As second instrument, we chose guitar and designed visualizations to be displayed either on a screen as an augmented mirror or as an optical see-through AR headset. These modalities allow us to also show information around the instrument and in 3D. We evaluated our prototypes through case studies and our results demonstrate the general effectivity and revealed design-related and technical limitations.
[HC-19] Supporting Music Education through Visualizations of MIDI Recordings IEEE-VIS2020
【速读】:该论文旨在解决音乐家在演奏分析中依赖听觉导致的局限性问题,即由于听觉的顺序特性,难以快速获得对单个或多个录音的整体把握。解决方案的关键在于提出多种可视化方法,以辅助识别演奏中的错误和风格差异;当前方法聚焦于节奏分析,并基于MIDI数据实现,从而将音频信息转化为可直观观察的图形表示,提升分析效率与准确性。
链接: https://arxiv.org/abs/2603.23631
作者: Frank Heyen,Michael Sedlmair
机构: VISUS, University of Stuttgart (斯图加特大学可视化研究所)
类目: Human-Computer Interaction (cs.HC); Graphics (cs.GR)
备注: Presented at the IEEE VIS 2020 Poster Session
Abstract:Musicians mostly have to rely on their ears when they want to analyze what they play, for example to detect errors. Since hearing is sequential, it is not possible to quickly grasp an overview over one or multiple recordings of a whole piece of music at once. We therefore propose various visualizations that allow analyzing errors and stylistic variance. Our current approach focuses on rhythm and uses MIDI data for simplicity.
计算机视觉
[CV-0] AG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)策略在杂乱场景中因实例级定位失败而导致的可靠性下降问题,即策略虽能生成合理的抓取轨迹,但常因分心物体或外观相似性干扰而误抓非目标物体。解决方案的关键在于提出一种无需修改模型架构的推理时引导机制——目标无关引导(Target-Agnostic Guidance, TAG),其核心思想是通过对比原始观测与物体擦除观测下策略输出的差异,生成一个残差引导信号,以增强决策过程中对目标物体特征的依赖,从而减少由分心物和外观变化引起的偏差。
链接: https://arxiv.org/abs/2603.24584
作者: Jiaying Zhou,Zhihao Zhan,Ruifeng Zhai,Qinhan Lyu,Hao Liu,Keze Wang,Liang Lin,Guangrun Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vision–Language–Action (VLA) policies have shown strong progress in mapping language instructions and visual observations to robotic actions, yet their reliability degrades in cluttered scenes with distractors. By analyzing failure cases, we find that many errors do not arise from infeasible motions, but from instance-level grounding failures: the policy often produces a plausible grasp trajectory that lands slightly off-target or even on the wrong object instance. To address this issue, we propose TAG (Target-Agnostic Guidance), a simple inference-time guidance mechanism that explicitly reduces distractor- and appearance-induced bias in VLA policies. Inspired by classifier-free guidance (CFG), TAG contrasts policy predictions under the original observation and an object-erased observation, and uses their difference as a residual steering signal that strengthens the influence of object evidence in the decision process. TAG does not require modifying the policy architecture and can be integrated with existing VLA policies with minimal training and inference changes. We evaluate TAG on standard manipulation benchmarks, including LIBERO, LIBERO-Plus, and VLABench, where it consistently improves robustness under clutter and reduces near-miss and wrong-object executions.
[CV-1] Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving
【速读】:该论文旨在解决基于世界模型的自动驾驶规划方法中存在的表征压缩不足、空间理解有限以及时间动态信息利用不充分的问题,这些问题在数据和计算资源受限条件下导致规划性能欠佳。解决方案的关键在于提出Latent-WAM框架,其核心由两个模块构成:一是空间感知的压缩世界编码器(Spatial-Aware Compressive World Encoder, SCWE),通过可学习查询从基础模型中提取几何知识,并将多视角图像压缩为紧凑的场景标记(scene tokens);二是动态潜在世界模型(Dynamic Latent World Model, DLWM),采用因果Transformer架构,基于历史视觉与运动表示自回归预测未来世界状态,从而实现对时空动态的高效建模。该方法在NAVSIM v2和HUGSIM基准上取得当前最优性能,且显著减少训练数据需求和模型参数量(仅104M)。
链接: https://arxiv.org/abs/2603.24581
作者: Linbo Wang,Yupeng Zheng,Qiang Chen,Shiwei Li,Yichen Zhang,Zebin Xing,Qichao Zhang,Xiang Li,Deheng Qian,Pengxuan Yang,Yihang Dong,Ce Hao,Xiaoqing Ye,Junyu han,Yifeng Pan,Dongbin Zhao
机构: Chongqing Chang’an Technology Co., Ltd.(重庆长安科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:We introduce Latent-WAM, an efficient end-to-end autonomous driving framework that achieves strong trajectory planning through spatially-aware and dynamics-informed latent world representations. Existing world-model-based planners suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, resulting in sub-optimal planning under constrained data and compute budgets. Latent-WAM addresses these limitations with two core modules: a Spatial-Aware Compressive World Encoder (SCWE) that distills geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens via learnable queries, and a Dynamic Latent World Model (DLWM) that employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations. Extensive experiments on NAVSIM v2 and HUGSIM demonstrate new state-of-the-art results: 89.3 EPDMS on NAVSIM v2 and 28.9 HD-Score on HUGSIM, surpassing the best prior perception-free method by 3.2 EPDMS with significantly less training data and a compact 104M-parameter model.
[CV-2] Vision-Language Models vs Human: Perceptual Image Quality Assessment
【速读】:该论文旨在解决如何利用视觉语言模型(Vision Language Models, VLMs)有效近似人类感知图像质量评估(Perceptual Image Quality Assessment, IQA)的问题,尤其在对比度、色彩丰富度和整体偏好三个维度上的表现。其解决方案的关键在于通过系统性地将六种VLMs(包括四种商业模型和两种开源模型)与人类心理物理数据进行对比,量化它们在不同属性上的对齐程度,并揭示模型内部一致性与人类感知一致性之间的非线性关系,从而为构建更可靠、可解释的自动化图像质量评估方法提供实证依据。
链接: https://arxiv.org/abs/2603.24578
作者: Imran Mehmood,Imad Ali Shah,Ming Ronnier Luo,Brian Deegan
机构: University of Galway (爱尔兰国立高威大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Psychophysical experiments remain the most reliable approach for perceptual image quality assessment (IQA), yet their cost and limited scalability encourage automated approaches. We investigate whether Vision Language Models (VLMs) can approximate human perceptual judgments across three image quality scales: contrast, colorfulness and overall preference. Six VLMs four proprietary and two openweight models are benchmarked against psychophysical data. This work presents a systematic benchmark of VLMs for perceptual IQA through comparison with human psychophysical data. The results reveal strong attribute dependent variability models with high human alignment for colorfulness (\rho up to 0.93) underperform on contrast and vice-versa. Attribute weighting analysis further shows that most VLMs assign higher weights to colorfulness compared to contrast when evaluating overall preference similar to the psychophysical data. Intramodel consistency analysis reveals a counterintuitive tradeoff: the most self consistent models are not necessarily the most human aligned suggesting response variability reflects sensitivity to scene dependent perceptual cues. Furthermore, human-VLM agreement is increased with perceptual separability, indicating VLMs are more reliable when stimulus differences are clearly expressed.
[CV-3] EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction
【速读】:该论文旨在解决手术机器人感知中可变形软组织的高精度三维重建问题,尤其针对低纹理表面、镜面反射和器械遮挡导致几何连续性断裂的挑战。现有固定拓扑方法在处理此类复杂场景时性能受限。解决方案的关键在于提出一种以几何为中心的框架EndoVGGT,其核心创新是引入了变形感知图注意力(Deformation-aware Graph Attention, DeGAT)模块——该模块不依赖静态空间邻域,而是动态构建特征空间语义图,以捕捉相干组织区域间的长程相关性,从而在遮挡区域仍能稳健传播结构线索,强化全局一致性并提升非刚性形变恢复能力。实验表明,该方法在SCARED数据集上显著提升重建保真度(PSNR提高24.6%,SSIM提高9.1%),且具备零样本跨数据集泛化能力,验证了DeGAT学习到的几何先验具有领域无关性。
链接: https://arxiv.org/abs/2603.24577
作者: Falong Fan,Yi Xie,Arnis Lektauers,Bo Liu,Jerzy Rozenblit
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and instrument occlusions often fragment geometric continuity, posing a challenge for existing fixed-topology approaches. To address this, we propose EndoVGGT, a geometry-centric framework equipped with a Deformation-aware Graph Attention (DeGAT) module. Rather than using static spatial neighborhoods, DeGAT dynamically constructs feature-space semantic graphs to capture long-range correlations among coherent tissue regions. This enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on SCARED show that our method significantly improves fidelity, increasing PSNR by 24.6% and SSIM by 9.1% over prior state-of-the-art. Crucially, EndoVGGT exhibits strong zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains, confirming that DeGAT learns domain-agnostic geometric priors. These results highlight the efficacy of dynamic feature-space modeling for consistent surgical 3D reconstruction.
[CV-4] Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation
【速读】:该论文旨在解决机器人操作中因感知歧义(perceptual aliasing)导致的非马尔可夫决策问题,即相同观测可能源于不同的交互历史,从而影响动作选择的准确性。现有方法通常依赖语义压缩的轨迹和基于相似性的检索机制,但会丢失关键的细粒度感知线索,并召回与决策无关的相似片段。解决方案的关键在于提出Chameleon框架,其通过将几何对齐的多模态token写入记忆库以保留区分性上下文信息,并利用可微分的记忆栈实现目标导向的回溯检索,从而在感知混淆场景下提升决策可靠性和长程控制性能。
链接: https://arxiv.org/abs/2603.24576
作者: Xinying Guo,Chenxi Jiang,Hyun Bin Kim,Ying Sun,Yang Xiao,Yuhang Han,Jianfei Yang
机构: Nanyang Technological University (南洋理工大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL
Abstract:Robotic manipulation often requires memory: occlusion and state changes can make decision-time observations perceptually aliased, making action selection non-Markovian at the observation level because the same observation may arise from different interaction histories. Most embodied agents implement memory via semantically compressed traces and similarity-based retrieval, which discards disambiguating fine-grained perceptual cues and can return perceptually similar but decision-irrelevant episodes. Inspired by human episodic memory, we propose Chameleon, which writes geometry-grounded multimodal tokens to preserve disambiguating context and produces goal-directed recall through a differentiable memory stack. We also introduce Camo-Dataset, a real-robot UR5e dataset spanning episodic recall, spatial tracking, and sequential manipulation under perceptual aliasing. Across tasks, Chameleon consistently improves decision reliability and long-horizon control over strong baselines in perceptually confusable settings.
[CV-5] VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models
【速读】:该论文旨在解决从“扁平化”的光栅图像(如PNG或JPEG)中自动恢复高保真度的可编辑矢量图形(Scalable Vector Graphics, SVG)的问题,这一过程在实际应用中因原始矢量源文件缺失而变得极为困难,且手动重建效率低下、成本高昂。解决方案的关键在于提出VFIG模型家族,其核心创新包括:构建包含66K高质量图-SVG配对的大规模数据集VFIG-DATA,涵盖真实论文图表与程序生成的复杂示意图;设计从粗到细的训练范式,先通过监督微调(Supervised Fine-Tuning, SFT)学习基础图形元素,再利用强化学习(Reinforcement Learning, RL)优化全局结构一致性、布局合理性和拓扑准确性;同时引入VFIG-BENCH评估基准,以新型指标衡量复杂图形的结构性完整性。该方法在开源模型中达到最先进性能,并在VLM-Judge评分上媲美GPT-5.2(得分0.829)。
链接: https://arxiv.org/abs/2603.24575
作者: Qijia He,Xunmei Liu,Hammaad Memon,Ziang Li,Zixian Ma,Jaemin Cho,Jason Ren,Daniel S Weld,Ranjay Krishna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only “flat” rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.
[CV-6] owards Training-Free Scene Text Editing CVPR2026
【速读】:该论文旨在解决场景文本编辑(scene text editing)中现有方法依赖特定任务训练或成对数据、导致可扩展性和适应性受限的问题。解决方案的关键在于提出一种无需训练的框架 TextFlow,其核心是融合注意力增强(Attention Boost, AttnBoost)与流形引导的视觉流建模(Flow Manifold Steering, FMS):FMS 通过建模字符与背景区域的视觉流来保持结构和风格一致性,AttnBoost 则借助注意力机制提升文本渲染质量;二者协同实现端到端的语义对齐与空间精细化调整,从而在不引入额外训练的情况下完成高质量、灵活的文本修改。
链接: https://arxiv.org/abs/2603.24571
作者: Yubo Li,Xugong Qin,Peng Zhang,Hailun Lin,Gangyan Zeng,Kexin Zhang
机构: Nanjing University of Science and Technology (南京理工大学); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Scene text editing seeks to modify textual content in natural images while maintaining visual realism and semantic consistency. Existing methods often require task-specific training or paired data, limiting their scalability and adaptability. In this paper, we propose TextFlow, a training-free scene text editing framework that integrates the strengths of Attention Boost (AttnBoost) and Flow Manifold Steering (FMS) to enable flexible, high-fidelity text manipulation without additional training. Specifically, FMS preserves the structural and style consistency by modeling the visual flow of characters and background regions, while AttnBoost enhances the rendering of textual content through attention-based guidance. By jointly leveraging these complementary modules, our approach performs end-to-end text editing through semantic alignment and spatial refinement in a plug-and-play manner. Extensive experiments demonstrate that our framework achieves visual quality and text accuracy comparable to or superior to those of training-based counterparts, generalizing well across diverse scenes and languages. This study advances scene text editing toward a more efficient, generalizable, and training-free paradigm. Code is available at this https URL
[CV-7] Anti-I2V: Safeguarding your photos from malicious image-to-video generation CVPR2026
【速读】:该论文旨在解决基于扩散模型的人像视频生成技术可能被恶意利用的问题,即通过特定人物的照片和文本提示生成伪造视频,从而引发隐私泄露与虚假信息传播风险。针对现有防御方法主要面向图像生成模型且对Diffusion Transformer(DiT)类视频扩散模型(VDMs)效果有限的不足,本文提出Anti-I2V防御方案,其关键在于:在RGB空间之外引入Lab*颜色空间和频域双重扰动机制,增强对抗性扰动的鲁棒性并聚焦显著像素区域;同时识别去噪过程中语义特征最显著的网络层,设计针对性训练目标以最大化破坏视频时序一致性与生成保真度,从而实现对多种扩散架构(包括DiT)的有效防护。
链接: https://arxiv.org/abs/2603.24570
作者: Duc Vu,Anh Nguyen,Chi Tran,Anh Tran
机构: Qualcomm AI Research (Qualcomm AI Research)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026 (Main Conference)
Abstract:Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person’s photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the L * a * b * and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.
[CV-8] POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan ACM-MM2026
【速读】:该论文旨在解决多模态说话人识别系统在现实场景中面临的关键挑战:一是模态缺失问题(如视觉信息因遮挡、摄像头故障或隐私限制而不可用),二是跨语言条件下的性能下降问题(由于不同语言间发音和语义差异导致模型泛化能力受限)。解决方案的关键在于设计一个标准化的基准测试平台——POLY-SIM Grand Challenge 2026,通过构建包含缺失模态和跨语言场景的数据集、明确的任务定义与评估协议,并提供基线模型,推动开发能够有效利用不完整多模态输入且在多种语言下保持高性能的鲁棒识别方法。
链接: https://arxiv.org/abs/2603.24569
作者: Marta Moscati,Muhammad Saad Saeed,Marina Zanoni,Mubashir Noman,Rohan Kumar Das,Monorama Swain,Yufang Hou,Elisabeth Andre,Khalid Mahmood Malik,Markus Schedl,Shah Nawaz
机构: Johannes Kepler University Linz (约翰内斯·开普勒林茨大学); University of Michigan-Flint (密歇根大学弗林特分校); Sapienza University of Rome (罗马大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Fortemedia Singapore (福瑞媒体新加坡); IT:U Interdisciplinary Transformation University Austria (奥地利跨学科转型大学); University of Augsburg (奥格斯堡大学); Human-centered AI Group, AI Lab, Linz Institute of Technology (以人为中心的人工智能小组,人工智能实验室,林茨技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Grand challenge at ACM MM 2026
Abstract:Multimodal speaker identification systems typically assume the availability of complete and homogeneous audio-visual modalities during both training and testing. However, in real-world applications, such assumptions often do not hold. Visual information may be missing due to occlusions, camera failures, or privacy constraints, while multilingual speakers introduce additional complexity due to linguistic variability across languages. These challenges significantly affect the robustness and generalization of multimodal speaker identification systems. The POLY-SIM Grand Challenge 2026 aims to advance research in multimodal speaker identification under missing-modality and cross-lingual conditions. Specifically, the Grand Challenge encourages the development of robust methods that can effectively leverage incomplete multimodal inputs while maintaining strong performance across different languages. This report presents the design and organization of the POLY-SIM Grand Challenge 2026, including the dataset, task formulation, evaluation protocol, and baseline model. By providing a standardized benchmark and evaluation framework, the challenge aims to foster progress toward more robust and practical multimodal speaker identification systems.
[CV-9] LensWalk: Agent ic Video Understanding by Planning How You See in Videos CVPR2026
【速读】:该论文旨在解决视频理解中推理与感知之间存在的固有断层问题,即现有方法依赖静态预处理信息,无法在理解过程中主动获取原始视频证据。其解决方案的关键在于提出LensWalk框架,该框架通过构建一个紧密的“推理-规划-观察”循环,使大型语言模型(Large Language Model)能够动态控制自身的视觉观测行为,包括每一步的时序范围和采样密度。借助基于视觉-语言模型(Vision-Language Model)的多功能工具集,代理可执行广域扫描、聚焦特定片段提取事实以及整合多时刻证据进行整体验证,从而实现按需、渐进式地收集证据以支持其链式思维演化。此设计无需模型微调即可显著提升多个基准测试上的准确率,尤其在长视频任务中表现突出。
链接: https://arxiv.org/abs/2603.24558
作者: Keliang Li,Yansong Li,Hongze Shen,Mengdi Liu,Hong Chang,Shiguang Shan
机构: Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所); Peng Cheng Laboratory(鹏城实验室); College of Computer Science and Electronic Engineering, Hunan University(湖南大学计算机与电子工程学院); University of Chinese Academy of Sciences(中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: To be published in CVPR 2026
Abstract:The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, the temporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent’s evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.
[CV-10] he role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series
【速读】:该论文旨在解决如何利用多光谱遥感数据准确区分有机农业与常规农业系统的问题,尤其是在复杂多样化的农业区域中实现空间显式的分类。其解决方案的关键在于采用基于时序Sentinel-2数据的Vision Transformer模型(具体为Temporo-Spatial Vision Transformer, TSViT架构),通过多任务学习机制联合识别作物类型和耕作系统,并系统评估空间上下文信息(通过调整图像补丁大小控制)对分类性能的影响。研究发现,尽管整体上可行,但分类效果因作物类型而异,且引入更广泛的空间上下文能显著提升两类任务的准确性,而联合学习作物类型对有机/常规系统的判别帮助有限。
链接: https://arxiv.org/abs/2603.24552
作者: Jan Hemmerling,Marcel Schwieder,Philippe Rufin,Leon-Friedrich Thomas,Mirela Tulbure,Patrick Hostert,Stefan Erasmi
机构: 1. University of Potsdam (波茨坦大学); 2. German Aerospace Center (德国航空航天中心); 3. University of Toulouse (图卢兹大学); 4. Max Planck Institute for Human Cognitive and Brain Sciences (马克斯·普朗克人类认知与脑科学研究所); 5. University of Bucharest (布加勒斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Organic farming is a key element in achieving more sustainable agriculture. For a better understanding of the development and impact of organic farming, comprehensive, spatially explicit information is needed. This study presents an approach for the discrimination of organic and conventional farming systems using intra-annual Sentinel-2 time series. In addition, it examines two factors influencing this discrimination: the joint learning of crop type information in a concurrent task and the role of spatial context. A Vision Transformer model based on the Temporo-Spatial Vision Transformer (TSViT) architecture was used to construct a classification model for the two farming systems. The model was extended for simultaneous learning of the crop type, creating a multitask learning setting. By varying the patch size presented to the model, we tested the influence of spatial context on the classification accuracy of both tasks. We show that discrimination between organic and conventional farming systems using multispectral remote sensing data is feasible. However, classification performance varies substantially across crop types. For several crops, such as winter rye, winter wheat, and winter oat, F1 scores of 0.8 or higher can be achieved. In contrast, other agricultural land use classes, such as permanent grassland, orchards, grapevines, and hops, cannot be reliably distinguished, with F1 scores for the organic management class of 0.4 or lower. Joint learning of farming system and crop type provides only limited additional benefits over single-task learning. In contrast, incorporating wider spatial context improves the performance of both farming system and crop type classification. Overall, we demonstrate that a classification of agricultural farming systems is possible in a diverse agricultural region using multispectral remote sensing data.
[CV-11] SEGAR: Selective Enhancement for Generative Augmented Reality
【速读】:该论文旨在解决增强现实(AR)应用中实时渲染延迟与视觉一致性难以兼顾的问题,尤其在需要动态插入虚拟内容的场景下,传统逐帧渲染方式易造成计算瓶颈。其解决方案的关键在于构建一个名为SEGAR的生成式世界模型框架,该框架结合基于扩散模型的世界建模与选择性修正阶段:首先由扩散模型预测包含特定区域编辑的未来图像序列,保持其他区域不变;随后通过修正阶段对安全关键区域进行真实世界观测对齐,同时保留非关键区域的预期增强效果。这一机制实现了未来帧的提前生成、缓存及按需修正,为生成式世界模型作为实用AR基础设施提供了初步实现路径。
链接: https://arxiv.org/abs/2603.24541
作者: Fanjun Bu,Chenyang Yuan,Hiroshi Yasuda
机构: Cornell University (康奈尔大学); Toyota Research Institute (丰田研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative world models offer a compelling foundation for augmented-reality (AR) applications: by predicting future image sequences that incorporate deliberate visual edits, they enable temporally coherent, augmented future frames that can be computed ahead of time and cached, avoiding per-frame rendering from scratch in real time. In this work, we present SEGAR, a preliminary framework that combines a diffusion-based world model with a selective correction stage to support this vision. The world model generates augmented future frames with region-specific edits while preserving others, and the correction stage subsequently aligns safety-critical regions with real-world observations while preserving intended augmentations elsewhere. We demonstrate this pipeline in driving scenarios as a representative setting where semantic region structure is well defined and real-world feedback is readily available. We view this as an early step toward generative world models as practical AR infrastructure, where future frames can be generated, cached, and selectively corrected on demand.
[CV-12] CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition
【速读】:该论文旨在解决长时序术中手术视频中由于标注数据稀缺且下游任务对时间精度要求高而导致的细粒度事件识别难题。解决方案的关键在于提出了一种名为CliPPER(Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition)的视频-语言预训练框架,其核心创新包括:1)引入上下文感知的视频-文本对比学习(VTC_CTX)和片段顺序预测(COP)两项预训练目标,以增强局部视频理解中的时序与语境依赖性;2)通过同一手术视频内视频-文本匹配的循环一致性对齐机制,强化双向一致性以提升整体表征连贯性;3)设计更精细的帧-文本匹配(FTM)对齐损失,优化视频帧与文本描述之间的对齐精度。这些策略共同提升了模型在零样本场景下对手术阶段、步骤、器械及三元组事件的识别性能。
链接: https://arxiv.org/abs/2603.24539
作者: Florian Stilz,Vinkle Srivastav,Nassir Navab,Nicolas Padoy
机构: University of Strasbourg, CNRS, INSERM, ICube, UMR7357, France; IHU Strasbourg, France; Technical University of Munich, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Video-language foundation models have proven to be highly effective in zero-shot applications across a wide range of tasks. A particularly challenging area is the intraoperative surgical procedure domain, where labeled data is scarce, and precise temporal understanding is often required for complex downstream tasks. To address this challenge, we introduce CliPPER (Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition), a novel video-language pretraining framework trained on surgical lecture videos. Our method is designed for fine-grained temporal video-text recognition and introduces several novel pretraining strategies to improve multimodal alignment in long-form surgical videos. Specifically, we propose Contextual Video-Text Contrastive Learning (VTC_CTX) and Clip Order Prediction (COP) pretraining objectives, both of which leverage temporal and contextual dependencies to enhance local video understanding. In addition, we incorporate a Cycle-Consistency Alignment over video-text matches within the same surgical video to enforce bidirectional consistency and improve overall representation coherence. Moreover, we introduce a more refined alignment loss, Frame-Text Matching (FTM), to improve the alignment between video frames and text. As a result, our model establishes a new state-of-the-art across multiple public surgical benchmarks, including zero-shot recognition of phases, steps, instruments, and triplets. The source code and pretraining captions can be found at this https URL.
[CV-13] UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience
【速读】:该论文旨在解决当前基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的自主移动GUI代理在长时GUI任务中面临的两大挑战:一是从失败轨迹中学习效率低下,二是稀疏奖励下信用分配不明确。为此,作者提出UI-Voyager,一种两阶段自进化移动GUI代理方法。其核心创新在于:第一阶段采用拒绝微调(Rejection Fine-Tuning, RFT),实现数据与模型的全自动协同演化;第二阶段引入组相对自蒸馏(Group Relative Self-Distillation, GRSD),通过识别群体回放中的关键分叉点,从成功轨迹中构建密集的步骤级监督信号以修正失败轨迹。该方案显著提升了自动化任务的成功率(在AndroidWorld上达到81.0% Pass@1),超越人类水平,并减少了对人工标注数据的依赖。
链接: https://arxiv.org/abs/2603.24533
作者: Zichuan Lin,Feiyu Liu,Yijun Yang,Jiafei Lyu,Yiming Gao,Yicheng Liu,Zhicong Lu,Yangbin Yu,Mingyu Yang,Junyou Li,Deheng Ye,Jie Jiang
机构: Tencent Hunyuan(腾讯混元)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Code and models are available at this https URL
Abstract:Autonomous mobile GUI agents have attracted increasing attention along with the advancement of Multimodal Large Language Models (MLLMs). However, existing methods still suffer from inefficient learning from failed trajectories and ambiguous credit assignment under sparse rewards for long-horizon GUI tasks. To that end, we propose UI-Voyager, a novel two-stage self-evolving mobile GUI agent. In the first stage, we employ Rejection Fine-Tuning (RFT), which enables the continuous co-evolution of data and models in a fully autonomous loop. The second stage introduces Group Relative Self-Distillation (GRSD), which identifies critical fork points in group rollouts and constructs dense step-level supervision from successful trajectories to correct failed ones. Extensive experiments on AndroidWorld show that our 4B model achieves an 81.0% Pass@1 success rate, outperforming numerous recent baselines and exceeding human-level performance. Ablation and case studies further verify the effectiveness of GRSD. Our method represents a significant leap toward efficient, self-evolving, and high-performance mobile GUI automation without expensive manual data annotation.
[CV-14] Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification
【速读】:该论文旨在解决基于CLIP的少样本图像分类中,如何有效融合文本和图像原型以提升分类性能的问题。其核心挑战在于图像原型可能引入与任务无关的背景或上下文噪声,从而影响分类准确性。解决方案的关键在于:首先通过将图像原型投影到语义文本嵌入空间的主方向上,构建一个与文本对齐的语义图像子空间,从而过滤掉无关信息;其次,在跨模态对齐较差的数据集上,进一步利用类别协方差建模图像特征的各向异性,结合文本对齐混合原型分类器与图像特定的线性判别分析(LDA)分类器,实现更优的少样本分类效果。
链接: https://arxiv.org/abs/2603.24528
作者: Dipam Goswami,Simone Magistri,Gido M. van de Ven,Bartłomiej Twardowski,Andrew D. Bagdanov,Tinne Tuytelaars,Joost van de Weijer
机构: Dipam Goswami; Simone Magistri; Gido M. van de Ven; Bartłomiej Twardowski; Andrew D. Bagdanov; Tinne Tuytelaars; Joost van de Weijer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the training set are an important source of information. In this work we investigate the impact of directly mixing image and text prototypes for few-shot classification and analyze this from a bias-variance perspective. We show that mixing prototypes acts like a shrinkage estimator. Although mixed prototypes improve classification performance, the image prototypes still add some noise in the form of instance-specific background or context information. In order to capture only information from the image space relevant to the given classification task, we propose projecting image prototypes onto the principal directions of the semantic text embedding space to obtain a text-aligned semantic image subspace. These text-aligned image prototypes, when mixed with text embeddings, further improve classification. However, for downstream datasets with poor cross-modal alignment in CLIP, semantic alignment might be suboptimal. We show that the image subspace can still be leveraged by modeling the anisotropy using class covariances. We demonstrate that combining a text-aligned mixed prototype classifier and an image-specific LDA classifier outperforms existing methods across few-shot classification benchmarks.
[CV-15] oward Physically Consistent Driving Video World Models under Challenging Trajectories
【速读】:该论文旨在解决当前视频生成模型在自动驾驶仿真中对挑战性或反事实轨迹(如由模拟器或规划系统生成的不完美轨迹)条件下的物理一致性不足问题,这类轨迹常导致生成视频出现严重物理不一致性和伪影。解决方案的关键在于提出PhyGenesis框架,其核心包括两个组件:(1) 物理条件生成器(physical condition generator),用于将潜在无效的轨迹输入转化为物理合理的条件;(2) 物理增强视频生成器(physics-enhanced video generator),基于这些条件生成高保真多视角驾驶视频。为有效训练该框架,作者构建了一个大规模、富含物理信息的异构数据集,其中融合了真实驾驶视频与CARLA模拟器生成的多样化挑战场景,并利用由此衍生的监督信号引导模型学习极端条件下的物理驱动动力学,从而实现轨迹修正并提升视频生成的物理一致性。
链接: https://arxiv.org/abs/2603.24506
作者: Jiawei Zhou,Zhenxin Zhu,Lingyi Du,Linye Lyu,Lijun Zhou,Zhanqian Wu,Hongcheng Luo,Zhuotao Tian,Bing Wang,Guang Chen,Hangjun Ye,Haiyang Sun,Yu Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video generation models have shown strong potential as world models for autonomous driving simulation. However, existing approaches are primarily trained on real-world driving datasets, which mostly contain natural and safe driving scenarios. As a result, current models often fail when conditioned on challenging or counterfactual trajectories-such as imperfect trajectories generated by simulators or planning systems-producing videos with severe physical inconsistencies and artifacts. To address this limitation, we propose PhyGenesis, a world model designed to generate driving videos with high visual fidelity and strong physical consistency. Our framework consists of two key components: (1) a physical condition generator that transforms potentially invalid trajectory inputs into physically plausible conditions, and (2) a physics-enhanced video generator that produces high-fidelity multi-view driving videos under these conditions. To effectively train these components, we construct a large-scale, physics-rich heterogeneous dataset. Specifically, in addition to real-world driving videos, we generate diverse challenging driving scenarios using the CARLA simulator, from which we derive supervision signals that guide the model to learn physically grounded dynamics under extreme conditions. This challenging-trajectory learning strategy enables trajectory correction and promotes physically consistent video generation. Extensive experiments demonstrate that PhyGenesis consistently outperforms state-of-the-art methods, especially on challenging trajectories. Our project page is available at: this https URL.
[CV-16] Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models CVPR2026
【速读】:该论文旨在解决当前多模态语言模型(Multimodal Language Model, MLLM)在视觉导向的理论心智(Theory of Mind, ToM)推理任务中表现不足的问题,尤其是现有评估大多基于文本输入、忽视纯视觉场景下模型对人类心理状态的理解能力,以及缺乏对模型内部注意力机制行为的可解释性分析。解决方案的关键在于提出VisionToM框架,其核心思想是通过计算与正确语义目标对齐的干预向量(intervention vectors),引导模型在不同层次的视觉特征上调整注意力分布,从而减少模型对虚假语言先验的依赖,提升其在真实世界视频数据上的多模态推理准确性和生成式解释的质量。
链接: https://arxiv.org/abs/2603.24484
作者: Siqi Liu,Xinyang Li,Bochao Zou,Junbao Zhuo,Huimin Ma,Jiansheng Chen
机构: University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 7 figures, accepted at CVPR 2026, project page: see this https URL
Abstract:As large language models (LLMs) continue to advance, there is increasing interest in their ability to infer human mental states and demonstrate a human-like Theory of Mind (ToM). Most existing ToM evaluations, however, are centered on text-based inputs, while scenarios relying solely on visual information receive far less attention. This leaves a gap, since real-world human-AI interaction typically requires multimodal understanding. In addition, many current methods regard the model as a black box and rarely probe how its internal attention behaves in multiple-choice question answering (QA). The impact of LLM hallucinations on such tasks is also underexplored from an interpretability perspective. To address these issues, we introduce VisionToM, a vision-oriented intervention framework designed to strengthen task-aware reasoning. The core idea is to compute intervention vectors that align visual representations with the correct semantic targets, thereby steering the model’s attention through different layers of visual features. This guidance reduces the model’s reliance on spurious linguistic priors, leading to more reliable multimodal language model (MLLM) outputs and better QA performance. Experiments on the EgoToM benchmark-an egocentric, real-world video dataset for ToM with three multiple-choice QA settings-demonstrate that our method substantially improves the ToM abilities of MLLMs. Furthermore, results on an additional open-ended generation task show that VisionToM enables MLLMs to produce free-form explanations that more accurately capture agents’ mental states, pushing machine-human collaboration toward greater alignment.
[CV-17] OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning
【速读】:该论文旨在解决当前开源视频生成模型在统一性与智能性方面的不足,即多数学术模型仍高度碎片化,且现有统一框架难以无缝整合多种任务。其解决方案的关键在于提出OmniWeaving,一个具备多模态组合能力与推理驱动机制的全层级视频生成模型;通过大规模预训练数据集学习复杂场景下的时序绑定与意图理解,使模型能同时处理文本、多图像和视频输入,并作为智能代理推断用户复杂意图以实现高级视频创作。
链接: https://arxiv.org/abs/2603.24458
作者: Kaihang Pan,Qi Tian,Jianwei Zhang,Weijie Kong,Jiangfeng Xiong,Yanxin Long,Shixue Zhang,Haiyi Qiu,Tan Wang,Zheqi Lv,Yue Wu,Liefeng Bo,Siliang Tang,Zhao Zhong
机构: Zhejiang University (浙江大学); Tencent Hunyuan (腾讯混元); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 22 figures. Project Page: this https URL
Abstract:While proprietary systems such as Seedance-2.0 have achieved remarkable success in omni-capable video generation, open-source alternatives significantly lag behind. Most academic models remain heavily fragmented, and the few existing efforts toward unified video generation still struggle to seamlessly integrate diverse tasks within a single framework. To bridge this gap, we propose OmniWeaving, an omni-level video generation model featuring powerful multimodal composition and reasoning-informed capabilities. By leveraging a massive-scale pretraining dataset that encompasses diverse compositional and reasoning-augmented scenarios, OmniWeaving learns to temporally bind interleaved text, multi-image, and video inputs while acting as an intelligent agent to infer complex user intentions for sophisticated video creation. Furthermore, we introduce IntelligentVBench, the first comprehensive benchmark designed to rigorously assess next-level intelligent unified video generation. Extensive experiments demonstrate that OmniWeaving achieves SoTA performance among open-source unified models. The codes and model will be made publicly available soon. Project Page: this https URL.
[CV-18] Unleashing Vision-Language Semantics for Deepfake Video Detection CVPR2026
【速读】:该论文旨在解决当前深度伪造视频检测(Deepfake Video Detection, DFD)方法中对预训练视觉-语言模型(Vision-Language Models, VLMs)潜力挖掘不足的问题,尤其是忽视了其在潜在空间中蕴含的丰富跨模态语义信息。现有方法仅依赖视觉特征进行检测,未能充分利用VLMs在视觉与语言之间对齐的语义能力,导致模型判别力受限。解决方案的关键在于提出VLAForge框架,通过两个核心机制实现突破:一是引入ForgePerceiver模块作为独立学习器,以细粒度和整体化方式捕捉多样且微妙的伪造线索,同时保留预训练的视觉-语言对齐(Vision-Language Alignment, VLA)知识;二是设计身份感知的VLA评分机制,将跨模态语义与ForgePerceiver提取的伪造线索相结合,并借助基于身份先验的文本提示增强真实性语义表征,从而显著提升模型在帧级和视频级检测任务中的判别能力。
链接: https://arxiv.org/abs/2603.24454
作者: Jiawen Zhu,Yunqi Miao,Xueyi Zhang,Jiankang Deng,Guansong Pang
机构: Singapore Management University (新加坡管理大学); The University of Warwick (华威大学); Nanyang Technological University (南洋理工大学); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures, accepted by CVPR 2026
Abstract:Recent Deepfake Video Detection (DFD) studies have demonstrated that pre-trained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength – the rich vision-language semantics embedded in the latent space. We propose VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics to enhance model’s discriminability in deepfake detection. This work i) enhances the visual perception of VLM through a ForgePerceiver, which acts as an independent learner to capture diverse, subtle forgery cues both granularly and holistically, while preserving the pretrained Vision-Language Alignment (VLA) knowledge, and ii) provides a complementary discriminative cue – Identity-Aware VLA score, derived by coupling cross-modal semantics with the forgery cues learned by ForgePerceiver. Notably, the VLA score is augmented by an identity prior-informed text prompting to capture authenticity cues tailored to each identity, thereby enabling more discriminative cross-modal semantics. Comprehensive experiments on video DFD benchmarks, including classical face-swapping forgeries and recent full-face generation forgeries, demonstrate that our VLAForge substantially outperforms state-of-the-art methods at both frame and video levels. Code is available at this https URL.
[CV-19] CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents
【速读】:该论文旨在解决当前桌面计算机使用代理(Computer-use Agents, CUAs)发展中因缺乏连续、高质量人类示范视频而导致的通用性瓶颈问题。现有数据集如ScaleCUA仅包含稀疏截图,难以支持复杂桌面任务的端到端学习。解决方案的关键在于提出CUA-Suite,其核心是VideoCUA,提供约10,000个专家演示任务、覆盖87种多样化应用的连续30 fps屏幕录制视频、轨迹式光标运动记录及多层级推理标注,总计约55小时、600万帧的高质量视频数据。该连续视频流保留了人机交互的完整时序动态信息,可无损转换为现有代理框架所需格式,从而显著提升模型对复杂桌面环境的理解与执行能力。
链接: https://arxiv.org/abs/2603.24440
作者: Xiangru Jian,Shravan Nayak,Kevin Qinghong Lin,Aarash Feizi,Kaixin Li,Patrice Bechard,Spandana Gella,Sai Rajeswar
机构: ServiceNow; University of Waterloo (滑铁卢大学); Mila; Université de Montréal (蒙特利尔大学); McGill University (麦吉尔大学); University of Oxford (牛津大学); National University of Singapore (新加坡国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite’s rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.
[CV-20] he Gait Signature of Frailty: Transfer Learning based Deep Gait Models for Scalable Frailty Assessment
【速读】:该论文旨在解决老年医学中虚弱(frailty)评估主观性强、异质性高且难以在临床实践中规模化的问题。其核心解决方案是构建一个公开的基于轮廓的步态虚弱数据集,该数据集在临床真实场景下采集,覆盖完整的虚弱谱系并包含使用助行器的老年个体;在此基础上,通过迁移学习策略优化预训练步态识别模型的微调方式,发现性能提升主要依赖于对低层步态表征的保守冻结与高层特征的灵活适应,而非单纯增加模型复杂度。此外,结合类不平衡处理和互补学习目标可进一步增强对相邻虚弱状态的判别能力,同时模型注意力集中于下肢与骨盆区域,符合已知的虚弱生物力学关联,从而为虚弱评估提供了一种可扩展、非侵入且可解释的计算框架。
链接: https://arxiv.org/abs/2603.24434
作者: Laura McDaniel,Basudha Pal,Crystal Szczesny,Yuxiang Guo,Ryan Roemmich,Peter Abadir,Rama Chellappa
机构: Johns Hopkins University (约翰霍普金斯大学); Johns Hopkins Medicine (约翰霍普金斯医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Frailty is a condition in aging medicine characterized by diminished physiological reserve and increased vulnerability to stressors. However, frailty assessment remains subjective, heterogeneous, and difficult to scale in clinical practice. Gait is a sensitive marker of biological aging, capturing multisystem decline before overt disability. Yet the application of modern computer vision to gait-based frailty assessment has been limited by small, imbalanced datasets and a lack of clinically representative benchmarks. In this work, we introduce a publicly available silhouette-based frailty gait dataset collected in a clinically realistic setting, spanning the full frailty spectrum and including older adults who use walking aids. Using this dataset, we evaluate how pretrained gait recognition models can be adapted for frailty classification under limited data conditions. We study both convolutional and hybrid attention-based architectures and show that predictive performance depends primarily on how pretrained representations are transferred rather than architectural complexity alone. Across models, selectively freezing low-level gait representations while allowing higher-level features to adapt yields more stable and generalizable performance than either full fine-tuning or rigid freezing. Conservative handling of class imbalance further improves training stability, and combining complementary learning objectives enhances discrimination between clinically adjacent frailty states. Interpretability analyses reveal consistent model attention to lower-limb and pelvic regions, aligning with established biomechanical correlates of frailty. Together, these findings establish gait-based representation learning as a scalable, non-invasive, and interpretable framework for frailty assessment and support the integration of modern biometric modeling approaches into aging research and clinical practice.
[CV-21] acher-Student Diffusion Model for Text-Driven 3D Hand Motion Generation ICASSP2026
【速读】:该论文旨在解决从自然语言生成逼真三维手部动作的问题,现有方法要么侧重全身动作而忽略精细的手势,要么依赖显式的三维物体网格,限制了通用性。解决方案的关键在于提出一种模型无关的教师-学生扩散框架(TSHaMo),其中学生模型仅通过文本学习生成手部动作,而教师模型利用辅助信号(如MANO参数)在训练阶段提供结构化指导;通过联合训练策略,学生模型可受益于教师的中间预测结果,同时在推理阶段保持纯文本输入特性,从而在无需测试时提供3D物体的前提下实现高质量且多样化的手部动作生成。
链接: https://arxiv.org/abs/2603.24407
作者: Ching-Lam Cheng,Bin Zhu,Shengfeng He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, accepted by ICASSP2026
Abstract:Generating realistic 3D hand motion from natural language is vital for VR, robotics, and human-computer interaction. Existing methods either focus on full-body motion, overlooking detailed hand gestures, or require explicit 3D object meshes, limiting generality. We propose TSHaMo, a model-agnostic teacher-student diffusion framework for text-driven hand motion generation. The student model learns to synthesize motions from text alone, while the teacher leverages auxiliary signals (e.g., MANO parameters) to provide structured guidance during training. A co-training strategy enables the student to benefit from the teacher’s intermediate predictions while remaining text-only at inference. Evaluated using two diffusion backbones on GRAB and H2O, TSHaMo consistently improves motion quality and diversity. Ablations confirm its robustness and flexibility in using diverse auxiliary inputs without requiring 3D objects at test time.
[CV-22] Causal Transfer in Medical Image Analysis
【速读】:该论文旨在解决医学影像模型在跨医院、扫描仪、人群或成像协议部署时因领域偏移(domain shift)导致性能下降的问题,从而限制了其临床可靠性。解决方案的关键在于提出并系统化地构建因果迁移学习(Causal Transfer Learning, CTL)范式,将因果推理与跨域表征学习相结合,以识别在不同环境中保持不变的因果机制,而非依赖易受环境变化影响的统计相关性。通过将领域偏移建模为因果问题,并整合结构因果模型(Structural Causal Models)、不变风险最小化(Invariant Risk Minimisation)和反事实推理(Counterfactual Reasoning),CTL能够提升模型在多中心、联邦学习等复杂临床场景下的鲁棒性和可泛化能力。
链接: https://arxiv.org/abs/2603.24388
作者: Mohammed M. Abdelsamea,Daniel Tweneboah Anyimadu,Tasneem Selim,Saif Alzubi,Lei Zhang,Ahmed Karam Eldaly,Xujiong Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical imaging models frequently fail when deployed across hospitals, scanners, populations, or imaging protocols due to domain shift, limiting their clinical reliability. While transfer learning and domain adaptation address such shifts statistically, they often rely on spurious correlations that break under changing conditions. On the other hand, causal inference provides a principled way to identify invariant mechanisms that remain stable across environments. This survey introduces and systematises Causal Transfer Learning (CTL) for medical image analysis. This paradigm integrates causal reasoning with cross-domain representation learning to enable robust and generalisable clinical AI. We frame domain shift as a causal problem and analyse how structural causal models, invariant risk minimisation, and counterfactual reasoning can be embedded within transfer learning pipelines. We studied spanning classification, segmentation, reconstruction, anomaly detection, and multimodal imaging, and organised them by task, shift type, and causal assumption. A unified taxonomy is proposed that connects causal frameworks and transfer mechanisms. We further summarise datasets, benchmarks, and empirical gains, highlighting when and why causal transfer outperforms correlation-based domain adaptation. Finally, we discuss how CTL supports fairness, robustness, and trustworthy deployment in multi-institutional and federated settings, and outline open challenges and research directions for clinically reliable medical imaging AI.
[CV-23] ViHOI: Human-Object Interaction Synthesis with Visual Priors CVPR2026
【速读】:该论文旨在解决生成真实且符合物理约束的3D人体-物体交互(Human-Object Interaction, HOI)动作序列这一关键挑战,其核心难点在于仅靠文本描述难以准确刻画复杂的物理交互关系。解决方案的关键在于提出一种新范式:从易获取的2D图像中提取丰富的交互先验信息,并将其融入基于扩散模型(diffusion-based generative models)的HOI生成框架中。具体而言,作者设计了ViHOI框架,利用大规模视觉-语言模型(Vision-Language Model, VLM)作为先验提取引擎,结合分层解耦策略分离视觉与文本先验,并通过基于Q-Former的适配器将高维特征压缩为紧凑的先验标记(prior tokens),从而显著提升扩散模型的条件训练效率与生成质量。此外,训练阶段使用运动渲染图像确保视觉输入与动作序列的语义对齐,推理阶段则借助文本到图像生成模型合成参考图像,增强对未见物体和交互类别的泛化能力。
链接: https://arxiv.org/abs/2603.24383
作者: Songjin Cai,Linjie Zhong,Ling Guo,Changxing Ding
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction priors from easily accessible 2D images. Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. We utilize a large Vision-Language Model (VLM) as a powerful prior-extraction engine and adopt a layer-decoupled strategy to obtain visual and textual priors. Concurrently, we design a Q-Former-based adapter that compresses the VLM’s high-dimensional features into compact prior tokens, which significantly facilitates the conditional training of our diffusion model. Our framework is trained on motion-rendered images from the dataset to ensure strict semantic alignment between visual inputs and motion sequences. During inference, it leverages reference images synthesized by a text-to-image generation model to improve generalization to unseen objects and interaction categories. Experimental results demonstrate that ViHOI achieves state-of-the-art performance, outperforming existing methods across multiple benchmarks and demonstrating superior generalization.
[CV-24] GeoRouter: Dynamic Paradigm Routing for Worldwide Image Geolocalization
【速读】:该论文旨在解决全球图像地理定位(worldwide image geolocalization)任务中因视觉与地理多样性导致的精度难题,尤其针对现有方法中检索式(retrieval-based)与生成式(generation-based)范式的性能差异问题。其核心挑战在于单一范式难以在所有场景下保持最优表现:检索式擅长细粒度实例匹配,而生成式具备更强的语义推理能力。解决方案的关键在于提出GeoRouter——一个基于大型视觉语言模型(LVLM)的动态路由框架,通过分析输入图像内容自适应选择最优范式;同时设计距离感知偏好目标(distance-aware preference objective),将不同范式间的定位误差差距转化为连续监督信号以优化路由策略,并构建首个专为训练路由策略设计的大规模数据集GeoRouting,从而实现跨范式的协同增强与性能提升。
链接: https://arxiv.org/abs/2603.24376
作者: Pengyue Jia,Derong Xu,Yingyi Zhang,Xiaopeng Li,Wenlin Zhang,Yi Wen,Yuanshao Zhu,Xiangyu Zhao
机构: City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Worldwide image geolocalization aims to predict precise GPS coordinates for images captured anywhere on Earth, which is challenging due to the large visual and geographic diversity. Recent methods mainly follow two paradigms: retrieval-based approaches that match queries against a reference database, and generation-based approaches that directly predict coordinates using Large Vision-Language Models (LVLMs). However, we observe distinct error profiles between them: retrieval excels at fine-grained instance matching, while generation offers robust semantic reasoning. This complementary heterogeneity suggests that no single paradigm is universally superior. To harness this potential, we propose GeoRouter, a dynamic routing framework that adaptively assigns each query to the optimal paradigm. GeoRouter leverages an LVLM backbone to analyze visual content and provide routing decisions. To optimize GeoRouter, we introduce a distance-aware preference objective that converts the distance gap between paradigms into a continuous supervision signal, explicitly reflecting relative performance differences. Furthermore, we construct GeoRouting, the first large-scale dataset tailored for training routing policies with independent paradigm predictions. Extensive experiments on IM2GPS3k and YFCC4k demonstrate that GeoRouter significantly outperforms state-of-the-art baselines.
[CV-25] PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks
【速读】:该论文旨在解决当前基于大模型的光学字符识别(Optical Character Recognition, OCR)系统在计算资源消耗高、复杂版面中文本定位精度不足以及文本幻觉(textual hallucination)问题突出等挑战。其解决方案的关键在于摒弃“模型规模越大性能越优”的传统认知,转而聚焦于数据质量驱动的优化策略:通过系统性量化训练数据的难度(data difficulty)、准确性(data accuracy)和多样性(data diversity)三个维度,构建一个仅含500万参数的轻量级两阶段OCR系统PP-OCRv5。实验证明,在高质量、高多样性且准确标注的数据基础上,此类高效架构可达到与数十亿参数视觉语言模型(Vision-Language Models, VLMs)相当甚至更优的性能表现,同时显著提升文本定位精度并降低幻觉发生率。
链接: https://arxiv.org/abs/2603.24373
作者: Cheng Cui,Yubo Zhang,Ting Sun,Xueqing Wang,Hongen Liu,Manhui Lin,Yue Zhang,Tingquan Gao,Changda Zhou,Jiaxuan Liu,Zelun Zhang,Jing Zhang,Jun Zhang,Yi Liu
机构: Baidu Inc (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The advent of “OCR 2.0” and large-scale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text localization within complex layouts, and a propensity for textual hallucinations. Revisiting the prevailing notion that model scale is the sole path to high accuracy, this paper introduces PP-OCRv5, a meticulously optimized, lightweight OCR system with merely 5 million parameters. We demonstrate that PP-OCRv5 achieves performance competitive with many billion-parameter VLMs on standard OCR benchmarks, while offering superior localization precision and reduced hallucinations. The cornerstone of our success lies not in architectural expansion but in a data-centric investigation. We systematically dissect the role of training data by quantifying three critical dimensions: data difficulty, data accuracy, and data diversity. Our extensive experiments reveal that with a sufficient volume of high-quality, accurately labeled, and diverse data, the performance ceiling for traditional, efficient two-stage OCR pipelines is far higher than commonly assumed. This work provides compelling evidence for the viability of lightweight, specialized models in the large-model era and offers practical insights into data curation for OCR. The source code and models are publicly available at this https URL.
[CV-26] Language-Guided Structure-Aware Network for Camouflaged Object Detection
【速读】:该论文旨在解决伪装目标检测(Camouflaged Object Detection, COD)中因目标与背景在颜色、纹理和结构上高度一致而导致的分割困难问题。现有方法虽引入多尺度融合与注意力机制,但普遍缺乏文本语义先验的引导,限制了模型在复杂场景下对伪装区域的关注能力。解决方案的关键在于提出一种语言引导的结构感知网络(Language-Guided Structure-Aware Network, LGSAN),其核心创新包括:利用CLIP模型从文本提示和RGB图像生成掩码,以指导PVT-v2提取的多尺度特征聚焦潜在目标区域;设计傅里叶边缘增强模块(Fourier Edge Enhancement Module, FEEM),通过频域融合高频率信息增强边缘特征;引入结构感知注意力模块(Structure-Aware Attention Module, SAAM)提升对物体结构与边界的感知能力;最后结合粗粒度引导的局部精修模块(Coarse-Guided Local Refinement Module, CGLRM)优化伪装区域的细节重建与边界完整性。
链接: https://arxiv.org/abs/2603.24355
作者: Min Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Camouflaged Object Detection (COD) aims to segment objects that are highly integrated with the background in terms of color, texture, and structure, making it a highly challenging task in computer vision. Although existing methods introduce multi-scale fusion and attention mechanisms to alleviate the above issues, they generally lack the guidance of textual semantic priors, which limits the model’s ability to focus on camouflaged regions in complex scenes. To address this issue, this paper proposes a Language-Guided Structure-Aware Network (LGSAN). Specifically, based on the visual backbone PVT-v2, we introduce CLIP to generate masks from text prompts and RGB images, thereby guiding the multi-scale features extracted by PVT-v2 to focus on potential target regions. On this foundation, we further design a Fourier Edge Enhancement Module (FEEM), which integrates multi-scale features with high-frequency information in the frequency domain to extract edge enhancement features. Furthermore, we propose a Structure-Aware Attention Module (SAAM) to effectively enhance the model’s perception of object structures and boundaries. Finally, we introduce a Coarse-Guided Local Refinement Module (CGLRM) to enhance fine-grained reconstruction and boundary integrity of camouflaged object regions. Extensive experiments demonstrate that our method consistently achieves highly competitive performance across multiple COD datasets, validating its effectiveness and robustness.
[CV-27] Le MuMo JEPA: Multi-Modal Self-Supervised Representation Learning with Learnable Fusion Tokens
【速读】:该论文旨在解决单模态自监督学习方法在视觉表征学习中忽略异构传感器间互补信息的问题,从而限制了模型对复杂场景的理解能力。其解决方案的关键在于提出Le MuMo JEPA框架,通过引入融合令牌(fusion tokens)作为跨模态共享瓶颈,在统一的Transformer架构中实现RGB图像与对齐的辅助模态(如LiDAR深度或热成像)之间的高效信息交互;具体而言,该方法在初始跨模态注意力层后丢弃模态特定令牌,迫使跨模态信息压缩至共享的融合令牌网格,并结合Sketched Isotropic Gaussian Regularization(SIGReg)对联合多模态CLS嵌入进行正则化,从而在不显著增加计算资源的情况下提升下游任务性能。
链接: https://arxiv.org/abs/2603.24327
作者: Ciem Cornelissen,Sam Leroux,Pieter Simoens
机构: IDLab, Department of Information Technology, Ghent University - imec; Ghent University (根特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self-supervised learning has emerged as a powerful paradigm for learning visual representations without manual annotations, yet most methods still operate on a single modality and therefore miss the complementary structure available from heterogeneous sensors. We present Le MuMo JEPA, a self-supervised framework that learns unified representations from RGB images and aligned companion modalities. In our driving experiments, the second modality is camera-aligned LiDAR depth; we also evaluate RGB-thermal training and transfer on the Teledyne FLIR ADAS benchmark. Our approach extends LeJEPA to the multi-modal setting by learning fusion tokens that act as a latent bottleneck between modality-specific patch stems inside a shared transformer. Our default model employs a pruned fusion strategy: after an initial cross-modal attention layer, modality-specific tokens are dropped, forcing cross-modal information into the shared fusion-token grid as an efficient latent bottleneck before Sketched Isotropic Gaussian Regularization (SIGReg) is applied to the joint multimodal CLS embedding. On Waymo, Le MuMo JEPA gives the strongest performance-efficiency trade-off on downstream patch probes among the from-scratch multimodal baselines, improving CenterNet detection and dense depth while remaining competitive on segmentation. Under from-scratch training on nuScenes, Le MuMo JEPA remains the strongest model, and it also gives the best FLIR results, especially after Waymo-initialized fine-tuning. It also retains the best overall accuracy-efficiency balance in our study at substantially lower compute, memory, and estimated training time.
[CV-28] Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions CVPR2026
【速读】:该论文旨在解决无监督域适应(Unsupervised Domain Adaptation, UDA)中语义类别学习顺序对分割性能的影响问题,尤其在恶劣天气条件下,传统静态课程学习方法因依赖人工设计的启发式规则(如固定不确定性度量)而难以适应模型训练过程中高维动态变化,导致类别偏差。其解决方案的关键在于将课程学习建模为一个序列决策问题,并提出一种自主类调度器(autonomous class scheduler),该调度器由两部分组成:(i) 高维状态编码器,用于将模型训练状态映射到潜在空间并提取反映进展的关键特征;(ii) 类别公平的策略梯度目标,确保各语义类别的改进均衡。结合源-目标混合监督机制,该方法可动态识别每个训练阶段最具信息量的类别,从而实现更自适应、高效的语义分割学习。
链接: https://arxiv.org/abs/2603.24322
作者: Shiqin Wang,Haoyang Chen,Huaizhou Huang,Yinkan He,Dongfang Sun,Xiaoqing Chen,Xingyu Liu,Zheng Wang,Kaiyan Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:The learning order of semantic classes significantly impacts unsupervised domain adaptation for semantic segmentation, especially under adverse weather conditions. Most existing curricula rely on handcrafted heuristics (e.g., fixed uncertainty metrics) and follow a static schedule, which fails to adapt to a model’s evolving, high-dimensional training dynamics, leading to category bias. Inspired by Reinforcement Learning, we cast curriculum learning as a sequential decision problem and propose an autonomous class scheduler. This scheduler consists of two components: (i) a high-dimensional state encoder that maps the model’s training status into a latent space and distills key features indicative of progress, and (ii) a category-fair policy-gradient objective that ensures balanced improvement across classes. Coupled with mixed source-target supervision, the learned class rankings direct the network’s focus to the most informative classes at each stage, enabling more adaptive and dynamic learning. It is worth noting that our method achieves state-of-the-art performance on three widely used benchmarks (e.g., ACDC, Dark Zurich, and Nighttime Driving) and shows generalization ability in synthetic-to-real semantic segmentation.
[CV-29] Refining time-space traffic diagrams: A neighborhood-adaptive linear regression method
【速读】:该论文旨在解决现有时空(Time-Space, TS)交通图因监测精度和采样频率限制而导致分辨率较低的问题,从而影响交通理论研究与工程应用的效果。其解决方案的关键在于引入邻域嵌入(neighborhood embedding)概念,通过自适应识别与目标单元相似的局部邻域,并在这些邻域内进行低分辨率到高分辨率的映射拟合,避免传统全局线性模型的过度平滑问题,从而更准确地捕捉交通波传播和拥堵演化等局部特征。该方法仅需少量成对的高低分辨率训练数据,具有公式简洁、泛化能力强、鲁棒性高的优势,为低采样率交通数据的低成本精细化重构提供了有效路径。
链接: https://arxiv.org/abs/2603.24312
作者: Zhihong Yao,Yi Yu,Yunxia Wu,Hao Li,Yangsheng Jiang,Zhengbing He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The time-space (TS) traffic diagram serves as a crucial tool for characterizing the dynamic evolution of traffic flow, with its resolution directly influencing the effectiveness of traffic theory research and engineering applications. However, constrained by monitoring precision and sampling frequency, existing TS traffic diagrams commonly suffer from low resolution. To address this issue, this paper proposes a refinement method for TS traffic diagrams based on neighborhood-adaptive linear regression. Introducing the concept of neighborhood embedding into TS diagram refinement, the method leverages local pattern similarity in TS diagrams, adaptively identifies neighborhoods similar to target cells, and fits the low-to-high resolution mapping within these neighborhoods for refinement. It avoids the over-smoothing tendency of the traditional global linear model, allows the capture of unique traffic wave propagation and congestion evolution characteristics, and outperforms the traditional neighborhood embedding method in terms of local information utilization to achieve target cell refinement. Validation on two real datasets across multiple scales and upscaling factors shows that, compared to benchmark methods, the proposed method achieves improvements of 9.16%, 8.16%, 1.86%, 3.89%, and 5.83% in metrics including MAE, MAPE, CMJS, SSIM, and GMSD, respectively. Furthermore, the proposed method exhibits strong generalization and robustness in cross-day and cross-scenario validations. In summary, requiring only a minimal amount of paired high- and low-resolution training data, the proposed method features a concise formulation, providing a foundation for the low-cost, fine-grained refinement of low-sampling-rate traffic data.
[CV-30] AMIF: Authorizable Medical Image Fusion Model with Built-in Authentication
【速读】:该论文旨在解决当前多模态医学图像融合模型在推理过程中存在知识产权(Intellectual Property Rights, IP)泄露风险的问题,即未经授权的用户可通过融合输出或模型蒸馏等逆向工程手段窃取模型知识和敏感训练数据。解决方案的关键在于提出AMIF(Authorizable Medical Image Fusion),其首次将授权访问控制机制内嵌于图像融合目标函数中:对于未授权使用场景,AMIF在融合结果中嵌入显式且可见的版权标识;而在成功通过密钥认证后,方可获得高质量的融合图像输出,从而实现对模型知识产权的有效保护。
链接: https://arxiv.org/abs/2603.24296
作者: Jie Song,Jun Jia,Wei Sun,Wangqiu Zhou,Tao Tan,Guangtao Zhai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal image fusion enables precise lesion localization and characterization for accurate diagnosis, thereby strengthening clinical decision-making and driving its growing prominence in medical imaging research. A powerful multimodal image fusion model relies on high-quality, clinically representative multimodal training data and a rigorously engineered model architecture. Therefore, the development of such professional radiomics models represents a collaborative achievement grounded in standardized acquisition, clinical-specific expertise, and algorithmic design proficiency, which necessitates protection of associated intellectual property rights. However, current multimodal image fusion models generate fused outputs without built-in mechanisms to safeguard intellectual property rights, inadvertently exposing proprietary model knowledge and sensitive training data through inference leakage. For example, malicious users can exploit fusion outputs and model distillation or other inference-based reverse engineering techniques to approximate the fusion performance of proprietary models. To address this issue, we propose AMIF, the first Authorizable Medical Image Fusion model with built-in authentication, which integrates authorization access control into the image fusion objective. For unauthorized usage, AMIF embeds explicit and visible copyright identifiers into fusion results. In contrast, high-quality fusion results are accessible upon successful key-based authentication.
[CV-31] RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation CVPR2026
【速读】:该论文旨在解决状态空间模型(State Space Model, SSM)在视频语义分割(Video Semantic Segmentation, VSS)任务中因固定大小状态空间导致特定时空细节遗忘的问题,从而限制了像素级时序一致性建模的能力。解决方案的关键在于提出一种“精炼特定信息的状态空间模型”(Refining Specifics State Space Model, RS-SSM),其核心机制包括两个组成部分:一是通道感知幅度感知器(Channel-wise Amplitude Perceptron, CwAP),用于提取并对齐状态空间中特定信息的分布特征;二是遗忘门信息精炼器(Forgetting Gate Information Refiner, FGIR),根据特定信息分布自适应地反转并精炼状态空间模型中的遗忘门矩阵,从而互补性地恢复压缩过程中丢失的细节信息,显著提升模型在像素级时空分割上的表现。
链接: https://arxiv.org/abs/2603.24295
作者: Kai Zhu,Zhenyu Cui,Zehua Zang,Jiahuan Zhou
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); Tsinghua University (清华大学); Institute of Software Chinese Academy of Sciences (中国科学院软件研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Recently, state space models have demonstrated efficient video segmentation through linear-complexity state space compression. However, Video Semantic Segmentation (VSS) requires pixel-level spatiotemporal modeling capabilities to maintain temporal consistency in segmentation of semantic objects. While state space models can preserve common semantic information during state space compression, the fixed-size state space inevitably forgets specific information, which limits the models’ capability for pixel-level segmentation. To tackle the above issue, we proposed a Refining Specifics State Space Model approach (RS-SSM) for video semantic segmentation, which performs complementary refining of forgotten spatiotemporal specifics. Specifically, a Channel-wise Amplitude Perceptron (CwAP) is designed to extract and align the distribution characteristics of specific information in the state space. Besides, a Forgetting Gate Information Refiner (FGIR) is proposed to adaptively invert and refine the forgetting gate matrix in the state space model based on the specific information distribution. Consequently, our RS-SSM leverages the inverted forgetting gate to complementarily refine the specific information forgotten during state space compression, thereby enhancing the model’s capability for spatiotemporal pixel-level segmentation. Extensive experiments on four VSS benchmarks demonstrate that our RS-SSM achieves state-of-the-art performance while maintaining high computational efficiency. The code is available at this https URL.
[CV-32] VERIA: Verification-Centric Multimodal Instance Augmentation for Long-Tailed 3D Object Detection
【速读】:该论文旨在解决自动驾驶中3D感知任务因训练数据长尾分布(long-tail distribution)导致的稀有类别(rare classes)检测性能下降问题,特别是这些类别在类内多样性高但样本覆盖不足。解决方案的关键在于提出VERIA(Visual-Enhanced RGB-LiDAR Instance Augmentation),一个以图像为先的多模态增强框架,通过调用现成的基础模型(foundation models)合成同步的RGB–LiDAR实例,并采用顺序语义与几何验证机制筛选高质量样本,从而在保持真实LiDAR统计特性的同时扩展类内变化范围,提升稀有类别的检测鲁棒性。
链接: https://arxiv.org/abs/2603.24294
作者: Jumin Lee,Siyeong Lee,Namil Kim,Sung-Eui Yoon
机构: KAIST(韩国科学技术院); Naver Labs(NAVER实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long-tail distributions in driving datasets pose a fundamental challenge for 3D perception, as rare classes exhibit substantial intra-class diversity yet available samples cover this variation space only sparsely. Existing instance augmentation methods based on copy-paste or asset libraries improve rare-class exposure but are often limited in fine-grained diversity and scene-context placement. We propose VERIA, an image-first multimodal augmentation framework that synthesizes synchronized RGB–LiDAR instances using off-the-shelf foundation models and curates them with sequential semantic and geometric verification. This verification-centric design tends to select instances that better match real LiDAR statistics while spanning a wider range of intra-class variation. Stage-wise yield decomposition provides a log-based diagnostic of pipeline reliability. On nuScenes and Lyft, VERIA improves rare-class 3D object detection in both LiDAR-only and multimodal settings. Our code is available at this https URL.
[CV-33] opoMesh: High-Fidelity Mesh Autoencoding via Topological Unification
【速读】:该论文旨在解决现有变分自编码器(VAE)在高保真三维生成任务中因目标网格(GT meshes)与预测网格拓扑结构不一致而导致的重建质量受限问题。具体而言,传统VAE通常输出固定结构的隐式场(如规则网格上的符号距离函数SDF),而真实网格具有任意且可变的拓扑结构,这种表示失配使得难以建立显式的顶点和面级对应关系,从而迫使先前方法依赖间接监督信号(如SDF或渲染损失),导致锐利几何特征难以保留。解决方案的关键在于提出TopoMesh,一种基于稀疏体素的VAE架构,其核心创新是引入统一的Dual Marching Cubes(DMC)拓扑框架:通过一个保留锐边的L∞距离度量重采样算法将任意输入网格转换为DMC兼容表示,同时使解码器输出相同格式的网格,从而确保预测与目标网格共享完全一致的拓扑结构,实现顶点、面层级的显式监督信号及其梯度传播,显著提升几何细节(尤其是锐利特征)的重建保真度。
链接: https://arxiv.org/abs/2603.24278
作者: Guan Luo,Xiu Li,Rui Chen,Xuanyu Yi,Jing Lin,Chia-Hao Chen,Jiahang Liu,Song-Hai Zhang,Jianfeng Zhang
机构: Tsinghua University (清华大学); ByteDance Seed (字节跳动种子); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The dominant paradigm for high-fidelity 3D generation relies on a VAE-Diffusion pipeline, where the VAE’s reconstruction capability sets a firm upper bound on generation quality. A fundamental challenge limiting existing VAEs is the representation mismatch between ground-truth meshes and network predictions: GT meshes have arbitrary, variable topology, while VAEs typically predict fixed-structure implicit fields (\eg, SDF on regular grids). This inherent misalignment prevents establishing explicit mesh-level correspondences, forcing prior work to rely on indirect supervision signals such as SDF or rendering losses. Consequently, fine geometric details, particularly sharp features, are poorly preserved during reconstruction. To address this, we introduce TopoMesh, a sparse voxel-based VAE that unifies both GT and predicted meshes under a shared Dual Marching Cubes (DMC) topological framework. Specifically, we convert arbitrary input meshes into DMC-compliant representations via a remeshing algorithm that preserves sharp edges using an L \infty distance metric. Our decoder outputs meshes in the same DMC format, ensuring that both predicted and target meshes share identical topological structures. This establishes explicit correspondences at the vertex and face level, allowing us to derive explicit mesh-level supervision signals for topology, vertex positions, and face orientations with clear gradients. Our sparse VAE architecture employs this unified framework and is trained with Teacher Forcing and progressive resolution training for stable and efficient convergence. Extensive experiments demonstrate that TopoMesh significantly outperforms existing VAEs in reconstruction fidelity, achieving superior preservation of sharp features and geometric details.
[CV-34] ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors
【速读】:该论文旨在解决扩散模型在生成极端长宽比(Extreme Aspect Ratio, EAR)图像时出现的结构崩溃问题,如物体重复和空间信息局限等,其根本原因在于缺乏鲁棒的空间先验知识。解决方案的关键在于提出ScrollScape框架,通过将EAR图像合成重构为连续视频生成过程:一方面利用视频模型固有的时间一致性作为全局结构约束以保障长程空间连贯性;另一方面引入扫描位置编码(Scanning Positional Encoding, ScanPE)实现全局坐标在帧间的分布,模拟灵活移动摄像机视角,并结合滚动超分辨率(Scrolling Super-Resolution, ScrollSR)机制借助视频超分辨率先验突破内存瓶颈,从而高效生成高达32K分辨率的图像。
链接: https://arxiv.org/abs/2603.24270
作者: Haodong Yu,Yabo Zhang,Donglin Di,Ruyi Zhang,Wangmeng Zuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial this http URL limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional this http URL overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core this http URL mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural this http URL, Scanning Positional Encoding (ScanPE) distributes global coordinates across frames to act as a flexible moving camera, while Scrolling Super-Resolution (ScrollSR) leverages video super-resolution priors to circumvent memory bottlenecks, efficiently scaling outputs to an unprecedented 32K resolution. Fine-tuned on a curated 3K multi-ratio image dataset, ScrollScape effectively aligns pre-trained video priors with the EAR generation task. Extensive evaluations demonstrate that it significantly outperforms existing image-diffusion baselines by eliminating severe localized artifacts. Consequently, our method overcomes inherent structural bottlenecks to ensure exceptional global coherence and visual fidelity across diverse domains at extreme scales.
[CV-35] Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep CVPR2026
【速读】:该论文旨在解决扩散模型(Diffusion Models)在视频编辑任务中计算成本高昂的问题,尤其是基于扩散的视频生成与编辑方法(如DiT模型)因迭代去噪过程导致的高延迟和高浮点运算量(FLOPs),限制了其实际部署。现有加速方法主要依赖于时间步级特征复用,但忽略了DiT架构内部的空间-时间token之间存在的冗余注意力操作,这些操作对输出贡献微弱。解决方案的关键在于提出HetCache——一种无需训练的加速框架,通过分析扩散过程中不同token类型的上下文相关性和交互强度,结合空间先验将时空token划分为“上下文”和“生成”两类,并选择性缓存与生成token关联最强、语义最具有代表性的上下文token,从而减少冗余注意力计算,同时保持编辑一致性与保真度。实验表明,该方法可在显著降低延迟(2.67×加速)和FLOPs的同时,几乎不损失编辑质量。
链接: https://arxiv.org/abs/2603.24260
作者: Tianyi Liu,Ye Lu,Linfeng Zhang,Chen Cai,Jianjun Gao,Yi Wang,Kim-Hui Yap,Lap-Pui Chau
机构: Nanyang Technological University (南洋理工大学); Shanghai Jiao Tong University (上海交通大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, accepted by CVPR2026
Abstract:Diffusion-based video editing has emerged as an important paradigm for high-quality and flexible content generation. However, despite their generality and strong modeling capacity, Diffusion Transformers (DiT) remain computationally expensive due to the iterative denoising process, posing challenges for practical deployment. Existing video diffusion acceleration methods primarily exploit denoising timestep-level feature reuse, which mitigates the redundancy in denoising process, but overlooks the architectural redundancy within the DiT that many attention operations over spatio-temporal tokens are redundantly executed, offering little to no incremental contribution to the model output. This work introduces HetCache, a training-free diffusion acceleration framework designed to exploit the inherent heterogeneity in diffusion-based masked video-to-video (MV2V) generation and editing. Instead of uniformly reuse or randomly sampling tokens, HetCache assesses the contextual relevance and interaction strength among various types of tokens in designated computing steps. Guided by spatial priors, it divides the spatial-temporal tokens in DiT model into context and generative tokens, and selectively caches the context tokens that exhibit the strongest correlation and most representative semantics with generative ones. This strategy reduces redundant attention operations while maintaining editing consistency and fidelity. Experiments show that HetCache achieves a noticeable acceleration, including a 2.67 \times latency speedup and FLOPs reduction over commonly used foundation models, with negligible degradation in editing quality.
[CV-36] Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在不同视角下对同一物体生成不一致描述的问题,这限制了具身智能体在长时间序列中构建稳定语义表征的能力。现有方法通常依赖离线多视角聚合或多阶段流水线,分离探索、数据关联与标题生成过程,难以有效推理已观测对象。本文提出一种统一的记忆增强型视觉-语言代理框架,其核心创新在于将数据关联、物体标题生成与探索策略整合进单一自回归架构中,通过处理当前RGB图像、自顶向下探索地图以及序列化的物体级情景记忆(episodic memory)作为物体级token,实现跨时间的物体身份持久性和语义一致性。训练方面采用基于分歧的探索策略和伪标题生成模型,在逼真3D环境中收集数据以强制多视角标题历史的一致性,从而显著提升captioning评分(最高+11.86%)和标题自相似性(+7.39%),同时通过紧凑场景表示实现可扩展性能。
链接: https://arxiv.org/abs/2603.24257
作者: Tommaso Galliena,Stefano Rosa,Tommaso Apicella,Pietro Morerio,Alessio Del Bue,Lorenzo Natale
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 7 figures, 7 tables (including Supplementary Materials)
Abstract:Vision-Language Models (VLMs) often yield inconsistent descriptions of the same object across viewpoints, hindering the ability of embodied agents to construct consistent semantic representations over time. Previous methods resolved inconsistencies using offline multi-view aggregation or multi-stage pipelines that decouple exploration, data association, and caption learning, with limited capacity to reason over previously observed objects. In this paper, we introduce a unified, memory-augmented Vision-Language agent that simultaneously handles data association, object captioning, and exploration policy within a single autoregressive framework. The model processes the current RGB observation, a top-down explored map, and an object-level episodic memory serialized into object-level tokens, ensuring persistent object identity and semantic consistency across extended sequences. To train the model in a self-supervised manner, we collect a dataset in photorealistic 3D environments using a disagreement-based policy and a pseudo-captioning model that enforces consistency across multi-view caption histories. Extensive evaluation on a manually annotated object-level test set, demonstrate improvements of up to +11.86% in standard captioning scores and +7.39% in caption self-similarity over baseline models, while enabling scalable performance through a compact scene representation. Code, model weights, and data are available at this https URL
[CV-37] B-MoE: A Body-Part-Aware Mixture-of-Experts “All Parts Matter” Approach to Micro-Action Recognition
【速读】:该论文旨在解决当前动作识别模型难以准确识别微动作(micro-actions)的问题,这类动作具有短暂持续时间、低幅度和高类别模糊性等特点,且在社交场景中蕴含丰富语义信息。解决方案的关键在于提出B-MoE框架——一种基于身体部位感知的专家混合(Body-part-aware Mixture-of-Experts)方法,其核心创新包括:1)每个专家专注于特定身体区域(头、躯干、上肢、下肢),并采用轻量级宏观-微观运动编码器(Macro-Micro Motion Encoder, M3E)以捕捉长程上下文结构与细粒度局部运动;2)通过交叉注意力路由机制学习不同身体区域间的关联,并动态选择最相关的区域用于微动作分类;3)引入双流编码器融合区域特异性语义线索与全局运动特征,从而联合建模空间局部化线索与时间上细微变化。实验表明,该方法在MA-52、SocialGesture和MPII-GroupInteraction三个挑战性基准上均取得显著性能提升,尤其在模糊、低频和低幅类别的识别中表现突出。
链接: https://arxiv.org/abs/2603.24245
作者: Nishit Poddar,Aglind Reka,Diana-Laura Borza,Snehashis Majhi,Michal Balazia,Abhijit Das,Francois Bremond
机构: INRIA; Côte d’Azur University; Birla Institute of Technology Science, Hyderabad
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Micro-actions, fleeting and low-amplitude motions, such as glances, nods, or minor posture shifts, carry rich social meaning but remain difficult for current action recognition models to recognize due to their subtlety, short duration, and high inter-class ambiguity. In this paper, we introduce B-MoE, a Body-part-aware Mixture-of-Experts framework designed to explicitly model the structured nature of human motion. In B-MoE, each expert specializes in a distinct body region (head, body, upper limbs, lower limbs), and is based on the lightweight Macro-Micro Motion Encoder (M3E) that captures long-range contextual structure and fine-grained local motion. A cross-attention routing mechanism learns inter-region relationships and dynamically selects the most informative regions for each micro-action. B-MoE uses a dual-stream encoder that fuses these region-specific semantic cues with global motion features to jointly capture spatially localized cues and temporally subtle variations that characterize micro-actions. Experiments on three challenging benchmarks (MA-52, SocialGesture, and MPII-GroupInteraction) show consistent state-of-theart gains, with improvements in ambiguous, underrepresented, and low amplitude classes.
[CV-38] InstanceRSR: Real-World Super-Resolution via Instance-Aware Representation Alignment ICASSP2026
【速读】:该论文旨在解决现有基于生成先验的现实世界超分辨率(Real-World Super-Resolution, RSR)方法在复杂场景中难以恢复多样目标实例的细粒度细节问题。其核心挑战在于,常用去噪损失(如均方误差 MSE)倾向于保证全局一致性,而忽视了实例级别的感知与重建。解决方案的关键在于提出 InstanceRSR 框架,通过联合建模语义信息并引入实例级特征对齐机制:首先利用低分辨率(Low-Resolution, LR)图像提供全局一致性引导,同时联合图像数据与语义分割图以在采样过程中强化语义相关性;其次设计实例表示学习模块,将扩散潜在空间与实例潜在空间对齐,实现实例感知的特征对齐,并进一步引入尺度对齐机制以增强细粒度感知和细节恢复能力。这一系列设计使模型在保持实例级语义一致性的同时,生成更逼真的细节,显著优于现有方法,在多个真实世界基准上达到新的最先进(State-of-the-Art, SOTA)性能。
链接: https://arxiv.org/abs/2603.24240
作者: Zixin Guo,Kai Zhao,Luyan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 4 figures, 2 tables. Accepted by ICASSP 2026
Abstract:Existing real-world super-resolution (RSR) methods based on generative priors have achieved remarkable progress in producing high-quality and globally consistent reconstructions. However, they often struggle to recover fine-grained details of diverse object instances in complex real-world scenes. This limitation primarily arises because commonly adopted denoising losses (e.g., MSE) inherently favor global consistency while neglecting instance-level perception and restoration. To address this issue, we propose InstanceRSR, a novel RSR framework that jointly models semantic information and introduces instance-level feature alignment. Specifically, we employ low-resolution (LR) images as global consistency guidance while jointly modeling image data and semantic segmentation maps to enforce semantic relevance during sampling. Moreover, we design an instance representation learning module to align the diffusion latent space with the instance latent space, enabling instance-aware feature alignment, and further incorporate a scale alignment mechanism to enhance fine-grained perception and detail recovery. Benefiting from these designs, our approach not only generates photorealistic details but also preserves semantic consistency at the instance level. Extensive experiments on multiple real-world benchmarks demonstrate that InstanceRSR significantly outperforms existing methods in both quantitative metrics and visual quality, achieving new state-of-the-art (SOTA) performance.
[CV-39] Attack Assessment and Augmented Identity Recognition for Human Skeleton Data
【速读】:该论文旨在解决小样本场景下基于LiDAR骨架数据的人体身份识别模型(HCN-ID)在面对未见过的对抗攻击时鲁棒性不足的问题。现有方法如AAIRS虽能提升模型性能,但未评估或增强其对对抗攻击的防御能力;而传统基于扰动的攻击生成方式受限于真实训练样本,难以有效用于小数据集的模型免疫训练。解决方案的关键在于提出Attack-AAIRS框架,该框架结合少量真实数据与GAN生成的合成数据,通过学习对抗攻击样本的分布来生成高质量攻击样本,并将其用于模型训练以实现免疫(inoculation),从而在不牺牲真实测试准确率的前提下显著提升模型对多种未见对抗攻击(如FGSM、PGD、MI-FGSM等)的鲁棒性。
链接: https://arxiv.org/abs/2603.24232
作者: Joseph G. Zalameda,Megan A. Witherow,Alexander M. Glandon,Jose Aguilera,Khan M. Iftekharuddin
机构: 未知
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures, 3 tables
Abstract:Machine learning models trained on small data sets for security applications are especially vulnerable to adversarial attacks. Person identification from LiDAR based skeleton data requires time consuming and expensive data acquisition for each subject identity. Recently, Assessment and Augmented Identity Recognition for Skeletons (AAIRS) has been used to train Hierarchical Co-occurrence Networks for Person Identification (HCN-ID) with small LiDAR based skeleton data sets. However, AAIRS does not evaluate robustness of HCN-ID to adversarial attacks or inoculate the model to defend against such attacks. Popular perturbation-based approaches to generating adversarial attacks are constrained to targeted perturbations added to real training samples, which is not ideal for inoculating models with small training sets. Thus, we propose Attack-AAIRS, a novel addition to the AAIRS framework. Attack-AAIRS leverages a small real data set and a GAN generated synthetic data set to assess and improve model robustness against unseen adversarial attacks. Rather than being constrained to perturbations of limited real training samples, the GAN learns the distribution of adversarial attack samples that exploit weaknesses in HCN-ID. Attack samples drawn from this distribution augment training for inoculation of the HCN-ID to improve robustness. Ten-fold cross validation of Attack-AAIRS yields increased robustness to unseen attacks- including FGSM, PGD, Additive Gaussian Noise, MI-FGSM, and BIM. The HCN-ID Synthetic Data Quality Score for Attack-AAIRS indicates that generated attack samples are of similar quality to the original benign synthetic samples generated by AAIRS. Furthermore, inoculated models show consistent final test accuracy with the original model trained on real data, demonstrating that our method improves robustness to adversarial attacks without reducing test performance on real data.
[CV-40] RVLM: Recursive Vision-Language Models with Adaptive Depth
【速读】:该论文旨在解决医学人工智能(Medical AI)系统面临的两大核心问题:一是传统视觉-语言模型(Vision-Language Models, VLMs)采用单次推理机制,导致预测结果缺乏可审计性和临床可解释性;二是迭代推理系统通常依赖固定迭代预算,在简单任务上浪费计算资源,而在复杂任务中又难以提供足够的推理深度。解决方案的关键在于提出一个统一框架,包含两个创新组件:其一为RVLM(Reasoning Vision-Language Model),通过引入“生成-执行”循环机制,使模型在每一步生成可执行的Python代码并调用视觉子代理处理图像,从而将诊断结论与可追溯的代码逻辑绑定,满足临床AI治理框架对审计的要求;其二为RRouter,一种轻量级控制器,基于任务复杂度特征动态预测最优迭代预算,并实时监控推理进度,在推理停滞时提前终止,实现计算资源的高效分配。
链接: https://arxiv.org/abs/2603.24224
作者: Nicanor Mayumu,Zeenath Khan,Melodena Stephens,Patrick Mukala,Farhad Oroumchian
机构: University of Wollongong in Dubai (迪拜伍伦贡大学); Mohammed Bin Rashid School of Government (穆罕默德·本·拉希德政府学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical AI systems face two fundamental limitations. First, conventional vision-language models (VLMs) perform single-pass inference, yielding black-box predictions that cannot be audited or explained in clinical terms. Second, iterative reasoning systems that expose intermediate steps rely on fixed iteration budgets wasting compute on simple cases while providing insufficient depth for complex ones. We address both limitations with a unified framework. RVLM replaces single-pass inference with an iterative generate-execute loop: at each step, the model writes Python code, invokes vision sub-agents, manipulates images, and accumulates evidence. Every diagnostic claim is grounded in executable code, satisfying auditability requirements of clinical AI governance frameworks. RRouter makes iteration depth adaptive: a lightweight controller predicts the optimal budget from task-complexity features, then monitors progress and terminates early when reasoning stalls. We evaluate on BraTS 2023 Meningioma (brain MRI) and MIMIC-CXR (chest X-ray) using Gemini 2.5 Flash without fine-tuning. Across repeated runs, RVLM shows high consistency on salient findings (e.g., mass presence and enhancement) and can detect cross-modal discrepancies between Fluid-Attenuated Inversion Recovery (FLAIR) signal characteristics and segmentation boundaries. On MIMIC-CXR, it generates structured reports and correctly recognises view-specific artefacts. Code: this https URL.
[CV-41] HEART-PFL: Stable Personalized Federated Learning under Heterogeneity with Hierarchical Directional Alignment and Adversarial Knowledge Transfer WACV2026
【速读】:该论文旨在解决个性化联邦学习(Personalized Federated Learning, PFL)中因客户端数据分布异构性导致的模型性能下降问题,尤其是现有方法存在的原型对齐浅层化和服务器端蒸馏不稳定的缺陷。其核心解决方案是提出一种双侧增强框架HEART-PFL,关键创新在于:(i) 采用分层方向对齐(Hierarchical Directional Alignment, HDA),在浅层使用余弦相似度进行方向对齐、深层采用均方误差(MSE)匹配以保留客户端特异性;(ii) 引入对抗知识迁移(Adversarial Knowledge Transfer, AKT),通过干净与对抗代理数据上的对称KL散度蒸馏稳定全局更新。该设计显著提升了个性化准确率与系统鲁棒性,在多个非独立同分布(Non-IID)数据集上达到SOTA性能,且仅需1.46M可训练参数的轻量适配器即可实现高效部署。
链接: https://arxiv.org/abs/2603.24209
作者: Minjun Kim,Minje Kim
机构: Promedius Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at WACV 2026. 8 pages, 7 figures, 3 tables
Abstract:Personalized Federated Learning (PFL) aims to deliver effective client-specific models under heterogeneous distributions, yet existing methods suffer from shallow prototype alignment and brittle server-side distillation. We propose HEART-PFL, a dual-sided framework that (i) performs depth-aware Hierarchical Directional Alignment (HDA) using cosine similarity in the early stage and MSE matching in the deep stage to preserve client specificity, and (ii) stabilizes global updates through Adversarial Knowledge Transfer (AKT) with symmetric KL distillation on clean and adversarial proxy data. Using lightweight adapters with only 1.46M trainable parameters, HEART-PFL achieves state-of-the-art personalized accuracy on CIFAR-100, Flowers-102, and Caltech-101 (63.42%, 84.23%, and 95.67%, respectively) under Dirichlet non-IID partitions, and remains robust to out-of-domain proxy data. Ablation studies further confirm that HDA and AKT provide complementary gains in alignment, robustness, and optimization stability, offering insights into how the two components mutually reinforce effective personalization. Overall, these results demonstrate that HEART-PFL simultaneously enhances personalization and global stability, highlighting its potential as a strong and scalable solution for PFL(code available at this https URL).
[CV-42] Powerful Teachers Matter: Text-Guided Multi-view Knowledge Distillation with Visual Prior Enhancement
【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation)中教师模型知识质量不足的问题,现有方法多聚焦于蒸馏策略优化,却忽视了提升教师模型本身的知识表达能力。其解决方案的关键在于提出文本引导的多视角知识蒸馏(Text-guided Multi-view Knowledge Distillation, TMKD),通过引入双模态教师模型——视觉教师和文本教师(CLIP)——提供更丰富的监督信号:视觉教师利用多视角输入融合边缘与高频特征等视觉先验信息以增强表征能力,文本教师则通过先验感知提示生成语义权重,实现自适应特征融合;同时引入视觉-语言对比正则化机制,强化学生模型中的语义知识学习。实验表明,该方法在五个基准数据集上可将蒸馏性能提升最高达4.49%。
链接: https://arxiv.org/abs/2603.24208
作者: Xin Zhang,Jianyang Xu,Hao Peng,Dongjing Wang,Jingyuan Zheng,Yu Li,Yuyu Yin,Hongbo Wang
机构: Hangzhou Dianzi University (杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures
Abstract:Knowledge distillation transfers knowledge from large teacher models to smaller students for efficient inference. While existing methods primarily focus on distillation strategies, they often overlook the importance of enhancing teacher knowledge quality. In this paper, we propose Text-guided Multi-view Knowledge Distillation (TMKD), which leverages dual-modality teachers, a visual teacher and a text teacher (CLIP), to provide richer supervisory signals. Specifically, we enhance the visual teacher with multi-view inputs incorporating visual priors (edge and high-frequency features), while the text teacher generates semantic weights through prior-aware prompts to guide adaptive feature fusion. Additionally, we introduce vision-language contrastive regularization to strengthen semantic knowledge in the student model. Extensive experiments on five benchmarks demonstrate that TMKD consistently improves knowledge distillation performance by up to 4.49%, validating the effectiveness of our dual-teacher multi-view enhancement strategy. Code is available at this https URL.
[CV-43] RefReward-SR: LR-Conditioned Reward Modeling for Preference-Aligned Super-Resolution
【速读】:该论文旨在解决生成式超分辨率(Generative Super-Resolution, GSR)中评估与优化框架与人类感知不一致的问题,即现有全参考(Full-Reference)和无参考(No-Reference)指标难以准确反映人类对图像真实感和语义合理性的偏好,且多数方法依赖于真实标签(Ground-Truth, GT)的分布匹配,无法有效映射至人类判断。其解决方案的关键在于提出 RefReward-SR——一种基于低分辨率(Low-Resolution, LR)输入条件的参考感知奖励模型,通过将LR图像作为语义锚点,利用多模态大语言模型(Multimodal Large Language Model, MLLM)的视觉-语言先验,以推理驱动的方式评估高分辨率(High-Resolution, HR)重建结果的语义一致性与合理性;同时构建首个大规模LR条件偏好数据集RefSR-18K,并采用分组相对策略优化(Group Relative Policy Optimization, GRPO)对MLLM进行微调,进而将该奖励信号整合进SR模型训练流程,从而显著提升生成结果在人类偏好上的对齐度,兼顾语义保真、感知真实性与视觉自然性。
链接: https://arxiv.org/abs/2603.24198
作者: Yushuai Song,Weize Quan,Weining Wang,Jiahui Sun,Jing Liu,Meng Li,Pengbin Yu,Zhentao Chen,Wei Shen,Lunxi Yuan,Dong-ming Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in generative super-resolution (SR) have greatly improved visual realism, yet existing evaluation and optimization frameworks remain misaligned with human perception. Full-Reference and No-Reference metrics often fail to reflect perceptual preference, either penalizing semantically plausible details due to pixel misalignment or favoring visually sharp but inconsistent artifacts. Moreover, most SR methods rely on ground-truth (GT)-dependent distribution matching, which does not necessarily correspond to human judgments. In this work, we propose RefReward-SR, a low-resolution (LR) reference-aware reward model for preference-aligned SR. Instead of relying on GT supervision or NR evaluation, RefReward-SR assesses high-resolution (HR) reconstructions conditioned on their LR inputs, treating the LR image as a semantic anchor. Leveraging the visual-linguistic priors of a Multimodal Large Language Models (MLLM), it evaluates semantic consistency and plausibility in a reasoning-aware manner. To support this paradigm, we construct RefSR-18K, the first large-scale LR-conditioned preference dataset for SR, providing pairwise rankings based on LR-HR consistency and HR naturalness. We fine-tune the MLLM with Group Relative Policy Optimization (GRPO) using LR-conditioned ranking rewards, and further integrate GRPO into SR model training with RefReward-SR as the core reward signal for preference-aligned generation. Extensive experiments show that our framework achieves substantially better alignment with human judgments, producing reconstructions that preserve semantic consistency while enhancing perceptual plausibility and visual naturalness. Code, models, and datasets will be released upon paper acceptance.
[CV-44] Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection
【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision Language Models, LVLMs)在图像分类任务中表现不佳的问题,尽管它们通常采用CLIP预训练的视觉编码器。研究表明,LVLMs的性能瓶颈并非源于其架构限制,而是由于CLIP中视觉与文本编码器分离导致的类别名称匹配偏差,而非联合视觉-文本推理能力的缺失。解决方案的关键在于利用LVLM内部表示,特别是注意力头(attention heads)的判别性特征,提出一种无需训练的“头集成分类器”(Head Ensemble Classifiers, HEC),通过类间判别分析(Gaussian Discriminant Analysis)筛选并组合最具区分度的视觉和文本注意力头,从而显著提升零样本和少样本图像分类性能,在12个数据集上达到最先进水平。
链接: https://arxiv.org/abs/2603.24181
作者: Adhemar de Senneville,Xavier Bou,Jérémy Anger,Rafael Grompone,Gabriele Facciolo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current Large Vision Language Models (LVLMs) excel at many zero-shot tasks like image captioning, visual question answering and OCR. However, these same models suffer from poor performance at image classification tasks, underperforming against CLIP-based methods. Notably, this gap is surprising because many LVLMs use CLIP-pretrained vision encoders. Yet LVLMs are not inherently limited by CLIP’s architecture with independent vision and text encoders. In CLIP, this separation biases classification toward class-name matching rather than joint visual-text reasoning. In this paper we show that, despite their poor raw performance, LVLMs can improve visual feature class separability at inference using prompt conditioning, and LVLMs’ internal representations, especially attention heads, can outperform the model itself at zero-shot and few-shot classification. We introduce Head Ensemble Classifiers (HEC) to bridge the performance gap between CLIP-based and LVLM-based classification methods. Inspired by Gaussian Discriminant Analysis, HEC ranks the most discriminative vision and text heads and combines them into a training-free classifier. We show that HEC achieves state-of-the-art performance in few-shot and zero-shot classification across 12 datasets.
[CV-45] Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection CVPR2026
【速读】:该论文旨在解决生成式 AI (Generative AI) 领域中,参考对象检测(Referring Object Detection, ROD)模型在数据稀缺场景下训练效率低下的问题。现有端到端的接地检测器在标签稀缺时需从零学习空间与语义结构,导致样本浪费。其解决方案的关键在于提出一种轻量、模型无关的框架 HeROD(Heuristic-inspired ROD),通过引入基于指代表达显式启发式推理先验(heuristic-inspired spatial and semantic reasoning priors),将可解释的信号注入现代 DETR 类流水线的三个阶段:候选排序、预测融合和匈牙利匹配。这些先验在训练和推理过程中引导模型聚焦于合理候选对象,从而显著提升标签效率与收敛性能,在 RefCOCO、RefCOCO+ 和 RefCOCOg 数据集上均优于强基线模型。
链接: https://arxiv.org/abs/2603.24166
作者: Xu Zhang,Zhe Chen,Jing Zhang,Dacheng Tao
机构: The University of Sydney (悉尼大学); La Trobe University (拉特罗布大学); Wuhan University (武汉大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026
Abstract:Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for data-rich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. In such regimes, end-to-end grounding detectors need to learn spatial and semantic structure from scratch, wasting precious samples. We ask a simple question: Can explicit reasoning priors help models learn more efficiently when data is scarce? To explore this, we first introduce a Data-efficient Referring Object Detection (De-ROD) task, which is a benchmark protocol for measuring ROD performance in low-data and few-shot settings. We then propose the HeROD (Heuristic-inspired ROD), a lightweight, model-agnostic framework that injects explicit, heuristic-inspired spatial and semantic reasoning priors, which are interpretable signals derived based on the referring phrase, into 3 stages of a modern DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. By biasing both training and inference toward plausible candidates, these priors promise to improve label efficiency and convergence performance. On RefCOCO, RefCOCO+, and RefCOCOg, HeROD consistently outperforms strong grounding baselines in scarce-label regimes. More broadly, our results suggest that integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision-language understanding.
[CV-46] CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare CVPR2026
【速读】:该论文旨在解决长周期、高复杂度的医疗领域自动化任务中,现有视觉语言模型(Vision-Language Models, VLMs)在多步骤推理与跨系统交互能力上的不足问题。当前研究多集中于短周期或通用场景(如桌面或移动界面),而对医疗信息系统(如电子病历系统、DICOM查看器等)中的长期任务自动化探索有限。解决方案的关键在于提出CarePilot——一个基于Actor-Critic框架的多智能体系统,其核心创新为:Actor模块融合工具定位与双记忆机制(长期和短期经验),以从视觉界面和系统状态中预测下一步语义操作;Critic模块则评估动作效果并动态更新记忆,通过迭代式代理模拟实现推理增强型策略学习,从而显著提升在医疗工作流中的鲁棒性和准确性。
链接: https://arxiv.org/abs/2603.24157
作者: Akash Ghosh,Tajamul Ashraf,Rishu Kumar Singh,Numan Saeed,Sriparna Saha,Xiuying Chen,Salman Khan
机构: Indian Institute of Technology Patna (印度理工学院巴特那分校); Mohamed bin Zayed University of AI (MBZUAI) (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Findings
Abstract:Multimodal agentic pipelines are transforming human-computer interaction by enabling efficient and accessible automation of complex, real-world tasks. However, recent efforts have focused on short-horizon or general-purpose applications (e.g., mobile or desktop interfaces), leaving long-horizon automation for domain-specific systems, particularly in healthcare, largely unexplored. To address this, we introduce CareFlow, a high-quality human-annotated benchmark comprising complex, long-horizon software workflows across medical annotation tools, DICOM viewers, EHR systems, and laboratory information systems. On this benchmark, existing vision-language models (VLMs) perform poorly, struggling with long-horizon reasoning and multi-step interactions in medical contexts. To overcome this, we propose CarePilot, a multi-agent framework based on the actor-critic paradigm. The Actor integrates tool grounding with dual-memory mechanisms (long-term and short-term experience) to predict the next semantic action from the visual interface and system state. The Critic evaluates each action, updates memory based on observed effects, and either executes or provides corrective feedback to refine the workflow. Through iterative agentic simulation, the Actor learns to perform more robust and reasoning-aware predictions during inference. Our experiments show that CarePilot achieves state-of-the-art performance, outperforming strong closed-source and open-source multimodal baselines by approximately 15.26% and 3.38%, respectively, on our benchmark and out-of-distribution dataset.
[CV-47] A convergent Plug-and-Play Majorization-Minimization algorithm for Poisson inverse problems
【速读】:该论文旨在解决泊松逆问题(Poisson inverse problems)的重建难题,这类问题常见于医学成像如断层扫描(tomography)和图像去卷积(deconvolution)等场景,其特点是观测数据服从泊松分布且噪声水平较高。解决方案的关键在于提出一种新颖的变分“即插即用”(variational plug-and-play)算法,通过最小化一个显式泛函实现:该泛函由Kullback-Leibler(KL)数据保真项与基于预训练神经网络的正则化项组成。该方法创新性地结合了经典似然最大化理论与基于梯度的去噪器(gradient-based denoisers),允许直接使用预训练的高斯去噪器(Gaussian denoisers)而不损失收敛性保证,其优化框架采用上界最小化(majorization-minimization, MM)策略,确保迭代过程收敛至驻点。实验表明,该方法在中等噪声下达到当前最优性能,在高噪声条件下优势更为显著,特别适用于核医学成像等噪声敏感场景。
链接: https://arxiv.org/abs/2603.24156
作者: Thibaut Modrzyk(CREATIS),Ane Etxebeste(CREATIS),Élie Bretin(ICJ, MMCS),Voichita Maxim(CREATIS)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we present a novel variational plug-and-play algorithm for Poisson inverse problems. Our approach minimizes an explicit functional which is the sum of a Kullback-Leibler data fidelity term and a regularization term based on a pre-trained neural network. By combining classical likelihood maximization methods with recent advances in gradient-based denoisers, we allow the use of pre-trained Gaussian denoisers without sacrificing convergence guarantees. The algorithm is formulated in the majorization-minimization framework, which guarantees convergence to a stationary point. Numerical experiments confirm state-of-the-art performance in deconvolution and tomography under moderate noise, and demonstrate clear superiority in high-noise conditions, making this method particularly valuable for nuclear medicine applications.
[CV-48] LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds CVPR2026
【速读】:该论文旨在解决开放词汇三维场景理解(open-vocabulary 3D scene understanding)中现有方法存在的效率低下、内存占用高及流程复杂的问题,这些问题主要源于迭代优化和密集的每个高斯点特征分配。其解决方案的关键在于提出LightSplat框架,该框架通过从多视角图像中向3D表示注入紧凑的2字节语义索引(semantic indices),仅在显著区域分配语义索引,并借助轻量级索引-特征映射机制避免昂贵的特征优化与存储开销;同时,利用单步聚类实现几何与语义相关的掩码在3D空间中的高效关联,从而保障语义一致性并提升推理效率。
链接: https://arxiv.org/abs/2603.24146
作者: Jaehun Bang,Jinhyeok Kim,Minji Kim,Seungheon Jeong,Kyungdon Joo
机构: AIGS, UNIST; GSAI, POSTECH
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Open-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain slow, memory-intensive, and overly complex due to iterative optimization and dense per-Gaussian feature assignments. To address this, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantic indices only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. We further ensure semantic consistency and efficient inference via single-step clustering that links geometrically and semantically related masks in 3D. We evaluate our method on LERF-OVS, ScanNet, and DL3DV-OVS across complex indoor-outdoor scenes. As a result, LightSplat achieves state-of-the-art performance with up to 50-400x speedup and 64x lower memory, enabling scalable language-driven 3D understanding. For more details, visit our project page this https URL.
[CV-49] utor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection CVPR2026
【速读】:该论文旨在解决深度伪造检测(deepfake detection)中标准监督训练方法因对所有样本赋予均匀重要性而导致特征学习不够鲁棒和泛化能力不足的问题。其解决方案的关键在于提出一种基于强化学习的“教师-学生”动态课程学习框架(Tutor-Student Reinforcement Learning, TSRL),其中“教师”代理(Tutor)通过Proximal Policy Optimization (PPO) 算法动态调整每个训练样本的损失权重,依据包含视觉特征与历史学习动态(如EMA损失和遗忘次数)的状态表示进行决策;“学生”代理(Student)即检测模型,在教师引导下优化训练过程,奖励机制设计为鼓励从错误预测到正确预测的转变,从而自动识别并优先学习高价值样本(如难但可学样本),显著提升模型在未见伪造技术下的泛化性能。
链接: https://arxiv.org/abs/2603.24139
作者: Zhanhe Lei,Zhongyuan Wang,Jikang Cheng,Baojin Huang,Yuhong Yang,Zhen Han,Chao Liang,Dengpan Ye
机构: Wuhan University (武汉大学); Peking University (北京大学); Huazhong Agricultural University (华中农业大学); Guangzhou University (广州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026
Abstract:Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a Tutor'' agent learns to guide a Student’’ (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample’s loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student’s immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student’s generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at this https URL.
[CV-50] Spectral Scalpel: Amplifying Adjacent Action Discrepancy via Frequency-Selective Filtering for Skeleton-Based Action Segmentation CVPR
【速读】:该论文旨在解决骨架动作分割(Skeleton-based Temporal Action Segmentation, STAS)中因相邻动作间时空模式区分度不足而导致的类别判别能力弱和分割边界模糊的问题。解决方案的关键在于提出一种频域选择性滤波框架——Spectral Scalpel,其通过自适应多尺度频域滤波器(adaptive multi-scale spectral filters)作为“手术刀”,抑制相邻不同动作间的共享频率成分,同时增强各自特有的动作频率特征,从而提升动作间的差异性并锐化动作转换边界;此外,引入频域感知通道混合器(frequency-aware channel mixer)以聚合跨通道频谱信息,强化通道演化过程,实现从时域到频域的建模扩展,显著改善了STAS任务的性能表现。
链接: https://arxiv.org/abs/2603.24134
作者: Haoyu Ji,Bowen Chen,Zhihao Yang,Wenze Huang,Yu Gao,Xueting Liu,Weihong Ren,Zhiyong Wang,Honghai Liu
机构: Harbin Institute of Technology, Shenzhen; Shenzhen HT Intelligent Control Co., Ltd.; Southern University of Science and Technology; Southeast University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Conference
Abstract:Skeleton-based Temporal Action Segmentation (STAS) seeks to densely segment and classify diverse actions within long, untrimmed skeletal motion sequences. However, existing STAS methodologies face challenges of limited inter-class discriminability and blurred segmentation boundaries, primarily due to insufficient distinction of spatio-temporal patterns between adjacent actions. To address these limitations, we propose Spectral Scalpel, a frequency-selective filtering framework aimed at suppressing shared frequency components between adjacent distinct actions while amplifying their action-specific frequencies, thereby enhancing inter-action discrepancies and sharpening transition boundaries. Specifically, Spectral Scalpel employs adaptive multi-scale spectral filters as scalpels to edit frequency spectra, coupled with a discrepancy loss between adjacent actions serving as the surgical objective. This design amplifies representational disparities between neighboring actions, effectively mitigating boundary localization ambiguities and inter-class confusion. Furthermore, complementing long-term temporal modeling, we introduce a frequency-aware channel mixer to strengthen channel evolution by aggregating spectra across channels. This work presents a novel paradigm for STAS that extends conventional spatio-temporal modeling by incorporating frequency-domain analysis. Extensive experiments on five public datasets demonstrate that Spectral Scalpel achieves state-of-the-art performance. Code is available at this https URL.
[CV-51] Reservoir-Based Graph Convolutional Networks
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在处理复杂或动态数据时面临的两大挑战:一是深层网络中长距离依赖建模困难,导致计算成本高;二是过平滑(over-smoothing)问题,即随着层数加深,节点嵌入趋于相似而丧失区分能力。针对这些问题,论文提出RGC-Net(Reservoir-based Graph Convolutional Network),其核心解决方案在于将储层计算(reservoir computing)机制与结构化图卷积相结合,通过固定随机的储层权重和漏积分器(leaky integrator)设计,实现稳定的信息传播与特征保留,从而在不增加参数调优负担的前提下提升模型对多跳邻域信息的聚合能力,并显著改善分类与生成任务中的收敛速度和过平滑现象。
链接: https://arxiv.org/abs/2603.24131
作者: Mayssa Soussia,Gita Ayu Salsabila,Mohamed Ali Mahjoub,Islem Rekik
机构: National Engineering School of Sousse, University of Sousse, LATIS – Laboratory of Advanced Technology and Intelligent Systems (国家工程学院苏塞,苏塞大学,高级技术和智能系统实验室); BASIRA Lab, Imperial-X and Department of Computing, Imperial College London (BASIRA 实验室,帝国理工学院-X 和计算系,帝国理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Message passing is a core mechanism in Graph Neural Networks (GNNs), enabling the iterative update of node embeddings by aggregating information from neighboring nodes. Graph Convolutional Networks (GCNs) exemplify this approach by adapting convolutional operations for graph structures, allowing features from adjacent nodes to be combined effectively. However, GCNs encounter challenges with complex or dynamic data. Capturing long-range dependencies often requires deeper layers, which not only increase computational costs but also lead to over-smoothing, where node embeddings become indistinguishable. To overcome these challenges, reservoir computing has been integrated into GNNs, leveraging iterative message-passing dynamics for stable information propagation without extensive parameter tuning. Despite its promise, existing reservoir-based models lack structured convolutional mechanisms, limiting their ability to accurately aggregate multi-hop neighborhood information. To address these limitations, we propose RGC-Net (Reservoir-based Graph Convolutional Network), which integrates reservoir dynamics with structured graph convolution. Key contributions include: (i) a reimagined convolutional framework with fixed random reservoir weights and a leaky integrator to enhance feature retention; (ii) a robust, adaptable model for graph classification; and (iii) an RGC-Net-powered transformer for graph generation with application to dynamic brain connectivity. Extensive experiments show that RGC-Net achieves state-of-the-art performance in classification and generative tasks, including brain graph evolution, with faster convergence and reduced over-smoothing. Source code is available at this https URL .
[CV-52] Combi-CAM: A Novel Multi-Layer Approach for Explainable Image Geolocalization
【速读】:该论文旨在解决行星尺度图像地理定位(planet-scale photo geolocalization)中深度学习模型预测过程缺乏可解释性的问题。传统方法通常仅依赖卷积神经网络(CNN)最深层的梯度加权类激活图(gradient-weighted class activation maps, Grad-CAM)来解释模型决策,难以全面揭示不同视觉特征对定位结果的贡献。论文提出Combi-CAM方法,其关键在于融合网络多个层级的Grad-CAM结果,从而更细致地刻画图像中不同区域与地理定位决策之间的关联,显著提升了模型决策的可解释性与洞察力。
链接: https://arxiv.org/abs/2603.24117
作者: David Faget(CB),José Luis Lisani,Miguel Colom(CB, CMLA)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Planet-scale photo geolocalization involves the intricate task of estimating the geographic location depicted in an image purely based on its visual features. While deep learning models, particularly convolutional neural networks (CNNs), have significantly advanced this field, understanding the reasoning behind their predictions remains challenging. In this paper, we present Combi-CAM, a novel method that enhances the explainability of CNN-based geolocalization models by combining gradient-weighted class activation maps obtained from several layers of the network architecture, rather than using only information from the deepest layer as is typically done. This approach provides a more detailed understanding of how different image features contribute to the model’s decisions, offering deeper insights than the traditional approaches.
[CV-53] Retinal Layer Segmentation in OCT Images With 2.5D Cross-slice Feature Fusion Module for Glaucoma Assessment
【速读】:该论文旨在解决光学相干断层扫描(Optical Coherence Tomography, OCT)图像中视网膜层分割的切片间不一致性问题,该问题源于传统二维(2D)分割方法缺乏跨相邻B-scan的上下文信息,而三维(3D)方法虽能更好捕捉切片间关联但计算成本高昂。解决方案的关键在于提出一种2.5D分割框架,其核心是引入新颖的跨切片特征融合(Cross-Slice Feature Fusion, CFF)模块,该模块在U-Net类架构中融合不同切片间的特征信息,从而有效增强边界检测的一致性并提升噪声区域的鲁棒性,实现兼顾上下文感知能力与计算效率的高精度视网膜层分割。
链接: https://arxiv.org/abs/2603.24115
作者: Hyunwoo Kim,Heesuk Kim,Wungrak Choi,Jae-Sang Hyun
机构: Yonsei University (延世大学); Severance Hospital (首尔圣玛丽医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:For accurate glaucoma diagnosis and monitoring, reliable retinal layer segmentation in OCT images is essential. However, existing 2D segmentation methods often suffer from slice-to-slice inconsistencies due to the lack of contextual information across adjacent B-scans. 3D segmentation methods are better for capturing slice-to-slice context, but they require expensive computational resources. To address these limitations, we propose a 2.5D segmentation framework that incorporates a novel cross-slice feature fusion (CFF) module into a U-Net-like architecture. The CFF module fuses inter-slice features to effectively capture contextual information, enabling consistent boundary detection across slices and improved robustness in noisy regions. The framework was validated on both a clinical dataset and the publicly available DUKE DME dataset. Compared to other segmentation methods without the CFF module, the proposed method achieved an 8.56% reduction in mean absolute distance and a 13.92% reduction in root mean square error, demonstrating improved segmentation accuracy and robustness. Overall, the proposed 2.5D framework balances contextual awareness and computational efficiency, enabling anatomically reliable retinal layer delineation for automated glaucoma evaluation and potential clinical applications.
[CV-54] Granular Ball Guided Stable Latent Domain Discovery for Domain-General Crowd Counting
【速读】:该论文旨在解决单源域泛化下的人群计数问题,其核心挑战在于:单一标注源域通常包含异质的潜在域(latent domains),而测试数据可能呈现严重的分布偏移(distribution shift)。传统方法直接对样本级潜在特征进行平坦聚类,易受特征噪声、异常值和表示漂移的影响,导致伪域分配不可靠,削弱了结构化域学习的效果。解决方案的关键在于提出一种粒度球引导的稳定潜在域发现框架,通过将样本组织为紧凑的局部粒度球(granular ball),并以球心作为代表进行聚类,从而将直接样本级聚类转化为分层的基于代表的聚类过程,获得更稳定且语义一致的伪域分配;在此基础上构建双分支学习机制,利用语义码本重编码增强可迁移语义表征,并通过风格分支建模域特定外观变化,降低语义与风格的纠缠,提升在域偏移下的泛化能力。
链接: https://arxiv.org/abs/2603.24106
作者: Fan Chen,Shuyin Xia,Yi Wang,Xinbo Gao
机构: Chongqing Key Laboratory of Computational Intelligence (重庆市计算智能重点实验室); Key Laboratory of Cyberspace Big Data Intelligent Security, Ministry of Education (教育部网络空间大数据智能安全重点实验室); Sichuan-Chongqing Co-construction Key Laboratory of Digital Economy Intelligence (川渝共建数字经济智能重点实验室); Key Laboratory of Big Data Intelligent Computing (大数据智能计算重点实验室); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single-source domain generalization for crowd counting remains highly challenging because a single labeled source domain often contains heterogeneous latent domains, while test data may exhibit severe distribution shifts. A fundamental difficulty lies in stable latent domain discovery: directly performing flat clustering on evolving sample-level latent features is easily affected by feature noise, outliers, and representation drift, leading to unreliable pseudo-domain assignments and weakened domain-structured learning. To address this issue, we propose a granular ball guided stable latent domain discovery framework for domain-general crowd counting. Specifically, the proposed method first organizes samples into compact local granular balls and then clusters granular ball centers as representatives to obtain pseudo-domains, transforming direct sample-level clustering into a hierarchical representative-based clustering process. This design yields more stable and semantically consistent pseudo-domain assignments. Built upon the discovered latent domains, we further develop a two-branch learning framework that enhances transferable semantic representations via semantic codebook re-encoding while modeling domain-specific appearance variations through a style branch, thereby reducing semantic–style entanglement and improving generalization under domain shifts. Extensive experiments on ShanghaiTech A/B, UCF_QNRF, and NWPU-Crowd under a strict no-adaptation protocol demonstrate that the proposed method consistently outperforms strong baselines, especially under large domain gaps.
[CV-55] LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation CVPR
【速读】:该论文旨在解决骨架动作分割(Skeleton-based Temporal Action Segmentation, STAS)中因忽视人体运动背后的物理动力学而造成的分类边界模糊与跨类别判别力不足的问题。现有方法虽能捕捉时空运动学特征,但难以区分具有相似运动学模式却蕴含不同动力学意图的动作(如“挥手”与“击打”),且在动态力变化显著的边界处定位精度受限。其解决方案的关键在于提出拉格朗日动力学感知网络(Lagrangian-Dynamic Informed Network, LaDy),通过从关节位置计算广义坐标并估计满足物理约束的拉格朗日项来显式合成广义力;进一步引入能量一致性损失(Energy Consistency Loss)以强制执行功-能定理,使动能变化与合力做功保持一致,从而增强物理合理性;最终利用学习到的动力学信号驱动时空调制模块:空间上融合广义力与空间表征提升语义判别性,时间上构建显著动态信号用于时序门控,显著增强边界感知能力。
链接: https://arxiv.org/abs/2603.24097
作者: Haoyu Ji,Xueting Liu,Yu Gao,Wenze Huang,Zhihao Yang,Weihong Ren,Zhiyong Wang,Honghai Liu
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳); Southern University of Science and Technology (南方科技大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Conference
Abstract:Skeleton-based Temporal Action Segmentation (STAS) aims to densely parse untrimmed skeletal sequences into frame-level action categories. However, existing methods, while proficient at capturing spatio-temporal kinematics, neglect the underlying physical dynamics that govern human motion. This oversight limits inter-class discriminability between actions with similar kinematics but distinct dynamic intents, and hinders precise boundary localization where dynamic force profiles shift. To address these, we propose the Lagrangian-Dynamic Informed Network (LaDy), a framework integrating principles of Lagrangian dynamics into the segmentation process. Specifically, LaDy first computes generalized coordinates from joint positions and then estimates Lagrangian terms under physical constraints to explicitly synthesize the generalized forces. To further ensure physical coherence, our Energy Consistency Loss enforces the work-energy theorem, aligning kinetic energy change with the work done by the net force. The learned dynamics then drive a Spatio-Temporal Modulation module: Spatially, generalized forces are fused with spatial representations to provide more discriminative semantics. Temporally, salient dynamic signals are constructed for temporal gating, thereby significantly enhancing boundary awareness. Experiments on challenging datasets show that LaDy achieves state-of-the-art performance, validating the integration of physical dynamics for action segmentation. Code is available at this https URL.
[CV-56] LGTM: Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation IJCNN2026
【速读】:该论文旨在解决生成式 AI(Generative AI)在文本到图像扩散模型中对光照条件控制不足的问题。现有方法依赖两阶段流程,在图像生成后进行再光照处理,效率低下且需大量数据微调与计算资源,难以适应新模型和任务。其解决方案的关键在于提出一种无需训练的光照引导文本到图像扩散模型(Light-Guided Text-to-Image Diffusion Model, LGTM),通过操控扩散过程的初始潜在噪声(initial latent noise)来实现光照方向的可控生成,基于潜在空间的通道级分析发现,选择性地操纵特定潜变量通道即可实现细粒度光照控制,而无需修改或微调预训练模型。该方法在保持图像质量和文本一致性的同时显著提升光照一致性,且可无缝集成至ControlNet等结构化控制框架中,展现出良好的通用性和动态调控能力。
链接: https://arxiv.org/abs/2603.24086
作者: Ryugo Morita,Stanislav Frolov,Brian Bernhard Moser,Ko Watanabe,Riku Takahashi,Andreas Dengel
机构: RPTU Kaiserslautern-Landau; DFKI GmbH; Hosei University
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to IJCNN2026
Abstract:Diffusion models have demonstrated high-quality performance in conditional text-to-image generation, particularly with structural cues such as edges, layouts, and depth. However, lighting conditions have received limited attention and remain difficult to control within the generative process. Existing methods handle lighting through a two-stage pipeline that relights images after generation, which is inefficient. Moreover, they rely on fine-tuning with large datasets and heavy computation, limiting their adaptability to new models and tasks. To address this, we propose a novel Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation (LGTM), which manipulates the initial latent noise of the diffusion process to guide image generation with text prompts and user-specified light directions. Through a channel-wise analysis of the latent space, we find that selectively manipulating latent channels enables fine-grained lighting control without fine-tuning or modifying the pre-trained model. Extensive experiments show that our method surpasses prompt-based baselines in lighting consistency, while preserving image quality and text alignment. This approach introduces new possibilities for dynamic, user-guided light control. Furthermore, it integrates seamlessly with models like ControlNet, demonstrating adaptability across diverse scenarios.
[CV-57] When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm CVPR2026
【速读】:该论文旨在解决生成式人工智能(Generative AI)中多模态大语言模型(Multimodal Large Language Models, MLLMs)所引发的新安全风险问题,特别是其在不安全内容生成和虚假图像合成方面的潜在危害。解决方案的关键在于系统性地对比MLLMs与扩散模型(Diffusion Models)在安全性上的差异:研究发现,MLLMs因具备更强的语义理解能力,能够准确解析抽象提示并生成更复杂的不安全图像,且这些图像对现有检测器更具隐蔽性,即使针对MLLMs数据重新训练检测器,仍可通过提供更长、更详细的输入绕过检测。这一发现揭示了当前对MLLMs安全风险的认知不足,为未来构建更鲁棒的安全防护机制提供了关键依据。
链接: https://arxiv.org/abs/2603.24079
作者: Ye Leng,Junjie Chu,Mingjie Li,Chenhao Lin,Chao Shen,Michael Backes,Yun Shen,Yang Zhang
机构: CISPA Helmholtz Center for Information Security; Xi’an Jiaotong University; Flexera
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted by CVPR 2026. 15 pages, 11 figures
Abstract:Recently, multimodal large language models (MLLMs) have emerged as a unified paradigm for language and image generation. Compared with diffusion models, MLLMs possess a much stronger capability for semantic understanding, enabling them to process more complex textual inputs and comprehend richer contextual meanings. However, this enhanced semantic ability may also introduce new and potentially greater safety risks. Taking diffusion models as a reference point, we systematically analyze and compare the safety risks of emerging MLLMs along two dimensions: unsafe content generation and fake image synthesis. Across multiple unsafe generation benchmark datasets, we observe that MLLMs tend to generate more unsafe images than diffusion models. This difference partly arises because diffusion models often fail to interpret abstract prompts, producing corrupted outputs, whereas MLLMs can comprehend these prompts and generate unsafe content. For current advanced fake image detectors, MLLM-generated images are also notably harder to identify. Even when detectors are retrained with MLLMs-specific data, they can still be bypassed by simply providing MLLMs with longer and more descriptive inputs. Our measurements indicate that the emerging safety risks of the cutting-edge generative paradigm, MLLMs, have not been sufficiently recognized, posing new challenges to real-world safety.
[CV-58] PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation CVPR2026
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在海报理解与生成任务中对视觉设计认知(如构图结构、字体层次、语义意图)建模不足的问题。解决方案的关键在于提出 PosterIQ——一个面向海报理解与生成的设计驱动型基准,涵盖构图结构解析、文本-图像对应关系、字体可读性与感知、设计质量评估及具隐喻能力的可控合成等任务,并提供 7,765 个图像标注实例和 822 个生成提示,以量化评估多模态大模型(MLLMs)与扩散生成器在视觉层次、字体语义、显著性控制及意图传达等方面的性能差距,从而推动模型创造力提升并嵌入以人为本的设计原则。
链接: https://arxiv.org/abs/2603.24078
作者: Yuheng Feng,Wen Zhang,Haodong Duan,Xingxing Zou
机构: The Hong Kong Polytechnic University (香港理工大学); Snapchat Inc. (Snapchat公司); ByteDance Seed (字节跳动种子项目)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Project Page: this https URL
Abstract:We present PosterIQ, a design-driven benchmark for poster understanding and generation, annotated across composition structure, typographic hierarchy, and semantic intent. It includes 7,765 image-annotation instances and 822 generation prompts spanning real, professional, and synthetic cases. To bridge visual design cognition and generative modeling, we define tasks for layout parsing, text-image correspondence, typography/readability and font perception, design quality assessment, and controllable, composition-aware generation with metaphor. We evaluate state-of-the-art MLLMs and diffusion-based generators, finding persistent gaps in visual hierarchy, typographic semantics, saliency control, and intention communication; commercial models lead on high-level reasoning but act as insensitive automatic raters, while generators render text well yet struggle with composition-aware synthesis. Extensive analyses show PosterIQ is both a quantitative benchmark and a diagnostic tool for design reasoning, offering reproducible, task-specific metrics. We aim to catalyze models’ creativity and integrate human-centred design principles into generative vision-language systems.
[CV-59] AD-Reasoning : Multimodal Guideline-Guided Reasoning for Alzheimers Disease Diagnosis ICME2026
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)诊断中多模态数据融合与临床指南一致性不足的问题,现有模型往往缺乏透明度且未能严格遵循NIA-AA(National Institute on Aging–Alzheimer’s Association)诊断标准。其解决方案的关键在于提出AD-Reasoning框架,该框架通过结构化MRI与六种临床模态的联合编码、双向交叉注意力融合机制,并引入基于规则的验证器和强化学习微调策略,以确保输出格式规范性、指南证据覆盖度以及推理-决策一致性,从而实现可解释、符合临床指南的诊断结果生成。
链接: https://arxiv.org/abs/2603.24059
作者: Qiuhui Chen,Yushan Deng,Xuancheng Yao,Yi Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICME 2026
Abstract:Alzheimer’s disease (AD) diagnosis requires integrating neuroimaging with heterogeneous clinical evidence and reasoning under established criteria, yet most multimodal models remain opaque and weakly guideline-aligned. We present AD-Reasoning, a multimodal framework that couples structural MRI with six clinical modalities and a rule-based verifier to generate structured, NIA-AA-consistent diagnoses. AD-Reasoning combines modality-specific encoders, bidirectional cross-attention fusion, and reinforcement fine-tuning with verifiable rewards that enforce output format, guideline evidence coverage, and reasoning–decision consistency. We also release AD-MultiSense, a 10,378-visit multimodal QA dataset with guideline-validated rationales built from ADNI/AIBL. On AD-MultiSense, AD-Reasoning achieves state-of-the-art diagnostic accuracy and produces structured rationales that improve transparency over recent baselines, while providing transparent rationales.
[CV-60] Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification CVPR2026
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的对象幻觉(object hallucination)问题,即模型在生成文本时错误地引入不存在的物体或属性,严重影响其在自动驾驶、医学图像分析等高风险场景中的可靠性。解决方案的关键在于识别并量化注意力失衡(attention imbalance)现象——包括模态间(视觉与语言之间)和模态内(单个token之间)的不均衡注意力分配,并提出一种轻量级的解码时干预方法Attention Imbalance Rectification (AIR),通过重新分配注意力权重和调整分布来修正这种失衡,从而显著降低对象幻觉率,同时提升模型在多种视觉语言任务中的整体性能。
链接: https://arxiv.org/abs/2603.24058
作者: Han Sun,Qin Li,Peixin Wang,Min Zhang
机构: East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026(Findings)
Abstract:Object hallucination in Large Vision-Language Models (LVLMs) severely compromises their reliability in real-world applications, posing a critical barrier to their deployment in high-stakes scenarios such as autonomous driving and medical image analysis. Through systematic empirical investigation, we identify that the imbalanced attention allocation, both across modalities (i.e., vision and language) and within modalities (among individual tokens), exhibits a strong causal correlation with the occurrence of object hallucination. Leveraging this insight, we introduce a novel concept termed attention imbalance, which not only quantifies the degree of attention disparity but also visually delineates the underlying patterns (e.g., over-attentiveness to irrelevant language tokens or under-attentiveness to discriminative visual features) that drive object hallucination. To mitigate object hallucination, we further propose Attention Imbalance Rectification (AIR), a lightweight decoding-time intervention method that reallocates attention weights and adjusts attention distributions to rectify modality-wise and token-wise imbalances. Extensive evaluations on four mainstream LVLMs and three benchmarks (CHAIR, POPE, and MM-Vet) with seven baselines demonstrate that AIR consistently reduces object hallucination rates, achieving up to a 35.1% reduction compared to the baselines, while improving up to 15.9% of LVLMs’ general capability across diverse vision-language tasks.
[CV-61] Beyond Semantic Priors: Mitigating Optimization Collapse for Generalizable Visual Forensics
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在检测非语义伪造(non-semantic forgeries)时存在的优化崩溃(Optimization Collapse)问题,即当使用Sharpness-Aware Minimization(SAM)训练时,一旦扰动半径超过一个狭窄阈值,检测器性能会退化为随机猜测。其核心解决方案是提出一种新的架构——对比区域注入Transformer(Contrastive Regional Injection Transformer, CoRIT),关键在于引入计算高效的对比梯度代理(Contrastive Gradient Proxy, CGP),并结合三种无需额外训练的策略:区域细化掩码(Region Refinement Mask)以抑制CGP方差、区域信号注入(Regional Signal Injection)以保持CGP幅度、层级表示融合(Hierarchical Representation Integration)以获得更具泛化能力的表征。通过理论分析揭示了优化崩溃源于梯度信噪比(Gradient Signal-to-Noise Ratio, GSNR)层间衰减,并证明临界优化半径(Critical Optimization Radius, COR)随GSNR单调递增,从而从几何稳定性角度解释了SAM在处理高保真伪造时的局限性,最终实现跨域和通用伪造检测任务上的最优性能。
链接: https://arxiv.org/abs/2603.24057
作者: Jipeng Liu,Haichao Shi,Siyu Xing,Rong Yin,Xiao-Yu Zhang
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Science and Technology, Beihang University (北京航空航天大学网络科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Vision-Language Models (VLMs) like CLIP have emerged as a dominant paradigm for generalizable deepfake detection, a representational disconnect remains: their semantic-centric pre-training is ill-suited for capturing non-semantic artifacts inherent to hyper-realistic synthesis. In this work, we identify a failure mode termed Optimization Collapse, where detectors trained with Sharpness-Aware Minimization (SAM) degenerate to random guessing on non-semantic forgeries once the perturbation radius exceeds a narrow threshold. To theoretically formalize this collapse, we propose the Critical Optimization Radius (COR) to quantify the geometric stability of the optimization landscape, and leverage the Gradient Signal-to-Noise Ratio (GSNR) to measure generalization potential. We establish a theorem proving that COR increases monotonically with GSNR, thereby revealing that the geometric instability of SAM optimization originates from degraded intrinsic generalization potential. This result identifies the layer-wise attenuation of GSNR as the root cause of Optimization Collapse in detecting non-semantic forgeries. Although naively reducing perturbation radius yields stable convergence under SAM, it merely treats the symptom without mitigating the intrinsic generalization degradation, necessitating enhanced gradient fidelity. Building on this insight, we propose the Contrastive Regional Injection Transformer (CoRIT), which integrates a computationally efficient Contrastive Gradient Proxy (CGP) with three training-free strategies: Region Refinement Mask to suppress CGP variance, Regional Signal Injection to preserve CGP magnitude, and Hierarchical Representation Integration to attain more generalizable representations. Extensive experiments demonstrate that CoRIT mitigates optimization collapse and achieves state-of-the-art generalization across cross-domain and universal forgery benchmarks.
[CV-62] LGEST: Dynamic Spatial-Spectral Expert Routing for Hyperspectral Image Classification
【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)分类中现有深度学习方法存在的三大问题:局部-全局表征融合方式僵化、跨异质波段的光谱-空间尺度差异处理不足,以及在高维样本异质性下易受Hughes现象影响。其解决方案的核心在于提出一种名为Local-Global Expert Spatial-Spectral Transformer (LGEST) 的新框架,关键创新包括:1)通过深度空间-光谱自编码器(Deep Spatial-Spectral Autoencoder, DSAE)实现层次化非线性压缩以生成紧凑且判别性强的嵌入,保持三维邻域一致性并减少高维空间信息损失;2)引入交叉交互混合专家特征金字塔(Cross-Interactive Mixed Expert Feature Pyramid, CIEM-FPN),利用交叉注意力机制与残差混合专家层动态融合多尺度特征,并通过可学习门控函数自适应加权光谱判别力与空间显著性;3)构建局部-全局专家系统(Local-Global Expert System, LGES),以稀疏激活的专家对分解特征进行处理,其中卷积子专家捕获细粒度纹理,Transformer子专家建模长程上下文依赖,路由控制器依据实时特征显著性动态选择最优专家组合。
链接: https://arxiv.org/abs/2603.24045
作者: Jiawen Wen,Suixuan Qiu,Zihang Luo,Xiaofei Yang,Haotian Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning methods, including Convolutional Neural Networks, Transformers and Mamba, have achieved remarkable success in hyperspectral image (HSI) classification. Nevertheless, existing methods exhibit inflexible integration of local-global representations, inadequate handling of spectral-spatial scale disparities across heterogeneous bands, and susceptibility to the Hughes phenomenon under high-dimensional sample heterogeneity. To address these challenges, we propose Local-Global Expert Spatial-Spectral Transformer (LGEST), a novel framework that synergistically combines three key innovations. The LGEST first employs a Deep Spatial-Spectral Autoencoder (DSAE) to generate compact yet discriminative embeddings through hierarchical nonlinear compression, preserving 3D neighborhood coherence while mitigating information loss in high-dimensional spaces. Secondly, a Cross-Interactive Mixed Expert Feature Pyramid (CIEM-FPN) leverages cross-attention mechanisms and residual mixture-of-experts layers to dynamically fuse multi-scale features, adaptively weighting spectral discriminability and spatial saliency through learnable gating functions. Finally, a Local-Global Expert System (LGES) processes decomposed features via sparsely activated expert pairs: convolutional sub-experts capture fine-grained textures, while transformer sub-experts model long-range contextual dependencies, with a routing controller dynamically selecting experts based on real-time feature saliency. Extensive experiments on four benchmark datasets demonstrate that LGEST consistently outperforms state-of-the-art methods.
[CV-63] HAM: A Training-Free Style Transfer Approach via Heterogeneous Attention Modulation for Diffusion Models CVPR2026
【速读】:该论文旨在解决扩散模型在图像风格迁移中面临的风格-内容平衡难题,即如何在保留用户提供的内容图像身份信息的同时,准确捕捉复杂的风格参考。其解决方案的关键在于提出了一种无需训练的风格迁移方法——异质注意力调制(Heterogeneous Attention Modulation, HAM),通过引入风格噪声初始化以及创新性的两种注意力机制:全局注意力调节(Global Attention Regulation, GAR)和局部注意力移植(Local Attention Transplantation, LAT),在扩散过程中动态调控不同层级的注意力分布,从而在保持内容细节完整性的同时增强对复杂风格特征的建模能力。
链接: https://arxiv.org/abs/2603.24043
作者: Yeqi He,Liang Li,Zhiwen Yang,Xichun Sheng,Zhidong Zhao,Chenggang Yan
机构: Hangzhou Dianzi University (杭州电子科技大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Macao Polytechnic University (澳门理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2026 Findings
Abstract:Diffusion models have demonstrated remarkable performance in image generation, particularly within the domain of style transfer. Prevailing style transfer approaches typically leverage pre-trained diffusion models’ robust feature extraction capabilities alongside external modular control pathways to explicitly impose style guidance signals. However, these methods often fail to capture complex style reference or retain the identity of user-provided content images, thus falling into the trap of style-content balance. Thus, we propose a training-free style transfer approach via \textbfh eterogeneous \textbfa ttention \textbfm odulation ( \textbfHAM ) to protect identity information during image/text-guided style reference transfer, thereby addressing the style-content trade-off challenge. Specifically, we first introduces style noise initialization to initialize latent noise for diffusion. Then, during the diffusion process, it innovatively employs HAM for different attention mechanisms, including Global Attention Regulation (GAR) and Local Attention Transplantation (LAT), which better preserving the details of the content image while capturing complex style references. Our approach is validated through a series of qualitative and quantitative experiments, achieving state-of-the-art performance on multiple quantitative metrics.
[CV-64] A3: Towards Advertising Aesthetic Assessment CVPR2026
【速读】:该论文旨在解决广告图像美学评估中缺乏客观、可扩展且可解释的评价方法的问题,当前依赖主观判断导致标准不一、难以规模化应用。解决方案的关键在于提出A³框架,其核心是基于理论驱动的A³-Law范式,包含三个层级:感知注意(Perceptual Attention)、形式兴趣(Formal Interest)和欲望影响(Desire Impact),通过结构化评估广告图像在吸引注意力、激发兴趣及引发购买欲望方面的表现;在此基础上构建了包含12万条指令-响应对的A³-Dataset,并训练出A³-Align多模态大语言模型,采用Chain-of-Thought(CoT)引导学习提升模型与A³-Law的一致性,实验表明该方案在广告质量筛选与生成式批评任务中具有优异泛化能力。
链接: https://arxiv.org/abs/2603.24037
作者: Kaiyuan Ji,Yixuan Gao,Lu Sun,Yushuo Zheng,Zijian Chen,Jianbo Zhang,Xiangyang Zhu,Yuan Tian,Zicheng Zhang,Guangtao Zhai
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University (上海交通大学图像通信与网络工程研究所); School of Information and Electronic Engineering, East China Normal University (华东师范大学信息与电子工程学院); School of Computer Science and Technology, Xi’an Jiaotong University (西安交通大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Advertising images significantly impact commercial conversion rates and brand equity, yet current evaluation methods rely on subjective judgments, lacking scalability, standardized criteria, and interpretability. To address these challenges, we present A^3 (Advertising Aesthetic Assessment), a comprehensive framework encompassing four components: a paradigm (A^3-Law), a dataset (A^3-Dataset), a multimodal large language model (A^3-Align), and a benchmark (A^3-Bench). Central to A^3 is a theory-driven paradigm, A^3-Law, comprising three hierarchical stages: (1) Perceptual Attention, evaluating perceptual image signals for their ability to attract attention; (2) Formal Interest, assessing formal composition of image color and spatial layout in evoking interest; and (3) Desire Impact, measuring desire evocation from images and their persuasive impact. Building on A^3-Law, we construct A^3-Dataset with 120K instruction-response pairs from 30K advertising images, each richly annotated with multi-dimensional labels and Chain-of-Thought (CoT) rationales. We further develop A^3-Align, trained under A^3-Law with CoT-guided learning on A^3-Dataset. Extensive experiments on A^3-Bench demonstrate that A^3-Align achieves superior alignment with A^3-Law compared to existing models, and this alignment generalizes well to quality advertisement selection and prescriptive advertisement critique, indicating its potential for broader deployment. Dataset, code, and models can be found at: this https URL.
[CV-65] SpectralSplats: Robust Differentiable Tracking via Spectral Moment Supervision
【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting, 3DGS)在实际应用中因梯度消失导致的优化不稳定性问题,特别是在相机严重错位时,由于高斯基元局部支撑特性使得标准光度目标函数无法产生有效梯度,从而导致优化器陷入局部最优或无法收敛。解决方案的关键在于将优化目标从空间域迁移至频率域,通过引入一组全局复正弦特征(Spectral Moments)作为监督信号,构建一个覆盖整个图像域的全局吸引盆地,确保即使在无像素重叠的情况下仍能提供有方向性的梯度;同时,基于理论推导设计了频率退火(Frequency Annealing)调度策略,逐步从全局凸性过渡到精确的空间对齐,避免高频引起的周期性局部极小值,从而实现鲁棒且高效的视频跟踪。
链接: https://arxiv.org/abs/2603.24036
作者: Avigail Cohen Rimon,Amir Mann,Mirela Ben Chen,Or Litany
机构: Technion - Israel Institute of Technology (以色列理工学院); Nvidia(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:3D Gaussian Splatting (3DGS) enables real-time, photorealistic novel view synthesis, making it a highly attractive representation for model-based video tracking. However, leveraging the differentiability of the 3DGS renderer “in the wild” remains notoriously fragile. A fundamental bottleneck lies in the compact, local support of the Gaussian primitives. Standard photometric objectives implicitly rely on spatial overlap; if severe camera misalignment places the rendered object outside the target’s local footprint, gradients strictly vanish, leaving the optimizer stranded. We introduce SpectralSplats, a robust tracking framework that resolves this “vanishing gradient” problem by shifting the optimization objective from the spatial to the frequency domain. By supervising the rendered image via a set of global complex sinusoidal features (Spectral Moments), we construct a global basin of attraction, ensuring that a valid, directional gradient toward the target exists across the entire image domain, even when pixel overlap is completely nonexistent. To harness this global basin without introducing periodic local minima associated with high frequencies, we derive a principled Frequency Annealing schedule from first principles, gracefully transitioning the optimizer from global convexity to precise spatial alignment. We demonstrate that SpectralSplats acts as a seamless, drop-in replacement for spatial losses across diverse deformation parameterizations (from MLPs to sparse control points), successfully recovering complex deformations even from severely misaligned initializations where standard appearance-based tracking catastrophically fails.
[CV-66] Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection CVPR2026
【速读】:该论文旨在解决开放词汇时间动作检测(Open-Vocabulary Temporal Action Detection, OV-TAD)中因仅依赖标签级语义与视觉特征的全局对齐而导致的未见类别动作识别性能受限问题,即现有方法难以有效迁移已知类别的时序一致视觉知识至未知类别。其解决方案的关键在于提出一种分阶段分解与对齐(Phase-wise Decomposition and Alignment, PDA)框架,通过三个核心模块实现细粒度动作模式学习:首先利用大语言模型的思维链(Chain-of-Thought, CoT)推理能力自动将动作标签分解为连贯的阶段级描述(CoT-Prompting Semantic Decomposition, CSD);其次引入文本增强的前景过滤模块(Text-infused Foreground Filtering, TIF),基于阶段语义线索自适应筛选每个阶段的动作相关片段,生成语义对齐的视觉表示;最后设计自适应阶段对齐模块(Adaptive Phase-wise Alignment, APA),在阶段层面进行视觉-文本匹配并自适应聚合跨阶段对齐结果以完成最终预测,从而显著提升对未见动作的泛化能力。
链接: https://arxiv.org/abs/2603.24030
作者: Sa Zhu,Wanqian Zhang,Lin Wang,Xiaohua Chen,Chenxu Cui,Jinchao Zhang,Bo Li
机构: Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences; State Key Laboratory of Cyberspace Security Defense; Hangzhou Dianzi University; Department of Automation, Tsinghua University
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by CVPR 2026
Abstract:Open-Vocabulary Temporal Action Detection (OV-TAD) aims to classify and localize action segments in untrimmed videos for unseen categories. Previous methods rely solely on global alignment between label-level semantics and visual features, which is insufficient to transfer temporal consistent visual knowledge from seen to unseen classes. To address this, we propose a Phase-wise Decomposition and Alignment (PDA) framework, which enables fine-grained action pattern learning for effective prior knowledge transfer. Specifically, we first introduce the CoT-Prompting Semantic Decomposition (CSD) module, which leverages the chain-of-thought (CoT) reasoning ability of large language models to automatically decompose action labels into coherent phase-level descriptions, emulating human cognitive processes. Then, Text-infused Foreground Filtering (TIF) module is introduced to adaptively filter action-relevant segments for each phase leveraging phase-wise semantic cues, producing semantically aligned visual representations. Furthermore, we propose the Adaptive Phase-wise Alignment (APA) module to perform phase-level visual-textual matching, and adaptively aggregates alignment results across phases for final prediction. This adaptive phase-wise alignment facilitates the capture of transferable action patterns and significantly enhances generalization to unseen actions. Extensive experiments on two OV-TAD benchmarks demonstrated the superiority of the proposed method.
[CV-67] COVTrack: Learning Open-Vocabulary Multi-Object Tracking from Continuous Videos via a Synergistic Paradigm
【速读】:该论文旨在解决开放词汇多目标跟踪(Open-Vocabulary Multi-Object Tracking, OVMOT)中两个核心瓶颈问题:一是缺乏持续标注的视频数据以支持模型训练,二是现有框架难以协同优化检测与关联模块。针对数据瓶颈,作者构建了首个持续标注的OVMOT训练集C-TAO,其标注密度较原始TAO提升26倍,并能捕捉平滑运动动态和中间物体状态;针对框架瓶颈,提出COVTrack++框架,通过三个关键模块实现检测与关联的双向协同机制:(1) 多线索自适应融合(Multi-Cue Adaptive Fusion, MCF)动态平衡外观、运动与语义线索以增强关联特征学习;(2) 多粒度层级聚合(Multi-Granularity Hierarchical Aggregation, MGA)利用密集检测中的层次空间关系,使可见子节点(如物体部件)辅助遮挡父对象(如完整躯干)的关联特征增强;(3) 时间置信度传播(Temporal Confidence Propagation, TCP)通过高置信度轨迹稳定低置信度候选帧,缓解闪烁现象并提升轨迹连续性。实验表明,该方法在TAO数据集上达到SOTA性能,Novel TETA指标达35.4%(验证集),显著优于先前方法,并展现出强零样本泛化能力。
链接: https://arxiv.org/abs/2603.24016
作者: Zekun Qian,Wei Feng,Ruize Han,Junhui Hou
机构: Tianjin University (天津大学); City University of Hong Kong (香港城市大学); Shenzhen University of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Multi-Object Tracking (MOT) has traditionally focused on a few specific categories, restricting its applicability to real-world scenarios involving diverse objects. Open-Vocabulary Multi-Object Tracking (OVMOT) addresses this by enabling tracking of arbitrary categories, including novel objects unseen during training. However, current progress is constrained by two challenges: the lack of continuously annotated video data for training, and the lack of a customized OVMOT framework to synergistically handle detection and association. We address the data bottleneck by constructing C-TAO, the first continuously annotated training set for OVMOT, which increases annotation density by 26x over the original TAO and captures smooth motion dynamics and intermediate object states. For the framework bottleneck, we propose COVTrack++, a synergistic framework that achieves a bidirectional reciprocal mechanism between detection and association through three modules: (1) Multi-Cue Adaptive Fusion (MCF) dynamically balances appearance, motion, and semantic cues for association feature learning; (2) Multi-Granularity Hierarchical Aggregation (MGA) exploits hierarchical spatial relationships in dense detections, where visible child nodes (e.g., object parts) assist occluded parent objects (e.g., whole body) for association feature enhancement; (3) Temporal Confidence Propagation (TCP) recovers flickering detections through high-confidence tracked objects boosting low-confidence candidates across frames, stabilizing trajectories. Extensive experiments on TAO demonstrate state-of-the-art performance, with novel TETA reaching 35.4% and 30.5% on validation and test sets, improving novel AssocA by 4.8% and novel LocA by 5.8% over previous methods, and show strong zero-shot generalization on BDD100K. The code and dataset will be publicly available.
[CV-68] UW-VOS: A Large-Scale Dataset for Underwater Video Object Segmentation
【速读】:该论文旨在解决水下视频目标分割(Underwater Video Object Segmentation, UW-VOS)中因颜色失真、对比度低和普遍伪装导致的性能显著下降问题,其核心挑战在于缺乏高质量训练数据及现有开放域方法在水下场景中的泛化能力不足。解决方案的关键在于提出首个大规模水下VOS基准数据集UW-VOS(包含1,431个视频序列、409类目标和309,295个掩码标注),并通过半自动数据生成引擎结合人工严格验证构建;同时设计参数高效的SAM-U框架,通过在图像编码器中插入轻量级适配器(adapter),仅用约2%的可训练参数即可将SAM2迁移至水下领域,实现SOTA性能并有效缓解域间差异问题。
链接: https://arxiv.org/abs/2603.24006
作者: Hongshen Zhao,Jingkang Tai,Yuhang Wu,Wenkang Zhang,Xi Lan,Shangyan Wang,Tianyu Zhang,Wankou Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Underwater Video Object Segmentation (VOS) is essential for marine exploration, yet open-air methods suffer significant degradation due to color distortion, low contrast, and prevalent camouflage. A primary hurdle is the lack of high-quality training data. To bridge this gap, we introduce \textbfUW-VOS , the first large-scale underwater VOS benchmark comprising 1,431 video sequences across 409 categories with 309,295 mask annotations, constructed via a semi-automatic data engine with rigorous human verification. We further propose \textbfSAM-U , a parameter-efficient framework that adapts SAM2 to the underwater domain. By inserting lightweight adapters into the image encoder, SAM-U achieves state-of-the-art performance with only \sim 2 % trainable parameters. Extensive experiments reveal that existing methods experience an average 13-point \mathcalJ\mathcalF drop on UW-VOS, while SAM-U effectively bridges this domain gap. Detailed attribute-based analysis further identifies small targets, camouflage, and exit-re-entry as critical bottlenecks, providing a roadmap for future research in robust underwater perception.
[CV-69] DB SwinT: A Dual-Branch Swin Transformer Network for Road Extraction in Optical Remote Sensing Imagery
【速读】:该论文旨在解决复杂城乡环境中光学遥感影像中道路提取精度低的问题,尤其针对道路被树木、建筑物等遮挡导致的结构碎片化和连续性丧失。其解决方案的关键在于提出了一种双分支Swin Transformer网络(DB SwinT),通过双分支编码器分别学习局部细节与全局语义信息:局部分支专注于恢复遮挡区域的精细结构,全局分支则捕捉更广泛的语义上下文以保持道路网络的整体连贯性;同时引入注意力特征融合(Attentional Feature Fusion, AFF)模块,自适应地融合两分支特征,显著提升对遮挡路段的表征能力。
链接: https://arxiv.org/abs/2603.24005
作者: Zongyang He,Xiangli Yang,Xian Gao,Zhiguo Wang
机构: Chongqing Jiaotong University (重庆交通大学); Inner Mongolia University of Technology (内蒙古工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the continuous improvement in the spatial resolution of optical remote sensing imagery, accurate road extraction has become increasingly important for applications such as urban planning, traffic monitoring, and disaster management. However, road extraction in complex urban and rural environments remains challenging, as roads are often occluded by trees, buildings, and other objects, leading to fragmented structures and reduced extraction accuracy. To address this problem, this paper proposes a Dual-Branch Swin Transformer network (DB SwinT) for road extraction. The proposed framework combines the long-range dependency modeling capability of the Swin Transformer with the multi-scale feature fusion strategy of U-Net, and employs a dual-branch encoder to learn complementary local and global representations. Specifically, the local branch focuses on recovering fine structural details in occluded areas, while the global branch captures broader semantic context to preserve the overall continuity of road networks. In addition, an Attentional Feature Fusion (AFF) module is introduced to adaptively fuse features from the two branches, further enhancing the representation of occluded road segments. Experimental results on the Massachusetts and DeepGlobe datasets show that DB SwinT achieves Intersection over Union (IoU) scores of 79.35% and 74.84%, respectively, demonstrating its effectiveness for road extraction from optical remote sensing imagery.
[CV-70] HGGT: Robust and Flexible 3D Hand Mesh Reconstruction from Uncalibrated Images
【速读】:该论文旨在解决从单张或任意视角图像中高保真重建3D手部几何结构的问题,尤其针对当前方法在部署灵活性与精度之间存在的矛盾:单视图方法虽易部署但受深度模糊性和遮挡影响,而多视图系统虽能消除不确定性却依赖固定校准的采集环境。解决方案的关键在于借鉴3D基础模型(3D foundation models)的思想,将手部重建任务重新定义为一个视觉-几何联合建模问题,提出了一种前馈架构,首次实现了从非标定视角中同时推断3D手部网格(hand mesh)和相机位姿(camera pose),从而在无需复杂标定的情况下实现高精度且泛化能力强的手部三维重建。
链接: https://arxiv.org/abs/2603.23997
作者: Yumeng Liu,Xiao-Xiao Long,Marc Habermann,Xuanze Yang,Cheng Lin,Yuan Liu,Yuexin Ma,Wenping Wang,Ligang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
Abstract:Recovering high-fidelity 3D hand geometry from images is a critical task in computer vision, holding significant value for domains such as robotics, animation and VR/AR. Crucially, scalable applications demand both accuracy and deployment flexibility, requiring the ability to leverage massive amounts of unstructured image data from the internet or enable deployment on consumer-grade RGB cameras without complex calibration. However, current methods face a dilemma. While single-view approaches are easy to deploy, they suffer from depth ambiguity and occlusion. Conversely, multi-view systems resolve these uncertainties but typically demand fixed, calibrated setups, limiting their real-world utility. To bridge this gap, we draw inspiration from 3D foundation models that learn explicit geometry directly from visual data. By reformulating hand reconstruction from arbitrary views as a visual-geometry grounded task, we propose a feed-forward architecture that, for the first time in literature, jointly infers 3D hand meshes and camera poses from uncalibrated views. Extensive evaluations show that our approach outperforms state-of-the-art benchmarks and demonstrates strong generalization to uncalibrated, in-the-wild scenarios. Here is the link of our project page: this https URL.
[CV-71] CAKE: Real-time Action Detection via Motion Distillation and Background-aware Contrastive Learning
【速读】:该论文针对在线动作检测(Online Action Detection, OAD)系统中存在的两大挑战展开研究:一是计算成本过高,二是对区分性时序动态特征建模不足,尤其是在背景运动干扰下的表现较差。为解决这些问题,作者提出了一种基于流的蒸馏框架CAKE(Context-Aware Knowledge distillation for Efficient Online Action Detection),其核心创新在于设计了动态运动适配器(Dynamic Motion Adapter, DMA),通过抑制静态背景噪声并增强像素变化信息,无需显式计算光流即可有效近似其运动感知能力;同时引入浮动对比学习(Floating Contrastive Learning)策略,从时序背景中分离出具有判别性的运动动态特征。实验表明,CAKE在保持与当前最优模型相同主干网络的前提下实现了更优的平均精度(mAP),且单CPU环境下运行速度超过72 FPS,显著提升了资源受限场景下的实用性。
链接: https://arxiv.org/abs/2603.23988
作者: Hieu Hoang,Dung Trung Tran,Hong Nguyen,Nam-Phong Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Online Action Detection (OAD) systems face two primary challenges: high computational cost and insufficient modeling of discriminative temporal dynamics against background motion. Adding optical flow could provides strong motion cues but it incurs significant computational overhead. We propose CAKE, a OAD Flow-based distillation framework to transfer motion knowledge into RGB models. We propose Dynamic Motion Adapter (DMA) to suppress static background noise and emphasize pixel changes, effectively approximating optical flow without explicit computation. The framework also integrates a Floating Contrastive Learning strategy to distinguish informative motion dynamics from temporal background. Various experiments conducted on the TVSeries, THUMOS’14, Kinetics-400 datasets show effectiveness of our model. CAKE achieves a standout mAP compared with SOTA while using the same backbone. Our model operates at over 72 FPS on a single CPU, making it highly suitable for resource-constrained systems.
[CV-72] SilLang: Improving Gait Recognition with Silhouette Language Encoding
【速读】:该论文旨在解决现有行人步态识别方法中对二值化步态轮廓(binary gait silhouettes)的离散特性利用不足的问题。当前主流方法多依赖视觉骨干网络提取连续特征,忽略了步态轮廓与自然语言在离散编码空间中的相似性,从而限制了对时序运动模式的精细建模能力。解决方案的关键在于提出一种轮廓-速度分词器(Contour-Velocity Tokenizer),通过重塑二值步态轮廓的分布以匹配文本标记的空间密度和频率,实现与大语言模型(LLM)离散语义空间的对齐;进而构建双分支框架“轮廓语言模型”(Silhouette Language Model),融合来自LLM的离散语言嵌入来增强步态轮廓表征,显著提升在SUSTech1K、GREW和Gait3D等数据集上的性能。
链接: https://arxiv.org/abs/2603.23976
作者: Ruiyi Zhan,Guozhen Peng,Canyu Chen,Jian Lei,Annan Li
机构: Beihang University (北京航空航天大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gait silhouettes, which can be encoded into binary gait codes, are widely adopted to representing motion patterns of pedestrian. Recent approaches commonly leverage visual backbones to encode gait silhouettes, achieving successful performance. However, they primarily focus on continuous visual features, overlooking the discrete nature of binary silhouettes that inherently share a discrete encoding space with natural language. Large Language Models (LLMs) have demonstrated exceptional capability in extracting discriminative features from discrete sequences and modeling long-range dependencies, highlighting their potential to capture temporal motion patterns by identifying subtle variations. Motivated by these observations, we explore bridging binary gait silhouettes and natural language within a binary encoding space. However, the encoding spaces of text tokens and binary gait silhouettes remain misaligned, primarily due to differences in token frequency and density. To address this issue, we propose the Contour-Velocity Tokenizer, which encodes binary gait silhouettes while reshaping their distribution to better align with the text token space. We then establish a dual-branch framework termed Silhouette Language Model, which enhances visual silhouettes by integrating discrete linguistic embeddings derived from LLMs. Implemented on mainstream gait backbones, SilLang consistently improves state-of-the-art methods across SUSTech1K, GREW, and Gait3D.
[CV-73] HyDRA: Hybrid Domain-Aware Robust Architecture for Heterogeneous Collaborative Perception IROS2026
【速读】:该论文旨在解决协同感知(Collaborative Perception, CP)中因模型架构或训练数据分布差异导致的异质性(heterogeneity)问题,该问题会显著降低协作智能体的性能。解决方案的关键在于提出一种统一的HyDRA(Hybrid Domain-Aware Robust Architecture)框架,其核心创新包括:1)引入轻量级域分类器(domain classifier),动态识别异质性智能体并将其分配至晚期融合分支;2)设计锚点引导的姿态图优化方法(anchor-guided pose graph optimization),利用中间融合提供的可靠检测结果作为固定空间锚点,以缓解晚期融合固有的定位误差。该方案无需额外训练即可实现与最先进异质性感知方法相当的性能,并支持协作智能体数量的零成本扩展。
链接: https://arxiv.org/abs/2603.23975
作者: Minwoo Song,Minhee Kang,Heejin Ahn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures, Submitted to IROS 2026
Abstract:In collaborative perception, an agent’s performance can be degraded by heterogeneity arising from differences in model architecture or training data distributions. To address this challenge, we propose HyDRA (Hybrid Domain-Aware Robust Architecture), a unified pipeline that integrates intermediate and late fusion within a domain-aware framework. We introduce a lightweight domain classifier that dynamically identifies heterogeneous agents and assigns them to the late-fusion branch. Furthermore, we propose anchor-guided pose graph optimization to mitigate localization errors inherent in late fusion, leveraging reliable detections from intermediate fusion as fixed spatial anchors. Extensive experiments demonstrate that, despite requiring no additional training, HyDRA achieves performance comparable to state-of-the-art heterogeneity-aware CP methods. Importantly, this performance is maintained as the number of collaborating agents increases, enabling zero-cost scaling without retraining.
[CV-74] SLAT-Phys: Fast Material Property Field Prediction from Structured 3D Latents
【速读】:该论文旨在解决从单张RGB图像中直接估计三维资产(3D assets)空间变化的材料属性场(material property field)的问题,这是物理仿真、机器人学和数字孪生生成中的关键步骤。传统视觉方法要么计算成本高、速度慢,要么依赖于显式的三维重建信息。其解决方案的关键在于提出一种端到端的方法SLAT-Phys,该方法利用预训练的3D资产生成模型所提取的空间结构化潜在特征(spatially organised latent features),这些特征编码了丰富的几何与语义先验信息,并通过一个轻量级神经解码器预测杨氏模量(Young’s modulus)、密度和泊松比(Poisson’s ratio)。该方法无需显式三维重建或体素化预处理,仅需9.9秒/对象即可完成估计,在保持与现有方法相当精度的同时实现了120倍的速度提升。
链接: https://arxiv.org/abs/2603.23973
作者: Rocktim Jyoti Das,Dinesh Manocha
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: 8 page, 4 figures
Abstract:Estimating the material property field of 3D assets is critical for physics-based simulation, robotics, and digital twin generation. Existing vision-based approaches are either too expensive and slow or rely on 3D information. We present SLAT-Phys, an end-to-end method that predicts spatially varying material property fields of 3D assets directly from a single RGB image without explicit 3D reconstruction. Our approach leverages spatially organised latent features from a pretrained 3D asset generation model that encodes rich geometry and semantic prior, and trains a lightweight neural decoder to estimate Young’s modulus, density, and Poisson’s ratio. The coarse volumetric layout and semantic cues of the latent representation about object geometry and appearance enable accurate material estimation. Our experiments demonstrate that our method provides competitive accuracy in predicting continuous material parameters when compared against prior approaches, while significantly reducing computation time. In particular, SLAT-Phys requires only 9.9 seconds per object on an NVIDIA RTXA5000 GPU and avoids reconstruction and voxelization preprocessing. This results in 120x speedup compared to prior methods and enables faster material property estimation from a single image.
[CV-75] GRMLR: Knowledge-Enhanced Small-Data Learning for Deep-Sea Cold Seep Stage Inference
【速读】:该论文旨在解决深海冷泉阶段评估中因依赖昂贵且高风险的载人潜水器作业和宏观生物视觉调查而导致的成本高、效率低的问题,同时应对微生物组数据维度远高于样本量(p=26, n=13)所引发的过拟合难题。解决方案的关键在于提出一种知识增强型分类框架,通过引入生态知识图谱作为结构先验,融合宏-微生物耦合关系与微生物共现模式,将生态学逻辑内嵌至图正则化的多项逻辑回归(Graph-Regularized Multinomial Logistic Regression, GRMLR)模型中,利用流形惩罚约束特征空间,从而实现生物学一致性的精准分类;值得注意的是,该框架在推理阶段仅需微生物丰度谱即可完成预测,无需宏观生物观测,显著提升了实用性与可扩展性。
链接: https://arxiv.org/abs/2603.23961
作者: Chenxu Zhou,Zelin Liu,Rui Cai,Houlin Gong,Yikang Yu,Jia Zeng,Yanru Pei,Liang Zhang,Weishu Zhao,Xiaofeng Gao
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep-sea cold seep stage assessment has traditionally relied on costly, high-risk manned submersible operations and visual surveys of macrofauna. Although microbial communities provide a promising and more cost-effective alternative, reliable inference remains challenging because the available deep-sea dataset is extremely small ( n = 13 ) relative to the microbial feature dimension ( p = 26 ), making purely data-driven models highly prone to overfitting. To address this, we propose a knowledge-enhanced classification framework that incorporates an ecological knowledge graph as a structural prior. By fusing macro-microbe coupling and microbial co-occurrence patterns, the framework internalizes established ecological logic into a \underline\textbfGraph-\underline\textbfRegularized \underline\textbfMultinomial \underline\textbfLogistic \underline\textbfRegression (GRMLR) model, effectively constraining the feature space through a manifold penalty to ensure biologically consistent classification. Importantly, the framework removes the need for macrofauna observations at inference time: macro-microbe associations are used only to guide training, whereas prediction relies solely on microbial abundance profiles. Experimental results demonstrate that our approach significantly outperforms standard baselines, highlighting its potential as a robust and scalable framework for deep-sea ecological assessment.
[CV-76] Leave No Stone Unturned: Uncovering Holistic Audio-Visual Intrinsic Coherence for Deepfake Detection
【速读】:该论文旨在解决当前深度伪造(deepfake)检测方法在跨数据集场景下泛化能力不足的问题,尤其是现有检测器多依赖单一模态的伪影或音频-视觉不一致特征,难以有效融合多模态信息以实现鲁棒检测。其解决方案的关键在于提出HAVIC(Holistic Audio-Visual Intrinsic Coherence-based detector),通过预训练学习真实视频中模态内结构一致性(modality-specific structural coherence)、模态间微观与宏观一致性(inter-modal micro- and macro-coherence),并在此基础上进行自适应聚合(holistic adaptive aggregation),动态融合音视频特征,从而基于内在的音视频一致性实现更通用、更可靠的深度伪造检测。
链接: https://arxiv.org/abs/2603.23960
作者: Jielun Peng,Yabin Wang,Yaqi Li,Long Kong,Xiaopeng Hong
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid progress of generative AI has enabled hyper-realistic audio-visual deepfakes, intensifying threats to personal security and social trust. Most existing deepfake detectors rely either on uni-modal artifacts or audio-visual discrepancies, failing to jointly leverage both sources of information. Moreover, detectors that rely on generator-specific artifacts tend to exhibit degraded generalization when confronted with unseen forgeries. We argue that robust and generalizable detection should be grounded in intrinsic audio-visual coherence within and across modalities. Accordingly, we propose HAVIC, a Holistic Audio-Visual Intrinsic Coherence-based deepfake detector. HAVIC first learns priors of modality-specific structural coherence, inter-modal micro- and macro-coherence by pre-training on authentic videos. Based on the learned priors, HAVIC further performs holistic adaptive aggregation to dynamically fuse audio-visual features for deepfake detection. Additionally, we introduce HiFi-AVDF, a high-fidelity audio-visual deepfake dataset featuring both text-to-video and image-to-video forgeries from state-of-the-art commercial generators. Extensive experiments across several benchmarks demonstrate that HAVIC significantly outperforms existing state-of-the-art methods, achieving improvements of 9.39% AP and 9.37% AUC on the most challenging cross-dataset scenario. Our code and dataset are available at this https URL.
[CV-77] PointRFT: Explicit Reinforcement Fine-tuning for Point Cloud Few-shot Learning
【速读】:该论文旨在解决3D点云表示学习中缺乏有效微调方法的问题,尤其关注如何通过强化学习(Reinforcement Learning, RL)提升点云基础模型在数据稀缺场景下的性能。其解决方案的关键在于提出PointRFT——首个专为点云表征学习设计的强化微调范式,通过构建精度奖励函数(accuracy reward)与分散奖励函数(dispersion reward)来稳定训练过程并缓解分布偏移问题;实验表明,PointRFT在少样本分类任务中显著优于传统监督微调(Supervised Fine-Tuning, SFT),并在融合预训练-微调-强化微调的混合框架中展现出更强的表征能力,实现了数据稀缺条件下的最优性能。
链接: https://arxiv.org/abs/2603.23957
作者: Yankai Wang,Yiding Sun,Qirui Wang,Pengbo Li,Chaoyi Lu,Dongxu Zhang
机构: Xi’an Jiaotong University (西安交通大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding spatial dynamics and semantics in point cloud is fundamental for comprehensive 3D comprehension. While reinforcement learning algorithms such as Group Relative Policy Optimization (GRPO) have recently achieved remarkable breakthroughs in large language models by incentivizing reasoning capabilities through strategic reward design, their potential remains largely unexplored in the 3D perception domain. This naturally raises a pivotal question: Can RL-based methods effectively empower 3D point cloud fine-tuning? In this paper, we propose PointRFT, the first reinforcement fine-tuning paradigm tailored specifically for point cloud representation learning. We select three prevalent 3D foundation models and devise specialized accuracy reward and dispersion reward functions to stabilize training and mitigate distribution shifts. Through comprehensive few-shot classification experiments comparing distinct training paradigms, we demonstrate that PointRFT consistently outperforms vanilla supervised fine-tuning (SFT) across diverse benchmarks. Furthermore, when organically integrated into a hybrid Pretraining-SFT-RFT paradigm, the representational capacity of point cloud foundation models is substantially unleashed, achieving state-of-the-art performance particularly under data-scarce scenarios.
[CV-78] SynMVCrowd: A Large Synthetic Benchmark for Multi-view Crowd Counting and Localization
【速读】:该论文旨在解决现有多视角人群计数与定位方法在评估时所依赖的小场景、低人群密度及有限视图和帧数的数据集导致的过拟合问题,从而无法真实反映算法性能的问题。其解决方案的关键在于提出一个大规模合成基准数据集 SynMVCrowd,该数据集包含50个合成场景、大量多视角视频帧和相机视角,并支持高达1000人的密集人群场景,显著提升了评估的实用性;同时,论文还设计了强健的多视角人群计数与定位基线模型,在该新基准上超越所有对比方法,并验证了该基准对跨域迁移至真实场景下人群计数与定位性能提升的有效性。
链接: https://arxiv.org/abs/2603.23956
作者: Qi Zhang,Daijie Chen,Yunfei Gong,Hui Huang
机构: Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IJCV 2026
Abstract:Existing multi-view crowd counting and localization methods are evaluated under relatively small scenes with limited crowd numbers, camera views, and frames. This makes the evaluation and comparison of existing methods impractical, as small datasets are easily overfit by these methods. To avoid these issues, 3DROM proposes a data augmentation method. Instead, in this paper, we propose a large synthetic benchmark, SynMVCrowd, for more practical evaluation and comparison of multi-view crowd counting and localization tasks. The SynMVCrowd benchmark consists of 50 synthetic scenes with a large number of multi-view frames and camera views and a much larger crowd number (up to 1000), which is more suitable for large-scene multi-view crowd vision tasks. Besides, we propose strong multi-view crowd localization and counting baselines that outperform all comparison methods on the new SynMVCrowd benchmark. Moreover, we prove that better domain transferring multi-view and single-image counting performance could be achieved with the aid of the benchmark on novel new real scenes. As a result, the proposed benchmark could advance the research for multi-view and single-image crowd counting and localization to more practical applications. The codes and datasets are here: this https URL.
[CV-79] VOLMO: Versatile and Open Large Models for Ophthalmology
【速读】:该论文旨在解决眼科临床工作中多模态信息整合效率低、现有通用及医疗领域多模态大语言模型(Multimodal Large Language Models, MLLMs)在眼科任务中表现不佳的问题。其解决方案的关键在于提出一个模型无关、数据开放的框架VOLMO(Versatile and Open Large Models for Ophthalmology),通过三个阶段构建专用于眼科的MLLM:首先在86,965对图像-文本数据上进行眼科知识预训练;其次在26,929个标注实例上针对12种眼病进行疾病筛查与分期分类微调;最后在913例患者病例报告上进行多步骤临床推理训练,以支持评估、诊疗计划和随访生成。基于此框架训练的2B参数模型在多项眼科任务中显著优于多个强基线模型,包括InternVL-2B、LLaVA-Med-7B、MedGemma系列和RETFound,并在外部独立队列中验证了其鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2603.23953
作者: Zhenyue Qin,Younjoon Chung,Elijah Lee,Wanyue Feng,Xuguang Ai,Serina Applebaum,Minjie Zou,Yang Liu,Pan Xiao,Mac Singer,Amisha Dave,Aidan Gilson,Tiarnan D. L. Keenan,Emily Y. Chew,Zhiyong Lu,Yih-Chung Tham,Ron Adelman,Luciano V. Del Priore,Qingyu Chen
机构: Yale University (耶鲁大学); Carnegie Mellon University (卡内基梅隆大学); National University of Singapore (新加坡国立大学); Washington University in Saint Louis (圣路易斯华盛顿大学); Harvard University (哈佛大学); National Institutes of Health (美国国立卫生研究院); National Library of Medicine (美国国家医学图书馆)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:
Abstract:Vision impairment affects millions globally, and early detection is critical to preventing irreversible vision loss. Ophthalmology workflows require clinicians to integrate medical images, structured clinical data, and free-text notes to determine disease severity and management, which is time-consuming and burdensome. Recent multimodal large language models (MLLMs) show promise, but existing general and medical MLLMs perform poorly in ophthalmology, and few ophthalmology-specific MLLMs are openly available. We present VOLMO (Versatile and Open Large Models for Ophthalmology), a model-agnostic, data-open framework for developing ophthalmology-specific MLLMs. VOLMO includes three stages: ophthalmology knowledge pretraining on 86,965 image-text pairs from 26,569 articles across 82 journals; domain task fine-tuning on 26,929 annotated instances spanning 12 eye conditions for disease screening and severity classification; and multi-step clinical reasoning on 913 patient case reports for assessment, planning, and follow-up care. Using this framework, we trained a compact 2B-parameter MLLM and compared it with strong baselines, including InternVL-2B, LLaVA-Med-7B, MedGemma-4B, MedGemma-27B, and RETFound. We evaluated these models on image description generation, disease screening and staging classification, and assessment-and-management generation, with additional manual review by two healthcare professionals and external validation on three independent cohorts for age-related macular degeneration and diabetic retinopathy. Across settings, VOLMO-2B consistently outperformed baselines, achieving stronger image description performance, an average F1 of 87.4% across 12 eye conditions, and higher scores in external validation.
[CV-80] High-Fidelity Face Content Recovery via Tamper-Resilient Versatile Watermarking
【速读】:该论文旨在解决由生成式 AI (Generative AI) 驱动的面部篡改和深度伪造(deepfakes)对媒体溯源、完整性及版权保护带来的严重威胁。现有水印系统通常依赖显式的定位信息嵌入,导致视觉保真度与功能之间存在权衡:较大的定位信号会降低图像质量,并在强生成编辑下削弱解码鲁棒性;且多数方法不支持内容恢复,限制了其在需重建原始证据场景下的取证价值。解决方案的关键在于提出 VeriFi 框架,其核心创新包括:(1) 嵌入紧凑的语义潜在水印作为内容保持先验,实现即使在严重篡改后仍可高保真恢复人脸内容;(2) 通过图像特征与解码溯源信号的相关性实现像素级篡改定位,无需引入特定定位伪影;(3) 设计 AIGC 攻击模拟器,融合潜在空间混合与无缝融合技术,提升对真实深度伪造流程的鲁棒性。
链接: https://arxiv.org/abs/2603.23940
作者: Peipeng Yu,Jinfeng Xie,Chengfu Ou,Xiaoyu Zhou,Jianwei Fei,Yunshu Dai,Zhihua Xia,Chip Hong Chang
机构: Jinan University(暨南大学); Sun Yat-sen University(中山大学); University of Macau(澳门大学); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The proliferation of AIGC-driven face manipulation and deepfakes poses severe threats to media provenance, integrity, and copyright protection. Prior versatile watermarking systems typically rely on embedding explicit localization payloads, which introduces a fidelity–functionality trade-off: larger localization signals degrade visual quality and often reduce decoding robustness under strong generative edits. Moreover, existing methods rarely support content recovery, limiting their forensic value when original evidence must be reconstructed. To address these challenges, we present VeriFi, a versatile watermarking framework that unifies copyright protection, pixel-level manipulation localization, and high-fidelity face content recovery. VeriFi makes three key contributions: (1) it embeds a compact semantic latent watermark that serves as an content-preserving prior, enabling faithful restoration even after severe manipulations; (2) it achieves fine-grained localization without embedding localization-specific artifacts by correlating image features with decoded provenance signals; and (3) it introduces an AIGC attack simulator that combines latent-space mixing with seamless blending to improve robustness to realistic deepfake pipelines. Extensive experiments on CelebA-HQ and FFHQ show that VeriFi consistently outperforms strong baselines in watermark robustness, localization accuracy, and recovery quality, providing a practical and verifiable defense for deepfake forensics.
[CV-81] Revealing Multi-View Hallucination in Large Vision-Language Models
【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在处理多视角图像输入时出现的“多视角幻觉”(multi-view hallucination)问题,即模型容易混淆或错误关联来自不同实例或视角的视觉信息。为系统分析该问题,作者构建了MVH-Bench基准数据集,包含4.8k个问答对,用于检测跨实例和跨视角两类幻觉。解决方案的关键在于提出一种无需训练的解码技术——参考移位对比解码(Reference Shift Contrastive Decoding, RSCD),其通过注意力掩码生成负logits来抑制视觉干扰,从而增强模型对正确视觉证据的关联能力。实验表明,RSCD在Qwen2.5-VL和LLaVA-OneVision上分别提升性能达21.1和34.6点,显著优于现有幻觉缓解方法。
链接: https://arxiv.org/abs/2603.23934
作者: Wooje Park,Insu Lee,Soohyun Kim,Jaeyun Jang,Minyoung Noh,Kyuhong Shim,Byonghyo Shim
机构: Seoul National University (首尔国立大学); Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from different instances or viewpoints, a phenomenon we term multi-view hallucination. To systematically analyze this problem, we construct MVH-Bench, a benchmark comprising 4.8k question-answer pairs targeting two types of hallucination: cross-instance and cross-view. Empirical results show that recent LVLMs struggle to correctly associate visual evidence with its corresponding instance or viewpoint. To overcome this limitation, we propose Reference Shift Contrastive Decoding (RSCD), a training-free decoding technique that suppresses visual interference by generating negative logits through attention masking. Experiments on MVH-Bench with Qwen2.5-VL and LLaVA-OneVision demonstrate that RSCD consistently improves performance by up to 21.1 and 34.6 points over existing hallucination mitigation methods, highlighting the effectiveness of our approach.
[CV-82] DP2-VL: Private Photo Dataset Protection by Data Poisoning for Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在训练过程中因使用少量私人照片而引发的身份关联泄露隐私问题,即攻击者通过微调VLM,将目标个体的面部身份与其私有财产和社交关系等敏感信息嵌入模型内部表征中,进而导致用户隐私在模型部署后被未经授权地暴露。解决方案的关键在于提出首个针对私有照片的数据保护框架DP2-VL,其核心机制是利用数据投毒技术,在不显著改变图像感知质量的前提下,对原始图像施加不可察觉的扰动,使模型编码器的嵌入空间发生整体偏移,从而实现受保护图像与干净推理图像之间的分离,使得基于受保护数据的微调产生过拟合,有效阻断身份关联信息的泄露。
链接: https://arxiv.org/abs/2603.23925
作者: Hongyi Miao,Jun Jia,Xincheng Wang,Qianli Ma,Wei Sun,Wangqiu Zhou,Dandan Zhu,Yewen Cao,Zhi Liu,Guangtao Zhai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in visual-language alignment have endowed vision-language models (VLMs) with fine-grained image understanding capabilities. However, this progress also introduces new privacy risks. This paper first proposes a novel privacy threat model named identity-affiliation learning: an attacker fine-tunes a VLM using only a few private photos of a target individual, thereby embedding associations between the target facial identity and their private property and social relationships into the model’s internal representations. Once deployed via public APIs, this model enables unauthorized exposure of the target user’s private information upon input of their photos. To benchmark VLMs’ susceptibility to such identity-affiliation leakage, we introduce the first identity-affiliation dataset comprising seven typical scenarios appearing in private photos. Each scenario is instantiated with multiple identity-centered photo-description pairs. Experimental results demonstrate that mainstream VLMs like LLaVA, Qwen-VL, and MiniGPT-v2, can recognize facial identities and infer identity-affiliation relationships by fine-tuning on small-scale private photographic dataset, and even on synthetically generated datasets. To mitigate this privacy risk, we propose DP2-VL, the first Dataset Protection framework for private photos that leverages Data Poisoning. Though optimizing imperceptible perturbations by pushing the original representations toward an antithetical region, DP2-VL induces a dataset-level shift in the embedding space of VLMs’encoders. This shift separates protected images from clean inference images, causing fine-tuning on the protected set to overfit. Extensive experiments demonstrate that DP2-VL achieves strong generalization across models, robustness to diverse post-processing operations, and consistent effectiveness across varying protection ratios.
[CV-83] DepthArb: Training-Free Depth-Arbitrated Generation for Occlusion-Robust Image Synthesis
【速读】:该论文旨在解决文本到图像扩散模型在合成多对象之间准确遮挡关系时存在的缺陷,尤其是在密集重叠区域中常出现的概念混淆或逻辑错误遮挡问题。现有无训练布局引导方法主要依赖于与深度顺序无关的刚性空间先验,难以正确建模物体间的可见性关系。解决方案的关键在于提出一种无需重新训练的框架 DepthArb,其核心是通过两个机制实现注意力竞争仲裁:Attention Arbitration Modulation (AAM) 通过抑制重叠区域中的背景激活来强制执行深度有序的可见性,Spatial Compactness Control (SCC) 通过限制注意力发散来保持结构完整性。这两个机制共同作用,使模型能够在不修改主干网络的情况下有效生成合理遮挡关系,显著提升生成图像的遮挡准确性与视觉保真度。
链接: https://arxiv.org/abs/2603.23924
作者: Hongjin Niu,Jiahao Wang,Xirui Hu,Weizhan Zhang,Lan Ma,Yuan Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image diffusion models frequently exhibit deficiencies in synthesizing accurate occlusion relationships of multiple objects, particularly within dense overlapping regions. Existing training-free layout-guided methods predominantly rely on rigid spatial priors that remain agnostic to depth order, often resulting in concept mixing or illogical occlusion. To address these limitations, we propose DepthArb, a training-free framework that resolves occlusion ambiguities by arbitrating attention competition between interacting objects. Specifically, DepthArb employs two core mechanisms: Attention Arbitration Modulation (AAM), which enforces depth-ordered visibility by suppressing background activations in overlapping regions, and Spatial Compactness Control (SCC), which preserves structural integrity by curbing attention divergence. These mechanisms enable robust occlusion generation without model retraining. To systematically evaluate this capability, we propose OcclBench, a comprehensive benchmark designed to evaluate diverse occlusion scenarios. Extensive evaluations demonstrate that DepthArb consistently outperforms state-of-the-art baselines in both occlusion accuracy and visual fidelity. As a plug-and-play method, DepthArb seamlessly enhances the compositional capabilities of diffusion backbones, offering a novel perspective on spatial layering within generative models.
[CV-84] Uncertainty-Aware Vision-based Risk Object Identification via Conformal Risk Tube Prediction ICRA
【速读】:该论文旨在解决智能驾驶系统中基于视觉的风险对象识别(Vision-ROI)问题,特别是现有方法因采用确定性决策而忽略不确定性,导致在模糊场景下可能出现提前或延迟的风险检测以及时间上不稳定的预测,尤其在存在多个交互风险的复杂场景中更为显著。其解决方案的关键在于提出了一种统一的“共形风险管预测”(Conformal Risk Tube Prediction)框架,该框架能够联合建模时空维度上的风险不确定性,提供对真实风险的覆盖保证,并生成具有不确定性估计的校准风险评分,从而提升风险识别的鲁棒性和下游任务性能。
链接: https://arxiv.org/abs/2603.23919
作者: Kai-Yu Fu,Yi-Ting Chen
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE International Conference on Robotics and Automation (ICRA) 2026
Abstract:We study object importance-based vision risk object identification (Vision-ROI), a key capability for hazard detection in intelligent driving systems. Existing approaches make deterministic decisions and ignore uncertainty, which could lead to safety-critical failures. Specifically, in ambiguous scenarios, fixed decision thresholds may cause premature or delayed risk detection and temporally unstable predictions, especially in complex scenes with multiple interacting risks. Despite these challenges, current methods lack a principled framework to model risk uncertainty jointly across space and time. We propose Conformal Risk Tube Prediction, a unified formulation that captures spatiotemporal risk uncertainty, provides coverage guarantees for true risks, and produces calibrated risk scores with uncertainty estimates. To conduct a systematic evaluation, we present a new dataset and metrics probing diverse scenario configurations with multi-risk coupling effects, which are not supported by existing datasets. We systematically analyze factors affecting uncertainty estimation, including scenario variations, per-risk category behavior, and perception error propagation. Our method delivers substantial improvements over prior approaches, enhancing vision-ROI robustness and downstream performance, such as reducing nuisance braking alerts. For more qualitative results, please visit our project webpage: this https URL
[CV-85] DecepGPT : Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning
【速读】:该论文旨在解决多模态欺骗检测(Multimodal Deception Detection)中缺乏可验证证据、跨域泛化能力弱以及小样本条件下模型易陷入捷径学习(shortcut learning)的问题。其关键解决方案包括:1)构建包含结构化线索级描述与推理链的推理数据集,实现模型输出可审计的报告;2)发布T4-Deception数据集,基于统一“说真话”电视节目形式在四个国家采集,是目前最大的非实验室欺骗检测数据集;3)提出两个模块——稳定个体-共性协同(SICS)和蒸馏模态一致性(DMC),分别通过融合可学习全局先验与样本自适应残差来优化多模态表示,并利用知识蒸馏对齐单模态预测与融合预测,从而有效抑制捷径学习并提升跨文化场景下的迁移性能。
链接: https://arxiv.org/abs/2603.23916
作者: Jiajian Huang,Dongliang Zhu,Zitong YU,Hui Ma,Jiayu Zhang,Chunmei Zhu,Xiaochun Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 8 figures, 7 tables
Abstract:Multimodal deception detection aims to identify deceptive behavior by analyzing audiovisual cues for forensics and security. In these high-stakes settings, investigators need verifiable evidence connecting audiovisual cues to final decisions, along with reliable generalization across domains and cultural contexts. However, existing benchmarks provide only binary labels without intermediate reasoning cues. Datasets are also small with limited scenario coverage, leading to shortcut learning. We address these issues through three contributions. First, we construct reasoning datasets by augmenting existing benchmarks with structured cue-level descriptions and reasoning chains, enabling model output auditable reports. Second, we release T4-Deception, a multicultural dataset based on the unified ``To Tell The Truth’’ television format implemented across four countries. With 1695 samples, it is the largest non-laboratory deception detection dataset. Third, we propose two modules for robust learning under small-data conditions. Stabilized Individuality-Commonality Synergy (SICS) refines multimodal representations by synergizing learnable global priors with sample-adaptive residuals, followed by a polarity-aware adjustment that bi-directionally recalibrates representations. Distilled Modality Consistency (DMC) aligns modality-specific predictions with the fused multimodal predictions via knowledge distillation to prevent unimodal shortcut learning. Experiments on three established benchmarks and our novel dataset demonstrate that our method achieves state-of-the-art performance in both in-domain and cross-domain scenarios, while exhibiting superior transferability across diverse cultural contexts. The datasets and codes will be released.
[CV-86] Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding
【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, VLMs)在推理过程中因解码阶段内存开销过高而导致的效率瓶颈问题,尤其是在处理长序列视觉与文本token时,如多张高分辨率图像或视频输入场景。解决方案的关键在于提出AttentionPack框架,其核心创新包括:(i) 提出一种多头注意力压缩方法,通过利用键(key)和值(value)矩阵的隐式低秩结构实现经济存储;(ii) 设计一种基于token的注意力感知解压机制,以降低延迟开销。实验表明,AttentionPack可提升高达8倍的内存效率,支持更大批量推理和更长上下文长度,同时保持模型输出质量不变。
链接: https://arxiv.org/abs/2603.23914
作者: Fatih Ilhan,Gaowen Liu,Ramana Rao Kompella,Selim Furkan Tekin,Tiansheng Huang,Zachary Yahn,Yichang Xu,Ling Liu
机构: Georgia Institute of Technology (佐治亚理工学院); Cisco Research (思科研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on addressing the challenges due to the increased high number of visual inputs and interactions, particularly in long-context tasks with multiple high-resolution images or videos. AttentionPack is novel in two aspects: (i) We introduce a multi-head attention compaction method for economically storing key and value matrices by exploiting the implicit low-rank structure, and (ii) we develop a token-specific attention-aware decompression mechanism to reduce latency overhead. Experimental results on multiple benchmarks demonstrate that AttentionPack improves memory efficiency by up to 8x, enabling higher batch sizes and faster batch inference while preserving the model output quality or longer context lengths for superior retrieval performance. We also report the effectiveness of AttentionPack combined with eviction, quantization and kernel fusion, showing further efficiency gains for resource-limited environments.
[CV-87] GenMask: Adapting DiT for Segmentation via Direct Mask CVPR2026
【速读】:该论文旨在解决当前分割任务中依赖预训练生成模型进行间接特征提取所带来的表征错位问题,以及由此导致的流程复杂性和适应性受限的问题。其核心解决方案是摒弃间接适应策略,转而采用直接的生成式训练范式,使分割任务与图像生成任务在统一框架下联合优化。关键创新在于提出一种针对二值掩码(binary masks)的时间步采样策略,该策略强调在分割训练中使用极端噪声水平以增强鲁棒性,同时在图像生成中保持适度噪声,从而实现两类任务的和谐共训;基于此,作者提出了GenMask模型,它在原始DiT架构基础上直接生成黑白分割掩码和RGB彩色图像,无需为分割任务定制特征提取管道,并在引用分割和推理分割基准上达到最先进性能。
链接: https://arxiv.org/abs/2603.23906
作者: Yuhuan Yang,Xianwei Zhuang,Yuxuan Cai,Chaofan Ma,Shuai Bai,Jiangchao Yao,Ya Zhang,Junyang Lin,Yanfeng Wang
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by cvpr 2026
Abstract:Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow and limit adaptation. In this paper, we argue that instead of indirect adaptation, segmentation tasks should be trained directly in a generative manner. We identify a key obstacle to this unified formulation: VAE latents of binary masks are sharply distributed, noise robust, and linearly separable, distinct from natural image latents. To bridge this gap, we introduce timesteps sampling strategy for binary masks that emphasizes extreme noise levels for segmentation and moderate noise for image generation, enabling harmonious joint training. We present GenMask, a DiT trains to generate black-and-white segmentation masks as well as colorful images in RGB space under the original generative objective. GenMask preserves the original DiT architecture while removing the need of feature extraction pipelines tailored for segmentation tasks. Empirically, GenMask attains state-of-the-art performance on referring and reasoning segmentation benchmarks and ablations quantify the contribution of each component.
[CV-88] Latent Bias Alignment for High-Fidelity Diffusion Inversion in Real-World Image Reconstruction and Manipulation
【速读】:该论文旨在解决扩散模型中的扩散反演(diffusion inversion)问题,即如何从种子噪声中生成或逼近真实世界图像,这是连接扩散模型与现实场景的关键基础。现有方法常因反演轨迹与生成轨迹之间的错位(misalignment)以及反演过程与矢量量化自编码器(VQ autoencoder, VQAE)重建之间的不匹配而面临重建质量低和鲁棒性弱的问题。解决方案的关键在于提出两种核心技术:一是引入潜在偏置向量(latent bias vector),在每一步反演过程中学习调整以减少反演与生成轨迹的偏差,称为潜在偏置优化(Latent Bias Optimization, LBO);二是通过近似联合优化扩散反演与VQAE重建过程,学习调整图像潜在表示作为两者的接口,称为图像潜在增强(Image Latent Boosting, ILB)。实验表明,该方法显著提升了图像重建质量及下游任务(如图像编辑和稀有概念生成)性能。
链接: https://arxiv.org/abs/2603.23903
作者: Weiming Chen,Qifan Liu,Siyi Liu,Yushun Tang,Yijia Wang,Zhihan Zhu,Zhihai He
机构: Southern University of Science and Technology (南方科技大学); Shenzhen Polytechnic University (深圳职业技术大学); Huawei Technologies Co., Ltd. (华为技术有限公司); Pengcheng Lab (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent research has shown that text-to-image diffusion models are capable of generating high-quality images guided by text prompts. But can they be used to generate or approximate real-world images from the seed noise? This is known as the diffusion inversion problem, which serves as a fundamental building block for bridging diffusion models and real-world scenarios. However, existing diffusion inversion methods often suffer from low reconstruction quality or weak robustness. Two major challenges need to be carefully addressed: (1) the misalignment between the inversion and generation trajectories during the diffusion process, and (2) the mismatch between the diffusion inversion process and the VQ autoencoder (VQAE) reconstruction. To address these challenges, we introduce a latent bias vector at each inversion step, which is learned to reduce the misalignment between inversion and generation trajectories. We refer to this strategy as Latent Bias Optimization (LBO). Furthermore, we perform an approximate joint optimization of the diffusion inversion and VQAE reconstruction processes by learning to adjust the image latent representation, which serves as the connecting interface between them. We refer to this technique as Image Latent Boosting (ILB). Extensive experimental results demonstrate that the proposed method significantly improves the image reconstruction quality of the diffusion model, as well as the performance of downstream tasks, including image editing and rare concept generation.
[CV-89] Knowledge-Refined Dual Context-Aware Network for Partially Relevant Video Retrieval ICME2026
【速读】:该论文旨在解决从非剪辑视频中检索部分相关片段时面临的两大挑战:一是文本与视频片段之间信息密度不匹配,二是现有注意力机制难以捕捉语义焦点和事件关联性。其解决方案的关键在于提出KDC-Net(Knowledge-Refined Dual Context-Aware Network),从文本和视觉双角度协同优化。在文本侧,通过层次化语义聚合模块(Hierarchical Semantic Aggregation)自适应融合多尺度短语线索以增强查询语义;在视频侧,采用动态时间注意力机制(Dynamic Temporal Attention)结合相对位置编码和自适应时间窗口,突出具有局部时间一致性的关键事件;同时引入基于CLIP的动态蒸馏策略并融合时间连续性感知的精炼机制,实现段落感知且目标对齐的知识迁移。实验表明,该方法在PRVR基准上显著优于现有最先进方法,尤其在低片段-视频比场景下表现优异。
链接: https://arxiv.org/abs/2603.23902
作者: Junkai Yang,Qirui Wang,Yaoqing Jin,Shuai Ma,Minghan Xu,Shanmin Pang
机构: Xi’an Jiaotong University (西安交通大学); Universität Stuttgart (斯图加特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in ICME 2026
Abstract:Retrieving partially relevant segments from untrimmed videos remains difficult due to two persistent challenges: the mismatch in information density between text and video segments, and limited attention mechanisms that overlook semantic focus and event correlations. We present KDC-Net, a Knowledge-Refined Dual Context-Aware Network that tackles these issues from both textual and visual perspectives. On the text side, a Hierarchical Semantic Aggregation module captures and adaptively fuses multi-scale phrase cues to enrich query semantics. On the video side, a Dynamic Temporal Attention mechanism employs relative positional encoding and adaptive temporal windows to highlight key events with local temporal coherence. Additionally, a dynamic CLIP-based distillation strategy, enhanced with temporal-continuity-aware refinement, ensures segment-aware and objective-aligned knowledge transfer. Experiments on PRVR benchmarks show that KDC-Net consistently outperforms state-of-the-art methods, especially under low moment-to-video ratios.
[CV-90] MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation CVPR2026
【速读】:该论文旨在解决端到端文本图像机器翻译(End-to-End Text-Image Machine Translation, TIMT)在多样化视觉场景和低资源语言下的鲁棒性不足问题,以及当前视觉语言大模型(Vision-Language Large Models, VLLMs)在跨模态推理设计上的不成熟。现有方法通常采用串行解析与翻译或仅依赖语言层面的链式思维(Chain-of-Thought, CoT),忽视了视觉认知对VLLMs的核心作用。其解决方案的关键在于提出Cognition-Perception-Reasoning for Translation (CPR-Trans) 数据范式,该范式将场景认知(scene cognition)、文本感知(text perception)与翻译推理(translation reasoning)统一于一个连贯的推理过程中,并通过VLLM驱动的数据生成管道提供结构化、可解释的监督信号,从而实现感知与推理的有效对齐,在3B和7B规模模型上均显著提升了翻译准确性和可解释性。
链接: https://arxiv.org/abs/2603.23896
作者: Gengluo Li,Chengquan Zhang,Yupu Liang,Huawen Shen,Yaping Zhang,Pengyuan Lyu,Weinong Wang,Xingyu Wan,Gangyan Zeng,Han Hu,Can Ma,Yu Zhou
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); Tencent (腾讯); Nankai University (南开大学); University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:End-to-end text-image machine translation (TIMT), which directly translates textual content in images across languages, is crucial for real-world multilingual scene understanding. Despite advances in vision-language large models (VLLMs), robustness across diverse visual scenes and low-resource languages remains underexplored due to limited evaluation resources. We present MMTIT-Bench, a human-verified multilingual and multi-scenario benchmark with 1,400 images spanning fourteen non-English and non-Chinese languages and diverse settings such as documents, scenes, and web images, enabling rigorous assessment of end-to-end TIMT. Beyond benchmarking, we study how reasoning-oriented data design improves translation. Although recent VLLMs have begun to incorporate long Chain-of-Thought (CoT) reasoning, effective thinking paradigms for TIMT are still immature: existing designs either cascade parsing and translation in a sequential manner or focus on language-only reasoning, overlooking the visual cognition central to VLLMs. We propose Cognition-Perception-Reasoning for Translation (CPR-Trans), a data paradigm that integrates scene cognition, text perception, and translation reasoning within a unified reasoning process. Using a VLLM-driven data generation pipeline, CPR-Trans provides structured, interpretable supervision that aligns perception with reasoning. Experiments on 3B and 7B models show consistent gains in accuracy and interpretability. We will release MMTIT-Bench to promote the multilingual and multi-scenario TIMT research upon acceptance.
[CV-91] FilterGS: Traversal-Free Parallel Filtering and Adaptive Shrinking for Large-Scale LoD 3D Gaussian Splatting
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting)在大规模场景中应用时面临的两大关键问题:一是基于层级细节(Level-of-Detail)方法的串行遍历效率低下,占用超过60%的渲染时间;二是存在大量冗余的高斯-瓦片(Gaussian-tile)键值对,导致不必要的计算开销。解决方案的核心在于提出FilterGS框架,其包含两个互补的并行过滤机制,可无需树结构遍历即可高效筛选高斯元素;同时引入一种新颖的GTC(Gaussian-Tile Compressibility)指标来量化冗余程度,并据此设计场景自适应的高斯收缩策略,有效减少冗余配对,从而在保持视觉质量的同时显著提升渲染速度。
链接: https://arxiv.org/abs/2603.23891
作者: Yixian Wang,Haolin Yu,Jiadong Tang,Yu Gao,Xihan Wang,Yufeng Yue,Yi Yang
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting has revolutionized neural rendering with real-time performance. However, scaling this approach to large scenes using Level-of-Detail methods faces critical challenges: inefficient serial traversal consuming over 60% of rendering time, and redundant Gaussian-tile pairs that incur unnecessary processing overhead. To address these limitations, we introduce FilterGS, featuring a parallel filtering mechanism with two complementary filters that select Gaussian elements efficiently without tree traversal. Additionally, we propose a novel GTC metric that quantifies the redundancy of Gaussian-tile key-value pairs. Based on this metric, we introduce a scene-adaptive Gaussian shrinking strategy that effectively reduces redundant pairs. Extensive experiments demonstrate that FilterGS achieves state-of-the-art rendering speeds while maintaining competitive visual quality across multiple large-scale datasets. Project page: this https URL
[CV-92] owards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training CVPR2026
【速读】:该论文旨在解决当前端到端文档解析(end-to-end document parsing)中存在的结构不一致、重复预测和幻觉问题,这些问题主要源于缺乏大规模高质量的全页级(document-level)标注数据以及缺乏结构感知的训练策略。解决方案的关键在于提出一种数据与训练协同设计(data-training co-design)框架:首先通过“真实场景合成”(Realistic Scene Synthesis)策略生成结构多样且规模庞大的全页监督数据,其次引入“文档感知训练配方”(Document-Aware Training Recipe),结合渐进式学习和结构标记优化(structure-token optimization),显著提升模型的结构保真度与解码稳定性。该方法在真实世界文档基准Wild-OmniDocBench上验证了其鲁棒性与准确性。
链接: https://arxiv.org/abs/2603.23885
作者: Gengluo Li,Chengquan Zhang,Yupu Liang,Huawen Shen,Yaping Zhang,Pengyuan Lyu,Weinong Wang,Xingyu Wan,Gangyan Zeng,Han Hu,Can Ma,Yu Zhou
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); Tencent (腾讯); Nankai University (南开大学); University of Chinese Academy of Sciences (中国科学院大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions - primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data-training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware Training Recipe introduces progressive learning and structure-token optimization to enhance structural fidelity and decoding stability. We further build Wild-OmniDocBench, a benchmark derived from real-world captured documents for robustness evaluation. Integrated into a 1B-parameter MLLM, our method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios. All models, data synthesis pipelines, and benchmarks will be publicly released to advance future research in document understanding.
[CV-93] BioVITA: Biological Dataset Model and Benchmark for Visual-Textual-Acoustic Alignment CVPR2026
【速读】:该论文旨在解决从多模态数据(图像、文本和音频)中理解动物物种的挑战,尤其是在生态学与计算机视觉交叉领域中,如何有效整合音频模态以提升物种识别能力的问题。现有生物模型如BioCLIP虽在图像与文本分类任务上表现优异,但对音频信息的利用仍处于探索阶段。解决方案的关键在于提出BioVITA框架,其核心包括:(1) 构建包含130万段音频和230万张图像的大型训练数据集,覆盖14,133个物种及34种生态特征标签;(2) 基于BioCLIP2设计两阶段训练策略,实现音频表征与视觉及文本表征的有效对齐;(3) 提出跨模态检索基准,涵盖三种模态间的全部方向(如图像→音频、音频→文本等),并在家族、属、种三个分类层级上验证性能。实验表明,该方法可学习统一的语义表示空间,超越传统分类体系,显著推动多模态生物多样性认知的发展。
链接: https://arxiv.org/abs/2603.23883
作者: Risa Shinoda,Kaede Shiohara,Nakamasa Inoue,Kuniaki Saito,Hiroaki Santo,Fumio Okura
机构: The University of Osaka (大阪大学); The University of Tokyo (东京大学); Institute of Science Tokyo (东京科学大学); OMRON SINIC X (欧姆龙Sinic X)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Main
Abstract:Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology. While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding. The project page is available at: this https URL
[CV-94] EnvSocial-Diff: A Diffusion-Based Crowd Simulation Model with Environmental Conditioning and Individual-Group Interaction ICLR2026
【速读】:该论文旨在解决现有行人轨迹建模方法在模拟真实人群行为时对环境因素和多层级社会互动建模不足的问题。当前大多数方法主要关注社会动力学,而忽略了环境约束(如障碍物、兴趣点和光照)以及个体与群体之间的复杂交互关系。其解决方案的关键在于提出一种基于扩散模型的群体仿真框架EnvSocial-Diff,该框架通过两个核心模块实现:一是结构化的环境条件模块,显式编码场景中的障碍物、兴趣对象和光照水平,提供可解释的环境约束与吸引信号;二是个体-群体交互模块,利用图结构设计同时捕捉细粒度的人际关系和群体层面的一致性行为。实验表明,这种融合环境感知与多层次社会互动的机制显著提升了轨迹预测的准确性与真实性。
链接: https://arxiv.org/abs/2603.23874
作者: Bingxue Zhao,Qi Zhang,Hui Huang
机构: VCC, College of Computer Science and Software Engineering, Shenzhen University (深圳大学计算机与软件工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026
Abstract:Modeling realistic pedestrian trajectories requires accounting for both social interactions and environmental context, yet most existing approaches largely emphasize social dynamics. We propose \textbfEnvSocial-Diff: a diffusion-based crowd simulation model informed by social physics and augmented with environmental conditioning and individual–group interaction. Our structured environmental conditioning module explicitly encodes obstacles, objects of interest, and lighting levels, providing interpretable signals that capture scene constraints and attractors. In parallel, the individual–group interaction module goes beyond individual-level modeling by capturing both fine-grained interpersonal relations and group-level conformity through a graph-based design. Experiments on multiple benchmark datasets demonstrate that EnvSocial-Diff outperforms the latest state-of-the-art methods, underscoring the importance of explicit environmental conditioning and multi-level social interaction for realistic crowd simulation. Code is here: this https URL.
[CV-95] MLE-UVAD: Minimal Latent Entropy Autoencoder for Fully Unsupervised Video Anomaly Detection ECCV2026
【速读】:该论文旨在解决单场景、完全无监督视频异常检测(VAD)问题,即在不依赖任何标签的情况下,直接使用包含正常与异常事件的原始视频进行训练和测试。传统方法通常需要大量标注数据(全监督或弱监督)或仅使用正常视频(一类分类),这些方法易受分布偏移和污染影响。解决方案的关键在于提出一种熵引导的自编码器(entropy-guided autoencoder),其核心创新是将标准重建损失与一种新颖的最小潜在熵(Minimal Latent Entropy, MLE)损失相结合:重建损失使正常和异常帧在潜在空间中形成分离簇,而MLE损失通过最小化潜在嵌入的熵,促使嵌入集中在高密度区域;由于正常帧占主导地位,稀疏的异常嵌入被拉入正常簇,从而迫使解码器聚焦于正常模式,导致异常帧重建质量显著下降,形成清晰的重建差异,实现有效检测。
链接: https://arxiv.org/abs/2603.23868
作者: Yuang Geng,Junkai Zhou,Kang Yang,Pan He,Zhuoyang Zhou,Jose C. Principe,Joel Harley,Ivan Ruchkin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ECCV 2026. 18 pages, 8 figures. Includes supplementary material
Abstract:In this paper, we address the challenging problem of single-scene, fully unsupervised video anomaly detection (VAD), where raw videos containing both normal and abnormal events are used directly for training and testing without any labels. This differs sharply from prior work that either requires extensive labeling (fully or weakly supervised) or depends on normal-only videos (one-class classification), which are vulnerable to distribution shifts and contamination. We propose an entropy-guided autoencoder that detects anomalies through reconstruction error by reconstructing normal frames well while making anomalies reconstruct poorly. The key idea is to combine the standard reconstruction loss with a novel Minimal Latent Entropy (MLE) loss in the autoencoder. Reconstruction loss alone maps normal and abnormal inputs to distinct latent clusters due to their inherent differences, but also risks reconstructing anomalies too well to detect. Therefore, MLE loss addresses this by minimizing the entropy of latent embeddings, encouraging them to concentrate around high-density regions. Since normal frames dominate the raw video, sparse anomalous embeddings are pulled into the normal cluster, so the decoder emphasizes normal patterns and produces poor reconstructions for anomalies. This dual-loss design produces a clear reconstruction gap that enables effective anomaly detection. Extensive experiments on two widely used benchmarks and a challenging self-collected driving dataset demonstrate that our method achieves robust and superior performance over baselines.
[CV-96] Can VLMs Reason Robustly? A Neuro-Symbolic Investigation
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在分布偏移(distribution shifts)下推理能力不足的问题,特别是针对协变量偏移(covariate shift)场景——即感知输入分布发生变化但底层预测规则保持不变的情形。研究发现,通过梯度驱动的端到端微调训练的VLMs虽能在分布内(in-distribution)任务中取得高准确率,却难以在分布外(out-of-distribution)条件下保持鲁棒性,表明微调无法可靠地诱导出稳定的推理函数。为此,论文提出一种神经符号方法VLC(Visual Logic Circuit),其关键在于将感知与推理解耦:利用VLM进行对象概念识别,并将任务逻辑规则编译为基于电路的符号程序,从而对VLM识别出的对象概念执行精确的符号推理。实验表明,VLC在三种具有不同规则集的视觉演绎推理任务中均展现出一致的鲁棒性能,验证了该方法在支持稳定推理方面的有效性。
链接: https://arxiv.org/abs/2603.23867
作者: Weixin Chen,Antonio Vergari,Han Zhao
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Edinburgh (爱丁堡大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) have been applied to a wide range of reasoning tasks, yet it remains unclear whether they can reason robustly under distribution shifts. In this paper, we study covariate shifts in which the perceptual input distribution changes while the underlying prediction rules do not. To investigate this question, we consider visual deductive reasoning tasks, where a model is required to answer a query given an image and logical rules defined over the object concepts in the image. Empirically, we find that VLMs fine-tuned through gradient-based end-to-end training can achieve high in-distribution accuracy but fail to generalize under such shifts, suggesting that fine-tuning does not reliably induce the underlying reasoning function. This motivates a neuro-symbolic perspective that decouples perception from reasoning. However, we further observe that recent neuro-symbolic approaches that rely on black-box components for reasoning can still exhibit inconsistent robustness across tasks. To address this issue, we propose VLC, a neuro-symbolic method that combines VLM-based concept recognition with circuit-based symbolic reasoning. In particular, task rules are compiled into a symbolic program, specifically a circuit, which executes the rules exactly over the object concepts recognized by the VLM. Experiments on three visual deductive reasoning tasks with distinct rule sets show that VLC consistently achieves strong performance under covariate shifts, highlighting its ability to support robust reasoning.
[CV-97] See Remember Explore: A Benchmark and Baselines for Streaming Spatial Reasoning
【速读】:该论文旨在解决当前空间视觉语言模型(Spatial VLM)在实际部署中面临的两大关键问题:一是缺乏对长时序流式推理(long-horizon streaming inference)的支持,二是忽视了当当前视图信息不足时需通过主动感知(active perception)获取缺失证据的能力。为应对这一挑战,作者提出S3-Bench基准套件,其设计融合仿真环境与真实世界流式视频数据,支持时间锚定的问答任务,并要求模型仅基于截至当前时刻的观测进行推理。解决方案的核心在于提出AMF-VLM模型,其关键技术包括:(i) 内存折叠(memory folding),将长时间序列观测压缩为结构化紧凑记忆以适应有限计算资源;(ii) 主动探索机制,输出显式的动作指令(如移动、旋转、扫描)来主动采集缺失信息后再作答。实验表明,该方法在模拟和真实场景下分别提升8.8%和13.3%的准确率,同时保持向标准空间基准的良好迁移能力。
链接: https://arxiv.org/abs/2603.23864
作者: Yuxi Wei,Wei Huang,Qirui Chen,Lu Hou,Xiaojuan Qi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spatial understanding is fundamental for embodied agents, yet most spatial VLMs and benchmarks remain offline-evaluating post-hoc QA over pre-recorded inputs and overlooking two crucial deployment-critical requirements: long-horizon streaming inference and active perception when the current view is insufficient. To address this gap, we introduce S3-Bench, a benchmark suite for streaming spatial question answering with active exploration, where queries are temporally grounded to specific timestamps and must be answered using only observations available up to that moment. S3-Bench adopts a dual-domain design, combining a scalable simulator with controllable trajectories and exploration actions, and real-world streaming videos that capture practical sensing artifacts for rigorous generalization evaluation. Overall, it spans 10K+ scenes and 26K+ trajectories, with dedicated training (S3-Train) and evaluation (S3-Eval) splits. We further propose AMF-VLM, which supports streaming spatial reasoning under bounded computing via (i) memory folding, which compresses long-horizon observations into compact structured memory, and (ii) active exploration, which outputs explicit actions (e.g. move/rotate/scan) to acquire missing evidence before answering. Extensive experiments demonstrate that, compared to models using identical training data, our approach yields improvements of 8.8% and 13.3% on the simulated and real splits of S3-Eval, respectively, while maintaining competitive transferability to standard spatial benchmarks.
[CV-98] 3D-LLDM: Label-Guided 3D Latent Diffusion Model for Improving High-Resolution Synthetic MR Imaging in Hepatic Structure Segmentation
【速读】:该论文旨在解决医学影像领域中高质量标注数据稀缺的问题,特别是在肝细胞癌(hepatocellular carcinoma, HCC)的磁共振(MR)图像分析中,由于真实临床数据有限且标注成本高,导致深度学习模型训练受限。其解决方案的关键在于提出一种标签引导的3D潜在扩散模型(3D-LLDM),通过引入ControlNet架构实现结构约束下的体积生成:利用含钆塞酸二钠(Gd-EOB-DTPA)增强的肝胆期MR图像自动提取肝脏、门静脉、肝静脉及HCC的解剖分割掩膜,并以此作为条件指导合成高质量、配准准确的三维MR体积及其对应分割图。该方法在720例真实临床数据上训练后,在FID指标上优于GANs 70.9%和当前最优扩散模型26.7%,并显著提升下游HCC分割任务性能(Dice分数最高提升11.153%)。
链接: https://arxiv.org/abs/2603.23845
作者: Kyeonghun Kim,Jaehyeok Bae,Youngung Han,Joo Young Bae,Seoyoung Ju,Junsu Lim,Gyeongmin Kim,Nam-Joon Kim,Woo Kyoung Jeong,Ken Ying-Kai Liao,Won Jae Lee,Pa Hong,Hyuk-Jae Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ISBI 2026 (Oral). Camera-ready version
Abstract:Deep learning and generative models are advancing rapidly, with synthetic data increasingly being integrated into training pipelines for downstream analysis tasks. However, in medical imaging, their adoption remains constrained by the scarcity of reliable annotated datasets. To address this limitation, we propose 3D-LLDM, a label-guided 3D latent diffusion model that generates high-quality synthetic magnetic resonance (MR) volumes with corresponding anatomical segmentation masks. Our approach uses hepatobiliary phase MR images enhanced with the Gd-EOB-DTPA contrast agent to derive structural masks for the liver, portal vein, hepatic vein, and hepatocellular carcinoma, which then guide volumetric synthesis through a ControlNet-based architecture. Trained on 720 real clinical hepatobiliary phase MR scans from Samsung Medical Center, 3D-LLDM achieves a Fréchet Inception Distance (FID) of 28.31, improving over GANs by 70.9% and over state-of-the-art diffusion baselines by 26.7%. When used for data augmentation, the synthetic volumes improve hepatocellular carcinoma segmentation by up to 11.153% Dice score across five CNN architectures.
[CV-99] Sparse Autoencoders for Interpretable Medical Image Representation Learning
【速读】:该论文旨在解决医学视觉基础模型(Vision Foundation Models, FMs)中抽象潜在表示不可解释的问题,即临床医生无法直接 interrogate或验证这些模型所编码的信息。解决方案的关键在于使用稀疏自动编码器(Sparse Autoencoders, SAEs)替代原有的黑箱式图像表示,将其转换为人类可理解的稀疏特征。通过在BiomedParse和DINOv3提取的嵌入上训练SAEs,并利用TotalSegmentator数据集中的909,873张CT和MRI二维切片进行优化,研究发现:所学稀疏特征不仅能以高保真度重建原始嵌入(R²高达0.941),还能仅用10个特征就恢复高达87.8%的下游任务性能(实现99.4%的维度压缩),同时保持图像检索中的语义一致性,并可通过大语言模型(LLM)自动解释为自然语言概念,从而在零样本条件下实现基于语言驱动的图像检索,打通临床语言与抽象潜在表示之间的鸿沟。
链接: https://arxiv.org/abs/2603.23794
作者: Philipp Wesp,Robbie Holland,Vasiliki Sideri-Lampretsa,Sergios Gatidis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 4 figures
Abstract:Vision foundation models (FMs) achieve state-of-the-art performance in medical imaging. However, they encode information in abstract latent representations that clinicians cannot interrogate or verify. The goal of this study is to investigate Sparse Autoencoders (SAEs) for replacing opaque FM image representations with human-interpretable, sparse features. We train SAEs on embeddings from BiomedParse (biomedical) and DINOv3 (general-purpose) using 909,873 CT and MRI 2D image slices from the TotalSegmentator dataset. We find that learned sparse features: (a) reconstruct original embeddings with high fidelity (R2 up to 0.941) and recover up to 87.8% of downstream performance using only 10 features (99.4% dimensionality reduction), (b) preserve semantic fidelity in image retrieval tasks, © correspond to specific concepts that can be expressed in language using large language model (LLM)-based auto-interpretation. (d) bridge clinical language and abstract latent representations in zero-shot language-driven image retrieval. Our work indicates SAEs are a promising pathway towards interpretable, concept-driven medical vision systems. Code repository: this https URL.
[CV-100] Re-Prompting SAM 3 via Object Retrieval: 3rd of the 5th PVUW MOSE Track
【速读】:该论文旨在解决复杂场景下的半监督视频对象分割(semi-supervised video object segmentation, SS-VOS)问题,尤其针对目标物体在视频中出现消失与重新出现、剧烈形变以及强同类干扰物等挑战。其解决方案的关键在于构建一个基于SAM~3的自动重提示(re-prompting)框架:首先利用SAM~3检测器在后续帧中识别同类别候选对象,再通过DINOv3提取的对象级特征匹配机制,结合考虑形变的锚点特征池,精准检索可靠的靶标锚点;这些锚点与首帧掩码一同注入SAM~3跟踪器,实现多锚点传播而非仅依赖初始提示,从而显著提升模型在动态变化和干扰环境下的鲁棒性。
链接: https://arxiv.org/abs/2603.23788
作者: Mingqi Gao,Sijie Li,Jungong Han
机构: University of Sheffield (谢菲尔德大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This technical report explores the MOSEv2 track of the PVUW 2026 Challenge, which targets complex semi-supervised video object segmentation. Built on SAM~3, we develop an automatic re-prompting framework to improve robustness under target disappearance and reappearance, severe transformation, and strong same-category distractors. Our method first applies the SAM~3 detector to later frames to identify same-category object candidates, and then performs DINOv3-based object-level matching with a transformation-aware target feature pool to retrieve reliable target anchors. These anchors are injected back into the SAM~3 tracker together with the first-frame mask, enabling multi-anchor propagation rather than relying solely on the initial prompt. This simple directly benefits several core challenges of MOSEv2. Our solution achieves a JF of 51.17% on the test set, ranking 3rd in the MOSEv2 track.
[CV-101] Retinal Disease Classification from Fundus Images using CNN Transfer Learning
【速读】:该论文旨在解决视网膜疾病(retinal diseases)早期筛查难以普及的问题,特别是在资源匮乏人群中实现自动化、可扩展的早期诊断。其解决方案的关键在于构建一个可复现的深度学习流水线,通过对比基础卷积神经网络(CNN)与基于预训练VGG16模型的迁移学习方法,在公开的眼底图像数据集上进行二分类风险评估。实验表明,迁移学习策略显著提升了模型性能(测试准确率达90.8%,加权F1-score为0.90),优于基础CNN(准确率83.1%),同时揭示了在少数类别病例敏感性方面仍存在挑战,强调了类不平衡处理和阈值选择对临床可靠性的重要性。
链接: https://arxiv.org/abs/2603.23785
作者: Ali Akram
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 4 figures
Abstract:Retinal diseases remain among the leading preventable causes of visual impairment worldwide. Automated screening based on fundus image analysis has the potential to expand access to early detection, particularly in underserved populations. This paper presents a reproducible deep learning pipeline for binary retinal disease risk classification from publicly available fundus photographs. We implement and compare a baseline convolutional neural network with a transfer learning approach using a pretrained VGG16 backbone and evaluate generalization on held-out data. To address class imbalance, we apply class weighting and report standard classification metrics including accuracy, precision, recall, F1-score, confusion matrices, and ROC-AUC. The VGG16 transfer learning model achieves 90.8% test accuracy with a weighted F1-score of 0.90, substantially outperforming the baseline CNN (83.1% accuracy). Results indicate that transfer learning improves discrimination compared to a baseline CNN, while also revealing remaining challenges in sensitivity to minority disease cases. We discuss practical limitations related to dataset characteristics, class imbalance, and threshold selection, and provide guidance for reproducibility and future improvements for clinically reliable screening
[CV-102] Semantic Iterative Reconstruction: One-Shot Universal Anomaly Detection
【速读】:该论文旨在解决无监督医学异常检测(Unsupervised Medical Anomaly Detection)中因正常样本稀缺而导致的模型泛化能力差的问题。现有方法通常需为每个数据集或疾病单独训练专用模型,要求每任务数百张正常图像,且缺乏跨模态泛化能力。其解决方案的关键在于提出语义迭代重构框架(Semantic Iterative Reconstruction, SIR),该框架利用预训练教师编码器提取多尺度深层特征,并设计一个紧凑的“上采样-下采样”解码器配合多轮迭代优化,在深层特征空间中强化正常先验;通过仅使用来自九个异构数据集各一张正常样本进行联合训练,实现单一通用模型对所有测试集的零样本迁移检测,无需任务特定微调。
链接: https://arxiv.org/abs/2603.23766
作者: Ning Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures,5 table
Abstract:Unsupervised medical anomaly detection is severely limited by the scarcity of normal training samples. Existing methods typically train dedicated models for each dataset or disease, requiring hundreds of normal images per task and lacking cross-modality generalization. We propose Semantic Iterative Reconstruction (SIR), a framework that enables a single universal model to detect anomalies across diverse medical domains using extremely few normal samples. SIR leverages a pretrained teacher encoder to extract multi-scale deep features and employs a compact up-then-down decoder with multi-loop iterative refinement to enforce robust normality priors in deep feature space. The framework adopts a one-shot universal design: a single model is trained by mixing exactly one normal sample from each of nine heterogeneous datasets, enabling effective anomaly detection on all corresponding test sets without task-specific retraining. Extensive experiments on nine medical benchmarks demonstrate that SIR achieves state-of-the-art under all four settings – one-shot universal, full-shot universal, one-shot specialized, and full-shot specialized – consistently outperforming previous methods. SIR offers an efficient and scalable solution for multi-domain clinical anomaly detection. Code is available at this https URL. Comments: 8 pages, 2 figures,5 table Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.23766 [cs.CV] (or arXiv:2603.23766v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.23766 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-103] Learning Cross-Joint Attention for Generalizable Video-Based Seizure Detection
【速读】:该论文旨在解决基于视频的癫痫发作自动检测方法在跨受试者场景下泛化能力差的问题,其核心挑战在于现有方法易受背景干扰和个体外观特征依赖的影响。解决方案的关键在于提出一种以关节为中心的注意力模型(joint-centric attention model),通过检测视频中人体关键点并提取以关节为中心的局部片段来抑制背景信息,随后利用Video Vision Transformer(ViViT)对这些片段进行编码,并引入跨关节注意力机制建模不同身体部位之间的时空交互关系,从而捕捉癫痫发作特有的协调运动模式,显著提升对未见受试者的泛化性能。
链接: https://arxiv.org/abs/2603.23757
作者: Omar Zamzam,Takfarinas Medani,Chinmay Chinara,Richard Leahy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated seizure detection from long-term clinical videos can substantially reduce manual review time and enable real-time monitoring. However, existing video-based methods often struggle to generalize to unseen subjects due to background bias and reliance on subject-specific appearance cues. We propose a joint-centric attention model that focuses exclusively on body dynamics to improve cross-subject generalization. For each video segment, body joints are detected and joint-centered clips are extracted, suppressing background context. These joint-centered clips are tokenized using a Video Vision Transformer (ViViT), and cross-joint attention is learned to model spatial and temporal interactions between body parts, capturing coordinated movement patterns characteristic of seizure semiology. Extensive cross-subject experiments show that the proposed method consistently outperforms state-of-the-art CNN-, graph-, and transformer-based approaches on unseen subjects.
[CV-104] IJmond Industrial Smoke Segmentation Dataset
【速读】:该论文旨在解决工业烟雾(industrial smoke)图像分割问题,以实现对烟雾区域的精确识别与定位。其解决方案的关键在于构建并公开了一个专门用于工业烟雾分割的数据集,该数据集发布于figshare平台,采用CC BY 4.0许可协议,为相关研究提供了高质量、可复现的基准资源,从而推动生成式AI (Generative AI) 或深度学习模型在工业环境监测中的应用与发展。
链接: https://arxiv.org/abs/2603.23754
作者: Yen-Chia Hsu,Despoina Touska
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This report describes a dataset for industrial smoke segmentation, published on a figshare repository (this https URL). The dataset is licensed under CC BY 4.0.
[CV-105] Detection and Classification of (Pre)Cancerous Cells in Pap Smears: An Ensemble Strategy for the RIVA Cervical Cytology Challenge
【速读】:该论文旨在解决宫颈细胞图像中多类别检测的难题,特别是在常规巴氏涂片(Pap smear)图像中因严重类别不平衡和细胞核重叠导致的检测性能下降问题。其核心解决方案是基于YOLOv11m架构,系统评估三种缓解类别不平衡的策略——损失重加权(loss reweighting)、数据重采样(data resampling)与迁移学习(transfer learning),并通过构建集成模型(ensemble)融合各策略训练出的子模型,利用加权框融合(Weighted Boxes Fusion, WBF)方法实现互补检测行为的协同优化。实验表明,该集成方法在最终测试集上相较最优单模型提升29%的mAP50-95指标,验证了组合多种不平衡缓解策略的有效性。
链接: https://arxiv.org/abs/2603.23742
作者: Lautaro Kogan,María Victoria Ríos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for Poster Presentation at the RIVA Cervical Cytology Challenge, IEEE ISBI 2026. 4 pages, 2 figures
Abstract:Automated detection and classification of cervical cells in conventional Pap smear images can strengthen cervical cancer screening at scale by reducing manual workload, improving triage, and increasing consistency across readers. However, it is challenged by severe class imbalance and frequent nuclear overlap. We present our approach to the RIVA Cervical Cytology Challenge (ISBI 2026), which requires multi-class detection of eight Bethesda cell categories under these conditions. Using YOLOv11m as the base architecture, we systematically evaluate three strategies to improve detection performance: loss reweighting, data resampling and transfer learning. We build an ensemble by combining models trained under each strategy, promoting complementary detection behavior and combining them through Weighted Boxes Fusion (WBF). The ensemble achieves a mAP50-95 of 0.201 on the preliminary test set and 0.147 on the final test set, representing a 29% improvement over the best individual model on the final test set and demonstrating the effectiveness of combining complementary imbalance mitigation strategies.
[CV-106] An Adapter-free Fine-tuning Approach for Tuning 3D Foundation Models ICPR
【速读】:该论文旨在解决点云基础模型在低数据场景下微调时面临的挑战,即全量微调易导致过拟合和预训练表征漂移,而现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法虽能缓解此问题,却因引入额外可训练组件而增加推理延迟。其解决方案的关键在于提出一种无适配器(adapter-free)的动量一致性微调(Momentum-Consistency Fine-Tuning, MCFT)方法:通过选择性地微调预训练编码器的一部分,并施加基于动量的一致性约束以保留任务无关的表征,从而在不增加模型参数和推理开销的前提下实现稳定且高效的下游适应。
链接: https://arxiv.org/abs/2603.23730
作者: Sneha Paul,Zachary Patterson,Nizar Bouguila
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at The Fifth International Conference on Pattern Recognition and Artificial Intelligence (ICPRAI 2026)
Abstract:Point cloud foundation models demonstrate strong generalization, yet adapting them to downstream tasks remains challenging in low-data regimes. Full fine-tuning often leads to overfitting and significant drift from pre-trained representations, while existing parameter-efficient fine-tuning (PEFT) methods mitigate this issue by introducing additional trainable components at the cost of increased inference-time latency. We propose Momentum-Consistency Fine-Tuning (MCFT), an adapter-free approach that bridges the gap between full and parameter-efficient fine-tuning. MCFT selectively fine-tunes a portion of the pre-trained encoder while enforcing a momentum-based consistency constraint to preserve task-agnostic representations. Unlike PEFT methods, MCFT introduces no additional representation learning parameters beyond a standard task head, maintaining the original model’s parameter count and inference efficiency. We further extend MCFT with two variants: a semi-supervised framework that leverages abundant unlabeled data to enhance few-shot performance, and a pruning-based variant that improves computational efficiency through structured layer removal. Extensive experiments on object recognition and part segmentation benchmarks demonstrate that MCFT consistently outperforms prior methods, achieving a 3.30% gain in 5-shot settings and up to a 6.13% improvement with semi-supervised learning, while remaining well-suited for resource-constrained deployment.
[CV-107] Bi-CRCL: Bidirectional Conservative-Radical Complementary Learning with Pre-trained Foundation Models for Class-incremental Medical Image Analysis
【速读】:该论文旨在解决医学图像引导诊断中的类别增量学习(Class-incremental learning, CIL)问题,即在不遗忘已有疾病类别知识的前提下,持续适应新出现的疾病类别,以支持可扩展的临床部署。由于医疗数据的异质性和隐私限制导致无法使用记忆回放机制,且医学影像领域对领域特定适配要求高,传统方法难以有效应对。解决方案的关键在于提出双向保守-激进互补学习(Bidirectional Conservative-Radical Complementary Learning, Bi-CRCL)框架,其核心是通过双学习器机制实现知识稳定与快速适应的平衡:保守学习器采用稳定性导向更新保留历史知识,激进学习器采用可塑性导向学习快速适应新类别;并通过双向交互机制实现前向迁移与后向巩固,使新旧知识得以持续融合并缓解灾难性遗忘。
链接: https://arxiv.org/abs/2603.23729
作者: Xinyao Wu,Zhe Xu,Cheng Chen,Jiawei Ma,Yefeng Zheng,Raymond Kai-yu Tong
机构: The Chinese University of Hong Kong (香港中文大学); Columbia University (哥伦比亚大学); The University of Hong Kong (香港大学); City University of Hong Kong (香港城市大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint; under review
Abstract:Class-incremental learning (CIL) in medical image-guided diagnosis requires retaining prior diagnostic knowledge while adapting to newly emerging disease categories, which is critical for scalable clinical deployment. This problem is particularly challenging due to heterogeneous data and privacy constraints that prevent memory replay. Although pretrained foundation models (PFMs) have advanced general-domain CIL, their potential in medical imaging remains underexplored, where domain-specific adaptation is essential yet difficult due to anatomical complexity and inter-institutional heterogeneity. To address this gap, we conduct a systematic benchmark of recent PFM-based CIL methods and propose Bidirectional Conservative-Radical Complementary Learning (Bi-CRCL), a dual-learner framework inspired by complementary learning systems. Bi-CRCL integrates a conservative learner that preserves prior knowledge through stability-oriented updates and a radical learner that rapidly adapts to new categories via plasticity-oriented learning. A bidirectional interaction mechanism enables forward transfer and backward consolidation, allowing continual integration of new knowledge while mitigating catastrophic forgetting. During inference, outputs from both learners are adaptively fused for robust predictions. Experiments on five medical imaging datasets demonstrate consistent improvements over state-of-the-art methods under diverse settings, including cross-dataset shifts and varying task configurations.
[CV-108] Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks CVPR2026
【速读】:该论文旨在解决自动驾驶卡车在复杂工况下因车头与挂车之间存在动态关节(第五轮连接点)和柔性变形导致的传感器位姿时变问题,从而影响感知系统精度和鲁棒性的问题。传统感知与标定方法依赖静态基准或高视差、纹理丰富的场景,难以适应真实道路环境中频繁的快速转向和遮挡情况。解决方案的关键在于提出dCAP(dynamic Calibration and Articulated Perception)框架,其核心创新是利用带有跨视角和时序注意力机制的Transformer模型,持续估计牵引车与挂车摄像头之间的6-DoF相对位姿,在保持时空一致性的同时聚合空间线索,实现对动态几何变化的自适应补偿;该方案通过替换BEVFormer中的静态外参为动态预测结果,显著提升了3D目标检测性能,有效克服了静态标定在实际应用中的局限性。
链接: https://arxiv.org/abs/2603.23711
作者: Morui Zhu,Yongqi Zhu,Song Fu,Qing Yang
机构: University of North Texas (北德克萨斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to CVPR2026
Abstract:Autonomous trucking poses unique challenges due to articulated tractor-trailer geometry, and time-varying sensor poses caused by the fifth-wheel joint and trailer flex. Existing perception and calibration methods assume static baselines or rely on high-parallax and texture-rich scenes, limiting their reliability under real-world settings. We propose dCAP (dynamic Calibration and Articulated Perception), a vision-based framework that continuously estimates the 6-DoF (degree of freedom) relative pose between tractor and trailer cameras. dCAP employs a transformer with cross-view and temporal attention to robustly aggregate spatial cues while maintaining temporal consistency, enabling accurate perception under rapid articulation and occlusion. Integrated with BEVFormer, dCAP improves 3D object detection by replacing static calibration with dynamically predicted extrinsics. To facilitate evaluation, we introduce STT4AT, a CARLA-based benchmark simulating semi-trailer trucks with synchronized multi-sensor suites and time-varying inter-rig geometry across diverse environments. Experiments demonstrate that dCAP achieves stable, accurate perception while addressing the limitations of static calibration in autonomous trucking. The dataset, development kit, and source code will be publicly released.
[CV-109] CoRe: Joint Optimization with Contrastive Learning for Medical Image Registration
【速读】:该论文旨在解决医学图像配准中因强度不一致性和非线性组织形变导致的鲁棒性不足问题。其解决方案的关键在于将等变对比学习(equivariant contrastive learning)直接集成到配准模型中,通过联合优化对比学习与配准目标,使学习到的特征表示既具备对组织形变的不变性,又具有任务相关的判别能力,从而显著提升配准性能。
链接: https://arxiv.org/abs/2603.23694
作者: Eytan Kats,Christoph Grossbroehmer,Ziad Al-Haj Hemidi,Fenja Falta,Wiebke Heyer,Mattias P. Heinrich
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:Medical image registration is a fundamental task in medical image analysis, enabling the alignment of images from different modalities or time points. However, intensity inconsistencies and nonlinear tissue deformations pose significant challenges to the robustness of registration methods. Recent approaches leveraging self-supervised representation learning show promise by pre-training feature extractors to generate robust anatomical embeddings, that farther used for the registration. In this work, we propose a novel framework that integrates equivariant contrastive learning directly into the registration model. Our approach leverages the power of contrastive learning to learn robust feature representations that are invariant to tissue deformations. By jointly optimizing the contrastive and registration objectives, we ensure that the learned representations are not only informative but also suitable for the registration task. We evaluate our method on abdominal and thoracic image registration tasks, including both intra-patient and inter-patient scenarios. Experimental results demonstrate that the integration of contrastive learning directly into the registration framework significantly improves performance, surpassing strong baseline methods.
[CV-110] AdvSplat: Adversarial Attacks on Feed-Forward Gaussian Splatting Models
【速读】:该论文旨在解决生成式3D高斯溅射(3D Gaussian Splatting, 3DGS)模型在实际应用中面临的安全性问题,特别是针对无优化的前馈式3DGS模型易受对抗攻击的漏洞。传统3DGS依赖于场景级优化,限制了可扩展性和泛化能力,而新兴的前馈式3DGS虽提升了效率和部署潜力,但其基于神经网络的架构也引入了对抗样本风险。论文提出AdvSplat,首次系统性地研究了对前馈3DGS的对抗攻击方法,其关键在于设计两种高效、实用的黑盒攻击算法:一种基于梯度估计,另一种为无梯度方法,二者均通过频域参数化优化像素空间扰动,在无需访问模型内部结构的情况下实现对输入图像的微小扰动,从而显著破坏重建质量。实验表明,AdvSplat可在多个数据集上成功实施隐蔽且高效的对抗攻击,揭示了该领域亟需关注的鲁棒性与安全性挑战。
链接: https://arxiv.org/abs/2603.23686
作者: Yiran Qiao,Yiren Lu,Yunlai Zhou,Rui Yang,Linlin Hou,Yu Yin,Jing Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) is increasingly recognized as a powerful paradigm for real-time, high-fidelity 3D reconstruction. However, its per-scene optimization pipeline limits scalability and generalization, and prevents efficient inference. Recently emerged feed-forward 3DGS models address these limitations by enabling fast reconstruction from a few input views after large-scale pretraining, without scene-specific optimization. Despite their advantages and strong potential for commercial deployment, the use of neural networks as the backbone also amplifies the risk of adversarial manipulation. In this paper, we introduce AdvSplat, the first systematic study of adversarial attacks on feed-forward 3DGS. We first employ white-box attacks to reveal fundamental vulnerabilities of this model family. We then develop two improved, practically relevant, query-efficient black-box algorithms that optimize pixel-space perturbations via a frequency-domain parameterization: one based on gradient estimation and the other gradient-free, without requiring any access to model internals. Extensive experiments across multiple datasets demonstrate that AdvSplat can significantly disrupt reconstruction results by injecting imperceptible perturbations into the input images. Our findings surface an overlooked yet urgent problem in this domain, and we hope to draw the community’s attention to this emerging security and robustness challenge.
[CV-111] MoCHA: Denoising Caption Supervision for Motion-Text Retrieval
【速读】:该论文旨在解决文本-动作检索系统中因标注文本具有分布特性而导致的嵌入空间方差问题。具体而言,同一动作可能对应多个不同描述(由不同标注者生成),这些描述包含可从3D关节坐标中恢复的动作语义(如动作类型、身体部位、方向性)以及标注者特有的风格和推断上下文,而标准对比学习将每条文本视为单一正样本,忽略了这种分布结构,从而导致同一动作内嵌入差异增大,削弱了文本与动作之间的对齐效果。解决方案的关键在于提出MoCHA(Motion Canonicalization Framework),通过在编码前将每条文本投影到其可从动作中恢复的内容上,实现文本标准化(canonicalization),从而生成更紧凑的正样本簇并提升嵌入分离度。该方法为通用预处理步骤,兼容任意检索架构,并通过引入基于大语言模型(LLM)和轻量级蒸馏FlanT5的两种学习型标准化器,显著提升了跨数据集迁移能力与检索性能。
链接: https://arxiv.org/abs/2603.23684
作者: Nikolai Warner,Cameron Ethan Taylor,Irfan Essa,Apaar Sadhwani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-motion retrieval systems learn shared embedding spaces from motion-caption pairs via contrastive objectives. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions: different annotators produce different text for the same motion, mixing motion-recoverable semantics (action type, body parts, directionality) with annotator-specific style and inferred context that cannot be determined from 3D joint coordinates alone. Standard contrastive training treats each caption as the single positive target, overlooking this distributional structure and inducing within-motion embedding variance that weakens alignment. We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding, producing tighter positive clusters and better-separated embeddings. Canonicalization is a general principle: even deterministic rule-based methods improve cross-dataset transfer, though learned canonicalizers provide substantially larger gains. We present two learned variants: an LLM-based approach (GPT-5.2) and a distilled FlanT5 model requiring no LLM at inference time. MoCHA operates as a preprocessing step compatible with any retrieval architecture. Applied to MoPa (MotionPatches), MoCHA sets a new state of the art on both HumanML3D (H) and KIT-ML (K): the LLM variant achieves 13.9% T2M R@1 on H (+3.1pp) and 24.3% on K (+10.3pp), while the LLM-free T5 variant achieves gains of +2.5pp and +8.1pp. Canonicalization reduces within-motion text-embedding variance by 11-19% and improves cross-dataset transfer substantially, with H to K improving by 94% and K to H by 52%, demonstrating that standardizing the language space yields more transferable motion-language representations.
[CV-112] Prototype Fusion: A Training-Free Multi-Layer Approach to OOD Detection
【速读】:该论文旨在解决安全关键场景中对分布外(out-of-distribution, OOD)样本检测的可靠性问题,当前主流方法依赖神经网络倒数第二层(penultimate-layer)激活特征作为判别依据,但其有效性假设尚未被充分验证。论文通过实证发现,中间层同样蕴含丰富且具有判别力的特征信息,从而提出一种模型无关的多层特征聚合方案:在训练阶段,从连续卷积块中提取特征并计算类别均值嵌入,经L₂归一化后构建紧凑的类内原型(ID prototypes);推理时,利用测试样本与各原型之间的余弦相似度作为OOD评分——ID样本会与至少一个原型高度相关,而OOD样本则保持均匀低相似度。该方法显著提升了OOD检测性能,在多个主流基准上平均提升AUROC达4.41%,同时降低假阳性率(FPR)达13.58%,揭示了多层特征聚合作为潜在有效信号的价值,挑战了传统以倒数第二层为核心的检测范式。
链接: https://arxiv.org/abs/2603.23677
作者: Shreen Gul,Mohamed Elmahallawy,Ardhendu Tripathy,Sanjay Madria
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning models are increasingly deployed in safety-critical applications, where reliable out-of-distribution (OOD) detection is essential to ensure robustness. Existing methods predominantly rely on the penultimate-layer activations of neural networks, assuming they encapsulate the most informative in-distribution (ID) representations. In this work, we revisit this assumption to show that intermediate layers encode equally rich and discriminative information for OOD detection. Based on this observation, we propose a simple yet effective model-agnostic approach that leverages internal representations across multiple layers. Our scheme aggregates features from successive convolutional blocks, computes class-wise mean embeddings, and applies L_2 normalization to form compact ID prototypes capturing class semantics. During inference, cosine similarity between test features and these prototypes serves as an OOD score–ID samples exhibit strong affinity to at least one prototype, whereas OOD samples remain uniformly distant. Extensive experiments on state-of-the-art OOD benchmarks across diverse architectures demonstrate that our approach delivers robust, architecture-agnostic performance and strong generalization for image classification. Notably, it improves AUROC by up to 4.41% and reduces FPR by 13.58%, highlighting multi-layer feature aggregation as a powerful yet underexplored signal for OOD detection, challenging the dominance of penultimate-layer-based methods. Our code is available at: this https URL.
[CV-113] Bio-Inspired Event-Based Visual Servoing for Ground Robots
【速读】:该论文旨在解决传统视觉伺服控制中因依赖连续帧图像而导致的高计算开销与延迟问题,尤其是在地面机器人运动控制场景下。其核心挑战在于如何实现低延迟、高效率的状态感知与反馈控制,同时避免复杂的状态估计过程。解决方案的关键在于利用动态视觉传感器(Dynamic Vision Sensor, DVS)产生的异步事件流,通过固定空间核对结构化对数强度变化模式进行处理,从而在数学上解析地分离出特定的运动学状态(如速度和位置-速度乘积),并基于多模式刺激直接合成非线性状态反馈项,无需传统状态观测器。此外,为克服事件感知在平衡点处固有的线性可观测性丧失问题,提出一种仿生主动感知极限环控制器,实验证明该方法在1/10尺度自主地面车辆上具有极低延迟和高效计算特性。
链接: https://arxiv.org/abs/2603.23672
作者: Maral Mordad,Kian Behzad,Debojyoti Biswas,Noah J. Cowan,Milad Siami
机构: Northeastern University (东北大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Biological sensory systems are inherently adaptive, filtering out constant stimuli and prioritizing relative changes, likely enhancing computational and metabolic efficiency. Inspired by active sensing behaviors across a wide range of animals, this paper presents a novel event-based visual servoing framework for ground robots. Utilizing a Dynamic Vision Sensor (DVS), we demonstrate that by applying a fixed spatial kernel to the asynchronous event stream generated from structured logarithmic intensity-change patterns, the resulting net event flux analytically isolates specific kinematic states. We establish a generalized theoretical bound for this event rate estimator and show that linear and quadratic spatial profiles isolate the robot’s velocity and position-velocity product, respectively. Leveraging these properties, we employ a multi-pattern stimulus to directly synthesize a nonlinear state-feedback term entirely without traditional state estimation. To overcome the inescapable loss of linear observability at equilibrium inherent in event sensing, we propose a bio-inspired active sensing limit-cycle controller. Experimental validation on a 1/10-scale autonomous ground vehicle confirms the efficacy, extreme low-latency, and computational efficiency of the proposed direct-sensing approach.
[CV-114] Estimating Individual Tree Height and Species from UAV Imagery
【速读】:该论文旨在解决森林生物量精准估算中个体树木高度与物种识别的难题,传统方法依赖于地面测量或低分辨率遥感数据,难以实现高精度、大范围的个体树级参数获取。其解决方案的关键在于提出首个面向树中心视角无人机(UAV)影像的基准数据集BIRCH-Trees,并设计了一种基于视觉基础模型(Vision Foundation Model, VFM)的统一框架DINOvTree,通过共享特征提取主干网络与任务特定头部结构,实现同时预测树木高度和物种分类,显著提升了模型效率与性能,在仅使用第二优方案54%–58%参数量的情况下达到最优综合表现。
链接: https://arxiv.org/abs/2603.23669
作者: Jannik Endres,Etienne Laliberté,David Rolnick,Arthur Ouaknine
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Accurate estimation of forest biomass, a major carbon sink, relies heavily on tree-level traits such as height and species. Unoccupied Aerial Vehicles (UAVs) capturing high-resolution imagery from a single RGB camera offer a cost-effective and scalable approach for mapping and measuring individual trees. We introduce BIRCH-Trees, the first benchmark for individual tree height and species estimation from tree-centered UAV images, spanning three datasets: temperate forests, tropical forests, and boreal plantations. We also present DINOvTree, a unified approach using a Vision Foundation Model (VFM) backbone with task-specific heads for simultaneous height and species prediction. Through extensive evaluations on BIRCH-Trees, we compare DINOvTree against commonly used vision methods, including VFMs, as well as biological allometric equations. We find that DINOvTree achieves top overall results with accurate height predictions and competitive classification accuracy while using only 54% to 58% of the parameters of the second-best approach.
[CV-115] Foundation Model Embeddings Meet Blended Emotions: A Multimodal Fusion Approach for the BLEMORE Challenge
【速读】:该论文旨在解决混合情绪识别(blended emotion recognition)中相对显著性预测(relative salience prediction)的问题,即在多模态输入(如面部、语音和身体语言)下准确识别复合情绪并量化各模态的贡献权重。解决方案的关键在于构建一个由12个编码器组成的集成系统,通过晚期概率融合(late probability fusion)整合多种模态特征:包括基于软标签KL训练的S4D-ViTMoE人脸编码器、选择性冻结的Wav2Vec2语音编码器(仅使用第6–12层以保留韵律信息)、微调的身体语言编码器(TimeSformer、VideoMAE),以及首次应用于情绪识别的Gemini Embedding 2.0大模型视频嵌入(仅需2秒输入即可达到0.320的presence accuracy)。实验表明,非端到端微调策略优于全模型微调,个性化表达风格是主要瓶颈,且任务适配编码器占整体集成权重的62%,凸显了模态特异性设计的重要性。
链接: https://arxiv.org/abs/2603.23650
作者: Masoumeh Chapariniya,Aref Farhadipour,Sarah Ebling,Volker Dellwo,Teodora Vukovic
机构: University of Zürich (苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present our system for the BLEMORE Challenge at FG 2026 on blended emotion recognition with relative salience prediction. Our approach combines six encoder families through late probability fusion: an S4D-ViTMoE face encoder adapted with soft-label KL training, frozen layer-selective Wav2Vec2 audio features, finetuned body-language encoders (TimeSformer, VideoMAE), and – for the first time in emotion recognition – Gemini Embedding 2.0, a large multimodal model whose video embeddings produce competitive presence accuracy (ACCP = 0.320) from only 2 seconds of input. Three key findings emerge from our experiments: selecting prosody-encoding layers (6–12) from frozen Wav2Vec2 outperforms end-to-end finetuning (Score 0.207 vs. 0.161), as the non-verbal nature of BLEMORE audio makes phonetic layers irrelevant; the post-processing salience threshold \beta varies from 0.05 to 0.43 across folds, revealing that personalized expression styles are the primary bottleneck; and task-adapted encoders collectively receive 62% of ensemble weight over general-purpose baselines. Our 12-encoder system achieves Score = 0.279 (ACCP = 0.391, ACCS = 0.168) on the test set, placing 6th.
[CV-116] λSplit: Self-Supervised Content-Aware Spectral Unmixing for Fluorescence Microscopy
【速读】:该论文旨在解决荧光显微成像中因发射光谱重叠和噪声干扰导致的传统光谱解混方法性能下降的问题。传统方法依赖像素级最小二乘拟合,难以应对高噪声、强光谱重叠或低维光谱数据场景。其解决方案的关键在于提出一种物理信息驱动的深度生成模型——λSplit,该模型基于分层变分自编码器(Hierarchical Variational Autoencoder)学习浓度图的条件分布,并通过一个全可微的光谱混合器(Spectral Mixer)确保与图像形成过程的一致性,同时利用学习到的结构先验实现卓越的解混效果和隐式去噪能力,从而在多种挑战性基准下显著优于10种基线方法,包括经典算法和现有学习方法。
链接: https://arxiv.org/abs/2603.23647
作者: Federico Carrara,Talley Lambert,Mehdi Seifi,Florian Jug
机构: Fondazione Human Technopole, Milan, Italy; Università Campus Bio-Medico, Rome, Italy; Harvard Medical School, Boston, MA, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 25 pages supplement, 16 figures total, 14 tables total
Abstract:In fluorescence microscopy, spectral unmixing aims to recover individual fluorophore concentrations from spectral images that capture mixed fluorophore emissions. Since classical methods operate pixel-wise and rely on least-squares fitting, their performance degrades with increasingly overlapping emission spectra and higher levels of noise, suggesting that a data-driven approach that can learn and utilize a structural prior might lead to improved results. Learning-based approaches for spectral imaging do exist, but they are either not optimized for microscopy data or are developed for very specific cases that are not applicable to fluorescence microscopy settings. To address this, we propose \lambdaSplit, a physics-informed deep generative model that learns a conditional distribution over concentration maps using a hierarchical Variational Autoencoder. A fully differentiable Spectral Mixer enforces consistency with the image formation process, while the learned structural priors enable state-of-the-art unmixing and implicit noise removal. We demonstrate \lambdaSplit on 3 real-world datasets that we synthetically cast into a total of 66 challenging spectral unmixing benchmarks. We compare our results against a total of 10 baseline methods, including classical methods and a range of learning-based methods. Our results consistently show competitive performance and improved robustness in high noise regimes, when spectra overlap considerably, or when the spectral dimensionality is lowered, making \lambdaSplit a new state-of-the-art for spectral unmixing of fluorescent microscopy data. Importantly, \lambdaSplit is compatible with spectral data produced by standard confocal microscopes, enabling immediate adoption without specialized hardware modifications.
[CV-117] Stochastic Ray Tracing for the Reconstruction of 3D Gaussian Splatting
【速读】:该论文旨在解决基于光线追踪的3D高斯溅射(3D Gaussian Splatting, 3DGS)方法在重建与渲染效率上的瓶颈问题,尤其是传统方法因需对每条光线沿线所有相交高斯进行排序而导致的计算开销过大,同时现有方案仍依赖光栅化近似(如阴影贴图)来实现可重光照场景,削弱了光线追踪本应具备的通用性优势。其核心解决方案是提出一种可微分且无需排序的随机光线追踪公式,通过一个无偏蒙特卡洛估计器仅对每条光线采样少量高斯点进行像素颜色梯度计算,从而跳过排序步骤;该方法在标准3DGS中实现了与光栅化方法相当的重建质量和速度,同时显著优于基于排序的光线追踪,在可重光照3DGS中则利用完全光线追踪的阴影射线驱动每个高斯的着色,显著提升了重建保真度。
链接: https://arxiv.org/abs/2603.23637
作者: Peiyu Xu,Xin Sun,Krishna Mullia,Raymond Fei,Iliyan Georgiev,Shuang Zhao
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Adobe Research (Adobe 研究院); Canva Research (Canva 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Ray-tracing-based 3D Gaussian splatting (3DGS) methods overcome the limitations of rasterization – rigid pinhole camera assumptions, inaccurate shadows, and lack of native reflection or refraction – but remain slower due to the cost of sorting all intersecting Gaussians along every ray. Moreover, existing ray-tracing methods still rely on rasterization-style approximations such as shadow mapping for relightable scenes, undermining the generality that ray tracing promises. We present a differentiable, sorting-free stochastic formulation for ray-traced 3DGS – the first framework that uses stochastic ray tracing to both reconstruct and render standard and relightable 3DGS scenes. At its core is an unbiased Monte Carlo estimator for pixel-color gradients that evaluates only a small sampled subset of Gaussians per ray, bypassing the need for sorting. For standard 3DGS, our method matches the reconstruction quality and speed of rasterization-based 3DGS while substantially outperforming sorting-based ray tracing. For relightable 3DGS, the same stochastic estimator drives per-Gaussian shading with fully ray-traced shadow rays, delivering notably higher reconstruction fidelity than prior work. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.23637 [cs.CV] (or arXiv:2603.23637v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.23637 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-118] Ukrainian Visual Word Sense Disambiguation Benchmark
【速读】:该论文旨在解决乌克兰语视觉词义消歧(Visual Word Sense Disambiguation, Visual-WSD)任务的评估基准缺失问题,以支持跨语言多模态模型性能的系统性比较。其解决方案的关键在于:借鉴已有的英语、意大利语和波斯语视觉词义消歧基准构建方法,通过半自动采集并经领域专家校验的方式构建乌克兰语视觉词义消歧数据集,并在此基础上对八种多语言多模态大语言模型进行评测,结果表明当前模型在乌克兰语上的表现显著低于英文零样本CLIP基线模型,揭示了语言间性能差距的存在。
链接: https://arxiv.org/abs/2603.23627
作者: Yurii Laba,Yaryna Mohytych,Ivanna Rohulia,Halyna Kyryleyza,Hanna Dydyk-Meush,Oles Dobosevych,Rostyslav Hryniv
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This study presents a benchmark for evaluating the Visual Word Sense Disambiguation (Visual-WSD) task in Ukrainian. The main goal of the Visual-WSD task is to identify, with minimal contextual information, the most appropriate representation of a given ambiguous word from a set of ten images. To construct this benchmark, we followed a methodology similar to that proposed by (CITATION), who previously introduced benchmarks for the Visual-WSD task in English, Italian, and Farsi. This approach allows us to incorporate the Ukrainian benchmark into a broader framework for cross-language model performance comparisons. We collected the benchmark data semi-automatically and refined it with input from domain experts. We then assessed eight multilingual and multimodal large language models using this benchmark. All tested models performed worse than the zero-shot CLIP-based baseline model (CITATION) used by (CITATION) for the English Visual-WSD task. Our analysis revealed a significant performance gap in the Visual-WSD task between Ukrainian and English.
[CV-119] M3T: Discrete Multi-Modal Motion Tokens for Sign Language Production
【速读】:该论文旨在解决手语生成中非手动特征(Non-manual Features, NMFs)难以有效建模的问题。现有3D手语生成系统受限于标准人体模型面部空间维度不足,无法充分表达口部动作、眉毛抬升、眼神方向和头部运动等语法必需的非手动特征;同时,当采用更丰富的表示时,传统离散分词方法易发生码本坍缩(codebook collapse),导致大部分表达空间不可达。其解决方案的关键在于提出SMPL-FX框架,将FLAME高维表情空间与SMPL-X身体模型耦合,并使用针对不同模态(身体、手部、面部)设计的有限标量量化变分自编码器(Finite Scalar Quantization VAEs)进行分词表示,再通过多模态Transformer(M3T)以自回归方式建模运动序列,并引入辅助翻译目标以促进语义对齐嵌入。该方案在多个基准数据集上实现了最先进的手语生成质量,尤其在仅依赖非手动特征区分手势的任务中显著提升准确率至58.3%。
链接: https://arxiv.org/abs/2603.23617
作者: Alexandre Symeonidis-Herzig,Jianhe Low,Ozge Mercanoglu Sincan,Richard Bowden
机构: University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sign language production requires more than hand motion generation. Non-manual features, including mouthings, eyebrow raises, gaze, and head movements, are grammatically obligatory and cannot be recovered from manual articulators alone. Existing 3D production systems face two barriers to integrating them: the standard body model provides a facial space too low-dimensional to encode these articulations, and when richer representations are adopted, standard discrete tokenization suffers from codebook collapse, leaving most of the expression space unreachable. We propose SMPL-FX, which couples FLAME’s rich expression space with the SMPL-X body, and tokenize the resulting representation with modality-specific Finite Scalar Quantization VAEs for body, hands, and face. M3T is an autoregressive transformer trained on this multi-modal motion vocabulary, with an auxiliary translation objective that encourages semantically grounded embeddings. Across three standard benchmarks (How2Sign, CSL-Daily, Phoenix14T) M3T achieves state-of-the-art sign language production quality, and on NMFs-CSL, where signs are distinguishable only by non-manual features, reaches 58.3% accuracy against 49.0% for the strongest comparable pose baseline.
[CV-120] LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset
【速读】:该论文旨在解决自动驾驶领域中对罕见场景(long-tail driving events)泛化能力不足的问题。其解决方案的关键在于构建一个面向端到端驾驶的新数据集,该数据集包含多视角视频、轨迹、高层指令以及由具有多元文化背景的领域专家提供的多语言(英语、西班牙语、中文)推理轨迹,从而支持上下文学习(in-context learning)和少样本泛化(few-shot generalization)。该数据集不仅评估传统安全与舒适性指标,还引入了指令遵循能力和输出语义一致性等新维度,为视觉-语言模型(VLMs)和视觉-语言动作模型(VLAs)提供了更全面的多模态基准,有助于研究不同推理形式对驾驶能力的影响。
链接: https://arxiv.org/abs/2603.23607
作者: Royden Wagner,Omer Sahin Tas,Jaime Villa,Felix Hauser,Yinzhe Shen,Marlon Steiner,Dominik Strutz,Carlos Fernandez,Christian Kinzig,Guillermo S. Guitierrez-Cabello,Hendrik Königshof,Fabian Immel,Richard Schwarzkopf,Nils Alexander Rack,Kevin Rösch,Kaiwen Wang,Jan-Hendrik Pauls,Martin Lauer,Igor Gilitschenski,Holger Caesar,Christoph Stiller
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 21 pages
Abstract:In real-world domains such as self-driving, generalization to rare scenarios remains a fundamental challenge. To address this, we introduce a new dataset designed for end-to-end driving that focuses on long-tail driving events. We provide multi-view video data, trajectories, high-level instructions, and detailed reasoning traces, facilitating in-context learning and few-shot generalization. The resulting benchmark for multimodal models, such as VLMs and VLAs, goes beyond safety and comfort metrics by evaluating instruction following and semantic coherence between model outputs. The multilingual reasoning traces in English, Spanish, and Chinese are from domain experts with diverse cultural backgrounds. Thus, our dataset is a unique resource for studying how different forms of reasoning affect driving competence. Our dataset is available at: this https URL
[CV-121] CAPTCHA Solving for Native GUI Agents : Automated Reasoning -Action Data Generation and Self-Corrective Training
【速读】:该论文旨在解决当前通用图形用户界面(GUI)代理在处理现代交互式验证码(CAPTCHA)任务时表现不佳的问题,同时保持其在一般GUI任务上的性能。现有方法要么是针对CAPTCHA设计的专用流水线,无法胜任通用GUI操作;要么是端到端的原生视觉语言模型(VLM),虽能处理多种GUI任务但难以有效破解复杂CAPTCHA。解决方案的关键在于提出ReCAP——一个具备CAPTCHA能力的原生GUI代理,通过构建覆盖七类典型CAPTCHA类型的动态测试系统来强化模型对噪声鲁棒OCR、细粒度视觉理解与精确控制等核心能力,并开发自动化数据收集与清洗管道生成大规模带推理轨迹的CAPTCHA交互数据;进一步利用失败轨迹构建自修正数据,使代理能够在线反思错误并纠正动作,从而将CAPTCHA成功率从约30%提升至80%,同时维持在通用GUI基准上的高性能。
链接: https://arxiv.org/abs/2603.23559
作者: Yuxi Chen,Haoyu Zhai,Chenkai Wang,Rui Yang,Lingming Zhang,Gang Wang,Huan Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:GUI agents are rapidly shifting from multi-module pipelines to end-to-end, native vision-language models (VLMs) that perceive raw screenshots and directly interact with digital devices. Despite rapid progress on general GUI tasks, CAPTCHA solving remains a major challenge. On the other hand, although specialized CAPTCHA solving pipelines exist, they cannot handle general GUI tasks. To address this gap, we introduce ReCAP: a CAPTCHA-capable native GUI agent that can robustly solve modern, interactive CAPTCHA challenges, while preserving their performance as a general GUI agent. We first develop a dynamic CAPTCHA system spanning seven representative CAPTCHA types, designed to stress primitive and complementary capabilities for CAPTCHA solving (e.g., robust OCR under heavy noise and text stylization, fine-grained visual understanding, and precise control). Then, we develop an automated data collection and curation pipeline that generates large-scale CAPTCHA interaction trajectories paired with reasoning traces. As CAPTCHA solving often requires multi-step interaction and recovery from intermediate mistakes, we further leverage failed trajectories to construct self-correction data, training agents to reflect on errors and correct their actions online. Across held-out test sets, ReCAP improves CAPTCHA-solving success from roughly 30% to 80%, while maintaining strong performance on general GUI-agent benchmarks.
[CV-122] Learning Actionable Manipulation Recovery via Counterfactual Failure Synthesis
【速读】:该论文旨在解决机器人操作中执行错误后自主恢复能力不足的问题,现有方法受限于昂贵且危险的真实世界数据收集或存在严重“仿真到现实”差距的模拟扰动,同时视觉分析器多输出粗粒度的二元诊断,难以提供可执行的轨迹级修正。解决方案的关键在于提出Dream2Fix框架,通过在生成式世界模型(generative world model)中扰动动作,直接从真实世界成功演示中合成逼真的反事实失败轨迹,从而无需依赖仿真即可生成成对的失败-修正数据;并通过结构化验证机制严格筛选出任务有效性、视觉一致性和运动学安全性均满足条件的轨迹,构建了包含120k样本的高保真数据集,进而微调视觉语言模型以联合预测失败类型与精确恢复轨迹,实现从视觉异常到纠正动作的端到端映射,最终在物理机器人部署中实现了零样本闭环失败恢复。
链接: https://arxiv.org/abs/2603.13528
作者: Dayou Li,Jiuzhou Lei,Hao Wang,Lulin Liu,Yunhao Yang,Zihan Wang,Bangya Liu,Minghui Zheng,Zhiwen Fan
机构: Texas AM University (德州农工大学); University of Minnesota (明尼苏达大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); Abaka AI; University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While recent foundation models have significantly advanced robotic manipulation, these systems still struggle to autonomously recover from execution errors. Current failure-learning paradigms rely on either costly and unsafe real-world data collection or simulator-based perturbations, which introduce a severe sim-to-real gap. Furthermore, existing visual analyzers predominantly output coarse, binary diagnoses rather than the executable, trajectory-level corrections required for actual recovery. To bridge the gap between failure diagnosis and actionable recovery, we introduce Dream2Fix, a framework that synthesizes photorealistic, counterfactual failure rollouts directly from successful real-world demonstrations. By perturbing actions within a generative world model, Dream2Fix creates paired failure-correction data without relying on simulators. To ensure the generated data is physically viable for robot learning, we implement a structured verification mechanism that strictly filters rollouts for task validity, visual coherence, and kinematic safety. This engine produces a high-fidelity dataset of over 120k paired samples. Using this dataset, we fine-tune a vision-language model to jointly predict failure types and precise recovery trajectories, mapping visual anomalies directly to corrective actions. Extensive real-world robotic experiments show our approach achieves state-of-the-art correction accuracy, improving from 19.7% to 81.3% over prior baselines, and successfully enables zero-shot closed-loop failure recovery in physical deployments.
[CV-123] Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamic CVPR2026
【速读】:该论文旨在解决高成本功能性磁共振成像(fMRI)在大规模应用中的限制问题,尤其是如何利用低成本、高时间分辨率的脑电图(EEG)来重建高质量、具有高空间保真度和强时间一致性的动态fMRI序列。其解决方案的关键在于提出一种基于EEG条件约束的框架,通过引入空域中间帧重建机制(null-space intermediate-frame reconstruction),有效处理真实fMRI采集中常见的采样不规则性问题,从而实现任意中间帧的测量一致性补全,显著提升序列连续性和实际可用性,同时保持全脑及功能特异性区域的优异重建质量与功能信息保留,为从EEG估计高分辨率fMRI动态提供了新路径。
链接: https://arxiv.org/abs/2603.24176
作者: Wanying Qu,Jianxiong Gao,Wei Wang,Yanwei Fu
机构: Fudan University (复旦大学); Southern University of Science and Technology (南方科技大学); Shanghai Innovation Institute (上海创新研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注: CVPR 2026
Abstract:Capturing dynamic spatiotemporal neural activity is essential for understanding large-scale brain mechanisms. Functional magnetic resonance imaging (fMRI) provides high-resolution cortical representations that form a strong basis for characterizing fine-grained brain activity patterns. The high acquisition cost of fMRI limits large-scale applications, therefore making high-quality fMRI reconstruction a crucial task. Electroencephalography (EEG) offers millisecond-level temporal cues that complement fMRI. Leveraging this complementarity, we present an EEG-conditioned framework for reconstructing dynamic fMRI as continuous neural sequences with high spatial fidelity and strong temporal coherence at the cortical-vertex level. To address sampling irregularities common in real fMRI acquisitions, we incorporate a null-space intermediate-frame reconstruction, enabling measurement-consistent completion of arbitrary intermediate frames and improving sequence continuity and practical applicability. Experiments on the CineBrain dataset demonstrate superior voxel-wise reconstruction quality and robust temporal consistency across whole-brain and functionally specific regions. The reconstructed fMRI also preserves essential functional information, supporting downstream visual decoding tasks. This work provides a new pathway for estimating high-resolution fMRI dynamics from EEG and advances multimodal neuroimaging toward more dynamic brain activity modeling.
[CV-124] Comparative analysis of dual-form networks for live land monitoring using multi-modal satellite image time series
【速读】:该论文旨在解决多模态卫星图像时序(Multi-modal Satellite Image Time Series, SITS)分析在实时土地监测应用中面临的计算效率瓶颈问题。具体而言,传统Transformer架构虽能有效捕捉时间依赖性和融合多源遥感数据,但其二次方复杂度及每次新增观测需重新处理整个序列的特性,限制了其在大范围、高频次监测场景中的部署。解决方案的关键在于引入双形式注意力机制(dual-form attention mechanisms),该机制支持并行训练与递归推理(recurrent inference),从而实现增量式处理;同时针对SITS特有的时间不规则性和数据未对齐问题,设计基于实际获取日期而非序列索引计算token距离的时间适配机制,显著提升了模型的实用性与效率。实验表明,该方法在预测和太阳能板建设监测任务中性能接近标准Transformer,同时具备高效递归推理能力,且多模态框架优于单模态方法,验证了双形式注意力机制在传感器融合中的有效性。
链接: https://arxiv.org/abs/2603.24109
作者: Iris Dumeur(CB),Jérémy Anger(CB),Gabriele Facciolo(CB)
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-modal Satellite Image Time Series (SITS) analysis faces significant computational challenges for live land monitoring applications. While Transformer architectures excel at capturing temporal dependencies and fusing multi-modal data, their quadratic computational complexity and the need to reprocess entire sequences for each new acquisition limit their deployment for regular, large-area monitoring. This paper studies various dual-form attention mechanisms for efficient multi-modal SITS analysis, that enable parallel training while supporting recurrent inference for incremental processing. We compare linear attention and retention mechanisms within a multi-modal spectro-temporal encoder. To address SITS-specific challenges of temporal irregularity and unalignment, we develop temporal adaptations of dual-form mechanisms that compute token distances based on actual acquisition dates rather than sequence indices. Our approach is evaluated on two tasks using Sentinel-1 and Sentinel-2 data: multi-modal SITS forecasting as a proxy task, and real-world solar panel construction monitoring. Experimental results demonstrate that dual-form mechanisms achieve performance comparable to standard Transformers while enabling efficient recurrent inference. The multimodal framework consistently outperforms mono-modal approaches across both tasks, demonstrating the effectiveness of dual mechanisms for sensor fusion. The results presented in this work open new opportunities for operational land monitoring systems requiring regular updates over large geographic areas.
[CV-125] Machine vision with small numbers of detected photons per inference
【速读】:该论文旨在解决在极低光环境下机器视觉系统性能显著下降的问题,尤其是在平均每个像素接收到的光子数接近或低于1的情况下,传统方法因光子稀缺和检测的随机性而难以实现高精度图像识别。解决方案的关键在于提出了一种名为“光子感知类神经形态传感”(Photon-aware Neuromorphic Sensing, PANS)的新方法,其核心是将光子统计特性(如泊松分布的随机性)与端到端优化相结合,在训练阶段显式建模低光条件下的测量噪声和光子预算限制,从而实现对光学前端与后处理算法的联合优化。实验表明,PANS在仅需数个至数十个总光子即可完成图像分类任务时,仍能保持较高准确率,相比传统方法展现出数量级的光子效率提升。
链接: https://arxiv.org/abs/2603.23974
作者: Shi-Yuan Ma,Jérémie Laydevant,Mandar M. Sohoni,Logan G. Wright,Tianyu Wang,Peter L. McMahon
机构: Cornell University (康奈尔大学); NTT Research, Inc. (NTT 研究公司)
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
备注: 98 pages, 34 figures
Abstract:Machine vision, including object recognition and image reconstruction, is a central technology in many consumer devices and scientific instruments. The design of machine-vision systems has been revolutionized by the adoption of end-to-end optimization, in which the optical front end and the post-processing back end are jointly optimized. However, while machine vision currently works extremely well in moderate-light or bright-light situations – where a camera may detect thousands of photons per pixel and billions of photons per frame – it is far more challenging in very low-light situations. We introduce photon-aware neuromorphic sensing (PANS), an approach for end-to-end optimization in highly photon-starved scenarios. The training incorporates knowledge of the low photon budget and the stochastic nature of light detection when the average number of photons per pixel is near or less than 1. We report a proof-of-principle experimental demonstration in which we performed low-light image classification using PANS, achieving 73% (82%) accuracy on FashionMNIST with an average of only 4.9 (17) detected photons in total per inference, and 86% (97%) on MNIST with 8.6 (29) detected photons – orders of magnitude more photon-efficient than conventional approaches. We also report simulation studies showing how PANS could be applied to other classification, event-detection, and image-reconstruction tasks. By taking into account the statistics of measurement results for non-classical states or alternative sensing hardware, PANS could in principle be adapted to enable high-accuracy results in quantum and other photon-starved setups.
人工智能
[AI-0] he Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agent ic Artificial Intelligence
【速读】:该论文旨在解决组织中代理型人工智能(Agentic AI)在决策过程中的可靠性与监督成本之间的权衡问题,尤其关注当确定性工作流被随机策略替代后,如何量化并控制其决策轨迹的统计可信度、局部明确性和经济可治理性。解决方案的关键在于构建一个测度论意义上的马尔可夫框架,引入核心指标如状态盲区质量 $ B_n(\tau) $、状态-动作盲区质量 $ B^{\text{SA}}_{\pi,n}(\tau) $、基于熵的人工干预升级门限以及覆盖频次测度上的预期监督成本恒等式;并通过真实企业采购流程日志(Business Process Intelligence Challenge 2019 数据集)验证了该框架的有效性,表明细化状态空间可显著提升对下一步决策盲区的识别能力,并且这些指标同时决定了可实现的自主性水平与预期监督负担。
链接: https://arxiv.org/abs/2603.24582
作者: Biplab Pal,Santanu Bhattacharya
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures, submitted to Engineering Applications of Artificial Intelligence
Abstract:Agentic artificial intelligence (AI) in organizations is a sequential decision problem constrained by reliability and oversight cost. When deterministic workflows are replaced by stochastic policies over actions and tool calls, the key question is not whether a next step appears plausible, but whether the resulting trajectory remains statistically supported, locally unambiguous, and economically governable. We develop a measure-theoretic Markov framework for this setting. The core quantities are state blind-spot mass B_n(tau), state-action blind mass B^SA_pi,n(tau), an entropy-based human-in-the-loop escalation gate, and an expected oversight-cost identity over the workflow visitation measure. We instantiate the framework on the Business Process Intelligence Challenge 2019 purchase-to-pay log (251,734 cases, 1,595,923 events, 42 distinct workflow actions) and construct a log-driven simulated agent from a chronological 80/20 split of the same process. The main empirical finding is that a large workflow can appear well supported at the state level while retaining substantial blind mass over next-step decisions: refining the operational state to include case context, economic magnitude, and actor class expands the state space from 42 to 668 and raises state-action blind mass from 0.0165 at tau=50 to 0.1253 at tau=1000. On the held-out split, m(s) = max_a pi-hat(a|s) tracks realized autonomous step accuracy within 3.4 percentage points on average. The same quantities that delimit statistically credible autonomy also determine expected oversight burden. The framework is demonstrated on a large-scale enterprise procurement workflow and is designed for direct application to engineering processes for which operational event logs are available. Comments: 22 pages, 5 figures, submitted to Engineering Applications of Artificial Intelligence Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.11; I.2.6; J.1 Cite as: arXiv:2603.24582 [cs.AI] (or arXiv:2603.24582v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.24582 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-1] Completeness of Unbounded Best-First Minimax and Descent Minimax
【速读】:该论文致力于解决两类经典搜索算法——无界最佳优先极小极大(Unbounded Best-First Minimax)和下降极小极大(Descent Minimax)——在无限搜索时间内仍无法保证确定最优策略(尤其是必胜策略)的问题。这类算法是当前无知识强化学习(knowledge-free reinforcement learning)中的核心方法,但其完备性长期未被证明。论文的关键解决方案在于对这两类算法进行形式化推广,并引入“完成技术”(completion technique)的理论分析框架,从而证明:只要使用该完成技术,任何此类算法都能计算出最优策略。实验进一步验证了该技术显著提升了算法在实际博弈场景中识别必胜策略的能力。
链接: https://arxiv.org/abs/2603.24572
作者: Quentin Cohen-Solal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In this article, we focus on search algorithms for two-player perfect information games, whose objective is to determine the best possible strategy, and ideally a winning strategy. Unfortunately, some search algorithms for games in the literature are not able to always determine a winning strategy, even with an infinite search time. This is the case, for example, of the following algorithms: Unbounded Best-First Minimax and Descent Minimax, which are core algorithms in state-of-the-art knowledge-free reinforcement learning. They were then improved with the so-called completion technique. However, whether this technique sufficiently improves these algorithms to allow them to always determine a winning strategy remained an open question until now. To answer this question, we generalize the two algorithms (their versions using the completion technique), and we show that any algorithm of this class of algorithms computes the best strategy. Finally, we experimentally show that the completion technique improves winning performance. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.24572 [cs.AI] (or arXiv:2603.24572v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.24572 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Quentin Cohen-Solal [view email] [v1] Wed, 25 Mar 2026 17:50:31 UTC (472 KB) Full-text links: Access Paper: View a PDF of the paper titled Completeness of Unbounded Best-First Minimax and Descent Minimax, by Quentin Cohen-SolalView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-2] From Liar Paradox to Incongruent Sets: A Normal Form for Self-Reference
【速读】:该论文旨在解决自指语句(self-referential sentences)所引发的语义不一致性问题,尤其是如何在保持局部经典语义的前提下,识别并刻画全局不一致性的结构根源。其核心挑战在于:自指常导致悖论或不可满足性,但传统逻辑框架难以区分局部可满足性与全局矛盾之间的本质差异。解决方案的关键是提出非一致正规形(incongruent normal form, INF)——将一个自指句子转化为一组个体可满足但整体不可满足的非自指句子集合,从而隔离由自指造成的语义障碍。INF不仅保留了局部语义的一致性,还通过正确性定理精确刻画了何时全局不一致源于局部相容承诺的冲突。进一步地,作者证明了这种“非一致结构”(incongruence)不仅是悖论的来源,更是语义信息量的根本来源,并在有限语义状态空间中构建了基于布尔函数和傅里叶分析的量化语义能量框架,揭示了语义确定性、信息性和谱简单性之间的不确定性关系,表明语义信息无法在无界能量代价下坍缩为单一确定态,凸显了非一致结构作为语义表示的基本结构性与定量特征。
链接: https://arxiv.org/abs/2603.24527
作者: Shalender Singh,Vishnu Priya Singh Parmar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 46 pages
Abstract:We introduce incongruent normal form (INF), a structural representation for self-referential semantic sentences. An INF replaces a self-referential sentence with a finite family of non-self-referential sentences that are individually satisfiable but not jointly satisfiable. This transformation isolates the semantic obstruction created by self-reference while preserving classical semantics locally and is accompanied by correctness theorems characterizing when global inconsistency arises from locally compatible commitments. We then study the role of incongruence as a structural source of semantic informativeness. Using a minimal model-theoretic notion of informativeness-understood as the ability of sentences to distinguish among admissible models-we show that semantic completeness precludes informativeness, while incongruence preserves it. Moreover, incongruence is not confined to paradoxical constructions: any consistent incomplete first-order theory admits finite incongruent families arising from incompatible complete extensions. In this sense, incompleteness manifests structurally as locally realizable but globally incompatible semantic commitments, providing a minimal formal basis for semantic knowledge. Finally, we introduce a quantitative semantic framework. In a canonical finite semantic-state setting, we model semantic commitments as Boolean functions and define a Fourier-analytic notion of semantic energy based on total influence. We derive uncertainty-style bounds relating semantic determinacy, informativeness, and spectral simplicity, and establish a matrix inequality bounding aggregate semantic variance by total semantic energy. These results show quantitatively that semantic informativeness cannot collapse into a single determinate state without unbounded energy cost, identifying incongruence as a fundamental structural and quantitative feature of semantic representation.
[AI-3] No Single Metric Tells the Whole Story: A Multi-Dimensional Evaluation Framework for Uncertainty Attributions
【速读】:该论文旨在解决当前生成式 AI(Generative AI)领域中不确定性归因(uncertainty attribution)方法评估标准不统一的问题。现有研究在评估不确定性归因效果时依赖多样化的代理任务和指标,导致不同方法之间难以比较。为解决这一问题,作者基于成熟的可解释人工智能(Explainable AI, XAI)评估框架 Co-12,提出了一套系统化的评估体系,明确实现了正确性(correctness)、一致性(consistency)、连续性(continuity)和紧凑性(compactness)四项属性,并引入了专为不确定性归因设计的“传递性”(conveyance)属性,用以衡量可控的先验不确定性增加是否能可靠地反映到特征层面的归因结果中。实验表明,梯度法在一致性和传递性上优于扰动法,且蒙特卡洛Dropconnect在多数指标上优于蒙特卡洛Dropout,但各方法间的评价结果仍存在低一致性,说明单一指标不足以全面评估不确定性归因质量。该框架为不确定性归因方法的系统比较与开发提供了坚实基础。
链接: https://arxiv.org/abs/2603.24524
作者: Emily Schiller,Teodor Chiaburu,Marco Zullich,Luca Longo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the Fourth World Conference on Explainable Artificial Intelligence, xAI 2026, Fortaleza, Brazil, July 1-3, 2026
Abstract:Research on explainable AI (XAI) has frequently focused on explaining model predictions. More recently, methods have been proposed to explain prediction uncertainty by attributing it to input features (uncertainty attributions). However, the evaluation of these methods remains inconsistent as studies rely on heterogeneous proxy tasks and metrics, hindering comparability. We address this by aligning uncertainty attributions with the well-established Co-12 framework for XAI evaluation. We propose concrete implementations for the correctness, consistency, continuity, and compactness properties. Additionally, we introduce conveyance, a property tailored to uncertainty attributions that evaluates whether controlled increases in epistemic uncertainty reliably propagate to feature-level attributions. We demonstrate our evaluation framework with eight metrics across combinations of uncertainty quantification and feature attribution methods on tabular and image data. Our experiments show that gradient-based methods consistently outperform perturbation-based approaches in consistency and conveyance, while Monte-Carlo dropconnect outperforms Monte-Carlo dropout in most metrics. Although most metrics rank the methods consistently across samples, inter-method agreement remains low. This suggests no single metric sufficiently evaluates uncertainty attribution quality. The proposed evaluation framework contributes to the body of knowledge by establishing a foundation for systematic comparison and development of uncertainty attribution methods.
[AI-4] Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在安全防护方面的漏洞问题,特别是针对白盒对抗攻击(white-box adversarial attacks)的自动化发现与优化难题。传统方法依赖人工设计攻击算法,效率低且难以覆盖复杂场景,而本文提出基于LLM代理(LLM agent)的自研式(autoresearch-style)流水线,利用Claude Code作为核心引擎,通过迭代生成和优化攻击算法,实现对目标模型(如GPT-OSS-Safeguard-20B)的高成功率渗透测试。其解决方案的关键在于:1)以现有攻击方法(如GCG)为起点,结合强化学习式的反馈机制进行演化;2)利用白盒环境提供的密集量化反馈信号,使攻击策略能高效收敛并显著超越已有30余种方法;3)所发现的攻击具备良好泛化能力,可在未见过的模型(如Meta-SecAlign-70B)上实现100%攻击成功率,验证了自动化安全研究的可行性。
链接: https://arxiv.org/abs/2603.24511
作者: Alexander Panfilov,Peter Romov,Igor Shilov,Yves-Alexandre de Montjoye,Jonas Geiping,Maksym Andriushchenko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citeprank2026posttrainbench, novikov2025alphaevolve. We show that an \emphautoresearch-style pipeline \citepkarpathy2026autoresearch powered by Claude Code discovers novel white-box adversarial attack \textitalgorithms that \textbfsignificantly outperform all existing (30+) methods in jailbreaking and prompt injection evaluations. Starting from existing attack implementations, such as GCG~\citepzou2023universal, the agent iterates to produce new algorithms achieving up to 40% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to \leq 10% for existing algorithms (\Creffig:teaser, left). The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held-out models, achieving \textbf100% ASR against Meta-SecAlign-70B \citepchen2025secalign versus 56% for the best baseline (\Creffig:teaser, middle). Extending the findings of~\citecarlini2025autoadvexbench, our results are an early demonstration that incremental safety and security research can be automated using LLM agents. White-box adversarial red-teaming is particularly well-suited for this: existing methods provide strong starting points, and the optimization objective yields dense, quantitative feedback. We release all discovered attacks alongside baseline implementations and evaluation code at this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2603.24511 [cs.LG] (or arXiv:2603.24511v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.24511 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-5] Enes Causal Discovery
【速读】:该论文试图解决观测数据中因果发现(causal discovery)的难题,尤其是在缺乏干预数据的情况下如何有效建模因果关系。其解决方案的关键在于提出一种混合专家(mixture of experts)架构,通过将模型实体(如因果关系)进一步参数化,以提升对复杂因果结构的表达能力;同时,作者指出简单线性模型(如皮尔逊相关系数模型)在该数据集上已能取得良好性能,因此所提方法需克服这一强基线,从而推动因果发现方法的改进。
链接: https://arxiv.org/abs/2603.24436
作者: Alexis Kafantaris
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
备注:
Abstract:Enes The proposed architecture is a mixture of experts, which allows for the model entities, such as the causal relationships, to be further parameterized. More specifically, an attempt is made to exploit a neural net as implementing neurons poses a great challenge for this dataset. To explain, a simple and fast Pearson coefficient linear model usually achieves good scores. An aggressive baseline that requires a really good model to overcome that is. Moreover, there are major limitations when it comes to causal discovery of observational data. Unlike the sachs one did not use interventions but only prior knowledge; the most prohibiting limitation is that of the data which is addressed. Thereafter, the method and the model are described and after that the results are presented.
[AI-6] ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills Plugins and Watchers
【速读】:该论文旨在解决开源自主代理运行时OpenClaw因权限过于宽泛而引发的安全漏洞问题,这些问题将模型错误转化为系统级威胁,如敏感数据泄露、权限提升及恶意第三方技能执行。现有安全措施碎片化,仅覆盖代理生命周期的孤立阶段,缺乏整体防护。解决方案的关键在于提出ClawKeeper框架,通过三个互补的架构层实现多维实时保护:(1) 基于技能的保护在指令层面注入结构化安全策略,强制执行环境特定约束和跨平台边界;(2) 基于插件的保护作为内部运行时执行器,提供配置加固、主动威胁检测与持续行为监控;(3) 基于观察者的保护引入解耦的系统级安全中间件,持续验证代理状态演化并支持实时干预(如终止高风险操作或强制人工确认),其“Watcher”范式为下一代自主代理系统的安全构建提供了基础性支撑。
链接: https://arxiv.org/abs/2603.24414
作者: Songyang Liu,Chaozhuo Li,Chenxu Wang,Jinyu Hou,Zejian Chen,Litian Zhang,Zheng Liu,Qiwei Ye,Yiming Hei,Xi Zhang,Zhongyuan Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 22 pages, 14 figures, 5 tables
Abstract:OpenClaw has rapidly established itself as a leading open-source autonomous agent runtime, offering powerful capabilities including tool integration, local file access, and shell command execution. However, these broad operational privileges introduce critical security vulnerabilities, transforming model errors into tangible system-level threats such as sensitive data leakage, privilege escalation, and malicious third-party skill execution. Existing security measures for the OpenClaw ecosystem remain highly fragmented, addressing only isolated stages of the agent lifecycle rather than providing holistic protection. To bridge this gap, we present ClawKeeper, a real-time security framework that integrates multi-dimensional protection mechanisms across three complementary architectural layers. (1) \textbfSkill-based protection operates at the instruction level, injecting structured security policies directly into the agent context to enforce environment-specific constraints and cross-platform boundaries. (2) \textbfPlugin-based protection serves as an internal runtime enforcer, providing configuration hardening, proactive threat detection, and continuous behavioral monitoring throughout the execution pipeline. (3) \textbfWatcher-based protection introduces a novel, decoupled system-level security middleware that continuously verifies agent state evolution. It enables real-time execution intervention without coupling to the agent’s internal logic, supporting operations such as halting high-risk actions or enforcing human confirmation. We argue that this Watcher paradigm holds strong potential to serve as a foundational building block for securing next-generation autonomous agent systems. Extensive qualitative and quantitative evaluations demonstrate the effectiveness and robustness of ClawKeeper across diverse threat scenarios. We release our code.
[AI-7] Real Talk Virtual Faces: A Formal Concept Analysis of Personality and Sentiment in Influencer Audiences
【速读】:该论文旨在解决虚拟偶像(Virtual Influencers, VIs)与人类偶像(Human Influencers, HIs)在受众 discourse 中存在结构性差异但缺乏多信号协同分析方法的问题。现有研究主要依赖问卷调查或聚合互动数据,仅能揭示“说什么”而无法刻画“如何共现”。其解决方案的关键在于提出一种两层结构优先框架:第一层基于形式概念分析(Formal Concept Analysis, FCA)结合支持度冰山过滤,从每周聚合评论中提取情感、大五人格线索与话题标签的共现模式,生成话语谱图;第二层在评论层级挖掘关联规则,揭示人格—情感—话题间的隐性依赖关系,从而识别出 HI 仅呈现单一情绪稳定型话语模式,而 VI 则表现出三种结构迥异的话语模式,包括人类偶像中缺失的外貌相关聚类。该方法实现了对多信号协同机制的精细化建模,揭示了虚拟性不仅改变内容表达,更重塑了受众反应背后的语义组合逻辑。
链接: https://arxiv.org/abs/2603.24410
作者: Shahram Chaudhry,Sidahmed Benabderrahmane,Talal Rahwan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Virtual influencers~(VIs) – digitally synthetic social-media personas – attract audiences whose discourse appears qualitatively different from discourse around human influencers~(HIs). Existing work characterises this difference through surveys or aggregate engagement statistics, which reveal \emphwhat audiences say but not \emphhow multiple signals co-occur. We propose a two-layer, structure-first framework grounded in Formal Concept Analysis~(FCA) and association rule mining. The first layer applies FCA with support-based iceberg filtering to weekly-aggregated comment data, extracting discourse profiles – weekly co-occurrence bundles of sentiment, Big Five personality cues, and topic tags. The second layer mines association rules at the comment level, revealing personality–sentiment–topic dependencies invisible to frequency-table analysis. Applied to YouTube comments from three VI–HI influencer pairs, the two-layer analysis reveals a consistent structural divergence: HI discourse concentrates into a single, emotionally regulated (stability-centred) regime (low neuroticism anchoring positivity), while VI discourse supports three structurally distinct discourse modes, including an appearance-discourse cluster absent from HI despite near-equal marginal prevalence. Topic-specific analyses further show that VI contexts exhibit negative sentiment in psychologically sensitive domains (mental health, body image, artificial identity) relative to HI contexts. Our results position FCA as a principled tool for multi-signal discourse analysis and demonstrate that virtuality reshapes not just what audiences say, but the underlying grammar of how signals co-occur in their reactions. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.24410 [cs.CY] (or arXiv:2603.24410v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2603.24410 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-8] AI-Supervisor: Autonomous AI Research Supervision via a Persistent Research World Model
【速读】:该论文旨在解决现有自动化研究系统缺乏持续知识积累与协同验证机制的问题,这些问题导致其在文献综述、研究空白识别、方法开发和论文撰写等环节中呈现“无状态”和“线性化”特征,难以实现对研究领域的动态理解与自我修正。解决方案的关键在于提出AutoProf(Autonomous Professor)这一多智能体编排框架,其核心创新包括:构建一个基于知识图谱的持续演进的“研究世界模型”作为共享记忆,实现跨智能体的知识沉淀与复用;引入结构化的研究空白发现机制,通过模块化分解方法并跨基准评估以定位细粒度差距;设计自校正发现循环与自提升开发循环,分别用于分析失败原因、检测基准偏差及迭代优化组件;所有智能体通过共识机制确保结果可靠性,且框架具备模型无关性和弹性扩展能力,支持从轻量探索到全规模研究的灵活部署。
链接: https://arxiv.org/abs/2603.24402
作者: Yunbo Long
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing automated research systems operate as stateless, linear pipelines, generating outputs without maintaining a persistent understanding of the research landscape. They process papers sequentially, propose ideas without structured gap analysis, and lack mechanisms for agents to verify or refine each other’s findings. We present AutoProf (Autonomous Professor), a multi-agent orchestration framework where specialized agents provide end-to-end AI research supervision driven by human interests, from literature review through gap discovery, method development, evaluation, and paper writing, via autonomous exploration and self-correcting updates. Unlike sequential pipelines, AutoProf maintains a continuously evolving Research World Model implemented as a Knowledge Graph, capturing methods, benchmarks, limitations, and unexplored gaps as shared memory across agents. The framework introduces three contributions: first, structured gap discovery that decomposes methods into modules, evaluates them across benchmarks, and identifies module-level gaps; second, self-correcting discovery loops that analyze why modules succeed or fail, detect benchmark biases, and assess evaluation adequacy; third, self-improving development loops using cross-domain mechanism search to iteratively address failing components. All agents operate under a consensus mechanism where findings are validated before being committed to the shared model. The framework is model-agnostic, supports mainstream large language models, and scales elastically with token budget from lightweight exploration to full-scale investigation.
[AI-9] MolEvolve: LLM -Guided Evolutionary Search for Interpretable Molecular Optimization
【速读】:该论文旨在解决深度学习在化学领域中因可解释性不足和无法识别活性悬崖(activity cliffs)而导致的局限性问题,即微小结构变化引发显著性质波动的现象。传统表示学习受限于相似性原则,难以捕捉这类结构-活性间的不连续性。其解决方案的关键在于提出MolEvolve框架,该框架将分子发现重构为一个自主的前瞻规划问题,利用大语言模型(Large Language Model, LLM)主动探索并演化一组可执行的化学符号操作库,并结合蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)引擎在测试阶段进行规划,同时调用外部工具(如RDKit)实现自主轨迹发现,从而生成透明、可读的推理链,将复杂结构变换转化为人类可理解的化学洞察。
链接: https://arxiv.org/abs/2603.24382
作者: Xiangsen Chen,Ruilong Wu,Yanyan Lan,Ting Ma,Yang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Despite deep learning’s success in chemistry, its impact is hindered by a lack of interpretability and an inability to resolve activity cliffs, where minor structural nuances trigger drastic property shifts. Current representation learning, bound by the similarity principle, often fails to capture these structural-activity discontinuities. To address this, we introduce MolEvolve, an evolutionary framework that reformulates molecular discovery as an autonomous, look-ahead planning problem. Unlike traditional methods that depend on human-engineered features or rigid prior knowledge, MolEvolve leverages a Large Language Model (LLM) to actively explore and evolve a library of executable chemical symbolic operations. By utilizing the LLM to cold start and an Monte Carlo Tree Search (MCTS) engine for test-time planning with external tools (e.g. RDKit), the system self-discovers optimal trajectories autonomously. This process evolves transparent reasoning chains that translate complex structural transformations into actionable, human-readable chemical insights. Experimental results demonstrate that MolEvolve’s autonomous search not only evolves transparent, human-readable chemical insights, but also outperforms baselines in both property prediction and molecule optimization tasks.
[AI-10] Evidence of an Emergent “Self” in Continual Robot Learning
【速读】:该论文旨在解决如何在智能系统中量化“自我”概念的问题,即如何识别并区分系统中的“自我”与其他快速习得的认知结构。其解决方案的关键在于:通过寻找认知过程中相对不变的子网络——即在持续学习条件下变化最小、具有高度稳定性的部分——来识别“自我”。研究者基于这一原则,在两类机器人认知结构分析中发现,持续学习任务下的机器人会形成显著更稳定的子网络(p < 0.001),从而为探索其他认知人工智能系统中的自性(selfhood)提供了理论依据与方法路径。
链接: https://arxiv.org/abs/2603.24350
作者: Adidev Jhunjhunwala,Judah Goldfeder,Hod Lipson
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39 pages, 17 figures, includes supplementary materials
Abstract:A key challenge to understanding self-awareness has been a principled way of quantifying whether an intelligent system has a concept of a “self,” and if so how to differentiate the “self” from other cognitive structures. We propose that the “self” can be isolated by seeking the invariant portion of cognitive process that changes relatively little compared to more rapidly acquired cognitive knowledge and skills, because our self is the most persistent aspect of our experiences. We used this principle to analyze the cognitive structure of robots under two conditions: One robot learns a constant task, while a second robot is subjected to continual learning under variable tasks. We find that robots subjected to continual learning develop an invariant subnetwork that is significantly more stable (p 0.001) compared to the control. We suggest that this principle can offer a window into exploring selfhood in other cognitive AI systems.
[AI-11] Enhancing Efficiency and Performance in Deepfake Audio Detection through Neuron-level dropin Neuroplasticity Mechanisms IJCNN2026
【速读】:该论文旨在解决当前音频深度伪造检测模型在性能提升上受限于参数规模扩展的问题,尤其是现有方法因单纯堆叠层数导致计算成本过高且需全量重训练,同时低秩适配方法主要局限于基于注意力机制的架构,适用范围受限。其解决方案的关键在于受哺乳动物大脑神经元可塑性启发,提出两种新算法:Dropin 和 Further Plasticity,通过动态调整特定层中的神经元数量来灵活调控模型参数,在不显著增加计算负担的前提下实现更高效的模型优化与性能提升。
链接: https://arxiv.org/abs/2603.24343
作者: Yupei Li,Shuaijie Shao,Manuel Milling,Björn Schuller
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted at IJCNN 2026
Abstract:Current audio deepfake detection has achieved remarkable performance using diverse deep learning architectures such as ResNet, and has seen further improvements with the introduction of large models (LMs) like Wav2Vec. The success of large language models (LLMs) further demonstrates the benefits of scaling model parameters, but also highlights one bottleneck where performance gains are constrained by parameter counts. Simply stacking additional layers, as done in current LLMs, is computationally expensive and requires full retraining. Furthermore, existing low-rank adaptation methods are primarily applied to attention-based architectures, which limits their scope. Inspired by the neuronal plasticity observed in mammalian brains, we propose novel algorithms, dropin and further plasticity, that dynamically adjust the number of neurons in certain layers to flexibly modulate model parameters. We evaluate these algorithms on multiple architectures, including ResNet, Gated Recurrent Neural Networks, and Wav2Vec. Experimental results using the widely recognised ASVSpoof2019 LA, PA, and FakeorReal dataset demonstrate consistent improvements in computational efficiency with the dropin approach and a maximum of around 39% and 66% relative reduction in Equal Error Rate with the dropin and plasticity approach among these dataset, respectively. The code and supplementary material are available at Github link.
[AI-12] Large Language Model Guided Incentive Aware Reward Design for Cooperative Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决协作式多智能体系统中辅助奖励(auxiliary reward)设计困难的问题,尤其是在任务反馈稀疏的情况下,人工设计的奖励函数容易因激励错位而导致次优协作。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的自动化奖励设计框架,该框架通过环境仪器化生成可执行的奖励程序,并在形式有效性约束下进行迭代搜索;通过固定计算预算从零训练策略并仅依据稀疏任务回报进行选择,从而自动发现与任务目标对齐的奖励结构。实验表明,该方法能显著提升协作性能,尤其在交互瓶颈明显的环境中效果突出。
链接: https://arxiv.org/abs/2603.24324
作者: Dogan Urgun,Gokhan Gungor
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Designing effective auxiliary rewards for cooperative multi-agent systems remains a precarious task; misaligned incentives risk inducing suboptimal coordination, especially where sparse task feedback fails to provide sufficient grounding. This study introduces an automated reward design framework that leverages large language models to synthesize executable reward programs from environment instrumentation. The procedure constrains candidate programs within a formal validity envelope and evaluates their efficacy by training policies from scratch under a fixed computational budget; selection depends exclusively on the sparse task return. The framework is evaluated across four distinct Overcooked-AI layouts characterized by varied corridor congestion, handoff dependencies, and structural asymmetries. Iterative search generations consistently yield superior task returns and delivery counts, with the most pronounced gains occurring in environments dominated by interaction bottlenecks. Diagnostic analysis of the synthesized shaping components indicates increased interdependence in action selection and improved signal alignment in coordination-intensive tasks. These results demonstrate that the search for objectivegrounded reward programs can mitigate the burden of manual engineering while producing shaping signals compatible with cooperative learning under finite budgets.
[AI-13] oward Generalist Neural Motion Planners for Robotic Manipulators: Challenges and Opportunities
【速读】:该论文旨在解决当前神经运动规划器(neural motion planner)在未见过的、分布外(out-of-distribution)规划场景中泛化能力不足的问题。其核心挑战在于,尽管神经运动规划器通过快速推理和有效处理运动规划问题的多模态特性提升了效率,但在面对复杂、杂乱环境中的域特定挑战时仍表现不稳定。论文的关键解决方案在于系统性回顾与分析现有最先进的神经运动规划方法,明确其优势与局限,并提出构建具备通用性的神经运动规划器的发展路径,以增强其在多样化任务和环境中的一致性能表现。
链接: https://arxiv.org/abs/2603.24318
作者: Davood Soleymanzadeh,Ivan Lopez-Sanchez,Hao Su,Yunzhu Li,Xiao Liang,Minghui Zheng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:State-of-the-art generalist manipulation policies have enabled the deployment of robotic manipulators in unstructured human environments. However, these frameworks struggle in cluttered environments primarily because they utilize auxiliary modules for low-level motion planning and control. Motion planning remains challenging due to the high dimensionality of the robot’s configuration space and the presence of workspace obstacles. Neural motion planners have enhanced motion planning efficiency by offering fast inference and effectively handling the inherent multi-modality of the motion planning problem. Despite such benefits, current neural motion planners often struggle to generalize to unseen, out-of-distribution planning settings. This paper reviews and analyzes the state-of-the-art neural motion planners, highlighting both their benefits and limitations. It also outlines a path toward establishing generalist neural motion planners capable of handling domain-specific challenges. For a list of the reviewed papers, please refer to this https URL.
[AI-14] Cost-Sensitive Neighborhood Aggregation for Heterophilous Graphs: When Does Per-Edge Routing Help?
【速读】:该论文旨在解决图神经网络(Graph Neural Network, GNN)在异配性(heterophily)图结构中消息传递机制的有效性问题,特别是区分对抗性异配(adversarial heterophily)与信息性异配(informative heterophily)两种情形下,是否需要对每条边进行细粒度的消息路由(message routing)。其核心解决方案是提出成本敏感的邻域聚合(Cost-Sensitive Neighborhood Aggregation, CSNA)层,该层通过学习投影空间中的成对距离,将消息软路由至一致(concordant)和不一致(discordant)通道,并分别应用独立变换。关键创新在于:当边类型可被有效分离时(即成本函数能区分不同边类型),CSNA能够保留类别判别信号,而均值聚合会削弱此类信号;反之,在信息性异配场景中,由于缺乏可利用的边类型分解,细粒度路由无益。这一发现表明,成本函数对边类型的区分能力本身即可作为诊断工具,用于判断何时精细路由优于统一谱通道(uniform spectral channel)。
链接: https://arxiv.org/abs/2603.24291
作者: Eyal Weiss
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work distinguishes two heterophily regimes: adversarial, where cross-class edges dilute class signal and harm classification, and informative, where the heterophilous structure itself carries useful signal. We ask: when does per-edge message routing help, and when is a uniform spectral channel sufficient? To operationalize this question we introduce Cost-Sensitive Neighborhood Aggregation (CSNA), a GNN layer that computes pairwise distance in a learned projection and uses it to soft-route each message through concordant and discordant channels with independent transformations. Under a contextual stochastic block model we show that cost-sensitive weighting preserves class-discriminative signal where mean aggregation provably attenuates it, provided w_+/w_- q/p . On six benchmarks with uniform tuning, CSNA is competitive with state-of-the-art methods on adversarial-heterophily datasets (Texas, Wisconsin, Cornell, Actor) but underperforms on informative-heterophily datasets (Chameleon, Squirrel) – precisely the regime where per-edge routing has no useful decomposition to exploit. The pattern is itself the finding: the cost function’s ability to separate edge types serves as a diagnostic for the heterophily regime, revealing when fine-grained routing adds value over uniform channels and when it does not. Code is available at this https URL .
[AI-15] Bridging Biological Hearing and Neuromorphic Computing: End-to-End Time-Domain Audio Signal Processing with Reservoir Computing
【速读】:该论文旨在解决音频信号处理中因传统方法计算复杂度高而导致实时性不足的问题,尤其是Mel Frequency Cepstral Coefficients (MFCCs) 提取过程中依赖耗时的时频变换所引发的效率瓶颈。其解决方案的关键在于引入基于时间域的储层计算(Reservoir Computing)机制,通过用卷积操作替代传统的频率域转换步骤,从而在不牺牲特征判别能力的前提下显著简化MFCC提取流程,实现端到端的高效、低功耗音频处理框架,适用于嵌入式系统和语音驱动应用。
链接: https://arxiv.org/abs/2603.24283
作者: Rinku Sebastian,Simon O’Keefe,Martin Trefzer
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the advancements in cutting-edge technologies, audio signal processing continues to pose challenges and lacks the precision of a human speech processing system. To address these challenges, we propose a novel approach to simplify audio signal processing by leveraging time-domain techniques and reservoir computing. Through our research, we have developed a real-time audio signal processing system by simplifying audio signal processing through the utilization of reservoir computers, which are significantly easier to train. Feature extraction is a fundamental step in speech signal processing, with Mel Frequency Cepstral Coefficients (MFCCs) being a dominant choice due to their perceptual relevance to human hearing. However, conventional MFCC extraction relies on computationally intensive time-frequency transformations, limiting efficiency in real-time applications. To address this, we propose a novel approach that leverages reservoir computing to streamline MFCC extraction. By replacing traditional frequency-domain conversions with convolution operations, we eliminate the need for complex transformations while maintaining feature discriminability. We present an end-to-end audio processing framework that integrates this method, demonstrating its potential for efficient and real-time speech analysis. Our results contribute to the advancement of energy-efficient audio processing technologies, enabling seamless deployment in embedded systems and voice-driven applications. This work bridges the gap between biologically inspired feature extraction and modern neuromorphic computing, offering a scalable solution for next-generation speech recognition systems. Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.24283 [cs.SD] (or arXiv:2603.24283v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2603.24283 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-16] Embracing Heteroscedasticity for Probabilistic Time Series Forecasting
【速读】:该论文旨在解决概率时间序列预测(Probabilistic Time Series Forecasting, PTSF)中对异方差性(heteroscedasticity)建模不足的问题。现有非自回归生成方法(如TimeVAE和K²VAE)依赖基于均方误差(MSE)的训练目标,隐式假设预测方差恒定(homoscedastic),难以刻画真实时间序列中由非平稳动态、状态转换和外部条件变化引起的时变条件方差。解决方案的关键在于提出位置-尺度高斯变分自编码器(Location-Scale Gaussian VAE, LSG-VAE),通过显式参数化预测均值与时间依赖方差的位置-尺度似然结构,从而准确捕捉异方差性带来的认知不确定性(aleatoric uncertainty),并引入自适应衰减机制,在训练中自动降低高波动观测的影响,提升趋势预测的鲁棒性。
链接: https://arxiv.org/abs/2603.24254
作者: Yijun Wang,Qiyuan Zhuang,Xiu-Shen Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Probabilistic time series forecasting (PTSF) aims to model the full predictive distribution of future observations, enabling both accurate forecasting and principled uncertainty quantification. A central requirement of PTSF is to embrace heteroscedasticity, as real-world time series exhibit time-varying conditional variances induced by nonstationary dynamics, regime changes, and evolving external conditions. However, most existing non-autoregressive generative approaches to PTSF, such as TimeVAE and K^2 VAE, rely on MSE-based training objectives that implicitly impose a homoscedastic assumption, thereby fundamentally limiting their ability to model temporal heteroscedasticity. To address this limitation, we propose the Location-Scale Gaussian VAE (LSG-VAE), a simple but effective framework that explicitly parameterizes both the predictive mean and time-dependent variance through a location-scale likelihood formulation. This design enables LSG-VAE to faithfully capture heteroscedastic aleatoric uncertainty and introduces an adaptive attenuation mechanism that automatically down-weights highly volatile observations during training, leading to improved robustness in trend prediction. Extensive experiments on nine benchmark datasets demonstrate that LSG-VAE consistently outperforms fifteen strong generative baselines while maintaining high computational efficiency suitable for real-time deployment.
[AI-17] DVM: Real-Time Kernel Generation for Dynamic AI Models
【速读】:该论文旨在解决动态AI模型中编译效率与优化能力之间的矛盾问题,即现有运行时编译(runtime compilation)因编译耗时过长而损害模型效率,而离线编译则面临编译时间长、设备内存占用高或牺牲优化机会以换取可用性的问题。解决方案的关键在于通过加速编译过程或隐藏编译开销来实现高效运行时编译。为此,作者提出了一种实时编译器DVM,其核心创新包括:基于字节码虚拟机的运行时算子编译器,将算子程序编码为字节码并在CPU上完成编译,再解码为虚拟指令直接在NPU上执行;以及结合符号推导和运行时融合的算子融合机制,支持模式驱动和堆叠驱动的融合策略,从而显著提升动态模型的执行效率和编译速度。
链接: https://arxiv.org/abs/2603.24239
作者: Jingzhi Fang,Xiong Gao,Renwei Zhang,Zichun Ye,Lei Chen,Jie Zhao,Chengnuo Huang,Hui Xu,Xuefeng Jin
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Dynamism is common in AI computation, e.g., the dynamic tensor shapes and the dynamic control flows in models. Due to the long compilation time, existing runtime compilation damages the model efficiency, while the offline compilers either suffer from the long compilation time and device memory footprint to cover all the possible execution instances of a dynamic model, or sacrifice optimization opportunities for usability. In this paper, we rethink the feasibility of runtime compilation for dynamic models and identify that the key for it to work is to speed up the compilation or hide the compilation overhead. To do this, we propose a real-time compiler, DVM. In DVM, we design a runtime operator compiler based on a bytecode virtual machine to perform effective and efficient compilation for each dynamic operator instance given its input. Specifically, instead of compiling programs into machine code, we encode the operator program into bytecode on the CPU and decode the bytecode into virtual instructions for direct execution on the NPU. Based on the runtime operator compiler, we further propose an operator fuser, which performs symbol-deduction-based fusion on static graphs and runtime fusion on dynamic graphs. Both pattern- and stacking-based fusion are supported to increase fusion opportunities. Evaluation on operators, subgraphs, and models shows that, compared with TorchInductor, PyTorch-eager and MindSpore-graph-O0, we are up to 11.77 \times better in terms of the operator/model efficiency and up to 5 orders of magnitude faster in terms of the maximum compilation time.
[AI-18] Environment-Grounded Multi-Agent Workflow for Autonomous Penetration Testing
【速读】:该论文旨在解决数字基础设施日益复杂和互联背景下,如何实现可扩展且可靠的自动化渗透测试方法问题,特别是在高度网络化的机器人系统(Robotics-based systems)中。其解决方案的关键在于提出一种环境感知的多智能体架构,该架构在执行过程中动态构建基于图结构的共享记忆,以捕获可观测的系统状态,包括网络拓扑、通信通道、漏洞及尝试的攻击行为,从而在保持测试过程可追溯性和上下文管理能力的同时,实现结构化自动化。
链接: https://arxiv.org/abs/2603.24221
作者: Michael Somma,Markus Großpointner,Paul Zabalegui,Eppu Heilimo,Branka Stojanović
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing complexity and interconnectivity of digital infrastructures make scalable and reliable security assessment methods essential. Robotic systems represent a particularly important class of operational technology, as modern robots are highly networked cyber-physical systems deployed in domains such as industrial automation, logistics, and autonomous services. This paper explores the use of large language models for automated penetration testing in robotic environments. We propose an environment-grounded multi-agent architecture tailored to Robotics-based systems. The approach dynamically constructs a shared graph-based memory during execution that captures the observable system state, including network topology, communication channels, vulnerabilities, and attempted exploits. This enables structured automation while maintaining traceability and effective context management throughout the testing process. Evaluated across multiple iterations within a specialized robotics Capture-the-Flag scenario (ROS/ROS2), the system demonstrated high reliability, successfully completing the challenge in 100% of test runs (n=5). This performance significantly exceeds literature benchmarks while maintaining the traceability and human oversight required by frameworks like the EU AI Act.
[AI-19] Uncovering Memorization in Timeseries Imputation models: LBRM Membership Inference and its link to attribute Leakage
【速读】:该论文旨在解决时间序列插补模型(time series imputation model)在实际部署中面临的隐私泄露问题,特别是针对黑盒环境下的成员推理攻击(membership inference attack)和属性推理攻击(attribute inference attack)。其关键解决方案在于提出了一种两阶段攻击框架:第一阶段设计了一种基于参考模型(reference model)的新型成员推理攻击方法,显著提升了对鲁棒于过拟合攻击模型的检测准确率;第二阶段首次实现了针对时间序列插补模型的属性推理攻击,可预测训练数据中的敏感特征。实验表明,该成员推理攻击在训练从零开始和微调场景下均表现优异,且能有效预判属性推理攻击的成功概率(精度达90%,优于通用情况下的78%)。
链接: https://arxiv.org/abs/2603.24213
作者: Faiz Taleb,Ivan Gazeau,Maryline Laurent
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning models for time series imputation are now essential in fields such as healthcare, the Internet of Things (IoT), and finance. However, their deployment raises critical privacy concerns. Beyond the well-known issue of unintended memorization, which has been extensively studied in generative models, we demonstrate that time series models are vulnerable to inference attacks in a black-box setting. In this work, we introduce a two-stage attack framework comprising: (1) a novel membership inference attack based on a reference model that improves detection accuracy, even for models robust to overfitting-based attacks, and (2) the first attribute inference attack that predicts sensitive characteristics of the training data for timeseries imputation model. We evaluate these attacks on attention-based and autoencoder architectures in two scenarios: models that are trained from scratch, and fine-tuned models where the adversary has access to the initial weights. Our experimental results demonstrate that the proposed membership attack retrieves a significant portion of the training data with a tpr@top25% score significantly higher than a naive attack baseline. We show that our membership attack also provides a good insight of whether attribute inference will work (with a precision of 90% instead of 78% in the genral case).
[AI-20] Invisible Threats from Model Context Protocol: Generating Stealthy Injection Payload via Tree-based Adaptive Search
【速读】:该论文旨在解决基于模型上下文协议(Model Context Protocol, MCP)的大语言模型(Large Language Models, LLMs)在调用外部工具时所面临的一种新型安全威胁——恶意操纵工具响应的攻击问题。现有间接提示注入(indirect prompt injection)方法存在部署成本高、语义连贯性弱、依赖白盒信息或易被防御机制检测等缺陷。为应对这一挑战,作者提出树状结构载荷注入(Tree structured Injection for Payloads, TIP),其核心创新在于将载荷生成建模为树状结构搜索问题,并通过粗粒度到细粒度的优化框架引导搜索过程;同时引入路径感知反馈机制以稳定训练并避免局部最优,以及基于可观测防御信号动态调整探索预算,从而在黑盒环境下实现高效且隐蔽的攻击。实验表明,TIP在无防御场景下攻击成功率超过95%,查询次数仅为先前自适应攻击的十分之一,并在四种主流防御策略下仍保持超50%的有效性,显著优于当前最先进攻击方法。
链接: https://arxiv.org/abs/2603.24203
作者: Yulin Shen,Xudong Pan,Geng Hong,Min Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in the Model Context Protocol (MCP) have enabled large language models (LLMs) to invoke external tools with unprecedented ease. This creates a new class of powerful and tool augmented agents. Unfortunately, this capability also introduces an under explored attack surface, specifically the malicious manipulation of tool responses. Existing techniques for indirect prompt injection that target MCP suffer from high deployment costs, weak semantic coherence, or heavy white box requirements. Furthermore, they are often easily detected by recently proposed defenses. In this paper, we propose Tree structured Injection for Payloads (TIP), a novel black-box attack which generates natural payloads to reliably seize control of MCP enabled agents even under defense. Technically, We cast payload generation as a tree structured search problem and guide the search with an attacker LLM operating under our proposed coarse-to-fine optimization framework. To stabilize learning and avoid local optima, we introduce a path-aware feedback mechanism that surfaces only high quality historical trajectories to the attacker model. The framework is further hardened against defensive transformations by explicitly conditioning the search on observable defense signals and dynamically reallocating the exploration budget. Extensive experiments on four mainstream LLMs show that TIP attains over 95% attack success in undefended settings while requiring an order of magnitude fewer queries than prior adaptive attacks. Against four representative defense approaches, TIP preserves more than 50% effectiveness and significantly outperforms the state-of-the-art attacks. By implementing the attack on real world MCP systems, our results expose an invisible but practical threat vector in MCP deployments. We also discuss potential mitigation approaches to address this critical security gap.
[AI-21] A Deep Dive into Scaling RL for Code Generation with Synthetic Data and Curricula
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大语言模型性能时,随着规模扩大难以持续维持效果的问题,核心挑战在于数据多样性与结构而非单纯的数据量成为瓶颈。解决方案的关键在于提出了一种可扩展的多轮合成数据生成流水线:通过一个教师模型基于上下文中的学生模型表现摘要迭代优化问题,无需对教师模型进行微调即可生成具有结构化难度递进的合成数据;该方法显著提升了有效合成问题的产出率,并自然产生难度梯度(即同一任务的更易或更难变体),从而支持课程学习(curriculum-based training)策略,实证表明其能有效增强模型在领域内代码和多数跨域数学任务上的性能。
链接: https://arxiv.org/abs/2603.24202
作者: Cansu Sancaktar,David Zhang,Gabriel Synnaeve,Taco Cohen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has emerged as a powerful paradigm for improving large language models beyond supervised fine-tuning, yet sustaining performance gains at scale remains an open challenge, as data diversity and structure, rather than volume alone, become the limiting factor. We address this by introducing a scalable multi-turn synthetic data generation pipeline in which a teacher model iteratively refines problems based on in-context student performance summaries, producing structured difficulty progressions without any teacher fine-tuning. Compared to single-turn generation, this multi-turn approach substantially improves the yield of valid synthetic problems and naturally produces stepping stones, i.e. easier and harder variants of the same core task, that support curriculum-based training. We systematically study how task difficulty, curriculum scheduling, and environment diversity interact during RL training across the Llama3.1-8B Instruct and Qwen3-8B Base model families, with additional scaling experiments on Qwen2.5-32B. Our results show that synthetic augmentation consistently improves in-domain code and in most cases out-of-domain math performance, and we provide empirical insights into how curriculum design and data diversity jointly shape RL training dynamics.
[AI-22] KCLNet: Electrically Equivalence-Oriented Graph Representation Learning for Analog Circuits
【速读】:该论文旨在解决模拟电路(analog circuit)表示学习(representation learning)的难题,其核心挑战在于模拟电路具有连续的电气特性,与数字电路的离散状态相比更难建模。解决方案的关键在于提出了一种基于直流电(DC)等效的表示学习框架KCLNet,其创新性地引入了受基尔霍夫电流定律(Kirchhoff’s Current Law, KCL)启发的嵌入空间约束机制:通过在图神经网络中设计电学模拟的消息传递过程,并强制每个节点处流出与流入电流嵌入之和相等,从而维持嵌入空间的有序性,显著提升电路嵌入的泛化能力。这一方法在保持电气约束的前提下实现了对模拟电路的有效表征,实验证明其在电路分类、子电路检测及电路编辑距离预测等下游任务中均表现优异。
链接: https://arxiv.org/abs/2603.24101
作者: Peng Xu,Yapeng Li,Tinghuan Chen,Tsung-Yi Ho,Bei Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Digital circuits representation learning has made remarkable progress in the electronic design automation domain, effectively supporting critical tasks such as testability analysis and logic reasoning. However, representation learning for analog circuits remains challenging due to their continuous electrical characteristics compared to the discrete states of digital circuits. This paper presents a direct current (DC) electrically equivalent-oriented analog representation learning framework, named \textbfKCLNet. It comprises an asynchronous graph neural network structure with electrically-simulated message passing and a representation learning method inspired by Kirchhoff’s Current Law (KCL). This method maintains the orderliness of the circuit embedding space by enforcing the equality of the sum of outgoing and incoming current embeddings at each depth, which significantly enhances the generalization ability of circuit embeddings. KCLNet offers a novel and effective solution for analog circuit representation learning with electrical constraints preserved. Experimental results demonstrate that our method achieves significant performance in a variety of downstream tasks, e.g., analog circuit classification, subcircuit detection, and circuit edit distance prediction.
[AI-23] owards Effective Experiential Learning: Dual Guidance for Utilization and Internalization
【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的训练方法在提升大语言模型(Large Language Models, LLMs)推理能力时,仍难以有效模拟人类学习过程的问题,尤其是缺乏对外部经验与内部知识的协同利用和内化机制。解决方案的关键在于提出一种统一框架——Dual Guidance Optimization (DGO),其核心是构建一个经验库(experience bank)来存储历史探索轨迹,并通过外部经验库与模型内部知识的联合引导实现更高效的探索;同时,新生成的轨迹被用于动态更新经验库并优化模型参数,形成经验利用与内化的闭环机制,从而显著提升LLMs在可验证奖励强化学习(Reinforcement Learning from Verifiable Rewards, RLVR)场景下的推理性能。
链接: https://arxiv.org/abs/2603.24093
作者: Fei Bai,Zhipeng Chen,Chuan Hao,Ming Yang,Ran Tao,Bryan Dai,Wayne Xin Zhao,Jian Yang,Hongteng Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, reinforcement learning~(RL) has become an important approach for improving the capabilities of large language models~(LLMs). In particular, reinforcement learning from verifiable rewards~(RLVR) has emerged as a promising paradigm for reasoning tasks. However, existing RL-based training still remains only a rough approximation to human learning. Human learners leverage both external and internal experience to guide exploration and gradually internalize useful trajectories into stable knowledge. Motivated by this gap, we ask: how can LLMs better utilize and internalize experience during RLVR training? To answer this question, we propose \textbfDual \textbfGuidance \textbfOptimization~(\textbfDGO), a unified framework that leverages \emphexternal and \emphinternal experience to improve training effectiveness. Specifically, DGO first constructs an experience bank from previously explored trajectories. The policy then performs exploration under the joint guidance of the experience bank and the model’s internal knowledge. The resulting trajectories are further used to refine the experience bank and optimize model parameters, forming a closed loop of experience utilization and internalization. Experiments show that DGO consistently outperforms baseline methods, suggesting that better utilization and internalization of experience lead to more effective reasoning.
[AI-24] Bridging the Evaluation Gap: Standardized Benchmarks for Multi-Objective Search
【速读】:该论文旨在解决多目标搜索(Multi-Objective Search, MOS)领域中实证评估长期存在的碎片化问题,即不同研究使用异构的问题实例和不兼容的目标定义,导致跨研究比较困难。尤其指出DIMACS道路网络作为传统基准存在目标高度相关的问题,无法体现多样化的帕累托前沿(Pareto-front)结构。解决方案的关键在于提出首个全面且标准化的MOS基准套件,涵盖四个结构差异显著的领域:真实世界道路网络、结构化合成图、基于游戏的网格环境以及高维机器人运动规划路网;通过提供固定的图实例、标准的起终点查询及精确与近似参考帕累托最优解集,系统性地覆盖从强相关到严格独立的目标交互模式,从而为未来MOS评估提供统一、可复现且结构全面的基础。
链接: https://arxiv.org/abs/2603.24084
作者: Hadar Peer,Carlos Hernandez,Sven Koenig,Ariel Felner,Oren Salzman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Empirical evaluation in multi-objective search (MOS) has historically suffered from fragmentation, relying on heterogeneous problem instances with incompatible objective definitions that make cross-study comparisons difficult. This standardization gap is further exacerbated by the realization that DIMACS road networks, a historical default benchmark for the field, exhibit highly correlated objectives that fail to capture diverse Pareto-front structures. To address this, we introduce the first comprehensive, standardized benchmark suite for exact and approximate MOS. Our suite spans four structurally diverse domains: real-world road networks, structured synthetic graphs, game-based grid environments, and high-dimensional robotic motion-planning roadmaps. By providing fixed graph instances, standardized start-goal queries, and both exact and approximate reference Pareto-optimal solution sets, this suite captures a full spectrum of objective interactions: from strongly correlated to strictly independent. Ultimately, this benchmark provides a common foundation to ensure future MOS evaluations are robust, reproducible, and structurally comprehensive.
[AI-25] Knowledge-Guided Manipulation Using Multi-Task Reinforcement Learning ICRA2026
【速读】:该论文旨在解决机器人在部分可观测环境下进行多任务操作时面临的感知不完整、泛化能力弱以及控制决策缺乏语义信息等问题。其核心挑战在于如何将开放词汇的视觉检测结果与环境中的空间关系、物体属性(如容纳性与可操作性)相结合,以构建一个持续更新且具语义意义的世界表征,从而提升策略学习的样本效率和跨场景适应性。解决方案的关键在于提出一种基于知识图谱的多任务强化学习框架(KG-M3PO),通过在线构建3D场景图(scene graph)将开放词汇目标锚定到度量化的、关系型表示中,并引入动态关系机制实时更新空间、包含及可操作边;同时,采用图神经网络编码器端到端地联合优化感知与控制目标,使关系特征直接由任务表现驱动。该方法融合视觉、本体感觉、语言和图结构等多种模态信息至共享潜在空间,政策条件仅依赖轻量级图查询与视觉/本体感觉输入,形成紧凑而语义丰富的状态表示,显著提升了复杂遮挡、干扰物和布局变化场景下的成功率与鲁棒性。
链接: https://arxiv.org/abs/2603.24083
作者: Aditya Narendra,Mukhammadrizo Maribjonov,Dmitry Makarov,Dmitry Yudin,Aleksandr Panov
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 8 figures. Accepted to IEEE International Conference on Robotics and Automation (ICRA 2026)
Abstract:This paper introduces Knowledge Graph based Massively Multi-task Model-based Policy Optimization (KG-M3PO), a framework for multi-task robotic manipulation in partially observable settings that unifies Perception, Knowledge, and Policy. The method augments egocentric vision with an online 3D scene graph that grounds open-vocabulary detections into a metric, relational representation. A dynamic-relation mechanism updates spatial, containment, and affordance edges at every step, and a graph neural encoder is trained end-to-end through the RL objective so that relational features are shaped directly by control performance. Multiple observation modalities (visual, proprioceptive, linguistic, and graph-based) are encoded into a shared latent space, upon which the RL agent operates to drive the control loop. The policy conditions on lightweight graph queries alongside visual and proprioceptive inputs, yielding a compact, semantically informed state for decision making. Experiments on a suite of manipulation tasks with occlusions, distractors, and layout shifts demonstrate consistent gains over strong baselines: the knowledge-conditioned agent achieves higher success rates, improved sample efficiency, and stronger generalization to novel objects and unseen scene configurations. These results support the premise that structured, continuously maintained world knowledge is a powerful inductive bias for scalable, generalizable manipulation: when the knowledge module participates in the RL computation graph, relational representations align with control, enabling robust long-horizon behavior under partial observability. Comments: 8 pages, 8 figures. Accepted to IEEE International Conference on Robotics and Automation (ICRA 2026) Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.24083 [cs.RO] (or arXiv:2603.24083v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2603.24083 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-26] Enhanced Mycelium of Thought (EMoT): A Bio-Inspired Hierarchical Reasoning Architecture with Strategic Dormancy and Mnemonic Encoding
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在复杂多领域推理任务中面临的三大局限:线性或树状推理路径缺乏持久记忆、战略休眠机制缺失以及跨域知识融合能力不足。其解决方案的核心是提出增强型思维菌丝体(Enhanced Mycelium of Thought, EMoT)框架,该框架采用四层分层拓扑结构(微观、介观、宏观、元层级),引入推理节点的战略性休眠与再激活机制,并集成包含五种编码风格的记忆宫殿(Memory Palace)系统,从而实现更稳定、可复用且具备跨域整合能力的推理过程。
链接: https://arxiv.org/abs/2603.24065
作者: Florian Odi Stummer
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 6 figures, 15 tables; includes ablation studies and reasoning trace visualisation
Abstract:Current prompting paradigms for large language models (LLMs), including Chain-of-Thought (CoT) and Tree-of-Thoughts (ToT), follow linear or tree-structured reasoning paths that lack persistent memory, strategic dormancy, and cross-domain synthesis. We present the Enhanced Mycelium of Thought (EMoT) framework, a bio-inspired reasoning architecture that organises cognitive processing into a four-level hierarchy (Micro, Meso, Macro, Meta), implements strategic dormancy and reactivation of reasoning nodes, and integrates a Memory Palace with five mnemonic encoding styles. EMoT is a research prototype for complex, multi-domain problems, not a general-purpose prompting enhancement. Two complementary evaluations reveal a characteristic trade-off. In a blind LLM-as-Judge evaluation across three domains, EMoT achieved near-parity with CoT (4.20 vs. 4.33/5.0) with higher stability, and outperformed CoT on Cross-Domain Synthesis (4.8 vs. 4.4). Ablation studies show that strategic dormancy is architecturally essential (quality collapsed from 4.2 to 1.0 when disabled). On a 15-item short-answer benchmark, EMoT (27%) substantially underperformed simpler baselines, confirming systematic overthinking on simple problems. These results are subject to important limitations: small sample sizes (n=3 complex cases, n=15 short-answer items), LLM-as-Judge evaluation with potential self-preference bias, and approximately 33-fold computational cost overhead. To our knowledge, EMoT is the first reasoning framework to combine hierarchical topology, strategic thought dormancy with reactivation, and mnemonic memory encoding in a single architecture.
[AI-27] ELITE: Experiential Learning and Intent-Aware Transfer for Self-improving Embodied Agents
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在构建具身智能体(embodied agents)时面临的根本性问题:VLMs 依赖静态训练数据,缺乏与物理环境的交互能力,导致其在执行复杂任务时频繁跳过关键步骤、提出无效动作并重复错误。为弥补这一语义理解与可靠动作执行之间的鸿沟,作者提出 ELITE 框架,其核心创新在于两个协同机制——自省式知识构建(self-reflective knowledge construction)和意图感知检索(intent-aware retrieval)。其中,自省式知识构建通过结构化精炼操作从执行轨迹中提取可复用策略并维护动态演化的策略池;意图感知检索则基于当前任务意图从策略池中识别并应用相关策略。该方案使智能体能够在无监督环境下持续学习自身交互经验,并有效迁移至程序相似的新任务,显著提升性能表现。
链接: https://arxiv.org/abs/2603.24018
作者: Bingqing Wei,Zhongyu Xia,Dingai Liu,Xiaoyu Zhou,Zhiwei Lin,Yongtao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models (VLMs) have shown remarkable general capabilities, yet embodied agents built on them fail at complex tasks, often skipping critical steps, proposing invalid actions, and repeating mistakes. These failures arise from a fundamental gap between the static training data of VLMs and the physical interaction for embodied tasks. VLMs can learn rich semantic knowledge from static data but lack the ability to interact with the world. To address this issue, we introduce ELITE, an embodied agent framework with Experiential Learning and Intent-aware Transfer that enables agents to continuously learn from their own environment interaction experiences, and transfer acquired knowledge to procedurally similar tasks. ELITE operates through two synergistic mechanisms, \textiti.e., self-reflective knowledge construction and intent-aware retrieval. Specifically, self-reflective knowledge construction extracts reusable strategies from execution trajectories and maintains an evolving strategy pool through structured refinement operations. Then, intent-aware retrieval identifies relevant strategies from the pool and applies them to current tasks. Experiments on the EB-ALFRED and EB-Habitat benchmarks show that ELITE achieves 9% and 5% performance improvement over base VLMs in the online setting without any supervision. In the supervised setting, ELITE generalizes effectively to unseen task categories, achieving better performance compared to state-of-the-art training-based methods. These results demonstrate the effectiveness of ELITE for bridging the gap between semantic understanding and reliable action execution.
[AI-28] Language-Grounded Multi-Agent Planning for Personalized and Fair Participatory Urban Sensing
【速读】:该论文旨在解决参与式城市感知(Participatory Urban Sensing)中因依赖集中优化和假设参与者同质性而导致的分配僵化问题,该问题忽视了个体偏好与异构城市环境的多样性。解决方案的关键在于提出一种基于大语言模型(LLM)的多智能体框架MAPUS,其中参与者被建模为具有个人档案和日程安排的自主智能体,协调者智能体则通过基于语言的协商机制实现公平导向的选择与感知路径优化,从而在保证感知覆盖度的同时显著提升参与者的满意度与公平性。
链接: https://arxiv.org/abs/2603.24014
作者: Xusen Guo,Mingxing Peng,Hongliang Lu,Hai Yang,Jun Ma,Yuxuan Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 12 figures
Abstract:Participatory urban sensing leverages human mobility for large-scale urban data collection, yet existing methods typically rely on centralized optimization and assume homogeneous participants, resulting in rigid assignments that overlook personal preferences and heterogeneous urban contexts. We propose MAPUS, an LLM-based multi-agent framework for personalized and fair participatory urban sensing. In our framework, participants are modeled as autonomous agents with individual profiles and schedules, while a coordinator agent performs fairness-aware selection and refines sensing routes through language-based negotiation. Experiments on real-world datasets show that MAPUS achieves competitive sensing coverage while substantially improving participant satisfaction and fairness, promoting more human-centric and sustainable urban sensing systems.
[AI-29] Understanding the Challenges in Iterative Generative Optimization with LLM s
【速读】:该论文试图解决生成式优化(Generative Optimization)在实际应用中表现脆弱的问题,即尽管其作为构建自改进智能体(self-improving agents)的潜力已被广泛认可,但在实践中仅有极少数智能体采用自动化优化机制。论文指出,这种脆弱性源于设计学习循环时存在的“隐藏”决策:优化器可编辑的内容、执行反馈的信用范围(credit horizon),以及如何将试验与错误批量整合为学习证据。解决方案的关键在于明确并系统化这些设计因素——通过案例研究发现,起始人工制品(starting artifact)决定了可行解空间,截断的执行轨迹仍能提升Atari代理性能,而更大的小批量(minibatch)并不一定单调提升BigBench Extra Hard上的泛化能力。因此,论文强调缺乏跨领域通用的学习循环设置方法是生产部署的主要障碍,并提供了针对上述关键因素的实践指导以提升生成式优化的稳定性和可推广性。
链接: https://arxiv.org/abs/2603.23994
作者: Allen Nie,Xavier Daull,Zhiyi Kuang,Abhinav Akkiraju,Anish Chaudhuri,Max Piasevoli,Ryan Rong,YuCheng Yuan,Prerit Choudhary,Shannon Xiao,Rasool Fakoor,Adith Swaminathan,Ching-An Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 36 pages, 17 figures
Abstract:Generative optimization uses large language models (LLMs) to iteratively improve artifacts (such as code, workflows or prompts) using execution feedback. It is a promising approach to building self-improving agents, yet in practice remains brittle: despite active research, only 9% of surveyed agents used any automated optimization. We argue that this brittleness arises because, to set up a learning loop, an engineer must make ``hidden’’ design choices: What can the optimizer edit and what is the “right” learning evidence to provide at each update? We investigate three factors that affect most applications: the starting artifact, the credit horizon for execution traces, and batching trials and errors into learning evidence. Through case studies in MLAgentBench, Atari, and BigBench Extra Hard, we find that these design decisions can determine whether generative optimization succeeds, yet they are rarely made explicit in prior work. Different starting artifacts determine which solutions are reachable in MLAgentBench, truncated traces can still improve Atari agents, and larger minibatches do not monotonically improve generalization on BBEH. We conclude that the lack of a simple, universal way to set up learning loops across domains is a major hurdle for productionization and adoption. We provide practical guidance for making these choices.
[AI-30] From Untamed Black Box to Interpretable Pedagogical Orchestration: The Ensemble of Specialized LLM s Architecture for Adaptive Tutoring
【速读】:该论文旨在解决当前教育对话系统中使用的单体大型语言模型(Monolithic Large Language Models, LLMs)因缺乏可解释性和可控性而导致的教学行为不可审计、违反教学约束(如过早提供提示)等问题。其解决方案的关键在于提出一种“专业化LLM集合架构”(Ensemble of Specialized LLMs, ES-LLMs),通过将决策机制与自然语言生成解耦:由基于规则的编排器(orchestrator)根据可解释的贝叶斯知识追踪(Bayesian Knowledge Tracing, BKT)模型选择具体教学动作(如辅导、评估、反馈等),再由专用LLM渲染器生成对应语句。这种结构实现了对教学约束(如“尝试后再提示”)的显式强制执行,并提升了系统的可靠性、可控性与资源效率,实验证明其在教学质量、透明度和运行成本方面显著优于单体基线模型。
链接: https://arxiv.org/abs/2603.23990
作者: Nizam Kadir
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted as a FULL paper at the 27th International Conference on Artificial Intelligence in Education (AIED 2026). 15 pages, 4 figures, 4 tables
Abstract:Monolithic Large Language Models (LLMs) used in educational dialogue often behave as “black boxes,” where pedagogical decisions are implicit and difficult to audit, frequently violating instructional constraints by providing answers too early. We introduce the Ensemble of Specialized LLMS (ES-LLMS) architecture that separates decision-making from wording. Pedagogical actions are selected by a deterministic rules-based orchestrator coordinating specialized agents covering tutoring, assessment, feedback, scaffolding, motivation and ethics-guided by an interpretable Bayesian Knowledge Tracing (BKT) student model. An LLM renderer surface-realizes the chosen action in natural language. This design emphasizes reliability and controllability: constraints such as “attempt-before-hint” and hint caps are enforced as explicit rules, and the system logs per-turn agent traces and constraint checks. Validation of pedagogical quality via human expert reviewers (N=6) and a multi-LLM-as-Judge panel (six state-of-the-art models) showed that ES-LLMs were preferred in 91.7% and 79.2% of cases, respectively. The architecture significantly outperformed monolithic baselines across all seven dimensions, particularly in Scaffolding Guidance, and Trust Explainability. Furthermore, a Monte Carlo simulation (N=2,400) exposed a “Mastery Gain Paradox,” where monolithic tutors inflated short-term performance through over-assistance. In contrast, ES-LLMs achieved 100% adherence to pedagogical constraints (e.g., attempt-before-hint) and a 3.3x increase in hint efficiency. Operationally, ES-LLMs reduced costs by 54% and latency by 22% by utilizing stateless prompts. We conclude that structural decoupling is essential for transforming stochastic models into trustworthy, verifiable and resource-efficient pedagogical agents.
[AI-31] SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating
【速读】:该论文旨在解决当前基于文本驱动的仿人机器人全身运动生成方法中存在的物理不可行性(physical hallucinations)问题,即生成的运动轨迹在实际机器人执行时难以被跟踪或存在安全隐患,尤其在分布外(out-of-distribution, OOD)输入下更为严重。解决方案的关键在于提出SafeFlow框架,其核心创新包括:1)采用物理引导的修正流匹配(Physics-Guided Rectified Flow Matching)在变分自编码器(VAE)潜空间中生成更可执行的运动轨迹,并通过Reflow加速采样以降低函数评估次数(NFE),提升实时性;2)设计三阶段安全门(3-Stage Safety Gate),通过文本嵌入空间中的马氏距离检测语义OOD提示、方向敏感性差异度量过滤不稳定生成、以及最终硬性关节与速度约束,确保输出轨迹的安全性和可行性。该方案显著提升了生成运动的真实世界可执行性与鲁棒性。
链接: https://arxiv.org/abs/2603.23983
作者: Hanbyel Cho,Sang-Hun Kim,Jeonguk Kang,Donghan Koo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Project Page: this https URL
Abstract:Recent advances in real-time interactive text-driven motion generation have enabled humanoids to perform diverse behaviors. However, kinematics-only generators often exhibit physical hallucinations, producing motion trajectories that are physically infeasible to track with a downstream motion tracking controller or unsafe for real-world deployment. These failures often arise from the lack of explicit physics-aware objectives for real-robot execution and become more severe under out-of-distribution (OOD) user inputs. Hence, we propose SafeFlow, a text-driven humanoid whole-body control framework that combines physics-guided motion generation with a 3-Stage Safety Gate driven by explicit risk indicators. SafeFlow adopts a two-level architecture. At the high level, we generate motion trajectories using Physics-Guided Rectified Flow Matching in a VAE latent space to improve real-robot executability, and further accelerate sampling via Reflow to reduce the number of function evaluations (NFE) for real-time control. The 3-Stage Safety Gate enables selective execution by detecting semantic OOD prompts using a Mahalanobis score in text-embedding space, filtering unstable generations via a directional sensitivity discrepancy metric, and enforcing final hard kinematic constraints such as joint and velocity limits before passing the generated trajectory to a low-level motion tracking controller. Extensive experiments on the Unitree G1 demonstrate that SafeFlow outperforms prior diffusion-based methods in success rate, physical compliance, and inference speed, while maintaining diverse expressiveness.
[AI-32] Kirchhoff-Inspired Neural Networks for Evolving High-Order Perception
【速读】:该论文旨在解决传统深度学习网络在信息编码与传输机制上与生物神经系统本质差异的问题,特别是缺乏对信号强度、耦合结构和状态演化之间协同关系的系统性建模。现有架构主要通过调整神经元间连接权重来优化信息传递,而忽略了生物神经元依赖膜电位动态波动的特性。为此,作者提出基于基尔霍夫电流定律(Kirchhoff’s current law)构建的状态变量型神经网络(Kirchhoff-Inspired Neural Network, KINN),其关键在于从基本常微分方程出发推导出数值稳定的更新规则,从而在单层内显式解耦并编码高阶演化成分,同时保持物理一致性、可解释性和端到端可训练性。
链接: https://arxiv.org/abs/2603.23977
作者: Tongfei Chen,Jingying Yang,Linlin Yang,Jinhu Lü,David Doermann,Chunyu Xie,Long He,Tian Wang,Juan Zhang,Guodong Guo,Baochang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning architectures are fundamentally inspired by neuroscience, particularly the structure of the brain’s sensory pathways, and have achieved remarkable success in learning informative data representations. Although these architectures mimic the communication mechanisms of biological neurons, their strategies for information encoding and transmission are fundamentally distinct. Biological systems depend on dynamic fluctuations in membrane potential; by contrast, conventional deep networks optimize weights and biases by adjusting the strengths of inter-neural connections, lacking a systematic mechanism to jointly characterize the interplay among signal intensity, coupling structure, and state evolution. To tackle this limitation, we propose the Kirchhoff-Inspired Neural Network (KINN), a state-variable-based network architecture constructed based on Kirchhoff’s current law. KINN derives numerically stable state updates from fundamental ordinary differential equations, enabling the explicit decoupling and encoding of higher-order evolutionary components within a single layer while preserving physical consistency, interpretability, and end-to-end trainability. Extensive experiments on partial differential equation (PDE) solving and ImageNet image classification validate that KINN outperforms state-of-the-art existing methods.
[AI-33] Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage
【速读】:该论文旨在解决当前网络安全领域中Advanced Persistent Threats (APTs)不断演化导致传统安全防护手段失效,以及SOC(Security Operation Centers)分析师面临海量日志数据难以高效分析的问题。解决方案的关键在于提出了一种自动化且动态的威胁狩猎框架,其核心创新是将Agentic AI与Splunk SIEM平台深度融合,构建了一个从流量采集到异常评估的全流程模块化体系:包括基于重构的自编码器用于异常检测、两层深度强化学习(DRL)实现初始分流研判,以及大语言模型(LLM)进行上下文语义分析。该框架能根据SOC目标自主适应网络环境变化,并对可疑和恶意流量进行风险优先级排序,从而显著提升威胁识别准确率与运营效率。
链接: https://arxiv.org/abs/2603.23966
作者: Rishikesh Sahay,Bell Eapen,Weizhi Meng,Md Rasel Al Mamun,Nikhil Kumar Dora,Manjusha Sumasadan,Sumit Kumar Tetarave,Rod Soto
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:With frequently evolving Advanced Persistent Threats (APTs) in cyberspace, traditional security solutions approaches have become inadequate for threat hunting for organizations. Moreover, SOC (Security Operation Centers) analysts are often overwhelmed and struggle to analyze the huge volume of logs received from diverse devices in organizations. To address these challenges, we propose an automated and dynamic threat hunting framework for monitoring evolving threats, adapting to changing network conditions, and performing risk-based prioritization for the mitigation of suspicious and malicious traffic. By integrating Agentic AI with Splunk, an established SIEM platform, we developed a unique threat hunting framework. The framework systematically and seamlessly integrates different threat hunting modules together, ranging from traffic ingestion to anomaly assessment using a reconstruction-based autoencoder, deep reinforcement learning (DRL) with two layers for initial triage, and a large language model (LLM) for contextual analysis. We evaluated the framework against a publicly available benchmark dataset, as well as against a simulated dataset. The experimental results show that the framework can effectively adapt to different SOC objectives autonomously and identify suspicious and malicious traffic. The framework enhances operational effectiveness by supporting SOC analysts in their decision-making to block, allow, or monitor network traffic. This study thus enhances cybersecurity and threat hunting literature by presenting the novel threat hunting framework for security decision- making, as well as promoting cumulative research efforts to develop more effective frameworks to battle continuously evolving cyber threats.
[AI-34] From Pixels to Digital Agents : An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)领域中环境演化路径不清晰、评估基准分散且缺乏系统性量化分析的问题。传统研究多依赖定性描述,难以揭示RL环境从孤立物理仿真向通用语义驱动代理演进的本质规律。其解决方案的关键在于构建一个大规模、数据驱动的实证研究框架:通过程序化处理海量学术文献并精炼出超过2000篇核心出版物,提出一种多维分类法(multi-dimensional taxonomy),对基准测试进行跨应用领域和认知能力要求的系统性分析,并利用自动化语义与统计方法识别出两大主导生态——“语义先验”(Semantic Prior)生态系统(以大语言模型 Large Language Models, LLMs 为核心)与“领域特定泛化”(Domain-Specific Generalization)生态系统。这一方法不仅验证了RL环境发展的范式转变,还刻画了不同领域的“认知指纹”(cognitive fingerprints),从而为下一代具身语义模拟器(Embodied Semantic Simulators)的设计提供了可量化的理论指导和技术路线。
链接: https://arxiv.org/abs/2603.23964
作者: Lijing Luo,Yiben Luo,Alexey Gorbatovski,Sergey Kovalchuk,Xiaodan Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages main text, 18 figures
Abstract:The remarkable progress of reinforcement learning (RL) is intrinsically tied to the environments used to train and evaluate artificial agents. Moving beyond traditional qualitative reviews, this work presents a large-scale, data-driven empirical investigation into the evolution of RL environments. By programmatically processing a massive corpus of academic literature and rigorously distilling over 2,000 core publications, we propose a quantitative methodology to map the transition from isolated physical simulations to generalist, language-driven foundation agents. Implementing a novel, multi-dimensional taxonomy, we systematically analyze benchmarks against diverse application domains and requisite cognitive capabilities. Our automated semantic and statistical analysis reveals a profound, data-verified paradigm shift: the bifurcation of the field into a “Semantic Prior” ecosystem dominated by Large Language Models (LLMs) and a “Domain-Specific Generalization” ecosystem. Furthermore, we characterize the “cognitive fingerprints” of these distinct domains to uncover the underlying mechanisms of cross-task synergy, multi-domain interference, and zero-shot generalization. Ultimately, this study offers a rigorous, quantitative roadmap for designing the next generation of Embodied Semantic Simulators, bridging the gap between continuous physical control and high-level logical reasoning.
[AI-35] Variable-Length Audio Fingerprinting
【速读】:该论文旨在解决现有深度学习音频指纹技术在处理固定长度音频片段时忽视时间动态性的问题,从而导致对变长音频或存在时序变化的音频识别效果受限。其解决方案的关键在于提出一种新型可变长度音频指纹方法(Variable-Length Audio Finger Printing, VLAFP),该方法首次实现了训练与测试阶段均支持任意长度音频输入的端到端深度音频指纹建模,有效捕捉音频中的时序特征并提升在真实场景下的音频识别与检索性能。
链接: https://arxiv.org/abs/2603.23947
作者: Hongjie Chen,Hanyu Meng,Huimin Zeng,Ryan A. Rossi,Lie Lu,Josh Kimball
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Audio fingerprinting converts audio to much lower-dimensional representations, allowing distorted recordings to still be recognized as their originals through similar fingerprints. Existing deep learning approaches rigidly fingerprint fixed-length audio segments, thereby neglecting temporal dynamics during segmentation. To address limitations due to this rigidity, we propose Variable-Length Audio FingerPrinting (VLAFP), a novel method that supports variable-length fingerprinting. To the best of our knowledge, VLAFP is the first deep audio fingerprinting model capable of processing audio of variable length, for both training and testing. Our experiments show that VLAFP outperforms existing state-of-the-arts in live audio identification and audio retrieval across three real-world datasets.
[AI-36] AnalogAgent : Self-Improving Analog Circuit Design Automation with LLM Agents
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的模拟电路设计自动化方法中存在的两大核心问题:一是单一模型循环在生成、诊断与修正过程中倾向于生成简洁摘要而非领域特异性洞察,二是上下文衰减(context attrition)导致关键技术细节丢失。为应对上述挑战,作者提出了一种无需训练的代理框架 AnalogAgent,其关键创新在于引入了一个集成多智能体系统(Multi-Agent System, MAS)与自进化记忆(Self-Evolving Memory, SEM)的架构,通过代码生成器、设计优化器和知识 curator 三类代理协同工作,将执行反馈提炼为可适应的“操作手册”并存储于 SEM 中,从而实现跨任务迁移能力而无需额外专家标注、数据库或设计库支持。实验证明,AnalogAgent 在多个基准测试中显著提升了性能,尤其在轻量级模型(如 Qwen-8B)上实现了平均 Pass@1 提升达 48.8%,展现出对开放权重模型在高质量模拟电路设计自动化中的强大增强效果。
链接: https://arxiv.org/abs/2603.23910
作者: Zhixuan Bao,Zhuoyi Lin,Jiageng Wang,Jinhai Hu,Yuan Gao,Yaoxin Wu,Xiaoli Li,Xun Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures
Abstract:Recent advances in large language models (LLMs) suggest strong potential for automating analog circuit design. Yet most LLM-based approaches rely on a single-model loop of generation, diagnosis, and correction, which favors succinct summaries over domain-specific insight and suffers from context attrition that erases critical technical details. To address these limitations, we propose AnalogAgent, a training-free agentic framework that integrates an LLM-based multi-agent system (MAS) with self-evolving memory (SEM) for analog circuit design automation. AnalogAgent coordinates a Code Generator, Design Optimizer, and Knowledge Curator to distill execution feedback into an adaptive playbook in SEM and retrieve targeted guidance for subsequent generation, enabling cross-task transfer without additional expert feedback, databases, or libraries. Across established benchmarks, AnalogAgent achieves 92% Pass@1 with Gemini and 97.4% Pass@1 with GPT-5. Moreover, with compact models (e.g., Qwen-8B), it yields a +48.8% average Pass@1 gain across tasks and reaches 72.1% Pass@1 overall, indicating that AnalogAgent substantially strengthens open-weight models for high-quality analog circuit design automation.
[AI-37] DUPLEX: Agent ic Dual-System Planning via LLM -Driven Information Extraction
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在长时域机器人任务规划中因幻觉(hallucination)和逻辑不一致性导致的可靠性不足问题。其解决方案的关键在于提出一种代理式双系统神经符号架构(agentic dual-system neuro-symbolic architecture),即DUPLEX:将LLM严格限定在其擅长的任务——结构化语义 grounding(如从自然语言中提取实体、关系等信息并映射为规划领域定义语言,PDDL)——而非端到端规划或代码生成;同时,当符号规划器失败时,由一个高容量的慢速系统激活,基于求解器诊断驱动LLM进行迭代反思与修复,从而实现可靠且高效的计划合成。
链接: https://arxiv.org/abs/2603.23909
作者: Keru Hua,Ding Wang,Yaoying Gu,Xiaoguang Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Language Models (LLMs) provide semantic flexibility for robotic task planning, their susceptibility to hallucination and logical inconsistency limits their reliability in long-horizon domains. To bridge the gap between unstructured environments and rigorous plan synthesis, we propose DUPLEX, an agentic dual-system neuro-symbolic architecture that strictly confines the LLM to schema-guided information extraction rather than end-to-end planning or code generation. In our framework, a feed-forward Fast System utilizes a lightweight LLM to extract entities, relations etc. from natural language, deterministically mapping them into a Planning Domain Definition Language (PDDL) problem file for a classical symbolic planner. To resolve complex or underspecified scenarios, a Slow System is activated exclusively upon planning failure, leveraging solver diagnostics to drive a high-capacity LLM in iterative reflection and repair. Extensive evaluations across 12 classical and household planning domains demonstrate that DUPLEX significantly outperforms existing end-to-end and hybrid LLM baselines in both success rate and reliability. These results confirm that The key is not to make the LLM plan better, but to restrict the LLM to the part it is good at - structured semantic grounding - and leave logical plan synthesis to a symbolic planner.
[AI-38] Agent Chemist: A Multi-Agent Experimental Robotic Platform Integrating Chemical Perception and Precise Control
【速读】:该论文旨在解决化学实验室自动化中长期存在的问题:现有自动化系统受限于刚性工作流程,难以适应实验任务的长尾分布(long-tail distribution of experimental tasks),即现实中大量非标准化、低频且不断演化的操作无法被预设协议覆盖,导致系统在面对新型反应条件、非常规仪器配置或意外程序变化时缺乏泛化能力。解决方案的关键在于构建一个基于多智能体(multi-agent)的机器人平台,通过协作式任务分解、动态调度与自适应控制实现灵活执行;同时融合化学感知(chemical perception)以实现实时反应监测,并采用反馈驱动的执行机制,使系统能够依据实验状态的变化调整动作而非依赖固定脚本,从而提升在多样化实验室场景下的通用性和可靠性。
链接: https://arxiv.org/abs/2603.23886
作者: Xiangyi Wei,Fei Wang,Haotian Zhang,Xin An,Haitian Zhu,Lianrui Hu,Yang Li,Changbo Wang,Xiao He
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Chemical laboratory automation has long been constrained by rigid workflows and poor adaptability to the long-tail distribution of experimental tasks. While most automated platforms perform well on a narrow set of standardized procedures, real laboratories involve diverse, infrequent, and evolving operations that fall outside predefined protocols. This mismatch prevents existing systems from generalizing to novel reaction conditions, uncommon instrument configurations, and unexpected procedural variations. We present a multi-agent robotic platform designed to address this long-tail challenge through collaborative task decomposition, dynamic scheduling, and adaptive control. The system integrates chemical perception for real-time reaction monitoring with feedback-driven execution, enabling it to adjust actions based on evolving experimental states rather than fixed scripts. Validation via acid-base titration demonstrates autonomous progress tracking, adaptive dispensing control, and reliable end-to-end experiment execution. By improving generalization across diverse laboratory scenarios, this platform provides a practical pathway toward intelligent, flexible, and scalable laboratory automation.
[AI-39] he Luna Bound Propagator for Formal Analysis of Neural Networks
【速读】:该论文旨在解决当前参数化CROWN分析(alpha-CROWN)仅限于Python实现所带来的集成困难问题,这限制了其在现有深度神经网络(DNN)验证工具及生产级系统中的应用。解决方案的关键在于提出Luna,一个用C++实现的新一代边界传播器(bound propagator),它不仅支持区间边界传播(Interval Bound Propagation)、CROWN分析和alpha-CROWN分析,还能在通用计算图上运行,从而显著提升与现有验证框架的兼容性与执行效率。实验表明,Luna在VNN-COMP 2025基准测试中,在边界紧致性和计算效率方面均达到或接近当前最先进的alpha-CROWN实现水平。
链接: https://arxiv.org/abs/2603.23878
作者: Henry LeCates,Haoze Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 13 pages, 2 figures
Abstract:The parameterized CROWN analysis, a.k.a., alpha-CROWN, has emerged as a practically successful bound propagation method for neural network verification. However, existing implementations of alpha-CROWN are limited to Python, which complicates integration into existing DNN verifiers and long-term production-level systems. We introduce Luna, a new bound propagator implemented in C++. Luna supports Interval Bound Propagation, the CROWN analysis, and the alpha-CROWN analysis over a general computational graph. We describe the architecture of Luna and show that it is competitive with the state-of-the-art alpha-CROWN implementation in terms of both bound tightness and computational efficiency on benchmarks from VNN-COMP 2025.
[AI-40] he DeepXube Software Package for Solving Pathfinding Problems with Learned Heuristic Functions and Search
【速读】:该论文旨在解决路径规划问题中启发式搜索算法依赖人工设计启发函数(heuristic function)的局限性,从而提升自动化程度与求解效率。其核心解决方案是通过深度强化学习自动学习可指导启发式搜索的神经网络启发函数,关键创新包括:基于有限时域贝尔曼方程的学习方法、回溯经验回放(hindsight experience replay)、批处理启发式搜索策略,以及利用答案集编程(answer-set programming)灵活定义目标状态。此外,系统通过多继承结构简化领域建模和训练数据生成,并借助CPU并行化生成训练数据、GPU加速强化学习更新,实现高效训练;同时支持多种基于GPU并行性的路径规划算法(如批量加权A*、Q*搜索和束搜索),显著提升了求解性能与实用性。
链接: https://arxiv.org/abs/2603.23873
作者: Forest Agostinelli
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:DeepXube is a free and open-source Python package and command-line tool that seeks to automate the solution of pathfinding problems by using machine learning to learn heuristic functions that guide heuristic search algorithms tailored to deep neural networks (DNNs). DeepXube is comprised of the latest advances in deep reinforcement learning, heuristic search, and formal logic for solving pathfinding problems. This includes limited-horizon Bellman-based learning, hindsight experience replay, batched heuristic search, and specifying goals with answer-set programming. A robust multiple-inheritance structure simplifies the definition of pathfinding domains and the generation of training data. Training heuristic functions is made efficient through the automatic parallelization of the generation of training data across central processing units (CPUs) and reinforcement learning updates across graphics processing units (GPUs). Pathfinding algorithms that take advantage of the parallelism of GPUs and DNN architectures, such as batch weighted A* and Q* search and beam search are easily employed to solve pathfinding problems through command-line arguments. Finally, several convenient features for visualization, code profiling, and progress monitoring during training and solving are available. The GitHub repository is publicly available at this https URL.
[AI-41] HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在训练大语言模型进行数学推理时遇到的核心难题:对于模型完全无法解决的“悬崖提示”(cliff prompts),RL梯度会完全消失,导致无法传递任何学习信号以改进这些失败模式。解决方案的关键在于提出混合蒸馏策略优化(Hybrid Distillation Policy Optimization, HDPO),其核心机制是在每轮训练中识别出所有采样路径均失败的悬崖提示,通过提供真实答案生成特权轨迹(privileged rollouts),筛选出正确解后将教师模型(即同一模型但输入包含真值信息)的token级分布蒸馏到学生模型中。由于教师与学生共享权重仅在输入上不同,实现实用性差距(realizability gap)可被严格界定,优于跨模型蒸馏;理论证明表明,在硬阈值极限下,R=1过滤的特权生成能恢复最优KL正则化RL策略。实验表明,HDPO在保持贪婪准确率的同时显著提升覆盖率指标(pass@4提升0.8–1.1%,pass@8提升0.4–1.7%),且蒸馏权重λ可直接调控探索-利用权衡。
链接: https://arxiv.org/abs/2603.23871
作者: Ken Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - “cliff” prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail, generates privileged rollouts by providing the model with ground-truth information, filters for correct solutions, and distills the teacher’s token-level distribution into the student. Because teacher and student share the same weights - differing only in their input - the realizability gap is provably bounded, unlike cross-model distillation. We prove that R=1 filtered privileged generation recovers the optimal KL-regularized RL policy in the hard-threshold limit. Experiments on OpenMathInstruct-2 with Qwen2.5-Math-1.5B-Instruct show that HDPO consistently improves coverage metrics (pass@4 by +0.8-1.1%, pass@8 by +0.4-1.7%) while maintaining greedy accuracy, with the distillation weight lambda providing direct control over the exploration-exploitation tradeoff.
[AI-42] Deep Convolutional Neural Networks for predicting highest priority functional group in organic molecules
【速读】:该论文旨在解决有机分子中最高优先级功能基团(Functional Group)的预测问题,该基团决定了化合物的主要物理和化学性质。解决方案的关键在于利用傅里叶变换红外光谱(Fourier-transform Infrared spectroscopy, FTIR)作为输入特征,并采用深度卷积神经网络(Deep Convolutional Neural Networks, CNN)进行建模,相较于传统机器学习方法如支持向量机(Support Vector Machine, SVM),CNN能够更有效地从FTIR光谱中提取局部特征并自动学习高级抽象表示,从而实现更高精度的功能基团识别。
链接: https://arxiv.org/abs/2603.23862
作者: Kunal Khatri,Vineet Mehta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Our work addresses the problem of predicting the highest priority functional group present in an organic molecule. Functional Groups are groups of bound atoms that determine the physical and chemical properties of organic molecules. In the presence of multiple functional groups, the dominant functional group determines the compound’s properties. Fourier-transform Infrared spectroscopy (FTIR) is a commonly used spectroscopic method for identifying the presence or absence of functional groups within a compound. We propose the use of a Deep Convolutional Neural Networks (CNN) to predict the highest priority functional group from the Fourier-transform infrared spectrum (FTIR) of the organic molecule. We have compared our model with other previously applied Machine Learning (ML) method Support Vector Machine (SVM) and reasoned why CNN outperforms it.
[AI-43] Why the Maximum Second Derivative of Activations Matters for Adversarial Robustness
【速读】:该论文旨在解决激活函数曲率(activation function curvature)对模型对抗鲁棒性(adversarial robustness)影响机制不明确的问题。解决方案的关键在于提出递归可调曲率激活族(Recursive Curvature-Tunable Activation Family, RCT-AF),通过参数 α 和 β 精确调控激活函数的二阶导数最大值 max∣σ′′∣,从而系统性地揭示其与对抗鲁棒性的非单调关系:当 max∣σ′′∣ 在 4 到 10 范围内时,模型获得最优鲁棒性,这源于该区间能最小化损失函数的归一化 Hessian 对角线范数(normalized Hessian diagonal norm),避免极小值过尖锐,从而提升鲁棒泛化能力。
链接: https://arxiv.org/abs/2603.23860
作者: Yunrui Yu,Hang Su,Jun Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This work investigates the critical role of activation function curvature – quantified by the maximum second derivative \max|\sigma’‘| – in adversarial robustness. Using the Recursive Curvature-Tunable Activation Family (RCT-AF), which enables precise control over curvature through parameters \alpha and \beta , we systematically analyze this relationship. Our study reveals a fundamental trade-off: insufficient curvature limits model expressivity, while excessive curvature amplifies the normalized Hessian diagonal norm of the loss, leading to sharper minima that hinder robust generalization. This results in a non-monotonic relationship where optimal adversarial robustness consistently occurs when \max|\sigma’‘| falls within 4 to 10, a finding that holds across diverse network architectures, datasets, and adversarial training methods. We provide theoretical insights into how activation curvature affects the diagonal elements of the hessian matrix of the loss, and experimentally demonstrate that the normalized Hessian diagonal norm exhibits a U-shaped dependence on \max|\sigma’'| , with its minimum within the optimal robustness range, thereby validating the proposed mechanism.
[AI-44] When AI output tips to bad but nobody notices: Legal implications of AIs mistakes
【速读】:该论文旨在解决生成式 AI(Generative AI)在法律行业应用中因“幻觉”(hallucination)导致的虚构法律条文、判例和司法裁决被误用的问题,这可能引发律师的职业惩戒、执业过失责任及司法程序 integrity 的系统性风险。解决方案的关键在于揭示此类幻觉并非随机错误,而是由 Transformer 架构内部状态跨越可计算阈值所引发的确定性失效模式——即当模型输出从可靠推理跃迁至权威性伪造时,存在可预测的物理机制基础。因此,论文主张法律从业者、法院与监管机构应摒弃将生成式 AI 视为“黑箱”的传统认知,转而建立基于其实际失效机理的验证协议,以实现对技术使用的负责任管理。
链接: https://arxiv.org/abs/2603.23857
作者: Dylan J. Restrepo,Nicholas J. Restrepo,Frank Y. Huo,Neil F. Johnson
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI); Chaotic Dynamics (nlin.CD); Physics and Society (physics.soc-ph)
备注:
Abstract:The adoption of generative AI across commercial and legal professions offers dramatic efficiency gains – yet for law in particular, it introduces a perilous failure mode in which the AI fabricates fictitious case law, statutes, and judicial holdings that appear entirely authentic. Attorneys who unknowingly file such fabrications face professional sanctions, malpractice exposure, and reputational harm, while courts confront a novel threat to the integrity of the adversarial process. This failure mode is commonly dismissed as random hallucination', but recent physics-based analysis of the Transformer's core mechanism reveals a deterministic component: the AI's internal state can cross a calculable threshold, causing its output to flip from reliable legal reasoning to authoritative-sounding fabrication. Here we present this science in a legal-industry setting, walking through a simulated brief-drafting scenario. Our analysis suggests that fabrication risk is not an anomalous glitch but a foreseeable consequence of the technology's design, with direct implications for the evolving duty of technological competence. We propose that legal professionals, courts, and regulators replace the outdated black box’ mental model with verification protocols based on how these systems actually fail.
[AI-45] Learning-guided Prioritized Planning for Lifelong Multi-Agent Path Finding in Warehouse Automation
【速读】:该论文旨在解决长期多智能体路径规划(Lifelong Multi-Agent Path Finding, Lifelong MAPF)在现代仓库自动化中的挑战,即如何在复杂动态环境中实现高效、冲突-free的机器人路径规划以提升系统整体吞吐量。传统基于搜索的求解器难以适应长时间尺度下的环境变化和动态需求,而现有机器学习方法尚未展现出显著优势。解决方案的关键在于提出一种名为强化学习引导的滚动时域优先规划(Reinforcement Learning guided Rolling Horizon Prioritized Planning, RL-RH-PP)的新框架,其核心创新是将强化学习(Reinforcement Learning, RL)与经典优先规划(Prioritized Planning, PP)相结合:通过将动态优先级分配建模为部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP),利用RL自动学习最优优先级策略,并借助注意力机制驱动的神经网络实时生成优先级序列,从而在保持PP简单性和灵活性的同时,有效处理多智能体间的时空交互关系,显著提升系统吞吐量并具备良好的泛化能力。
链接: https://arxiv.org/abs/2603.23838
作者: Han Zheng,Yining Ma,Brandon Araki,Jingkai Chen,Cathy Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Lifelong Multi-Agent Path Finding (MAPF) is critical for modern warehouse automation, which requires multiple robots to continuously navigate conflict-free paths to optimize the overall system throughput. However, the complexity of warehouse environments and the long-term dynamics of lifelong MAPF often demand costly adaptations to classical search-based solvers. While machine learning methods have been explored, their superiority over search-based methods remains inconclusive. In this paper, we introduce Reinforcement Learning (RL) guided Rolling Horizon Prioritized Planning (RL-RH-PP), the first framework integrating RL with search-based planning for lifelong MAPF. Specifically, we leverage classical Prioritized Planning (PP) as a backbone for its simplicity and flexibility in integrating with a learning-based priority assignment policy. By formulating dynamic priority assignment as a Partially Observable Markov Decision Process (POMDP), RL-RH-PP exploits the sequential decision-making nature of lifelong planning while delegating complex spatial-temporal interactions among agents to reinforcement learning. An attention-based neural network autoregressively decodes priority orders on-the-fly, enabling efficient sequential single-agent planning by the PP planner. Evaluations in realistic warehouse simulations show that RL-RH-PP achieves the highest total throughput among baselines and generalizes effectively across agent densities, planning horizons, and warehouse layouts. Our interpretive analyses reveal that RL-RH-PP proactively prioritizes congested agents and strategically redirects agents from congestion, easing traffic flow and boosting throughput. These findings highlight the potential of learning-guided approaches to augment traditional heuristics in modern warehouse automation.
[AI-46] Circuit Complexity of Hierarchical Knowledge Tracing and Implications for Log-Precision Transformers
【速读】:该论文旨在解决深度概念层级结构中前提关系(prerequisite)传播的计算复杂性问题,特别是在基于Transformer架构的知识追踪模型(knowledge tracing models)中如何有效建模和推理这种层次化知识掌握状态。其关键解决方案在于:首先通过电路复杂度理论形式化了递归多数(recursive-majority)前提传播任务,并证明其位于 NC1 类,且在无限制条件下难以被统一 TC0 模型刻画;其次,在单调性约束下揭示了交替全量/存在型前提树可导致单调阈值电路的严格深度层次差异,从而建立了理论边界;最后,实验发现Transformer编码器在训练过程中会收敛到对称性不变的捷径(permutation-invariant shortcuts),而引入中间子树的辅助监督信号可激发结构依赖计算,显著提升在深度3–4时的准确率,这为设计结构感知的目标函数与迭代机制以实现深层前提敏感的知识追踪提供了实证依据与理论支撑。
链接: https://arxiv.org/abs/2603.23823
作者: Naiming Liu,Richard Baraniuk,Shashank Sonkar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge tracing models mastery over interconnected concepts, often organized by prerequisites. We analyze hierarchical prerequisite propagation through a circuit-complexity lens to clarify what is provable about transformer-style computation on deep concept hierarchies. Using recent results that log-precision transformers lie in logspace-uniform \mathsfTC^0 , we formalize prerequisite-tree tasks including recursive-majority mastery propagation. Unconditionally, recursive-majority propagation lies in \mathsfNC^1 via O(\log n) -depth bounded-fanin circuits, while separating it from uniform \mathsfTC^0 would require major progress on open lower bounds. Under a monotonicity restriction, we obtain an unconditional barrier: alternating ALL/ANY prerequisite trees yield a strict depth hierarchy for \emphmonotone threshold circuits. Empirically, transformer encoders trained on recursive-majority trees converge to permutation-invariant shortcuts; explicit structure alone does not prevent this, but auxiliary supervision on intermediate subtrees elicits structure-dependent computation and achieves near-perfect accuracy at depths 3–4. These findings motivate structure-aware objectives and iterative mechanisms for prerequisite-sensitive knowledge tracing on deep hierarchies.
[AI-47] Willful Disobedience: Automatically Detecting Failures in Agent ic Traces
【速读】:该论文旨在解决当前AI代理(AI agent)在真实软件系统中执行多步工作流时,因长周期执行轨迹(agentic traces)导致的验证难题。传统仅基于结果的评估基准无法捕捉关键的过程性失败,如错误的工作流路由、不安全的工具调用或违反提示规则的行为。解决方案的关键在于提出AgentPex——一个基于AI的自动化评估工具,它能从代理提示和系统指令中提取行为规则,并以此规范自动检测执行轨迹的合规性,从而实现对代理行为的细粒度分析与可解释评估。
链接: https://arxiv.org/abs/2603.23806
作者: Reshabh K Sharma,Shraddha Barke,Benjamin Zorn
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents are increasingly embedded in real software systems, where they execute multi-step workflows through multi-turn dialogue, tool invocations, and intermediate decisions. These long execution histories, called agentic traces, make validation difficult. Outcome-only benchmarks can miss critical procedural failures, such as incorrect workflow routing, unsafe tool usage, or violations of prompt-specified rules. This paper presents AgentPex, an AI-powered tool designed to systematically evaluate agentic traces. AgentPex extracts behavioral rules from agent prompts and system instructions, then uses these specifications to automatically evaluate traces for compliance. We evaluate AgentPex on 424 traces from \tau2-bench across models in telecom, retail, and airline customer service. Our results show that AgentPex distinguishes agent behavior across models and surfaces specification violations that are not captured by outcome-only scoring. It also provides fine-grained analysis by domain and metric, enabling developers to understand agent strengths and weaknesses at scale.
[AI-48] Deep Neural Regression Collapse ALT
【速读】:该论文旨在解决深度神经网络在回归任务中学习到的结构是否具有类似分类任务中“神经坍缩”(Neural Collapse)现象的问题,尤其是这种结构是否不仅限于最后一层,而是贯穿整个网络深层。解决方案的关键在于系统性地证明了深度神经回归坍缩(Deep Neural Regression Collapse, Deep NRC)的存在:即在回归模型的多个隐藏层中,特征分布在目标维度对应的子空间内,特征协方差与目标协方差对齐,层权重的输入子空间与特征子空间一致,且特征的线性预测误差接近模型整体预测误差。此外,论文进一步揭示了具备Deep NRC的模型能够自动学习低秩目标的内在维度,并验证了权重衰减(weight decay)在诱导该结构中的必要性,从而为理解深度网络在回归场景下学习的简洁结构提供了更完整的理论框架。
链接: https://arxiv.org/abs/2603.23805
作者: Akshay Rangamani,Altay Unal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
备注: Accepted to CPAL 2026; Code will be available at this https URL
Abstract:Neural Collapse is a phenomenon that helps identify sparse and low rank structures in deep classifiers. Recent work has extended the definition of neural collapse to regression problems, albeit only measuring the phenomenon at the last layer. In this paper, we establish that Neural Regression Collapse (NRC) also occurs below the last layer across different types of models. We show that in the collapsed layers of neural regression models, features lie in a subspace that corresponds to the target dimension, the feature covariance aligns with the target covariance, the input subspace of the layer weights aligns with the feature subspace, and the linear prediction error of the features is close to the overall prediction error of the model. In addition to establishing Deep NRC, we also show that models that exhibit Deep NRC learn the intrinsic dimension of low rank targets and explore the necessity of weight decay in inducing Deep NRC. This paper provides a more complete picture of the simple structure learned by deep networks in the context of regression.
[AI-49] Object Search in Partially-Known Environments via LLM -informed Model-based Planning and Prompt Selection
【速读】:该论文旨在解决在部分已知环境中的目标物体搜索问题,传统方法往往依赖纯生成式 AI(Generative AI)或启发式策略,难以在复杂场景中实现高效且鲁棒的搜索性能。其解决方案的关键在于提出一种由大语言模型(Large Language Model, LLM)指导的模型化规划框架,利用LLM对不同位置发现目标物体的概率进行估计,并结合环境地图提取的移动成本构建可执行的搜索模型,从而将LLM的知识有效融入规划过程。此外,该方法通过引入基于离线回放的模型选择机制,在部署阶段实现快速提示词(prompt)和LLM的选择优化,显著降低平均代价与累积遗憾(cumulative regret),实验表明该方案在仿真与真实机器人环境中均优于基线策略。
链接: https://arxiv.org/abs/2603.23800
作者: Abhishek Paudel,Abhish Khanal,Raihan I. Arnob,Shahriar Hossain,Gregory J. Stein
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 9 figures
Abstract:We present a novel LLM-informed model-based planning framework, and a novel prompt selection method, for object search in partially-known environments. Our approach uses an LLM to estimate statistics about the likelihood of finding the target object when searching various locations throughout the scene that, combined with travel costs extracted from the environment map, are used to instantiate a model, thus using the LLM to inform planning and achieve effective search performance. Moreover, the abstraction upon which our approach relies is amenable to deployment-time model selection via the recent offline replay approach, an insight we leverage to enable fast prompt and LLM selection during deployment. Simulation experiments demonstrate that our LLM-informed model-based planning approach outperforms the baseline planning strategy that fully relies on LLM and optimistic strategy with as much as 11.8% and 39.2% improvements respectively, and our bandit-like selection approach enables quick selection of best prompts and LLMs resulting in 6.5% lower average cost and 33.8% lower average cumulative regret over baseline UCB bandit selection. Real-robot experiments in an apartment demonstrate similar improvements and so further validate our approach.
[AI-50] he Cognitive Firewall:Securing Browser Based AI Agents Against Indirect Prompt Injection Via Hybrid Edge Cloud Defense
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)作为自主浏览器代理部署时面临的间接提示注入(Indirect Prompt Injection, IPI)攻击问题,此类攻击通过网页内容诱导模型执行恶意行为。传统云端防御虽具备强大的语义分析能力,但存在延迟高和隐私泄露风险。其解决方案的关键在于提出一种三阶段的分计算架构——认知防火墙(Cognitive Firewall),该架构将安全检查任务在客户端与云端之间进行分配:本地视觉哨兵(Sentinel)负责过滤展示层攻击,云端深度规划器(Deep Planner)执行复杂语义分析,而确定性守卫(Guard)则在执行边界强制实施策略约束。这种混合机制显著降低攻击成功率至1%以下,同时实现对副作用操作的确定性控制,并通过边缘预处理大幅减少云端推理开销,获得约17,000倍的延迟优势。
链接: https://arxiv.org/abs/2603.23791
作者: Qianlong Lan,Anuj Kaul
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Deploying large language models (LLMs) as autonomous browser agents exposes a significant attack surface in the form of Indirect Prompt Injection (IPI). Cloud-based defenses can provide strong semantic analysis, but they introduce latency and raise privacy concerns. We present the Cognitive Firewall, a three-stage split-compute architecture that distributes security checks across the client and the cloud. The system consists of a local visual Sentinel, a cloud-based Deep Planner, and a deterministic Guard that enforces execution-time policies. Across 1,000 adversarial samples, edge-only defenses fail to detect 86.9% of semantic attacks. In contrast, the full hybrid architecture reduces the overall attack success rate (ASR) to below 1% (0.88% under static evaluation and 0.67% under adaptive evaluation), while maintaining deterministic constraints on side-effecting actions. By filtering presentation-layer attacks locally, the system avoids unnecessary cloud inference and achieves an approximately 17,000x latency advantage over cloud-only baselines. These results indicate that deterministic enforcement at the execution boundary can complement probabilistic language models, and that split-compute provides a practical foundation for securing interactive LLM agents.
[AI-51] Probabilistic Geometric Alignment via Bayesian Latent Transport for Domain-Adaptive Foundation Models
【速读】:该论文旨在解决在有限监督条件下,将大规模基础模型适应到新领域时面临的挑战,包括潜在分布不匹配、优化动力学不稳定以及不确定性传播失准等问题。其解决方案的关键在于提出一种不确定性感知的概率潜空间传输框架(uncertainty-aware probabilistic latent transport framework),将领域自适应建模为表示空间中的随机几何对齐问题:通过引入贝叶斯传输算子(Bayesian transport operator)沿Wasserstein型测地线轨迹重新分配潜在概率质量,并结合PAC-贝叶斯正则化机制约束后验模型复杂度以缓解灾难性过拟合。该方法在理论上保障了收敛稳定性、损失曲面平滑性和样本效率,实证结果表明其显著降低了潜在流形差异、加速了传输能量衰减并提升了协方差校准效果,同时确保后验不确定性的有界演化,从而增强了跨域迁移过程中的概率可靠性。
链接: https://arxiv.org/abs/2603.23783
作者: Kuepon Aueawatthanaphisut,Kuepon Aueawatthanaphisut
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
备注: 11 pages, 8 Figures, 25 Equations, 5 Tables and 3 Theorems
Abstract:Adapting large-scale foundation models to new domains with limited supervision remains a fundamental challenge due to latent distribution mismatch, unstable optimization dynamics, and miscalibrated uncertainty propagation. This paper introduces an uncertainty-aware probabilistic latent transport framework that formulates domain adaptation as a stochastic geometric alignment problem in representation space. A Bayesian transport operator is proposed to redistribute latent probability mass along Wasserstein-type geodesic trajectories, while a PAC-Bayesian regularization mechanism constrains posterior model complexity to mitigate catastrophic overfitting. The proposed formulation yields theoretical guarantees on convergence stability, loss landscape smoothness, and sample efficiency under distributional shift. Empirical analyses demonstrate substantial reduction in latent manifold discrepancy, accelerated transport energy decay, and improved covariance calibration compared with deterministic fine-tuning and adversarial domain adaptation baselines. Furthermore, bounded posterior uncertainty evolution indicates enhanced probabilistic reliability during cross-domain transfer. By establishing a principled connection between stochastic optimal transport geometry and statistical generalization theory, the proposed framework provides new insights into robust adaptation of modern foundation architectures operating in heterogeneous environments. These findings suggest that uncertainty-aware probabilistic alignment constitutes a promising paradigm for reliable transfer learning in next-generation deep representation systems.
[AI-52] Human-in-the-Loop Pareto Optimization: Trade-off Characterization for Assist-as-Needed Training and Performance Evaluation
【速读】:该论文旨在解决人类在运动技能训练与物理康复过程中,任务难度与用户表现之间存在的固有权衡问题(trade-off),这一权衡关系对于评估用户表现、设计“按需辅助”(assist-as-needed, AAN)协议以及衡量训练方案有效性至关重要。解决方案的关键在于提出一种新颖的人在回路(human-in-the-loop, HiL)帕累托优化方法,通过贝叶斯多准则优化技术系统且高效地刻画该权衡关系;其核心创新在于采用混合模型——量化指标测量任务性能,定性指标捕捉主观挑战感知水平,并借助用户研究验证了该框架的可行性及在三种应用场景中的实用性:AAN训练协议设计与评估、个体训练前后表现的公平比较(即使无法无辅助完成任务)、跨个体表现的公平比较(基于每位用户在所有可行辅助水平下的最优表现)。
链接: https://arxiv.org/abs/2603.23777
作者: Harun Tolasa,Volkan Patoglu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Under review for publication in IEEE Transactions on Haptics
Abstract:During human motor skill training and physical rehabilitation, there is an inherent trade-off between task difficulty and user performance. Characterizing this trade-off is crucial for evaluating user performance, designing assist-as-needed (AAN) protocols, and assessing the efficacy of training protocols. In this study, we propose a novel human-in-the-loop (HiL) Pareto optimization approach to characterize the trade-off between task performance and the perceived challenge level of motor learning or rehabilitation tasks. We adapt Bayesian multi-criteria optimization to systematically and efficiently perform HiL Pareto characterizations. Our HiL optimization employs a hybrid model that measures performance with a quantitative metric, while the perceived challenge level is captured with a qualitative metric. We demonstrate the feasibility of the proposed HiL Pareto characterization through a user study. Furthermore, we present the utility of the framework through three use cases in the context of a manual skill training task with haptic feedback. First, we demonstrate how the characterized trade-off can be used to design a sample AAN training protocol for a motor learning task and to evaluate the group-level efficacy of the proposed AAN protocol relative to a baseline adaptive assistance protocol. Second, we demonstrate that individual-level comparisons of the trade-offs characterized before and after the training session enable fair evaluation of training progress under different assistance levels. This evaluation method is more general than standard performance evaluations, as it can provide insights even when users cannot perform the task without assistance. Third, we show that the characterized trade-offs also enable fair performance comparisons among different users, as they capture the best possible performance of each user under all feasible assistance levels.
[AI-53] AI-driven Intent-Based Networking Approach for Self-configuration of Next Generation Networks
【速读】:该论文旨在解决意图驱动网络(Intent-Based Networking, IBN)中依赖自动化实现的两大难题:其一,从模糊的自然语言意图到控制器可执行策略的转换过程脆弱且易产生冲突与意外副作用;其二,现有保障机制多为被动响应式,难以在多意图场景下有效识别故障根源并处理级联症状与模糊遥测数据。解决方案的关键在于构建一个端到端闭环的IBN流水线,利用大语言模型(Large Language Models, LLMs)结合结构化验证实现自然语言到策略的鲁棒映射与冲突感知激活,并将保障机制重构为基于多意图的主动故障预测与根因消歧,从而实现可信赖的自动化,提供可操作的早期预警、可解释的推理过程及可量化的修复前置时间。
链接: https://arxiv.org/abs/2603.23772
作者: Md. Kamrul Hossain,Walid Aljoby
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: Accepted for presentation in IEEE/IFIP NOMS 2026
Abstract:Intent-Based Networking (IBN) aims to simplify operating heterogeneous infrastructures by translating high-level intents into enforceable policies and assuring compliance. However, dependable automation remains difficult because (i) realizing intents from ambiguous natural language into controller-ready policies is brittle and prone to conflicts and unintended side effects, and (ii) assurance is often reactive and struggles in multi-intent settings where faults create cascading symptoms and ambiguous telemetry. This paper proposes an end-to-end closed-loop IBN pipeline that uses large language models with structured validation for natural language to policy realization and conflict-aware activation, and reformulates assurance as proactive multi-intent failure prediction with root-cause disambiguation. The expected outcome is operator-trustworthy automation that provides actionable early warnings, interpretable explanations, and measurable lead time for remediation.
[AI-54] Self Paced Gaussian Contextual Reinforcement Learning
【速读】:该论文旨在解决自适应课程学习(Self-Paced Curriculum Learning, SPCL)在高维上下文空间中因依赖计算昂贵的内循环优化而导致可扩展性受限的问题。其解决方案的关键在于提出了一种新的自适应高斯课程学习方法(Self-Paced Gaussian Curriculum Learning, SPGL),该方法通过引入高斯上下文分布的闭式更新规则,避免了传统方法所需的数值优化过程,从而显著降低了计算开销,同时保持了样本效率和自适应能力。理论分析表明SPGL具有收敛性保证,并在多个上下文强化学习基准任务中验证了其有效性,尤其在隐藏上下文场景下表现优异且上下文分布收敛更稳定。
链接: https://arxiv.org/abs/2603.23755
作者: Mohsen Sahraei Ardakani,Rui Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 10 figures
Abstract:Curriculum learning improves reinforcement learning (RL) efficiency by sequencing tasks from simple to complex. However, many self-paced curriculum methods rely on computationally expensive inner-loop optimizations, limiting their scalability in high-dimensional context spaces. In this paper, we propose Self-Paced Gaussian Curriculum Learning (SPGL), a novel approach that avoids costly numerical procedures by leveraging a closed-form update rule for Gaussian context distributions. SPGL maintains the sample efficiency and adaptability of traditional self-paced methods while substantially reducing computational overhead. We provide theoretical guarantees on convergence and validate our method across several contextual RL benchmarks, including the Point Mass, Lunar Lander, and Ball Catching environments. Experimental results show that SPGL matches or outperforms existing curriculum methods, especially in hidden context scenarios, and achieves more stable context distribution convergence. Our method offers a scalable, principled alternative for curriculum generation in challenging continuous and partially observable domains.
[AI-55] Efficient Benchmarking of AI Agents
【速读】:该论文旨在解决AI代理(AI agent)在综合性基准测试中评估成本高昂的问题,尤其针对需要交互式滚动(interactive rollouts)和多步推理的任务场景。传统方法依赖完整基准测试以确保排名准确性,但存在资源消耗大、效率低的缺陷。解决方案的关键在于发现并利用“绝对分数预测”与“排名顺序预测”在框架驱动分布偏移(scaffold-driven distribution shift)下的不对称性:尽管绝对得分会因代理框架变化而显著波动,但相对排名保持稳定。基于此观察,作者提出一种无需优化的简单协议——仅在历史通过率处于中等难度区间(30%-70%)的任务上评估新代理,该策略受项目反应理论(Item Response Theory, IRT)启发,可减少44%-70%的评估任务量,同时在框架和时间分布偏移下仍保持高排名保真度(rank fidelity),优于随机采样和贪婪任务选择方法。
链接: https://arxiv.org/abs/2603.23749
作者: Franck Ndzomga
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures, 5 tables
Abstract:Evaluating AI agents on comprehensive benchmarks is expensive because each evaluation requires interactive rollouts with tool use and multi-step reasoning. We study whether small task subsets can preserve agent rankings at substantially lower cost. Unlike static language model benchmarks, agent evaluation is subject to scaffold-driven distribution shift, since performance depends on the framework wrapping the underlying model. Across eight benchmarks, 33 agent scaffolds, and 70+ model configurations, we find that absolute score prediction degrades under this shift, while rank-order prediction remains stable. Exploiting this asymmetry, we propose a simple optimization-free protocol: evaluate new agents only on tasks with intermediate historical pass rates (30-70%). This mid-range difficulty filter, motivated by Item Response Theory, reduces the number of evaluation tasks by 44-70% while maintaining high rank fidelity under scaffold and temporal shifts. It provides more reliable rankings than random sampling, which exhibits high variance across seeds, and outperforms greedy task selection under distribution shift. These results suggest that reliable leaderboard ranking does not require full-benchmark evaluation.
[AI-56] CDMT-EHR: A Continuous-Time Diffusion Framework for Generating Mixed-Type Time-Series Electronic Health Records
【速读】:该论文旨在解决电子健康记录(Electronic Health Records, EHRs)在临床研究中因隐私问题导致的数据共享受限难题,同时应对EHR数据中数值型与分类型特征共存且随时间动态变化所带来的合成挑战。其解决方案的关键在于提出一种连续时间扩散框架,通过三个核心创新实现高质量混合类型时序EHR生成:(1) 采用双向门控循环单元(bidirectional gated recurrent unit)作为主干网络以捕捉时间依赖性;(2) 引入可学习的连续嵌入机制对分类变量进行统一高斯扩散建模,从而支持跨特征联合建模;(3) 设计因子化解耦的可学习噪声调度策略,使模型能自适应不同特征在每个时间步的学习难度。实验表明,该方法在下游任务性能、分布保真度和判别能力上优于现有方法,且采样步数仅需50步,显著低于基线方法的1000步。
链接: https://arxiv.org/abs/2603.23719
作者: Shaonan Liu,Yuichiro Iwashita,Soichiro Nakako,Masakazu Iwamura,Koichi Kise
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Electronic health records (EHRs) are invaluable for clinical research, yet privacy concerns severely restrict data sharing. Synthetic data generation offers a promising solution, but EHRs present unique challenges: they contain both numerical and categorical features that evolve over time. While diffusion models have demonstrated strong performance in EHR synthesis, existing approaches predominantly rely on discrete-time formulations, which suffer from finite-step approximation errors and coupled training-sampling step counts. We propose a continuous-time diffusion framework for generating mixed-type time-series EHRs with three contributions: (1) continuous-time diffusion with a bidirectional gated recurrent unit backbone for capturing temporal dependencies, (2) unified Gaussian diffusion via learnable continuous embeddings for categorical variables, enabling joint cross-feature modeling, and (3) a factorized learnable noise schedule that adapts to per-feature-per-timestep learning difficulties. Experiments on two large-scale intensive care unit datasets demonstrate that our method outperforms existing approaches in downstream task performance, distribution fidelity, and discriminability, while requiring only 50 sampling steps compared to 1,000 for baseline methods. Classifier-free guidance further enables effective conditional generation for class-imbalanced clinical scenarios.
[AI-57] Learning What Can Be Picked: Active Reachability Estimation for Efficient Robotic Fruit Harvesting
【速读】:该论文旨在解决农业机器人在非结构化果园环境中进行果实采摘时,因感知到动作(perception-to-action)流程效率低下而导致的部署难题。现有方法通常依赖于耗时的逆运动学或路径规划来判断目标果实是否可达,造成计算冗余和决策延迟。其解决方案的关键在于将RGB-D感知与主动学习(active learning)相结合,直接将可达性建模为一个二分类决策问题,并通过主动学习策略选择最具信息量的样本进行标注,从而显著降低人工标注成本,同时保持高预测精度。实验表明,该方法在少量标注数据下即可实现优于随机采样的准确率(提升约6–8%),并能高效适应新的果园布局,验证了主动学习在任务级感知中的有效性,为替代传统计算密集型运动学可达性分析提供了可扩展的新路径。
链接: https://arxiv.org/abs/2603.23679
作者: Nur Afsa Syeda,Mohamed Elmahallawy,Luis Fernando de la Torre,John Miller
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Agriculture remains a cornerstone of global health and economic sustainability, yet labor-intensive tasks such as harvesting high-value crops continue to face growing workforce shortages. Robotic harvesting systems offer a promising solution; however, their deployment in unstructured orchard environments is constrained by inefficient perception-to-action pipelines. In particular, existing approaches often rely on exhaustive inverse kinematics or motion planning to determine whether a target fruit is reachable, leading to unnecessary computation and delayed decision-making. Our approach combines RGB-D perception with active learning to directly learn reachability as a binary decision problem. We then leverage active learning to selectively query the most informative samples for reachability labeling, significantly reducing annotation effort while maintaining high predictive accuracy. Extensive experiments demonstrate that the proposed framework achieves accurate reachability prediction with substantially fewer labeled samples, yielding approximately 6–8% higher accuracy than random sampling and enabling label-efficient adaptation to new orchard configurations. Among the evaluated strategies, entropy- and margin-based sampling outperform Query-by-Committee and standard uncertainty sampling in low-label regimes, while all strategies converge to comparable performance as the labeled set grows. These results highlight the effectiveness of active learning for task-level perception in agricultural robotics and position our approach as a scalable alternative to computation-heavy kinematic reachability analysis. Our code is available through this https URL.
[AI-58] Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement
【速读】:该论文旨在解决在3D环境中,仅依赖视觉观测和模糊的自然语言目标进行长时程规划的问题,尤其针对多步骤的3D盒子重排任务。现有方法通常依赖符号规划器(symbolic planners),其状态与目标的关联关系脆弱,或直接从2D视觉语言模型(VLMs)生成动作序列,二者均难以处理大量物体、复杂的3D几何结构以及隐含的语义约束。论文提出的关键解决方案是Reactive Action Mask Planner (RAMP-3D),其核心在于将长时程规划建模为对成对3D掩码的逐步反应式预测:一个“选物掩码”(which-object mask)指示操作对象,另一个“目标区域掩码”(which-target-region mask)指定放置位置。该方法通过RGB-D观测和自然语言指令,实时生成多步抓取与放置动作,在包含1–30个盒子的仓库场景中实现79.5%的成功率,显著优于基于2D VLM的基线,验证了基于掩码的反应式策略在长时程规划中的有效性。
链接: https://arxiv.org/abs/2603.23676
作者: Ashish Malik,Caleb Lowe,Aayam Shrestha,Stefan Lee,Fuxin Li,Alan Fern
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:We study long-horizon planning in 3D environments from under-specified natural-language goals using only visual observations, focusing on multi-step 3D box rearrangement tasks. Existing approaches typically rely on symbolic planners with brittle relational grounding of states and goals, or on direct action-sequence generation from 2D vision-language models (VLMs). Both approaches struggle with reasoning over many objects, rich 3D geometry, and implicit semantic constraints. Recent advances in 3D VLMs demonstrate strong grounding of natural-language referents to 3D segmentation masks, suggesting the potential for more general planning capabilities. We extend existing 3D grounding models and propose Reactive Action Mask Planner (RAMP-3D), which formulates long-horizon planning as sequential reactive prediction of paired 3D masks: a “which-object” mask indicating what to pick and a “which-target-region” mask specifying where to place it. The resulting system processes RGB-D observations and natural-language task specifications to reactively generate multi-step pick-and-place actions for 3D box rearrangement. We conduct experiments across 11 task variants in warehouse-style environments with 1-30 boxes and diverse natural-language constraints. RAMP-3D achieves 79.5% success rate on long-horizon rearrangement tasks and significantly outperforms 2D VLM-based baselines, establishing mask-based reactive policies as a promising alternative to symbolic pipelines for long-horizon planning.
[AI-59] Echoes: A semantically-aligned music deepfake detection dataset
【速读】:该论文旨在解决当前音乐深度伪造(Deepfake)检测模型在真实场景下泛化能力不足的问题,尤其是针对由不同AI音乐生成系统(provider-diverse)产生的伪造音频缺乏统一、具有挑战性的训练与评估基准。解决方案的关键在于构建一个名为Echoes的新数据集,该数据集包含来自10种主流AI音乐生成系统的3,577首曲目(共110小时音频),并强制在语义层面实现伪造音频与真实参考音频的一致性(semantic-level alignment),具体通过直接以真实波形或歌曲描述符为条件生成音频来实现。这种设计有效防止了模型学习到数据集特有“捷径”(shortcut learning),从而促使检测器学习更具鲁棒性和可迁移性的特征,实验表明基于Echoes训练的模型在跨数据集测试中表现出最优的泛化性能。
链接: https://arxiv.org/abs/2603.23667
作者: Octavian Pascu,Dan Oneata,Horia Cucu,Nicolas M. Muller
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:We introduce Echoes, a new dataset for music deepfake detection designed for training and benchmarking detectors under realistic and provider-diverse conditions. Echoes comprises 3,577 tracks (110 hours of audio) spanning multiple genres (pop, rock, electronic), and includes content generated by ten popular AI music generation systems. To prevent shortcut learning and promote robust generalization, the dataset is deliberately constructed to be challenging, enforcing semantic-level alignment between spoofed audio and bona fide references. This alignment is achieved by conditioning generated audio samples directly on bona-fide waveforms or song descriptors. We evaluate Echoes in a cross-dataset setting against three existing AI-generated music datasets using state-of-the-art Wav2Vec2 XLS-R 2B representations. Results show that (i) Echoes is the hardest in-domain dataset; (ii) detectors trained on existing datasets transfer poorly to Echoes; (iii) training on Echoes yields the strongest generalization performance. These findings suggest that provider diversity and semantic alignment help learn more transferable detection cues.
[AI-60] GTO Wizard Benchmark
【速读】:该论文旨在解决在具有部分可观测性的多智能体系统中,对博弈算法(特别是扑克领域)进行标准化、高精度评估的难题。核心问题在于传统蒙特卡洛评估方法因方差过大导致统计显著性不足,难以公平比较不同算法性能。解决方案的关键是提出GTO Wizard Benchmark——一个基于公开API和标准化框架的评测系统,其核心创新在于引入AIVAT(Agent-Independent Variance Reduction Technique),这是一种可证明无偏的方差降低技术,能够在等效统计功效下将所需对局数减少至传统方法的十分之一,从而大幅提升评估效率与准确性。该基准以超人类水平的GTO Wizard AI作为对抗基准,为LLM等新型推理模型提供了一个严谨且可量化的测试平台。
链接: https://arxiv.org/abs/2603.23660
作者: Marc-Antoine Provost,Nejc Ilenic,Christopher Solinas,Philippe Beardsell
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce GTO Wizard Benchmark, a public API and standardized evaluation framework for benchmarking algorithms in Heads-Up No-Limit Texas Hold’em (HUNL). The benchmark evaluates agents against GTO Wizard AI, a state-of-the-art superhuman poker agent that approximates Nash Equilibria, and defeated Slumbot, the 2018 Annual Computer Poker Competition champion and previous strongest publicly accessible HUNL benchmark, by 19.4 \pm 4.1 bb/100. Variance is a fundamental challenge in poker evaluation; we address this by integrating AIVAT, a provably unbiased variance reduction technique that achieves equivalent statistical significance with ten times fewer hands than naive Monte Carlo evaluation. We conduct a comprehensive benchmarking study of state-of-the-art large language models under zero-shot conditions, including GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4, and others. Initial results and analysis reveal dramatic progress in LLM reasoning over recent years, yet all models remain far below the baseline established by our benchmark. Qualitative analysis reveals clear opportunities for improvement, including representation and the ability to reason over hidden states. This benchmark provides researchers with a precise and quantifiable setting to evaluate advances in planning and reasoning in multi-agent systems with partial observability.
[AI-61] Can LLM Agents Be CFOs? A Benchmark for Resource Allocation in Dynamic Enterprise Environments
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在长期企业资源分配场景下,面对不确定性时是否具备有效资源配置能力的问题。现有研究多聚焦于短周期反应式决策,而资源分配需在时间维度上做出承诺、平衡多重目标并保留未来灵活性,这一挑战尚未被充分探索。解决方案的关键在于构建首个面向长期企业资源分配的基准测试平台——EnterpriseArena,其通过一个包含132个月的企业模拟环境,融合公司级财务数据、匿名业务文档、宏观经济与行业信号及专家验证的运营规则,实现部分可观测性设计:代理必须在有限预算下权衡信息获取与资源保存,从而逼真还原企业CFO决策场景。实验表明,该设定极具挑战性,仅16%的运行能完成全周期,且模型规模并非性能保障,揭示了当前LLM代理在长期不确定性下存在显著的能力缺口。
链接: https://arxiv.org/abs/2603.23638
作者: Yi Han,Lingfei Qian,Yan Wang,Yueru He,Xueqing Peng,Dongji Feng,Yankai Chen,Haohang Li,Yupeng Cao,Jimin Huang,Xue Liu,Jian-Yun Nie,Sophia Ananiadou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have enabled agentic systems that can reason, plan, and act across complex tasks, but it remains unclear whether they can allocate resources effectively under uncertainty. Unlike short-horizon reactive decisions, allocation requires committing scarce resources over time while balancing competing objectives and preserving flexibility for future needs. We introduce EnterpriseArena, the first benchmark for evaluating agents on long-horizon enterprise resource allocation. It instantiates CFO-style decision-making in a 132-month enterprise simulator combining firm-level financial data, anonymized business documents, macroeconomic and industry signals, and expert-validated operating rules. The environment is partially observable and reveals the state only through budgeted organizational tools, forcing agents to trade off information acquisition against conserving scarce resources. Experiments on eleven advanced LLMs show that this setting remains highly challenging: only 16% of runs survive the full horizon, and larger models do not reliably outperform smaller ones. These results identify long-horizon resource allocation under uncertainty as a distinct capability gap for current LLM agents.
[AI-62] LLM LOOP: Improving LLM -Generated Code and Tests through Automated Iterative Feedback Loops
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成源代码时普遍存在编译错误或逻辑错误的问题,这些问题导致开发者需耗费大量精力进行人工修正和测试用例优化,且常出现重复劳动。解决方案的关键在于提出 LLMLOOP 框架,其通过五个迭代循环自动完成代码与测试用例的精炼:包括修复编译错误、处理静态分析问题、解决测试失败、以及基于变异分析提升测试质量。该框架不仅生成高质量测试用例,还将其作为验证机制和回归测试套件,从而系统性地提升 LLM 生成代码的正确性和可靠性。
链接: https://arxiv.org/abs/2603.23613
作者: Ravin Ravi,Dylan Bradshaw,Stefano Ruberto,Gunel Jahangirova,Valerio Terragni
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted for publication in IEEE International Conference on Software Maintenance and Evolution (ICSME 2025). This arXiv version is the authors’ accepted manuscript. DOI: https://doi.org/10.1109/ICSME64153.2025.00109 Code: this http URL
Abstract:Large Language Models (LLMs) are showing remarkable performance in generating source code, yet the generated code often has issues like compilation errors or incorrect code. Researchers and developers often face wasted effort in implementing checks and refining LLM-generated code, frequently duplicating their efforts. This paper presents LLMLOOP, a framework that automates the refinement of both source code and test cases produced by LLMs. LLMLOOP employs five iterative loops: resolving compilation errors, addressing static analysis issues, fixing test case failures, and improving test quality through mutation analysis. These loops ensure the generation of high-quality test cases that serve as both a validation mechanism and a regression test suite for the generated code. We evaluated LLMLOOP on HUMANEVAL-X, a recent benchmark of programming tasks. Results demonstrate the tool’s effectiveness in refining LLM-generated outputs.
[AI-63] Environment Maps: Structured Environmental Representations for Long-Horizon Agents ICLR2026
【速读】:该论文旨在解决长时程软件工作流自动化中因代理(agent)频繁出现级联错误和环境随机性导致的任务失败问题,尤其是在动态界面中单次误操作即可引发整个任务崩溃的情况。其核心解决方案是提出一种称为“Environment Maps”的持久化、与代理无关的表征机制,该机制通过整合屏幕录制和执行轨迹等异构证据,构建一个结构化的图表示,包含四个关键组件:上下文(抽象位置)、动作(参数化可用性)、工作流(观测轨迹)和隐性知识(领域定义与可复用过程)。这一框架显著提升了任务成功率(在WebArena基准上达到28.2%),优于仅依赖会话内上下文的基线模型(14.2%)及直接使用原始轨迹数据的模型(23.3%),并为长期规划提供了人类可理解、可编辑且可增量优化的结构化接口。
链接: https://arxiv.org/abs/2603.23610
作者: Yenchia Feng,Chirag Sharma,Karime Maamari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, accepted to ICLR 2026 the 2nd Workshop on World Models
Abstract:Although large language models (LLMs) have advanced rapidly, robust automation of complex software workflows remains an open problem. In long-horizon settings, agents frequently suffer from cascading errors and environmental stochasticity; a single misstep in a dynamic interface can lead to task failure, resulting in hallucinations or trial-and-error. This paper introduces \textitEnvironment Maps : a persistent, agent-agnostic representation that mitigates these failures by consolidating heterogeneous evidence, such as screen recordings and execution traces, into a structured graph. The representation consists of four core components: (1) Contexts (abstracted locations), (2) Actions (parameterized affordances), (3) Workflows (observed trajectories), and (4) Tacit Knowledge (domain definitions and reusable procedures). We evaluate this framework on the WebArena benchmark across five domains. Agents equipped with environment maps achieve a 28.2% success rate, nearly doubling the performance of baselines limited to session-bound context (14.2%) and outperforming agents that have access to the raw trajectory data used to generate the environment maps (23.3%). By providing a structured interface between the model and the environment, Environment Maps establish a persistent foundation for long-horizon planning that is human-interpretable, editable, and incrementally refinable.
[AI-64] LineMVGNN: Anti-Money Laundering with Line-Graph-Assisted Multi-View Graph Neural Networks
【速读】:该论文旨在解决传统反洗钱(Anti-Money Laundering, AML)系统依赖规则且准确率低、可扩展性差的问题,以及现有图神经网络(Graph Neural Networks, GNNs)在处理交易图时存在的多维边特征支持不足、可解释性弱和计算复杂度高等缺陷。其解决方案的关键在于提出一种基于线图辅助的多视角图神经网络(LineMVGNN),通过引入交易图的线图(line graph)视图增强信息传播,并设计轻量级多视角图神经网络模块实现节点间双向消息传递,从而更有效地捕捉资金流动模式,提升洗钱检测性能。实验表明,该方法在真实世界交易数据集上优于当前主流方法,兼具良好的可扩展性、对抗鲁棒性和监管合规潜力。
链接: https://arxiv.org/abs/2603.23584
作者: Chung-Hoo Poon,James Kwok,Calvin Chow,Jang-Hyeon Choi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
备注: Published as a journal paper in AI 2025
Abstract:Anti-money laundering (AML) systems are important for protecting the global economy. However, conventional rule-based methods rely on domain knowledge, leading to suboptimal accuracy and a lack of scalability. Graph neural networks (GNNs) for digraphs (directed graphs) can be applied to transaction graphs and capture suspicious transactions or accounts. However, most spectral GNNs do not naturally support multi-dimensional edge features, lack interpretability due to edge modifications, and have limited scalability owing to their spectral nature. Conversely, most spatial methods may not capture the money flow well. Therefore, in this work, we propose LineMVGNN (Line-Graph-Assisted Multi-View Graph Neural Network), a novel spatial method that considers payment and receipt transactions. Specifically, the LineMVGNN model extends a lightweight MVGNN module, which performs two-way message passing between nodes in a transaction graph. Additionally, LineMVGNN incorporates a line graph view of the original transaction graph to enhance the propagation of transaction information. We conduct experiments on two real-world account-based transaction datasets: the Ethereum phishing transaction network dataset and a financial payment transaction dataset from one of our industry partners. The results show that our proposed method outperforms state-of-the-art methods, reflecting the effectiveness of money laundering detection with line-graph-assisted multi-view graph learning. We also discuss scalability, adversarial robustness, and regulatory considerations of our proposed method.
[AI-65] AI Generalisation Gap In Comorbid Sleep Disorder Staging
【速读】:该论文旨在解决现有基于深度学习的单导联脑电图(EEG)睡眠分期模型在临床人群(如缺血性卒中患者)中泛化性能差的问题。其关键解决方案是构建了一个新的、经过临床标注的缺血性卒中睡眠数据集iSLEEPS,并采用SE-ResNet与双向LSTM相结合的模型进行睡眠分期,同时通过Grad-CAM注意力可视化技术揭示模型在患者数据中关注的是生理上无意义的EEG区域,从而验证了健康人群与卒中患者之间睡眠结构存在显著差异,强调了开发面向特定疾病或受试者群体的模型并辅以临床验证的重要性。
链接: https://arxiv.org/abs/2603.23582
作者: Saswata Bose,Suvadeep Maiti,Shivam Kumar Sharma,Mythirayee S,Tapabrata Chakraborti,Srijitesh Rajendran,Raju S. Bapi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate sleep staging is essential for diagnosing OSA and hypopnea in stroke patients. Although PSG is reliable, it is costly, labor-intensive, and manually scored. While deep learning enables automated EEG-based sleep staging in healthy subjects, our analysis shows poor generalization to clinical populations with disrupted sleep. Using Grad-CAM interpretations, we systematically demonstrate this limitation. We introduce iSLEEPS, a newly clinically annotated ischemic stroke dataset (to be publicly released), and evaluate a SE-ResNet plus bidirectional LSTM model for single-channel EEG sleep staging. As expected, cross-domain performance between healthy and diseased subjects is poor. Attention visualizations, supported by clinical expert feedback, show the model focuses on physiologically uninformative EEG regions in patient data. Statistical and computational analyses further confirm significant sleep architecture differences between healthy and ischemic stroke cohorts, highlighting the need for subject-aware or disease-specific models with clinical validation before deployment. A summary of the paper and the code is available at this https URL
[AI-66] APreQEL: Adaptive Mixed Precision Quantization For Edge LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在边缘设备部署时面临的高计算成本与内存需求问题,尤其关注如何在有限资源下实现内存、延迟和精度之间的平衡。其关键解决方案是提出一种自适应混合精度量化机制(adaptive mixed precision quantization),该机制通过分析各层对模型性能的贡献,并结合目标硬件平台上的量化类型行为特征,为每一层动态分配最合适的量化策略,从而在满足用户定义优先级的前提下,实现更优的资源利用效率与性能权衡。
链接: https://arxiv.org/abs/2603.23575
作者: Meriem Bouzouad,Yuan-Hao Chang,Jalil Boukhobza
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Today, large language models have demonstrated their strengths in various tasks ranging from reasoning, code generation, and complex problem solving. However, this advancement comes with a high computational cost and memory requirements, making it challenging to deploy these models on edge devices to ensure real-time responses and data privacy. Quantization is one common approach to reducing memory use, but most methods apply it uniformly across all layers. This does not account for the fact that different layers may respond differently to reduced precision. Importantly, memory consumption and computational throughput are not necessarily aligned, further complicating deployment decisions. This paper proposes an adaptive mixed precision quantization mechanism that balances memory, latency, and accuracy in edge deployment under user-defined priorities. This is achieved by analyzing the layer-wise contribution and by inferring how different quantization types behave across the target hardware platform in order to assign the most suitable quantization type to each layer. This integration ensures that layer importance and the overall performance trade-offs are jointly respected in this design. Our work unlocks new configuration designs that uniform quantization cannot achieve, expanding the solution space to efficiently deploy the LLMs on resource-constrained devices.
[AI-67] PoiCGAN: A Targeted Poisoning Based on Feature-Label Joint Perturbation in Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中恶意客户端发起的目标性投毒攻击(targeted poisoning attack)难以规避模型性能测试与基于异常检测的防御机制的问题,从而提升攻击的隐蔽性和有效性。解决方案的关键在于提出一种基于特征-标签协同扰动的攻击方法 PoiCGAN,通过修改条件生成对抗网络(Conditional Generative Adversarial Network, CGAN)中判别器和生成器的输入,引导生成理想的投毒生成器,该生成器不仅能生成特定目标类别的污染样本,还能自动执行标签翻转操作,从而在保持主任务(工业图像分类)准确率下降小于8.87%的前提下,实现比基线方法高83.97%的攻击成功率,并显著增强攻击的隐蔽性。
链接: https://arxiv.org/abs/2603.23574
作者: Tao Liu,Jiguang Lv,Dapeng Man,Weiye Xi,Yaole Li,Feiyu Zhao,Kuiming Wang,Yingchao Bian,Chen Xu,Wu Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated Learning (FL), as a popular distributed learning paradigm, has shown outstanding performance in improving computational efficiency and protecting data privacy, and is widely applied in industrial image classification. However, due to its distributed nature, FL is vulnerable to threats from malicious clients, with poisoning attacks being a common threat. A major limitation of existing poisoning attack methods is their difficulty in bypassing model performance tests and defense mechanisms based on model anomaly detection. This often results in the detection and removal of poisoned models, which undermines their practical utility. To ensure both the performance of industrial image classification and attacks, we propose a targeted poisoning attack, PoiCGAN, based on feature-label collaborative perturbation. Our method modifies the inputs of the discriminator and generator in the Conditional Generative Adversarial Network (CGAN) to influence the training process, generating an ideal poison generator. This generator not only produces specific poisoned samples but also automatically performs label flipping. Experiments across various datasets show that our method achieves an attack success rate 83.97% higher than baseline methods, with a less than 8.87% reduction in the main task’s accuracy. Moreover, the poisoned samples and malicious models exhibit high stealthiness.
[AI-68] Dual-Criterion Curriculum Learning: Application to Temporal Data
【速读】:该论文旨在解决课程学习(Curriculum Learning, CL)中实例级难度评估指标定义困难的问题,尤其是现有方法多依赖特定应用场景的启发式规则,缺乏普适性和有效性。其解决方案的关键在于提出双准则课程学习(Dual-Criterion Curriculum Learning, DCCL)框架,该框架融合了基于损失(loss-based)和基于数据密度(density-based)的双重难度评估机制——通过在数据表示空间中学习密度信息来校准训练过程中的损失证据,从而更准确地反映真实的学习难度,特别是在数据稀疏区域。实验证明,在时间序列预测任务中,该双准则策略显著优于仅使用损失或标准非课程学习方法。
链接: https://arxiv.org/abs/2603.23573
作者: Gaspard Abel,Eloi Campagne,Mohamed Benloughmari,Argyris Kalogeratos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Curriculum Learning (CL) is a meta-learning paradigm that trains a model by feeding the data instances incrementally according to a schedule, which is based on difficulty progression. Defining meaningful difficulty assessment measures is crucial and most usually the main bottleneck for effective learning, while also in many cases the employed heuristics are only application-specific. In this work, we propose the Dual-Criterion Curriculum Learning (DCCL) framework that combines two views of assessing instance-wise difficulty: a loss-based criterion is complemented by a density-based criterion learned in the data representation space. Essentially, DCCL calibrates training-based evidence (loss) under the consideration that data sparseness amplifies the learning difficulty. As a testbed, we choose the time-series forecasting task. We evaluate our framework on multivariate time-series benchmarks under standard One-Pass and Baby-Steps training schedules. Empirical results show the interest of density-based and hybrid dual-criterion curricula over loss-only baselines and standard non-CL training in this setting.
[AI-69] StateLinFormer: Stateful Training Enhancing Long-term Memory in Navigation
【速读】:该论文旨在解决导航智能系统中长期记忆不足的问题,即现有方法在面对长时间交互时难以实现持续的记忆保留与适应性提升:模块化系统依赖显式映射而缺乏灵活性,基于Transformer的端到端模型受限于固定上下文窗口,无法支持跨段落的持久记忆。其解决方案的关键在于提出StateLinFormer,一种采用状态感知记忆机制(stateful memory mechanism)的线性注意力导航模型,通过在连续训练片段间保持递归记忆状态而非在每个批次边界重置,从而有效逼近无限长序列的学习过程,显著增强模型的长程记忆保留能力与情境学习(In-Context Learning, ICL)性能。
链接: https://arxiv.org/abs/2603.23571
作者: Zhiyuan Chen,Yuxuan Zhong,Fan Wang,Bo Yu,Pengtao Shao,Shaoshan Liu,Ning Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures
Abstract:Effective navigation intelligence relies on long-term memory to support both immediate generalization and sustained adaptation. However, existing approaches face a dilemma: modular systems rely on explicit mapping but lack flexibility, while Transformer-based end-to-end models are constrained by fixed context windows, limiting persistent memory across extended interactions. We introduce StateLinFormer, a linear-attention navigation model trained with a stateful memory mechanism that preserves recurrent memory states across consecutive training segments instead of reinitializing them at each batch boundary. This training paradigm effectively approximates learning on infinitely long sequences, enabling the model to achieve long-horizon memory retention. Experiments across both MAZE and ProcTHOR environments demonstrate that StateLinFormer significantly outperforms its stateless linear-attention counterpart and standard Transformer baselines with fixed context windows. Notably, as interaction length increases, persistent stateful training substantially improves context-dependent adaptation, suggesting an enhancement in the model’s In-Context Learning (ICL) capabilities for navigation tasks.
[AI-70] AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization
【速读】:该论文旨在解决华为Ascend神经处理单元(NPU)上AscendC(Ascend C)算子优化中存在的双重知识瓶颈问题:一方面缺乏公开的参考实现供学习借鉴,另一方面性能高度依赖于主机端数据分块(tiling)程序与内核端指令调度和流水线配置的耦合关系。解决方案的关键在于提出一个闭环强化学习代理AscendOptimizer,其核心机制包括两部分:在主机端通过“循环内剖析进化搜索”(profiling-in-the-loop evolutionary search)直接从硬件反馈中发现高效且合法的数据移动策略;在内核端通过“回溯优化动机挖掘”(rewinding optimized kernels to synthesize “bad-to-good” trajectories)提取可迁移的优化模式,并将其结构化为经验库用于指导内核重写。通过交替执行主机调优与内核重写,在闭环中持续扩展可行解空间并降低延迟,最终在127个真实AscendC算子上实现了几何平均1.19倍的速度提升,其中49.61%的算子超越了参考实现。
链接: https://arxiv.org/abs/2603.23566
作者: Jiehao Wu,Zixiao Huang,Wenhao Li,Chuyun Shen,Junjie Sheng,Xiangfeng Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:AscendC (Ascend C) operator optimization on Huawei Ascend neural processing units (NPUs) faces a two-fold knowledge bottleneck: unlike the CUDA ecosystem, there are few public reference implementations to learn from, and performance hinges on a coupled two-part artifact - a host-side tiling program that orchestrates data movement and a kernel program that schedules and pipelines instructions. We present AscendOptimizer, an episodic agent that bootstraps this missing expertise by turning execution into experience. On the host side, AscendOptimizer performs profiling-in-the-loop evolutionary search to discover valid and high-performing tiling and data-movement configurations directly from hardware feedback. On the kernel side, it mines transferable optimization motifs by rewinding optimized kernels - systematically de-optimizing them to synthesize instructive “bad-to-good” trajectories - and distills these motifs into a retrievable experience bank for guided rewriting. By alternating host tuning and kernel rewriting in a closed loop, AscendOptimizer steadily expands feasibility and pushes latency down. On a benchmark of 127 real AscendC operators, AscendOptimizer achieves a 1.19x geometric-mean speedup over the open-source baseline, with 49.61% of operators outperforming their references, outperforming strong agent and search baselines.
[AI-71] Safe Reinforcement Learning with Preference-based Constraint Inference
【速读】:该论文旨在解决安全强化学习(Safe Reinforcement Learning, Safe RL)中约束推断的难题,即如何在现实场景下以低成本且可靠的方式学习复杂、主观且难以显式定义的安全约束。现有方法依赖于严格的假设或大量专家示范,不具实用性。针对此问题,作者提出了一种新颖的基于偏好的约束强化学习方法(Preference-based Constrained Reinforcement Learning, PbCRL),其关键创新在于引入一种“死区”(dead zone)机制到偏好建模中,理论上证明该机制能诱导出重尾(heavy-tailed)的安全成本分布,从而实现更优的约束对齐;同时结合信噪比(Signal-to-Noise Ratio, SNR)损失以鼓励由成本方差驱动的探索,提升策略学习效果,并采用两阶段训练策略降低在线标注负担并自适应增强约束满足度。
链接: https://arxiv.org/abs/2603.23565
作者: Chenglin Li,Guangchun Ruan,Hua Geng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Safe reinforcement learning (RL) is a standard paradigm for safety-critical decision making. However, real-world safety constraints can be complex, subjective, and even hard to explicitly specify. Existing works on constraint inference rely on restrictive assumptions or extensive expert demonstrations, which is not realistic in many real-world applications. How to cheaply and reliably learn these constraints is the major challenge we focus on in this study. While inferring constraints from human preferences offers a data-efficient alternative, we identify the popular Bradley-Terry (BT) models fail to capture the asymmetric, heavy-tailed nature of safety costs, resulting in risk underestimation. It is still rare in the literature to understand the impacts of BT models on the downstream policy learning. To address the above knowledge gaps, we propose a novel approach namely Preference-based Constrained Reinforcement Learning (PbCRL). We introduce a novel dead zone mechanism into preference modeling and theoretically prove that it encourages heavy-tailed cost distributions, thereby achieving better constraint alignment. Additionally, we incorporate a Signal-to-Noise Ratio (SNR) loss to encourage exploration by cost variances, which is found to benefit policy learning. Further, two-stage training strategy are deployed to lower online labeling burdens while adaptively enhancing constraint satisfaction. Empirical results demonstrate that PbCRL achieves superior alignment with true safety requirements and outperforms the state-of-the-art baselines in terms of safety and reward. Our work explores a promising and effective way for constraint inference in Safe RL, which has great potential in a range of safety-critical applications.
[AI-72] Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG
【速读】:该论文旨在解决合成数据增强方法在数据受限领域中难以突破检索增强生成(Retrieval-Augmented Generation, RAG)性能天花板的问题。现有方法通过增加合成标记数量或使用更强的生成器来提升模型表现,但效果逐渐减弱。其解决方案的关键在于提出合成混合训练(Synthetic Mixed Training),即同时利用合成问答对(synthetic QAs)与合成文档(synthetic documents),以互补的训练信号促进模型学习;并引入**焦点重写(Focal Rewriting)**技术,显式地基于特定问题条件化文档生成,从而提升合成文档多样性与相关性,实现随合成数据量和生成器强度增长的对数线性性能提升。最终,在QuaLITY等多基准测试中,该方法使Llama 8B模型相对RAG获得最高达4.4%的性能提升,并在五组实验设置中超越RAG。
链接: https://arxiv.org/abs/2603.23562
作者: Seungju Han,Konwoo Kim,Chanwoo Park,Benjamin Newman,Suhas Kotha,Jaehun Jung,James Zou,Yejin Choi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Synthetic data augmentation helps language models learn new knowledge in data-constrained domains. However, naively scaling existing synthetic data methods by training on more synthetic tokens or using stronger generators yields diminishing returns below the performance of RAG. To break the RAG ceiling, we introduce Synthetic Mixed Training, which combines synthetic QAs and synthetic documents. This leverages their complementary training signals, and enables log-linear improvements as both synthetic data volume and generator strength increase. This allows the model to outperform RAG by a 2.6% relative gain on QuaLITY, a long-document reading comprehension benchmark. In addition, we introduce Focal Rewriting, a simple technique for synthetic document generation that explicitly conditions document generation on specific questions, improving the diversity of synthetic documents and yielding a steeper log-linear scaling curve. On QuaLITY, our final recipe trains a Llama 8B model that outperforms RAG by 4.4% relatively. Across models and benchmarks (QuaLITY, LongHealth, FinanceBench), our training enables models to beat RAG in five of six settings, outperforms by 2.6%, and achieves a 9.1% gain when combined with RAG.
[AI-73] Upper Entropy for 2-Monotone Lower Probabilities
【速读】:该论文致力于解决不确定性量化(uncertainty quantification)中的上熵(upper entropy)计算问题,尤其在基于credal方法(即把不确定性建模为概率集的方法)的场景下,上熵作为核心的不确定性度量指标,其高效且精确的计算具有重要意义。论文的关键解决方案在于证明了该问题存在强多项式时间解法(strongly polynomial solution),并在此基础上提出了对以往针对2-单调下概率(2-monotone lower probabilities)及其特例算法的多项显著改进,从而提升了计算效率与适用性。
链接: https://arxiv.org/abs/2603.23558
作者: Tuan-Anh Vu,Sébastien Destercke,Frédéric Pichon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures
Abstract:Uncertainty quantification is a key aspect in many tasks such as model selection/regularization, or quantifying prediction uncertainties to perform active learning or OOD detection. Within credal approaches that consider modeling uncertainty as probability sets, upper entropy plays a central role as an uncertainty measure. This paper is devoted to the computational aspect of upper entropies, providing an exhaustive algorithmic and complexity analysis of the problem. In particular, we show that the problem has a strongly polynomial solution, and propose many significant improvements over past algorithms proposed for 2-monotone lower probabilities and their specific cases.
[AI-74] Evidence for Limited Metacognition in LLM s
【速读】:该论文旨在解决如何定量评估大语言模型(Large Language Models, LLMs)的元认知能力(metacognitive abilities)这一科学难题,尤其是在当前关于LLM是否具备自我意识甚至感知能力(sentience)引发广泛关注的背景下。其解决方案的关键在于摒弃依赖模型自述的传统方法,转而设计基于非人类动物元认知研究启发的实验范式,通过测试模型能否战略性地利用对自身内部状态(如回答置信度)的认知来优化决策行为,从而客观量化其元认知表现。研究发现,2024年后推出的前沿LLMs展现出日益显著的元认知能力证据,包括评估和利用自身回答正确性信心的能力,以及预测自身输出并据此调整行为的能力,且这些能力与模型返回的token概率分布存在潜在关联,提示可能存在上游内部信号支持元认知机制。
链接: https://arxiv.org/abs/2509.21545
作者: Christopher Ackerman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 25 figures
Abstract:The possibility of LLM self-awareness and even sentience is gaining increasing public attention and has major safety and policy implications, but the science of measuring them is still in a nascent state. Here we introduce a novel methodology for quantitatively evaluating metacognitive abilities in LLMs. Taking inspiration from research on metacognition in nonhuman animals, our approach eschews model self-reports and instead tests to what degree models can strategically deploy knowledge of internal states. Using two experimental paradigms, we demonstrate that frontier LLMs introduced since early 2024 show increasingly strong evidence of certain metacognitive abilities, specifically the ability to assess and utilize their own confidence in their ability to answer factual and reasoning questions correctly and the ability to anticipate what answers they would give and utilize that information appropriately. We buttress these behavioral findings with an analysis of the token probabilities returned by the models, which suggests the presence of an upstream internal signal that could provide the basis for metacognition. We further find that these abilities 1) are limited in resolution, 2) emerge in context-dependent manners, and 3) seem to be qualitatively different from those of humans. We also report intriguing differences across models of similar capabilities, suggesting that LLM post-training may have a role in developing metacognitive abilities.
[AI-75] Mitigating Many-Shot Jailbreaking
【速读】:该论文旨在解决多示例越狱攻击(Many-shot jailbreaking, MSJ)这一安全漏洞问题,即利用现代大语言模型(Large Language Models, LLMs)的长上下文窗口,在提示中嵌入大量伪造助手生成不当响应的示例,从而诱导模型忽略其安全训练并输出有害内容。解决方案的关键在于系统性评估微调(fine-tuning)与输入净化(input sanitization)两种策略的单独及联合效果,发现二者均能逐步提升防御能力,且组合使用时显著降低MSJ攻击成功率,同时保持模型在良性上下文学习和对话任务中的性能表现,表明该协同防护机制可有效缓解此类漏洞。
链接: https://arxiv.org/abs/2504.09604
作者: Christopher M. Ackerman,Nina Panickssery
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Many-shot jailbreaking (MSJ) is an adversarial technique that exploits the long context windows of modern LLMs to circumvent model safety training by including in the prompt many examples of a “fake” assistant responding inappropriately before the final request. With enough examples, the model’s in-context learning abilities override its safety training, and it responds as if it were the “fake” assistant. In this work, we probe the effectiveness of different fine-tuning and input sanitization approaches on mitigating MSJ attacks, alone and in combination. We find incremental mitigation effectiveness for each, and show that the combined techniques significantly reduce the effectiveness of MSJ attacks, while retaining model performance in benign in-context learning and conversational tasks. We suggest that our approach could meaningfully ameliorate this vulnerability if incorporated into model safety post-training.
[AI-76] SM-Net: Learning a Continuous Spectral Manifold from Multiple Stellar Libraries
【速读】:该论文旨在解决传统stellar population synthesis(星族合成)库在处理多源高分辨率恒星光谱库时存在的参数空间不连续、覆盖范围有限以及插值精度不足的问题。解决方案的关键在于提出SM-Net——一个基于机器学习的模型,通过融合PHOENIX-Husser、C3K-Conroy、OB-PoWR和TMAP-Werner等多个恒星光谱库构建统一的连续谱流形(spectral manifold),从而实现从有效温度(Teff)、表面重力(log g)和金属丰度(log Z)到光谱的端到端映射。该方法利用深度神经网络对跨库参数空间进行平滑插值,并能处理缺失数据(将零或掩码通量视为未知而非物理零),在宽广的参数域(Teff = 2,000–190,000 K,log g = –1 to 9,log Z = –4 to 1)内提供数值稳定且高效的光谱生成能力,同时具备每秒超14,000个光谱的推理速度,显著优于传统方法。
链接: https://arxiv.org/abs/2603.23899
作者: Omar Anwar,Aaron S. G. Robotham,Luca Cortese,Kevin Vinsen
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:
Abstract:We present SM-Net, a machine-learning model that learns a continuous spectral manifold from multiple high-resolution stellar libraries. SM-Net generates stellar spectra directly from the fundamental stellar parameters effective temperature (Teff), surface gravity (log g), and metallicity (log Z). It is trained on a combined grid derived from the PHOENIX-Husser, C3K-Conroy, OB-PoWR, and TMAP-Werner libraries. By combining their parameter spaces, we construct a composite dataset that spans a broader and more continuous region of stellar parameter space than any individual library. The unified grid covers Teff = 2,000-190,000 K, log g = -1 to 9, and log Z = -4 to 1, with spectra spanning 3,000-100,000 Angstrom. Within this domain, SM-Net provides smooth interpolation across heterogeneous library boundaries. Outside the sampled region, it can produce numerically smooth exploratory predictions, although these extrapolations are not directly validated against reference models. Zero or masked flux values are treated as unknowns rather than physical zeros, allowing the network to infer missing regions using correlations learned from neighbouring grid points. Across 3,538 training and 11,530 test spectra, SM-Net achieves mean squared errors of 1.47 x 10^-5 on the training set and 2.34 x 10^-5 on the test set in the transformed log1p-scaled flux representation. Inference throughput exceeds 14,000 spectra per second on a single GPU. We also release the model together with an interactive web dashboard for real-time spectral generation and visualisation. SM-Net provides a fast, robust, and flexible data-driven complement to traditional stellar population synthesis libraries.
[AI-77] Wafer-Level Etch Spatial Profiling for Process Monitoring from Time-Series with Time-LLM
【速读】:该论文旨在解决先进等离子体刻蚀工艺中晶圆级空间变异性的监测难题,即如何从原位过程信号中准确预测晶圆表面刻蚀深度的二维空间分布。传统数据驱动方法多依赖于平均刻蚀速率等标量指标,难以刻画复杂的空间非均匀性,而实际工艺质量由空间分布决定。解决方案的关键在于提出一种基于Time-LLM(时间大语言模型)的空间回归模型,通过重构输入嵌入和输出投影机制,将大语言模型(Large Language Model, LLM)的重编程能力从传统的时序预测拓展至晶圆级空间估计任务,从而在数据受限条件下实现稳定且高精度的空间分布预测。
链接: https://arxiv.org/abs/2603.23576
作者: Hyunwoo Kim,Munyoung Lee,Seung Hyub Jeon,Kyu Sung Lee
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to AVSS 2026
Abstract:Understanding wafer-level spatial variations from in-situ process signals is essential for advanced plasma etching process monitoring. While most data-driven approaches focus on scalar indicators such as average etch rate, actual process quality is determined by complex two-dimensional spatial distributions across the wafer. This paper presents a spatial regression model that predicts wafer-level etch depth distributions directly from multichannel in-situ process time series. We propose a Time-LLM-based spatial regression model that extends LLM reprogramming from conventional time-series forecasting to wafer-level spatial estimation by redesigning the input embedding and output projection. Using the BOSCH plasma-etching dataset, we demonstrate stable performance under data-limited conditions, supporting the feasibility of LLM-based reprogramming for wafer-level spatial monitoring.
[AI-78] Large Language Models and Scientific Discourse: Wheres the Intelligence?
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在理解和生成科学知识时,其能力与人类专家的知识建构机制之间存在本质差异,尤其是在知识初始形成阶段和面对新情境时的表现不足。解决方案的关键在于指出,LLMs的“理解”依赖于已有的书面文献,而无法获取人类专家通过隐性知识(tacit knowledge)和社会话语构建的早期知识体系;因此,LLMs的能力提升并非单纯源于模型自身的推理优化,而是取决于人类社会中书写 discourse 的演化或人工干预(如对训练数据的修改),这使得LLMs能够逐步对齐人类认知模式。论文强调,真正的智能仍体现在人类对知识演化的主动塑造能力上,而非LLMs本身。
链接: https://arxiv.org/abs/2603.23543
作者: Harry Collins,Simon Thorne
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:We explore the capabilities of Large Language Models (LLMs) by comparing the way they gather data with the way humans build knowledge. Here we examine how scientific knowledge is made and compare it with LLMs. The argument is structured by reference to two figures, one representing scientific knowledge and the other LLMs. In a 2014 study, scientists explain how they choose to ignore a ‘fringe science’ paper in the domain of gravitational wave physics: the decisions are made largely as a result of tacit knowledge built up in social discourse, most spoken discourse, within closed groups of experts. It is argued that LLMs cannot or do not currently access such discourse, but it is typical of the early formation of scientific knowledge. LLMs ‘understanding’ builds on written literatures and is therefore insecure in the case of the initial stages of knowledge building. We refer to Colin Fraser’s ‘Dumb Monty Hall problem’ where in 2023 ChatGPT failed though a year later or so later LLMs were succeeding. We argue that this is not a matter of improvement in LLMs ability to reason but in the change in the body of human written discourse on which they can draw (or changes being put in by humans ‘by hand’). We then invent a new Monty Hall prompt and compare the responses of a panel of LLMs and a panel of humans: they are starkly different but we explain that the previous mechanisms will soon allow the LLMs to align themselves to humans once more. Finally, we look at ‘overshadowing’ where a settled body of discourse becomes so dominant that LLMs fail to respond to small variations in prompts which render the old answers nonsensical. The ‘intelligence’ we argue is in the humans not the LLMs
机器学习
[LG-0] Polynomial Speedup in Diffusion Models with the Multilevel Euler-Maruyama Method
链接: https://arxiv.org/abs/2603.24594
作者: Arthur Jacot
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:
Abstract:We introduce the Multilevel Euler-Maruyama (ML-EM) method compute solutions of SDEs and ODEs using a range of approximators f^1,\dots,f^k to the drift f with increasing accuracy and computational cost, only requiring a few evaluations of the most accurate f^k and many evaluations of the less costly f^1,\dots,f^k-1 . If the drift lies in the so-called Harder than Monte Carlo (HTMC) regime, i.e. it requires \epsilon^-\gamma compute to be \epsilon -approximated for some \gamma2 , then ML-EM \epsilon -approximates the solution of the SDE with \epsilon^-\gamma compute, improving over the traditional EM rate of \epsilon^-\gamma-1 . In other terms it allows us to solve the SDE at the same cost as a single evaluation of the drift. In the context of diffusion models, the different levels f^1,\dots,f^k are obtained by training UNets of increasing sizes, and ML-EM allows us to perform sampling with the equivalent of a single evaluation of the largest UNet. Our numerical experiments confirm our theory: we obtain up to fourfold speedups for image generation on the CelebA dataset downscaled to 64x64, where we measure a \gamma\approx2.5 . Given that this is a polynomial speedup, we expect even stronger speedups in practical applications which involve orders of magnitude larger networks.
[LG-1] DreamerAD: Efficient Reinforcement Learning via Latent World Model for Autonomous Driving
链接: https://arxiv.org/abs/2603.24587
作者: Pengxuan Yang,Yupeng Zheng,Deheng Qian,Zebin Xing,Qichao Zhang,Linbo Wang,Yichen Zhang,Shaoyu Guo,Zhongpu Xia,Qiang Chen,Junyu Han,Lingyun Xu,Yifeng Pan,Dongbin Zhao
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: first version
Abstract:We introduce DreamerAD, the first latent world model framework that enables efficient reinforcement learning for autonomous driving by compressing diffusion sampling from 100 steps to 1 - achieving 80x speedup while maintaining visual interpretability. Training RL policies on real-world driving data incurs prohibitive costs and safety risks. While existing pixel-level diffusion world models enable safe imagination-based training, they suffer from multi-step diffusion inference latency (2s/frame) that prevents high-frequency RL interaction. Our approach leverages denoised latent features from video generation models through three key mechanisms: (1) shortcut forcing that reduces sampling complexity via recursive multi-resolution step compression, (2) an autoregressive dense reward model operating directly on latent representations for fine-grained credit assignment, and (3) Gaussian vocabulary sampling for GRPO that constrains exploration to physically plausible trajectories. DreamerAD achieves 87.7 EPDMS on NavSim v2, establishing state-of-the-art performance and demonstrating that latent-space RL is effective for autonomous driving.
[LG-2] Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction
链接: https://arxiv.org/abs/2603.24562
作者: Haresh Rengaraj Rajamohan,Xiang Gao,Weicheng Zhu,Shih-Lun Huang,Long Chen,Gabe Schulman,Huizhen Jin,Shengduo Li,Yixuan Wang,Huidi Yang,Kyunghyun Cho,Cem M. Deniz,Narges Razavian
类目: Machine Learning (cs.LG)
*备注:
Abstract:While large-scale pretraining has revolutionized language modeling, its potential remains underexplored in healthcare with structured electronic health records (EHRs). We present RAVEN, a novel generative pretraining strategy for sequential EHR data based on Recurrence-Aware next-Visit EveNt prediction. Leveraging a dataset of over one million unique individuals, our model learns to autoregressively generate tokenized clinical events for the next visit conditioned on patient history. We introduce regularization on predicting repeated events and highlight a key pitfall in EHR-based foundation model evaluations: repeated event tokens can inflate performance metrics when new onsets are not distinguished from subsequent occurrences. Furthermore, we empirically investigate the scaling behaviors in a data-constrained, compute-saturated regime, showing that simply increasing model size is suboptimal without commensurate increases in data volume. We evaluate our model via zero-shot prediction for forecasting the incidence of a diverse set of diseases, where it rivals fully fine-tuned representation-based Transformer models and outperforms widely used simulation-based next-token approaches. Finally, without additional parameter updates, we show that RAVEN can generalize to an external patient cohort under lossy clinical code mappings and feature coverage gaps.
[LG-3] uneShift-KD: Knowledge Distillation and Transfer for Fine-tuned Models
链接: https://arxiv.org/abs/2603.24518
作者: Yushi Guan,Jeanine Ohene-Agyei,Daniel Kwan,Jean Sebastien Dandurand,Yifei Zhang,Nandita Vijaykumar
类目: Machine Learning (cs.LG)
*备注:
Abstract:To embed domain-specific or specialized knowledge into pre-trained foundation models, fine-tuning using techniques such as parameter efficient fine-tuning (e.g. LoRA) is a common practice. However, as new LLM architectures and pre-trained models emerge, transferring this specialized knowledge to newer models becomes an important task. In many scenarios, the original specialized data may be unavailable due to privacy or commercial restrictions, necessitating distillation and transfer of this specialized knowledge from the fine-tuned base model to a different pre-trained model. We present TuneShift-KD, a novel approach that automatically distills specialized knowledge from a fine-tuned model to a target model using only a few examples representative of the specialized information. Our key insight is that specialized knowledge can be identified through perplexity differences between base and fine-tuned models: prompts where the fine-tuned model responds confidently (low perplexity), but the base model struggles (high perplexity), indicate queries corresponding to the specialized knowledge learned by the fine-tuned model. TuneShift-KD leverages this insight to create a synthetic training dataset to transfer the specialized knowledge. Using an iterative process, TuneShift-KD generates more prompts similar to those that generated responses with specialized knowledge. TuneShift-KD does not require training discriminators or access to training datasets. It is an automated approach that only requires the initial fine-tuned and base models and a few representative prompts. Our experiments demonstrate that models fine-tuned using TuneShift-KD achieve higher accuracy than prior approaches, enabling ease of deployment and more effective transfer of the specialized knowledge.
[LG-4] AVO: Agent ic Variation Operators for Autonomous Evolutionary Search
链接: https://arxiv.org/abs/2603.24517
作者: Terry Chen,Zhifan Ye,Bing Xu,Zihao Ye,Timmy Liu,Ali Hassani,Tianqi Chen,Andrew Kerr,Haicheng Wu,Yang Xu,Yu-Jung Chen,Hanfeng Chen,Aditya Kane,Ronny Krashinsky,Ming-Yu Liu,Vinod Grover,Luis Ceze,Roger Bringmann,John Tran,Wei Liu,Fung Xie,Michael Lightstone,Humphrey Shi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Agentic Variation Operators (AVO) are a new family of evolutionary variation operators that replace the fixed mutation, crossover, and hand-designed heuristics of classical evolutionary search with autonomous coding agents. Rather than confining a language model to candidate generation within a prescribed pipeline, AVO instantiates variation as a self-directed agent loop that can consult the current lineage, a domain-specific knowledge base, and execution feedback to propose, repair, critique, and verify implementation edits. We evaluate AVO on attention, among the most aggressively optimized kernel targets in AI, on NVIDIA Blackwell (B200) GPUs. Over 7 days of continuous autonomous evolution on multi-head attention, AVO discovers kernels that outperform cuDNN by up to 3.5% and FlashAttention-4 by up to 10.5% across the evaluated configurations. The discovered optimizations transfer readily to grouped-query attention, requiring only 30 minutes of additional autonomous adaptation and yielding gains of up to 7.0% over cuDNN and 9.3% over FlashAttention-4. Together, these results show that agentic variation operators move beyond prior LLM-in-the-loop evolutionary pipelines by elevating the agent from candidate generator to variation operator, and can discover performance-critical micro-architectural optimizations that produce kernels surpassing state-of-the-art expert-engineered attention implementations on today’s most advanced GPU hardware.
[LG-5] owards Safe Learning-Based Non-Linear Model Predictive Control through Recurrent Neural Network Modeling
链接: https://arxiv.org/abs/2603.24503
作者: Mihaela-Larisa Clement,Mónika Farsang,Agnes Poks,Johannes Edelmann,Manfred Plöchl,Radu Grosu,Ezio Bartocci
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:
Abstract:The practical deployment of nonlinear model predictive control (NMPC) is often limited by online computation: solving a nonlinear program at high control rates can be expensive on embedded hardware, especially when models are complex or horizons are long. Learning-based NMPC approximations shift this computation offline but typically demand large expert datasets and costly training. We propose Sequential-AMPC, a sequential neural policy that generates MPC candidate control sequences by sharing parameters across the prediction horizon. For deployment, we wrap the policy in a safety-augmented online evaluation and fallback mechanism, yielding Safe Sequential-AMPC. Compared to a naive feedforward policy baseline across several benchmarks, Sequential-AMPC requires substantially fewer expert MPC rollouts and yields candidate sequences with higher feasibility rates and improved closed-loop safety. On high-dimensional systems, it also exhibits better learning dynamics and performance in fewer epochs while maintaining stable validation improvement where the feedforward baseline can stagnate.
[LG-6] Project and Generate: Divergence-Free Neural Operators for Incompressible Flows
链接: https://arxiv.org/abs/2603.24500
作者: Xigui Li,Hongwei Zhang,Ruoxi Jiang,Deshu Chen,Chensen Lin,Limei Han,Yuan Qi,Xin Guo,Yuan Cheng
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Learning-based models for fluid dynamics often operate in unconstrained function spaces, leading to physically inadmissible, unstable simulations. While penalty-based methods offer soft regularization, they provide no structural guarantees, resulting in spurious divergence and long-term collapse. In this work, we introduce a unified framework that enforces the incompressible continuity equation as a hard, intrinsic constraint for both deterministic and generative modeling. First, to project deterministic models onto the divergence-free subspace, we integrate a differentiable spectral Leray projection grounded in the Helmholtz-Hodge decomposition, which restricts the regression hypothesis space to physically admissible velocity fields. Second, to generate physically consistent distributions, we show that simply projecting model outputs is insufficient when the prior is incompatible. To address this, we construct a divergence-free Gaussian reference measure via a curl-based pushforward, ensuring the entire probability flow remains subspace-consistent by construction. Experiments on 2D Navier-Stokes equations demonstrate exact incompressibility up to discretization error and substantially improved stability and physical consistency.
[LG-7] Uniform Laws of Large Numbers in Product Spaces
链接: https://arxiv.org/abs/2603.24493
作者: Ron Holzman,Shay Moran,Alexander Shlimovich
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Uniform laws of large numbers form a cornerstone of Vapnik–Chervonenkis theory, where they are characterized by the finiteness of the VC dimension. In this work, we study uniform convergence phenomena in cartesian product spaces, under assumptions on the underlying distribution that are compatible with the product structure. Specifically, we assume that the distribution is absolutely continuous with respect to the product of its marginals, a condition that captures many natural settings, including product distributions, sparse mixtures of product distributions, distributions with low mutual information, and more. We show that, under this assumption, a uniform law of large numbers holds for a family of events if and only if the linear VC dimension of the family is finite. The linear VC dimension is defined as the maximum size of a shattered set that lies on an axis-parallel line, namely, a set of vectors that agree on all but at most one coordinate. This dimension is always at most the classical VC dimension, yet it can be arbitrarily smaller. For instance, the family of convex sets in \mathbbR^d has linear VC dimension 2 , while its VC dimension is infinite already for d\ge 2 . Our proofs rely on estimator that departs substantially from the standard empirical mean estimator and exhibits more intricate structure. We show that such deviations from the standard empirical mean estimator are unavoidable in this setting. Throughout the paper, we propose several open questions, with a particular focus on quantitative sample complexity bounds. Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2603.24493 [cs.LG] (or arXiv:2603.24493v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.24493 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-8] Composer 2 Technical Report
链接: https://arxiv.org/abs/2603.24477
作者: Cursor Reseach:Aaron Chan,Ahmed Shalaby,Alexander Wettig,Aman Sanger,Andrew Zhai,Anurag Ajay,Ashvin Nair,Charlie Snell,Chen Lu,Chen Shen,Emily Jia,Federico Cassano,Hanpeng Liu,Haoyu Chen,Henry Wildermuth,Jacob Jackson,Janet Li,Jediah Katz,Jiajun Yao,Joey Hejna,Josh Warner,Julius Vering,Kevin Frans,Lee Danilek,Less Wright,Lujing Cen,Luke Melas-Kyriazi,Michael Truell,Michiel de Jong,Naman Jain,Nate Schmidt,Nathan Wang,Niklas Muennighoff,Oleg Rybkin,Paul Loh,Phillip Kravtsov,Rishabh Yadav,Sahil Shah,Sam Kottler,Alexander M Rush,Shengtong Zhang,Shomil Jain,Sriram Sankar,Stefan Heule,Stuart H. Sul,Sualeh Asif,Victor Rong,Wanqi Zhu,William Lin,Yuchen Wu,Yuri Volkov,Yury Zemlyanskiy,Zack Holbrook,Zhiyuan Zhang
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Composer 2 is a specialized model designed for agentic software engineering. The model demonstrates strong long-term planning and coding intelligence while maintaining the ability to efficiently solve problems for interactive use. The model is trained in two phases: first, continued pretraining to improve the model’s knowledge and latent coding ability, followed by large-scale reinforcement learning to improve end-to-end coding performance through stronger reasoning, accurate multi-step execution, and coherence on long-horizon realistic coding problems. We develop infrastructure to support training in the same Cursor harness that is used by the deployed model, with equivalent tools and structure, and use environments that match real problems closely. To measure the ability of the model on increasingly difficult tasks, we introduce a benchmark derived from real software engineering problems in large codebases including our own. Composer 2 is a frontier-level coding model and demonstrates a process for training strong domain-specialized models. On our CursorBench evaluations the model achieves a major improvement in accuracy compared to previous Composer models (61.3). On public benchmarks the model scores 61.7 on Terminal-Bench and 73.7 on SWE-bench Multilingual in our harness, comparable to state-of-the-art systems.
[LG-9] Conformalized Transfer Learning for Li-ion Battery State of Health Forecasting under Manufacturing and Usage Variability
链接: https://arxiv.org/abs/2603.24475
作者: Samuel Filgueira da Silva,Mehmet Fatih Ozkan,Faissal El Idrissi,Marcello Canova
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to the 2026 American Control Conference (ACC)
Abstract:Accurate forecasting of state-of-health (SOH) is essential for ensuring safe and reliable operation of lithium-ion cells. However, existing models calibrated on laboratory tests at specific conditions often fail to generalize to new cells that differ due to small manufacturing variations or operate under different conditions. To address this challenge, an uncertainty-aware transfer learning framework is proposed, combining a Long Short-Term Memory (LSTM) model with domain adaptation via Maximum Mean Discrepancy (MMD) and uncertainty quantification through Conformal Prediction (CP). The LSTM model is trained on a virtual battery dataset designed to capture real-world variability in electrode manufacturing and operating conditions. MMD aligns latent feature distributions between simulated and target domains to mitigate domain shift, while CP provides calibrated, distribution-free prediction intervals. This framework improves both the generalization and trustworthiness of SOH forecasts across heterogeneous cells.
[LG-10] Learning Response-Statistic Shifts and Parametric Roll Episodes from Wave–Vessel Time Series via LSTM Functional Models
链接: https://arxiv.org/abs/2603.24431
作者: Jose del Aguila Ferrandis
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Parametric roll is a rare but high-consequence instability that can trigger abrupt regime changes in ship response, including pronounced shifts in roll statistics and tail risk. This paper develops a data-driven surrogate that learns the nonlinear, causal functional mapping from incident wave–motion time series to vessel motions, and demonstrates that the surrogate reproduces both (i) parametric roll episodes and (ii) the associated statistical shifts in the response. Crucially, the learning framework is data-source agnostic: the paired wave–motion time series can be obtained from controlled experiments (e.g., towing-tank or basin tests with wave probes and motion tracking) when a hull exists, or from high-fidelity simulations during design when experiments are not yet available. To provide a controlled severe-sea demonstration, we generate training data with a URANS numerical wave tank, using long-crested irregular seas synthesized from a modified Pierson–Moskowitz spectrum. The demonstration dataset comprises 49 random-phase realizations for each of three sea states, simulated at a fixed forward speed selected to yield encounter conditions under which parametric-roll episodes can occur. A stacked LSTM surrogate is trained on wave-elevation time series and evaluated on held-out realizations using time-domain accuracy and distributional fidelity metrics. In the most severe case, the model tracks the onset and growth of large-amplitude roll consistent with parametric excitation, and captures the corresponding changes in roll probability density functions (PDFs). We further compare loss-function choices (MSE, relative-entropy-based objectives, and amplitude-weighted variants) and show how they trade average error for improved tail fidelity relevant to operability and risk assessment.
[LG-11] Marchuk: Efficient Global Weather Forecasting from Mid-Range to Sub-Seasonal Scales via Flow Matching
链接: https://arxiv.org/abs/2603.24428
作者: Arsen Kuzhamuratov,Mikhail Zhirnov,Andrey Kuznetsov,Ivan Oseledets,Konstantin Sobolev
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate subseasonal weather forecasting remains a major challenge due to the inherently chaotic nature of the atmosphere, which limits the predictive skill of conventional models beyond the mid-range horizon (approximately 15 days). In this work, we present \textitMarchuk, a generative latent flow-matching model for global weather forecasting spanning mid-range to subseasonal timescales, with prediction horizons of up to 30 days. Marchuk conditions on current-day weather maps and autoregressively predicts subsequent days’ weather maps within the learned latent space. We replace rotary positional encodings (RoPE) with trainable positional embeddings and extend the temporal context window, which together enhance the model’s ability to represent and propagate long-range temporal dependencies during latent forecasting. Marchuk offers two key advantages: high computational efficiency and strong predictive performance. Despite its compact architecture of only 276 million parameters, the model achieves performance comparable to LaDCast, a substantially larger model with 1.6 billion parameters, while operating at significantly higher inference speeds. We open-source our inference code and model at: this https URL
[LG-12] On the Use of Bagging for Local Intrinsic Dimensionality Estimation
链接: https://arxiv.org/abs/2603.24384
作者: Kristóf Péter,Ricardo J. G. B. Campello,James Bailey,Michael E. Houle
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Main document: 10 pages, 5 figures; Appendix: 38 pages, 27 figures
Abstract:The theory of Local Intrinsic Dimensionality (LID) has become a valuable tool for characterizing local complexity within and across data manifolds, supporting a range of data mining and machine learning tasks. Accurate LID estimation requires samples drawn from small neighborhoods around each query to avoid biases from nonlocal effects and potential manifold mixing, yet limited data within such neighborhoods tends to cause high estimation variance. As a variance reduction strategy, we propose an ensemble approach that uses subbagging to preserve the local distribution of nearest neighbor (NN) distances. The main challenge is that the uniform reduction in total sample size within each subsample increases the proximity threshold for finding a fixed number k of NNs around the query. As a result, in the specific context of LID estimation, the sampling rate has an additional, complex interplay with the neighborhood size, where both combined determine the sample size as well as the locality and resolution considered for estimation. We analyze both theoretically and experimentally how the choice of the sampling rate and the k-NN size used for LID estimation, alongside the ensemble size, affects performance, enabling informed prior selection of these hyper-parameters depending on application-based preferences. Our results indicate that within broad and well-characterized regions of the hyper-parameters space, using a bagged estimator will most often significantly reduce variance as well as the mean squared error when compared to the corresponding non-bagged baseline, with controllable impact on bias. We additionally propose and evaluate different ways of combining bagging with neighborhood smoothing for substantial further improvements on LID estimation performance.
[LG-13] CoordLight: Learning Decentralized Coordination for Network-Wide Traffic Signal Control
链接: https://arxiv.org/abs/2603.24366
作者: Yifeng Zhang,Harsh Goel,Peizhuo Li,Mehul Damani,Sandeep Chinchali,Guillaume Sartoretti
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Abstract:Adaptive traffic signal control (ATSC) is crucial in alleviating congestion, maximizing throughput and promoting sustainable mobility in ever-expanding cities. Multi-Agent Reinforcement Learning (MARL) has recently shown significant potential in addressing complex traffic dynamics, but the intricacies of partial observability and coordination in decentralized environments still remain key challenges in formulating scalable and efficient control strategies. To address these challenges, we present CoordLight, a MARL-based framework designed to improve intra-neighborhood traffic by enhancing decision-making at individual junctions (agents), as well as coordination with neighboring agents, thereby scaling up to network-level traffic optimization. Specifically, we introduce the Queue Dynamic State Encoding (QDSE), a novel state representation based on vehicle queuing models, which strengthens the agents’ capability to analyze, predict, and respond to local traffic dynamics. We further propose an advanced MARL algorithm, named Neighbor-aware Policy Optimization (NAPO). It integrates an attention mechanism that discerns the state and action dependencies among adjacent agents, aiming to facilitate more coordinated decision-making, and to improve policy learning updates through robust advantage calculation. This enables agents to identify and prioritize crucial interactions with influential neighbors, thus enhancing the targeted coordination and collaboration among agents. Through comprehensive evaluations against state-of-the-art traffic signal control methods over three real-world traffic datasets composed of up to 196 intersections, we empirically show that CoordLight consistently exhibits superior performance across diverse traffic networks with varying traffic flows. The code is available at this https URL
[LG-14] Language-Assisted Image Clustering Guided by Discriminative Relational Signals and Adaptive Semantic Centers
链接: https://arxiv.org/abs/2603.24275
作者: Jun Ma,Xu Zhang,Zhengxing Jiao,Yaxin Hou,Hui Liu,Junhui Hou,Yuheng Jia
类目: Machine Learning (cs.LG)
*备注:
Abstract:Language-Assisted Image Clustering (LAIC) augments the input images with additional texts with the help of vision-language models (VLMs) to promote clustering performance. Despite recent progress, existing LAIC methods often overlook two issues: (i) textual features constructed for each image are highly similar, leading to weak inter-class discriminability; (ii) the clustering step is restricted to pre-built image-text alignments, limiting the potential for better utilization of the text modality. To address these issues, we propose a new LAIC framework with two complementary components. First, we exploit cross-modal relations to produce more discriminative self-supervision signals for clustering, as it compatible with most VLMs training mechanisms. Second, we learn category-wise continuous semantic centers via prompt learning to produce the final clustering assignments. Extensive experiments on eight benchmark datasets demonstrate that our method achieves an average improvement of 2.6% over state-of-the-art methods, and the learned semantic centers exhibit strong interpretability. Code is available in the supplementary material.
[LG-15] DeepDTF: Dual-Branch Transformer Fusion for Multi-Omics Anticancer Drug Response Prediction
链接: https://arxiv.org/abs/2603.24265
作者: Yuhan Zhao,Jacob Tennant,James Yang,Zhishan Guo,Young Whang,Ning Sui
类目: Machine Learning (cs.LG)
*备注: 7 Pages, 4 figures
Abstract:Cancer drug response varies widely across tumors due to multi-layer molecular heterogeneity, motivating computational decision support for precision oncology. Despite recent progress in deep CDR models, robust alignment between high-dimensional multi-omics and chemically structured drugs remains challenging due to cross-modal misalignment and limited inductive bias. We present DeepDTF, an end-to-end dual-branch Transformer fusion framework for joint log(IC50) regression and drug sensitivity classification. The cell-line branch uses modality-specific encoders for multi-omics profiles with Transformer blocks to capture long-range dependencies, while the drug branch represents compounds as molecular graphs and encodes them with a GNN-Transformer to integrate local topology with global context. Omics and drug representations are fused by a Transformer-based module that models cross-modal interactions and mitigates feature misalignment. On public pharmacogenomic benchmarks under 5-fold cold-start cell-line evaluation, DeepDTF consistently outperforms strong baselines across omics settings, achieving up to RMSE=1.248, R^2=0.875, and AUC=0.987 with full multi-omics inputs, while reducing classification error (1-ACC) by 9.5%. Beyond accuracy, DeepDTF provides biologically grounded explanations via SHAP-based gene attributions and pathway enrichment with pre-ranked GSEA.
[LG-16] Forecasting with Guidance: Representation-Level Supervision for Time Series Forecasting
链接: https://arxiv.org/abs/2603.24262
作者: Jiacheng Wang,Liang Fan,Baihua Li,Luyan Zhang
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, 4 tables
Abstract:Nowadays, time series forecasting is predominantly approached through the end-to-end training of deep learning architectures using error-based objectives. While this is effective at minimizing average loss, it encourages the encoder to discard informative yet extreme patterns. This results in smooth predictions and temporal representations that poorly capture salient dynamics. To address this issue, we propose ReGuider, a plug-in method that can be seamlessly integrated into any forecasting architecture. ReGuider leverages pretrained time series foundation models as semantic teachers. During training, the input sequence is processed together by the target forecasting model and the pretrained model. Rather than using the pretrained model’s outputs directly, we extract its intermediate embeddings, which are rich in temporal and semantic information, and align them with the target model’s encoder embeddings through representation-level supervision. This alignment process enables the encoder to learn more expressive temporal representations, thereby improving the accuracy of downstream forecasting. Extensive experimentation across diverse datasets and architectures demonstrates that our ReGuider consistently improves forecasting performance, confirming its effectiveness and versatility.
[LG-17] C-STEP: Continuous Space-Time Empowerment for Physics-informed Safe Reinforcement Learning of Mobile Agents
链接: https://arxiv.org/abs/2603.24241
作者: Guihlerme Daubt,Adrian Redder
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:Safe navigation in complex environments remains a central challenge for reinforcement learning (RL) in robotics. This paper introduces Continuous Space-Time Empowerment for Physics-informed (C-STEP) safe RL, a novel measure of agent-centric safety tailored to deterministic, continuous domains. This measure can be used to design physics-informed intrinsic rewards by augmenting positive navigation reward functions. The reward incorporates the agents internal states (e.g., initial velocity) and forward dynamics to differentiate safe from risky behavior. By integrating C-STEP with navigation rewards, we obtain an intrinsic reward function that jointly optimizes task completion and collision avoidance. Numerical results demonstrate fewer collisions, reduced proximity to obstacles, and only marginal increases in travel time. Overall, C-STEP offers an interpretable, physics-informed approach to reward shaping in RL, contributing to safety for agentic mobile robotic systems.
[LG-18] Identification of NMF by choosing maximum-volume basis vectors
链接: https://arxiv.org/abs/2603.24227
作者: Qianqian Qi,Zhongming Chen,Peter G. M. van der Heijden
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:In nonnegative matrix factorization (NMF), minimum-volume-constrained NMF is a widely used framework for identifying the solution of NMF by making basis vectors as similar as possible. This typically induces sparsity in the coefficient matrix, with each row containing zero entries. Consequently, minimum-volume-constrained NMF may fail for highly mixed data, where such sparsity does not hold. Moreover, the estimated basis vectors in minimum-volume-constrained NMF may be difficult to interpret as they may be mixtures of the ground truth basis vectors. To address these limitations, in this paper we propose a new NMF framework, called maximum-volume-constrained NMF, which makes the basis vectors as distinct as possible. We further establish an identifiability theorem for maximum-volume-constrained NMF and provide an algorithm to estimate it. Experimental results demonstrate the effectiveness of the proposed method.
[LG-19] IPatch: A Multi-Resolution Transformer Architecture for Robust Time-Series Forecasting
链接: https://arxiv.org/abs/2603.24207
作者: Aymane Harkati,Moncef Garouani,Olivier Teste,Julien Aligon,Mohamed Hamlich
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate forecasting of multivariate time series remains challenging due to the need to capture both short-term fluctuations and long-range temporal dependencies. Transformer-based models have emerged as a powerful approach, but their performance depends critically on the representation of temporal data. Traditional point-wise representations preserve individual time-step information, enabling fine-grained modeling, yet they tend to be computationally expensive and less effective at modeling broader contextual dependencies, limiting their scalability to long sequences. Patch-wise representations aggregate consecutive steps into compact tokens to improve efficiency and model local temporal dynamics, but they often discard fine-grained temporal details that are critical for accurate predictions in volatile or complex time series. We propose IPatch, a multi-resolution Transformer architecture that integrates both point-wise and patch-wise tokens, modeling temporal information at multiple resolutions. Experiments on 7 benchmark datasets demonstrate that IPatch consistently improves forecasting accuracy, robustness to noise, and generalization across various prediction horizons compared to single-representation baselines.
[LG-20] setlinWiSARD: On-Chip Training of Weightless Neural Networks using Tsetlin Automata on FPGAs
链接: https://arxiv.org/abs/2603.24186
作者: Shengyu Duan,Marcos L. L. Sartori,Rishad Shafik,Alex Yakovlev
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Accepted at the 63rd Design Automation Conference (DAC 2026)
Abstract:Increasing demands for adaptability, privacy, and security at the edge have persistently pushed the frontiers for a new generation of machine learning (ML) algorithms with training and inference capabilities on-chip. Weightless Neural Network (WNN) is such an algorithm that is principled on lookup table based simple neuron structures. As a result, it offers architectural benefits, such as low-latency, low-complexity inference, compared to deep neural networks that depend heavily on multiply-accumulate operations. However, traditional WNNs rely on memorization-based one-shot training, which either leads to overfitting and reduced accuracy or requires tedious post-training adjustments, limiting their effectiveness for efficient on chip training. In this work, we propose TsetlinWiSARD, a training approach for WNNs that leverages Tsetlin Automata (TAs) to enable probabilistic, feedback-driven learning. It overcomes the overfitting of WiSARD’s one-shot training with iterative optimization, while maintaining simple, continuous binary feedback for efficient on-chip training. Central to our approach is a field programmable gate array (FPGA)-based training architecture that delivers state-of-the-art accuracy while significantly improving hardware efficiency. Our approach provides over 1000x faster training when compared with the traditional WiSARD implementation of WNNs. Further, we demonstrate 22% reduced resource usage, 93.3% lower latency, and 64.2% lower power consumption compared to FPGA-based training accelerators implementing other ML algorithms. Comments: Accepted at the 63rd Design Automation Conference (DAC 2026) Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR) Cite as: arXiv:2603.24186 [cs.LG] (or arXiv:2603.24186v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.24186 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-21] Walma: Learning to See Memory Corruption in WebAssembly
链接: https://arxiv.org/abs/2603.24167
作者: Oussama Draissi,Mark Günzel,Ahmad-Reza Sadeghi,Lucas Davi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, 3 tables
Abstract:WebAssembly’s (Wasm) monolithic linear memory model facilitates memory corruption attacks that can escalate to cross-site scripting in browsers or go undetected when a malicious host tampers with a module’s state. Existing defenses rely on invasive binary instrumentation or custom runtimes, and do not address runtime integrity verification under an adversarial host model. We present Walma, a framework for WebAssembly Linear Memory Attestation that leverages machine learning to detect memory corruption and external tampering by classifying memory snapshots. We evaluate Walma on six real-world CVE-affected applications across three verification backends (cpu-wasm, cpu-tch, gpu) and three instrumentation policies. Our results demonstrate that CNN-based classification can effectively detect memory corruption in applications with structured memory layouts, with coarse-grained boundary checks incurring as low as 1.07x overhead, while fine-grained monitoring introduces higher (1.5x–1.8x) but predictable costs. Our evaluation quantifies the accuracy and overhead trade-offs across deployment configurations, demonstrating the practical feasibility of ML-based memory attestation for WebAssembly.
[LG-22] Linear-Nonlinear Fusion Neural Operator for Partial Differential Equations
链接: https://arxiv.org/abs/2603.24143
作者: Heng Wu,Junjie Wang,Benzhuo Lu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 26 pages, 14 figures
Abstract:Neural operator learning directly constructs the mapping relationship from the equation parameter space to the solution space, enabling efficient direct inference in practical applications without the need for repeated solution of partial differential equations (PDEs) - an advantage that is difficult to achieve with traditional numerical methods. In this work, we find that explicitly decoupling linear and nonlinear effects within such operator mappings leads to markedly improved learning efficiency. This yields a novel network structure, namely the Linear-Nonlinear Fusion Neural Operator (LNF-NO), which models operator mappings via the multiplicative fusion of a linear component and a nonlinear component, thus achieving a lightweight and interpretable representation. This linear-nonlinear decoupling enables efficient capture of complex solution features at the operator level while maintaining stability and generality. LNF-NO naturally supports multiple functional inputs and is applicable to both regular grids and irregular geometries. Across a diverse suite of PDE operator-learning benchmarks, including nonlinear Poisson-Boltzmann equations and multi-physics coupled systems, LNF-NO is typically substantially faster to train than Deep Operator Networks (DeepONet) and Fourier Neural Operators (FNO), while achieving comparable or better accuracy in most cases. On the tested 3D Poisson-Boltzmann case, LNF-NO attains the best accuracy among the compared models and trains approximately 2.7x faster than a 3D FNO baseline.
[LG-23] Efficient Controller Learning from Human Preferences and Numerical Data Via Multi-Modal Surrogate Models
链接: https://arxiv.org/abs/2603.24138
作者: Lukas Theiner,Maik Pfefferkorn,Yongpeng Zhao,Sebastian Hirt,Rolf Findeisen
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 8 pages, 4 figures, accepted for ECC 2026
Abstract:Tuning control policies manually to meet high-level objectives is often time-consuming. Bayesian optimization provides a data-efficient framework for automating this process using numerical evaluations of an objective function. However, many systems, particularly those involving humans, require optimization based on subjective criteria. Preferential Bayesian optimization addresses this by learning from pairwise comparisons instead of quantitative measurements, but relying solely on preference data can be inefficient. We propose a multi-fidelity, multi-modal Bayesian optimization framework that integrates low-fidelity numerical data with high-fidelity human preferences. Our approach employs Gaussian process surrogate models with both hierarchical, autoregressive and non-hierarchical, coregionalization-based structures, enabling efficient learning from mixed-modality data. We illustrate the framework by tuning an autonomous vehicle’s trajectory planner, showing that combining numerical and preference data significantly reduces the need for experiments involving the human decision maker while effectively adapting driving style to individual preferences.
[LG-24] On Gossip Algorithms for Machine Learning with Pairwise Objectives
链接: https://arxiv.org/abs/2603.24128
作者: Igor Colin(LTCI, S2A, IP Paris),Aurélien Bellet(PREMEDICAL),Stephan Clémençon(LTCI, IDS, S2A, IP Paris),Joseph Salmon(IROKO, UM)
类目: Machine Learning (cs.LG)
*备注:
Abstract:In the IoT era, information is more and more frequently picked up by connected smart sensors with increasing, though limited, storage, communication and computation abilities. Whether due to privacy constraints or to the structure of the distributed system, the development of statistical learning methods dedicated to data that are shared over a network is now a major issue. Gossip-based algorithms have been developed for the purpose of solving a wide variety of statistical learning tasks, ranging from data aggregation over sensor networks to decentralized multi-agent optimization. Whereas the vast majority of contributions consider situations where the function to be estimated or optimized is a basic average of individual observations, it is the goal of this article to investigate the case where the latter is of pairwise nature, taking the form of a U -statistic of degree two. Motivated by various problems such as similarity learning, ranking or clustering for instance, we revisit gossip algorithms specifically designed for pairwise objective functions and provide a comprehensive theoretical framework for their convergence. This analysis fills a gap in the literature by establishing conditions under which these methods succeed, and by identifying the graph properties that critically affect their efficiency. In particular, a refined analysis of the convergence upper and lower bounds is performed.
[LG-25] Likelihood hacking in probabilistic program synthesis
链接: https://arxiv.org/abs/2603.24126
作者: Jacek Karwowski,Younesse Kaddar,Zihuiwen Ye,Nikolay Malkin,Sam Staton
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:
Abstract:When language models are trained by reinforcement learning (RL) to write probabilistic programs, they can artificially inflate their marginal-likelihood reward by producing programs whose data distribution fails to normalise instead of fitting the data better. We call this failure likelihood hacking (LH). We formalise LH in a core probabilistic programming language (PPL) and give sufficient syntactic conditions for its prevention, proving that a safe language fragment \mathcalL_\textsafe satisfying these conditions cannot produce likelihood-hacking programs. Empirically, we show that GRPO-trained models generating PyMC code discover LH exploits within the first few training steps, driving violation rates well above the untrained-model baseline. We implement \mathcalL_\textsafe 's conditions as \textttSafeStan , a LH-resistant modification of Stan, and show empirically that it prevents LH under optimisation pressure. These results show that language-level safety constraints are both theoretically grounded and effective in practice for automated Bayesian model discovery.
[LG-26] Mixed-signal implementation of feedback-control optimizer for single-layer Spiking Neural Networks
链接: https://arxiv.org/abs/2603.24113
作者: Jonathan Haag,Christian Metzner,Dmitrii Zendrikov,Giacomo Indiveri,Benjamin Grewe,Chiara De Luca,Matteo Saponati
类目: Machine Learning (cs.LG)
*备注:
Abstract:On-chip learning is key to scalable and adaptive neuromorphic systems, yet existing training methods are either difficult to implement in hardware or overly restrictive. However, recent studies show that feedback-control optimizers can enable expressive, on-chip training of neuromorphic devices. In this work, we present a proof-of-concept implementation of such feedback-control optimizers on a mixed-signal neuromorphic processor. We assess the proposed approach in an In-The-Loop(ITL) training setup on both a binary classification task and the nonlinear Yin-Yang problem, demonstrating on-chip training that matches the performance of numerical simulations and gradient-based baselines. Our results highlight the feasibility of feedback-driven, online learning under realistic mixed-signal constraints, and represent a co-design approach toward embedding such rules directly in silicon for autonomous and adaptive neuromorphic computing.
[LG-27] oward a Multi-Layer ML-Based Security Framework for Industrial IoT
链接: https://arxiv.org/abs/2603.24111
作者: Aymen Bouferroum(FUN),Valeria Loscri(FUN),Abderrahim Benslimane(LIA)
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The Industrial Internet of Things (IIoT) introduces significant security challenges as resource-constrained devices become increasingly integrated into critical industrial processes. Existing security approaches typically address threats at a single network layer, often relying on expensive hardware and remaining confined to simulation environments. In this paper, we present the research framework and contributions of our doctoral thesis, which aims to develop a lightweight, Machine Learning (ML)-based security framework for IIoT environments. We first describe our adoption of the Tm-IIoT trust model and the Hybrid IIoT (H-IIoT) architecture as foundational baselines, then introduce the Trust Convergence Acceleration (TCA) approach, our primary contribution that integrates ML to predict and mitigate the impact of degraded network conditions on trust convergence, achieving up to a 28.6% reduction in convergence time while maintaining robustness against adversarial behaviors. We then propose a real-world deployment architecture based on affordable, open-source hardware, designed to implement and extend the security framework. Finally, we outline our ongoing research toward multi-layer attack detection, including physical-layer threat identification and considerations for robustness against adversarial ML attacks.
[LG-28] Causality-Driven Disentangled Representation Learning in Multiplex Graphs
链接: https://arxiv.org/abs/2603.24105
作者: Saba Nasiri,Selin Aviyente,Dorina Thanou
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Submitted to IEEE Transactions on Signal and Information Processing over Networks. Includes supplementary material
Abstract:Learning representations from multiplex graphs, i.e., multi-layer networks where nodes interact through multiple relation types, is challenging due to the entanglement of shared (common) and layer-specific (private) information, which limits generalization and interpretability. In this work, we introduce a causal inference-based framework that disentangles common and private components in a self-supervised manner. CaDeM jointly (i) aligns shared embeddings across layers, (ii) enforces private embeddings to capture layer-specific signals, and (iii) applies backdoor adjustment to ensure that the common embeddings capture only global information while being separated from the private representations. Experiments on synthetic and real-world datasets demonstrate consistent improvements over existing baselines, highlighting the effectiveness of our approach for robust and interpretable multiplex graph representation learning.
[LG-29] he impact of sensor placement on graph-neural-network-based leakage detection
链接: https://arxiv.org/abs/2603.24076
作者: J.J.H. van Gemert,V. Breschi,D.R. Yntema,K.J. Keesman,M. Lazar
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Sensor placement for leakage detection in water distribution networks is an important and practical challenge for water utilities. Recent work has shown that graph neural networks can estimate and predict pressures and detect leaks, but their performance strongly depends on the available sensor measurements and configurations. In this paper, we investigate how sensor placement influences the performance of GNN-based leakage detection. We propose a novel PageRank-Centrality-based sensor placement method and demonstrate that it substantially impacts reconstruction, prediction, and leakage detection on the EPANET Net1.
[LG-30] Lagrangian Relaxation Score-based Generation for Mixed Integer linear Programming
链接: https://arxiv.org/abs/2603.24033
作者: Ruobing Wang,Xin Li,Yujie Fang,Mingzhong Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Predict-and-search (PaS) methods have shown promise for accelerating mixed-integer linear programming (MILP) solving. However, existing approaches typically assume variable independence and rely on deterministic single-point predictions, which limits solution diversityand often necessitates extensive downstream search for high-quality solutions. In this paper, we propose \textbfSRG, a generative framework based on Lagrangian relaxation-guided stochastic differential equations (SDEs), with theoretical guarantees on solution quality. SRG leverages convolutional kernels to capture inter-variable dependencies while integrating Lagrangian relaxation to guide the sampling process toward feasible and near-optimal regions. Rather than producing a single estimate, SRG generates diverse, high-quality solution candidates that collectively define compact and effective trust-region subproblems for standard MILP solvers. Across multiple public benchmarks, SRG consistently outperforms existing machine learning baselines in solution quality. Moreover, SRG demonstrates strong zero-shot transferability: on unseen cross-scale/problem instances, it achieves competitive optimality with state-of-the-art exact solvers while significantly reducing computational overhead through faster search and superior solution quality.
[LG-31] -IF-Learn: Iterative Feature Selection and Unsupervised Learning for High-Dimensional Complex Data AISTATS
链接: https://arxiv.org/abs/2603.24025
作者: Chen Ma,Wanjie Wang,Shuhao Fan
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 28 pages, 5 figures, including appendix. Accepted at AISTATS
Abstract:Unsupervised learning of high-dimensional data is challenging due to irrelevant or noisy features obscuring underlying structures. It’s common that only a few features, called the influential features, meaningfully define the clusters. Recovering these influential features is helpful in data interpretation and clustering. We propose i-IF-Learn, an iterative unsupervised framework that jointly performs feature selection and clustering. Our core innovation is an adaptive feature selection statistic that effectively combines pseudo-label supervision with unsupervised signals, dynamically adjusting based on intermediate label reliability to mitigate error propagation common in iterative frameworks. Leveraging low-dimensional embeddings (PCA or Laplacian eigenmaps) followed by k -means, i-IF-Learn simultaneously outputs influential feature subset and clustering labels. Numerical experiments on gene microarray and single-cell RNA-seq datasets show that i-IF-Learn significantly surpasses classical and deep clustering baselines. Furthermore, using our selected influential features as preprocessing substantially enhances downstream deep models such as DeepCluster, UMAP, and VAE, highlighting the importance and effectiveness of targeted feature selection.
[LG-32] Stochastic Dimension-Free Zeroth-Order Estimator for High-Dimensional and High-Order PINNs
链接: https://arxiv.org/abs/2603.24002
作者: Zhangyong Liang,Ji Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Physics-Informed Neural Networks (PINNs) for high-dimensional and high-order partial differential equations (PDEs) are primarily constrained by the \mathcalO(d^k) spatial derivative complexity and the \mathcalO§ memory overhead of backpropagation (BP). While randomized spatial estimators successfully reduce the spatial complexity to \mathcalO(1) , their reliance on first-order optimization still leads to prohibitive memory consumption at scale. Zeroth-order (ZO) optimization offers a BP-free alternative; however, naively combining randomized spatial operators with ZO perturbations triggers a variance explosion of \mathcalO(1/\varepsilon^2) , leading to numerical divergence. To address these challenges, we propose the \textbfStochastic \textbfDimension-free \textbfZeroth-order \textbfEstimator (\textbfSDZE), a unified framework that achieves dimension-independent complexity in both space and memory. Specifically, SDZE leverages \emphCommon Random Numbers Synchronization (CRNS) to algebraically cancel the \mathcalO(1/\varepsilon^2) variance by locking spatial random seeds across perturbations. Furthermore, an \emphimplicit matrix-free subspace projection is introduced to reduce parameter exploration variance from \mathcalO§ to \mathcalO® while maintaining an \mathcalO(1) optimizer memory footprint. Empirical results demonstrate that SDZE enables the training of 10-million-dimensional PINNs on a single NVIDIA A100 GPU, delivering significant improvements in speed and memory efficiency over state-of-the-art baselines.
[LG-33] Can we generate portable representations for clinical time series data using LLM s? ICLR2026
链接: https://arxiv.org/abs/2603.23987
作者: Zongliang Ji,Yifei Sun,Andre Amaral,Anna Goldenberg,Rahul G. Krishnan
类目: Machine Learning (cs.LG)
*备注: Accepted to the 14th International Conference on Learning Representations (ICLR 2026)
Abstract:Deploying clinical ML is slow and brittle: models that work at one hospital often degrade under distribution shifts at the next. In this work, we study a simple question – can large language models (LLMs) create portable patient embeddings i.e. representations of patients enable a downstream predictor built on one hospital to be used elsewhere with minimal-to-no retraining and fine-tuning. To do so, we map from irregular ICU time series onto concise natural language summaries using a frozen LLM, then embed each summary with a frozen text embedding model to obtain a fixed length vector capable of serving as input to a variety of downstream predictors. Across three cohorts (MIMIC-IV, HIRID, PPICU), on multiple clinically grounded forecasting and classification tasks, we find that our approach is simple, easy to use and competitive with in-distribution with grid imputation, self-supervised representation learning, and time series foundation models, while exhibiting smaller relative performance drops when transferring to new hospitals. We study the variation in performance across prompt design, with structured prompts being crucial to reducing the variance of the predictive models without altering mean accuracy. We find that using these portable representations improves few-shot learning and does not increase demographic recoverability of age or sex relative to baselines, suggesting little additional privacy risk. Our work points to the potential that LLMs hold as tools to enable the scalable deployment of production grade predictive models by reducing the engineering overhead.
[LG-34] Diet Your LLM : Dimension-wise Global Pruning of LLM s via Merging Task-specific Importance Score
链接: https://arxiv.org/abs/2603.23985
作者: Jimyung Hong,Jaehyung Kim
类目: Machine Learning (cs.LG)
*备注: 14 pages, 10 figures. Code available at this https URL
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities, but their massive scale poses significant challenges for practical deployment. Structured pruning offers a promising solution by removing entire dimensions or layers, yet existing methods face critical trade-offs: task-agnostic approaches cannot adapt to task-specific requirements, while task-aware methods require costly training to learn task adaptability. We propose DIET (Dimension-wise global pruning of LLMs via merging Task-wise importance scores), a training-free structured pruning method that combines dimension-level granularity with task-aware selection. DIET profiles activation magnitudes across tasks using only 100 samples per task, then applies majority voting to construct a single global mask. DIET does not require large costs from pre-computation or training. Experiments on seven zero-shot benchmarks using Gemma-2 2B and 9B models demonstrate the effectiveness of DIET; for example, at 20% sparsity on Gemma-2 2B, DIET achieves near 10% average accuracy improvement, compared to previous state-of-the-art structured pruning methods. This advantage persists across various sparsity levels and model scales, positioning DIET as a practical and robust choice for structured LLM pruning.
[LG-35] ranscending Classical Neural Network Boundaries: A Quantum-Classical Synergistic Paradigm for Seismic Data Processing
链接: https://arxiv.org/abs/2603.23984
作者: Zhengyi Yuan,Xintong Dong,Xinyang Wang,Zheng Cong,Shiqi Dong
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:
Abstract:In recent years, a number of neural-network (NN) methods have exhibited good performance in seismic data processing, such as denoising, interpolation, and frequency-band extension. However, these methods rely on stacked perceptrons and standard activation functions, which imposes a bottleneck on the representational capacity of deep-learning models, making it difficult to capture the complex and non-stationary dynamics of seismic wavefields. Different from the classical perceptron-stacked NNs which are fundamentally confined to real-valued Euclidean spaces, the quantum NNs leverage the exponential state space of quantum mechanics to map the features into high-dimensional Hilbert spaces, transcending the representational boundary of classical NNs. Based on this insight, we propose a quantum-classical synergistic generative adversarial network (QC-GAN) for seismic data processing, serving as the first application of quantum NNs in seismic exploration. In QC-GAN, a quantum pathway is used to exploit the high-order feature correlations, while the convolutional pathway specializes in extracting the waveform structures of seismic wavefields. Furthermore, we design a QC feature complementarity loss to enforce the feature orthogonality in the proposed QC-GAN. This novel loss function can ensure that the two pathways encode non-overlapping information to enrich the capacity of feature representation. On the whole, by synergistically integrating the quantum and convolutional pathways, the proposed QC-GAN breaks the representational bottleneck inherent in classical GAN. Experimental results on denoising and interpolation tasks demonstrate that QC-GAN preserves wavefield continuity and amplitude-phase information under complex noise conditions.
[LG-36] Wireless communication empowers online scheduling of partially-observable transportation multi-robot systems in a smart factory
链接: https://arxiv.org/abs/2603.23967
作者: Yaxin Liao,Qimei Cui,Kwang-Cheng Chen,Xiong Li,Jinlian Chen,Xiyu Zhao,Xiaofeng Tao,Ping Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Achieving agile and reconfigurable production flows in smart factories depends on online multi-robot task assignment (MRTA), which requires online collision-free and congestion-free route scheduling of transportation multi-robot systems (T-MRS), e.g., collaborative automatic guided vehicles (AGVs). Due to the real-time operational requirements and dynamic interactions between T-MRS and production MRS, online scheduling under partial observability in dynamic factory environments remains a significant and under-explored challenge. This paper proposes a novel communication-enabled online scheduling framework that explicitly couples wireless machine-to-machine (M2M) networking with route scheduling, enabling AGVs to exchange intention information, e.g., planned routes, to overcome partial observations and assist complex computation of online scheduling. Specifically, we determine intelligent AGVs’ intention and sensor data as new M2M traffic and tailor the retransmission-free multi-link transmission networking to meet real-time operation demands. This scheduling-oriented networking is then integrated with a simulated annealing-based MRTA scheme and a congestion-aware A*-based route scheduling method. The integrated communication and scheduling scheme allows AGVs to dynamically adjust collision-free and congestion-free routes with reduced computational overhead. Numerical experiments shows the impacts from wireless communication on the performance of T-MRS and suggest that the proposed integrated scheme significantly enhances scheduling efficiency compared to other baselines, even under high AGV load conditions and limited channel resources. Moreover, the results reveal that the scheduling-oriented wireless M2M communication design fundamentally differs from human-to-human communications, implying new technological opportunities in a wireless networked smart factory.
[LG-37] Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs
链接: https://arxiv.org/abs/2603.23926
作者: Guy Zamir,Matthew Zurek,Yudong Chen
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in’’ costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the \gamma -regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form \tildeO( \sqrtSA,\textVar + \textlower-order terms) , where S,A are the state and action space sizes, and \textVar captures cumulative transition variance. This implies minimax-optimal average-reward and \gamma -regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span \Vert h^\star\Vert_\textsp , our algorithm obtains lower-order terms scaling as \Vert h^\star\Vert_\textsp S^2 A , which we prove is optimal in both \Vert h^\star\Vert_\textsp and A . Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than \Vert h^\star \Vert_\textsp^2 S A , and we provide a prior-free algorithm whose lower-order terms scale as \Vert h^\star\Vert_\textsp^2 S^3 A , nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on \Vert h^\star\Vert_\textsp in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge. Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2603.23926 [cs.LG] (or arXiv:2603.23926v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.23926 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-38] Praxium: Diagnosing Cloud Anomalies with AI-based Telemetry and Dependency Analysis
链接: https://arxiv.org/abs/2603.23890
作者: Rohan Kumar,Jason Li,Zongshun Zhang,Syed Mohammad Qasim,Gianluca Stringhini,Ayse Kivilcim Coskun
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:As the modern microservice architecture for cloud applications grows in popularity, cloud services are becoming increasingly complex and more vulnerable to misconfiguration and software bugs. Traditional approaches rely on expert input to diagnose and fix microservice anomalies, which lacks scalability in the face of the continuous integration and continuous deployment (CI/CD) paradigm. Microservice rollouts, containing new software installations, have complex interactions with the components of an application. Consequently, this added difficulty in attributing anomalous behavior to any specific installation or rollout results in potentially slower resolution times. To address the gaps in current diagnostic methods, this paper introduces Praxium, a framework for anomaly detection and root cause inference. Praxium aids administrators in evaluating target metric performance in the context of dependency installation information provided by a software discovery tool, PraxiPaaS. Praxium continuously monitors telemetry data to identify anomalies, then conducts root cause analysis via causal impact on recent software installations, in order to provide site reliability engineers (SRE) relevant information about an observed anomaly. In this paper, we demonstrate that Praxium is capable of effective anomaly detection and root cause inference, and we provide an analysis on effective anomaly detection hyperparameter tuning as needed in a practical setting. Across 75 total trials using four synthetic anomalies, anomaly detection consistently performs at 0.97 macro-F1. In addition, we show that causal impact analysis reliably infers the correct root cause of anomalies, even as package installations occur at increasingly shorter intervals.
[LG-39] Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration ICLR2026
链接: https://arxiv.org/abs/2603.23889
作者: Guopeng Li,Matthijs T.J. Spaan,Julian F.P. Kooij
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 21 pages, 9 figures, accepted by ICLR 2026 poster
Abstract:When safety is formulated as a limit of cumulative cost, safe reinforcement learning (RL) aims to learn policies that maximize return subject to the cost constraint in data collection and deployment. Off-policy safe RL methods, although offering high sample efficiency, suffer from constraint violations due to cost-agnostic exploration and estimation bias in cumulative cost. To address this issue, we propose Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy safe RL algorithm that integrates cost-bounded online exploration and conservative offline distributional value learning. First, we introduce a novel cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost in the action space and adaptively adjusts the trust region to control the training cost. Second, we adopt truncated quantile critics to stabilize the cost value learning. Quantile critics also quantify epistemic uncertainty to guide exploration. Experiments on safe velocity, safe navigation, and autonomous driving tasks demonstrate that COX-Q achieves high sample efficiency, competitive test safety performance, and controlled data collection cost. The results highlight COX-Q as a promising RL method for safety-critical applications.
[LG-40] An Invariant Compiler for Neural ODEs in AI-Accelerated Scientific Simulation
链接: https://arxiv.org/abs/2603.23861
作者: Fangzhou Yu,Yiqi Su,Ray Lee,Shenfeng Cheng,Naren Ramakrishnan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural ODEs are increasingly used as continuous-time models for scientific and sensor data, but unconstrained neural ODEs can drift and violate domain invariants (e.g., conservation laws), yielding physically implausible solutions. In turn, this can compound error in long-horizon prediction and surrogate simulation. Existing solutions typically aim to enforce invariance by soft penalties or other forms of regularization, which can reduce overall error but do not guarantee that trajectories will not leave the constraint manifold. We introduce the invariant compiler, a framework that enforces invariants by construction: it treats invariants as first-class types and uses an LLM-driven compilation workflow to translate a generic neural ODE specification into a structure-preserving architecture whose trajectories remain on the admissible manifold in continuous time (and up to numerical integration error in practice). This compiler view cleanly separates what must be preserved (scientific structure) from what is learned from data (dynamics within that structure). It provides a systematic design pattern for invariant-respecting neural surrogates across scientific domains.
[LG-41] Symbolic–KAN: Kolmogorov-Arnold Networks with Discrete Symbolic Structure for Interpretable Learning
链接: https://arxiv.org/abs/2603.23854
作者: Salah A Faroughi,Farinaz Mostajeran,Amirhossein Arzani,Shirko Faroughi
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Analysis of PDEs (math.AP); Dynamical Systems (math.DS)
*备注:
Abstract:Symbolic discovery of governing equations is a long-standing goal in scientific machine learning, yet a fundamental trade-off persists between interpretability and scalable learning. Classical symbolic regression methods yield explicit analytic expressions but rely on combinatorial search, whereas neural networks scale efficiently with data and dimensionality but produce opaque representations. In this work, we introduce Symbolic Kolmogorov-Arnold Networks (Symbolic-KANs), a neural architecture that bridges this gap by embedding discrete symbolic structure directly within a trainable deep network. Symbolic-KANs represent multivariate functions as compositions of learned univariate primitives applied to learned scalar projections, guided by a library of analytic primitives, hierarchical gating, and symbolic regularization that progressively sharpens continuous mixtures into one-hot selections. After gated training and discretization, each active unit selects a single primitive and projection direction, yielding compact closed-form expressions without post-hoc symbolic fitting. Symbolic-KANs further act as scalable primitive discovery mechanisms, identifying the most relevant analytic components that can subsequently inform candidate libraries for sparse equation-learning methods. We demonstrate that Symbolic-KAN reliably recovers correct primitive terms and governing structures in data-driven regression and inverse dynamical systems. Moreover, the framework extends to forward and inverse physics-informed learning of partial differential equations, producing accurate solutions directly from governing constraints while constructing compact symbolic representations whose selected primitives reflect the true analytical structure of the underlying equations. These results position Symbolic-KAN as a step toward scalable, interpretable, and mechanistically grounded learning of governing laws.
[LG-42] Unveiling Hidden Convexity in Deep Learning: a Sparse Signal Processing Perspective
链接: https://arxiv.org/abs/2603.23831
作者: Emi Zeger,Mert Pilanci
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:
Abstract:Deep neural networks (DNNs), particularly those using Rectified Linear Unit (ReLU) activation functions, have achieved remarkable success across diverse machine learning tasks, including image recognition, audio processing, and language modeling. Despite this success, the non-convex nature of DNN loss functions complicates optimization and limits theoretical understanding. In this paper, we highlight how recently developed convex equivalences of ReLU NNs and their connections to sparse signal processing models can address the challenges of training and understanding NNs. Recent research has uncovered several hidden convexities in the loss landscapes of certain NN architectures, notably two-layer ReLU networks and other deeper or varied architectures. This paper seeks to provide an accessible and educational overview that bridges recent advances in the mathematics of deep learning with traditional signal processing, encouraging broader signal processing applications.
[LG-43] Resolving gradient pathology in physics-informed epidemiological models
链接: https://arxiv.org/abs/2603.23799
作者: Nickson Golooba,Woldegebriel Assefa Woldegerima
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注: 16 pages, 4 figures. Submitted to Neural Networks
Abstract:Physics-informed neural networks (PINNs) are increasingly used in mathematical epidemiology to bridge the gap between noisy clinical data and compartmental models, such as the susceptible-exposed-infected-removed (SEIR) model. However, training these hybrid networks is often unstable due to competing optimization objectives. As established in recent literature on ``gradient pathology," the gradient vectors derived from the data loss and the physical residual often point in conflicting directions, leading to slow convergence or optimization deadlock. While existing methods attempt to resolve this by balancing gradient magnitudes or projecting conflicting vectors, we propose a novel method, conflict-gated gradient scaling (CGGS), to address gradient conflicts in physics-informed neural networks for epidemiological modelling, ensuring stable and efficient training and a computationally efficient alternative. This method utilizes the cosine similarity between the data and physics gradients to dynamically modulate the penalty weight. Unlike standard annealing schemes that only normalize scales, CGGS acts as a geometric gate: it suppresses the physical constraint when directional conflict is high, allowing the optimizer to prioritize data fidelity, and restores the constraint when gradients align. We prove that this gating mechanism preserves the standard O(1/T) convergence rate for smooth non-convex objectives, a guarantee that fails under fixed-weight or magnitude-balanced training when gradients conflict. We demonstrate that this mechanism autonomously induces a curriculum learning effect, improving parameter estimation in stiff epidemiological systems compared to magnitude-based baselines. Our empirical results show improved peak recovery and convergence over magnitude-based methods.
[LG-44] Manifold Generalization Provably Proceeds Memorization in Diffusion Models
链接: https://arxiv.org/abs/2603.23792
作者: Zebang Shen,Ya-Ping Hsieh,Niao He
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: The first two authors contributed equally
Abstract:Diffusion models often generate novel samples even when the learned score is only \emphcoarse – a phenomenon not accounted for by the standard view of diffusion training as density estimation. In this paper, we show that, under the \emphmanifold hypothesis, this behavior can instead be explained by coarse scores capturing the \emphgeometry of the data while discarding the fine-scale distributional structure of the population measure~ \mu_\scriptscriptstyle\mathrmdata . Concretely, whereas estimating the full data distribution \mu_\scriptscriptstyle\mathrmdata supported on a k -dimensional manifold is known to require the classical minimax rate \tilde\mathcalO(N^-1/k) , we prove that diffusion models trained with coarse scores can exploit the \emphregularity of the manifold support and attain a near-parametric rate toward a \emphdifferent target distribution. This target distribution has density uniformly comparable to that of~ \mu_\scriptscriptstyle\mathrmdata throughout any \tilde\mathcalO\bigl(N^-\beta/(4k)\bigr) -neighborhood of the manifold, where \beta denotes the manifold regularity. Our guarantees therefore depend only on the smoothness of the underlying support, and are especially favorable when the data density itself is irregular, for instance non-differentiable. In particular, when the manifold is sufficiently smooth, we obtain that \emphgeneralization – formalized as the ability to generate novel, high-fidelity samples – occurs at a statistical rate strictly faster than that required to estimate the full population distribution~ \mu_\scriptscriptstyle\mathrmdata .
[LG-45] Digital Twin-Assisted Measurement Design and Channel Statistics Prediction
链接: https://arxiv.org/abs/2603.23787
作者: Robin J. Williams,Mahmoud Saad Abouamer,Petar Popovski
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures. Accepted for 2026 IEEE International Conference on Communications Workshops: Workshop on Data Driven and AI-Enabled Digital Twin Networks and Applications (TwinNetApp)
Abstract:Prediction of wireless channels and their statistics is a fundamental procedure for ensuring performance guarantees in wireless systems. Statistical radio maps powered by Gaussian processes (GPs) offer flexible, non-parametric frameworks, but their performance depends critically on the choice of mean and covariance functions. These are typically learned from dense measurements without exploiting environmental geometry. Digital twins (DTs) of wireless environments leverage computational power to incorporate geometric information; however, they require costly calibration to accurately capture material and propagation characteristics. This work introduces a hybrid channel prediction framework that leverages uncalibrated DTs derived from open-source maps to extract geometry-induced prior information for GP prediction. These structural priors are fused with a small number of channel measurements, enabling data-efficient prediction of channel statistics across the entire environment. By exploiting the uncertainty quantification inherent to GPs, the framework supports principled measurement selection by identifying informative probing locations under resource constraints. Through this integration of imperfect DTs with statistical learning, the proposed method reduces measurement overhead, improves prediction accuracy, and establishes a practical approach for resource-efficient wireless channel prediction.
[LG-46] Latent Algorithmic Structure Precedes Grokking: A Mechanistic Study of ReLU MLPs on Modular Arithmetic
链接: https://arxiv.org/abs/2603.23784
作者: Anand Swaroop
类目: Machine Learning (cs.LG)
*备注: 9 pages, 5 figures
Abstract:Grokking-the phenomenon where validation accuracy of neural networks on modular addition of two integers rises long after training data has been memorized-has been characterized in previous works as producing sinusoidal input weight distributions in transformers and multi-layer perceptrons (MLPs). We find empirically that ReLU MLPs in our experimental setting instead learn near-binary square wave input weights, where intermediate-valued weights appear exclusively near sign-change boundaries, alongside output weight distributions whose dominant Fourier phases satisfy a phase-sum relation \phi_\mathrmout = \phi_a + \phi_b ; this relation holds even when the model is trained on noisy data and fails to grok. We extract the frequency and phase of each neuron’s weights via DFT and construct an idealized MLP: Input weights are replaced by perfect binary square waves and output weights by cosines, both parametrized by the frequencies, phases, and amplitudes extracted from the dominant Fourier components of the real model weights. This idealized model achieves 95.5% accuracy when the frequencies and phases are extracted from the weights of a model trained on noisy data that itself achieves only 0.23% accuracy. This suggests that grokking does not discover the correct algorithm, but rather sharpens an algorithm substantially encoded during memorization, progressively binarizing the input weights into cleaner square waves and aligning the output weights, until generalization becomes possible.
[LG-47] Lightweight Fairness for LLM -Based Recommendations via Kernelized Projection and Gated Adapters
链接: https://arxiv.org/abs/2603.23780
作者: Nan Cui,Wendy Hui Wang,Yue Ning
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) have introduced new capabilities to recommender systems, enabling dynamic, context-aware, and conversational recommendations. However, LLM-based recommender systems inherit and may amplify social biases embedded in their pre-training data, especially when demographic cues are present. Existing fairness solutions either require extra parameters fine-tuning, or suffer from optimization instability. We propose a lightweight and scalable bias mitigation method that combines a kernelized Iterative Null-space Projection (INLP) with a gated Mixture-of-Experts (MoE) adapter. Our approach estimates a closed-form projection that removes single or multiple sensitive attributes from LLM representations with no additional trainable parameters. To preserve task utility, we introduce a two-level MoE adapter that selectively restores useful signals without reintroducing bias. Experiments on two public datasets show that our method reduces attribute leakage across multiple protected variables while maintaining competitive recommendation accuracy.
[LG-48] Kronecker-Structured Nonparametric Spatiotemporal Point Processes
链接: https://arxiv.org/abs/2603.23746
作者: Zhitong Xu,Qiwei Yuan,Yinghao Chen,Yan Sun,Bin Shen,Shandian Zhe
类目: Machine Learning (cs.LG)
*备注:
Abstract:Events in spatiotemporal domains arise in numerous real-world applications, where uncovering event relationships and enabling accurate prediction are central challenges. Classical Poisson and Hawkes processes rely on restrictive parametric assumptions that limit their ability to capture complex interaction patterns, while recent neural point process models increase representational capacity but integrate event information in a black-box manner, hindering interpretable relationship discovery. To address these limitations, we propose a Kronecker-Structured Nonparametric Spatiotemporal Point Process (KSTPP) that enables transparent event-wise relationship discovery while retaining high modeling flexibility. We model the background intensity with a spatial Gaussian process (GP) and the influence kernel as a spatiotemporal GP, allowing rich interaction patterns including excitation, inhibition, neutrality, and time-varying effects. To enable scalable training and prediction, we adopt separable product kernels and represent the GPs on structured grids, inducing Kronecker-structured covariance matrices. Exploiting Kronecker algebra substantially reduces computational cost and allows the model to scale to large event collections. In addition, we develop a tensor-product Gauss-Legendre quadrature scheme to efficiently evaluate intractable likelihood integrals. Extensive experiments demonstrate the effectiveness of our framework.
[LG-49] BXRL: Behavior-Explainable Reinforcement Learning
链接: https://arxiv.org/abs/2603.23738
作者: Ram Rachum,Yotam Amitai,Yonatan Nakar,Reuth Mirsky,Cameron Allen
类目: Machine Learning (cs.LG)
*备注:
Abstract:A major challenge of Reinforcement Learning is that agents often learn undesired behaviors that seem to defy the reward structure they were given. Explainable Reinforcement Learning (XRL) methods can answer queries such as “explain this specific action”, “explain this specific trajectory”, and “explain the entire policy”. However, XRL lacks a formal definition for behavior as a pattern of actions across many episodes. We provide such a definition, and use it to enable a new query: “Explain this behavior”. We present Behavior-Explainable Reinforcement Learning (BXRL), a new problem formulation that treats behaviors as first-class objects. BXRL defines a behavior measure as any function m : \Pi \to \mathbbR , allowing users to precisely express the pattern of actions that they find interesting and measure how strongly the policy exhibits it. We define contrastive behaviors that reduce the question “why does the agent prefer a to a’ ?” to “why is m(\pi) high?” which can be explored with differentiation. We do not implement an explainability method; we instead analyze three existing methods and propose how they could be adapted to explain behavior. We present a port of the HighwayEnv driving environment to JAX, which provides an interface for defining, measuring, and differentiating behaviors with respect to the model parameters. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.23738 [cs.LG] (or arXiv:2603.23738v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.23738 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-50] Energy Efficient Software Hardware CoDesign for Machine Learning: From TinyML to Large Language Models ASPLOS2026
链接: https://arxiv.org/abs/2603.23668
作者: Mohammad Saleh Vahdatpour,Yanqing Zhang
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted as a poster presentation at the EMC2 Workshop, ASPLOS 2026
Abstract:The rapid deployment of machine learning across platforms from milliwatt-class TinyML devices to large language models has made energy efficiency a primary constraint for sustainable AI. Across these scales, performance and energy are increasingly limited by data movement and memory-system behavior rather than by arithmetic throughput alone. This work reviews energy efficient software hardware codesign methods spanning edge inference and training to datacenter-scale LLM serving, covering accelerator architectures (e.g., ASIC/FPGA dataflows, processing-/compute-in-memory designs) and system-level techniques (e.g., partitioning, quantization, scheduling, and runtime adaptation). We distill common design levers and trade-offs, and highlight recurring gaps including limited cross-platform generalization, large and costly co-design search spaces, and inconsistent benchmarking across workloads and deployment settings. Finally, we outline a hierarchical decomposition perspective that maps optimization strategies to computational roles and supports incremental adaptation, offering practical guidance for building energy and carbon aware ML systems.
[LG-51] Boost Like a (Var)Pro: Trust-Region Gradient Boosting via Variable Projection
链接: https://arxiv.org/abs/2603.23658
作者: Abhijit Chowdhary,Elizabeth Newman,Deepanshu Verma
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注: 55 pages, 14 figures
Abstract:Gradient boosting, a method of building additive ensembles from weak learners, has established itself as a practical and theoretically-motivated approach to approximate functions, especially using decision tree weak learners. Comparable methods for smooth parametric learners, such as neural networks, remain less developed in both training methodology and theory. To this end, we introduce \textttVPBoost (\bf Variable \bf Projection \bf Boosting), a gradient boosting algorithm for separable smooth approximators, i.e., models with a smooth nonlinear featurizer followed by a final linear mapping. \textttVPBoost fuses variable projection, a training paradigm for separable models that enforces optimality of the linear weights, with a second-order weak learning strategy. The combination of second-order boosting, separable models, and variable projection give rise to a closed-form solution for the optimal linear weights and a natural interpretation of \VPBoost as a functional trust-region method. We thereby leverage trust-region theory to prove \VPBoost converges to a stationary point under mild geometric conditions and, under stronger assumptions, achieves a superlinear convergence rate. Comprehensive numerical experiments on synthetic data, image recognition, and scientific machine learning benchmarks demonstrate that \VPBoost learns an ensemble with improved evaluation metrics in comparison to gradient-descent-based boosting and attains competitive performance relative to an industry-standard decision tree boosting algorithm.
[LG-52] LLM Inference at the Edge: Mobile NPU and GPU Performance Efficiency Trade-offs Under Sustained Load
链接: https://arxiv.org/abs/2603.23640
作者: Pranay Tummalapalli,Sahil Arayakandy,Ritam Pal,Kautuk Kundan
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, 10 tables
Abstract:Deploying large language models on-device for always-on personal agents demands sustained inference from hardware tightly constrained in power, thermal envelope, and memory. We benchmark Qwen 2.5 1.5B (4-bit quantised) across four platforms: a Raspberry Pi 5 with Hailo-10H NPU, a Samsung Galaxy S24 Ultra, an iPhone 16 Pro, and a laptop NVIDIA RTX 4050 GPU. Using a fixed 258-token prompt over 20 warm-condition iterations per device, we measure throughput, latency, power, and thermal behaviour. For mobile platforms, thermal management supersedes peak compute as the primary constraint: the iPhone 16 Pro loses nearly half its throughput within two iterations, and the S24 Ultra suffers a hard OS-enforced GPU frequency floor that terminates inference entirely. On dedicated hardware, distinct constraints dominate: the RTX 4050 is bounded by its battery power ceiling, while the Hailo-10H is limited by on-module memory bandwidth. The RTX 4050 sustains 131.7 tok/s at 34.1 W; the Hailo-10H sustains 6.9 tok/s at under 2 W with near-zero variance, matching the RTX 4050 in energy proportionality at 19x lower throughput. Results should be interpreted as platform-level deployment characterisations for a single model and prompt type, reflecting hardware and software combined, rather than general claims about hardware capability alone.
[LG-53] Steering Code LLM s with Activation Directions for Language and Library Control
链接: https://arxiv.org/abs/2603.23629
作者: Md Mahbubur Rahman,Arjun Guha,Harshitha Menon
类目: Machine Learning (cs.LG)
*备注:
Abstract:Code LLMs often default to particular programming languages and libraries under neutral prompts. We investigate whether these preferences are encoded as approximately linear directions in activation space that can be manipulated at inference time. Using a difference-in-means method, we estimate layer-wise steering vectors for five language/library pairs and add them to model hidden states during generation. Across three open-weight code LLMs, these interventions substantially increase generation toward the target ecosystem under neutral prompts and often remain effective even when prompts explicitly request the opposite choice. Steering strength varies by model and target, with common ecosystems easier to induce than rarer alternatives, and overly strong interventions can reduce output quality. Overall, our results suggest that code-style preferences in LLMs are partly represented by compact, steerable structure in activation space.
[LG-54] MetaKube: An Experience-Aware LLM Framework for Kubernetes Failure Diagnosis
链接: https://arxiv.org/abs/2603.23580
作者: Wei Sun,Ting Wang,Xinran Tian,Wanshun Lan,Xuhan Feng,Haoyue Li,Fangxin Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Existing LLM-based Kubernetes diagnostic systems cannot learn from operational experience, operating on static knowledge bases without improving from past resolutions. We present MetaKube, an experience-aware LLM framework through three synergistic innovations: (1) an Episodic Pattern Memory Network (EPMN) that abstracts diagnostic patterns from historical resolutions and provides confidence-calibrated retrieval for both rapid pattern matching and guided causal exploration, (2) a meta-cognitive controller that dynamically routes between intuitive and analytical pathways based on problem familiarity, optimizing the trade-off between speed and depth, and (3) KubeLLM, a locally-deployable 8B model enhanced through domain-specific post-training on our 7,000-sample Kubernetes Fault Resolution Dataset. Evaluation on 1,873 real-world scenarios demonstrates MetaKube transforms Qwen3-8B from 50.9 to 90.5 points, approaching GPT-4.1 performance while ensuring complete data privacy. EPMN contributes 15.3% improvement through experiential learning, with continuous learning experiments showing progressive gains as the system accumulates operational knowledge. The source code and related resources are available at this https URL.
[LG-55] Residual Attention Physics-Informed Neural Networks for Robust Multiphysics Simulation of Steady-State Electrothermal Energy Systems
链接: https://arxiv.org/abs/2603.23578
作者: Yuqing Zhou,Ze Tao,Fujun Liu
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Efficient thermal management and precise field prediction are critical for the design of advanced energy systems, including electrohydrodynamic transport, microfluidic energy harvesters, and electrically driven thermal regulators. However, the steady-state simulation of these electrothermal coupled multiphysics systems remains challenging for physics-informed neural computation due to strong nonlinear field coupling, temperature-dependent coefficient variability, and complex interface dynamics. This study proposes a Residual Attention Physics-Informed Neural Network (RA-PINN) framework for the unified solution of coupled velocity, pressure, electric-potential, and temperature fields. By integrating a unified five-field operator formulation with residual-connected feature propagation and attention-guided channel modulation, the proposed architecture effectively captures localized coupling structures and steep gradients. We evaluate RA-PINN across four representative energy-relevant benchmarks: constant-coefficient coupling, indirect pressure-gauge constraints, temperature-dependent transport, and oblique-interface consistency. Comparative analysis against Pure-MLP, LSTM-PINN, and pLSTM-PINN demonstrates that RA-PINN achieves superior accuracy, yielding the lowest MSE, RMSE, and relative L_2 errors across all scenarios. Notably, RA-PINN maintains high structural fidelity in interface-dominated and variable-coefficient settings where conventional PINN backbones often fail. These results establish RA-PINN as a robust and accurate computational framework for the high-fidelity modeling and optimization of complex electrothermal multiphysics in sustainable energy applications.
[LG-56] Causal Reconstruction of Sentiment Signals from Sparse News Data
链接: https://arxiv.org/abs/2603.23568
作者: Stefania Stan,Marzio Lunghi,Vito Vargetto,Claudio Ricci,Rolands Repetto,Brayden Leo,Shao-Hong Gan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages, 2 figures, 14 tables
Abstract:Sentiment signals derived from sparse news are commonly used in financial analysis and technology monitoring, yet transforming raw article-level observations into reliable temporal series remains a largely unsolved engineering problem. Rather than treating this as a classification challenge, we propose to frame it as a causal signal reconstruction problem: given probabilistic sentiment outputs from a fixed classifier, recover a stable latent sentiment series that is robust to the structural pathologies of news data such as sparsity, redundancy, and classifier uncertainty. We present a modular three-stage pipeline that (i) aggregates article-level scores onto a regular temporal grid with uncertainty-aware and redundancy-aware weights, (ii) fills coverage gaps through strictly causal projection rules, and (iii) applies causal smoothing to reduce residual noise. Because ground-truth longitudinal sentiment labels are typically unavailable, we introduce a label-free evaluation framework based on signal stability diagnostics, information preservation lag proxies, and counterfactual tests for causality compliance and redundancy robustness. As a secondary external check, we evaluate the consistency of reconstructed signals against stock-price data for a multi-firm dataset of AI-related news titles (November 2024 to February 2026). The key empirical finding is a three-week lead lag pattern between reconstructed sentiment and price that persists across all tested pipeline configurations and aggregation regimes, a structural regularity more informative than any single correlation coefficient. Overall, the results support the view that stable, deployable sentiment indicators require careful reconstruction, not only better classifiers.
[LG-57] Labeled Compression Schemes for Concept Classes of Finite Functions
链接: https://arxiv.org/abs/2603.23561
作者: Benchong Li
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:The sample compression conjecture is: Each concept class of VC dimension d has a compression scheme of size this http URL this paper, for any concept class of finite functions, we present a labeled sample compression scheme of size equals to its VC dimension d. That is, the long standing open sample compression conjecture is resolved.
[LG-58] Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction
链接: https://arxiv.org/abs/2603.23550
作者: Haoyu Wang,Yuxin Chen,Liang Luo,Buyun Zhang,Ellie Dingqiao Wen,Pan Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-turn human-AI collaboration is fundamental to deploying interactive services such as adaptive tutoring, conversational recommendation, and professional consultation. However, optimizing these interactions via reinforcement learning is hindered by the sparsity of verifiable intermediate rewards and the high stochasticity of user responses. To address these challenges, we introduce Implicit Turn-wise Policy Optimization (ITPO). ITPO leverages an implicit process reward model to derive fine-grained, turn-wise process rewards from sparse outcome signals. Unlike volatile token-level rewards, these turn-level signals exhibit superior robustness and may utilize a normalization mechanism to further enhance training stability. We evaluate ITPO across three representative multi-turn collaborative tasks: math tutoring, document writing, and medical recommendation. Empirical results demonstrate that ITPO, when combined with PPO, GRPO, or RLOO, consistently achieves improved convergence than existing baselines. Elaborate trajectory analysis confirms that ITPO infers turn-wise preferences that are semantically aligned with human judgment. Code is publicly available at this https URL.
[LG-59] DeepOFW: Deep Learning-Driven OFDM-Flexible Waveform Modulation for Peak-to-Averag e Power Ratio Reduction
链接: https://arxiv.org/abs/2603.23544
作者: Ran Greidi,Kobi Cohen
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Peak-to-average power ratio (PAPR) remains a major limitation of multicarrier modulation schemes such as orthogonal frequency-division multiplexing (OFDM), reducing power amplifier efficiency and limiting practical transmit power. In this work, we propose DeepOFW, a deep learning-driven OFDM-flexible waveform modulation framework that enables data-driven waveform design while preserving the low-complexity hardware structure of conventional transceivers. The proposed architecture is fully differentiable, allowing end-to-end optimization of waveform generation and receiver processing under practical physical constraints. Unlike neural transceiver approaches that require deep learning inference at both ends of the link, DeepOFW confines the learning stage to an offline or centralized unit, enabling deployment on standard transmitter and receiver hardware without additional computational overhead. The framework jointly optimizes waveform representations and detection parameters while explicitly incorporating PAPR constraints during training. Extensive simulations over 3GPP multipath channels demonstrate that the learned waveforms significantly reduce PAPR compared with classical OFDM while simultaneously improving bit error rate (BER) performance relative to state-of-the-art transmission schemes. These results highlight the potential of data-driven waveform design to enhance multicarrier communication systems while maintaining hardware-efficient implementations. An open-source implementation of the proposed framework is released to facilitate reproducible research and practical adoption.
[LG-60] rust Region Constrained Bayesian Optimization with Penalized Constraint Handling
链接: https://arxiv.org/abs/2603.24567
作者: Raju Chowdhury,Tanmay Sen,Prajamitra Bhuyan,Biswabrata Pradhan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Constrained optimization in high-dimensional black-box settings is difficult due to expensive evaluations, the lack of gradient information, and complex feasibility regions. In this work, we propose a Bayesian optimization method that combines a penalty formulation, a surrogate model, and a trust region strategy. The constrained problem is converted to an unconstrained form by penalizing constraint violations, which provides a unified modeling framework. A trust region restricts the search to a local region around the current best solution, which improves stability and efficiency in high dimensions. Within this region, we use the Expected Improvement acquisition function to select evaluation points by balancing improvement and uncertainty. The proposed Trust Region method integrates penalty-based constraint handling with local surrogate modeling. This combination enables efficient exploration of feasible regions while maintaining sample efficiency. We compare the proposed method with state-of-the-art methods on synthetic and real-world high-dimensional constrained optimization problems. The results show that the method identifies high-quality feasible solutions with fewer evaluations and maintains stable performance across different settings.
[LG-61] Continuous-Time Learning of Probability Distributions: A Case Study in a Digital Trial of Young Children with Type 1 Diabetes
链接: https://arxiv.org/abs/2603.24427
作者: Antonio Álvarez-López,Marcos Matabuena
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 53 pages, 11 figures
Abstract:Understanding how biomarker distributions evolve over time is a central challenge in digital health and chronic disease monitoring. In diabetes, changes in the distribution of glucose measurements can reveal patterns of disease progression and treatment response that conventional summary measures miss. Motivated by a 26-week clinical trial comparing the closed-loop insulin delivery system t:slim X2 with standard therapy in children with type 1 diabetes, we propose a probabilistic framework to model the continuous-time evolution of time-indexed distributions using continuous glucose monitoring data (CGM) collected every five minutes. We represent the glucose distribution as a Gaussian mixture, with time-varying mixture weights governed by a neural ODE. We estimate the model parameter using a distribution-matching criterion based on the maximum mean discrepancy. The resulting framework is interpretable, computationally efficient, and sensitive to subtle temporal distributional changes. Applied to CGM trial data, the method detects treatment-related improvements in glucose dynamics that are difficult to capture with traditional analytical approaches.
[LG-62] Neural Network Models for Contextual Regression
链接: https://arxiv.org/abs/2603.24400
作者: Seksan Kiatsupaibul,Pakawan Chansiripas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We propose a neural network model for contextual regression in which the regression model depends on contextual features that determine the active submodel and an algorithm to fit the model. The proposed simple contextual neural network (SCtxtNN) separates context identification from context-specific regression, resulting in a structured and interpretable architecture with fewer parameters than a fully connected feed-forward network. We show mathematically that the proposed architecture is sufficient to represent contextual linear regression models using only standard neural network components. Numerical experiments are provided to support the theoretical result, showing that the proposed model achieves lower excess mean squared error and more stable performance than feed-forward neural networks with comparable numbers of parameters, while larger networks improve accuracy only at the cost of increased complexity. The results suggest that incorporating contextual structure can improve model efficiency while preserving interpretability.
[LG-63] Federated fairness-aware classification under differential privacy
链接: https://arxiv.org/abs/2603.24392
作者: Gengyu Xue,Yi Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Privacy and algorithmic fairness have become two central issues in modern machine learning. Although each has separately emerged as a rapidly growing research area, their joint effect remains comparatively under-explored. In this paper, we systematically study the joint impact of differential privacy and fairness on classification in a federated setting, where data are distributed across multiple servers. Targeting demographic disparity constrained classification under federated differential privacy, we propose a two-step algorithm, namely FDP-Fair. In the special case where there is only one server, we further propose a simple yet powerful algorithm, namely CDP-Fair, serving as a computationally-lightweight alternative. Under mild structural assumptions, theoretical guarantees on privacy, fairness and excess risk control are established. In particular, we disentangle the source of the private fairness-aware excess risk into a) intrinsic cost of classification, b) cost of private classification, c) non-private cost of fairness and d) private cost of fairness. Our theoretical findings are complemented by extensive numerical experiments on both synthetic and real datasets, highlighting the practicality of our designed algorithms.
[LG-64] Adaptive decision-making for stochastic service network design
链接: https://arxiv.org/abs/2603.24369
作者: Javier Duran Micco,Bilge Atasoy
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:This paper addresses the Service Network Design (SND) problem for a logistics service provider (LSP) operating in a multimodal freight transport network, considering uncertain travel times and limited truck fleet availability. A two-stage optimization approach is proposed, which combines metaheuristics, simulation and machine learning components. This solution framework integrates tactical decisions, such as transport request acceptance and capacity booking for scheduled services, with operational decisions, including dynamic truck allocation, routing, and re-planning in response to disruptions. A simulated annealing (SA) metaheuristic is employed to solve the tactical problem, supported by an adaptive surrogate model trained using a discrete-event simulation model that captures operational complexities and cascading effects of uncertain travel times. The performance of the proposed method is evaluated using benchmark instances. First, the SA is tested on a deterministic version of the problem and compared to state-of-the-art results, demonstrating it can improve the solution quality and significantly reduce the computational time. Then, the proposed SA is applied to the more complex stochastic problem. Compared to a benchmark algorithm that executes a full simulation for each solution evaluation, the learning-based SA generates high quality solutions while significantly reducing computational effort, achieving only a 5% difference in objective function value while cutting computation time by up to 20 times. These results demonstrate the strong performance of the proposed algorithm in solving complex versions of the SND. Moreover, they highlight the effectiveness of integrating diverse modeling and optimization techniques, and the potential of such approaches to efficiently address freight transport planning challenges.
[LG-65] Connecting Meteorite Spectra to Lunar Surface Composition Using Hyperspectral Imaging and Machine Learning
链接: https://arxiv.org/abs/2603.24323
作者: Fatemeh Fazel Hesar,Mojtaba Raouf,Amirmohammad Chegeni,Peyman Soltani,Bernard Foing,Elias Chatzitheodoridis,Michiel J. A. de Dood,Fons J. Verbeek
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 22 page, 8 figures, Accepted for publication in Planetary Science Universe Journal
Abstract:We present an innovative, cost-effective framework integrating laboratory Hyperspectral Imaging (HSI) of the Bechar010 Lunar meteorite with ground-based lunar HSI and supervised Machine Learning(ML) to generate high-fidelity mineralogical maps. A 3mm thin section of Bechar010 was imaged under a microscope with a 30mm focal length lens at 150mm working distance, using 6x binning to increase the signal-to-noise ratio, producing a data cube (X \times Y \times \lambda = 791 \times 1024 \times 224 , 0.24mm \times 0.2mm resolution) across 400-1000nm (224 bands, 2.7nm spectral sampling, 5.5nm full width at half maximum spectral resolution) using a Specim FX10 camera. Ground-based lunar HSI was captured with a Celestron 8SE telescope (3km/pixel), yielded a data cube ( 371 \times 1024 \times 224 ). Solar calibration was performed using a Spectralon reference (99% reflectance 2% error) ensured accurate reflectance spectra. A Support Vector Machine (SVM) with a radial basis function kernel, trained on expert-labeled spectra, achieved 93.7% classification accuracy(5-fold cross-validation) for olivine (92% precision, 90% recall) and pyroxene (88% precision, 86% recall) in Bechar 010. LIME analysis identified key wavelengths (e.g., 485nm, 22.4% for M3; 715nm, 20.6% for M6) across 10 pre-selected regions (M1 to M10), indicating olivine-rich (Highland-like) and pyroxene-rich (Mare-like) compositions. SAM analysis revealed angles from 0.26 radian to 0.66 radian, linking M3 and M9 to Highlands and M6 and M10 to Mares. K-means clustering of Lunar data identified 10 mineralogical clusters (88% accuracy), validated against Chandrayaan-1 Moon mineralogy Mapper ( \rm M^3 ) data (140m/pixel, 10nm spectral resolution).A novel push-broom HSI approach with a telescope achieves 0.8 arcsec resolution for lunar spectroscopy, inspiring full-sky multi-object spectral mapping.
[LG-66] CGRL: Causal-Guided Representation Learning for Graph Out-of-Distribution Generalization
链接: https://arxiv.org/abs/2603.24304
作者: Bowen Lu,Liangqiang Yang,Teng Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Graph Neural Networks (GNNs) have achieved impressive performance in graph-related tasks. However, they suffer from poor generalization on out-of-distribution (OOD) data, as they tend to learn spurious correlations. Such correlations present a phenomenon that GNNs fail to stably learn the mutual information between prediction representations and ground-truth labels under OOD settings. To address these challenges, we formulate a causal graph starting from the essence of node classification, adopt backdoor adjustment to block non-causal paths, and theoretically derive a lower bound for improving OOD generalization of GNNs. To materialize these insights, we further propose a novel approach integrating causal representation learning and a loss replacement strategy. The former captures node-level causal invariance and reconstructs graph posterior distribution. The latter introduces asymptotic losses of the same order to replace the original losses. Extensive experiments demonstrate the superiority of our method in OOD generalization and effectively alleviating the phenomenon of unstable mutual information learning.
[LG-67] Quantum Neural Physics: Solving Partial Differential Equations on Quantum Simulators using Quantum Convolutional Neural Networks
链接: https://arxiv.org/abs/2603.24196
作者: Jucai Zhai,Muhammad Abdullah,Boyang Chen,Fazal Chaudry,Paul N. Smith,Claire E. Heaney,Yanghua Wang,Jiansheng Xiang,Christopher C. Pain
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 25 pages and 8 figures
Abstract:In scientific computing, the formulation of numerical discretisations of partial differential equations (PDEs) as untrained convolutional layers within Convolutional Neural Networks (CNNs), referred to by some as Neural Physics, has demonstrated good efficiency for executing physics-based solvers on GPUs. However, classical grid-based methods still face computational bottlenecks when solving problems involving billions of degrees of freedom. To address this challenge, this paper proposes a novel framework called ‘Quantum Neural Physics’ and develops a Hybrid Quantum-Classical CNN Multigrid Solver (HQC-CNNMG). This approach maps analytically-determined stencils of discretised differential operators into parameter-free or untrained quantum convolutional kernels. By leveraging amplitude encoding, the Linear Combination of Unitaries technique and the Quantum Fourier Transform, the resulting quantum convolutional operators can be implemented using quantum circuits with a circuit depth that scales as O(log K), where K denotes the size of the encoded input block. These quantum operators are embedded into a classical W-Cycle multigrid using a U-Net. This design enables seamless integration of quantum operators within a hierarchical solver whilst retaining the robustness and convergence properties of classical multigrid methods. The proposed Quantum Neural Physics solver is validated on a quantum simulator for the Poisson equation, diffusion equation, convection-diffusion equation and incompressible Navier-Stokes equations. The solutions of the HQC-CNNMG are in close agreement with those from traditional solution methods. This work establishes a mapping from discretised physical equations to logarithmic-scale quantum circuits, providing a new and exploratory path to exponential memory compression and computational acceleration for PDE solvers on future fault-tolerant quantum computers. Comments: 25 pages and 8 figures Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph) Cite as: arXiv:2603.24196 [quant-ph] (or arXiv:2603.24196v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2603.24196 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-68] Minimal Sufficient Representations for Self-interpretable Deep Neural Networks
链接: https://arxiv.org/abs/2603.24041
作者: Zhiyao Tan,Liu Li,Huazhen Lin
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:
Abstract:Deep neural networks (DNNs) achieve remarkable predictive performance but remain difficult to interpret, largely due to overparameterization that obscures the minimal structure required for interpretation. Here we introduce DeepIn, a self-interpretable neural network framework that adaptively identifies and learns the minimal representation necessary for preserving the full expressive capacity of standard DNNs. We show that DeepIn can correctly identify the minimal representation dimension, select relevant variables, and recover the minimal sufficient network architecture for prediction. The resulting estimator achieves optimal non-asymptotic error rates that adapt to the learned minimal dimension, demonstrating that recovering minimal sufficient structure fundamentally improves generalization error. Building on these guarantees, we further develop hypothesis testing procedures for both selected variables and learned representations, bridging deep representation learning with formal statistical inference. Across biomedical and vision benchmarks, DeepIn improves both predictive accuracy and interpretability, reducing error by up to 30% on real-world datasets while automatically uncovering human-interpretable discriminative patterns. Our results suggest that interpretability and statistical rigor can be embedded directly into deep architectures without sacrificing performance.
[LG-69] ChargeFlow: Flow-Matching Refinement of Charge-Conditioned Electron Densities
链接: https://arxiv.org/abs/2603.23943
作者: Tri Minh Nguyen,Sherif Abdulkader Tawfik,Truyen Tran,Svetha Venkatesh
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Accurate charge densities are central to electronic-structure theory, but computing charge-state-dependent densities with density functional theory remains too expensive for large-scale screening and defect workflows. We present ChargeFlow, a flow-matching refinement model that transforms a charge-conditioned superposition of atomic densities into the corresponding DFT electron density on the native periodic real-space grid using a 3D U-Net velocity field. Trained on 9,502 charged Materials Project-derived calculations and evaluated on an external 1,671-structure benchmark spanning perovskites, charged defects, diamond defects, metal-organic frameworks, and organic crystals, ChargeFlow is not uniformly best on every in-distribution class but is strongest on problems dominated by nonlocal charge redistribution and charge-state extrapolation, improving deformation-density error from 3.62% to 3.21% and charge- response cosine similarity from 0.571 to 0.655 relative to a ResNet baseline. The predicted densities remain chemically useful under downstream analysis, yielding successful Bader partitioning on all 1,671 benchmark structures and high-fidelity electrostatic potentials, which positions flow matching as a practical density-refinement strategy for charged materials.
[LG-70] Beyond Consistency: Inference for the Relative risk functional in Deep Nonparametric Cox Models
链接: https://arxiv.org/abs/2603.23835
作者: Sattwik Ghosal,Xuran Meng,Yi Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 24 pages, 5 figures, 4 tables
Abstract:There remain theoretical gaps in deep neural network estimators for the nonparametric Cox proportional hazards model. In particular, it is unclear how gradient-based optimization error propagates to population risk under partial likelihood, how pointwise bias can be controlled to permit valid inference, and how ensemble-based uncertainty quantification behaves under realistic variance decay regimes. We develop an asymptotic distribution theory for deep Cox estimators that addresses these issues. First, we establish nonasymptotic oracle inequalities for general trained networks that link in-sample optimization error to population risk without requiring the exact empirical risk optimizer. We then construct a structured neural parameterization that achieves infinity-norm approximation rates compatible with the oracle bound, yielding control of the pointwise bias. Under these conditions and using the Hajek–Hoeffding projection, we prove pointwise and multivariate asymptotic normality for subsampled ensemble estimators. We derive a range of subsample sizes that balances bias correction with the requirement that the Hajek–Hoeffding projection remain dominant. This range accommodates decay conditions on the single-overlap covariance, which measures how strongly a single shared observation influences the estimator, and is weaker than those imposed in the subsampling literature. An infinitesimal jackknife representation provides analytic covariance estimation and valid Wald-type inference for relative risk contrasts such as log-hazard ratios. Finally, we illustrate the finite-sample implications of the theory through simulations and a real data application.
[LG-71] Wasserstein Parallel Transport for Predicting the Dynamics of Statistical Systems
链接: https://arxiv.org/abs/2603.23736
作者: Tristan Luca Saidi,Gonzalo Mena,Larry Wasserman,Florian Gunsilius
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:
Abstract:Many scientific systems, such as cellular populations or economic cohorts, are naturally described by probability distributions that evolve over time. Predicting how such a system would have evolved under different forces or initial conditions is fundamental to causal inference, domain adaptation, and counterfactual prediction. However, the space of distributions often lacks the vector space structure on which classical methods rely. To address this, we introduce a general notion of parallel dynamics at a distributional level. We base this principle on parallel transport of tangent dynamics along optimal transport geodesics and call it ``Wasserstein Parallel Trends’'. By replacing the vector subtraction of classic methods with geodesic parallel transport, we can provide counterfactual comparisons of distributional dynamics in applications such as causal inference, domain adaptation, and batch-effect correction in experimental settings. The main mathematical contribution is a novel notion of fanning scheme on the Wasserstein manifold that allows us to efficiently approximate parallel transport along geodesics while also providing the first theoretical guarantees for parallel transport in the Wasserstein space. We also show that Wasserstein Parallel Trends recovers the classic parallel trends assumption for averages as a special case and derive closed-form parallel transport for Gaussian measures. We deploy the method on synthetic data and two single-cell RNA sequencing datasets to impute gene-expression dynamics across biological systems.
[LG-72] Autoregressive Guidance of Deep Spatially Selective Filters using Bayesian Tracking for Efficient Extraction of Moving Speakers
链接: https://arxiv.org/abs/2603.23723
作者: Jakob Kienegger,Timo Gerkmann
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Deep spatially selective filters achieve high-quality enhancement with real-time capable architectures for stationary speakers of known directions. To retain this level of performance in dynamic scenarios when only the speakers’ initial directions are given, accurate, yet computationally lightweight tracking algorithms become necessary. Assuming a frame-wise causal processing style, temporal feedback allows for leveraging the enhanced speech signal to improve tracking performance. In this work, we investigate strategies to incorporate the enhanced signal into lightweight tracking algorithms and autoregressively guide deep spatial filters. Our proposed Bayesian tracking algorithms are compatible with arbitrary deep spatial filters. To increase the realism of simulated trajectories during development and evaluation, we propose and publish a novel dataset based on the social force model. Results validate that the autoregressive incorporation significantly improves the accuracy of our Bayesian trackers, resulting in superior enhancement with none or only negligibly increased computational overhead. Real-world recordings complement these findings and demonstrate the generalizability of our methods to unseen, challenging acoustic conditions.
[LG-73] he Economics of Builder Saturation in Digital Markets
链接: https://arxiv.org/abs/2603.23685
作者: Armin Catovic
类目: Theoretical Economics (econ.TH); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); General Economics (econ.GN)
*备注: 22 pages, 3 figures. Preprint. This paper develops a simple economic model of attention-constrained entry in digital markets, synthesizing results from industrial organization and network science, with applications to AI-enabled production
Abstract:Recent advances in generative AI systems have dramatically reduced the cost of digital production, fueling narratives that widespread participation in software creation will yield a proliferation of viable companies. This paper challenges that assumption. We introduce the Builder Saturation Effect, formalizing a model in which production scales elastically but human attention remains finite. In markets with near-zero marginal costs and free entry, increases in the number of producers dilute average attention and returns per producer, even as total output expands. Extending the framework to incorporate quality heterogeneity and reinforcement dynamics, we show that equilibrium outcomes exhibit declining average payoffs and increasing concentration, consistent with power-law-like distributions. These results suggest that AI-enabled, democratised production is more likely to intensify competition and produce winner-take-most outcomes than to generate broadly distributed entrepreneurial success. Contribution type: This paper is primarily a work of synthesis and applied formalisation. The individual theoretical ingredients - attention scarcity, free-entry dilution, superstar effects, preferential attachment - are well established in their respective literatures. The contribution is to combine them into a unified framework and direct the resulting predictions at a specific contemporary claim about AI-enabled entrepreneurship. Comments: 22 pages, 3 figures. Preprint. This paper develops a simple economic model of attention-constrained entry in digital markets, synthesizing results from industrial organization and network science, with applications to AI-enabled production Subjects: Theoretical Economics (econ.TH); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); General Economics (econ.GN) ACMclasses: J.4; D.2.0 Cite as: arXiv:2603.23685 [econ.TH] (or arXiv:2603.23685v1 [econ.TH] for this version) https://doi.org/10.48550/arXiv.2603.23685 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-74] ZeroFold: Protein-RNA Binding Affinity Predictions from Pre-Structural Embeddings
链接: https://arxiv.org/abs/2603.23583
作者: Josef Hanke(1),Sebastian Pujalte Ojeda(1),Shengyu Zhang(1),Werngard Czechtizky(2),Leonardo De Maria(2),Michele Vendruscolo(1) ((1) Yusuf Hamied Department of Chemistry, University of Cambridge, UK (2) Medicinal Chemistry, Research and Early Development, Respiratory and Immunology, BioPharmaceuticals R and D, AstraZeneca, Sweden)
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 16 pages, 3 figures, 2 tables
Abstract:The accurate prediction of protein-RNA binding affinity remains an unsolved problem in structural biology, limiting opportunities in understanding gene regulation and designing RNA-targeting therapeutics. A central obstacle is the structural flexibility of RNA, as, unlike proteins, RNA molecules exist as dynamic conformational ensembles. Thus, committing to a single predicted structure discards information relevant to binding. Here, we show that this obstacle can be addressed by extracting pre-structural embeddings, which are intermediate representations from a biomolecular foundation model captured before the structure decoding step. Pre-structural embeddings implicitly encode conformational ensemble information without requiring predicted structures. We build ZeroFold, a transformer-based model that combines pre-structural embeddings from Boltz-2 for both protein and RNA molecules through a cross-modal attention mechanism to predict binding affinity directly from sequence. To support training and evaluation, we construct PRADB, a curated dataset of 2,621 unique protein-RNA pairs with experimentally measured affinities drawn from four complementary databases. On a held-out test set constructed with 40% sequence identity thresholds, ZeroFold achieves a Spearman correlation of 0.65, a value approaching the ceiling imposed by experimental measurement noise. Under progressively fairer evaluation conditions that control for training-set overlap, ZeroFold compares favourably with respect to leading structure-based and leading sequence-based predictors, with the performance gap widening as sequence similarity to competitor training data is reduced. These results illustrate how pre-structural embeddings offer a representation strategy for flexible biomolecules, opening a route to affinity prediction for protein-RNA pairs for which no structural data exist.
[LG-75] he Mass Agreement Score: A Point-centric Measure of Cluster Size Consistency
链接: https://arxiv.org/abs/2603.23581
作者: Randolph Wiredu-Aidoo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In clustering, strong dominance in the size of a particular cluster is often undesirable, motivating a measure of cluster size uniformity that can be used to filter such partitions. A basic requirement of such a measure is stability: partitions that differ only slightly in their point assignments should receive similar uniformity scores. A difficulty arises because cluster labels are not fixed objects; algorithms may produce different numbers of labels even when the underlying point distribution changes very little. Measures defined directly over labels can therefore become unstable under label-count perturbations. I introduce the Mass Agreement Score (MAS), a point-centric metric bounded in [0, 1] that evaluates the consistency of expected cluster size as measured from the perspective of points in each cluster. Its construction yields fragment robustness by design, assigning similar scores to partitions with similar bulk structure while remaining sensitive to genuine redistribution of cluster mass.
[LG-76] PDGMM-VAE: A Variational Autoencoder with Adaptive Per-Dimension Gaussian Mixture Model Priors for Nonlinear ICA
链接: https://arxiv.org/abs/2603.23547
作者: Yuan-Hao Wei,Yan-Jie Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Independent component analysis is a core framework within blind source separation for recovering latent source signals from observed mixtures under statistical independence assumptions. In this work, we propose PDGMM-VAE, a source-oriented variational autoencoder in which each latent dimension, interpreted explicitly as an individual source signal, is assigned its own Gaussian mixture model prior. Unlike conventional VAE formulations with a shared simple prior, the proposed framework imposes per-dimension heterogeneous prior constraints, enabling the model to capture diverse non-Gaussian source statistics and thereby promote source separation under a probabilistic encoder-decoder architecture. Importantly, the parameters of these per-dimension GMM priors are not fixed in advance, but are adaptively learned and automatically refined toward convergence together with the encoder and decoder parameters under the overall training objective. Within this formulation, the encoder serves as a demixing mapping from observations to latent sources, while the decoder reconstructs the observed mixtures from the inferred components. The proposed model provides a systematic study of an idea that had previously only been noted in our preliminary form, namely, equipping different latent sources with different GMM priors for ICA, and formulates it as a full VAE framework with end-to-end training and per-dimension prior learning. Experimental results on both linear and nonlinear mixing problems demonstrate that PDGMM-VAE can recover latent source signals and achieve satisfactory separation performance.
附件下载



